My experience contributing to pandas

Home

Background

I have been using pandas extensively since 2019 at my work. It is one of those killer utilities which is both performant and feature rich at the same time. Considering how much pandas had helped me at my work, I wanted to contribute back.

The first issue

I started with pandas-dev/pandas#45506. The first thing to look for is the error backtrace. But here the issue didn’t provide a backtrace. So, I had to test it out by myself.

The failing code

This was a pandas version 1.4.0 problem so I installed it using pip install pandas==1.4.0 and ran the following code which was provided in the issue description:

import pandas as pd

data = [
    {"f_0": "2015-07-01", "f_2": "08335394550"},
    {"f_0": "2015-07-02", "f_2": "+49 (0) 0345 300033"},
    {"f_0": "2015-07-03", "f_2": "+49(0)2598 04457"},
    {"f_0": "2015-07-04", "f_2": "0741470003"},
    {"f_0": "2015-07-05", "f_2": "04181 83668"},
]

dtypes = dict(f_0='datetime64[ns]', f_2='string')
df = pd.DataFrame(data=data).astype(dtypes)
df['f_0'].eq(df['f_2'])

I got the following backtrace

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/pandas/core/ops/__init__.py", line 185, in flex_wrapper
    return self._binop(other, op, level=level, fill_value=fill_value)
  File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/pandas/core/series.py", line 2986, in _binop
    result = func(this_vals, other_vals)
  File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/pandas/core/ops/common.py", line 70, in new_method
    return method(self, other)
  File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/pandas/core/arraylike.py", line 40, in __eq__
    return self._cmp_method(other, operator.eq)
  File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/pandas/core/arrays/datetimelike.py", line 1008, in _cmp_method
    other = self._validate_comparison_value(other)
  File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/pandas/core/arrays/datetimelike.py", line 549, in _validate_comparison_value
    other = self._validate_listlike(other, allow_object=True)
  File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/pandas/core/arrays/datetimelike.py", line 711, in _validate_listlike
    value = type(self)._from_sequence(value, dtype=self.dtype)
  File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/pandas/core/arrays/datetimes.py", line 339, in _from_sequence
    return cls._from_sequence_not_strict(scalars, dtype=dtype, copy=copy)
  File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/pandas/core/arrays/datetimes.py", line 358, in _from_sequence_not_strict
    subarr, tz, inferred_freq = _sequence_to_dt64ns(
  File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/pandas/core/arrays/datetimes.py", line 2082, in _sequence_to_dt64ns
    data, inferred_tz = objects_to_datetime64ns(
  File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/pandas/core/arrays/datetimes.py", line 2199, in objects_to_datetime64ns
    result, tz_parsed = tslib.array_to_datetime(
  File "pandas/_libs/tslib.pyx", line 381, in pandas._libs.tslib.array_to_datetime
  File "pandas/_libs/tslib.pyx", line 613, in pandas._libs.tslib.array_to_datetime
  File "pandas/_libs/tslib.pyx", line 751, in pandas._libs.tslib._array_to_datetime_object
  File "pandas/_libs/tslib.pyx", line 742, in pandas._libs.tslib._array_to_datetime_object
  File "pandas/_libs/tslibs/parsing.pyx", line 281, in pandas._libs.tslibs.parsing.parse_datetime_string
  File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/dateutil/parser/_parser.py", line 1368, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/dateutil/parser/_parser.py", line 649, in parse
    ret = self._build_naive(res, default)
  File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/dateutil/parser/_parser.py", line 1235, in _build_naive
    naive = default.replace(**repl)
OverflowError: signed integer is greater than maximum

Observations

First thing I would like to point out is that this guy looks as if he is out of his mind trying to compare a date and a phone number. Who does that!! (Pun intended)

Jokes aside

From the backtrace it looks like pandas uses the dateutil module behind the scenes for converting dates from a string to a datetime like object.

But the question arises that why would it try to convert a string into a datetime object. The answer is that datetime object comparison is more robust compared to string comparison.

For instance, lets take two strings "12 Feb 2000 2am" and "12/02/2000 2am" which is the same timestamp written in different formats. This when compared as string will be unequal but when compared as datetime objects will be equal.

In our case we already had one column marked as datetime object while the other was a string column. So, it would make more sense to assume the string column to contain timestamp information. This is the reason why its first converted to datetime.

After pandas fails to convert to datetime, it falls back comparing them as raw strings instead. Although, there is a lot of exception handling done to recover from a failed datetime conversion, there was one case which was missed and that was the uncaught exception, OverflowError.

There are many ways to solve this issue. However, catching this exception and recovering from it should be enough.

Possible solutions

  1. Fix /home/.../pandas/core/arrays/datetimelike.py file at around line 711 by catching Overflow exception and doing nothing (pass).
  2. Fix /home/.../pandas/core/arrays/datetimes.py file at around line 2199 by catching Overflow exception and throwing ValueError exception which would eventually get caught in datetimelike.py at line 711.

The implementation

I chose the second solution because it would solve the OverflowError exception wherever the objects_to_datetime64ns function was called.

The previous implementation

try:
    result, tz_parsed = tslib.array_to_datetime(
           data.ravel("K"),
           errors=errors,
           utc=utc,
           dayfirst=dayfirst,
           yearfirst=yearfirst,
           require_iso8601=require_iso8601,
           allow_mixed=allow_mixed,
    )
    result = result.reshape(data.shape, order=order)
except ValueError as err:
    try:
           values, tz_parsed = conversion.datetime_to_datetime64(data.ravel("K"))
           # If tzaware, these values represent unix timestamps, so we
           #  return them as i8 to distinguish from wall times
           values = values.reshape(data.shape, order=order)
           return values.view("i8"), tz_parsed
    except (ValueError, TypeError):
           raise err

Implementation after applying the fix

try:
    result, tz_parsed = tslib.array_to_datetime(
           data.ravel("K"),
           errors=errors,
           utc=utc,
           dayfirst=dayfirst,
           yearfirst=yearfirst,
           require_iso8601=require_iso8601,
           allow_mixed=allow_mixed,
    )
    result = result.reshape(data.shape, order=order)
except ValueError as err:
    try:
           values, tz_parsed = conversion.datetime_to_datetime64(data.ravel("K"))
           # If tzaware, these values represent unix timestamps, so we
           #  return them as i8 to distinguish from wall times
           values = values.reshape(data.shape, order=order)
           return values.view("i8"), tz_parsed
    except (ValueError, TypeError):
           raise err
except OverflowError as err:
    raise ValueError("Out of range") from err

When you run the failing code again, it gives the following output:

0    False
1    False
2    False
3    False
4    False
dtype: bool

No exceptions were raised. I finally felt relieved. I had finally fixed a bug in one of the major libraries in Python.

Further steps

I then forked the project and followed the steps to raise a pull request. I was asked to do a some changes and also had to add a test. The pandas devs were reviewing my code and providing proper feedback on what to do to get it merged. It was overall a nice experience.

Conclusion

Looking forward to contributing again. The list of bugs as of 2022 Feb is nearly 3300. It will be a lot of work. The good thing is that the devs are quick to respond on issues as well as pull requests.

References