My experience contributing to pandas
Background
I have been using pandas
extensively since 2019 at my work. It is one of those
killer utilities which is both performant and feature rich at the same time.
Considering how much pandas had helped me at my work, I wanted to contribute
back.
The first issue
I started with pandas-dev/pandas#45506. The first thing to look for is the error backtrace. But here the issue didn’t provide a backtrace. So, I had to test it out by myself.
The failing code
This was a pandas version 1.4.0 problem so I installed it using pip install
pandas==1.4.0
and ran the following code which was provided in the issue description:
import pandas as pd data = [ {"f_0": "2015-07-01", "f_2": "08335394550"}, {"f_0": "2015-07-02", "f_2": "+49 (0) 0345 300033"}, {"f_0": "2015-07-03", "f_2": "+49(0)2598 04457"}, {"f_0": "2015-07-04", "f_2": "0741470003"}, {"f_0": "2015-07-05", "f_2": "04181 83668"}, ] dtypes = dict(f_0='datetime64[ns]', f_2='string') df = pd.DataFrame(data=data).astype(dtypes) df['f_0'].eq(df['f_2'])
I got the following backtrace
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/pandas/core/ops/__init__.py", line 185, in flex_wrapper return self._binop(other, op, level=level, fill_value=fill_value) File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/pandas/core/series.py", line 2986, in _binop result = func(this_vals, other_vals) File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/pandas/core/ops/common.py", line 70, in new_method return method(self, other) File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/pandas/core/arraylike.py", line 40, in __eq__ return self._cmp_method(other, operator.eq) File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/pandas/core/arrays/datetimelike.py", line 1008, in _cmp_method other = self._validate_comparison_value(other) File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/pandas/core/arrays/datetimelike.py", line 549, in _validate_comparison_value other = self._validate_listlike(other, allow_object=True) File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/pandas/core/arrays/datetimelike.py", line 711, in _validate_listlike value = type(self)._from_sequence(value, dtype=self.dtype) File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/pandas/core/arrays/datetimes.py", line 339, in _from_sequence return cls._from_sequence_not_strict(scalars, dtype=dtype, copy=copy) File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/pandas/core/arrays/datetimes.py", line 358, in _from_sequence_not_strict subarr, tz, inferred_freq = _sequence_to_dt64ns( File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/pandas/core/arrays/datetimes.py", line 2082, in _sequence_to_dt64ns data, inferred_tz = objects_to_datetime64ns( File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/pandas/core/arrays/datetimes.py", line 2199, in objects_to_datetime64ns result, tz_parsed = tslib.array_to_datetime( File "pandas/_libs/tslib.pyx", line 381, in pandas._libs.tslib.array_to_datetime File "pandas/_libs/tslib.pyx", line 613, in pandas._libs.tslib.array_to_datetime File "pandas/_libs/tslib.pyx", line 751, in pandas._libs.tslib._array_to_datetime_object File "pandas/_libs/tslib.pyx", line 742, in pandas._libs.tslib._array_to_datetime_object File "pandas/_libs/tslibs/parsing.pyx", line 281, in pandas._libs.tslibs.parsing.parse_datetime_string File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/dateutil/parser/_parser.py", line 1368, in parse return DEFAULTPARSER.parse(timestr, **kwargs) File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/dateutil/parser/_parser.py", line 649, in parse ret = self._build_naive(res, default) File "/home/comprop/.cache/pypoetry/virtualenvs/xyz-IBWMwS8H-py3.10/lib/python3.10/site-packages/dateutil/parser/_parser.py", line 1235, in _build_naive naive = default.replace(**repl) OverflowError: signed integer is greater than maximum
Observations
First thing I would like to point out is that this guy looks as if he is out of his mind trying to compare a date and a phone number. Who does that!! (Pun intended)
Jokes aside
From the backtrace it looks like pandas
uses the dateutil
module behind the
scenes for converting dates from a string to a datetime
like object.
But the question arises that why would it try to convert a string into a datetime object. The answer is that datetime object comparison is more robust compared to string comparison.
For instance, lets take two strings "12 Feb 2000 2am"
and "12/02/2000 2am"
which is the same timestamp written in different formats. This when compared
as string will be unequal but when compared as datetime objects will be
equal.
In our case we already had one column marked as datetime object while the other was a string column. So, it would make more sense to assume the string column to contain timestamp information. This is the reason why its first converted to datetime.
After pandas fails to convert to datetime, it falls back comparing them as
raw strings instead. Although, there is a lot of exception handling done to
recover from a failed datetime conversion, there was one case which was
missed and that was the uncaught exception, OverflowError
.
There are many ways to solve this issue. However, catching this exception and recovering from it should be enough.
Possible solutions
- Fix
/home/.../pandas/core/arrays/datetimelike.py
file at around line711
by catchingOverflow
exception and doing nothing (pass
). - Fix
/home/.../pandas/core/arrays/datetimes.py
file at around line2199
by catchingOverflow
exception and throwingValueError
exception which would eventually get caught indatetimelike.py
at line711
.
The implementation
I chose the second solution because it would solve the OverflowError
exception
wherever the objects_to_datetime64ns
function was called.
The previous implementation
try: result, tz_parsed = tslib.array_to_datetime( data.ravel("K"), errors=errors, utc=utc, dayfirst=dayfirst, yearfirst=yearfirst, require_iso8601=require_iso8601, allow_mixed=allow_mixed, ) result = result.reshape(data.shape, order=order) except ValueError as err: try: values, tz_parsed = conversion.datetime_to_datetime64(data.ravel("K")) # If tzaware, these values represent unix timestamps, so we # return them as i8 to distinguish from wall times values = values.reshape(data.shape, order=order) return values.view("i8"), tz_parsed except (ValueError, TypeError): raise err
Implementation after applying the fix
try: result, tz_parsed = tslib.array_to_datetime( data.ravel("K"), errors=errors, utc=utc, dayfirst=dayfirst, yearfirst=yearfirst, require_iso8601=require_iso8601, allow_mixed=allow_mixed, ) result = result.reshape(data.shape, order=order) except ValueError as err: try: values, tz_parsed = conversion.datetime_to_datetime64(data.ravel("K")) # If tzaware, these values represent unix timestamps, so we # return them as i8 to distinguish from wall times values = values.reshape(data.shape, order=order) return values.view("i8"), tz_parsed except (ValueError, TypeError): raise err except OverflowError as err: raise ValueError("Out of range") from err
When you run the failing code again, it gives the following output:
0 False 1 False 2 False 3 False 4 False dtype: bool
No exceptions were raised. I finally felt relieved. I had finally fixed a bug in one of the major libraries in Python.
Further steps
I then forked the project and followed the steps to raise a pull request. I was asked to do a some changes and also had to add a test. The pandas devs were reviewing my code and providing proper feedback on what to do to get it merged. It was overall a nice experience.
Conclusion
Looking forward to contributing again. The list of bugs as of 2022 Feb is nearly 3300. It will be a lot of work. The good thing is that the devs are quick to respond on issues as well as pull requests.