Skip to content

BUG: drop_duplicates() doesn't work for object dtype series containing numpy nans #16632

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ran404 opened this issue Jun 8, 2017 · 1 comment
Labels
Bug duplicated duplicated, drop_duplicates Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@ran404
Copy link

ran404 commented Jun 8, 2017

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
import pandas.testing as pdt

s1 = pd.Series([np.nan, np.nan, 'text'])
s2 = pd.Series([np.float64(np.nan), np.float64(np.nan),'text'])

# This doesn't blow up, thinks s1 and s2 are the same
pdt.assert_series_equal(s1, s2)

s1_unique = s1.drop_duplicates()
s2_unique = s2.drop_duplicates()

# This blows up
pdt.assert_series_equal(s1_unique, s2_unique)

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-415-6908d74400cd> in <module>()
----> 1 pdt.assert_series_equal(s1_unique, s2_unique)

~/local/ts3/lib/python3.6/site-packages/pandas/util/testing.py in assert_series_equal(left, right, check_dtype, check_index_type, check_series_type, check_less_precise, check_names, check_exact, check_datetimelike_compat, check_categorical, obj)
   1276         raise_assert_detail(obj, 'Series length are different',
   1277                             '{0}, {1}'.format(len(left), left.index),
-> 1278                             '{0}, {1}'.format(len(right), right.index))
   1279 
   1280     # index comparison

~/local/ts3/lib/python3.6/site-packages/pandas/util/testing.py in raise_assert_detail(obj, message, left, right, diff)
   1147         msg = msg + "\n[diff]: {diff}".format(diff=diff)
   1148 
-> 1149     raise AssertionError(msg)
   1150 
   1151 

AssertionError: Series are different

Series length are different
[left]:  2, Int64Index([0, 2], dtype='int64')
[right]: 3, Int64Index([0, 1, 2], dtype='int64')

Problem description

When dealing with mixed dtype Series (sometimes as a result of .T followed by slice operation from dataframes), the drop_duplicates() call is very surprising, as it doesn't work for np.float64(np.nan). I would expect the htable.duplicated_object(values) call to also work with mixed dtypes containing np.float64 nan values.

The drop_duplicates() call does work for python's builtin float.nan, however.

Expected Output

import pandas as pd
import pandas.testing as pdt

s1 = pd.Series([np.nan, np.nan, 'text'])
s2 = pd.Series([np.float64(np.nan), np.float64(np.nan),'text'])

# This doesn't blow up, thinks s1 and s2 are the same
pdt.assert_series_equal(s1, s2)

s1_unique = s1.drop_duplicates()
s2_unique = s2.drop_duplicates()

# The following assertions should not blow up
assert len(s1_unique) == 2
assert len(s2_unique) == 2
pdt.assert_series_equal(s1_unique, s2_unique)

Output of pd.show_versions()

# Paste the output here pd.show_versions() here INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Darwin OS-release: 16.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8

pandas: 0.20.2
pytest: 3.1.1
pip: 9.0.1
setuptools: 36.0.1
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: 0.9.5
IPython: 6.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: 1.5.1
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.8
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.10
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: 0.4.0

@dsm054
Copy link
Contributor

dsm054 commented Jun 8, 2017

It works for numpy nans -- it just only works for the same numpy nan. The problem boils down to

In [10]: pd._libs.hashtable.duplicated_object(np.array([1,2,2], dtype=object))
Out[10]: array([False, False,  True], dtype=bool)

In [11]: pd._libs.hashtable.duplicated_object(np.array([1,np.nan, np.nan], dtype=object))
Out[11]: array([False, False,  True], dtype=bool)

In [12]: f = float("nan")

In [13]: pd._libs.hashtable.duplicated_object(np.array([1, f, f], dtype=object))
Out[13]: array([False, False,  True], dtype=bool)

In [14]: pd._libs.hashtable.duplicated_object(np.array([1, float('nan'), float('nan')], dtype=object))
Out[14]: array([False, False, False], dtype=bool)

which happens because we're only working with identity:

                  kh_put_pymap(table, <PyObject*> values[i], &ret)

It would be probably be straightforward to canonicalize anything null in _libs/hashtable_func_helper.pxi.in before insertion.

@jorisvandenbossche jorisvandenbossche added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug labels Jun 8, 2017
@jorisvandenbossche jorisvandenbossche added this to the Next Major Release milestone Jun 8, 2017
@jreback jreback added Difficulty Advanced Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jun 9, 2017
@mroeschke mroeschke added duplicated duplicated, drop_duplicates and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels May 13, 2020
@simonjayhawkins simonjayhawkins changed the title drop_duplicates() doesn't work for mixed dtype series containing numpy nans BUG: drop_duplicates() doesn't work for mixed dtype series containing numpy nans Jun 11, 2022
@simonjayhawkins simonjayhawkins changed the title BUG: drop_duplicates() doesn't work for mixed dtype series containing numpy nans BUG: drop_duplicates() doesn't work for object dtype series containing numpy nans Jun 11, 2022
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug duplicated duplicated, drop_duplicates Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

6 participants