BUG: drop_duplicates() doesn't work for object dtype series containing numpy nans #16632

ran404 · 2017-06-08T10:26:27Z

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
import pandas.testing as pdt

s1 = pd.Series([np.nan, np.nan, 'text'])
s2 = pd.Series([np.float64(np.nan), np.float64(np.nan),'text'])

# This doesn't blow up, thinks s1 and s2 are the same
pdt.assert_series_equal(s1, s2)

s1_unique = s1.drop_duplicates()
s2_unique = s2.drop_duplicates()

# This blows up
pdt.assert_series_equal(s1_unique, s2_unique)

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-415-6908d74400cd> in <module>()
----> 1 pdt.assert_series_equal(s1_unique, s2_unique)

~/local/ts3/lib/python3.6/site-packages/pandas/util/testing.py in assert_series_equal(left, right, check_dtype, check_index_type, check_series_type, check_less_precise, check_names, check_exact, check_datetimelike_compat, check_categorical, obj)
   1276         raise_assert_detail(obj, 'Series length are different',
   1277                             '{0}, {1}'.format(len(left), left.index),
-> 1278                             '{0}, {1}'.format(len(right), right.index))
   1279 
   1280     # index comparison

~/local/ts3/lib/python3.6/site-packages/pandas/util/testing.py in raise_assert_detail(obj, message, left, right, diff)
   1147         msg = msg + "\n[diff]: {diff}".format(diff=diff)
   1148 
-> 1149     raise AssertionError(msg)
   1150 
   1151 

AssertionError: Series are different

Series length are different
[left]:  2, Int64Index([0, 2], dtype='int64')
[right]: 3, Int64Index([0, 1, 2], dtype='int64')

Problem description

When dealing with mixed dtype Series (sometimes as a result of .T followed by slice operation from dataframes), the drop_duplicates() call is very surprising, as it doesn't work for np.float64(np.nan). I would expect the htable.duplicated_object(values) call to also work with mixed dtypes containing np.float64 nan values.

The drop_duplicates() call does work for python's builtin float.nan, however.

Expected Output

import pandas as pd
import pandas.testing as pdt

s1 = pd.Series([np.nan, np.nan, 'text'])
s2 = pd.Series([np.float64(np.nan), np.float64(np.nan),'text'])

# This doesn't blow up, thinks s1 and s2 are the same
pdt.assert_series_equal(s1, s2)

s1_unique = s1.drop_duplicates()
s2_unique = s2.drop_duplicates()

# The following assertions should not blow up
assert len(s1_unique) == 2
assert len(s2_unique) == 2
pdt.assert_series_equal(s1_unique, s2_unique)

Output of `pd.show_versions()`

# Paste the output here pd.show_versions() here INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Darwin OS-release: 16.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8

pandas: 0.20.2
pytest: 3.1.1
pip: 9.0.1
setuptools: 36.0.1
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: 0.9.5
IPython: 6.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: 1.5.1
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.8
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.10
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: 0.4.0

The text was updated successfully, but these errors were encountered:

dsm054 · 2017-06-08T16:33:35Z

It works for numpy nans -- it just only works for the same numpy nan. The problem boils down to

In [10]: pd._libs.hashtable.duplicated_object(np.array([1,2,2], dtype=object))
Out[10]: array([False, False,  True], dtype=bool)

In [11]: pd._libs.hashtable.duplicated_object(np.array([1,np.nan, np.nan], dtype=object))
Out[11]: array([False, False,  True], dtype=bool)

In [12]: f = float("nan")

In [13]: pd._libs.hashtable.duplicated_object(np.array([1, f, f], dtype=object))
Out[13]: array([False, False,  True], dtype=bool)

In [14]: pd._libs.hashtable.duplicated_object(np.array([1, float('nan'), float('nan')], dtype=object))
Out[14]: array([False, False, False], dtype=bool)

which happens because we're only working with identity:

                  kh_put_pymap(table, <PyObject*> values[i], &ret)

It would be probably be straightforward to canonicalize anything null in _libs/hashtable_func_helper.pxi.in before insertion.

jorisvandenbossche added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug labels Jun 8, 2017

jorisvandenbossche added this to the Next Major Release milestone Jun 8, 2017

jreback added Difficulty Advanced Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jun 9, 2017

chris-b1 mentioned this issue May 17, 2018

Msgpack round trip changes duplicated() behavior for NaN's #21089

Closed

jbrockmendel removed Effort Low labels Oct 21, 2019

mroeschke added duplicated duplicated, drop_duplicates and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels May 13, 2020

mzeitlin11 mentioned this issue Jan 2, 2021

BUG: hash_pandas_object hash differs for NaN #28363

Open

simonjayhawkins changed the title ~~drop_duplicates() doesn't work for mixed dtype series containing numpy nans~~ BUG: drop_duplicates() doesn't work for mixed dtype series containing numpy nans Jun 11, 2022

simonjayhawkins changed the title ~~BUG: drop_duplicates() doesn't work for mixed dtype series containing numpy nans~~ BUG: drop_duplicates() doesn't work for object dtype series containing numpy nans Jun 11, 2022

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: drop_duplicates() doesn't work for object dtype series containing numpy nans #16632

BUG: drop_duplicates() doesn't work for object dtype series containing numpy nans #16632

ran404 commented Jun 8, 2017 •

edited

Loading

dsm054 commented Jun 8, 2017 •

edited

Loading

BUG: drop_duplicates() doesn't work for object dtype series containing numpy nans #16632

BUG: drop_duplicates() doesn't work for object dtype series containing numpy nans #16632

Comments

ran404 commented Jun 8, 2017 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

dsm054 commented Jun 8, 2017 • edited Loading

ran404 commented Jun 8, 2017 •

edited

Loading

Output of `pd.show_versions()`

dsm054 commented Jun 8, 2017 •

edited

Loading