Skip to content

Dropped categories when constructing Series or DF with categorical dtype, scalar data, and list index #19565

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ryan-feeley opened this issue Feb 7, 2018 · 1 comment
Labels
Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@ryan-feeley
Copy link

ryan-feeley commented Feb 7, 2018

Code Sample, a copy-pastable example if possible

import pandas as pd

# You need >= pandas 0.21 for this to work.
from pandas.api.types import CategoricalDtype

# Create categorical type with three unordered categories for future use
cats = ['a', 'b', 'c']
catType = CategoricalDtype(categories=cats, ordered=False)

# Use categorical type to create series from list
s1 = pd.Series(['a', 'a'], dtype=catType)

# Use categorical type to create series from scalar with two element
# index (constructor does broadcasting)
s2 = pd.Series('a', index=s1.index, dtype=catType)

# I expect s1 and s2 to be identical. They are not.
print(s1)
print(s2) #notice only one category is shown for the dtype

# I can assign any member of the original categories to s1.
s1.loc[0] = 'c'

# However, this call will fail
try:
    s2.loc[0] = 'c'
except:
    print(" ")
    print("Code in try block fails because of dropped category members")
    print(" ")
    
# Work around: explicitly call add_categories to replace info lost by the constructor
s2.cat.add_categories(cats[1:], inplace=True)
s2.loc[0] = 'c'

Problem description

The constructors of both Series and DataFrame accept the combination of a scalar value and a n-element index. With given this calling syntax, they mimic numpy broadcasting/scalar expansion and repeat the scalar value n times to produce the object.

This behavior is broken by specifying a pre-defined categorical dtype. The dtype of the resulting pandas object will only have one category corresponding to the original scalar. All other categories associated with the dtype are lost.

If, rather, the constructor is called with n-element data and n-element index, all categories are retained in the dtype, even of they were not all included in the n-element data.

Expected Output

In the above code, I expect s1 and s2 to be identical. Further, I expect to be able to assign other category members to s2.

For s1 I get this expected output:

0    a
1    a
dtype: category
Categories (3, object): [a, b, c]

but for for s2 I get this output showing only 1 category member:

0    a
1    a
dtype: category
Categories (1, object): [a]

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.22.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.26.1
numpy: 1.14.0
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

Thanks, seems closely related (but distinct from) #19342.

Let me know if you're interested in taking a look! We can point you in the right direction.

@TomAugspurger TomAugspurger added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Categorical Categorical Data Type Difficulty Intermediate labels Feb 7, 2018
@TomAugspurger TomAugspurger added this to the 0.23.0 milestone Feb 7, 2018
@jreback jreback modified the milestones: 0.23.0, Next Major Release Feb 10, 2018
@jreback jreback modified the milestones: Next Major Release, 0.23.0 Feb 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

3 participants