Reading a CSV with duplicated MultiRow columns #18062

jpiabrantes · 2017-11-01T08:39:29Z

Problem description

I have the following .csv file:

A	A	B
one	one	two
0	34	0.1

I want to use the first two rows to create MultiIndex columns. However, the first and second columns are duplicated and pandas is handling that in a weird way.

Code Sample, a copy-pastable example if possible

print(pd.read_csv('test.csv',header=[0,1]).columns)
>> Index([('A', 'one'), '('A', 'one').1', ('B', 'two')], dtype='object')

This output is a total mess, the first column and last are tuples while the second column is a string. Also, this is not a MultiIndex, it's an Index. Raising an error or having the option to delete the duplicate columns would be a better solution.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-97-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.21.0
pytest: 2.9.2
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.13.3
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
feather: None
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2017-11-01T13:37:07Z

Looks like a bad interaction between the name mangling and multiple headers.

This is a bit tricky though, since (IIUC) not name mangling here would cause data lose since you don't have unique columns. I'm not sure the best behavior, but the most consistent is to probably just mangle the inner-most level's names?

WillAyd · 2017-11-01T14:35:14Z

I’d have to confirm later but on initial glance I think the problem stems from line 799 in parsers.pyx where the mangling is coercing the duplicated name into a string value.

If that holds true would it not make sense to check the type of the name variable and only perform the mangling if it is a string, providing some warning to the user instead in the case of a sequence?

WillAyd · 2017-11-01T22:20:12Z

My original thought was wrong, but the problem here seems to stem from the _maybe_dedup_names function. I edited the function to the below:

def _maybe_dedup_names(self, names):
    ...
    if self.mangle_dupe_cols:
        names = list(names)  # so we can index
        counts = defaultdict(int)

        for i, col in enumerate(names):
            if isinstance(col, tuple):
                continue
            ...

And this worked as intended. However, this assumes that multi-indexed columns will always be passed in as a tuple, which I'm not sure is valid. An alternate approach would be to differentiate between strings and tuples / lists, but I wasn't sure off the top of my head of a completely reliable way to do that across all Python versions, so there could be some complexity trade-offs there.

@TomAugspurger - do you have a point of view on this? The above would be the simplest solution, but I'm certainly happy to work on something a more robust if we want to handle more than tuples and feel its worth the added complexity. Let me know what you think and I can work something up in a PR

TomAugspurger · 2017-11-02T11:00:48Z

cc @gfyoung who is more familiar with this code.

gfyoung · 2017-11-02T16:21:16Z

@jpiabrantes : Thanks for reporting this! Indeed, duplicate data is annoying to handle in pandas, and while we've mostly ironed out the single-column case, it does not surprise me that something would have broken in the MultiIndex case.

I think you should augment _maybe_dedup_names to account for the MultiIndex case more carefully. I agree with your suggest though to "dedup" at the lowest row. Give it a shot in a PR!

WillAyd · 2017-11-02T22:35:54Z

Well I was able to get the correct columns returned with the modifications called out above, but what I didn't notice earlier until I started testing was that it impacted the values within the DataFrame. The issue for that is that the read function within the CParserWrapper converts the data into a dict, using the col_names as a key...So when not mangled the values of the data frame being returned are incorrect.

Still debugging but if anyone has experience dealing with that or thoughts on how to resolve let me know

gfyoung · 2017-11-02T22:37:05Z

@WillAyd : Could you provide an example?

WillAyd · 2017-11-02T22:54:07Z

Assuming you add the code I provided earlier to the _maybe_dedup_names function (actual solution would be more robust, but that will work for this comment) try the following:

data = "A,A,B\none,one,two\n0,34,0.1"
pd.read_csv(io.StringIO(data), header=[0,1])

Doing so yields:

A		B
one	one	two
34	34	0.1

The correct column / MultiIndex will be built, but you'll notice that the values of the DataFrame are 34, 34 and 0.1; not 0, 34 and 0.1 as expected.

I think this gets mucked up within the read functions where you see lines like:

data = dict((k, v) for k, (i, v) in zip(names, data))

Because there are only 2 unique names to be shared across the 3 columns, and using the names within the keys of the dict prevent that from parsing correctly

gfyoung · 2017-11-02T23:03:06Z

@WillAyd : It doesn't look like you're mangling the second level of the names (or name tuple, rather), judging from what I can see from here.

WillAyd · 2017-11-02T23:06:37Z

Are you expecting the output to look something like this?

A		B
one	one.1	two
0	34	0.1

gfyoung · 2017-11-02T23:08:49Z

Yes, that would be what I would expect.

Closes gh-18062.

Closes pandas-devgh-18062.

TomAugspurger added the IO CSV read_csv, to_csv label Nov 1, 2017

TomAugspurger added this to the Next Major Release milestone Nov 1, 2017

TomAugspurger added Difficulty Advanced labels Nov 1, 2017

WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 3, 2017

BUG: MultiIndex mangling during parsing (pandas-dev#18062)

5fdf1e3

WillAyd mentioned this issue Nov 3, 2017

BUG: MultiIndex mangling during parsing (#18062) #18094

Merged

4 tasks

gfyoung added the MultiIndex label Nov 3, 2017

WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 3, 2017

BUG: MultiIndex mangling during parsing (pandas-dev#18062)

0207bee

WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 4, 2017

BUG: MultiIndex mangling during parsing (pandas-dev#18062)

41eea13

jreback modified the milestones: Next Major Release, 0.22.0 Nov 6, 2017

gfyoung closed this as completed in #18094 Nov 6, 2017

gfyoung pushed a commit that referenced this issue Nov 6, 2017

BUG: MultiIndex mangling during parsing (#18094)

980f650

Closes gh-18062.

watercrossing pushed a commit to watercrossing/pandas that referenced this issue Nov 10, 2017

BUG: MultiIndex mangling during parsing (pandas-dev#18094)

db9be2d

Closes pandas-devgh-18062.

No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017

BUG: MultiIndex mangling during parsing (pandas-dev#18094)

f6e3f82

Closes pandas-devgh-18062.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading a CSV with duplicated MultiRow columns #18062

Reading a CSV with duplicated MultiRow columns #18062

jpiabrantes commented Nov 1, 2017 •

edited by gfyoung

Loading

INSTALLED VERSIONS

TomAugspurger commented Nov 1, 2017

WillAyd commented Nov 1, 2017 •

edited

Loading

WillAyd commented Nov 1, 2017 •

edited

Loading

TomAugspurger commented Nov 2, 2017

gfyoung commented Nov 2, 2017

WillAyd commented Nov 2, 2017

gfyoung commented Nov 2, 2017

WillAyd commented Nov 2, 2017

gfyoung commented Nov 2, 2017 •

edited

Loading

WillAyd commented Nov 2, 2017 •

edited

Loading

gfyoung commented Nov 2, 2017

Reading a CSV with duplicated MultiRow columns #18062

Reading a CSV with duplicated MultiRow columns #18062

Comments

jpiabrantes commented Nov 1, 2017 • edited by gfyoung Loading

Problem description

Code Sample, a copy-pastable example if possible

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Nov 1, 2017

WillAyd commented Nov 1, 2017 • edited Loading

WillAyd commented Nov 1, 2017 • edited Loading

TomAugspurger commented Nov 2, 2017

gfyoung commented Nov 2, 2017

WillAyd commented Nov 2, 2017

gfyoung commented Nov 2, 2017

WillAyd commented Nov 2, 2017

gfyoung commented Nov 2, 2017 • edited Loading

WillAyd commented Nov 2, 2017 • edited Loading

gfyoung commented Nov 2, 2017

jpiabrantes commented Nov 1, 2017 •

edited by gfyoung

Loading

Output of `pd.show_versions()`

WillAyd commented Nov 1, 2017 •

edited

Loading

WillAyd commented Nov 1, 2017 •

edited

Loading

gfyoung commented Nov 2, 2017 •

edited

Loading

WillAyd commented Nov 2, 2017 •

edited

Loading