-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Reading a CSV with duplicated MultiRow columns #18062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Looks like a bad interaction between the name mangling and multiple headers. This is a bit tricky though, since (IIUC) not name mangling here would cause data lose since you don't have unique columns. I'm not sure the best behavior, but the most consistent is to probably just mangle the inner-most level's names? |
I’d have to confirm later but on initial glance I think the problem stems from line 799 in parsers.pyx where the mangling is coercing the duplicated name into a string value. If that holds true would it not make sense to check the type of the name variable and only perform the mangling if it is a string, providing some warning to the user instead in the case of a sequence? |
My original thought was wrong, but the problem here seems to stem from the _maybe_dedup_names function. I edited the function to the below:
And this worked as intended. However, this assumes that multi-indexed columns will always be passed in as a tuple, which I'm not sure is valid. An alternate approach would be to differentiate between strings and tuples / lists, but I wasn't sure off the top of my head of a completely reliable way to do that across all Python versions, so there could be some complexity trade-offs there. @TomAugspurger - do you have a point of view on this? The above would be the simplest solution, but I'm certainly happy to work on something a more robust if we want to handle more than tuples and feel its worth the added complexity. Let me know what you think and I can work something up in a PR |
cc @gfyoung who is more familiar with this code. |
@jpiabrantes : Thanks for reporting this! Indeed, duplicate data is annoying to handle in I think you should augment |
Well I was able to get the correct columns returned with the modifications called out above, but what I didn't notice earlier until I started testing was that it impacted the values within the DataFrame. The issue for that is that the read function within the CParserWrapper converts the data into a dict, using the col_names as a key...So when not mangled the values of the data frame being returned are incorrect. Still debugging but if anyone has experience dealing with that or thoughts on how to resolve let me know |
@WillAyd : Could you provide an example? |
Assuming you add the code I provided earlier to the _maybe_dedup_names function (actual solution would be more robust, but that will work for this comment) try the following:
Doing so yields:
The correct column / MultiIndex will be built, but you'll notice that the values of the DataFrame are 34, 34 and 0.1; not 0, 34 and 0.1 as expected. I think this gets mucked up within the read functions where you see lines like:
Because there are only 2 unique names to be shared across the 3 columns, and using the names within the keys of the dict prevent that from parsing correctly |
@WillAyd : It doesn't look like you're mangling the second level of the names (or name tuple, rather), judging from what I can see from here. |
Are you expecting the output to look something like this?
|
Yes, that would be what I would expect. |
Problem description
I have the following .csv file:
I want to use the first two rows to create MultiIndex columns. However, the first and second columns are duplicated and pandas is handling that in a weird way.
Code Sample, a copy-pastable example if possible
This output is a total mess, the first column and last are tuples while the second column is a string. Also, this is not a MultiIndex, it's an Index. Raising an error or having the option to delete the duplicate columns would be a better solution.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-97-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8
pandas: 0.21.0
pytest: 2.9.2
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.13.3
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
feather: None
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: