-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
align_chunks not working for datasets #10516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
for more information, see https://pre-commit.ci
|
Looks like those tests failing are not related to this PR |
|
Hi @max-sixty, sorry for bothering you. I'm not sure if you have some free time to review this PR. I'm not sure how the reviewer is assigned on Xarray in this case, as no other person was involved in the issue, and probably no one else was notified to review this. |
|
@josephnowak if that helps for the reviewing process, I can confirm that those changes work for me |
|
@josephnowak do you think someone else could maybe look at this? |
|
Hi @dcherian, sorry for bothering you, not sure if you have some free time to take a look on this, or if you know of someone else that have the time to review this PR would be awesome. |
|
Hi @lbesnard, Unfortunately, I don't know of anyone else who could review the PR. The previous time that I sent a PR to Xarray, it was reviewed in a couple of days, so probably most of the maintainers are busy during this month. I think that as a temporary solution, you can copy and paste the function grid_rechunk that is on my branch and use it directly on your code as I did here. It is more problematic because you need to keep track of the actual chunk structure of your data, but it should help. |
|
Oh i ve been using your hash commit which is great. But im using this in a production tool, which is using poetry, (tldr; pointing xarray package to this hash fails on my CICD pipeline), so Im a bit blocked at the moment. Its just creating more work for you as you always have to rebase.. thanks a lot anyway! However I'm surprised no one else seems to have notice this bug. |
|
Hi @max-sixty @dcherian, sorry for bothering you again, but I would like to know if any of you have some free time to review this PR, or if someone else can review it, or if you can provide an estimate of when you could review it so that I can avoid extra work on my side, having to rebase the PR multiple times. |
|
Hi @shoyer , I saw that you have done some contributions related to the to_zarr method, is it possible for you to review this PR? Or if you know of someone else would be very helpful. |
|
Hi @rabernat , I saw that you have made some contributions related to the to_zarr method. Is it possible for you to review this PR? Or if you know of someone else would be very helpful. |
shoyer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @josephnowak!
xarray/backends/chunks.py
Outdated
| # This is useful for the scenarios where the enc_chunks are bigger than the | ||
| # variable chunks, which happens when the user specifies the enc_chunks manually. | ||
| enc_chunks = tuple( | ||
| min(enc_chunk, sum(var_chunk)) | ||
| for enc_chunk, var_chunk in zip(enc_chunks, nd_var_chunks, strict=True) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we want to convert enc_chunks rather than raising an error?
If so, I think this definitely deserves a unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for taking a look on the PR, the test that I added cover this scenario.
It is necessary to convert the enc_chunks because there are cases where the array that is going to be stored is smaller on at least one of the dimension on the enc_chunks and that was causing that the logic of the align chunks failed because it expected that the array being always bigger or equal to the chunks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this, and I think it was not clear that the modification of the enc_chunks was only for the align_chunks algorithm, so I added this change https://github.com/pydata/xarray/pull/10516/files#diff-6462c27c36592f9134c381565c8f30eb59b48ea92d9bcaca371502bdeb8a030aR145-R149 and removed the enc_chunks modification, that should help to clarify the code.
btw, I changed the use of "var" to "v" on the names of the variables because I noticed that the use of "v" was more common on the Xarray code, for example, nd_var_chunks changed to nd_v_chunks and so on.
…st of Xarray, move the modification of the enc_chunks to the build_grid_chunks function, add additional test to covert the scenario where the chunk is bigger than the size of the array
|
@josephnowak sorry I missed this! thank you as ever for these PRs; happy to see that Stephan took a look. |
I forgot to pass the align_chunks to the to_zarr method on the datasets, which makes the feature useless for this kind of data structure. I added a specific test to cover this issue.
Now, the align_chunks works on the cases where the data is smaller than a single Zarr chunk (also, one test was added to cover this scenario).
I modified (again) the error message shown with the safe_chunks, now it includes information about the two chunks that overlap with a single Zarr chunk. From what I saw on the error 10501, the original message was not helping the users to understand what was happening.
whats-new.rst