-
Notifications
You must be signed in to change notification settings - Fork 48
feat: read_gbq creates order deterministically without table copy #191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! A few changes, but otherwise looking good. Always nice when session.py
can get a little bit simpler.
bigframes/session/__init__.py
Outdated
self, | ||
table_ref: bigquery.table.TableReference, | ||
*, | ||
api_name: str, | ||
enforce_region: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some docstring for this? It's not 100% clear to me what this means just from the variable name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing parameter, turns out it is always 'true'
bigframes/session/__init__.py
Outdated
array_value = self._create_total_ordering(table_expression) | ||
|
||
if col_order: | ||
array_value = array_value.select_columns(tuple(col_order)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed in chat, we should really do the column filter before making the hash-based ordering. Any chance we can do that now? Otherwise, let's file an issue and add a TODO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, definitely want to minimize the set of columns hashed (even if it affects the ordering). done
) | ||
if max_results: | ||
block = block.slice(stop=max_results) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires ordering, right? Kinda defeats the purpose of max_results
since it results in a full table scan to calculate the ordering.
Maybe we do a SELECT */columns+index_cols LIMIT max_results
query before we do anything else for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with LIMIT is that it is non-deterministic in the absence of ordering. We would have to immediately cache the sample.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. Filed issue 310257606 to track adding non-deterministic sampling methods in read_gbq
.
bigframes/session/__init__.py
Outdated
|
||
|
||
def _convert_to_string(column: ibis_types.Column) -> ibis_types.StringColumn: | ||
# Some of these probably don't work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check? Sounds like we need some targeted tests to cover the branches in this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should all work now, added a test specifically for the json case, other datatypes are covered by existing tests.
bigframes/session/__init__.py
Outdated
) | ||
return table, ordering | ||
|
||
def _ibis_to_session_table( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this method now, I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still used by ArrayValue.cached(). Renaming to the more accurate _ibis_to_temp_table
since the session dataset isn't being used anymore.
@@ -2719,7 +2719,8 @@ def _get_block(self) -> blocks.Block: | |||
return self._block | |||
|
|||
def _cached(self) -> DataFrame: | |||
return DataFrame(self._block.cached()) | |||
self._set_block(self._block.cached()) | |||
return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wunderbar! This should help with taking better advantage of our cached data in the cases where we do call this automatically.
@@ -534,6 +534,11 @@ def _read_gbq_table_to_ibis_with_total_ordering( | |||
# the same assumption and use these columns as the total ordering keys. | |||
table = self.bqclient.get_table(table_ref) | |||
|
|||
if table.location.casefold() != self._location.casefold(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today I learned casefold()
, thanks.
"Casefolded strings may be used for caseless matching." https://docs.python.org/3/library/stdtypes.html#str.casefold
) | ||
if max_results: | ||
block = block.slice(stop=max_results) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. Filed issue 310257606 to track adding non-deterministic sampling methods in read_gbq
.
doctest failures:
Let's remove the |
🤖 I have created a release *beep* *boop* --- ## [0.14.0](https://togithub.com/googleapis/python-bigquery-dataframes/compare/v0.13.0...v0.14.0) (2023-11-14) ### Features * Add 'cross' join support ([#176](https://togithub.com/googleapis/python-bigquery-dataframes/issues/176)) ([765446a](https://togithub.com/googleapis/python-bigquery-dataframes/commit/765446a929abe1ac076c3037afa7892f64105356)) * Add 'index', 'pad', 'nearest' interpolate methods ([#162](https://togithub.com/googleapis/python-bigquery-dataframes/issues/162)) ([6a28403](https://togithub.com/googleapis/python-bigquery-dataframes/commit/6a2840349a23035bdfdabacd1e231b41bbb5ed7a)) * Add series.sample (identical to existing dataframe.sample) ([#187](https://togithub.com/googleapis/python-bigquery-dataframes/issues/187)) ([37914a4](https://togithub.com/googleapis/python-bigquery-dataframes/commit/37914a4077c681881491f5c36d1a9c9f4255e18f)) * Add unordered sql compilation ([#156](https://togithub.com/googleapis/python-bigquery-dataframes/issues/156)) ([58f420c](https://togithub.com/googleapis/python-bigquery-dataframes/commit/58f420c91d94ca085e9810f36513ffe772bfddcf)) * Log most recent API calls as `recent-bigframes-api-xx` labels on BigQuery jobs ([#145](https://togithub.com/googleapis/python-bigquery-dataframes/issues/145)) ([4ea33b7](https://togithub.com/googleapis/python-bigquery-dataframes/commit/4ea33b7433532ae3a386a6ffa9eb57360ea39526)) * Read_gbq creates order deterministically without table copy ([#191](https://togithub.com/googleapis/python-bigquery-dataframes/issues/191)) ([8ab81de](https://togithub.com/googleapis/python-bigquery-dataframes/commit/8ab81dee4d0eee499094f2dd576550f0c59d7551)) * Support `date_series.astype("string[pyarrow]")` to cast DATE to STRING ([#186](https://togithub.com/googleapis/python-bigquery-dataframes/issues/186)) ([aee0e8e](https://togithub.com/googleapis/python-bigquery-dataframes/commit/aee0e8e2518c59bd1e0b07940c3309871fde8899)) * Support `series.at[row_label] = scalar` ([#173](https://togithub.com/googleapis/python-bigquery-dataframes/issues/173)) ([0c8bd33](https://togithub.com/googleapis/python-bigquery-dataframes/commit/0c8bd33806bb99206b8b12dbdf7d7485c6ffb759)) * Temporary resources no longer use BigQuery Sessions ([#194](https://togithub.com/googleapis/python-bigquery-dataframes/issues/194)) ([4a02cac](https://togithub.com/googleapis/python-bigquery-dataframes/commit/4a02cac88c7d7b46bed1fa813a862fc2ef9ef084)) ### Bug Fixes * All sort operation are now stable ([#195](https://togithub.com/googleapis/python-bigquery-dataframes/issues/195)) ([3a2761f](https://togithub.com/googleapis/python-bigquery-dataframes/commit/3a2761f3c38d0de8b8eda47fffa15b8412aa84b0)) * Default to 7 days expiration for `read_csv`, `read_json`, `read_parquet` ([#193](https://togithub.com/googleapis/python-bigquery-dataframes/issues/193)) ([03606cd](https://togithub.com/googleapis/python-bigquery-dataframes/commit/03606cda30eb7645bfd4534460112dcca56b0ab0)) * Deprecate the `remote_service_type` in llm model ([#180](https://togithub.com/googleapis/python-bigquery-dataframes/issues/180)) ([a8a409a](https://togithub.com/googleapis/python-bigquery-dataframes/commit/a8a409ab0bd1f99dfb442df0703bf8786e0fe58e)) * For reset_index on unnamed multiindex, always use level_[n] label ([#182](https://togithub.com/googleapis/python-bigquery-dataframes/issues/182)) ([f95000d](https://togithub.com/googleapis/python-bigquery-dataframes/commit/f95000d3f88662be4d88c8b0152f1b838e99ec55)) * Match pandas behavior when assigning listlike to empty dfs ([#172](https://togithub.com/googleapis/python-bigquery-dataframes/issues/172)) ([c1d1f42](https://togithub.com/googleapis/python-bigquery-dataframes/commit/c1d1f42a21cc089877f79ebb46a39ddef6958e04)) * Use anonymous dataset instead of session dataset for temp tables ([#181](https://togithub.com/googleapis/python-bigquery-dataframes/issues/181)) ([800d44e](https://togithub.com/googleapis/python-bigquery-dataframes/commit/800d44eb5eb77da5d87b2e005f5a2ed53842e7b5)) * Use random table for `read_pandas` ([#192](https://togithub.com/googleapis/python-bigquery-dataframes/issues/192)) ([741c75e](https://togithub.com/googleapis/python-bigquery-dataframes/commit/741c75e5797e26a1487ff3da76a07953d9537f3f)) * Use random table when loading data for `read_csv`, `read_json`, `read_parquet` ([#175](https://togithub.com/googleapis/python-bigquery-dataframes/issues/175)) ([9d2e6dc](https://togithub.com/googleapis/python-bigquery-dataframes/commit/9d2e6dc1ae4e11e80da4aabe0daa3a6044137cc6)) ### Documentation * Add code samples for `read_gbq_function` using community UDFs ([#188](https://togithub.com/googleapis/python-bigquery-dataframes/issues/188)) ([7506eab](https://togithub.com/googleapis/python-bigquery-dataframes/commit/7506eabf2e58159507809e36abfe90c417dfe92f)) * Add docstring code samples for `Series.apply` and `DataFrame.map` ([#185](https://togithub.com/googleapis/python-bigquery-dataframes/issues/185)) ([c816d84](https://togithub.com/googleapis/python-bigquery-dataframes/commit/c816d843e6f3c5a944cd4395ed0e1e91cec49812)) * Add llm kmeans notebook as an included example ([#177](https://togithub.com/googleapis/python-bigquery-dataframes/issues/177)) ([d49ae42](https://togithub.com/googleapis/python-bigquery-dataframes/commit/d49ae42a379fafd601cc94227e7f8f14b3d5f8c3)) * Use `head()` to get top `n` results, not to preview results ([#190](https://togithub.com/googleapis/python-bigquery-dataframes/issues/190)) ([87f84c9](https://togithub.com/googleapis/python-bigquery-dataframes/commit/87f84c9e58e7d0ea521ac386c9f02791cdddd19f)) --- This PR was generated with [Release Please](https://togithub.com/googleapis/release-please). See [documentation](https://togithub.com/googleapis/release-please#release-please).
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕