feat: read_gbq creates order deterministically without table copy #191

TrevorBergeron · 2023-11-09T23:01:49Z

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

tswast

Thanks! A few changes, but otherwise looking good. Always nice when session.py can get a little bit simpler.

tswast · 2023-11-10T15:24:38Z

bigframes/session/__init__.py

        self,
        table_ref: bigquery.table.TableReference,
        *,
        api_name: str,
+        enforce_region: bool = False,


Could you add some docstring for this? It's not 100% clear to me what this means just from the variable name.

removing parameter, turns out it is always 'true'

bigframes/session/__init__.py

tswast · 2023-11-10T15:33:41Z

bigframes/session/__init__.py

+            array_value = self._create_total_ordering(table_expression)
+
+        if col_order:
+            array_value = array_value.select_columns(tuple(col_order))


As we discussed in chat, we should really do the column filter before making the hash-based ordering. Any chance we can do that now? Otherwise, let's file an issue and add a TODO.

yeah, definitely want to minimize the set of columns hashed (even if it affects the ordering). done

bigframes/session/__init__.py

tswast · 2023-11-10T15:38:30Z

bigframes/session/__init__.py

        )
+        if max_results:
+            block = block.slice(stop=max_results)


This requires ordering, right? Kinda defeats the purpose of max_results since it results in a full table scan to calculate the ordering.

Maybe we do a SELECT */columns+index_cols LIMIT max_results query before we do anything else for now?

The problem with LIMIT is that it is non-deterministic in the absence of ordering. We would have to immediately cache the sample.

Ack. Filed issue 310257606 to track adding non-deterministic sampling methods in read_gbq.

bigframes/session/__init__.py

tswast · 2023-11-10T15:52:16Z

bigframes/session/__init__.py

+
+
+def _convert_to_string(column: ibis_types.Column) -> ibis_types.StringColumn:
+    # Some of these probably don't work


Should we check? Sounds like we need some targeted tests to cover the branches in this function.

Should all work now, added a test specifically for the json case, other datatypes are covered by existing tests.

tests/system/small/ml/test_forecasting.py

tests/system/small/test_progress_bar.py

tswast · 2023-11-10T18:55:27Z

bigframes/session/__init__.py

        )
-        return table, ordering

    def _ibis_to_session_table(


We can remove this method now, I believe.

Still used by ArrayValue.cached(). Renaming to the more accurate _ibis_to_temp_table since the session dataset isn't being used anymore.

Co-authored-by: Tim Swast <[email protected]>

tswast · 2023-11-10T21:29:33Z

bigframes/dataframe.py

@@ -2719,7 +2719,8 @@ def _get_block(self) -> blocks.Block:
        return self._block

    def _cached(self) -> DataFrame:
-        return DataFrame(self._block.cached())
+        self._set_block(self._block.cached())
+        return self


Wunderbar! This should help with taking better advantage of our cached data in the cases where we do call this automatically.

tswast · 2023-11-10T21:35:36Z

bigframes/session/__init__.py

@@ -534,6 +534,11 @@ def _read_gbq_table_to_ibis_with_total_ordering(
        # the same assumption and use these columns as the total ordering keys.
        table = self.bqclient.get_table(table_ref)

+        if table.location.casefold() != self._location.casefold():


Today I learned casefold(), thanks.

"Casefolded strings may be used for caseless matching." https://docs.python.org/3/library/stdtypes.html#str.casefold

tswast · 2023-11-10T21:48:51Z

bigframes/session/__init__.py

        )
+        if max_results:
+            block = block.slice(stop=max_results)


Ack. Filed issue 310257606 to track adding non-deterministic sampling methods in read_gbq.

bigframes/session/__init__.py

tswast · 2023-11-10T23:23:16Z

doctest failures:

FAILED third_party/bigframes_vendored/pandas/io/gbq.py::third_party.bigframes_vendored.pandas.io.gbq.GBQIOMixin.read_gbq
FAILED third_party/bigframes_vendored/pandas/io/parquet.py::third_party.bigframes_vendored.pandas.io.parquet.ParquetIOMixin.read_parquet
FAILED bigframes/pandas/__init__.py::bigframes.pandas.read_gbq
FAILED bigframes/pandas/__init__.py::bigframes.pandas.read_parquet
FAILED bigframes/session/__init__.py::bigframes.session.Session.read_gbq_table
FAILED bigframes/pandas/__init__.py::bigframes.pandas.read_gbq_query
FAILED bigframes/session/__init__.py::bigframes.session.Session.read_gbq_query
FAILED bigframes/pandas/__init__.py::bigframes.pandas.read_gbq_table

Let's remove the df.head(2) calls from samples where the data source doesn't have a guaranteed ordering.

feat: read_gbq creates order deterministically without table copy

0e16921

TrevorBergeron requested review from a team as code owners November 9, 2023 23:01

TrevorBergeron requested a review from tswast November 9, 2023 23:01

product-auto-label bot added size: l Pull request size is large. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Nov 9, 2023

TrevorBergeron added 4 commits November 10, 2023 00:23

add to_json_string ibis op as monkey patch

30f7cd1

explicitly enforce region in read_gbq in session

16e8c6e

Merge remote-tracking branch 'github/main' into hashorder

e863bfa

order some ml outputs, fix test failulres

53cee99

TrevorBergeron requested a review from GarrettWu November 10, 2023 01:48

TrevorBergeron added 2 commits November 10, 2023 02:05

more test, mypy fixes

e614482

revert removal of session table from read_pandas

945742f

tswast requested changes Nov 10, 2023

View reviewed changes

tswast reviewed Nov 10, 2023

View reviewed changes

TrevorBergeron and others added 4 commits November 10, 2023 20:43

address pr comments

d8e99f6

Update bigframes/session/__init__.py

69b8e28

Co-authored-by: Tim Swast <[email protected]>

Merge remote-tracking branch 'github/main' into hashorder

aa504a8

remove outdated comment

85a54f5

TrevorBergeron requested a review from tswast November 10, 2023 20:57

make _cached return self and fix golden sql ml tests

ef3dd3d

tswast approved these changes Nov 10, 2023

View reviewed changes

cache before creating time series model to avoid snapshot

2865788

TrevorBergeron added the automerge Merge the pull request once unit tests and other checks pass. label Nov 10, 2023

remove doctest lines dependent on unspecified ordering

ee1f3c5

gcf-merge-on-green bot merged commit 8ab81de into main Nov 11, 2023

gcf-merge-on-green bot deleted the hashorder branch November 11, 2023 00:12

gcf-merge-on-green bot removed the automerge Merge the pull request once unit tests and other checks pass. label Nov 11, 2023

release-please bot mentioned this pull request Nov 11, 2023

chore(main): release 0.14.0 #183

Merged

tswast mentioned this pull request Nov 14, 2023

feat: temporary resources no longer use BigQuery Sessions #194

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: read_gbq creates order deterministically without table copy #191

feat: read_gbq creates order deterministically without table copy #191

TrevorBergeron commented Nov 9, 2023

tswast left a comment

tswast Nov 10, 2023

TrevorBergeron Nov 10, 2023

tswast Nov 10, 2023

TrevorBergeron Nov 10, 2023

tswast Nov 10, 2023

TrevorBergeron Nov 10, 2023

tswast Nov 10, 2023

tswast Nov 10, 2023

TrevorBergeron Nov 10, 2023

tswast Nov 10, 2023

TrevorBergeron Nov 10, 2023

tswast Nov 10, 2023

tswast Nov 10, 2023

tswast Nov 10, 2023

tswast commented Nov 10, 2023



		def _convert_to_string(column: ibis_types.Column) -> ibis_types.StringColumn:
		# Some of these probably don't work

feat: read_gbq creates order deterministically without table copy #191

feat: read_gbq creates order deterministically without table copy #191

Conversation

TrevorBergeron commented Nov 9, 2023

tswast left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tswast commented Nov 10, 2023