feat: add streaming.StreamingDataFrame class #864

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

GarrettWu merged 15 commits into main from garrettwu-streaming

Jul 31, 2024

Contributor

GarrettWu commented Jul 26, 2024

StreamingDataFrame implemented basic create, projection, filter and repr operations. Delegate the operations to a member DataFrame with disabled cache and snapshot.

GarrettWu added 4 commits

July 26, 2024 04:49


          feat: add StreamingDataFrame support

0aba6d3


          use setattr for properties

65bbe69


          fix bug

e787cf3


          read session from DF

d7de128

GarrettWu requested review from tswast, TrevorBergeron and jiaxunwu

July 26, 2024 18:42

GarrettWu self-assigned this

review-notebook-app bot commented Jul 26, 2024

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

product-auto-label bot added size: xl api: bigquery labels


          fix docs and tests

5cf5f6b

product-auto-label bot added size: l and removed size: xl labels


          fix test

3ceb6f4

Contributor Author

GarrettWu commented Jul 29, 2024

Not sure why b/354024943 is caused. Tests are successful locally. Need to keep an eye on it.

GarrettWu marked this pull request as ready for review

July 29, 2024 05:00

GarrettWu requested review from a team as code owners

July 29, 2024 05:00

tswast reviewed

View reviewed changes

bigframes/session/__init__.py Outdated

+                          >>> sdf = bpd.read_gbq_table_streaming("bigquery-public-data.ml_datasets.penguins")
+                      """
+                      from bigframes import streaming

Collaborator

tswast Jul 29, 2024

Please make sure we raise a PreviewWarning and mark this as a preview method in the docstring.

Contributor Author

GarrettWu Jul 29, 2024

Done

GarrettWu added 2 commits

July 29, 2024 18:25


          add preview warning

5be0155


          Merge remote-tracking branch 'github/main' into garrettwu-streaming

82d9e62

GarrettWu requested a review from tswast

July 29, 2024 18:29

TrevorBergeron reviewed

View reviewed changes

bigframes/session/__init__.py Outdated

Comment on lines 775 to 777

+                      df = self._read_gbq_table(
+                          table, api_name="read_gbq_table_steaming", enable_snapshot=False
+                      )

Contributor

TrevorBergeron Jul 29, 2024

I think this will probably setup a default index? Should probably explicitly set a null index to ensure don't end up putting a window function into the final sql. Also, do we need to validate that the table is streaming-compatible? I'm guessing some table types don't work

Contributor Author

GarrettWu Jul 29, 2024

Done for NULL index. Easier SQL generated. Thanks.

For table types, it should be fine to let BQ emit errors. Do you have an idea how we can validate those? I don't see a clear way.

Contributor

TrevorBergeron Jul 30, 2024

we fetch metadata for each table when we read into bigframes. My guess is only "real" bq tables work with streaming, and probably not views, external tables, etc.

bigframes/streaming/__init__.py Outdated

Comment on lines 209 to 212

+                  def __getitem__(self, *args, **kwargs):
+                      return _return_type_wrapper(self._df.__getitem__, StreamingDataFrame)(
+                          *args, **kwargs
+                      )

Contributor

TrevorBergeron Jul 29, 2024

Som parameterizations here are goin to return just a series, causing issues later on.

Contributor Author

GarrettWu Jul 29, 2024

The cure should be adding StreamingSeries, logged in b/356201125.

bigframes/streaming/__init__.py Outdated

Comment on lines 250 to 252

+                  @property
+                  def sql(self):
+                      return self._df.sql

Contributor

TrevorBergeron Jul 29, 2024

df.sql automatically applies cached executions, and isn't intended for streaming queries. We should have another path for streaming sql that guarantees streaming-compatibility.

Contributor Author

GarrettWu Jul 29, 2024

Added a param to disable cache in the paths. We can refactor if need a totally separate paths.

bigframes/streaming/__init__.py Outdated

Comment on lines 234 to 239

+                  def __repr__(self, *args, **kwargs):
+                      return _return_type_wrapper(self._df.__repr__, StreamingDataFrame)(
+                          *args, **kwargs
+                      )
+                  __repr__.__doc__ = _curate_df_doc(inspect.getdoc(dataframe.DataFrame.__repr__))

Contributor

TrevorBergeron Jul 29, 2024

This is a streaming frame, but we want to print it out just like a static frame? Could be misleading

Contributor Author

GarrettWu Jul 29, 2024

No, that's what _curate_df_doc() does.

bigframes/streaming/__init__.py Outdated

Comment on lines 218 to 221

+                  def __setitem__(self, *args, **kwargs):
+                      return _return_type_wrapper(self._df.__setitem__, StreamingDataFrame)(
+                          *args, **kwargs
+                      )

Contributor

TrevorBergeron Jul 29, 2024

This could cause joins, which are not allowed. Should catch before query-time

Contributor Author

GarrettWu Jul 29, 2024

Yes, if the input is another Series, and if it is from another table, it's a join. And it will return an error for no index, I think? So it won't be sent to query?

Maybe it is a place that we need different definition and implementations with normal DF. Logged in b/356201125.

bigframes/streaming/__init__.py Outdated

+                  # Private constructor
+                  _create_key = object()
+                  def __init__(self, df: dataframe.DataFrame, *, create_key=0):

Contributor

TrevorBergeron Jul 29, 2024

I would really not wrap the main dataframe.DataFrame object. This reduces flexibility to modify this object - as it could break streaming behavior by accident.

Contributor Author

GarrettWu Jul 29, 2024

I think this is the best way at least for current moment. That we want to get some SDF APIs working. If any code is sharable I would like to try to share instead of duplicate. It would be better to find something breaking and fix at first place than let the implementations diverge and hard to find where they are different and cause separate issues.

Now with disabling snapshot and caching, we can see some operations can just work. I think it is decent for it now. Later on, Tardis will increase SQL coverage. And we will want to add more APIs. We may have a better idea of how diverge they need to be.


          resolve comments

29abff9

GarrettWu requested a review from TrevorBergeron

July 29, 2024 22:36

tswast reviewed

View reviewed changes

bigframes/pandas/__init__.py Outdated

		@@ -598,6 +599,18 @@ def read_gbq_table(
		read_gbq_table.__doc__ = inspect.getdoc(bigframes.session.Session.read_gbq_table)


		def read_gbq_table_streaming(table: str) -> bigframes.streaming.StreamingDataFrame:

Collaborator

tswast Jul 30, 2024

Could we move this to bigframes.streaming.read_gbq_table?

I think the Session method is OK for now, though I wonder if we would want a separate Session for streaming contexts? Session has a lot of configuration and implementation specific to pandas emulation.

Contributor Author

GarrettWu Jul 30, 2024

Done


          move to streaming.read_gbq_table, add logger

512fbc0

product-auto-label bot added size: xl and removed size: l labels

GarrettWu requested a review from tswast

July 30, 2024 21:38

GarrettWu added 4 commits

July 30, 2024 21:42


          fix unit test

eebf9c1


          Merge remote-tracking branch 'github/main' into garrettwu-streaming

3dc2b7b


          fix doc test

74c7fc2


          update notebook

f7d2e15

jiaxunwu approved these changes

View reviewed changes


          add back preview warning

0ab506b

GarrettWu merged commit a7d7197 into main

23 checks passed

GarrettWu deleted the garrettwu-streaming branch

July 31, 2024 19:53

release-please bot mentioned this pull request

chore(main): release 1.12.0 #834

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: bigquery size: xl