-
Notifications
You must be signed in to change notification settings - Fork 48
feat: recover struct column from exploded Series #904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bigframes/bigquery/__init__.py
Outdated
@@ -239,6 +239,15 @@ def json_extract( | |||
return series._apply_unary_op(ops.JSONExtract(json_path=json_path)) | |||
|
|||
|
|||
def struct(value: bigframes.dataframe.DataFrame) -> series.Series: | |||
data: List[Dict[str, Any]] = [{} for _ in value.index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it creates a local data copy in memory. It won't fit if the original DF is a large table in BQ. @chelsea-lin Do you have any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just so I can learn: does struct.explode()
[code pointer] create a local data copy in memory? @GarrettWu cc @tswast @TrevorBergeron as Tim introduced this change [PR] and Trevor reviewed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If explode()
does create a local copy, I think we are okay to create local copy for struct
, as the use case of this function is to unexplode a Dataframe back into a Series of structs (see b/357588049).
An alternative idea was to use DataFrame.loc
but seems like that has some unnecessary overhead and also creates local copies through calling to_pandas()
when Series is a BigFrames Series
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explode() doesn't. Column and Row aren't equivalent in BQ. Looping columns is OK since it can contain only up to 10k columns. But much larger size in rows. https://cloud.google.com/bigquery/quotas#standard_tables
You may want to add a STRUCT operator and apply to all the columns in the DF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The majority of BigFrames operators are designed to be deferred. This means that we construct static expression tree representing your operations, and these trees won't be executed (causing compiler and data downloads) until you explicitly trigger it, typically within actions like to_pandas
. Taken struct.explode()
as example, it calls struct.field
and series.rename
, adhere to this deferred model as well. You can confirm this behavior by https://screenshot.googleplex.com/Pqf69h7ZKUTmHGP, where no BQ job for s.struct.explode()
.
Following Gerrett's suggestion, we can implement this operator by creating a new STRUCT operator. A similar implementation approach can be seen in this pull request, where a JSONExtract unary operation is defined. This operation takes a single argument, json_path, and its compiler rule (defined in the json_extract
method) generates the SQL expression JSON_EXTRACT(json_obj, json_path)
.
bigframes/bigquery/__init__.py
Outdated
@@ -239,6 +239,15 @@ def json_extract( | |||
return series._apply_unary_op(ops.JSONExtract(json_path=json_path)) | |||
|
|||
|
|||
def struct(value: bigframes.dataframe.DataFrame) -> series.Series: | |||
data: List[Dict[str, Any]] = [{} for _ in value.index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The majority of BigFrames operators are designed to be deferred. This means that we construct static expression tree representing your operations, and these trees won't be executed (causing compiler and data downloads) until you explicitly trigger it, typically within actions like to_pandas
. Taken struct.explode()
as example, it calls struct.field
and series.rename
, adhere to this deferred model as well. You can confirm this behavior by https://screenshot.googleplex.com/Pqf69h7ZKUTmHGP, where no BQ job for s.struct.explode()
.
Following Gerrett's suggestion, we can implement this operator by creating a new STRUCT operator. A similar implementation approach can be seen in this pull request, where a JSONExtract unary operation is defined. This operation takes a single argument, json_path, and its compiler rule (defined in the json_extract
method) generates the SQL expression JSON_EXTRACT(json_obj, json_path)
.
import bigframes.series as series | ||
|
||
|
||
def test_struct_from_dataframe(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious the following cases work for bbq.struct(df)
- When the
df
has astruct
type column. - When the
df
has aint
column, which has aNone
element. - When the
df
has aarray
type column.
If they're working, could you please add more tests for them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Matthew. LGTM!
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #357588049 internal 🦕