Skip to content

feat: recover struct column from exploded Series #904

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Sep 3, 2024

Conversation

mattyopl
Copy link
Contributor

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #357588049 internal 🦕

@mattyopl mattyopl requested a review from GarrettWu August 19, 2024 17:51
@mattyopl mattyopl self-assigned this Aug 19, 2024
@mattyopl mattyopl requested review from a team as code owners August 19, 2024 17:51
@product-auto-label product-auto-label bot added size: s Pull request size is small. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Aug 19, 2024
@GarrettWu GarrettWu requested a review from chelsea-lin August 19, 2024 20:44
@@ -239,6 +239,15 @@ def json_extract(
return series._apply_unary_op(ops.JSONExtract(json_path=json_path))


def struct(value: bigframes.dataframe.DataFrame) -> series.Series:
data: List[Dict[str, Any]] = [{} for _ in value.index]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it creates a local data copy in memory. It won't fit if the original DF is a large table in BQ. @chelsea-lin Do you have any suggestions?

Copy link
Contributor Author

@mattyopl mattyopl Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just so I can learn: does struct.explode() [code pointer] create a local data copy in memory? @GarrettWu cc @tswast @TrevorBergeron as Tim introduced this change [PR] and Trevor reviewed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If explode() does create a local copy, I think we are okay to create local copy for struct, as the use case of this function is to unexplode a Dataframe back into a Series of structs (see b/357588049).

An alternative idea was to use DataFrame.loc but seems like that has some unnecessary overhead and also creates local copies through calling to_pandas() when Series is a BigFrames Series

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explode() doesn't. Column and Row aren't equivalent in BQ. Looping columns is OK since it can contain only up to 10k columns. But much larger size in rows. https://cloud.google.com/bigquery/quotas#standard_tables

You may want to add a STRUCT operator and apply to all the columns in the DF.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The majority of BigFrames operators are designed to be deferred. This means that we construct static expression tree representing your operations, and these trees won't be executed (causing compiler and data downloads) until you explicitly trigger it, typically within actions like to_pandas. Taken struct.explode() as example, it calls struct.field and series.rename, adhere to this deferred model as well. You can confirm this behavior by https://screenshot.googleplex.com/Pqf69h7ZKUTmHGP, where no BQ job for s.struct.explode().

Following Gerrett's suggestion, we can implement this operator by creating a new STRUCT operator. A similar implementation approach can be seen in this pull request, where a JSONExtract unary operation is defined. This operation takes a single argument, json_path, and its compiler rule (defined in the json_extract method) generates the SQL expression JSON_EXTRACT(json_obj, json_path).

@@ -239,6 +239,15 @@ def json_extract(
return series._apply_unary_op(ops.JSONExtract(json_path=json_path))


def struct(value: bigframes.dataframe.DataFrame) -> series.Series:
data: List[Dict[str, Any]] = [{} for _ in value.index]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The majority of BigFrames operators are designed to be deferred. This means that we construct static expression tree representing your operations, and these trees won't be executed (causing compiler and data downloads) until you explicitly trigger it, typically within actions like to_pandas. Taken struct.explode() as example, it calls struct.field and series.rename, adhere to this deferred model as well. You can confirm this behavior by https://screenshot.googleplex.com/Pqf69h7ZKUTmHGP, where no BQ job for s.struct.explode().

Following Gerrett's suggestion, we can implement this operator by creating a new STRUCT operator. A similar implementation approach can be seen in this pull request, where a JSONExtract unary operation is defined. This operation takes a single argument, json_path, and its compiler rule (defined in the json_extract method) generates the SQL expression JSON_EXTRACT(json_obj, json_path).

@product-auto-label product-auto-label bot added size: m Pull request size is medium. and removed size: s Pull request size is small. labels Aug 29, 2024
@mattyopl mattyopl requested a review from GarrettWu August 29, 2024 19:45
import bigframes.series as series


def test_struct_from_dataframe():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious the following cases work for bbq.struct(df)

  1. When the df has a struct type column.
  2. When the df has a int column, which has a None element.
  3. When the df has a array type column.
    If they're working, could you please add more tests for them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@mattyopl mattyopl requested a review from chelsea-lin August 30, 2024 13:48
Copy link
Contributor

@chelsea-lin chelsea-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Matthew. LGTM!

@mattyopl mattyopl enabled auto-merge (squash) August 30, 2024 18:32
@mattyopl mattyopl added the owlbot:run Add this label to trigger the Owlbot post processor. label Sep 3, 2024
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Sep 3, 2024
@mattyopl mattyopl merged commit 7dd304c into googleapis:main Sep 3, 2024
22 of 23 checks passed
@mattyopl mattyopl deleted the structs branch September 3, 2024 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: m Pull request size is medium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants