feat: recover struct column from exploded Series#904
feat: recover struct column from exploded Series#904mattyopl merged 21 commits intogoogleapis:mainfrom
Conversation
bigframes/bigquery/__init__.py
Outdated
|
|
||
|
|
||
| def struct(value: bigframes.dataframe.DataFrame) -> series.Series: | ||
| data: List[Dict[str, Any]] = [{} for _ in value.index] |
There was a problem hiding this comment.
it creates a local data copy in memory. It won't fit if the original DF is a large table in BQ. @chelsea-lin Do you have any suggestions?
There was a problem hiding this comment.
just so I can learn: does struct.explode() [code pointer] create a local data copy in memory? @GarrettWu cc @tswast @TrevorBergeron as Tim introduced this change [PR] and Trevor reviewed
There was a problem hiding this comment.
If explode() does create a local copy, I think we are okay to create local copy for struct, as the use case of this function is to unexplode a Dataframe back into a Series of structs (see b/357588049).
An alternative idea was to use DataFrame.loc but seems like that has some unnecessary overhead and also creates local copies through calling to_pandas() when Series is a BigFrames Series
There was a problem hiding this comment.
explode() doesn't. Column and Row aren't equivalent in BQ. Looping columns is OK since it can contain only up to 10k columns. But much larger size in rows. https://cloud.google.com/bigquery/quotas#standard_tables
You may want to add a STRUCT operator and apply to all the columns in the DF.
There was a problem hiding this comment.
The majority of BigFrames operators are designed to be deferred. This means that we construct static expression tree representing your operations, and these trees won't be executed (causing compiler and data downloads) until you explicitly trigger it, typically within actions like to_pandas. Taken struct.explode() as example, it calls struct.field and series.rename, adhere to this deferred model as well. You can confirm this behavior by https://screenshot.googleplex.com/Pqf69h7ZKUTmHGP, where no BQ job for s.struct.explode().
Following Gerrett's suggestion, we can implement this operator by creating a new STRUCT operator. A similar implementation approach can be seen in this pull request, where a JSONExtract unary operation is defined. This operation takes a single argument, json_path, and its compiler rule (defined in the json_extract method) generates the SQL expression JSON_EXTRACT(json_obj, json_path).
bigframes/bigquery/__init__.py
Outdated
|
|
||
|
|
||
| def struct(value: bigframes.dataframe.DataFrame) -> series.Series: | ||
| data: List[Dict[str, Any]] = [{} for _ in value.index] |
There was a problem hiding this comment.
The majority of BigFrames operators are designed to be deferred. This means that we construct static expression tree representing your operations, and these trees won't be executed (causing compiler and data downloads) until you explicitly trigger it, typically within actions like to_pandas. Taken struct.explode() as example, it calls struct.field and series.rename, adhere to this deferred model as well. You can confirm this behavior by https://screenshot.googleplex.com/Pqf69h7ZKUTmHGP, where no BQ job for s.struct.explode().
Following Gerrett's suggestion, we can implement this operator by creating a new STRUCT operator. A similar implementation approach can be seen in this pull request, where a JSONExtract unary operation is defined. This operation takes a single argument, json_path, and its compiler rule (defined in the json_extract method) generates the SQL expression JSON_EXTRACT(json_obj, json_path).
| import bigframes.series as series | ||
|
|
||
|
|
||
| def test_struct_from_dataframe(): |
There was a problem hiding this comment.
I am curious the following cases work for bbq.struct(df)
- When the
dfhas astructtype column. - When the
dfhas aintcolumn, which has aNoneelement. - When the
dfhas aarraytype column.
If they're working, could you please add more tests for them?
chelsea-lin
left a comment
There was a problem hiding this comment.
Thanks Matthew. LGTM!
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #357588049 internal 🦕