feat: Support write api as loading option by TrevorBergeron · Pull Request #1617 · googleapis/python-bigquery-dataframes

TrevorBergeron · 2025-04-15T18:54:19Z

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

tswast · 2025-04-18T15:38:43Z

bigframes/core/local_data.py

+                for field in self.data.schema
+            )
+
+            # Can't use RecordBatch.cast until set higher min pyarrow version


Could we include which version of pyarrow that is? Also, this seems like a TODO to switch to RecordBatch.cast when we can.

# TODO: Use RecordBatch.cast once our minimum pyarrow version is at least X.Y.Z+.

Yeah, updated to TODO with the needed version, 16.0, which is higher than our current floor, which is 15.0.2

tswast · 2025-04-18T15:46:02Z

bigframes/core/local_data.py

+        if duration_type == "int":
+
+            @_recursive_map_types
+            def durations_to_ints(type: pa.DataType) -> pa.DataType:
+                if pa.types.is_duration(type):
+                    return pa.int64()
+                return type
+
+            schema = pa.schema(
+                pa.field(field.name, durations_to_ints(field.type))
+                for field in self.data.schema
+            )
+
+            # Can't use RecordBatch.cast until set higher min pyarrow version
+            def convert_batch(batch: pa.RecordBatch) -> pa.RecordBatch:
+                return pa.record_batch(
+                    [arr.cast(type) for arr, type in zip(batch.columns, schema.types)],
+                    schema=schema,
+                )
+
+            batches = map(convert_batch, batches)


This type conversion logic seems like it could be pulled out into a separate function for easier unit testing.

Optional: parameterize it to support conversions from any type into any type. This might be useful for JSON in future, for example. Then again https://wiki.c2.com/?YouArentGonnaNeedIt, so let's just start with a refactor of the duration logic and we can parameterize it when we need such functionality.

yeah, I don't want to generalize this too hard, as the general case might involve more complex things than a simple cast, and I don't want to build the scaffolding for it without reason. Next iteration will pull out duration logic though.

tswast · 2025-04-18T15:52:32Z

bigframes/session/loader.py

+                raise ValueError(
+                    f"Problem loading at least one row from DataFrame: {response.row_errors}. {constants.FEEDBACK_LINK}"
+                )
+        destination_table = self._bqclient.get_table(bq_table_ref)


Do we need to do something here to finalize the stream? https://cloud.google.com/bigquery/docs/write-api-streaming

Not strictly necessary, but can avoid limits, per docs: "This step is optional in committed type, but helps to prevent exceeding the limit on active streams"

added finalize in new iteration

bigframes/core/utils.py

tswast · 2025-04-23T14:06:35Z

e2e failure: FAILED tests/system/large/functions/test_managed_function.py::test_managed_function_series_combine

>           raise exceptions.from_http_response(response)
E           google.api_core.exceptions.BadRequest: 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/bigframes-load-testing/queries/job_Dc6dR8IuHAC3x4f6tAWoYBTpdFJd?maxResults=0&location=US&prettyPrint=false: The job encountered an internal error during execution and was unable to complete successfully.
E           
E           Location: US
E           Job ID: job_Dc6dR8IuHAC3x4f6tAWoYBTpdFJd

Looks like a flake, unrelated to this change.

tswast

Love it. Thanks! Just one last thing

tswast · 2025-04-23T17:36:05Z

bigframes/session/__init__.py

            return self._loader.read_pandas(
                pandas_dataframe, method="stream", api_name=api_name
            )
+        elif write_engine == "bigquery_write":


Could you also add this to the various docstrings? Let's mark the "bigquery_write" option as [Preview] in the docs, too.

python-bigquery-dataframes/bigframes/session/__init__.py

Line 716 in 26ae5e7

write_engine (str):

added to the only docstring that enumerates the engines

tswast

Thanks!

product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Apr 15, 2025

TrevorBergeron force-pushed the write_api branch 3 times, most recently from 5e5c5b6 to 0d3ee4f Compare April 16, 2025 19:06

product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Apr 16, 2025

feat: Support write api as loading option

e564382

TrevorBergeron force-pushed the write_api branch from 0d3ee4f to e564382 Compare April 16, 2025 20:03

TrevorBergeron requested a review from tswast April 16, 2025 20:03

TrevorBergeron marked this pull request as ready for review April 16, 2025 20:03

TrevorBergeron requested review from a team as code owners April 16, 2025 20:03

blunderbuss-gcf bot assigned tswast Apr 16, 2025

fix test_get_standardized_ids_indexes

cb524ca

tswast reviewed Apr 18, 2025

View reviewed changes

TrevorBergeron added 2 commits April 18, 2025 19:46

Merge remote-tracking branch 'github/main' into write_api

d0d9899

finalize stream, refactor duration-int logic

dc6b7a8

TrevorBergeron requested a review from tswast April 18, 2025 20:10

add offsets

3d2d83e

tswast reviewed Apr 22, 2025

View reviewed changes

bigframes/core/utils.py Show resolved Hide resolved

Merge remote-tracking branch 'github/main' into write_api

b46ff45

TrevorBergeron requested a review from tswast April 23, 2025 00:35

tswast reviewed Apr 23, 2025

View reviewed changes

add to read_pandas docstring

5f22def

TrevorBergeron requested a review from tswast April 23, 2025 20:06

TrevorBergeron added 3 commits April 24, 2025 17:50

Merge remote-tracking branch 'github/main' into write_api

0221e00

fix batch/schema switcharoo

dc7c71f

Merge remote-tracking branch 'github/main' into write_api

8dcaccb

Merge branch 'main' into write_api

8b582ec

tswast approved these changes Apr 25, 2025

View reviewed changes

TrevorBergeron merged commit c46ad06 into main Apr 25, 2025
24 checks passed

TrevorBergeron deleted the write_api branch April 25, 2025 20:26

release-please bot mentioned this pull request Apr 25, 2025

chore(main): release 2.2.0 #1643

Merged

Conversation

TrevorBergeron commented Apr 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tswast commented Apr 23, 2025

Uh oh!

tswast left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tswast left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants