chore: use faster query_and_wait API in _read_gbq_colab by tswast · Pull Request #1777 · googleapis/python-bigquery-dataframes

tswast · 2025-05-28T22:01:13Z

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes internal issue b/405372623 🦕

…llow_large_results

TrevorBergeron · 2025-05-29T23:24:24Z

bigframes/session/loader.py

+                # This is somewhat wasteful, but we convert from Arrow to pandas
+                # to try to duplicate the same dtypes we'd have if this were a
+                # table node as best we can.


Hmm, ideally, we would not make this round trip. We should be able to convert directly to a managed storage table, but we need to override some default type inferences (notably, need to specify geo, json or they will infer as string). Constructor: https://github.com/googleapis/python-bigquery-dataframes/blob/main/bigframes/core/local_data.py#L85

Fixed in 12fd221

…llow_large_results

bigframes/session/loader.py

TrevorBergeron · 2025-05-30T21:07:48Z

bigframes/session/metrics.py

+        # Not every job populates these. For example, slot_millis is missing
+        # from queries that came from cached results.
+        bytes_processed if bytes_processed else 0,
+        slot_millis if slot_millis else 0,


do slightly worry about these zeros being misinterpreted if we ever summarize into averages, but I guess not a problem for now

In some cases (cached query) 0 probably makes sense for the average. I agree that it could be misleading, though.

TrevorBergeron · 2025-05-30T21:11:25Z

bigframes/session/loader.py

+        # If there was no destination table and we've made it this far, that
+        # means the query must have been DDL or DML. Return some job metadata,
+        # instead.
        if not destination:


this function is getting absurdly long, maybe we can pull this job -> stats_df out

Good idea. That is a good candidate to split out. I think I can make a few others as well, such as RowIterator -> DataFrame.

TrevorBergeron · 2025-05-30T21:20:20Z

bigframes/session/loader.py

+            array_value = core.ArrayValue.from_managed(mat, self._session)
+            array_with_offsets, offsets_col = array_value.promote_offsets()


The promote_offsets makes sense, the only problem is that this will invalidate the local engines until I implement that node type. Might be better, if messier, just to manually add offsets to the pyarrow table for now?

Makes sense. I can do that. I wonder why the test_read_gbq_colab_repr_avoids_requery test in tests/system/small/session/test_read_gbq_colab.py wasn't failing? Maybe because we do an upload and then a download instead of running a query?

Edit: I think it's because I wasn't testing with google-cloud-bigquery 3.34.0 with googleapis/python-bigquery#2190 and not with the query preview environment variable.

Edit2: I did still have to add support for slice to the local executor because of repr doing head, but I suspect that's still easier than implementing promote_offsets.

TrevorBergeron · 2025-05-30T21:22:54Z

bigframes/session/loader.py

+    ) -> Tuple[google.cloud.bigquery.table.RowIterator, Optional[bigquery.QueryJob]]:
+        ...
+
+    def _start_query(


maybe two separate methods would be less code than working around all the mypy nonsense to reuse the same symbol?

I can give it a try. Note that start_query_with_client already has this parameter though, so we'd end up with some inconsistency there.

Edit: Done in da2bf08

…llow_large_results

tswast · 2025-06-02T16:17:13Z

bigframes/session/loader.py

+        # If there was no destination table and we've made it this far, that
+        # means the query must have been DDL or DML. Return some job metadata,
+        # instead.
        if not destination:


Good idea. That is a good candidate to split out. I think I can make a few others as well, such as RowIterator -> DataFrame.

tswast · 2025-06-02T18:07:10Z

bigframes/session/loader.py

+            array_value = core.ArrayValue.from_managed(mat, self._session)
+            array_with_offsets, offsets_col = array_value.promote_offsets()


Makes sense. I can do that. I wonder why the test_read_gbq_colab_repr_avoids_requery test in tests/system/small/session/test_read_gbq_colab.py wasn't failing? Maybe because we do an upload and then a download instead of running a query?

Edit: I think it's because I wasn't testing with google-cloud-bigquery 3.34.0 with googleapis/python-bigquery#2190 and not with the query preview environment variable.

Edit2: I did still have to add support for slice to the local executor because of repr doing head, but I suspect that's still easier than implementing promote_offsets.

tswast · 2025-06-02T19:18:51Z

bigframes/session/local_scan_executor.py

            return None

-        # TODO: Can support some slicing, sorting
+        # TODO: Can support some sorting


TODO(tswast): Add unit tests for new slice support.

tswast · 2025-06-02T22:01:15Z

bigframes/session/loader.py

+    ) -> Tuple[google.cloud.bigquery.table.RowIterator, Optional[bigquery.QueryJob]]:
+        ...
+
+    def _start_query(


I can give it a try. Note that start_query_with_client already has this parameter though, so we'd end up with some inconsistency there.

Edit: Done in da2bf08

…llow_large_results

tswast · 2025-06-03T21:43:05Z

e2e failure for isdigit in prerelease tests tracked in b/333484335. It's due to a fix in pyarrow that bigframes hasn't been updated to emulate yet.

chore: use faster query_and_wait API in _read_gbq_colab

ccdfecf

product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels May 28, 2025

tswast added 5 commits May 29, 2025 13:41

try to fix unit tests

ba81e0b

more unit test fixes

d33812f

more test fixes

1dc80ee

Merge remote-tracking branch 'origin/main' into b405372623-read_gbq-a…

95c8890

…llow_large_results

fix mypy

63117a1

product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels May 29, 2025

tswast marked this pull request as ready for review May 29, 2025 22:55

tswast requested review from a team as code owners May 29, 2025 22:55

tswast requested a review from shobsi May 29, 2025 22:55

blunderbuss-gcf bot assigned jialuoo May 29, 2025

TrevorBergeron reviewed May 29, 2025

View reviewed changes

tswast added 3 commits May 30, 2025 10:06

Merge remote-tracking branch 'origin/main' into b405372623-read_gbq-a…

fe696ae

…llow_large_results

fix metrics counter in read_gbq with allow_large_results=False

bbc8294

use managedarrowtable

12fd221

tswast commented May 30, 2025

View reviewed changes

bigframes/session/loader.py Outdated Show resolved Hide resolved

tswast added 2 commits May 30, 2025 14:40

Update bigframes/session/loader.py

06661c9

Merge branch 'main' into b405372623-read_gbq-allow_large_results

61bf139

tswast requested a review from TrevorBergeron May 30, 2025 20:05

TrevorBergeron reviewed May 30, 2025

View reviewed changes

tswast added 6 commits May 30, 2025 16:23

Merge remote-tracking branch 'origin/main' into b405372623-read_gbq-a…

fc5df84

…llow_large_results

Merge remote-tracking branch 'origin/main' into b405372623-read_gbq-a…

e22c737

…llow_large_results

split out a few special case return values for read_gbq_query

64979d6

support slice node for repr

888fd3d

fix failing system test

2b75c49

move slice into semiexecutor and out of readlocalnode

63661be

unit test for local executor

7cd7371

tswast commented Jun 2, 2025

View reviewed changes

tswast added 2 commits June 3, 2025 11:17

split method instead of using reloads

da2bf08

Merge branch 'main' into b405372623-read_gbq-allow_large_results

492ec81

tswast requested a review from TrevorBergeron June 3, 2025 16:21

tswast added 2 commits June 3, 2025 14:21

fix reference to _start_query

35aba37

use limit rewrite for slice support

0a4e987

TrevorBergeron previously approved these changes Jun 3, 2025

View reviewed changes

tswast added 2 commits June 3, 2025 16:03

do not use numpy for offsets

bfb4ca2

Merge remote-tracking branch 'origin/main' into b405372623-read_gbq-a…

758544d

…llow_large_results

tswast dismissed TrevorBergeron’s stale review via 758544d June 3, 2025 21:05

tswast requested a review from TrevorBergeron June 3, 2025 21:29

TrevorBergeron approved these changes Jun 3, 2025

View reviewed changes

tswast merged commit f495c84 into main Jun 3, 2025
19 of 24 checks passed

tswast deleted the b405372623-read_gbq-allow_large_results branch June 3, 2025 21:43

		array_value = core.ArrayValue.from_managed(mat, self._session)
		array_with_offsets, offsets_col = array_value.promote_offsets()

Conversation

tswast commented May 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tswast Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tswast Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tswast commented Jun 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tswast Jun 2, 2025 •

edited

Loading

tswast Jun 2, 2025 •

edited

Loading