chore: use faster query_and_wait API in _read_gbq_colab#1777
Conversation
bigframes/session/loader.py
Outdated
| # This is somewhat wasteful, but we convert from Arrow to pandas | ||
| # to try to duplicate the same dtypes we'd have if this were a | ||
| # table node as best we can. |
There was a problem hiding this comment.
Hmm, ideally, we would not make this round trip. We should be able to convert directly to a managed storage table, but we need to override some default type inferences (notably, need to specify geo, json or they will infer as string). Constructor: https://github.com/googleapis/python-bigquery-dataframes/blob/main/bigframes/core/local_data.py#L85
| # Not every job populates these. For example, slot_millis is missing | ||
| # from queries that came from cached results. | ||
| bytes_processed if bytes_processed else 0, | ||
| slot_millis if slot_millis else 0, |
There was a problem hiding this comment.
do slightly worry about these zeros being misinterpreted if we ever summarize into averages, but I guess not a problem for now
There was a problem hiding this comment.
In some cases (cached query) 0 probably makes sense for the average. I agree that it could be misleading, though.
| # If there was no destination table and we've made it this far, that | ||
| # means the query must have been DDL or DML. Return some job metadata, | ||
| # instead. | ||
| if not destination: |
There was a problem hiding this comment.
this function is getting absurdly long, maybe we can pull this job -> stats_df out
There was a problem hiding this comment.
Good idea. That is a good candidate to split out. I think I can make a few others as well, such as RowIterator -> DataFrame.
bigframes/session/loader.py
Outdated
| array_value = core.ArrayValue.from_managed(mat, self._session) | ||
| array_with_offsets, offsets_col = array_value.promote_offsets() |
There was a problem hiding this comment.
The promote_offsets makes sense, the only problem is that this will invalidate the local engines until I implement that node type. Might be better, if messier, just to manually add offsets to the pyarrow table for now?
There was a problem hiding this comment.
Makes sense. I can do that. I wonder why the test_read_gbq_colab_repr_avoids_requery test in tests/system/small/session/test_read_gbq_colab.py wasn't failing? Maybe because we do an upload and then a download instead of running a query?
Edit: I think it's because I wasn't testing with google-cloud-bigquery 3.34.0 with googleapis/python-bigquery#2190 and not with the query preview environment variable.
Edit2: I did still have to add support for slice to the local executor because of repr doing head, but I suspect that's still easier than implementing promote_offsets.
bigframes/session/loader.py
Outdated
| ) -> Tuple[google.cloud.bigquery.table.RowIterator, Optional[bigquery.QueryJob]]: | ||
| ... | ||
|
|
||
| def _start_query( |
There was a problem hiding this comment.
maybe two separate methods would be less code than working around all the mypy nonsense to reuse the same symbol?
There was a problem hiding this comment.
I can give it a try. Note that start_query_with_client already has this parameter though, so we'd end up with some inconsistency there.
Edit: Done in da2bf08
…llow_large_results
…llow_large_results
| # If there was no destination table and we've made it this far, that | ||
| # means the query must have been DDL or DML. Return some job metadata, | ||
| # instead. | ||
| if not destination: |
There was a problem hiding this comment.
Good idea. That is a good candidate to split out. I think I can make a few others as well, such as RowIterator -> DataFrame.
bigframes/session/loader.py
Outdated
| array_value = core.ArrayValue.from_managed(mat, self._session) | ||
| array_with_offsets, offsets_col = array_value.promote_offsets() |
There was a problem hiding this comment.
Makes sense. I can do that. I wonder why the test_read_gbq_colab_repr_avoids_requery test in tests/system/small/session/test_read_gbq_colab.py wasn't failing? Maybe because we do an upload and then a download instead of running a query?
Edit: I think it's because I wasn't testing with google-cloud-bigquery 3.34.0 with googleapis/python-bigquery#2190 and not with the query preview environment variable.
Edit2: I did still have to add support for slice to the local executor because of repr doing head, but I suspect that's still easier than implementing promote_offsets.
| return None | ||
|
|
||
| # TODO: Can support some slicing, sorting | ||
| # TODO: Can support some sorting |
There was a problem hiding this comment.
TODO(tswast): Add unit tests for new slice support.
bigframes/session/loader.py
Outdated
| ) -> Tuple[google.cloud.bigquery.table.RowIterator, Optional[bigquery.QueryJob]]: | ||
| ... | ||
|
|
||
| def _start_query( |
There was a problem hiding this comment.
I can give it a try. Note that start_query_with_client already has this parameter though, so we'd end up with some inconsistency there.
Edit: Done in da2bf08
|
e2e failure for isdigit in prerelease tests tracked in b/333484335. It's due to a fix in pyarrow that bigframes hasn't been updated to emulate yet. |
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes internal issue b/405372623 🦕