feat: read_gbq creates order deterministically without table copy#191
feat: read_gbq creates order deterministically without table copy#191gcf-merge-on-green[bot] merged 14 commits intomainfrom
Conversation
tswast
left a comment
There was a problem hiding this comment.
Thanks! A few changes, but otherwise looking good. Always nice when session.py can get a little bit simpler.
bigframes/session/__init__.py
Outdated
| table_ref: bigquery.table.TableReference, | ||
| *, | ||
| api_name: str, | ||
| enforce_region: bool = False, |
There was a problem hiding this comment.
Could you add some docstring for this? It's not 100% clear to me what this means just from the variable name.
There was a problem hiding this comment.
removing parameter, turns out it is always 'true'
bigframes/session/__init__.py
Outdated
| array_value = self._create_total_ordering(table_expression) | ||
|
|
||
| if col_order: | ||
| array_value = array_value.select_columns(tuple(col_order)) |
There was a problem hiding this comment.
As we discussed in chat, we should really do the column filter before making the hash-based ordering. Any chance we can do that now? Otherwise, let's file an issue and add a TODO.
There was a problem hiding this comment.
yeah, definitely want to minimize the set of columns hashed (even if it affects the ordering). done
| index_labels=index_cols, | ||
| ) | ||
| if max_results: | ||
| block = block.slice(stop=max_results) |
There was a problem hiding this comment.
This requires ordering, right? Kinda defeats the purpose of max_results since it results in a full table scan to calculate the ordering.
Maybe we do a SELECT */columns+index_cols LIMIT max_results query before we do anything else for now?
There was a problem hiding this comment.
The problem with LIMIT is that it is non-deterministic in the absence of ordering. We would have to immediately cache the sample.
There was a problem hiding this comment.
Ack. Filed issue 310257606 to track adding non-deterministic sampling methods in read_gbq.
bigframes/session/__init__.py
Outdated
|
|
||
|
|
||
| def _convert_to_string(column: ibis_types.Column) -> ibis_types.StringColumn: | ||
| # Some of these probably don't work |
There was a problem hiding this comment.
Should we check? Sounds like we need some targeted tests to cover the branches in this function.
There was a problem hiding this comment.
Should all work now, added a test specifically for the json case, other datatypes are covered by existing tests.
bigframes/session/__init__.py
Outdated
| ) | ||
| return table, ordering | ||
|
|
||
| def _ibis_to_session_table( |
There was a problem hiding this comment.
We can remove this method now, I believe.
There was a problem hiding this comment.
Still used by ArrayValue.cached(). Renaming to the more accurate _ibis_to_temp_table since the session dataset isn't being used anymore.
Co-authored-by: Tim Swast <swast@google.com>
| def _cached(self) -> DataFrame: | ||
| return DataFrame(self._block.cached()) | ||
| self._set_block(self._block.cached()) | ||
| return self |
There was a problem hiding this comment.
Wunderbar! This should help with taking better advantage of our cached data in the cases where we do call this automatically.
| # the same assumption and use these columns as the total ordering keys. | ||
| table = self.bqclient.get_table(table_ref) | ||
|
|
||
| if table.location.casefold() != self._location.casefold(): |
There was a problem hiding this comment.
Today I learned casefold(), thanks.
"Casefolded strings may be used for caseless matching." https://docs.python.org/3/library/stdtypes.html#str.casefold
| index_labels=index_cols, | ||
| ) | ||
| if max_results: | ||
| block = block.slice(stop=max_results) |
There was a problem hiding this comment.
Ack. Filed issue 310257606 to track adding non-deterministic sampling methods in read_gbq.
|
doctest failures: Let's remove the |
🤖 I have created a release *beep* *boop* --- ## [0.14.0](https://togithub.com/googleapis/python-bigquery-dataframes/compare/v0.13.0...v0.14.0) (2023-11-14) ### Features * Add 'cross' join support ([#176](https://togithub.com/googleapis/python-bigquery-dataframes/issues/176)) ([765446a](https://togithub.com/googleapis/python-bigquery-dataframes/commit/765446a929abe1ac076c3037afa7892f64105356)) * Add 'index', 'pad', 'nearest' interpolate methods ([#162](https://togithub.com/googleapis/python-bigquery-dataframes/issues/162)) ([6a28403](https://togithub.com/googleapis/python-bigquery-dataframes/commit/6a2840349a23035bdfdabacd1e231b41bbb5ed7a)) * Add series.sample (identical to existing dataframe.sample) ([#187](https://togithub.com/googleapis/python-bigquery-dataframes/issues/187)) ([37914a4](https://togithub.com/googleapis/python-bigquery-dataframes/commit/37914a4077c681881491f5c36d1a9c9f4255e18f)) * Add unordered sql compilation ([#156](https://togithub.com/googleapis/python-bigquery-dataframes/issues/156)) ([58f420c](https://togithub.com/googleapis/python-bigquery-dataframes/commit/58f420c91d94ca085e9810f36513ffe772bfddcf)) * Log most recent API calls as `recent-bigframes-api-xx` labels on BigQuery jobs ([#145](https://togithub.com/googleapis/python-bigquery-dataframes/issues/145)) ([4ea33b7](https://togithub.com/googleapis/python-bigquery-dataframes/commit/4ea33b7433532ae3a386a6ffa9eb57360ea39526)) * Read_gbq creates order deterministically without table copy ([#191](https://togithub.com/googleapis/python-bigquery-dataframes/issues/191)) ([8ab81de](https://togithub.com/googleapis/python-bigquery-dataframes/commit/8ab81dee4d0eee499094f2dd576550f0c59d7551)) * Support `date_series.astype("string[pyarrow]")` to cast DATE to STRING ([#186](https://togithub.com/googleapis/python-bigquery-dataframes/issues/186)) ([aee0e8e](https://togithub.com/googleapis/python-bigquery-dataframes/commit/aee0e8e2518c59bd1e0b07940c3309871fde8899)) * Support `series.at[row_label] = scalar` ([#173](https://togithub.com/googleapis/python-bigquery-dataframes/issues/173)) ([0c8bd33](https://togithub.com/googleapis/python-bigquery-dataframes/commit/0c8bd33806bb99206b8b12dbdf7d7485c6ffb759)) * Temporary resources no longer use BigQuery Sessions ([#194](https://togithub.com/googleapis/python-bigquery-dataframes/issues/194)) ([4a02cac](https://togithub.com/googleapis/python-bigquery-dataframes/commit/4a02cac88c7d7b46bed1fa813a862fc2ef9ef084)) ### Bug Fixes * All sort operation are now stable ([#195](https://togithub.com/googleapis/python-bigquery-dataframes/issues/195)) ([3a2761f](https://togithub.com/googleapis/python-bigquery-dataframes/commit/3a2761f3c38d0de8b8eda47fffa15b8412aa84b0)) * Default to 7 days expiration for `read_csv`, `read_json`, `read_parquet` ([#193](https://togithub.com/googleapis/python-bigquery-dataframes/issues/193)) ([03606cd](https://togithub.com/googleapis/python-bigquery-dataframes/commit/03606cda30eb7645bfd4534460112dcca56b0ab0)) * Deprecate the `remote_service_type` in llm model ([#180](https://togithub.com/googleapis/python-bigquery-dataframes/issues/180)) ([a8a409a](https://togithub.com/googleapis/python-bigquery-dataframes/commit/a8a409ab0bd1f99dfb442df0703bf8786e0fe58e)) * For reset_index on unnamed multiindex, always use level_[n] label ([#182](https://togithub.com/googleapis/python-bigquery-dataframes/issues/182)) ([f95000d](https://togithub.com/googleapis/python-bigquery-dataframes/commit/f95000d3f88662be4d88c8b0152f1b838e99ec55)) * Match pandas behavior when assigning listlike to empty dfs ([#172](https://togithub.com/googleapis/python-bigquery-dataframes/issues/172)) ([c1d1f42](https://togithub.com/googleapis/python-bigquery-dataframes/commit/c1d1f42a21cc089877f79ebb46a39ddef6958e04)) * Use anonymous dataset instead of session dataset for temp tables ([#181](https://togithub.com/googleapis/python-bigquery-dataframes/issues/181)) ([800d44e](https://togithub.com/googleapis/python-bigquery-dataframes/commit/800d44eb5eb77da5d87b2e005f5a2ed53842e7b5)) * Use random table for `read_pandas` ([#192](https://togithub.com/googleapis/python-bigquery-dataframes/issues/192)) ([741c75e](https://togithub.com/googleapis/python-bigquery-dataframes/commit/741c75e5797e26a1487ff3da76a07953d9537f3f)) * Use random table when loading data for `read_csv`, `read_json`, `read_parquet` ([#175](https://togithub.com/googleapis/python-bigquery-dataframes/issues/175)) ([9d2e6dc](https://togithub.com/googleapis/python-bigquery-dataframes/commit/9d2e6dc1ae4e11e80da4aabe0daa3a6044137cc6)) ### Documentation * Add code samples for `read_gbq_function` using community UDFs ([#188](https://togithub.com/googleapis/python-bigquery-dataframes/issues/188)) ([7506eab](https://togithub.com/googleapis/python-bigquery-dataframes/commit/7506eabf2e58159507809e36abfe90c417dfe92f)) * Add docstring code samples for `Series.apply` and `DataFrame.map` ([#185](https://togithub.com/googleapis/python-bigquery-dataframes/issues/185)) ([c816d84](https://togithub.com/googleapis/python-bigquery-dataframes/commit/c816d843e6f3c5a944cd4395ed0e1e91cec49812)) * Add llm kmeans notebook as an included example ([#177](https://togithub.com/googleapis/python-bigquery-dataframes/issues/177)) ([d49ae42](https://togithub.com/googleapis/python-bigquery-dataframes/commit/d49ae42a379fafd601cc94227e7f8f14b3d5f8c3)) * Use `head()` to get top `n` results, not to preview results ([#190](https://togithub.com/googleapis/python-bigquery-dataframes/issues/190)) ([87f84c9](https://togithub.com/googleapis/python-bigquery-dataframes/commit/87f84c9e58e7d0ea521ac386c9f02791cdddd19f)) --- This PR was generated with [Release Please](https://togithub.com/googleapis/release-please). See [documentation](https://togithub.com/googleapis/release-please#release-please).
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕