Conversation
00eed4b to
fe82c6d
Compare
|
Heads up: I suspect we might encounter some conflicts after #1448 merges. I think that |
bigframes/core/blocks.py
Outdated
| df = pd.DataFrame( | ||
| data={ | ||
| "dry_run_stats": [ | ||
| *self.dtypes, | ||
| tuple(index_types) if len(index_types) > 1 else index_types[0], | ||
| query_job.total_bytes_processed, | ||
| ] | ||
| }, | ||
| index=[*self.column_labels, "[index]", "total_bytes_processed"], | ||
| ) |
There was a problem hiding this comment.
I find this pretty confusing. I think it would be more typical to have the job properties as separate columns.
Alternatively, we could return a pd.Series if we do want a 1 dimension object.
bigframes/core/blocks.py
Outdated
| query_job.total_bytes_processed, | ||
| ] | ||
| }, | ||
| index=[*self.column_labels, "[index]", "total_bytes_processed"], |
There was a problem hiding this comment.
Why always [index]? Sometimes the index has a name. Also, sometimes the index is a multi-index.
bigframes/core/blocks.py
Outdated
| df = pd.DataFrame( | ||
| data={ | ||
| "dry_run_stats": [ | ||
| *self.dtypes, |
There was a problem hiding this comment.
This feels a bit risky to me. What if the columns has the name total_bytes_processed? I would rather see dtypes as it's own object column/row that contains the predicted dtypes as a single object.
bigframes/core/blocks.py
Outdated
| data={ | ||
| "dry_run_stats": [ | ||
| *self.dtypes, | ||
| tuple(index_types) if len(index_types) > 1 else index_types[0], |
There was a problem hiding this comment.
Same here, maybe introduce index_dtypes and have that be the whole object of 0+ dtypes for a potential multi-index.
|
I used a DataFrame for dry run stats, just like what It looks like this: https://screenshot.googleplex.com/gEEBU7aMqUy93SB Plus, I applied some "@overload" magic to make Let me know your thoughts @tswast |
|
Here's what I had in mind: import copy
import pandas as pd
from google.colab import auth
from google.cloud import bigquery
# Authenticate the user
auth.authenticate_user()
# Initialize a BigQuery client
client = bigquery.Client(project='bigframes-dev') # Replace with your project ID
job_config = bigquery.QueryJobConfig()
job_config.dry_run = True
query = """
SELECT
name,
SUM(number) AS total
FROM
`bigquery-public-data.usa_names.usa_1910_2013`
GROUP BY
name
ORDER BY
total DESC
LIMIT
10;
"""
job = client.query(query, job_config=job_config)
job_api_repr = copy.deepcopy(job._properties)
print(job_api_repr)
index = []
values = []
query_configuration = job_api_repr['configuration'].pop("query")
index.extend(query_configuration.keys())
values.extend(query_configuration.values())
query_statistics = job_api_repr['statistics'].pop("query")
index.extend(query_statistics.keys())
values.extend(query_statistics.values())
remaining_statistics = job_api_repr['statistics']
index.extend(remaining_statistics.keys())
values.extend(remaining_statistics.values())
series = pd.Series(values, index=index)
print(series) |
|
Re: #1436 (comment) It would need
|
|
Now the stats look like these: https://screenshot.googleplex.com/BhaUb8uTYrdz7iM I hard-coded all the keys for value lookup in the dry run job. |
tswast
left a comment
There was a problem hiding this comment.
Please find another way to make the type checker happy. I dislike the repeated code.
bigframes/core/blocks.py
Outdated
|
|
||
| if dry_run: | ||
| if sampling.enable_downsampling: | ||
| raise NotImplementedError("Dry run with sampling is not supproted") |
There was a problem hiding this comment.
| raise NotImplementedError("Dry run with sampling is not supproted") | |
| raise NotImplementedError("Dry run with sampling is not supported") |
bigframes/core/blocks.py
Outdated
| series, query_job = self._block.select_columns([]).to_pandas( | ||
| ordered=ordered, | ||
| allow_large_results=allow_large_results, | ||
| dry_run=dry_run, | ||
| ) | ||
| return series, query_job | ||
|
|
||
| df, query_job = self._block.select_columns([]).to_pandas( |
There was a problem hiding this comment.
The select_columns([]) confuses me, but I see that was here before. Please refactor these a bit so that self._block.select_columns([]) is saved to a variable since it is in common with both.
There was a problem hiding this comment.
Alternatively, we can get rid of this if statement and rename the variable df_or_series.
bigframes/core/blocks.py
Outdated
| if dry_run: | ||
| series, query_job = self._block.select_columns([]).to_pandas( | ||
| ordered=ordered, | ||
| allow_large_results=allow_large_results, |
There was a problem hiding this comment.
I think allow_large_results shouldn't have an effect on dry run queries, as that controls the destination table property.
bigframes/core/blocks.py
Outdated
|
|
||
| df, query_job = self._block.select_columns([]).to_pandas( | ||
| ordered=ordered, allow_large_results=allow_large_results | ||
| ordered=ordered, allow_large_results=allow_large_results, dry_run=dry_run |
There was a problem hiding this comment.
Why include the dry_run argument here if we know it's false?
bigframes/core/blocks.py
Outdated
|
|
||
| df, query_job = self._block.select_columns([]).to_pandas( | ||
| ordered=ordered, allow_large_results=allow_large_results | ||
| ordered=ordered, allow_large_results=allow_large_results, dry_run=dry_run |
There was a problem hiding this comment.
Why include the dry_run argument here if we know it's false?
bigframes/core/indexes/base.py
Outdated
| self._query_job = query_job | ||
| return series | ||
|
|
||
| # Repeat the to_pandas() call to make mypy deduce type correctly, because mypy cannot resolve |
There was a problem hiding this comment.
Why don't you just use bool consistently, then?
Yeah that's true... I moved the |
|
Presubmits failed: Hard to say if it's related to this PR. |
* feat: Support dry_run in * centralize dry_run logics at block level * fix lint errors * remove unnecessary code * use dataframe for dry_run stats * flatten the job stats to a series * fix lint * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * fix query job issue * Make pandas surface directly call block._compute_dry_run * type hint update --------- Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
No description provided.