perf: Directly read gbq table for simple plans#1607
Conversation
3a9dc4b to
84c4aa0
Compare
84c4aa0 to
83312f0
Compare
| @@ -574,6 +583,31 @@ def with_id(self, id: identifiers.ColumnId) -> ScanItem: | |||
| class ScanList: | |||
There was a problem hiding this comment.
Could we get a docstring for ScanList, please?
bigframes/core/nodes.py
Outdated
| class ScanList: | ||
| items: typing.Tuple[ScanItem, ...] | ||
|
|
||
| def filter( |
There was a problem hiding this comment.
Should we call this project to align with https://en.wikipedia.org/wiki/Projection_(relational_algebra) ?
There was a problem hiding this comment.
renamed to project
There was a problem hiding this comment.
actually renamed this one to filter_cols
bigframes/core/nodes.py
Outdated
| ) -> ScanList: | ||
| result = ScanList(tuple(item for item in self.items if item.id in ids)) | ||
| if len(result.items) == 0: | ||
| # We need to select something, or stuff breaks |
There was a problem hiding this comment.
Ideally we'd pick the smallest size column, right? Let's add a TODO if so. Or does this column not actually get scanned?
Would be good to describe more what stuff breaks, too.
There was a problem hiding this comment.
specified that its the sql syntax that breaks. The columns won't actually get scanned, as it will just be pruned out by the optimizer anyways. Amended commend to say that its the sql syntax
bigframes/core/nodes.py
Outdated
| result = ScanList(self.items[:1]) | ||
| return result | ||
|
|
||
| def select( |
There was a problem hiding this comment.
A docstring would be great. Are we talking about https://en.wikipedia.org/wiki/Selection_(relational_algebra) (aka row predicates) here?
| @@ -0,0 +1,81 @@ | |||
| from typing import Any, Optional | |||
There was a problem hiding this comment.
Needs license header.
| from typing import Any, Optional | |
| # Copyright 2025 Google LLC | |
| # | |
| # Licensed under the Apache License, Version 2.0 (the "License"); | |
| # you may not use this file except in compliance with the License. | |
| # You may obtain a copy of the License at | |
| # | |
| # http://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, software | |
| # distributed under the License is distributed on an "AS IS" BASIS, | |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| # See the License for the specific language governing permissions and | |
| # limitations under the License. | |
| from __future__ import annotations | |
| from typing import Any, Optional |
| from bigframes.session import executor, semi_executor | ||
|
|
||
|
|
||
| class ReadApiSemiExecutor(semi_executor.SemiExecutor): |
There was a problem hiding this comment.
A docstring would be great. I assume the "Semi" part means that not all plans/expressions are supported?
There was a problem hiding this comment.
Added docstring to SemiExecutor abc, as well as ReadApiSemiExecutor. And yeah, SemiExecutor means exactly that, they can execute a subset of possible plans.
| if node.source.sql_predicate: | ||
| read_options["row_restriction"] = node.source.sql_predicate |
There was a problem hiding this comment.
Aside: How are we protecting against plans that we can't run. Hopefully we've got an allowlist of properties. I'm worried about adding a property to BigFrameNode and forgetting to add it to the executor.
There was a problem hiding this comment.
Ultimately, each semi-executor has to specify the subset of plans it can or cannot execute, whether through allowlists, if-else statements, or some other mechanism. The real problem is if it drifts out of sync with the expected reference behavior. This shouldn't be a problem for now, as all tests that can use this semi-executor, will, and will catch issues. However, as we add more semi-executors, we should directly compare the results against the reference implementation, and maybe even do some fuzzing.
| return None | ||
|
|
||
| import google.cloud.bigquery_storage_v1.types as bq_storage_types | ||
| from google.protobuf.timestamp_pb2 import Timestamp |
There was a problem hiding this comment.
Nit: Use import statements for packages and modules only, not for individual types, classes, or functions.
https://google.github.io/styleguide/pyguide.html#22-imports
Timestamp isn't in the list of excemptions. https://google.github.io/styleguide/pyguide.html#2241-exemptions
bigframes/core/nodes.py
Outdated
| return 0 | ||
|
|
||
| @property | ||
| def double_references_ids(self) -> bool: |
There was a problem hiding this comment.
naming nit: maybe "has_multi_reference_ids" is better, since the return type is bool?
Plus, technically speaking you can have triple references too 🧐
|
|
||
|
|
||
| def try_reduce_to_table_scan(node: nodes.BigFrameNode) -> Optional[nodes.ReadTableNode]: | ||
| if not all( |
There was a problem hiding this comment.
Shall we re-write this into a plain-old "for" loop?
for n in node.unique_nodes():
if not isinstance(n, ...):
return None
This might be more readable.
There was a problem hiding this comment.
yeah, makes sense, I tend too much towards "everything is an expression". fixed
bigframes/session/semi_executor.py
Outdated
| # Unstable interface, in development | ||
| class SemiExecutor(abc.ABC): | ||
| """ | ||
| A semi executor executes a subset of possible plans, returns None if it cannot execute a given plan. |
There was a problem hiding this comment.
"returns None if it cannot execute a given plan."
Maybe it should be "... if cannot execute any/all given plans?"
There was a problem hiding this comment.
Rephrased, but it is give a single (logical) plan at a time, and if it cannot find a way to physically execute it, it returns None.
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕