feat: Add Series.peek to preview data efficiently#727
Conversation
5a803a4 to
b3771b8
Compare
tswast
left a comment
There was a problem hiding this comment.
Could we split out the session-aware caching to a separate PR or is this pretty tightly coupled to Series.peek()?
bigframes/core/pruning.py
Outdated
|
|
||
| def cluster_cols_for_predicate(predicate: ex.Expression) -> Sequence[str]: | ||
| """Try to determine cluster col candidates that work with given predicates.""" | ||
| # TODO: Prioritize equality predicates over ranges |
There was a problem hiding this comment.
Could you rephrase? It took me a while to understand that you meant.
Perhaps add that since equality is narrower filter, it's more likely to reduce the data read if it's the first clustering filter.
Maybe this TODO should be a sort by how selective the predicates are?
There was a problem hiding this comment.
yeah, the idea is to cluster on the filter predicted to be most selective. update the todo
bigframes/operations/__init__.py
Outdated
|
|
||
| @property | ||
| def pruning_compatible(self) -> bool: | ||
| """Whether the operation preserves locality o""" |
There was a problem hiding this comment.
I see a hanging "o". Was that meant to be "or ..."?
There was a problem hiding this comment.
I'd also like some more information for help in determining when an operation would be pruning compatible.
There was a problem hiding this comment.
Actually didn't end up using this. Removed from new revision. Later on will add some concept of "inverse" operation to help normalize predicates. For now though, only considering range and equality predicates between column and a constant.
bigframes/series.py
Outdated
| Preview n arbitrary elements from the series. No guarantees about row selection or ordering. | ||
| ``Series.peek(force=False)`` will always be very fast, but will not succeed if data requires |
There was a problem hiding this comment.
Let's try to keep the first line summary short.
| Preview n arbitrary elements from the series. No guarantees about row selection or ordering. | |
| ``Series.peek(force=False)`` will always be very fast, but will not succeed if data requires | |
| Preview n arbitrary elements from the series without guarantees about row selection or ordering. | |
| ``Series.peek(force=False)`` will always be very fast, but will not succeed if data requires |
bigframes/session/__init__.py
Outdated
| def objects( | ||
| self, | ||
| ) -> collections.abc.Set[ | ||
| ) -> Tuple[ |
There was a problem hiding this comment.
Technically a breaking change. Maybe OK since we didn't actually document this property, but might be better to change from Set to a broader type like Iterable.
There was a problem hiding this comment.
Hmm, yeah, actually, should we just make this private? Added this very recently and wasn't really intending this for user consumption.
I worry that implementing |
|
|
||
| ``Series.peek(force=False)`` will always be very fast, but will not succeed if data requires | ||
| full data scanning. Using ``force=True`` will always succeed, but may be perform queries. | ||
| Query results will be cached so that future steps will benefit from these queries. |
There was a problem hiding this comment.
Do we need a caveat here that caching is session-aware and will attempt to cache the optimal subtree? (Not sure exactly how to phrase that in a friendlier way.)
There was a problem hiding this comment.
Yeah, not sure how/if we should communicate this to users. I also don't want to lock in any specific execution strategy other than "we might cache if force=True, but we will make that cache as useful as possible using some unspecified approach".
bigframes/session/__init__.py
Outdated
| ).node | ||
| self._cached_executions[array_value.node] = cached_replacement | ||
|
|
||
| def _session_aware_caching(self, array_value: core.ArrayValue) -> None: |
There was a problem hiding this comment.
Let's verbify this.
| def _session_aware_caching(self, array_value: core.ArrayValue) -> None: | |
| def _cache_with_session_awareness(self, array_value: core.ArrayValue) -> None: |
bigframes/core/pruning.py
Outdated
| op = predicate.op | ||
| if isinstance(op, COMPARISON_OP_TYPES): | ||
| return cluster_cols_for_comparison(predicate.inputs[0], predicate.inputs[1]) | ||
| if isinstance(op, (type(ops.invert_op))): |
There was a problem hiding this comment.
Let's add a TODO for geo, too. Looks like functions like st_dwithin can take advantage of clustering on geo columns. https://cloud.google.com/blog/products/data-analytics/best-practices-for-spatial-clustering-in-bigquery?e=48754805
bigframes/session/planner.py
Outdated
| Returns the node to cache, and optionally a clustering column. | ||
| """ | ||
| node_counts = traversals.count_nodes(session_forest) | ||
| # These node types are cheap to re-compute |
There was a problem hiding this comment.
Let's complete the thought in this comment for clarity.
| # These node types are cheap to re-compute | |
| # These node types are cheap to re-compute, so it makes more sense to cache their children. |
bigframes/session/planner.py
Outdated
| if cur_node_refs > caching_target_refs: | ||
| caching_target, caching_target_refs = cur_node, cur_node_refs | ||
| cluster_col = None | ||
| # Just pick the first cluster-compatible predicate |
There was a problem hiding this comment.
TODO to sort by a selectivity heuristic? Seems like this layer might make more sense than cluster_cols_for_predicate to do that sort.
5341d18 to
41f6083
Compare
bigframes/dtypes.py
Outdated
| return ( | ||
| not is_array_like(type) | ||
| and not is_struct_like(type) | ||
| and (type not in (GEO_DTYPE, TIME_DTYPE, FLOAT_DTYPE)) |
There was a problem hiding this comment.
Geo is clusterable but not orderable.
There was a problem hiding this comment.
Should we make this an allowlist, instead? I suspect as new types are added they aren't likely to be clusterable.
There was a problem hiding this comment.
added clusterable property to dtype metadata struct, defaulting as False
5610df5 to
06c9866
Compare
Co-authored-by: Tim Sweña (Swast) <swast@google.com>
Co-authored-by: Tim Sweña (Swast) <swast@google.com>
6fed30b to
81e5a02
Compare
tswast
left a comment
There was a problem hiding this comment.
LGTM with a couple of comments that would be good to resolve before merging.
bigframes/session/planner.py
Outdated
| caching_target, caching_target_refs = cur_node, cur_node_refs | ||
| schema = cur_node.schema | ||
| # Cluster cols only consider the target object and not other sesssion objects | ||
| # Note, this |
There was a problem hiding this comment.
Looks like this comment ended mid-sentence.
| cur_node = cur_node.child | ||
| cur_node_refs = node_counts.get(cur_node, 0) | ||
| if cur_node_refs > caching_target_refs: | ||
| caching_target, caching_target_refs = cur_node, cur_node_refs |
There was a problem hiding this comment.
Do we need to do anything to make sure we aren't selecting more columns than needed? I have some worries that column selection wouldn't have the desired affect.
Though, I suppose that'll only matter with unordered + unindexed DataFrames due to our hashing of the row. Maybe worth a TODO to be resolved with that project?
That said, I'd be curious to see if unordered/unindexed would benefit from caching at all due to the difficulties of using the cache in row identity joins.
There was a problem hiding this comment.
Row hashing shouldn't matter, as that only happens for initial table scan, which shouldn't need to be cached. However, yes, we could try to prune columns unused by the session before caching. Would need to be careful not to invalidate existing caching or join->projection rewriter, but should be possible. This could be done in a few ways, such as a partial cache (containing only some columns), or by rewriting all the session BFETs with a column pruning pass before caching.
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕