feat: Add Series.peek to preview data efficiently by TrevorBergeron · Pull Request #727 · googleapis/python-bigquery-dataframes

TrevorBergeron · 2024-05-28T22:29:08Z

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

tswast

Could we split out the session-aware caching to a separate PR or is this pretty tightly coupled to Series.peek()?

tswast · 2024-05-30T21:21:06Z

bigframes/core/pruning.py

+
+def cluster_cols_for_predicate(predicate: ex.Expression) -> Sequence[str]:
+    """Try to determine cluster col candidates that work with given predicates."""
+    # TODO: Prioritize equality predicates over ranges


Could you rephrase? It took me a while to understand that you meant.

Perhaps add that since equality is narrower filter, it's more likely to reduce the data read if it's the first clustering filter.

Maybe this TODO should be a sort by how selective the predicates are?

yeah, the idea is to cluster on the filter predicted to be most selective. update the todo

bigframes/core/pruning.py

bigframes/core/tree_properties.py

tswast · 2024-05-30T21:37:04Z

bigframes/operations/__init__.py


+    @property
+    def pruning_compatible(self) -> bool:
+        """Whether the operation preserves locality o"""


I see a hanging "o". Was that meant to be "or ..."?

I'd also like some more information for help in determining when an operation would be pruning compatible.

Actually didn't end up using this. Removed from new revision. Later on will add some concept of "inverse" operation to help normalize predicates. For now though, only considering range and equality predicates between column and a constant.

tswast · 2024-05-30T21:38:45Z

bigframes/series.py

+        Preview n arbitrary elements from the series. No guarantees about row selection or ordering.
+        ``Series.peek(force=False)`` will always be very fast, but will not succeed if data requires


Let's try to keep the first line summary short.

Suggested change

Preview n arbitrary elements from the series. No guarantees about row selection or ordering.

``Series.peek(force=False)`` will always be very fast, but will not succeed if data requires

Preview n arbitrary elements from the series without guarantees about row selection or ordering.

``Series.peek(force=False)`` will always be very fast, but will not succeed if data requires

tswast · 2024-05-30T21:41:27Z

bigframes/session/__init__.py

    def objects(
        self,
-    ) -> collections.abc.Set[
+    ) -> Tuple[


Technically a breaking change. Maybe OK since we didn't actually document this property, but might be better to change from Set to a broader type like Iterable.

Hmm, yeah, actually, should we just make this private? Added this very recently and wasn't really intending this for user consumption.

TrevorBergeron · 2024-05-31T19:52:30Z

Could we split out the session-aware caching to a separate PR or is this pretty tightly coupled to Series.peek()?

I worry that implementing Series.peek() without session-aware caching will result in a lot of redundant re-computation if users peek at multiple columns of the same dataframe. I could start with session-aware caching and apply it to dataframe.peek instead I guess.

bigframes/series.py

tswast · 2024-06-04T15:25:18Z

bigframes/series.py

+
+        ``Series.peek(force=False)`` will always be very fast, but will not succeed if data requires
+        full data scanning. Using ``force=True`` will always succeed, but may be perform queries.
+        Query results will be cached so that future steps will benefit from these queries.


Do we need a caveat here that caching is session-aware and will attempt to cache the optimal subtree? (Not sure exactly how to phrase that in a friendlier way.)

Yeah, not sure how/if we should communicate this to users. I also don't want to lock in any specific execution strategy other than "we might cache if force=True, but we will make that cache as useful as possible using some unspecified approach".

tswast · 2024-06-04T15:27:51Z

bigframes/session/__init__.py

        ).node
        self._cached_executions[array_value.node] = cached_replacement

+    def _session_aware_caching(self, array_value: core.ArrayValue) -> None:


Let's verbify this.

Suggested change

def _session_aware_caching(self, array_value: core.ArrayValue) -> None:

def _cache_with_session_awareness(self, array_value: core.ArrayValue) -> None:

tests/system/small/test_series.py

tests/unit/test_planner.py

bigframes/session/planner.py

tswast · 2024-06-04T15:44:20Z

bigframes/core/pruning.py

+        op = predicate.op
+        if isinstance(op, COMPARISON_OP_TYPES):
+            return cluster_cols_for_comparison(predicate.inputs[0], predicate.inputs[1])
+        if isinstance(op, (type(ops.invert_op))):


Let's add a TODO for geo, too. Looks like functions like st_dwithin can take advantage of clustering on geo columns. https://cloud.google.com/blog/products/data-analytics/best-practices-for-spatial-clustering-in-bigquery?e=48754805

tswast · 2024-06-04T15:47:20Z

bigframes/session/planner.py

+    Returns the node to cache, and optionally a clustering column.
+    """
+    node_counts = traversals.count_nodes(session_forest)
+    # These node types are cheap to re-compute


Let's complete the thought in this comment for clarity.

Suggested change

# These node types are cheap to re-compute

# These node types are cheap to re-compute, so it makes more sense to cache their children.

bigframes/session/planner.py

tswast · 2024-06-04T15:54:13Z

bigframes/session/planner.py

+        if cur_node_refs > caching_target_refs:
+            caching_target, caching_target_refs = cur_node, cur_node_refs
+            cluster_col = None
+            # Just pick the first cluster-compatible predicate


TODO to sort by a selectivity heuristic? Seems like this layer might make more sense than cluster_cols_for_predicate to do that sort.

tswast · 2024-06-12T22:03:30Z

bigframes/dtypes.py

+    return (
+        not is_array_like(type)
+        and not is_struct_like(type)
+        and (type not in (GEO_DTYPE, TIME_DTYPE, FLOAT_DTYPE))


Geo is clusterable but not orderable.

Should we make this an allowlist, instead? I suspect as new types are added they aren't likely to be clusterable.

added clusterable property to dtype metadata struct, defaulting as False

bigframes/dtypes.py

Co-authored-by: Tim Sweña (Swast) <swast@google.com>

tswast

LGTM with a couple of comments that would be good to resolve before merging.

tswast · 2024-06-26T14:20:35Z

bigframes/session/planner.py

+            caching_target, caching_target_refs = cur_node, cur_node_refs
+            schema = cur_node.schema
+            # Cluster cols only consider the target object and not other sesssion objects
+            # Note, this


Looks like this comment ended mid-sentence.

tswast · 2024-06-26T14:23:55Z

bigframes/session/planner.py

+        cur_node = cur_node.child
+        cur_node_refs = node_counts.get(cur_node, 0)
+        if cur_node_refs > caching_target_refs:
+            caching_target, caching_target_refs = cur_node, cur_node_refs


Do we need to do anything to make sure we aren't selecting more columns than needed? I have some worries that column selection wouldn't have the desired affect.

Though, I suppose that'll only matter with unordered + unindexed DataFrames due to our hashing of the row. Maybe worth a TODO to be resolved with that project?

That said, I'd be curious to see if unordered/unindexed would benefit from caching at all due to the difficulties of using the cache in row identity joins.

Row hashing shouldn't matter, as that only happens for initial table scan, which shouldn't need to be cached. However, yes, we could try to prune columns unused by the session before caching. Would need to be careful not to invalidate existing caching or join->projection rewriter, but should be possible. This could be done in a few ways, such as a partial cache (containing only some columns), or by rewriting all the session BFETs with a column pruning pass before caching.

product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels May 28, 2024

feat: Add Series.peek to preview data efficiently

b3771b8

TrevorBergeron force-pushed the series_cache branch from 5a803a4 to b3771b8 Compare May 29, 2024 00:50

product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels May 29, 2024

TrevorBergeron added 2 commits May 30, 2024 16:57

Merge remote-tracking branch 'github/main' into series_cache

e865395

add another test

f227476

TrevorBergeron marked this pull request as ready for review May 30, 2024 17:20

TrevorBergeron requested review from a team as code owners May 30, 2024 17:20

TrevorBergeron requested a review from junyazhang May 30, 2024 17:20

blunderbuss-gcf bot assigned shobsi May 30, 2024

TrevorBergeron requested a review from tswast May 30, 2024 17:20

tswast requested changes May 30, 2024

View reviewed changes

TrevorBergeron added 2 commits May 31, 2024 19:32

Merge remote-tracking branch 'github/main' into series_cache

68fc1e1

cleanup comments

a17e027

TrevorBergeron requested a review from tswast May 31, 2024 22:51

tswast requested changes Jun 4, 2024

View reviewed changes

TrevorBergeron added 4 commits June 4, 2024 17:17

Merge remote-tracking branch 'github/main' into series_cache

1764106

more comments, up to 4 cluster cols for session-based caching

5ff4661

add another session caching test

936e73d

Merge remote-tracking branch 'github/main' into series_cache

41f6083

TrevorBergeron force-pushed the series_cache branch from 5341d18 to 41f6083 Compare June 4, 2024 19:51

TrevorBergeron added 2 commits June 5, 2024 00:12

Merge remote-tracking branch 'github/main' into series_cache

ffbc518

add todo for geo predicate detection

a9b16c4

TrevorBergeron requested a review from tswast June 5, 2024 00:26

Merge branch 'main' into series_cache

83fc8fb

tswast reviewed Jun 12, 2024

View reviewed changes

TrevorBergeron added 4 commits June 12, 2024 22:41

add dtype clusterable and orderable property

ec1d973

Merge remote-tracking branch 'github/main' into series_cache

c307625

fix session aware caching unit tests

b917c71

mock session for planner test

848d0a4

TrevorBergeron requested a review from tswast June 14, 2024 00:22

TrevorBergeron added 2 commits June 25, 2024 17:43

fix offsets column name collision

79d05b5

Merge remote-tracking branch 'github/main' into series_cache

06c9866

TrevorBergeron force-pushed the series_cache branch from 5610df5 to 06c9866 Compare June 25, 2024 17:47

tswast reviewed Jun 25, 2024

View reviewed changes

bigframes/dtypes.py Outdated Show resolved Hide resolved

bigframes/dtypes.py Outdated Show resolved Hide resolved

TrevorBergeron and others added 3 commits June 25, 2024 15:18

Update bigframes/dtypes.py

2ed1520

Co-authored-by: Tim Sweña (Swast) <swast@google.com>

Update bigframes/dtypes.py

1ff4f68

Co-authored-by: Tim Sweña (Swast) <swast@google.com>

add another series peek test

81e5a02

TrevorBergeron force-pushed the series_cache branch from 6fed30b to 81e5a02 Compare June 25, 2024 22:24

TrevorBergeron requested a review from tswast June 25, 2024 22:24

tswast approved these changes Jun 26, 2024

View reviewed changes

TrevorBergeron added 2 commits June 26, 2024 16:37

remove partial comment

e91dbb5

Merge remote-tracking branch 'github/main' into series_cache

108e449

TrevorBergeron enabled auto-merge (squash) June 26, 2024 16:40

TrevorBergeron merged commit 580e1b9 into main Jun 26, 2024

TrevorBergeron deleted the series_cache branch June 26, 2024 17:38

release-please bot mentioned this pull request Jun 26, 2024

chore(main): release 1.11.0 #806

Merged

		Preview n arbitrary elements from the series. No guarantees about row selection or ordering.
		``Series.peek(force=False)`` will always be very fast, but will not succeed if data requires

	def _session_aware_caching(self, array_value: core.ArrayValue) -> None:
	def _cache_with_session_awareness(self, array_value: core.ArrayValue) -> None:

	# These node types are cheap to re-compute
	# These node types are cheap to re-compute, so it makes more sense to cache their children.

Conversation

TrevorBergeron commented May 28, 2024

Uh oh!

tswast left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TrevorBergeron commented May 31, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tswast left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!