feat: Add filters and columns arguments to read_gbq for enhanced data querying#198
feat: Add filters and columns arguments to read_gbq for enhanced data querying#198
Conversation
bigframes/pandas/__init__.py
Outdated
| index_col: Iterable[str] | str = (), | ||
| col_order: Iterable[str] = (), | ||
| max_results: Optional[int] = None, | ||
| filters: Optional[List[Tuple]] = None, |
There was a problem hiding this comment.
Let's make this an Iterable[Tuple] instead. I presume empty list of filters is semantically the same as None, right?
Also, we can document the supported operators with a Literal type.
| filters: Optional[List[Tuple]] = None, | |
| filters: Iterable[Tuple[str, Literal["in", "not in", "<", "<=", "==", "!=", ">=", ">"], Any]] = (), |
bigframes/session/__init__.py
Outdated
| index_col: Iterable[str] | str = (), | ||
| col_order: Iterable[str] = (), | ||
| max_results: Optional[int] = None, | ||
| filters: Optional[List[Tuple]] = None |
There was a problem hiding this comment.
Likewise, let's update these types.
| filters: Optional[List[Tuple]] = None | |
| filters: Iterable[Tuple[str, Literal["in", "not in", "<", "<=", "==", "!=", ">=", ">"], Any]] = (), |
bigframes/session/__init__.py
Outdated
| if (filters is None) or (len(filters) == 0): | ||
| return query_or_table | ||
|
|
||
| valid_operators = ["IN", "NOT IN", "=", ">", "<", ">=", "<=", "!="] |
There was a problem hiding this comment.
Please use pandas / python syntax. We can transform to SQL with a dictionary.
| valid_operators = ["IN", "NOT IN", "=", ">", "<", ">=", "<=", "!="] | |
| valid_operators = { | |
| "in": "IN", | |
| "not in": "NOT IN", | |
| "==": "=", | |
| ">": ">", | |
| "<": "<", | |
| ">=": ">=", | |
| "<=": "<=", | |
| "!=": "!=", | |
| } |
bigframes/session/__init__.py
Outdated
|
|
||
| where_clause = " WHERE " + " OR ".join(grouped_expressions) | ||
|
|
||
| full_query = f"SELECT * FROM {sub_query} AS sub{where_clause}" |
There was a problem hiding this comment.
I'd like to include the column filter here too, please.
There was a problem hiding this comment.
Sorry, I meant using col_order which is already used as a column filter.
bigframes/session/__init__.py
Outdated
| value_list = ", ".join( | ||
| [f'"{v}"' if isinstance(v, str) else str(v) for v in value] | ||
| ) | ||
| expression = f"{column} {operator} ({value_list})" |
There was a problem hiding this comment.
Column names could contain spaces. We need to enclose them in backticks.
| expression = f"{column} {operator} ({value_list})" | |
| expression = f"`{column}` {operator} ({value_list})" |
bigframes/session/__init__.py
Outdated
| f"Value for operator {operator} should be a list." | ||
| ) | ||
| value_list = ", ".join( | ||
| [f'"{v}"' if isinstance(v, str) else str(v) for v in value] |
There was a problem hiding this comment.
| [f'"{v}"' if isinstance(v, str) else str(v) for v in value] | |
| [repr(v) for v in value] |
bigframes/session/__init__.py
Outdated
| value = f'"{value}"' if isinstance(value, str) else value | ||
| expression = f"{column} {operator} {value}" |
There was a problem hiding this comment.
Column names could contain spaces. We need to enclose them in backticks.
| value = f'"{value}"' if isinstance(value, str) else value | |
| expression = f"{column} {operator} {value}" | |
| expression = f"`{column}` {operator} {repr(value)}" |
bigframes/session/__init__.py
Outdated
| if not isinstance(group, list): | ||
| raise ValueError("Each filter group should be a list.") | ||
|
|
There was a problem hiding this comment.
Oh, I see: it's for the OR. You can ignore this feedback. :-)
There was a problem hiding this comment.
Changed the list names to or_expressions and and_expressions, to make the logic more clear.
bigframes/session/__init__.py
Outdated
| if not isinstance(filters, list): | ||
| raise ValueError("Filters should be a list.") |
There was a problem hiding this comment.
Don't need to check this. Any Iterable should be fine.
bigframes/session/__init__.py
Outdated
| if not ( | ||
| all(isinstance(item, list) for item in filters) | ||
| or all(isinstance(item, tuple) for item in filters) | ||
| ): | ||
| raise ValueError( | ||
| "All items in filters should be either all lists or all tuples." | ||
| ) |
There was a problem hiding this comment.
Better to just catch this when we encounter it so we can let them know which item was incorrect.
bigframes/session/__init__.py
Outdated
| if not isinstance(value, list): | ||
| raise ValueError( | ||
| f"Value for operator {operator} should be a list." | ||
| ) |
There was a problem hiding this comment.
We don't need this check. Any iterable should be OK.
| If set, limit the maximum number of rows to fetch from the | ||
| query results. | ||
| filters (List[Tuple], default []): To filter out data. Filter syntax: | ||
| [[(column, op, val), …],…] where op is [=, >, >=, <, <=, !=, in, |
There was a problem hiding this comment.
We are doing a type annotation of List[Tuple], but [[(column, op, val), …],…] looks like a List[List[Tuple]]?
There was a problem hiding this comment.
Sorry, thought I have made change here.
bigframes/session/__init__.py
Outdated
| api_name="read_gbq", | ||
| ) | ||
|
|
||
| def _filters_to_query(self, query_or_table, filters): |
There was a problem hiding this comment.
Seems like a great candidate to write unit tests (string-in-string-out). In fact, at a later point it can go in a sql helper module like bigframes/ml/sql.py.
There was a problem hiding this comment.
Moved to unit test.
| FiltersType = ( | ||
| Iterable[ | ||
| Union[ | ||
| Tuple[str, Literal["in", "not in", "<", "<=", "==", "!=", ">=", ">"], Any], |
There was a problem hiding this comment.
Looks like this itself can be defined outside and reused for better readability
FilterType = Tuple[str, Literal["in", "not in", "<", "<=", "==", "!=", ">=", ">"], Any]
FiltersType = Iterable[Union[FilterType, Iterable[FilterType]]]
tests/system/small/test_session.py
Outdated
| pytest.param( | ||
| "{scalars_table_id}", | ||
| [ | ||
| (("rowindex", "not in", [0, 6])), |
There was a problem hiding this comment.
double parentheses here is effectively same as single parentheses, did you mean it to represent a test case with Iterable[Iterable[Tuple]] filter type, then need to add a comma - (("rowindex", "not in", [0, 6]),)
| sets of filters through an OR operation. A single list of tuples | ||
| can also be used, meaning that no OR operation between set of | ||
| filters is to be conducted. | ||
| filters (Iterable[Iterable[[Tuple]], default ()): To filter out data. |
There was a problem hiding this comment.
Looks like this would be Iterable[Union[Tuple, Iterable[Tuple]]]
| sets of filters through an OR operation. A single list of tuples | ||
| can also be used, meaning that no OR operation between set of | ||
| filters is to be conducted. | ||
| filters (Iterable[Iterable[[Tuple]], default ()): To filter out data. |
There was a problem hiding this comment.
Would be great to add a code sample (in the EXAMPLES section)
tswast
left a comment
There was a problem hiding this comment.
@Genesis929 Please resolve the conflicts and ping me when this is ready for another review. Thanks.
bigframes/pandas/__init__.py
Outdated
| index_col: Iterable[str] | str = (), | ||
| col_order: Iterable[str] = (), | ||
| max_results: Optional[int] = None, | ||
| columns: Iterable[str] = (), |
There was a problem hiding this comment.
Please pull the columns change out into a separate PR to be reviewed outside of the filters change. IMO, columns and col_order are redundant and both should select a subset of columns. We need to keep both for compatibility see: googleapis/python-bigquery-pandas#701
bigframes/session/__init__.py
Outdated
| # Add a verify index argument that fails if the index is not unique. | ||
| ) -> dataframe.DataFrame: | ||
| # TODO(b/281571214): Generate prompt to show the progress of read_gbq. | ||
| query_or_table = self._filters_to_query(query_or_table, columns, filters) |
There was a problem hiding this comment.
Please use col_order here. In a future PR, we can add columns as an alias for col_order. See: googleapis/python-bigquery-pandas#701
| query_or_table = self._filters_to_query(query_or_table, columns, filters) | |
| query_or_table = self._filters_to_query(query_or_table, col_order, filters) |
There was a problem hiding this comment.
Done, changed in read_gbq.
| >>> filters = [('year', '==', 2016), ('pitcherFirstName', 'in', ['John', 'Doe']), ('pitcherLastName', 'in', ['Gant'])] | ||
| >>> df = bpd.read_gbq( | ||
| ... "bigquery-public-data.baseball.games_wide", | ||
| ... columns=columns, |
There was a problem hiding this comment.
In a future PR, let's make columns an alias for col_order.
| ... columns=columns, | |
| ... col_order=columns, |
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes b/299514019 🦕