Refactor PyArrow DataFiles Projection functions#1043
Conversation
kevinjqliu
left a comment
There was a problem hiding this comment.
Thanks for working on this! I like the overall approach. Added a few comments.
I'm curious what the difference between project_table and project_batches from an API standpoint (and if we should document that)
| def _fs_from_file_path(file_path: str, io: FileIO) -> FileSystem: | ||
| scheme, netloc, _ = _parse_location(file_path) | ||
| if isinstance(io, PyArrowFileIO): | ||
| return io.fs_by_scheme(scheme, netloc) | ||
| else: | ||
| try: | ||
| from pyiceberg.io.fsspec import FsspecFileIO | ||
|
|
||
| if isinstance(io, FsspecFileIO): | ||
| from pyarrow.fs import PyFileSystem | ||
|
|
||
| return PyFileSystem(FSSpecHandler(io.get_fs(scheme))) | ||
| else: | ||
| raise ValueError(f"Expected PyArrowFileIO or FsspecFileIO, got: {io}") | ||
| except ModuleNotFoundError as e: | ||
| # When FsSpec is not installed | ||
| raise ValueError(f"Expected PyArrowFileIO or FsspecFileIO, got: {io}") from e |
There was a problem hiding this comment.
Yes, copied it over since it'll be deprecated!
| Returns an Iterator of pa.RecordBatch with data from the Iceberg table | ||
| by resolving the right columns that match the current table schema. | ||
| Only data that matches the provided row_filter expression is returned. | ||
|
|
There was a problem hiding this comment.
nit: add a snippet about why use this instead of project_table
| ) -> None: | ||
| self._table_metadata = table_metadata | ||
| self._io = io | ||
| self._fs = _fs_from_file_path(table_metadata.location, io) # TODO: use different FileSystem per file |
kevinjqliu
left a comment
There was a problem hiding this comment.
LGTM!! a few minor comments.
Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
| _limit: Optional[int] | ||
| """Scan the Iceberg Table and create an Arrow construct. | ||
|
|
||
| Attributes: |
There was a problem hiding this comment.
nit: More of a style thing. I think it is more valuable to have a docstring at the __init__ method since that's wat probably will show up people's IDE.
Fokko
left a comment
There was a problem hiding this comment.
This looks good, some much needed refactoring, thanks @sungwy for picking this up, and thanks @corleyma, @kevinjqliu and @ndrluis for the review!
* refactoring * refactor more * docstring * apache#1042 * adopt review feedback * thanks Kevin! Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com> --------- Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
* refactoring * refactor more * docstring * apache#1042 * adopt review feedback * thanks Kevin! Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com> --------- Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
No description provided.