-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: pagination + metadata for csv and excel readers [TCTC-1810] (#77)
* feat: preview on huge files (excel and csv) - added two new readers (csv and excel) with custom preview feats - updates allowed methods depending on preview arguments * fix: clean lint * fix: fix pandas.read_csv * feat: with pd.excel * feat: update code from params and optim * fix: updates on read_excel * feat: add generator for xldr reader+ offset/limit * feat: refacto readexcel * feat: clean numpy typing * fix: fix dependency * fix: fix tests * fix: remove kwargs from read_csv * fix: removed numpy + allow backward compatibility * refacto: removed preview_args in favor of preview * feat: add keep_default_na to read_csv throught readpandas * fix: fixes tests - The test_basic_excel test was dealing with a 'multi-sheets excel file', we fixed that by creating a new fixture file with only one sheet 'fixture-single-sheet.xlsx' - Reset the initial assert test value on test_csv_with_sep * fix: fix read_pandas * fix: fix read_csv with a chunksize * fix: fix read_excel and clean * fix: attempt on fix test_ftp * feat: add coverage for preview * cov: add preview params * cov: add tests for cov on nrows and skiprows * feat: added EXCEL_TYPE enum class * test: added tests for new file format type * feat: catch limit/offset once * Update peakina/readers/csv.py Co-authored-by: Eric Jolibois <[email protected]> * feat: updates from PR review * test: added nrows/skiprows tests * prr: ordering imports in readers * prr: keep_default_na as bool instead of Any * prr: add common for readers created readers/common.py for common classes to import shared classes between some readers * prr: removed context manager from read_csv for chunks * prr: clean docstrings * prr: PreviewArgs as a dataclass * prr: prioritize returning OLD xls format from NEW one * feat: add Generator type - add Generator typing - split read-sheet in multiple submethods * prr: renamed iterators methods * prr: renaming * prr: excel_type check is instead of equals * prr: updates from PR review * fix: csv reader preview/paginate - fix csv reader (by adding rigth columns) - add descriptives tests * tests: update fixture for relevants tests on pagination preview - from: a b 0 3 4 1 3 4 - to: a b 0 3 4 1 4 3 * refacto: move the extract-columns because only used by read_csv * fix: fix nrows/skiprows abiguiosity * fix: extract columns name to well handle preview * cov: added cov tests * fix: fix wraps args * fix: reduce unecessary loop * test: added test coverage * Update peakina/readers/excel.py Co-authored-by: David Nowinsky <[email protected]> * Update peakina/readers/excel.py Co-authored-by: David Nowinsky <[email protected]> * prr: updates from pr review * chores: rename `io` package so it is possible to debug tests. the `io` name clashes with the native `io` module * feat(excel files): preview args are splitted in 2 arguments. `io` package get its old name back * feat(excel files): add skiprows to the list of allowed kwargs * feat(excel files): the first row, which should contain headers is now always skipped * feat(pagination): pagination arguments are now passed to (and then ignored) to the read_json and read_xml functions * feat(pagination): fix None error * feat(pagination): downloaded files now keep their extensions * feat(pagination): this commit removes the new way to handle excel file by transforming them to CSV first. it only contains the bare minimum to handle the pagination on peakina side * feat(pagination): remove irrelevant tests * feat(pagination): c/c tests from main * feat(pagination): lint * feat(pagination): no longer read excel file twice for metadata * feat(pagination): no longer a lambda for skiprows * feat(pagination): more coverage * feat(pagination): more coverage and lint * feat(pagination): lint * feat(pagination): lint * feat(pagination): lint * feat(pagination): coverage * Update peakina/readers/csv.py Co-authored-by: Eric Jolibois <[email protected]> * feat(pagination): fix off by 1 error * feat(pagination): removed useless `reader_kwargs` in helpers.py * fix: lint * refactor: rewrite test to add more tests afterwards * fix: csv reader * fix: excel reader * export meta readers * fix coverage * remove out of scope code * test: update docstrings * chore: remove useless sentinel * test: more explicit * feat: add total_rows and df_rows for csv * fix: updates from pr review + added tests scenarios for a csv file having 12 lines * fix(excel): handle edge case with multiple sheet * fix: csv reader should use preview_offset if set alone * fix: excel reader should use preview_offset if set alone * fix: fix tests * reorder tests * support skipfooter, nrows, ... * support pandas kwargs in metadata Co-authored-by: sanix-darker <[email protected]> Co-authored-by: Eric Jolibois <[email protected]> Co-authored-by: David Nowinsky <[email protected]>
- Loading branch information
1 parent
c2485f1
commit f89bac2
Showing
14 changed files
with
583 additions
and
43 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,17 @@ | ||
from .csv import csv_meta, read_csv | ||
from .excel import excel_meta, read_excel | ||
from .json import read_json | ||
from .xml import read_xml | ||
|
||
__all__ = ( | ||
# CSV | ||
"read_csv", | ||
"csv_meta", | ||
# EXCEL | ||
"read_excel", | ||
"excel_meta", | ||
# JSON | ||
"read_json", | ||
# XML | ||
"read_xml", | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
""" | ||
Module to add csv support | ||
""" | ||
from functools import wraps | ||
from typing import TYPE_CHECKING, Any, Dict, Optional, Union | ||
|
||
import pandas as pd | ||
|
||
if TYPE_CHECKING: | ||
from os import PathLike | ||
|
||
FilePathOrBuffer = Union[str, bytes, PathLike[str], PathLike[bytes]] | ||
|
||
# The chunksize value for previews | ||
PREVIEW_CHUNK_SIZE = 1024 | ||
|
||
|
||
@wraps(pd.read_csv) | ||
def read_csv( | ||
filepath_or_buffer: "FilePathOrBuffer", | ||
*, | ||
# extra `peakina` reader kwargs | ||
preview_offset: int = 0, | ||
preview_nrows: Optional[int] = None, | ||
# change of default values | ||
keep_default_na: bool = False, # pandas default: `True` | ||
error_bad_lines: bool = False, # pandas default: `True` | ||
**kwargs: Any, | ||
) -> pd.DataFrame: | ||
""" | ||
The read_csv method is able to make a preview by reading on chunks | ||
""" | ||
if preview_nrows is not None or preview_offset: | ||
chunks = pd.read_csv( | ||
filepath_or_buffer, | ||
keep_default_na=keep_default_na, | ||
error_bad_lines=error_bad_lines, | ||
**kwargs, | ||
# keep the first row 0 (as the header) and then skip everything else up to row `preview_offset` | ||
skiprows=range(1, preview_offset + 1), | ||
nrows=preview_nrows, | ||
chunksize=PREVIEW_CHUNK_SIZE, | ||
) | ||
return next(chunks) | ||
|
||
return pd.read_csv( | ||
filepath_or_buffer, | ||
keep_default_na=keep_default_na, | ||
error_bad_lines=error_bad_lines, | ||
**kwargs, | ||
) | ||
|
||
|
||
def _line_count(filepath_or_buffer: "FilePathOrBuffer") -> int: | ||
with open(filepath_or_buffer) as f: | ||
lines = 0 | ||
buf_size = 1024 * 1024 | ||
read_f = f.read # loop optimization | ||
|
||
buf = read_f(buf_size) | ||
while buf: | ||
lines += buf.count("\n") | ||
buf = read_f(buf_size) | ||
|
||
return lines | ||
|
||
|
||
def csv_meta( | ||
filepath_or_buffer: "FilePathOrBuffer", reader_kwargs: Dict[str, Any] | ||
) -> Dict[str, Any]: | ||
total_rows = _line_count(filepath_or_buffer) | ||
|
||
if "nrows" in reader_kwargs: | ||
return { | ||
"total_rows": total_rows, | ||
"df_rows": reader_kwargs["nrows"], | ||
} | ||
|
||
start = 0 + reader_kwargs.get("skiprows", 0) | ||
end = total_rows - reader_kwargs.get("skipfooter", 0) | ||
|
||
preview_offset = reader_kwargs.get("preview_offset", 0) | ||
preview_nrows = reader_kwargs.get("preview_nrows", None) | ||
|
||
if preview_nrows is not None: | ||
return { | ||
"total_rows": total_rows, | ||
"df_rows": min(preview_nrows, max(end - start - preview_offset, 0)), | ||
} | ||
elif preview_offset: # and `preview_nrows` is None | ||
return { | ||
"total_rows": total_rows, | ||
"df_rows": max(end - start - preview_offset, 0), | ||
} | ||
else: | ||
return { | ||
"total_rows": total_rows, | ||
"df_rows": end - start, | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
""" | ||
Module to add excel files support | ||
""" | ||
import logging | ||
from functools import wraps | ||
from typing import TYPE_CHECKING, Any, Dict, Optional, Union | ||
|
||
import pandas as pd | ||
|
||
if TYPE_CHECKING: | ||
from os import PathLike | ||
|
||
FilePathOrBuffer = Union[str, bytes, PathLike[str], PathLike[bytes]] | ||
|
||
LOGGER = logging.getLogger(__name__) | ||
|
||
|
||
@wraps(pd.read_excel) | ||
def read_excel( | ||
filepath_or_buffer: "FilePathOrBuffer", | ||
*, | ||
# extra `peakina` reader kwargs | ||
preview_offset: int = 0, | ||
preview_nrows: Optional[int] = None, | ||
# change of default values | ||
keep_default_na: bool = False, # pandas default: `True` | ||
**kwargs: Any, | ||
) -> pd.DataFrame: | ||
df = pd.read_excel( | ||
filepath_or_buffer, | ||
keep_default_na=keep_default_na, | ||
**kwargs, | ||
) | ||
# if there are several sheets, pf.read_excel returns a dict {sheet_name: df} | ||
if isinstance(df, dict): | ||
for sheet_name, sheet_df in df.items(): | ||
sheet_df["__sheet__"] = sheet_name | ||
df = pd.concat(df.values(), sort=False) | ||
|
||
if preview_nrows is not None or preview_offset: | ||
offset = None if preview_nrows is None else preview_offset + preview_nrows | ||
return df[preview_offset:offset] | ||
return df | ||
|
||
|
||
def excel_meta( | ||
filepath_or_buffer: "FilePathOrBuffer", reader_kwargs: Dict[str, Any] | ||
) -> Dict[str, Any]: | ||
""" | ||
Returns a dictionary with the meta information of the excel file. | ||
""" | ||
excel_file = pd.ExcelFile(filepath_or_buffer) | ||
sheet_names = excel_file.sheet_names | ||
|
||
df = read_excel(excel_file, **reader_kwargs) | ||
|
||
if (sheet_name := reader_kwargs.get("sheet_name", 0)) is None: | ||
# multiple sheets together | ||
return { | ||
"sheetnames": sheet_names, | ||
"df_rows": df.shape[0], | ||
"total_rows": sum(excel_file.parse(sheet_name).shape[0] for sheet_name in sheet_names), | ||
} | ||
else: | ||
# single sheet | ||
return { | ||
"sheetnames": sheet_names, | ||
"df_rows": df.shape[0], | ||
"total_rows": excel_file.parse(sheet_name).shape[0], | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Binary file not shown.
Binary file not shown.
Oops, something went wrong.