feat: pagination + metadata for csv and excel readers [TCTC-1810] (#77)

* feat: preview on huge files (excel and csv) - added two new readers (csv and excel) with custom preview feats - updates allowed methods depending on preview arguments * fix: clean lint * fix: fix pandas.read_csv * feat: with pd.excel * feat: update code from params and optim * fix: updates on read_excel * feat: add generator for xldr reader+ offset/limit * feat: refacto readexcel * feat: clean numpy typing * fix: fix dependency * fix: fix tests * fix: remove kwargs from read_csv * fix: removed numpy + allow backward compatibility * refacto: removed preview_args in favor of preview * feat: add keep_default_na to read_csv throught readpandas * fix: fixes tests - The test_basic_excel test was dealing with a 'multi-sheets excel file', we fixed that by creating a new fixture file with only one sheet 'fixture-single-sheet.xlsx' - Reset the initial assert test value on test_csv_with_sep * fix: fix read_pandas * fix: fix read_csv with a chunksize * fix: fix read_excel and clean * fix: attempt on fix test_ftp * feat: add coverage for preview * cov: add preview params * cov: add tests for cov on nrows and skiprows * feat: added EXCEL_TYPE enum class * test: added tests for new file format type * feat: catch limit/offset once * Update peakina/readers/csv.py Co-authored-by: Eric Jolibois <[email protected]> * feat: updates from PR review * test: added nrows/skiprows tests * prr: ordering imports in readers * prr: keep_default_na as bool instead of Any * prr: add common for readers created readers/common.py for common classes to import shared classes between some readers * prr: removed context manager from read_csv for chunks * prr: clean docstrings * prr: PreviewArgs as a dataclass * prr: prioritize returning OLD xls format from NEW one * feat: add Generator type - add Generator typing - split read-sheet in multiple submethods * prr: renamed iterators methods * prr: renaming * prr: excel_type check is instead of equals * prr: updates from PR review * fix: csv reader preview/paginate - fix csv reader (by adding rigth columns) - add descriptives tests * tests: update fixture for relevants tests on pagination preview - from: a b 0 3 4 1 3 4 - to: a b 0 3 4 1 4 3 * refacto: move the extract-columns because only used by read_csv * fix: fix nrows/skiprows abiguiosity * fix: extract columns name to well handle preview * cov: added cov tests * fix: fix wraps args * fix: reduce unecessary loop * test: added test coverage * Update peakina/readers/excel.py Co-authored-by: David Nowinsky <[email protected]> * Update peakina/readers/excel.py Co-authored-by: David Nowinsky <[email protected]> * prr: updates from pr review * chores: rename `io` package so it is possible to debug tests. the `io` name clashes with the native `io` module * feat(excel files): preview args are splitted in 2 arguments. `io` package get its old name back * feat(excel files): add skiprows to the list of allowed kwargs * feat(excel files): the first row, which should contain headers is now always skipped * feat(pagination): pagination arguments are now passed to (and then ignored) to the read_json and read_xml functions * feat(pagination): fix None error * feat(pagination): downloaded files now keep their extensions * feat(pagination): this commit removes the new way to handle excel file by transforming them to CSV first. it only contains the bare minimum to handle the pagination on peakina side * feat(pagination): remove irrelevant tests * feat(pagination): c/c tests from main * feat(pagination): lint * feat(pagination): no longer read excel file twice for metadata * feat(pagination): no longer a lambda for skiprows * feat(pagination): more coverage * feat(pagination): more coverage and lint * feat(pagination): lint * feat(pagination): lint * feat(pagination): lint * feat(pagination): coverage * Update peakina/readers/csv.py Co-authored-by: Eric Jolibois <[email protected]> * feat(pagination): fix off by 1 error * feat(pagination): removed useless `reader_kwargs` in helpers.py * fix: lint * refactor: rewrite test to add more tests afterwards * fix: csv reader * fix: excel reader * export meta readers * fix coverage * remove out of scope code * test: update docstrings * chore: remove useless sentinel * test: more explicit * feat: add total_rows and df_rows for csv * fix: updates from pr review + added tests scenarios for a csv file having 12 lines * fix(excel): handle edge case with multiple sheet * fix: csv reader should use preview_offset if set alone * fix: excel reader should use preview_offset if set alone * fix: fix tests * reorder tests * support skipfooter, nrows, ... * support pandas kwargs in metadata Co-authored-by: sanix-darker <[email protected]> Co-authored-by: Eric Jolibois <[email protected]> Co-authored-by: David Nowinsky <[email protected]>
ToucanToco · Feb 22, 2022 · f89bac2 · f89bac2
1 parent c2485f1
commit f89bac2
Show file tree

Hide file tree

Showing 14 changed files with 583 additions and 43 deletions.
diff --git a/peakina/datapool.py b/peakina/datapool.py
@@ -1,8 +1,8 @@
 from os import path
 from typing import TYPE_CHECKING, Any, Dict, Hashable, Optional
 
-from .cache import Cache
-from .datasource import DataSource
+from peakina.cache import Cache
+from peakina.datasource import DataSource
 
 if TYPE_CHECKING:
     import pandas as pd

diff --git a/peakina/datasource.py b/peakina/datasource.py
@@ -16,8 +16,8 @@
 from pydantic.dataclasses import dataclass
 from slugify import slugify
 
-from .cache import Cache
-from .helpers import (
+from peakina.cache import Cache
+from peakina.helpers import (
     TypeEnum,
     detect_encoding,
     detect_sep,
@@ -29,11 +29,10 @@
     validate_kwargs,
     validate_sep,
 )
-from .io import Fetcher, MatchEnum
+from peakina.io import Fetcher, MatchEnum
 
 AVAILABLE_SCHEMES = set(Fetcher.registry) - {""}  # discard the empty string scheme
 PD_VALID_URLS = set(uses_relative + uses_netloc + uses_params) | AVAILABLE_SCHEMES
-NOTSET = object()
 
 
 @dataclass
@@ -75,7 +74,7 @@ def get_metadata(self) -> Dict[str, Any]:
             return {}  # no metadata for matched datasources
         with self.fetcher.open(self.uri) as f:
             assert self.type is not None
-            return get_metadata(f.name, self.type)
+            return get_metadata(f.name, self.type, self.reader_kwargs)
 
     @staticmethod
     def _get_single_df(
@@ -91,7 +90,7 @@ def _get_single_df(
         allowed_params = get_reader_allowed_params(filetype)
 
         # Check encoding
-        encoding = kwargs.get("encoding")
+        encoding = kwargs.get("encoding", "utf-8")
         if "encoding" in allowed_params:
             if not validate_encoding(stream.name, encoding):
                 encoding = detect_encoding(stream.name)
@@ -107,12 +106,6 @@ def _get_single_df(
         finally:
             stream.close()
 
-        # In case of sheets, the df can be a dictionary
-        if kwargs.get("sheet_name", NOTSET) is None:
-            for sheet_name, _df in df.items():
-                _df["__sheet__"] = sheet_name
-            df = pd.concat(df.values(), sort=False)
-
         return df
 
     def get_matched_datasources(self) -> Generator["DataSource", None, None]:

diff --git a/peakina/helpers.py b/peakina/helpers.py
@@ -19,7 +19,14 @@
 import chardet
 import pandas as pd
 
-from .readers import read_json, read_xml
+from peakina.readers import (
+    csv_meta,
+    excel_meta,
+    read_csv,
+    read_excel,
+    read_json,
+    read_xml,
+)
 
 
 class TypeInfos(NamedTuple):
@@ -37,18 +44,23 @@ class TypeInfos(NamedTuple):
 # For files without MIME types, we make fake MIME types based on detected extension
 CUSTOM_MIMETYPES = {".parquet": "peakina/parquet"}
 
+EXTRA_PEAKINA_READER_KWARGS = ["preview_offset", "preview_nrows"]
 
 SUPPORTED_FILE_TYPES = {
-    "csv": TypeInfos(["text/csv", "text/tab-separated-values"], pd.read_csv),
+    "csv": TypeInfos(
+        ["text/csv", "text/tab-separated-values"],
+        read_csv,
+        [],
+        csv_meta,
+    ),
     "excel": TypeInfos(
         [
             "application/vnd.ms-excel",
             "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
         ],
-        pd.read_excel,
-        # these options are missing from read_excel signature in pandas 0.23:
-        ["keep_default_na", "encoding", "decimal"],
-        lambda f: {"sheetnames": pd.ExcelFile(f).sheet_names},
+        read_excel,
+        ["encoding", "decimal"],
+        excel_meta,
     ),
     "json": TypeInfos(
         ["application/json"],
@@ -133,13 +145,15 @@ def detect_sep(filepath: str, encoding: Optional[str] = None) -> str:
     return csv.Sniffer().sniff(str_head(filepath, 100, encoding)).delimiter
 
 
-def validate_sep(filepath: str, sep: str = ",", encoding: Optional[str] = None) -> bool:
+def validate_sep(filepath: str, sep: str = ",", encoding: str = "utf-8") -> bool:
     """
     Validates if the `sep` is a right separator of a CSV file
     (i.e. the dataframe has more than one column).
     """
     try:
-        df = pd.read_csv(filepath, sep=sep, encoding=encoding, nrows=2)
+        # we want an error to be raised if we can't read the first two lines
+        # hence the parameter `error_bad_lines` set to `True`
+        df = read_csv(filepath, sep=sep, encoding=encoding, nrows=2, error_bad_lines=True)
         return len(df.columns) > 1
     except pd.errors.ParserError:
         return False
@@ -161,6 +175,7 @@ def validate_kwargs(kwargs: Dict[str, Any], t: Optional[TypeEnum]) -> bool:
         allowed_kwargs += get_reader_allowed_params(t)
         # Add extra allowed kwargs
         allowed_kwargs += SUPPORTED_FILE_TYPES[t].reader_kwargs
+        allowed_kwargs += EXTRA_PEAKINA_READER_KWARGS
     bad_kwargs = set(kwargs) - set(allowed_kwargs)
     if bad_kwargs:
         raise ValueError(f'Unsupported kwargs: {", ".join(map(repr, bad_kwargs))}')
@@ -176,6 +191,6 @@ def pd_read(filepath: str, t: str, kwargs: Dict[str, Any]) -> pd.DataFrame:
     return SUPPORTED_FILE_TYPES[t].reader(filepath, **kwargs)
 
 
-def get_metadata(filepath: str, t: str) -> Dict[str, Any]:
-    read = SUPPORTED_FILE_TYPES[t].metadata_reader
-    return read(filepath) if read else {}
+def get_metadata(filepath: str, type: str, reader_kwargs: Dict[str, Any]) -> Dict[str, Any]:
+    metadata_reader = SUPPORTED_FILE_TYPES[type].metadata_reader
+    return metadata_reader(filepath, reader_kwargs) if metadata_reader else {}
diff --git a/peakina/readers/__init__.py b/peakina/readers/__init__.py
@@ -1,7 +1,17 @@
+from .csv import csv_meta, read_csv
+from .excel import excel_meta, read_excel
 from .json import read_json
 from .xml import read_xml
 
 __all__ = (
+    # CSV
+    "read_csv",
+    "csv_meta",
+    # EXCEL
+    "read_excel",
+    "excel_meta",
+    # JSON
     "read_json",
+    # XML
     "read_xml",
 )
diff --git a/peakina/readers/csv.py b/peakina/readers/csv.py
@@ -0,0 +1,99 @@
+"""
+Module to add csv support
+"""
+from functools import wraps
+from typing import TYPE_CHECKING, Any, Dict, Optional, Union
+
+import pandas as pd
+
+if TYPE_CHECKING:
+    from os import PathLike
+
+    FilePathOrBuffer = Union[str, bytes, PathLike[str], PathLike[bytes]]
+
+# The chunksize value for previews
+PREVIEW_CHUNK_SIZE = 1024
+
+
+@wraps(pd.read_csv)
+def read_csv(
+    filepath_or_buffer: "FilePathOrBuffer",
+    *,
+    # extra `peakina` reader kwargs
+    preview_offset: int = 0,
+    preview_nrows: Optional[int] = None,
+    # change of default values
+    keep_default_na: bool = False,  # pandas default: `True`
+    error_bad_lines: bool = False,  # pandas default: `True`
+    **kwargs: Any,
+) -> pd.DataFrame:
+    """
+    The read_csv method is able to make a preview by reading on chunks
+    """
+    if preview_nrows is not None or preview_offset:
+        chunks = pd.read_csv(
+            filepath_or_buffer,
+            keep_default_na=keep_default_na,
+            error_bad_lines=error_bad_lines,
+            **kwargs,
+            # keep the first row 0 (as the header) and then skip everything else up to row `preview_offset`
+            skiprows=range(1, preview_offset + 1),
+            nrows=preview_nrows,
+            chunksize=PREVIEW_CHUNK_SIZE,
+        )
+        return next(chunks)
+
+    return pd.read_csv(
+        filepath_or_buffer,
+        keep_default_na=keep_default_na,
+        error_bad_lines=error_bad_lines,
+        **kwargs,
+    )
+
+
+def _line_count(filepath_or_buffer: "FilePathOrBuffer") -> int:
+    with open(filepath_or_buffer) as f:
+        lines = 0
+        buf_size = 1024 * 1024
+        read_f = f.read  # loop optimization
+
+        buf = read_f(buf_size)
+        while buf:
+            lines += buf.count("\n")
+            buf = read_f(buf_size)
+
+        return lines
+
+
+def csv_meta(
+    filepath_or_buffer: "FilePathOrBuffer", reader_kwargs: Dict[str, Any]
+) -> Dict[str, Any]:
+    total_rows = _line_count(filepath_or_buffer)
+
+    if "nrows" in reader_kwargs:
+        return {
+            "total_rows": total_rows,
+            "df_rows": reader_kwargs["nrows"],
+        }
+
+    start = 0 + reader_kwargs.get("skiprows", 0)
+    end = total_rows - reader_kwargs.get("skipfooter", 0)
+
+    preview_offset = reader_kwargs.get("preview_offset", 0)
+    preview_nrows = reader_kwargs.get("preview_nrows", None)
+
+    if preview_nrows is not None:
+        return {
+            "total_rows": total_rows,
+            "df_rows": min(preview_nrows, max(end - start - preview_offset, 0)),
+        }
+    elif preview_offset:  # and `preview_nrows` is None
+        return {
+            "total_rows": total_rows,
+            "df_rows": max(end - start - preview_offset, 0),
+        }
+    else:
+        return {
+            "total_rows": total_rows,
+            "df_rows": end - start,
+        }
diff --git a/peakina/readers/excel.py b/peakina/readers/excel.py
@@ -0,0 +1,70 @@
+"""
+Module to add excel files support
+"""
+import logging
+from functools import wraps
+from typing import TYPE_CHECKING, Any, Dict, Optional, Union
+
+import pandas as pd
+
+if TYPE_CHECKING:
+    from os import PathLike
+
+    FilePathOrBuffer = Union[str, bytes, PathLike[str], PathLike[bytes]]
+
+LOGGER = logging.getLogger(__name__)
+
+
+@wraps(pd.read_excel)
+def read_excel(
+    filepath_or_buffer: "FilePathOrBuffer",
+    *,
+    # extra `peakina` reader kwargs
+    preview_offset: int = 0,
+    preview_nrows: Optional[int] = None,
+    # change of default values
+    keep_default_na: bool = False,  # pandas default: `True`
+    **kwargs: Any,
+) -> pd.DataFrame:
+    df = pd.read_excel(
+        filepath_or_buffer,
+        keep_default_na=keep_default_na,
+        **kwargs,
+    )
+    # if there are several sheets, pf.read_excel returns a dict {sheet_name: df}
+    if isinstance(df, dict):
+        for sheet_name, sheet_df in df.items():
+            sheet_df["__sheet__"] = sheet_name
+        df = pd.concat(df.values(), sort=False)
+
+    if preview_nrows is not None or preview_offset:
+        offset = None if preview_nrows is None else preview_offset + preview_nrows
+        return df[preview_offset:offset]
+    return df
+
+
+def excel_meta(
+    filepath_or_buffer: "FilePathOrBuffer", reader_kwargs: Dict[str, Any]
+) -> Dict[str, Any]:
+    """
+    Returns a dictionary with the meta information of the excel file.
+    """
+    excel_file = pd.ExcelFile(filepath_or_buffer)
+    sheet_names = excel_file.sheet_names
+
+    df = read_excel(excel_file, **reader_kwargs)
+
+    if (sheet_name := reader_kwargs.get("sheet_name", 0)) is None:
+        # multiple sheets together
+        return {
+            "sheetnames": sheet_names,
+            "df_rows": df.shape[0],
+            "total_rows": sum(excel_file.parse(sheet_name).shape[0] for sheet_name in sheet_names),
+        }
+    else:
+        # single sheet
+        return {
+            "sheetnames": sheet_names,
+            "df_rows": df.shape[0],
+            "total_rows": excel_file.parse(sheet_name).shape[0],
+        }
diff --git a/peakina/readers/xml.py b/peakina/readers/xml.py
@@ -29,7 +29,11 @@ def transform_with_jq(data: Any, jq_filter: str) -> Union[PdDataList, PdDataDict
         return cast(PdDataList, all_data)
 
 
-def read_xml(filepath: str, encoding: str = "utf-8", filter: Optional[str] = None) -> pd.DataFrame:
+def read_xml(
+    filepath: str,
+    encoding: str = "utf-8",
+    filter: Optional[str] = None,
+) -> pd.DataFrame:
     data = xmltodict.parse(open(filepath).read(), encoding=encoding)
     if filter is not None:
         data = transform_with_jq(data, filter)

diff --git a/tests/fixtures/0_2.xls b/tests/fixtures/0_2.xls
diff --git a/tests/fixtures/fixture-single-sheet.xlsx b/tests/fixtures/fixture-single-sheet.xlsx
diff --git a/tests/fixtures/fixture_new_format.xls b/tests/fixtures/fixture_new_format.xls