Skip to content

Commit

Permalink
feat: pagination + metadata for csv and excel readers [TCTC-1810] (#77)
Browse files Browse the repository at this point in the history
* feat: preview on huge files (excel and csv)

- added two new readers (csv and excel) with custom preview feats
- updates allowed methods depending on preview arguments

* fix: clean lint

* fix: fix pandas.read_csv

* feat: with pd.excel

* feat: update code from params and optim

* fix: updates on read_excel

* feat: add generator for xldr reader+ offset/limit

* feat: refacto readexcel

* feat: clean numpy typing

* fix: fix dependency

* fix: fix tests

* fix: remove kwargs from read_csv

* fix: removed numpy + allow backward compatibility

* refacto: removed preview_args in favor of preview

* feat: add keep_default_na to read_csv throught readpandas

* fix: fixes tests

- The test_basic_excel test was dealing with a 'multi-sheets excel file',
	we fixed that by creating a new fixture file with only one sheet 'fixture-single-sheet.xlsx'
- Reset the initial assert test value on test_csv_with_sep

* fix: fix read_pandas

* fix: fix read_csv with a chunksize

* fix: fix read_excel and clean

* fix: attempt on fix test_ftp

* feat: add coverage for preview

* cov: add preview params

* cov: add tests for cov on nrows and skiprows

* feat: added EXCEL_TYPE enum class

* test: added tests for new file format  type

* feat: catch limit/offset once

* Update peakina/readers/csv.py

Co-authored-by: Eric Jolibois <[email protected]>

* feat: updates from PR review

* test: added nrows/skiprows tests

* prr: ordering imports in readers

* prr: keep_default_na as bool instead of Any

* prr: add common for readers

created readers/common.py for common classes
to import shared classes between some readers

* prr: removed context manager from read_csv for chunks

* prr: clean docstrings

* prr: PreviewArgs as a dataclass

* prr: prioritize returning OLD xls format from NEW one

* feat: add Generator type

- add Generator typing
- split read-sheet in multiple submethods

* prr: renamed iterators methods

* prr: renaming

* prr: excel_type check is instead of equals

* prr: updates from PR review

* fix: csv reader preview/paginate

- fix csv reader (by adding rigth columns)
- add descriptives tests

* tests: update fixture for relevants tests on pagination preview

- from:
   a   b
0  3   4
1  3   4

- to:
   a   b
0  3   4
1  4   3

* refacto: move the extract-columns because only used by read_csv

* fix: fix nrows/skiprows abiguiosity

* fix: extract columns name to well handle preview

* cov: added cov tests

* fix: fix wraps args

* fix: reduce unecessary loop

* test: added test coverage

* Update peakina/readers/excel.py

Co-authored-by: David Nowinsky <[email protected]>

* Update peakina/readers/excel.py

Co-authored-by: David Nowinsky <[email protected]>

* prr: updates from pr review

* chores: rename `io` package so it is possible to debug tests. the `io` name clashes with the native `io` module

* feat(excel files): preview args are splitted in 2 arguments. `io` package get its old name back

* feat(excel files): add skiprows to the list of allowed kwargs

* feat(excel files): the first row, which should contain headers is now always skipped

* feat(pagination): pagination arguments are now passed to (and then ignored) to the read_json and read_xml functions

* feat(pagination): fix None error

* feat(pagination): downloaded files now keep their extensions

* feat(pagination): this commit removes the new way to handle excel file by transforming them to CSV first. it only contains the bare minimum to handle the pagination on peakina side

* feat(pagination): remove irrelevant tests

* feat(pagination): c/c tests from main

* feat(pagination): lint

* feat(pagination): no longer read excel file twice for metadata

* feat(pagination): no longer a lambda for skiprows

* feat(pagination): more coverage

* feat(pagination): more coverage and lint

* feat(pagination): lint

* feat(pagination): lint

* feat(pagination): lint

* feat(pagination): coverage

* Update peakina/readers/csv.py

Co-authored-by: Eric Jolibois <[email protected]>

* feat(pagination): fix off by 1 error

* feat(pagination): removed useless `reader_kwargs` in helpers.py

* fix: lint

* refactor: rewrite test to add more tests afterwards

* fix: csv reader

* fix: excel reader

* export meta readers

* fix coverage

* remove out of scope code

* test: update docstrings

* chore: remove useless sentinel

* test: more explicit

* feat: add total_rows and df_rows for csv

* fix: updates from pr review + added tests scenarios for a csv file having 12 lines

* fix(excel): handle edge case with multiple sheet

* fix: csv reader should use preview_offset if set alone

* fix: excel reader should use preview_offset if set alone

* fix: fix tests

* reorder tests

* support skipfooter, nrows, ...

* support pandas kwargs in metadata

Co-authored-by: sanix-darker <[email protected]>
Co-authored-by: Eric Jolibois <[email protected]>
Co-authored-by: David Nowinsky <[email protected]>
  • Loading branch information
4 people authored Feb 22, 2022
1 parent c2485f1 commit f89bac2
Show file tree
Hide file tree
Showing 14 changed files with 583 additions and 43 deletions.
4 changes: 2 additions & 2 deletions peakina/datapool.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
from os import path
from typing import TYPE_CHECKING, Any, Dict, Hashable, Optional

from .cache import Cache
from .datasource import DataSource
from peakina.cache import Cache
from peakina.datasource import DataSource

if TYPE_CHECKING:
import pandas as pd
Expand Down
17 changes: 5 additions & 12 deletions peakina/datasource.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@
from pydantic.dataclasses import dataclass
from slugify import slugify

from .cache import Cache
from .helpers import (
from peakina.cache import Cache
from peakina.helpers import (
TypeEnum,
detect_encoding,
detect_sep,
Expand All @@ -29,11 +29,10 @@
validate_kwargs,
validate_sep,
)
from .io import Fetcher, MatchEnum
from peakina.io import Fetcher, MatchEnum

AVAILABLE_SCHEMES = set(Fetcher.registry) - {""} # discard the empty string scheme
PD_VALID_URLS = set(uses_relative + uses_netloc + uses_params) | AVAILABLE_SCHEMES
NOTSET = object()


@dataclass
Expand Down Expand Up @@ -75,7 +74,7 @@ def get_metadata(self) -> Dict[str, Any]:
return {} # no metadata for matched datasources
with self.fetcher.open(self.uri) as f:
assert self.type is not None
return get_metadata(f.name, self.type)
return get_metadata(f.name, self.type, self.reader_kwargs)

@staticmethod
def _get_single_df(
Expand All @@ -91,7 +90,7 @@ def _get_single_df(
allowed_params = get_reader_allowed_params(filetype)

# Check encoding
encoding = kwargs.get("encoding")
encoding = kwargs.get("encoding", "utf-8")
if "encoding" in allowed_params:
if not validate_encoding(stream.name, encoding):
encoding = detect_encoding(stream.name)
Expand All @@ -107,12 +106,6 @@ def _get_single_df(
finally:
stream.close()

# In case of sheets, the df can be a dictionary
if kwargs.get("sheet_name", NOTSET) is None:
for sheet_name, _df in df.items():
_df["__sheet__"] = sheet_name
df = pd.concat(df.values(), sort=False)

return df

def get_matched_datasources(self) -> Generator["DataSource", None, None]:
Expand Down
37 changes: 26 additions & 11 deletions peakina/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,14 @@
import chardet
import pandas as pd

from .readers import read_json, read_xml
from peakina.readers import (
csv_meta,
excel_meta,
read_csv,
read_excel,
read_json,
read_xml,
)


class TypeInfos(NamedTuple):
Expand All @@ -37,18 +44,23 @@ class TypeInfos(NamedTuple):
# For files without MIME types, we make fake MIME types based on detected extension
CUSTOM_MIMETYPES = {".parquet": "peakina/parquet"}

EXTRA_PEAKINA_READER_KWARGS = ["preview_offset", "preview_nrows"]

SUPPORTED_FILE_TYPES = {
"csv": TypeInfos(["text/csv", "text/tab-separated-values"], pd.read_csv),
"csv": TypeInfos(
["text/csv", "text/tab-separated-values"],
read_csv,
[],
csv_meta,
),
"excel": TypeInfos(
[
"application/vnd.ms-excel",
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
],
pd.read_excel,
# these options are missing from read_excel signature in pandas 0.23:
["keep_default_na", "encoding", "decimal"],
lambda f: {"sheetnames": pd.ExcelFile(f).sheet_names},
read_excel,
["encoding", "decimal"],
excel_meta,
),
"json": TypeInfos(
["application/json"],
Expand Down Expand Up @@ -133,13 +145,15 @@ def detect_sep(filepath: str, encoding: Optional[str] = None) -> str:
return csv.Sniffer().sniff(str_head(filepath, 100, encoding)).delimiter


def validate_sep(filepath: str, sep: str = ",", encoding: Optional[str] = None) -> bool:
def validate_sep(filepath: str, sep: str = ",", encoding: str = "utf-8") -> bool:
"""
Validates if the `sep` is a right separator of a CSV file
(i.e. the dataframe has more than one column).
"""
try:
df = pd.read_csv(filepath, sep=sep, encoding=encoding, nrows=2)
# we want an error to be raised if we can't read the first two lines
# hence the parameter `error_bad_lines` set to `True`
df = read_csv(filepath, sep=sep, encoding=encoding, nrows=2, error_bad_lines=True)
return len(df.columns) > 1
except pd.errors.ParserError:
return False
Expand All @@ -161,6 +175,7 @@ def validate_kwargs(kwargs: Dict[str, Any], t: Optional[TypeEnum]) -> bool:
allowed_kwargs += get_reader_allowed_params(t)
# Add extra allowed kwargs
allowed_kwargs += SUPPORTED_FILE_TYPES[t].reader_kwargs
allowed_kwargs += EXTRA_PEAKINA_READER_KWARGS
bad_kwargs = set(kwargs) - set(allowed_kwargs)
if bad_kwargs:
raise ValueError(f'Unsupported kwargs: {", ".join(map(repr, bad_kwargs))}')
Expand All @@ -176,6 +191,6 @@ def pd_read(filepath: str, t: str, kwargs: Dict[str, Any]) -> pd.DataFrame:
return SUPPORTED_FILE_TYPES[t].reader(filepath, **kwargs)


def get_metadata(filepath: str, t: str) -> Dict[str, Any]:
read = SUPPORTED_FILE_TYPES[t].metadata_reader
return read(filepath) if read else {}
def get_metadata(filepath: str, type: str, reader_kwargs: Dict[str, Any]) -> Dict[str, Any]:
metadata_reader = SUPPORTED_FILE_TYPES[type].metadata_reader
return metadata_reader(filepath, reader_kwargs) if metadata_reader else {}
10 changes: 10 additions & 0 deletions peakina/readers/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,17 @@
from .csv import csv_meta, read_csv
from .excel import excel_meta, read_excel
from .json import read_json
from .xml import read_xml

__all__ = (
# CSV
"read_csv",
"csv_meta",
# EXCEL
"read_excel",
"excel_meta",
# JSON
"read_json",
# XML
"read_xml",
)
99 changes: 99 additions & 0 deletions peakina/readers/csv.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
"""
Module to add csv support
"""
from functools import wraps
from typing import TYPE_CHECKING, Any, Dict, Optional, Union

import pandas as pd

if TYPE_CHECKING:
from os import PathLike

FilePathOrBuffer = Union[str, bytes, PathLike[str], PathLike[bytes]]

# The chunksize value for previews
PREVIEW_CHUNK_SIZE = 1024


@wraps(pd.read_csv)
def read_csv(
filepath_or_buffer: "FilePathOrBuffer",
*,
# extra `peakina` reader kwargs
preview_offset: int = 0,
preview_nrows: Optional[int] = None,
# change of default values
keep_default_na: bool = False, # pandas default: `True`
error_bad_lines: bool = False, # pandas default: `True`
**kwargs: Any,
) -> pd.DataFrame:
"""
The read_csv method is able to make a preview by reading on chunks
"""
if preview_nrows is not None or preview_offset:
chunks = pd.read_csv(
filepath_or_buffer,
keep_default_na=keep_default_na,
error_bad_lines=error_bad_lines,
**kwargs,
# keep the first row 0 (as the header) and then skip everything else up to row `preview_offset`
skiprows=range(1, preview_offset + 1),
nrows=preview_nrows,
chunksize=PREVIEW_CHUNK_SIZE,
)
return next(chunks)

return pd.read_csv(
filepath_or_buffer,
keep_default_na=keep_default_na,
error_bad_lines=error_bad_lines,
**kwargs,
)


def _line_count(filepath_or_buffer: "FilePathOrBuffer") -> int:
with open(filepath_or_buffer) as f:
lines = 0
buf_size = 1024 * 1024
read_f = f.read # loop optimization

buf = read_f(buf_size)
while buf:
lines += buf.count("\n")
buf = read_f(buf_size)

return lines


def csv_meta(
filepath_or_buffer: "FilePathOrBuffer", reader_kwargs: Dict[str, Any]
) -> Dict[str, Any]:
total_rows = _line_count(filepath_or_buffer)

if "nrows" in reader_kwargs:
return {
"total_rows": total_rows,
"df_rows": reader_kwargs["nrows"],
}

start = 0 + reader_kwargs.get("skiprows", 0)
end = total_rows - reader_kwargs.get("skipfooter", 0)

preview_offset = reader_kwargs.get("preview_offset", 0)
preview_nrows = reader_kwargs.get("preview_nrows", None)

if preview_nrows is not None:
return {
"total_rows": total_rows,
"df_rows": min(preview_nrows, max(end - start - preview_offset, 0)),
}
elif preview_offset: # and `preview_nrows` is None
return {
"total_rows": total_rows,
"df_rows": max(end - start - preview_offset, 0),
}
else:
return {
"total_rows": total_rows,
"df_rows": end - start,
}
70 changes: 70 additions & 0 deletions peakina/readers/excel.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
"""
Module to add excel files support
"""
import logging
from functools import wraps
from typing import TYPE_CHECKING, Any, Dict, Optional, Union

import pandas as pd

if TYPE_CHECKING:
from os import PathLike

FilePathOrBuffer = Union[str, bytes, PathLike[str], PathLike[bytes]]

LOGGER = logging.getLogger(__name__)


@wraps(pd.read_excel)
def read_excel(
filepath_or_buffer: "FilePathOrBuffer",
*,
# extra `peakina` reader kwargs
preview_offset: int = 0,
preview_nrows: Optional[int] = None,
# change of default values
keep_default_na: bool = False, # pandas default: `True`
**kwargs: Any,
) -> pd.DataFrame:
df = pd.read_excel(
filepath_or_buffer,
keep_default_na=keep_default_na,
**kwargs,
)
# if there are several sheets, pf.read_excel returns a dict {sheet_name: df}
if isinstance(df, dict):
for sheet_name, sheet_df in df.items():
sheet_df["__sheet__"] = sheet_name
df = pd.concat(df.values(), sort=False)

if preview_nrows is not None or preview_offset:
offset = None if preview_nrows is None else preview_offset + preview_nrows
return df[preview_offset:offset]
return df


def excel_meta(
filepath_or_buffer: "FilePathOrBuffer", reader_kwargs: Dict[str, Any]
) -> Dict[str, Any]:
"""
Returns a dictionary with the meta information of the excel file.
"""
excel_file = pd.ExcelFile(filepath_or_buffer)
sheet_names = excel_file.sheet_names

df = read_excel(excel_file, **reader_kwargs)

if (sheet_name := reader_kwargs.get("sheet_name", 0)) is None:
# multiple sheets together
return {
"sheetnames": sheet_names,
"df_rows": df.shape[0],
"total_rows": sum(excel_file.parse(sheet_name).shape[0] for sheet_name in sheet_names),
}
else:
# single sheet
return {
"sheetnames": sheet_names,
"df_rows": df.shape[0],
"total_rows": excel_file.parse(sheet_name).shape[0],
}
6 changes: 5 additions & 1 deletion peakina/readers/xml.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,11 @@ def transform_with_jq(data: Any, jq_filter: str) -> Union[PdDataList, PdDataDict
return cast(PdDataList, all_data)


def read_xml(filepath: str, encoding: str = "utf-8", filter: Optional[str] = None) -> pd.DataFrame:
def read_xml(
filepath: str,
encoding: str = "utf-8",
filter: Optional[str] = None,
) -> pd.DataFrame:
data = xmltodict.parse(open(filepath).read(), encoding=encoding)
if filter is not None:
data = transform_with_jq(data, filter)
Expand Down
Binary file modified tests/fixtures/0_2.xls
Binary file not shown.
Binary file added tests/fixtures/fixture-single-sheet.xlsx
Binary file not shown.
Binary file added tests/fixtures/fixture_new_format.xls
Binary file not shown.
Loading

0 comments on commit f89bac2

Please sign in to comment.