Skip to content

Commit

Permalink
Source file: Fix error on zipped files (airbytehq#39909)
Browse files Browse the repository at this point in the history
Co-authored-by: Marcos Marx <[email protected]>
  • Loading branch information
chenriquealvarenga and marcosmarxm authored Aug 1, 2024
1 parent 6da294f commit bc36dd2
Show file tree
Hide file tree
Showing 7 changed files with 279 additions and 252 deletions.
Binary file not shown.
2 changes: 1 addition & 1 deletion airbyte-integrations/connectors/source-file/metadata.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ data:
connectorSubtype: file
connectorType: source
definitionId: 778daa7c-feaf-4db6-96f3-70fd645acc77
dockerImageTag: 0.5.3
dockerImageTag: 0.5.4
dockerRepository: airbyte/source-file
documentationUrl: https://docs.airbyte.com/integrations/sources/file
githubIssueLabel: source-file
Expand Down
496 changes: 248 additions & 248 deletions airbyte-integrations/connectors/source-file/poetry.lock

Large diffs are not rendered by default.

3 changes: 2 additions & 1 deletion airbyte-integrations/connectors/source-file/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ requires = [ "poetry-core>=1.0.0",]
build-backend = "poetry.core.masonry.api"

[tool.poetry]
version = "0.5.3"
version = "0.5.4"
name = "source-file"
description = "Source implementation for File"
authors = [ "Airbyte <[email protected]>",]
Expand All @@ -22,6 +22,7 @@ beautifulsoup4 = "==4.11.1"
openpyxl = "==3.1.0"
google-cloud-storage = "==2.5.0"
pandas = "2.2.2"
numpy = "<2"
airbyte-cdk = "^0"
paramiko = "==2.11.0"
xlrd = "==2.0.1"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ def __init__(self, url: str, provider: dict, binary=None, encoding=None):
self._file = None
self.args = {
"mode": "rb" if binary else "r",
"encoding": encoding,
"encoding": None if binary else encoding,
}

def __enter__(self):
Expand Down Expand Up @@ -452,7 +452,7 @@ def _unzip(self, fp):
logger.info("Temp dir content: " + str(os.listdir(tmp_dir.name)))
final_file: str = os.path.join(tmp_dir.name, os.listdir(tmp_dir.name)[0])
logger.info("Pick up first file: " + final_file)
fp_tmp = open(final_file, "r")
fp_tmp = open(final_file, "rb")
return fp_tmp

def _cache_stream(self, fp):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,31 @@ def test_csv_with_utf16_encoding(absolute_path, test_files):
assert stream.json_schema == expected_schema


def test_zipped_csv_with_utf16_encoding(absolute_path, test_files):
config_local_zipped_csv_utf16 = {
"dataset_name": "AAA",
"format": "csv",
"reader_options": '{"encoding":"utf_16", "parse_dates": ["header5"]}',
"url": f"{absolute_path}/{test_files}/test_utf16.csv.zip",
"provider": {"storage": "local"},
}
expected_schema = {
"$schema": "http://json-schema.org/draft-07/schema#",
"properties": {
"header1": {"type": ["string", "null"]},
"header2": {"type": ["number", "null"]},
"header3": {"type": ["number", "null"]},
"header4": {"type": ["boolean", "null"]},
"header5": {"type": ["string", "null"], "format": "date-time"},
},
"type": "object",
}

catalog = SourceFile().discover(logger=logger, config=config_local_zipped_csv_utf16)
stream = next(iter(catalog.streams))
assert stream.json_schema == expected_schema


def get_catalog(properties):
return ConfiguredAirbyteCatalog(
streams=[
Expand Down
1 change: 1 addition & 0 deletions docs/integrations/sources/file.md
Original file line number Diff line number Diff line change
Expand Up @@ -234,6 +234,7 @@ In order to read large files from a remote location, this connector uses the [sm

| Version | Date | Pull Request | Subject |
| :------ | :--------- | :------------------------------------------------------- | :------------------------------------------------------------------------------------------------------ |
| 0.5.4 | 2024-07-01 | [39909](https://github.com/airbytehq/airbyte/pull/39909) | Fix error with zip files and encoding. |
| 0.5.3 | 2024-06-27 | [40215](https://github.com/airbytehq/airbyte/pull/40215) | Replaced deprecated AirbyteLogger with logging.Logger |
| 0.5.2 | 2024-06-06 | [39192](https://github.com/airbytehq/airbyte/pull/39192) | [autopull] Upgrade base image to v1.2.2 |
| 0.5.1 | 2024-05-03 | [37799](https://github.com/airbytehq/airbyte/pull/37799) | Add fastparquet engine for parquet file reader. |
Expand Down

0 comments on commit bc36dd2

Please sign in to comment.