Skip to content

Commit

Permalink
fix: patch pyarrow.open_stream to support pyarrow>0.17
Browse files Browse the repository at this point in the history
  • Loading branch information
percevalw committed Dec 7, 2023
1 parent 053a943 commit 1d27b7d
Show file tree
Hide file tree
Showing 4 changed files with 52 additions and 2 deletions.
6 changes: 6 additions & 0 deletions changelog.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,16 @@
# Changelog

## Unreleased

### Changed

- Support for pyarrow > 0.17.0

### Fixed
- Caching in spark instead of koalas to improve speed

## v0.1.6 (2023-09-27)

### Added
- Module ``event_sequences`` to visualize individual sequences of events.
- Module ``age_pyramid`` to quickly visualize the age and gender distributions in a cohort.
Expand Down
8 changes: 7 additions & 1 deletion eds_scikit/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@

import importlib
import os
import pathlib
import sys
import time
from packaging import version
Expand All @@ -19,15 +20,20 @@

import pandas as pd
import pyarrow
import pyarrow.ipc
import pyspark
from loguru import logger
from pyspark import SparkContext
from pyspark.sql import SparkSession

import eds_scikit.biology # noqa: F401 --> To register functions

import eds_scikit.utils.logging
pyarrow.open_stream = pyarrow.ipc.open_stream

sys.path.insert(
0, (pathlib.Path(__file__).parent / "package-override").absolute().as_posix()
)
os.environ["PYTHONPATH"] = ":".join(sys.path)

# Remove SettingWithCopyWarning
pd.options.mode.chained_assignment = None
Expand Down
38 changes: 38 additions & 0 deletions eds_scikit/package-override/pyarrow/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
"""
PySpark 2 needs pyarrow.open_stream, which was deprecated in 0.17.0 in favor of
pyarrow.ipc.open_stream. Here is the explanation of how we monkey-patch pyarrow
to add back pyarrow.open_stream for versions > 0.17 and how we make this work with
pyspark distributed computing :
1. We add this fake eds_scikit/package-override/pyarrow package to python lookup list
(the PYTHONPATH env var) in eds_scikit/__init__.py : this env variable will be shared
with the executors
2. When an executor starts and import packages, it looks in the packages by inspecting
the paths in PYTHONPATH. It finds our fake pyarrow package first an executes the
current eds_scikit/package-override/pyarrow/__init__.py file
3. In this file, we remove the fake pyarrow package path from the lookup list, unload
the current module from python modules cache (sys.modules) and re-import pyarrow
=> the executor's python will this time load the true pyarrow and store it in
sys.modules. Subsequent "import pyarrow" calls will return the sys.modules["pyarrow"]
value, which is the true pyarrow module.
4. We are not finished: we add back the deprecated "open_stream" function that was
removed in pyarrow 0.17.0 (the reason for all this hacking) by setting it
on the true pyarrow module
5. We still export the pyarrow module content (*) such that the first import, which
is the only one that resolves to this very module, still gets what it asked for:
the pyarrow module's content.
"""

import sys

sys.path.remove(next((p for p in sys.path if "package-override" in p), None))
del sys.modules["pyarrow"]
import pyarrow # noqa: E402, F401

try:
import pyarrow.ipc

pyarrow.open_stream = pyarrow.ipc.open_stream
except ImportError:
pass

from pyarrow import * # noqa: F401, F403, E402
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ dependencies = [
"loguru==0.7.0",
"pypandoc==1.7.5",
"pyspark==2.4.3",
"pyarrow==0.17.0", #"pyarrow>=0.10, <0.17.0",
"pyarrow>=0.10.0",
"pretty-html-table>=0.9.15, <0.10.0",
"catalogue",
"schemdraw>=0.15.0, <1.0.0",
Expand Down

0 comments on commit 1d27b7d

Please sign in to comment.