-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up data processing #683
Conversation
Generate an cpu intensity graph of the functions used during pytest. * Add a pytest project file * Add pytest-profiling * Add graphviz * Document how to profile performance
This shaved off 6 seconds from the test. * Use calamine when importing SMHI data
Speed up repeat usage of SMHI data to a fraction. * Add feather-format * Use a feather file cache for the imported SMI data.
…yran#682 Generalize the df cache decorator.
Refactor-out the cache_df decorator.
@joakimbits is attempting to deploy a commit to the Klimatbyrån Team on Vercel. A member of the Team first needs to authorize it. |
feather is part of pyarrow which is already in requirements. * Remove feather-format
Do not mention excel in cache_df. Also clarify why column names are cached separately.
* More explicit hint on functions supported. * Clarify options.
* Make default path a valid string. * Remove obsolete error handling for no path.
* Hint a return type that is the same as the decorated function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! I got a minor error, would love to hear your thoughts on it
""" | ||
Downloads data from SMHI and loads it into a pandas dataframe. | ||
|
||
Returns: | ||
pandas.DataFrame: The dataframe containing the SMHI data. | ||
""" | ||
|
||
df_raw = pd.read_excel(PATH_SMHI) | ||
df_raw = pd.read_excel(path, engine="calamine") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Här får jag fel när jag kör tester, och fattar det som att calamine inte stöds av Pandas. Bättre att använda annan engine? Ngt jag missar här?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python3 -m pip install -r requirements.txt
Funkar det sen?
Jag lade till calamine där och pandas skall hitta det sen när det är installerat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elvbom Kanske jag skall göra Calamine optional? Calamine snabbar upp den här read_excel med 6 sekunder, men den stora förbättringen är cache_df som tar nolltid efter en första read_excel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jag provade med xlrd och openpyxl vilket funkade, tror du det skulle vara en idé att använda ngn av dem istället? Jag är inte bekant med calamine sedan tidigare men tänker att alternativen ovan brukar vara bra val för parsing av Excel-filer med pandas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ja vad bra! Men blir det lika snabbt som Calamine? Jag kollar!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quflow@Joakims-MacBook-Air data % py.test tests
- 8 sekunder
pd.read_excel(path, engine="calamine")
- 9 sekunder
pd.read_excel(path, engine="openpyxl")
- 17 sekunder
pd.read_excel(path)
Xlrd funkade inte med SMHI data: FAILED tests/test_emission_calculations.py::TestEmissionCalculations::test_get_n_prep_data_from_smhi - xlrd.biffh.XLRDError: Excel xlsx file; not supported
Jag byter till openpyxl!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kanon! :)
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Fixes a problem when using unittest rather than pytest.
This reverts commit 216339e
Use openpyxl to read SMHI data. Note: calamine is only 15% faster which is not worth an additional dependency.
5449161
to
0d2575f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bra jobbat! :)
Closes #682 by adding a test profiling tool, a faster excel parser and a DataFrame caching decorator. Reduces test suite execution time from 15 seconds to 8 seconds on first run and 0.3 seconds on second run when tested on a Mac Air M1.