Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up data processing #683

Merged
merged 13 commits into from
Oct 23, 2024
Merged

Conversation

joakimbits
Copy link

Closes #682 by adding a test profiling tool, a faster excel parser and a DataFrame caching decorator. Reduces test suite execution time from 15 seconds to 8 seconds on first run and 0.3 seconds on second run when tested on a Mac Air M1.

Generate an cpu intensity graph of the functions used during pytest.

* Add a pytest project file
* Add pytest-profiling
* Add graphviz
* Document how to profile performance
This shaved off 6 seconds from the test.

* Use calamine when importing SMHI data
Speed up repeat usage of SMHI data to a fraction.

* Add feather-format
* Use a feather file cache for the imported SMI data.
Refactor-out the cache_df decorator.
Copy link

vercel bot commented Sep 22, 2024

@joakimbits is attempting to deploy a commit to the Klimatbyrån Team on Vercel.

A member of the Team first needs to authorize it.

@joakimbits joakimbits changed the base branch from staging to main September 22, 2024 00:09
feather is part of pyarrow which is already in requirements.

* Remove feather-format
Do not mention excel in cache_df.

Also clarify why column names are cached separately.
* More explicit hint on functions supported.
* Clarify options.
* Make default path a valid string.
* Remove obsolete error handling for no path.
* Hint a return type that is the same as the decorated function.
Copy link
Collaborator

@elvbom elvbom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I got a minor error, would love to hear your thoughts on it

"""
Downloads data from SMHI and loads it into a pandas dataframe.

Returns:
pandas.DataFrame: The dataframe containing the SMHI data.
"""

df_raw = pd.read_excel(PATH_SMHI)
df_raw = pd.read_excel(path, engine="calamine")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Här får jag fel när jag kör tester, och fattar det som att calamine inte stöds av Pandas. Bättre att använda annan engine? Ngt jag missar här?

Copy link
Author

@joakimbits joakimbits Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python3 -m pip install -r requirements.txt

Funkar det sen?

Jag lade till calamine där och pandas skall hitta det sen när det är installerat.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elvbom Kanske jag skall göra Calamine optional? Calamine snabbar upp den här read_excel med 6 sekunder, men den stora förbättringen är cache_df som tar nolltid efter en första read_excel.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jag provade med xlrd och openpyxl vilket funkade, tror du det skulle vara en idé att använda ngn av dem istället? Jag är inte bekant med calamine sedan tidigare men tänker att alternativen ovan brukar vara bra val för parsing av Excel-filer med pandas

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ja vad bra! Men blir det lika snabbt som Calamine? Jag kollar!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quflow@Joakims-MacBook-Air data % py.test tests

  • 8 sekunder pd.read_excel(path, engine="calamine")
  • 9 sekunder pd.read_excel(path, engine="openpyxl")
  • 17 sekunder pd.read_excel(path)

Xlrd funkade inte med SMHI data: FAILED tests/test_emission_calculations.py::TestEmissionCalculations::test_get_n_prep_data_from_smhi - xlrd.biffh.XLRDError: Excel xlsx file; not supported

Jag byter till openpyxl!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kanon! :)

Copy link

vercel bot commented Oct 9, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
klimatkollen ✅ Ready (Inspect) Visit Preview Oct 23, 2024 0:15am

Fixes a problem when using unittest rather than pytest.
Use openpyxl to read SMHI data.

Note: calamine is only 15% faster which is not worth an additional dependency.
Copy link
Collaborator

@elvbom elvbom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bra jobbat! :)

@elvbom elvbom merged commit 19f34d1 into Klimatbyran:main Oct 23, 2024
3 checks passed
@joakimbits joakimbits deleted the profiling-data-tests branch November 7, 2024 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speed-up data processing
2 participants