Speed up data processing #683

joakimbits · 2024-09-22T00:09:02Z

Closes #682 by adding a test profiling tool, a faster excel parser and a DataFrame caching decorator. Reduces test suite execution time from 15 seconds to 8 seconds on first run and 0.3 seconds on second run when tested on a Mac Air M1.

Generate an cpu intensity graph of the functions used during pytest. * Add a pytest project file * Add pytest-profiling * Add graphviz * Document how to profile performance

This shaved off 6 seconds from the test. * Use calamine when importing SMHI data

Speed up repeat usage of SMHI data to a fraction. * Add feather-format * Use a feather file cache for the imported SMI data.

…yran#682 Generalize the df cache decorator.

Refactor-out the cache_df decorator.

vercel · 2024-09-22T00:09:06Z

@joakimbits is attempting to deploy a commit to the Klimatbyrån Team on Vercel.

A member of the Team first needs to authorize it.

feather is part of pyarrow which is already in requirements. * Remove feather-format

Do not mention excel in cache_df. Also clarify why column names are cached separately.

* More explicit hint on functions supported. * Clarify options.

* Make default path a valid string. * Remove obsolete error handling for no path.

* Hint a return type that is the same as the decorated function.

elvbom

Great work! I got a minor error, would love to hear your thoughts on it

elvbom · 2024-09-26T15:33:13Z

data/issues/emissions/historical_data_calculations.py

    """
    Downloads data from SMHI and loads it into a pandas dataframe.

    Returns:
        pandas.DataFrame: The dataframe containing the SMHI data.
    """

-    df_raw = pd.read_excel(PATH_SMHI)
+    df_raw = pd.read_excel(path, engine="calamine")


Här får jag fel när jag kör tester, och fattar det som att calamine inte stöds av Pandas. Bättre att använda annan engine? Ngt jag missar här?

python3 -m pip install -r requirements.txt

Funkar det sen?

Jag lade till calamine där och pandas skall hitta det sen när det är installerat.

@elvbom Kanske jag skall göra Calamine optional? Calamine snabbar upp den här read_excel med 6 sekunder, men den stora förbättringen är cache_df som tar nolltid efter en första read_excel.

Jag provade med xlrd och openpyxl vilket funkade, tror du det skulle vara en idé att använda ngn av dem istället? Jag är inte bekant med calamine sedan tidigare men tänker att alternativen ovan brukar vara bra val för parsing av Excel-filer med pandas

Ja vad bra! Men blir det lika snabbt som Calamine? Jag kollar!

quflow@Joakims-MacBook-Air data % py.test tests

8 sekunder pd.read_excel(path, engine="calamine")

9 sekunder pd.read_excel(path, engine="openpyxl")

17 sekunder pd.read_excel(path)

Xlrd funkade inte med SMHI data: FAILED tests/test_emission_calculations.py::TestEmissionCalculations::test_get_n_prep_data_from_smhi - xlrd.biffh.XLRDError: Excel xlsx file; not supported

Jag byter till openpyxl!

vercel · 2024-10-09T11:18:10Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated (UTC)
klimatkollen	✅ Ready (Inspect)	Visit Preview	Oct 23, 2024 0:15am

Fixes a problem when using unittest rather than pytest.

This reverts commit 216339e

Use openpyxl to read SMHI data. Note: calamine is only 15% faster which is not worth an additional dependency.

elvbom

Bra jobbat! :)

joakimbits added 5 commits September 19, 2024 16:19

Profile data tests Klimatbyran#682

28138b8

Generate an cpu intensity graph of the functions used during pytest. * Add a pytest project file * Add pytest-profiling * Add graphviz * Document how to profile performance

Speedup read_excel for SMHI data Klimatbyran#682

216339e

This shaved off 6 seconds from the test. * Use calamine when importing SMHI data

Cache SMHI data Klimatbyran#682

53eeb3f

Speed up repeat usage of SMHI data to a fraction. * Add feather-format * Use a feather file cache for the imported SMI data.

Make default excel file path and df cache period configurable Klimatb…

b829564

…yran#682 Generalize the df cache decorator.

Add a cache_utilities module Klimatbyran#682

d2e9590

Refactor-out the cache_df decorator.

joakimbits changed the base branch from staging to main September 22, 2024 00:09

Remove duplicate requirement Klimatbyran#682

e1b86bc

feather is part of pyarrow which is already in requirements. * Remove feather-format

joakimbits mentioned this pull request Sep 22, 2024

Speed-up data processing #682

Open

joakimbits added 4 commits September 22, 2024 02:43

cache_df is not just for read_excel Klimatbyran#682

7516b67

Do not mention excel in cache_df. Also clarify why column names are cached separately.

Improve documentation of cache_df Klimatbyran#682

06e1102

* More explicit hint on functions supported. * Clarify options.

Make cache_df work without path Klimatbyran#682

5627a2d

* Make default path a valid string. * Remove obsolete error handling for no path.

Make cache_df decoration more apparent Klimatbyran#682

9dac37d

* Hint a return type that is the same as the decorated function.

elvbom requested changes Sep 26, 2024

View reviewed changes

vercel bot deployed to Preview October 9, 2024 11:19 View deployment

joakimbits added 3 commits October 15, 2024 10:00

Use relative import of cache_df Klimatbyran#682

1e5e324

Fixes a problem when using unittest rather than pytest.

Revert "Speedup read_excel for SMHI data Klimatbyran#682"

b39530e

This reverts commit 216339e

Speedup read_excel for SMHI data using openpyxl Klimatbyran#682

0d2575f

Use openpyxl to read SMHI data. Note: calamine is only 15% faster which is not worth an additional dependency.

joakimbits force-pushed the profiling-data-tests branch from 5449161 to 0d2575f Compare October 15, 2024 08:25

vercel bot deployed to Preview October 23, 2024 12:15 View deployment

elvbom approved these changes Oct 23, 2024

View reviewed changes

elvbom merged commit 19f34d1 into Klimatbyran:main Oct 23, 2024
3 checks passed

joakimbits deleted the profiling-data-tests branch November 7, 2024 13:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up data processing #683

Speed up data processing #683

joakimbits commented Sep 22, 2024

vercel bot commented Sep 22, 2024

elvbom left a comment

elvbom Sep 26, 2024

joakimbits Sep 26, 2024 •

edited

Loading

joakimbits Sep 30, 2024

elvbom Oct 9, 2024

joakimbits Oct 15, 2024

joakimbits Oct 15, 2024

elvbom Oct 23, 2024

vercel bot commented Oct 9, 2024 •

edited

Loading

elvbom left a comment

Speed up data processing #683

Speed up data processing #683

Conversation

joakimbits commented Sep 22, 2024

vercel bot commented Sep 22, 2024

elvbom left a comment

Choose a reason for hiding this comment

elvbom Sep 26, 2024

Choose a reason for hiding this comment

joakimbits Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

joakimbits Sep 30, 2024

Choose a reason for hiding this comment

elvbom Oct 9, 2024

Choose a reason for hiding this comment

joakimbits Oct 15, 2024

Choose a reason for hiding this comment

joakimbits Oct 15, 2024

Choose a reason for hiding this comment

elvbom Oct 23, 2024

Choose a reason for hiding this comment

vercel bot commented Oct 9, 2024 • edited Loading

elvbom left a comment

Choose a reason for hiding this comment

joakimbits Sep 26, 2024 •

edited

Loading

vercel bot commented Oct 9, 2024 •

edited

Loading