Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to reduce memory usage? #74

Open
Kazuuk opened this issue Nov 22, 2024 · 2 comments
Open

How to reduce memory usage? #74

Kazuuk opened this issue Nov 22, 2024 · 2 comments

Comments

@Kazuuk
Copy link

Kazuuk commented Nov 22, 2024

Hello

I tried to 'export_data' with Parquet file, which source is the large(2GB) size CSV, to SBDF.
When I tried with C library, I used 'tableslice' as recommended and it works fine.
But in Python, I converted Parquet file to Pandas dataframe first and then convert Pandas dataframe to SBDF.
In the conversion that Pandas dataframe to SBDF consume a lot of memory.
Is there anyway to use 'tableslice' method in Python, or better way?

Thank you in advance.

@bbassett-tibco
Copy link
Collaborator

Welcome back, @Kazuuk! The Python SBDF module already uses table slices as you describe (and I originally commented in spotfiresoftware/spotfire-sbdf-c#8); it's controlled by the rows_per_slice argument to export_data (the default of 0 will compute an appropriate setting such that close to 100,000 values end up in a slice). That being said, the behavior you've described sounds like it could be due to one of two causes:

  • Pandas is generally an in-memory structure (and export_data requires it's data in-memory); it's likely that a 2GB file read in via Arrow uses an amount of memory on the same order of magnitude as the file size.
  • We have a memory leak in the export code.

If you're willing to augment your code and install an additional package from PyPI, we can figure out which is the culprit. (We can't use the builtin tracemalloc module to debug this, because the C SBDF library does not use Python's memory allocation functions; we have an open issue to allow this in spotfiresoftware/spotfire-sbdf-c#5.) Please try the following and report the output here:

  • Install the psutil package from PyPI:
    python -m pip install psutil
    
  • Add the following to near the top of your script:
    import psutil
    _process = psutil.Process()
    
  • Add the following to just before you read the Parquet file into Pandas:
    print(f"SFPY: before parquet read {_process.memory_info().rss}")
    
  • Add the following to just before you call export_data:
    print(f"SFPY: before export_data {_process.memory_info().rss}")
    
  • Add the following to just after you call export_data:
    print(f"SFPY: after export_data {_process.memory_info().rss}")
    
  • Post the output lines that start with SFPY:

This is basically capturing the RSS (resident set size) memory usage of the Python process (and this is where the psutil package comes in, giving us an RSS value no matter what operating system you are using). We'll be able to see where the memory usage is coming from.

@Kazuuk
Copy link
Author

Kazuuk commented Nov 27, 2024

Thank you for your kind help :)
Here is the data you requested and the code that I tried.

Python Code is as follows,
table = pq.read_table(sPath)
print(f"SFPY: before parquet read {_process.memory_info().rss}")
df = table.to_pandas()
print(f"SFPY: before export_data {_process.memory_info().rss}")
sbdf.export_data(df, tPath)
print(f"SFPY: after export_data {_process.memory_info().rss}")

Output lines starts with SFPY is as follows,
Case #1 (Parquet file size as 0.22MB, originally 72.89MB in CSV)
SFPY: before parquet read 336879616
SFPY: before export_data 469250048
SFPY: after export_data 486871040

Case #2 (Parquet file size as 573MB, originally 2.1GB in CSV)
SFPY: before parquet read 4455198720
SFPY: before export_data 7983198208
Last SFPY is not shown, due to following error : "numpy.core._exceptions._ArrayMemoryError: Unable to allocate 9.30 MiB for an array with shape (1219272,) and data type int64"

Thank you again in advance :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants