How to reduce memory usage? #74

Kazuuk · 2024-11-22T06:05:58Z

Hello

I tried to 'export_data' with Parquet file, which source is the large(2GB) size CSV, to SBDF.
When I tried with C library, I used 'tableslice' as recommended and it works fine.
But in Python, I converted Parquet file to Pandas dataframe first and then convert Pandas dataframe to SBDF.
In the conversion that Pandas dataframe to SBDF consume a lot of memory.
Is there anyway to use 'tableslice' method in Python, or better way?

Thank you in advance.

bbassett-tibco · 2024-11-23T00:24:48Z

Welcome back, @Kazuuk! The Python SBDF module already uses table slices as you describe (and I originally commented in spotfiresoftware/spotfire-sbdf-c#8); it's controlled by the rows_per_slice argument to export_data (the default of 0 will compute an appropriate setting such that close to 100,000 values end up in a slice). That being said, the behavior you've described sounds like it could be due to one of two causes:

Pandas is generally an in-memory structure (and export_data requires it's data in-memory); it's likely that a 2GB file read in via Arrow uses an amount of memory on the same order of magnitude as the file size.
We have a memory leak in the export code.

If you're willing to augment your code and install an additional package from PyPI, we can figure out which is the culprit. (We can't use the builtin tracemalloc module to debug this, because the C SBDF library does not use Python's memory allocation functions; we have an open issue to allow this in spotfiresoftware/spotfire-sbdf-c#5.) Please try the following and report the output here:

Install the psutil package from PyPI:
```
python -m pip install psutil
```
Add the following to near the top of your script:
```
import psutil
_process = psutil.Process()
```
Add the following to just before you read the Parquet file into Pandas:
```
print(f"SFPY: before parquet read {_process.memory_info().rss}")
```

Add the following to just before you call export_data:

print(f"SFPY: before export_data {_process.memory_info().rss}")

Add the following to just after you call export_data:

print(f"SFPY: after export_data {_process.memory_info().rss}")

Post the output lines that start with SFPY:

This is basically capturing the RSS (resident set size) memory usage of the Python process (and this is where the psutil package comes in, giving us an RSS value no matter what operating system you are using). We'll be able to see where the memory usage is coming from.

Kazuuk · 2024-11-27T04:00:04Z

Thank you for your kind help :)
Here is the data you requested and the code that I tried.

Python Code is as follows,
table = pq.read_table(sPath)
print(f"SFPY: before parquet read {_process.memory_info().rss}")
df = table.to_pandas()
print(f"SFPY: before export_data {_process.memory_info().rss}")
sbdf.export_data(df, tPath)
print(f"SFPY: after export_data {_process.memory_info().rss}")

Output lines starts with SFPY is as follows,
Case #1 (Parquet file size as 0.22MB, originally 72.89MB in CSV)
SFPY: before parquet read 336879616
SFPY: before export_data 469250048
SFPY: after export_data 486871040

Case #2 (Parquet file size as 573MB, originally 2.1GB in CSV)
SFPY: before parquet read 4455198720
SFPY: before export_data 7983198208
Last SFPY is not shown, due to following error : "numpy.core._exceptions._ArrayMemoryError: Unable to allocate 9.30 MiB for an array with shape (1219272,) and data type int64"

Thank you again in advance :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to reduce memory usage? #74

How to reduce memory usage? #74

Kazuuk commented Nov 22, 2024

bbassett-tibco commented Nov 23, 2024

Kazuuk commented Nov 27, 2024

How to reduce memory usage? #74

How to reduce memory usage? #74

Comments

Kazuuk commented Nov 22, 2024

bbassett-tibco commented Nov 23, 2024

Kazuuk commented Nov 27, 2024