Expose PySpark's `persist()` method to the Evaluation class #292

samlamont · 2024-10-30T19:44:52Z

Wondering if we could make use of the persist or cache methods in pyspark to load the dataframe into memory, which could be attached to the Evaluation class object (similar to how the df is attached to the accessor class), and eliminate the need to re-calculate things for visualizations and method chaining.

I guess this could replace the accessor or complement it?

mgdenno · 2024-10-31T00:42:50Z

I am certainly in favor exploring this. Not sure I totally understand where this would fit in, but lets discuss.

samlamont · 2025-01-16T16:48:00Z

Some notes on persist in pyspark:

The general idea is that persist() allows you to cache a dataframe, making subsequent access to the data more efficient. You can cache the data in memory and/or to disk, which you can set using the storageLevel argument when calling persist().

My thought is that this could allow us to cache/persist the results of a query (on the metric and timeseries classes), which would enable us to make different types of plots (and run any other analyses) on the cached data more efficiently (without having to re-execute the query each time). This would be an alternative to the dataframe accessor class for visualization.

So instead of returning a pandas dataframe from a query we could return the class object (which we already do to support chaining) with the cached dataframe.

I added this method to the Metrics class:

  def persist(self, storage_level: psl = psl.MEMORY_AND_DISK):
      """Persist the DataFrame to specified storage level."""
      self.df.persist(storageLevel=storage_level)
      return self

then we can do:

  metrics_ds = ev.metrics.query(
      order_by=["primary_location_id", "month"],
      group_by=["primary_location_id", "month"],
      include_metrics=[
          DeterministicMetrics.KlingGuptaEfficiency(),
          DeterministicMetrics.NashSutcliffeEfficiency(),
          DeterministicMetrics.RelativeBias()
      ]
  ).persist()

metrics_ds.df.is_cached

>> True

then we could create different types of plots with methods on the class itself (each method would collect the cached spark df as a pandas df)

metrics_ds.plot_metrics_bar_chart()

metrics_ds.plot_metrics_heat_map()

metrics_ds.plot_metrics_...()

The results of some limited testing are a bit confusing to me though, if I run the above query, and then convert to a pandas df twice, the second call is much much faster than the first even without persist:

first = metrics_ds.to_pandas()

second = metrics_ds.to_pandas()

which seems like something is already being cached, even though metrics_ds.df is still a pyspark dataframe after calling `.to_pandas(). Including persist seems to have little effect. I think I'm missing something, maybe we can discuss

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose PySpark's `persist()` method to the Evaluation class #292

Expose PySpark's `persist()` method to the Evaluation class #292

samlamont commented Oct 30, 2024

mgdenno commented Oct 31, 2024

samlamont commented Jan 16, 2025

Expose PySpark's persist() method to the Evaluation class #292

Expose PySpark's persist() method to the Evaluation class #292

Comments

samlamont commented Oct 30, 2024

mgdenno commented Oct 31, 2024

samlamont commented Jan 16, 2025

Expose PySpark's `persist()` method to the Evaluation class #292

Expose PySpark's `persist()` method to the Evaluation class #292