Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose PySpark's persist() method to the Evaluation class #292

Open
samlamont opened this issue Oct 30, 2024 · 2 comments
Open

Expose PySpark's persist() method to the Evaluation class #292

samlamont opened this issue Oct 30, 2024 · 2 comments

Comments

@samlamont
Copy link
Collaborator

Wondering if we could make use of the persist or cache methods in pyspark to load the dataframe into memory, which could be attached to the Evaluation class object (similar to how the df is attached to the accessor class), and eliminate the need to re-calculate things for visualizations and method chaining.

I guess this could replace the accessor or complement it?

@mgdenno
Copy link
Contributor

mgdenno commented Oct 31, 2024

I am certainly in favor exploring this. Not sure I totally understand where this would fit in, but lets discuss.

@samlamont
Copy link
Collaborator Author

Some notes on persist in pyspark:

The general idea is that persist() allows you to cache a dataframe, making subsequent access to the data more efficient. You can cache the data in memory and/or to disk, which you can set using the storageLevel argument when calling persist().

My thought is that this could allow us to cache/persist the results of a query (on the metric and timeseries classes), which would enable us to make different types of plots (and run any other analyses) on the cached data more efficiently (without having to re-execute the query each time). This would be an alternative to the dataframe accessor class for visualization.

So instead of returning a pandas dataframe from a query we could return the class object (which we already do to support chaining) with the cached dataframe.

I added this method to the Metrics class:

  def persist(self, storage_level: psl = psl.MEMORY_AND_DISK):
      """Persist the DataFrame to specified storage level."""
      self.df.persist(storageLevel=storage_level)
      return self

then we can do:

  metrics_ds = ev.metrics.query(
      order_by=["primary_location_id", "month"],
      group_by=["primary_location_id", "month"],
      include_metrics=[
          DeterministicMetrics.KlingGuptaEfficiency(),
          DeterministicMetrics.NashSutcliffeEfficiency(),
          DeterministicMetrics.RelativeBias()
      ]
  ).persist()

metrics_ds.df.is_cached

>> True

then we could create different types of plots with methods on the class itself (each method would collect the cached spark df as a pandas df)

metrics_ds.plot_metrics_bar_chart()

metrics_ds.plot_metrics_heat_map()

metrics_ds.plot_metrics_...()

The results of some limited testing are a bit confusing to me though, if I run the above query, and then convert to a pandas df twice, the second call is much much faster than the first even without persist:

first = metrics_ds.to_pandas()

second = metrics_ds.to_pandas()

which seems like something is already being cached, even though metrics_ds.df is still a pyspark dataframe after calling `.to_pandas(). Including persist seems to have little effect. I think I'm missing something, maybe we can discuss

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants