Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow analysis scripts to read data.json #239

Merged
merged 2 commits into from
Nov 4, 2024
Merged

Conversation

WardBrian
Copy link
Collaborator

Closes #238.

  • Adds data to the latestRun object
  • Adds a files argument to the pyodide and webr mechanisms to allow us to populate arbitrary files

This sets the SIR analysis.R example go from

# posterior predictive check using the pred_cases generated quantity

install.packages(c("outbreaks", "bayesplot"))
library(outbreaks)
library(posterior)
library(ggplot2)

# same as data generation
cases <- influenza_england_1978_school$in_bed
n_days <- length(cases)
ts <- 1:n_days

# Extract posterior predictive checks
pred_cases <- as.matrix(as_draws_df(as_draws_rvars(draws)$pred_cases))[, -(15:17)]

bayesplot::ppc_ribbon(y = cases, yrep = pred_cases,
                      x = ts, y_draw = "point") +
  theme_bw() +
  ylab("cases") + xlab("days")

to

# posterior predictive check using the pred_cases generated quantity

install.packages("bayesplot")
library(outbreaks)
library(posterior)
library(ggplot2)

# load from data
d <- jsonlite::read_json('./data.json')
cases <- unlist(d$cases)
n_days <- d$n_days
ts <- unlist(d$ts)

# Extract posterior predictive checks
pred_cases <- as.matrix(as_draws_df(as_draws_rvars(draws)$pred_cases))[, -(15:17)]

bayesplot::ppc_ribbon(y = cases, yrep = pred_cases,
                      x = ts, y_draw = "point") +
  theme_bw() +
  ylab("cases") + xlab("days")

This will be even nicer if the data has some randomization to it, in which case re-running the same code in analysis would not recover the same data, but this would allow it to

Copy link
Collaborator

@jsoules jsoules left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as I understand this, the crucial change is that the contents of the data.json window will be sent to the language-appropriate worker thread as part of invoking the analysis script, and the worker thread will write that content to a virtual filesystem.

This will make the content available to the internals of the analysis script through the same mechanism as executing the analysis script on a local machine where the data file is locally available.

The code looks good and I've confirmed it works for the Python analysis script attached to the SIR model example. (It's not clear to me if the current version of the R analysis file is using the FS-based data.json or not.)

I wonder if creating an additional copy of an in-memory JSON data file could further tax the scarce memory resource in the case of models with very large data, but the increased utility is probably worth the risk in this case. Especially as the copy doesn't have to exist until after the sampler's been run.

I think we are good to move forward here.

@WardBrian
Copy link
Collaborator Author

I wonder if creating an additional copy of an in-memory JSON data file could further tax the scarce memory resource in the case of models with very large data, but the increased utility is probably worth the risk in this case. Especially as the copy doesn't have to exist until after the sampler's been run.

If this does become a problem, I think we could work around it by using FS.createLazyFile, but this would require extra machinery to 'host' the file at a URL, so I didn't tackle it here. If you know a good way, we definitely could do that sooner rather than later

@WardBrian WardBrian merged commit 7b7e63d into main Nov 4, 2024
2 checks passed
@WardBrian WardBrian deleted the analysis-data.json branch November 4, 2024 17:26
@WardBrian WardBrian restored the analysis-data.json branch November 4, 2024 17:28
@WardBrian WardBrian deleted the analysis-data.json branch November 4, 2024 17:28
@jsoules
Copy link
Collaborator

jsoules commented Nov 4, 2024

To be clear--yeah, I don't have an answer here, or even evidence that it's going to be a problem; I suspect any such situation is either massively over-provisioned with data or is going to run into problems while still in the sampler phase, so realistically I'm not worried about it.

We can solve it if it's ever an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make data.json available in analysis scripts
2 participants