Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]: Rethinking_2 Chp_04: update trace_to_dataframe() to InferenceData #143

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

joannadiong
Copy link
Contributor

In:

# Code 4.2 
trace_df = pm.trace_to_dataframe(trace_4_1)
trace_df.cov()

I have ported part of this code to save the trace as an InferenceData object:

with pm.Model() as m4_1:
    pm_data = az.from_pymc3(trace=trace_4_1)

The pm_data object will be needed in Code sections 4.33, 4.34, 4.36. (trace_to_dataframe is retained for now so the notebook does not break.)

I am unsure how to calculate the covariance matrix in trace_df.cov as above. I may have missed it, but there does not seem to be a function in arviz that does this. So the covariance matrix might need to be computed from pm_data.mu.values and pm_data.sigma.values.

Also, by default, the sampler generates 4 chains with 1000 draws. Should all chains be involved in computing the covariance matrix?

Could you advise what is the best approach to get the covariance matrix, for later separation into variances and correlations? Thanks for the pointers!

Ping @AlexAndorra or @aloctavodia

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@MarcoGorelli
Copy link
Contributor

Hey @joannadiong

Does it work to do this in 4.32

trace_df = pm_data.posterior.to_dataframe()
trace_df.cov()

?

Alternatively, maybe in 4.30 you could do

trace_4_1 = pm.sample(1000, tune=1000, return_inferencedata=True)

so that 4.32 just becomes

trace_df = trace_4_1.posterior.to_dataframe()
trace_df.cov()

?

@joannadiong
Copy link
Contributor Author

Hi @MarcoGorelli,

Many thanks for the suggestions. There are 2 traces in the chapter that call trace_to_dataframe().

trace_4_1

I received an error for the first approach:

# Code 4.32
trace_df = pm.posterior.to_dataframe()

>>> AttributeError: module 'pymc3' has no attribute 'posterior' 

But the second solution seemed to work for trace_4_1:

# 4.30
trace_4_1 = pm.sample(1000, tune=1000, return_inferencedata=True)

# 4.32
trace_df = trace_4_1.posterior.to_dataframe()
trace_df.cov()

In 4.34, I had to change trace_4_1["sigma"][:10] to trace_df["sigma"][:10].

Overall, I think the fix for this trace is done. The other trace is more problematic.

trace_4_3

Doing this in the notebook:

# 4.42
trace_4_3 = pm.sample(1000, tune=1000, return_inferencedata=True)

# 4.45
trace_4_3_df = trace_4_3.posterior.to_dataframe()

# 4.50
mu_at_50 = trace_4_3_df["a"] + trace_4_3_df["b"] * (50 - d2.weight.mean())

# 4.52
az.hpd(mu_at_50)

Gives this error at 4.52:

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-55-676334186adf> in <module>
----> 1 az.hpd(mu_at_50)
      2 

~/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/arviz/stats/stats.py in hpd(ary, hdi_prob, circular, multimodal, skipna, group, var_names, filter_vars, coords, max_modes, dask_kwargs, **kwargs)
    341         ("hpd will be deprecated " "Please replace hdi"),
    342     )
--> 343     return hdi(
    344         ary,
    345         hdi_prob,

~/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/arviz/stats/stats.py in hdi(ary, hdi_prob, circular, multimodal, skipna, group, var_names, filter_vars, coords, max_modes, dask_kwargs, **kwargs)
    498         ary = np.expand_dims(ary, 0)
    499 
--> 500     ary = convert_to_dataset(ary, group=group)
    501     if coords is not None:
    502         ary = get_coords(ary, coords)

~/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/arviz/data/converters.py in convert_to_dataset(obj, group, coords, dims)
    177     xarray.Dataset
    178     """
--> 179     inference_data = convert_to_inference_data(obj, group=group, coords=coords, dims=dims)
    180     dataset = getattr(inference_data, group, None)
    181     if dataset is None:

~/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/arviz/data/converters.py in convert_to_inference_data(obj, group, coords, dims, **kwargs)
    129             "cmdstanpy fit",
    130         )
--> 131         raise ValueError(
    132             "Can only convert {} to InferenceData, not {}".format(
    133                 ", ".join(allowable_types), obj.__class__.__name__

ValueError: Can only convert xarray dataarray, xarray dataset, dict, netcdf filename, numpy array, pystan fit, pymc3 trace, emcee fit, pyro mcmc fit, numpyro mcmc fit, cmdstan fit csv filename, cmdstanpy fit to InferenceData, not Series

I wanted to run model m4_3 separately in a Python file to inspect the objects:

# 4.42
xbar = d2.weight.mean()
with pm.Model() as m4_3:
    a = pm.Normal("a", mu=178, sd=20)
    b = pm.Lognormal("b", mu=0, sd=1)
    sigma = pm.Uniform("sigma", 0, 50)
    mu = a + b * (d2.weight - xbar)
    height = pm.Normal("height", mu=mu, sd=sigma, observed=d2.height)
    trace_4_3 = pm.sample(1000, tune=1000, return_inferencedata=True)

but interestingly, the m4_3 code produces an error when run in the Python file:

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, b, a]
█/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/pymc3/math.py:246: RuntimeWarning: divide by zero encountered in log1p
  return np.where(x < 0.6931471805599453, np.log(-np.expm1(-x)), np.log1p(-np.exp(-x)))
/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/pymc3/math.py:246: RuntimeWarning: divide by zero encountered in log1p
  return np.where(x < 0.6931471805599453, np.log(-np.expm1(-x)), np.log1p(-np.exp(-x)))
/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/pymc3/math.py:246: RuntimeWarning: divide by zero encountered in log1p
  return np.where(x < 0.6931471805599453, np.log(-np.expm1(-x)), np.log1p(-np.exp(-x)))
/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/pymc3/math.py:246: RuntimeWarning: divide by zero encountered in log1p
  return np.where(x < 0.6931471805599453, np.log(-np.expm1(-x)), np.log1p(-np.exp(-x)))
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 2 seconds.
WARNING (theano.tensor.opt): Failed to infer_shape from Op DeepCopyOp.
Input shapes: [(TensorConstant{352},)]
Exception encountered during infer_shape: <class 'ValueError'>
Exception message: Cannot compute test value: input 0 (DeepCopyOp.0) of Op Shape_i{0}(DeepCopyOp.0) missing default value. 
Traceback: Traceback (most recent call last):
  File "/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/theano/graph/op.py", line 81, in compute_test_value
    storage_map[ins] = [ins.get_test_value()]
  File "/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/theano/graph/basic.py", line 422, in get_test_value
    raise TestValueError(f"{self} has no test value {detailed_err_msg}")
theano.graph.utils.TestValueError: DeepCopyOp.0 has no test value 
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/theano/tensor/opt.py", line 1097, in get_node_infer_shape
    o_shapes = shape_infer(
  File "/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/theano/tensor/opt.py", line 1204, in default_infer_shape
    rval.append(self.shape_tuple(r))
  File "/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/theano/tensor/opt.py", line 1191, in shape_tuple
    return tuple([self.shape_ir(i, r) for i in range(r.ndim)])
  File "/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/theano/tensor/opt.py", line 1191, in <listcomp>
    return tuple([self.shape_ir(i, r) for i in range(r.ndim)])
  File "/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/theano/tensor/opt.py", line 1179, in shape_ir
    s = Shape_i(i)(r)
  File "/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/theano/graph/op.py", line 253, in __call__
    compute_test_value(node)
  File "/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/theano/graph/op.py", line 94, in compute_test_value
    raise ValueError(
ValueError: Cannot compute test value: input 0 (DeepCopyOp.0) of Op Shape_i{0}(DeepCopyOp.0) missing default value. 

The issues here and here seem the most closely related, but I did not import Theano and PyMC3 libraries separately.

What might be the best way to proceed? Thanks in advance.

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Feb 13, 2021

# Code 4.32
trace_df = pm.posterior.to_dataframe()

>>> AttributeError: module 'pymc3' has no attribute 'posterior' 

pm_data, not pm ;)


I don't know about that theano error you've received, but Alex and Osvaldo are subscribed to this thread so they may have some more useful input. Perhaps if you share your entire Python script, as well versions of PyMC3 and theano-pymc, then that will help with debugging

@joannadiong
Copy link
Contributor Author

Thanks @MarcoGorelli!

Code 4.32

# Code 4.32
trace_df = pm_data.posterior.to_dataframe()
trace_df.cov()

I used pm instead of pm_data because I couldn't find where pm_data was defined earlier?

Error at Code 4.52

Interestingly, I ran model m4_3 in Python script afresh and the Theano error did not replicate! (Not sure how it got magicked away, but it's one less problem.)

The code still fails at 4.52. I ran these relevant bits in a script:

import arviz as az
import pandas as pd
import pymc3 as pm

# 4.26
d = pd.read_csv("Data/Howell1.csv", sep=";", header=0)
d2 = d[d.age >= 18]

# 4.42
xbar = d2.weight.mean()
with pm.Model() as m4_3:
    a = pm.Normal("a", mu=178, sd=20)
    b = pm.Lognormal("b", mu=0, sd=1)
    sigma = pm.Uniform("sigma", 0, 50)
    mu = a + b * (d2.weight - xbar)
    height = pm.Normal("height", mu=mu, sd=sigma, observed=d2.height)
    trace_4_3 = pm.sample(1000, tune=1000, return_inferencedata=True)

# Code 4.45
trace_4_3_df = trace_4_3.posterior.to_dataframe()
trace_4_3_df.cov().round(3)

# Code 4.50
mu_at_50 = trace_4_3_df["a"] + trace_4_3_df["b"] * (50 - d2.weight.mean())

# Code 4.52
az.hpd(mu_at_50)

This produced the following ValueError:

Backend TkAgg is interactive backend. Turning interactive mode on.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, b, a]
█/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/pymc3/math.py:246: RuntimeWarning: divide by zero encountered in log1p
  return np.where(x < 0.6931471805599453, np.log(-np.expm1(-x)), np.log1p(-np.exp(-x)))
/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/pymc3/math.py:246: RuntimeWarning: divide by zero encountered in log1p
  return np.where(x < 0.6931471805599453, np.log(-np.expm1(-x)), np.log1p(-np.exp(-x)))
/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/pymc3/math.py:246: RuntimeWarning: divide by zero encountered in log1p
  return np.where(x < 0.6931471805599453, np.log(-np.expm1(-x)), np.log1p(-np.exp(-x)))
/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/pymc3/math.py:246: RuntimeWarning: divide by zero encountered in log1p
  return np.where(x < 0.6931471805599453, np.log(-np.expm1(-x)), np.log1p(-np.exp(-x)))
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 3 seconds.
/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/arviz/stats/stats.py:340: UserWarning: hpd will be deprecated Please replace hdi
  warnings.warn(
Traceback (most recent call last):
  File "/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3418, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-f0b5e524fe36>", line 27, in <module>
    az.hpd(mu_at_50)
  File "/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/arviz/stats/stats.py", line 343, in hpd
    return hdi(
  File "/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/arviz/stats/stats.py", line 500, in hdi
    ary = convert_to_dataset(ary, group=group)
  File "/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/arviz/data/converters.py", line 179, in convert_to_dataset
    inference_data = convert_to_inference_data(obj, group=group, coords=coords, dims=dims)
  File "/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/arviz/data/converters.py", line 131, in convert_to_inference_data
    raise ValueError(
ValueError: Can only convert xarray dataarray, xarray dataset, dict, netcdf filename, numpy array, pystan fit, pymc3 trace, emcee fit, pyro mcmc fit, numpyro mcmc fit, cmdstan fit csv filename, cmdstanpy fit to InferenceData, not Series

Versions in case helpful:

Name: pymc3
Version: 3.11.0

Name: Theano-PyMC
Version: 1.1.0

@MarcoGorelli
Copy link
Contributor

I used pm instead of pm_data because I couldn't find where pm_data was defined earlier?

You defined it in 4.32:

trace_df = pm.trace_to_dataframe(trace_4_1)
trace_df.cov()

+ # update: trace saved with InferenceData format
+ # trace_to_dataframe() retained until update to InferenceData is stable
+ with pm.Model() as m4_1:
+     pm_data = az.from_pymc3(trace=trace_4_1)

@joannadiong
Copy link
Contributor Author

Got it, thanks!

Fixed the structure of trace_4_1 and trace_4_3 to get dataframes this way:

trace_df = pm_data.posterior.to_dataframe()
trace_df.cov()

It dealt with the long error. The notebook now breaks at Code 4.59, 4.60:

height_pred = pm.sample_posterior_predictive(trace_4_3, 200, m4_3)
height_pred_hpd = az.hpd(height_pred["height"])

height_pred is an empty dict, so this returns a key error:

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

<ipython-input-61-0390e3860878> in <module>
----> 1 height_pred_hpd = az.hpd(height_pred["height"])
      2 

KeyError: 'height'

Tried height_pred = pm.sample_posterior_predictive(trace_4_3_df, 200, m4_3) instead, but also generates error:

height_pred = pm.sample_posterior_predictive(trace_4_3_df, 200, m4_3)
/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/pymc3/sampling.py:1687: UserWarning: samples parameter is smaller than nchains times ndraws, some draws and/or chains may not be represented in the returned posterior predictive sample
  warnings.warn(
█Traceback (most recent call last):
  File "/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3418, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-15-f0c44fac5ce3>", line 1, in <module>
    height_pred = pm.sample_posterior_predictive(trace_4_3_df, 200, m4_3)
  File "/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/pymc3/sampling.py", line 1729, in sample_posterior_predictive
    param = _trace[idx % len_trace]
  File "/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/pandas/core/frame.py", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/joanna/Dropbox/Sketchbook/python/resources/env/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: 0

@savagej
Copy link

savagej commented Mar 1, 2022

trace_4_1 = pm.sample(1000, tune=1000, return_inferencedata=True)

worked well for me and also means that we don't run into errors with using az.from_pymc3 which doesn't currently work when you import pymc vs pymc3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants