Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TuringGLM.jl Roadmap and Internals/API Discussion #1

Closed
storopoli opened this issue Nov 3, 2021 · 20 comments
Closed

TuringGLM.jl Roadmap and Internals/API Discussion #1

storopoli opened this issue Nov 3, 2021 · 20 comments
Assignees
Labels
enhancement New feature or request

Comments

@storopoli
Copy link
Member

storopoli commented Nov 3, 2021

Hello, all!

I was speaking with @yebai about how Turing could support the whole Bayesian Workflow following Bob Carpenter presentation at ProbProg 2021 in October, 2021. And I gave him an idea about how an easy interface of Turing could attract more users and be a great transition from users coming from a stats background and/or R background. I am deeply interested in this because I mainly lecture graduate students on applied social sciences and if I show either Stan code or Turing model code they would run away. Thus, I use the friendly RStan interface {brms} that uses the formula syntax (as you can see in this still on-going course YouTube playlist).

I also use Turing a lot in my research and most of it are hierarchical/generalized linear models that could be easily specified by a formula. Thus, I also find myself having to copy and paste a LOT of boilerplate code to define a model.

I suggested this package and the following objectives, initial release, API design and roadmap. We discussed and we would really love feedback.

I am also commited to a long-term maintainance/development of TuringGLM, since this would solve all my lecturer and researcher needs. I am tired of having students complaining that they cannot install {brms} because of RStan Windows issues. And also I would move all my lecturing and research to Julia. Furthermore, I think this package has a great opportunity to attract users to Julia/Turing/Bayesian Statistics, which I am all up for.

Objective of the Package

Uses @formula syntax to specify several different (generalized) (hierarchical) linear models in Turing.jl. It is defined by StatsModels.jl and can be easly extended to GLMs (see GLM.jl) and to Hierarchical models (see MixedModels.jl).

Heavily inspired by {brms} (uses RStan or CmdStanR), bambi (uses PyMC3) and StatsModels.jl.

My ideia is just to create a syntactic sugar of model definition with a @formula macro that translates it to a instantiated Turing @model along with the data. The user just need to call sample, advi or another future Turing function (e.g. pathfinder) onto the returned model from @formula and he is good to go. So this package would be easy to maintain and just focus on whats important: the @formula -> @model translation.

Discussion Points

  • @formula syntax:
    turing_model will return an instantiate Turing model with the data, so the user has just to call sample or advi etc in the model. Example:

    m = turing_model(@formula y ~ x; data=my_df)
    chn = sample(m, NUTS(), 2_000)
  • How the user specify custom priors?

Feature list for 0.1.0 release (public repo)

  • Close interface with DataFrames.jl and CategoricalArrays.jl
    • Reading the length(unique(x)) of a group-level intercept/slope
    • Auto dummy variable ( K 1 ) with categorical vectors by reading the levels and using the first level as baseline.
  • Likelihoods:
    • Linear Regression: Gaussian and TDist (IdentityLink)
    • Logistic Regression: Bernoulli, Binomial (how would we accept a matrix of [success n - success]) and Geometric (LogitLink)
    • Count Data: Poisson and NegativeBinomial (LogLink)
  • Priors (Should be Distributions.jl dependent):
    • Default Priors and Custom Priors
    • Intercept
    • Coefficients
    • Auxiliary parameter (residuals)
  • Group-level effect (Hierarchical Models):
    • Non-Centered Parameterization and recompute the parameter back in the return statement
    • Varying-Intercept: (1 | group)
    • Varying-Slope: (x_1 | group)
    • Varying-Intercept-Slope: (1 + x_1 | group) or (1 | group) + (x1 | group)
    • Correlated Group-level effects: LKJ() (OBS: fast implementation depends on Feature request: Add LKJCholesky Turing.jl#1629) ???
  • Automatic QR decomposition with a keyword argument inside @formula

Roadmap

  • Likelihoods:
    • Survival stuff: LogNormal, Gamma, Exponential and Weibull
    • Zero-Inflated stuff (Mostly Mixtures):
      • Zero-Inflated Count Data: ZeroInflatedPoisson and ZeroInflatedNegativeBinomial.
      • Zero-Inflated Binary Data: ZeroInflatedBinomial and ZeroInflatedBeta
    • Ordinal Regression
  • Priors:
    • Horseshoe Prior and R2D2 prior for Sparse Bayesian Regression
  • Multivariate Models (more than 1 likelihood and/or dependent variable):
    • Stuff like f1 = @formula y1 ~ x1 and f2 = @formula y2 ~ x2.
  • Autoregressive Models Correlation Structures:
    • AR(p): Autoregressive
    • MA(q): Moving Average
    • ARMA(p, q): Autoregressive Moving Average
  • Warmup with Pathfinder.jl?
  • Integration with Bijectors.jl
  • Normalizing Flows?
@rikhuijzer
Copy link
Collaborator

m = @formula y ~ x; data=my_df
chn = sample(m, NUTS(), 2_000)

This macro syntax is not gonna work or at least is very confusing. Maybe @formula data=df y ~ x.

Also, using the same name as GLM formula isn't the best, I think. How about @modelformula data=df y~x or so?

@yebai
Copy link
Member

yebai commented Nov 3, 2021

cc @phipsgabler @cpfiffer

@cpfiffer
Copy link
Member

cpfiffer commented Nov 3, 2021

We've done a little bit of this already here: https://github.com/cpfiffer/BayesModels. Could be fairly quickly repurposed/improved/integrated.

@storopoli
Copy link
Member Author

storopoli commented Nov 3, 2021

Yeah! Nice! I wasn't aware of the BayesModel.jl! You've solved almost everything with the @formula and prior.

I still think that it should return a instantiated model with the data and the user do whatever he wants to with it inside Turing.jl. This is where my proposal differs from BayesModel.jl but I am not married to the ideia.

Two things that I forgot to mention:

  1. Stan developers and researchers are pretty aware that {rstanarn} and {brms} are MORE popular than Stan.
  2. Stan developers and researchers know that their manual and case studies are a real advantage for the user because it allows it for simple copy-paste/adapting.

I currently ask for students to present a Bayesian analysis complete with code and analyses as a final grade assessment in the Bayesian Statistics course. I could give them the option to translate a Stan case study to TuringGLM.jl (respecting the license of course) and them I could, with a few polishing, turn it into Turing.jl and TuringGLM.jl case studies.

EDIT: the blm function solves the macro syntax issue that @rikhuijzer described.

@devmotion
Copy link
Member

I was asked to comment on this issue but I'm not sure if I'm the right person to ask - I know the @formula style is popular in different research communities and also among people with a background in R but I don't like it and hence I don't use it and try to avoid it (even in Julia), which also explains why I haven't used any of the linked R packages 😄

I agree with @rikhuijzer that it would be confusing to have another @formula macro which is different from StatsModels.@formula (GLM does neither own nor define @formula). However, I would not use a different macro name but I think the best option, both for maintenance and users, would be to just use StatsModels.@formula. I don't think it's a good idea to create yet another DSL or macro if there exists already a well maintained alternative. E.g., StatsModels.@formula already can be combined with data (such as data frames) and can also be extended: https://juliastats.org/StatsModels.jl/stable/internals/

So my main suggestion would be to work with StatsModels.@formula, extend it when necessary (e.g., additional Turing-specific terms if needed), and only implement a function that constructs a Turing model from such a @formula object and coupled data and is called by the user (i.e., something like turing_model(@formula(y ~ ....), data, priors), similar to the GLM example with lm here: https://juliastats.org/GLM.jl/dev/#Categorical-variables). So basically something similar to BayesModels it seems 😄

A completely different approach would be to view the problem that this package would solve as a limited example of a symbolic syntax for PPLs, i.e., as a ModelingToolkit for PPLs. We discussed this before and I think there are great opportunities, also regarding interoperability of different PPLs. Here you would need only a Turing output of the ModelingToolkit representation. The advantage would be that it would be much more general, disadvantages could be that probably it takes more time to set up and design correctly and maybe it would not be helpful for students or users that want to use R syntax.

But in any case I don't recommend creating another different macro/DSL syntax.

@storopoli
Copy link
Member Author

@devmotion these are great contributions, thanks!

I also agree with you that the formula syntax is very popular and specific to certain communities (it has been used since the S, predecessor or R, in 1992).

I thought that the @formula macro was created by GLM.jl but it is from StatsModels.jl and also MixedModels.jl extends it to hierarchical models.

The ModelingToolkit.jl symbolic syntax is also very interesting, would be easy to build/maintain if we design it as a simple DSL to hand-off turing models to Turing.jl.

From what I seem, I think that most of what we would use of StatsModels.@formula is already defined in GLM.jl and MixedModels.jl. I still cannot find something that would be easily to specify Binomial models from Bernoulli using the @formula. Since Bernoulli(p) == Binominal(1, p) we could in some cases gain significant computation advantage by using a Binomial likelihood instead of a Bernoulli. {brms} and R's glm handles this with the trials and cbind respectively:

# Frequentist GLM Base R
glm(cbind(success, (n - sucess)) ~ 1, family = binomial, ...)

# brms Bayesian GLM
brm(success | trials(n) ~ 1, family = binomial, ...)

I much prefer the cbind syntax over the pipe | thing by accepting a matrix of [success n - success] or to give the option for the user to pass the n trials inside the Binomial(n) likelihood.

@phipsgabler
Copy link
Member

Awesome project! I thought about something like this recently but didn't continue due to my time being limited... I somehow got a start of coupling MixedModels.jl and their formula extensions with Turing here (note that I'm not a modelling person, the code might not be a good implementation...).

The "problems" I hit there were

  1. The extensions to the formula syntax are not separated from the rest of the package; MixedModel.jl defines what the | terms mean and the model matrix functions, but when using the package you also depend on the whole optimization machinery that you don't want to when using Turing instead.
  2. I was really keen on doint this with the new condition/decondition functionality of DPPL, not arguments, and tried to figure out a way of letting the user define a model in formula style without yet being conditioned on observations (e.g., if you just want to simulate the outcome). But I couldn't find a nice design for that yet -- that's why you see those @formula 0 ~ ... things.

@storopoli
Copy link
Member Author

  1. I was really keen on doint this with the new condition/decondition functionality of DPPL, not arguments, and tried to figure out a way of letting the user define a model in formula style without yet being conditioned on observations (e.g., if you just want to simulate the outcome). But I couldn't find a nice design for that yet -- that's why you see those @formula 0 ~ ... things.

What if we specify inside the turing_model:

turing_model(@formula(y ~ ....), data; priors=DefaultPriors(), prior_only=false)
# or
turing_model(@formula(y ~ ....), data; priors=DefaultPriors(), likelihood=true)

@phipsgabler
Copy link
Member

Well @storopoli, I don't have much prior experience with this kind of modelling, but what I meant is the following: say I want to do some simulation. data contains only columns of independent variables a, b. Then I'd like to instantiate something like a generative model

mymodel = turing_model(@formula(outcome ~ a + b), data; ...)

and sample outcome from that (which is what simulate in my code does). This should be the joint over outcome and the regression parameters.

The problem is that there is no simple way of including the name of outcome in the formula, because the model matrix instantiation then believes it should be in data and wants to include it. (You can strip it off the formula object, but then you lose it...)

Only later I'd like to be able to say

chain = sample(mymodel | (; outcome), ...)

to sample from the posterior given an actual value for the dependent variable.

The formula syntax isn't made for the use case of specifying "only a right hand side" -- but that would be more elegant from a PPL design perspective, since parameter inference is only the special case of the conditioned model.

@storopoli
Copy link
Member Author

I understand, but I do not find that people who uses formula syntax uses that often. In fact, I have never seen what you are trying to do being used in a formula approach. The formula syntax has the preconception that all variables are defined in a data structure data.

@storopoli storopoli self-assigned this Nov 6, 2021
@storopoli storopoli added the enhancement New feature or request label Nov 6, 2021
@phipsgabler
Copy link
Member

Alright, I'm probably a very atypical case, so you can ignore the point :). The formula style would always construct a conditioned model and require the outcomes to be present in the data, right?

To generate simulated outcomes, you could then always first instantiate a conditioned model with unused "dummy" outcomes, and then decondition that and sample.

@storopoli
Copy link
Member Author

storopoli commented Nov 7, 2021

We could include something like we do currently in Turing for posterior and prior predictive checks. If the lhs (i.e. the y in y ~ ...) is missing (the instantiated type) we could implement a simulated outcome. What about that?

@sethaxen
Copy link
Member

sethaxen commented Nov 7, 2021

I've never used brms, bambi, or really any other regression package, so I can't offer experienced input. But I do have some initial impressions:

  1. This would be a fantastic contribution to the Julia ecosystem.
  2. I agree with @devmotion that it would be better to use existing @formula macros, and if they're too limiting, then work to try to extend them. Using Symbolics/ModelingToolkit is interesting, but I'm not certain what it gets us unless there's a use for symbolic manipulation/simplifcation/rewriting of the formulas.
  3. What Julia usually uniquely brings to the table is composability of other packages. Are there unique opportunities for composeability in regression models? Are there regression models that are hard to code in brms/bambi that would be easier with Julia?
  4. Why build the package on top of a PPL? For brms and bambi it makes sense, because the samplers and transformations they use are baked into the PPLs they wrap, but in Julia, these all live in individual packages. What do you get by using Turing under the hood, and what limitations does that pose? Would the package support non-static models?
  5. Perhaps one answer to the above is that turing_model converting a formula to a DynamicPPL.Model is convenient and modular. From that perspective, why be Turing-specific? One could also imagine a similar soss_model converter.
  6. One complication might be that different choices of model implementation might be preferred depending on the autodiff engine being used. This seems like something that can be considered later.

@storopoli
Copy link
Member Author

@sethaxen thanks for the feedback!

  1. I agree with @devmotion that it would be better to use existing @formula macros, and if they're too limiting, then work to try to extend them. Using Symbolics/ModelingToolkit is interesting, but I'm not certain what it gets us unless there's a use for symbolic manipulation/simplifcation/rewriting of the formulas.

Yes , I agree with you both. From what I was playing around in the weekend, we can easily extend StatsModels.@formula by adding it as a dependency. The hierarchical random effect stuff (1 | group) etc stuff is extended by MixedModels.jl but the package has a huge dependency footprint. So this I (and anybody who wants to collaborate) would have to extend/adapt.

The Symbolics/ModelingToolkit I think that would not be a major priority right now, but we can keep it in the roadmap.

  1. What Julia usually uniquely brings to the table is composability of other packages. Are there unique opportunities for composeability in regression models? Are there regression models that are hard to code in brms/bambi that would be easier with Julia?

This is a great provocation and invitation for insight. I honestly don't know. The main focus now is to create an easy API for users coming from R/Stats background to migrate to Julia and Turing. But we'll definitely explore this avenue in the near future.

  1. Perhaps one answer to the above is that turing_model converting a formula to a DynamicPPL.Model is convenient and modular. From that perspective, why be Turing-specific? One could also imagine a similar soss_model converter.

This also address the number 4. I cannot speak from Soss.jl, since I am not aknowledge on it and haven't used it yet. But from what I reckon, it does uses AdvancedHMC.jl and Turing samplers as AD backend (or am I wrong?). So it should be easy to create a SossGLM.jl brother package and add the soss_model function. Much of what we will learn in TuringGLM.jl will be aplicable/transferable to SossGLM.jl.

  1. One complication might be that different choices of model implementation might be preferred depending on the autodiff engine being used. This seems like something that can be considered later.

We will favor, for now, NUTS() and ForwardDiff.jl (Turing's default). Advanced users who need different AD backends or samplers will be advised to go straight to Turing.jl ecosystem.

@phipsgabler
Copy link
Member

For Seth's remarks 4 and 5: that's a good point. It depends on how the "backend" PPLs are used.

Is the plan to create a separate model for each formula (@formula(y ~ a + b) -> generated_model(y, a, b), and a, b are plugged in from data), or to only bridge data frames to model matrices via formula application (fixed_model(y, X) and X = modelcols(formula, data))?

In the latter case (which I think is preferable), you don't need to create a model for each formula, but have one sufficiently flexible model per "regression type". In which case I think crossing PPLs is easier, since the only contribution required is a good implementation of the fixed model, not code generation.

Yes , I agree with you both. From what I was playing around in the weekend, we can easily extend StatsModels.@formula by adding it as a dependency. The hierarchical random effect stuff (1 | group) etc stuff is extended by MixedModels.jl but the package has a huge dependency footprint. So this I (and anybody who wants to collaborate) would have to extend/adapt.

Indeed, the packages extending StatsModels don't tend to separate syntax extension and implementation cleanly. That's something that would be good to refactor, and probably rather tedious. I only looked at MixedEffects, and couldn't easily figure out which parts would be necessary to extract; also, they make some very peculiar assumptions about the formula application.

There's a fundamental reason why this is complicated: the StatsModels formula thing is designed in such a way that formula application (i.e., creation of model matrices) is always tied to model implementation at later stages (semantics time), because at some time you have to decide how to set up the model matrix.

If it were only MixedEffects.jl to adapt, it's probably doable, but I'm not familiar enough with that ecosystem how much else there is (polynomial regression? splines?). Is it feasible for them to move from one package "mixed effects regression with syntax and implementation" to "mixed effects syntax and meaning (apply_schema)" plus multiple implementations?

@storopoli
Copy link
Member Author

storopoli commented Nov 8, 2021

I think that splines and polynomials should be in the roadmap. I still need to inspect deeply the apply_schema and @formula API. From what I saw, extracting the terms and reading the data using Tables.jl is "doable".

I will try to create an initial release with the feature list detailed above using just the @formula from StatsModels and extending manually the hierarchical terms. Polynomials and splines could be added by the user doing a transformation in the underlying data (i.e. DataFrames.transform or by applying a vectorized function transformation on a specific Tuple/Array) before it being passed to @formula.

@devmotion
Copy link
Member

Regarding Tables.jl, this reminds me of the point

Close interface with DataFrames.jl and CategoricalArrays.jl

in the feature list above. My suggestion would be to avoid any explicit dependency on DataFrames if possible and just support the Tables.jl interface (as StatsModels and @formula already does apparently).

@storopoli
Copy link
Member Author

Yes, @devmotion that is exactly what I am looking for.
I am depending on CategorialArrays.jl, though, to parse the categorical vector and, most importantly, get the reference dummy variable from the levels output. Is CategoricalArrays.jl an OK dependency for that?

@devmotion
Copy link
Member

StatsModels removed the dependency on CategoricalArrays two years ago (JuliaStats/StatsModels.jl#157) and replaced it with the DataAPI interface. For instance, levels is defined in DataAPI, so depending on your use case you might be able to even avoid CategoricalArrays. DataAPI is also supported by other array types such as https://github.com/JuliaData/PooledArrays.jl, which maybe could be supported automatically if one uses the DataAPI interface. But I would not be worried too much about CategoricalArrays if DataAPI is not sufficient, it is a much smaller dependency than DataFrames.

@storopoli
Copy link
Member Author

Great! That's all I need them! DataAPI.jl and Tables.jl.
I will close this issue. I think I have everthing for now.
Thank you all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants