-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TuringGLM.jl Roadmap and Internals/API Discussion #1
Comments
This macro syntax is not gonna work or at least is very confusing. Maybe Also, using the same name as GLM formula isn't the best, I think. How about |
We've done a little bit of this already here: https://github.com/cpfiffer/BayesModels. Could be fairly quickly repurposed/improved/integrated. |
Yeah! Nice! I wasn't aware of the I still think that it should return a instantiated model with the data and the user do whatever he wants to with it inside Two things that I forgot to mention:
I currently ask for students to present a Bayesian analysis complete with code and analyses as a final grade assessment in the Bayesian Statistics course. I could give them the option to translate a Stan case study to EDIT: the |
I was asked to comment on this issue but I'm not sure if I'm the right person to ask - I know the I agree with @rikhuijzer that it would be confusing to have another So my main suggestion would be to work with A completely different approach would be to view the problem that this package would solve as a limited example of a symbolic syntax for PPLs, i.e., as a ModelingToolkit for PPLs. We discussed this before and I think there are great opportunities, also regarding interoperability of different PPLs. Here you would need only a Turing output of the ModelingToolkit representation. The advantage would be that it would be much more general, disadvantages could be that probably it takes more time to set up and design correctly and maybe it would not be helpful for students or users that want to use R syntax. But in any case I don't recommend creating another different macro/DSL syntax. |
@devmotion these are great contributions, thanks! I also agree with you that the formula syntax is very popular and specific to certain communities (it has been used since the S, predecessor or R, in 1992). I thought that the The ModelingToolkit.jl symbolic syntax is also very interesting, would be easy to build/maintain if we design it as a simple DSL to hand-off turing models to From what I seem, I think that most of what we would use of # Frequentist GLM Base R
glm(cbind(success, (n - sucess)) ~ 1, family = binomial, ...)
# brms Bayesian GLM
brm(success | trials(n) ~ 1, family = binomial, ...) I much prefer the |
Awesome project! I thought about something like this recently but didn't continue due to my time being limited... I somehow got a start of coupling MixedModels.jl and their formula extensions with Turing here (note that I'm not a modelling person, the code might not be a good implementation...). The "problems" I hit there were
|
What if we specify inside the turing_model(@formula(y ~ ....), data; priors=DefaultPriors(), prior_only=false)
# or
turing_model(@formula(y ~ ....), data; priors=DefaultPriors(), likelihood=true) |
Well @storopoli, I don't have much prior experience with this kind of modelling, but what I meant is the following: say I want to do some simulation. mymodel = turing_model(@formula(outcome ~ a + b), data; ...) and sample The problem is that there is no simple way of including the name of Only later I'd like to be able to say chain = sample(mymodel | (; outcome), ...) to sample from the posterior given an actual value for the dependent variable. The formula syntax isn't made for the use case of specifying "only a right hand side" -- but that would be more elegant from a PPL design perspective, since parameter inference is only the special case of the conditioned model. |
I understand, but I do not find that people who uses formula syntax uses that often. In fact, I have never seen what you are trying to do being used in a formula approach. The formula syntax has the preconception that all variables are defined in a data structure |
Alright, I'm probably a very atypical case, so you can ignore the point :). The formula style would always construct a conditioned model and require the outcomes to be present in the data, right? To generate simulated outcomes, you could then always first instantiate a conditioned model with unused "dummy" outcomes, and then |
We could include something like we do currently in Turing for posterior and prior predictive checks. If the lhs (i.e. the |
I've never used brms, bambi, or really any other regression package, so I can't offer experienced input. But I do have some initial impressions:
|
@sethaxen thanks for the feedback!
Yes , I agree with you both. From what I was playing around in the weekend, we can easily extend The Symbolics/ModelingToolkit I think that would not be a major priority right now, but we can keep it in the roadmap.
This is a great provocation and invitation for insight. I honestly don't know. The main focus now is to create an easy API for users coming from R/Stats background to migrate to Julia and Turing. But we'll definitely explore this avenue in the near future.
This also address the number 4. I cannot speak from
We will favor, for now, |
For Seth's remarks 4 and 5: that's a good point. It depends on how the "backend" PPLs are used. Is the plan to create a separate model for each formula ( In the latter case (which I think is preferable), you don't need to create a model for each formula, but have one sufficiently flexible model per "regression type". In which case I think crossing PPLs is easier, since the only contribution required is a good implementation of the fixed model, not code generation.
Indeed, the packages extending StatsModels don't tend to separate syntax extension and implementation cleanly. That's something that would be good to refactor, and probably rather tedious. I only looked at MixedEffects, and couldn't easily figure out which parts would be necessary to extract; also, they make some very peculiar assumptions about the formula application. There's a fundamental reason why this is complicated: the StatsModels formula thing is designed in such a way that formula application (i.e., creation of model matrices) is always tied to model implementation at later stages (semantics time), because at some time you have to decide how to set up the model matrix. If it were only MixedEffects.jl to adapt, it's probably doable, but I'm not familiar enough with that ecosystem how much else there is (polynomial regression? splines?). Is it feasible for them to move from one package "mixed effects regression with syntax and implementation" to "mixed effects syntax and meaning ( |
I think that splines and polynomials should be in the roadmap. I still need to inspect deeply the I will try to create an initial release with the feature list detailed above using just the |
Regarding Tables.jl, this reminds me of the point
in the feature list above. My suggestion would be to avoid any explicit dependency on DataFrames if possible and just support the Tables.jl interface (as StatsModels and |
Yes, @devmotion that is exactly what I am looking for. |
StatsModels removed the dependency on CategoricalArrays two years ago (JuliaStats/StatsModels.jl#157) and replaced it with the DataAPI interface. For instance, |
Great! That's all I need them! |
Hello, all!
I was speaking with @yebai about how Turing could support the whole Bayesian Workflow following Bob Carpenter presentation at ProbProg 2021 in October, 2021. And I gave him an idea about how an easy interface of Turing could attract more users and be a great transition from users coming from a stats background and/or R background. I am deeply interested in this because I mainly lecture graduate students on applied social sciences and if I show either Stan code or Turing model code they would run away. Thus, I use the friendly RStan interface
{brms}
that uses the formula syntax (as you can see in this still on-going course YouTube playlist).I also use Turing a lot in my research and most of it are hierarchical/generalized linear models that could be easily specified by a formula. Thus, I also find myself having to copy and paste a LOT of boilerplate code to define a model.
I suggested this package and the following objectives, initial release, API design and roadmap. We discussed and we would really love feedback.
I am also commited to a long-term maintainance/development of TuringGLM, since this would solve all my lecturer and researcher needs. I am tired of having students complaining that they cannot install
{brms}
because of RStan Windows issues. And also I would move all my lecturing and research to Julia. Furthermore, I think this package has a great opportunity to attract users to Julia/Turing/Bayesian Statistics, which I am all up for.Objective of the Package
Uses
@formula
syntax to specify several different (generalized) (hierarchical) linear models in Turing.jl. It is defined byStatsModels.jl
and can be easly extended to GLMs (seeGLM.jl
) and to Hierarchical models (seeMixedModels.jl
).Heavily inspired by
{brms}
(uses RStan or CmdStanR),bambi
(uses PyMC3) andStatsModels.jl
.My ideia is just to create a syntactic sugar of model definition with a
@formula
macro that translates it to a instantiated Turing@model
along with the data. The user just need to callsample
,advi
or another future Turing function (e.g.pathfinder
) onto the returned model from@formula
and he is good to go. So this package would be easy to maintain and just focus on whats important: the@formula
->@model
translation.Discussion Points
@formula
syntax:turing_model
will return an instantiateTuring
model with thedata
, so the user has just to callsample
oradvi
etc in the model. Example:How the user specify custom priors?
Feature list for
0.1.0
release (public repo)DataFrames.jl
andCategoricalArrays.jl
length(unique(x))
of a group-level intercept/slopecategorical
vectors by reading thelevels
and using the first level as baseline.Gaussian
andTDist
(IdentityLink
)Bernoulli
,Binomial
(how would we accept a matrix of[success n - success]
) andGeometric
(LogitLink
)Poisson
andNegativeBinomial
(LogLink
)Distributions.jl
dependent):return
statement(1 | group)
(x_1 | group)
(1 + x_1 | group)
or(1 | group) + (x1 | group)
LKJ()
(OBS: fast implementation depends on Feature request: Add LKJCholesky Turing.jl#1629) ???@formula
Roadmap
LogNormal
,Gamma
,Exponential
andWeibull
ZeroInflatedPoisson
andZeroInflatedNegativeBinomial
.ZeroInflatedBinomial
andZeroInflatedBeta
f1 = @formula y1 ~ x1
andf2 = @formula y2 ~ x2
.AR(p)
: AutoregressiveMA(q)
: Moving AverageARMA(p, q)
: Autoregressive Moving AveragePathfinder.jl
?Bijectors.jl
The text was updated successfully, but these errors were encountered: