Skip to content

Commit

Permalink
Merge pull request #191 from JuliaAI/docs
Browse files Browse the repository at this point in the history
Add Documenter.jl documentation, migrated from the MLJ manual
  • Loading branch information
ablaom authored Feb 23, 2024
2 parents 28d4f22 + 306b824 commit 7857aec
Show file tree
Hide file tree
Showing 35 changed files with 1,830 additions and 17 deletions.
10 changes: 9 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,10 @@
*DS_Store
Manifest.toml
.ipynb_checkpoints
*~
#*
.DS_Store
sandbox/
/docs/build/
/docs/site/
/docs/Manifest.toml
.vscode
17 changes: 6 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,11 @@ machine learning models into
| :-----------: | :------: |
| [![Build Status](https://github.com/JuliaAI/MLJModelInterface.jl/workflows/CI/badge.svg)](https://github.com/JuliaAI/MLJModelInterface.jl/actions) | [![codecov.io](http://codecov.io/github/JuliaAI/MLJModelInterface.jl/coverage.svg?branch=master)](http://codecov.io/github/JuliaAI/MLJModelInterface.jl?branch=master) |

[![Stable](https://img.shields.io/badge/docs-stable-blue.svg)](https://juliaai.github.io/MLJModelInterface.jl/stable/)

[MLJ](https://github.com/alan-turing-institute/MLJ.jl) is a framework
for evaluating, combining and optimizing machine learning models in
Julia. A third party package wanting to integrate their supervised or
unsupervised machine learning models must import the module
`MLJModelInterface` defined in this package.

### Instructions

- [Quick-start guide](https://alan-turing-institute.github.io/MLJ.jl/dev/quick_start_guide_to_adding_models/) to adding models to MLJ

- [Detailed API
specification](https://alan-turing-institute.github.io/MLJ.jl/dev/adding_models_for_general_use/)
[MLJ](https://alan-turing-institute.github.io/MLJ.jl/dev/) is a framework for evaluating,
combining and optimizing machine learning models in Julia. A third party package wanting
to integrate their machine learning models into MLJ must import the module
`MLJModelInterface` defined in this package, as described in the
[documentation]((https://juliaai.github.io/MLJModelInterface.jl/stable/).
3 changes: 3 additions & 0 deletions docs/Project.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[deps]
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
MLJModelInterface = "e80e1ace-859a-464e-9ed9-23947d8ae3ea"
46 changes: 46 additions & 0 deletions docs/make.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
using Documenter
using MLJModelInterface
import MLJModelInterface as MMI

makedocs(;
modules=[MLJModelInterface, ],
format=Documenter.HTML(),
pages=[
"Home" => "index.md",
"Quick-start guide" => "quick_start_guide.md",
"The model type hierarchy" => "the_model_type_hierarchy.md",
"New model type declarations" => "type_declarations.md",
"Supervised models" => "supervised_models.md",
"Summary of methods" => "summary_of_methods.md",
"The form of data for fitting and predicting" => "form_of_data.md",
"The fit method" => "the_fit_method.md",
"The fitted_params method" => "the_fitted_params_method.md",
"The predict method" => "the_predict_method.md",
"The predict_joint method" => "the_predict_joint_method.md",
"Training losses" => "training_losses.md",
"Feature importances" => "feature_importances.md",
"Trait declarations" => "trait_declarations.md",
"Iterative models and the update! method" => "iterative_models.md",
"Implementing a data front end" => "implementing_a_data_front_end.md",
"Supervised models with a transform method" =>
"supervised_models_with_transform.md",
"Models that learn a probability distribution" => "fitting_distributions.md",
"Serialization" => "serialization.md",
"Document strings" => "document_strings.md",
"Unsupervised models" => "unsupervised_models.md",
"Static models" => "static_models.md",
"Outlier detection models" => "outlier_detection_models.md",
"Convenience methods" => "convenience_methods.md",
"Where to place code implementing new models" => "where_to_put_code.md",
"How to add models to the MLJ Model Registry" => "how_to_register.md",
"Reference" => "reference.md",
],
sitename="MLJModelInterface",
warnonly = [:cross_references, :missing_docs],
)

deploydocs(
repo = "github.com/JuliaAI/MLJModelInterface.jl",
devbranch="dev",
push_preview=false,
)
16 changes: 16 additions & 0 deletions docs/src/convenience_methods.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Convenience methods

```@docs; canonical=false
MMI.table
MMI.matrix
MMI.int
MMI.UnivariateFinite
MMI.classes
MMI.decoder
MMI.select
MMI.selectrows
MMI.selectcols
MMI.UnivariateFinite
```


52 changes: 52 additions & 0 deletions docs/src/document_strings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Document strings

To be registered, MLJ models must include a detailed document string
for the model type, and this must conform to the standard outlined
below. We recommend you simply adapt an existing compliant document
string and read the requirements below if you're not sure, or to use
as a checklist. Here are examples of compliant doc-strings (go to the
end of the linked files):

- Regular supervised models (classifiers and regressors): [MLJDecisionTreeInterface.jl](https://github.com/JuliaAI/MLJDecisionTreeInterface.jl/blob/master/src/MLJDecisionTreeInterface.jl) (see the end of the file)

- Tranformers: [MLJModels.jl](https://github.com/JuliaAI/MLJModels.jl/blob/dev/src/builtins/Transformers.jl)

A utility function is available for generating a standardized header
for your doc-strings (but you provide most detail by hand):

```@docs
MLJModelInterface.doc_header
```

## The document string standard

Your document string must include the following components, in order:

- A *header*, closely matching the example given above.

- A *reference describing the algorithm* or an actual description of
the algorithm, if necessary. Detail any non-standard aspects of the
implementation. Generally, defer details on the role of
hyperparameters to the "Hyperparameters" section (see below).

- Instructions on *how to import the model type* from MLJ (because a user can already inspect the doc-string in the Model Registry, without having loaded the code-providing package).

- Instructions on *how to instantiate* with default hyperparameters or with keywords.

- A *Training data* section: explains how to bind a model to data in a machine with all possible signatures (eg, `machine(model, X, y)` but also `machine(model, X, y, w)` if, say, weights are supported); the role and scitype requirements for each data argument should be itemized.

- Instructions on *how to fit* the machine (in the same section).

- A *Hyperparameters* section (unless there aren't any): an itemized list of the parameters, with defaults given.

- An *Operations* section: each implemented operation (`predict`, `predict_mode`, `transform`, `inverse_transform`, etc ) is itemized and explained. This should include operations with no data arguments, such as `training_losses` and `feature_importances`.

- A *Fitted parameters* section: To explain what is returned by `fitted_params(mach)` (the same as `MLJModelInterface.fitted_params(model, fitresult)` - see later) with the fields of that named tuple itemized.

- A *Report* section (if `report` is non-empty): To explain what, if anything, is included in the `report(mach)` (the same as the `report` return value of `MLJModelInterface.fit`) with the fields itemized.

- An optional but highly recommended *Examples* section, which includes MLJ examples, but which could also include others if the model type also implements a second "local" interface, i.e., defined in the same module. (Note that each module referring to a type can declare separate doc-strings which appear concatenated in doc-string queries.)

- A closing *"See also"* sentence which includes a `@ref` link to the raw model type (if you are wrapping one).


7 changes: 7 additions & 0 deletions docs/src/feature_importances.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Feature importances

```@docs; canonical=false
MLJModelInterface.feature_importances
```

Trait values can also be set using the `metadata_model` method, see below.
20 changes: 20 additions & 0 deletions docs/src/fitting_distributions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Models that learn a probability distribution


!!! warning "Experimental"

The following API is experimental. It is subject to breaking changes during minor or major releases without warning. Models implementing this interface will not work with MLJBase versions earlier than 0.17.5.

Models that fit a probability distribution to some `data` should be
regarded as `Probabilistic <: Supervised` models with target `y = data`
and `X = nothing`.

The `predict` method should return a single distribution.

A working implementation of a model that fits a `UnivariateFinite`
distribution to some categorical data using [Laplace
smoothing](https://en.wikipedia.org/wiki/Additive_smoothing)
controlled by a hyperparameter `alpha` is given
[here](https://github.com/JuliaAI/MLJBase.jl/blob/d377bee1198ec179a4ade191c11fef583854af4a/test/interface/model_api.jl#L36).


47 changes: 47 additions & 0 deletions docs/src/form_of_data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# The form of data for fitting and predicting

The model implementer does not have absolute control over the types of
data `X`, `y` and `Xnew` appearing in the `fit` and `predict` methods
they must implement. Rather, they can specify the *scientific type* of
this data by making appropriate declarations of the traits
`input_scitype` and `target_scitype` discussed later under [Trait
declarations](@ref).

*Important Note.* Unless it genuinely makes little sense to do so, the
MLJ recommendation is to specify a `Table` scientific type for `X`
(and hence `Xnew`) and an `AbstractVector` scientific type (e.g.,
`AbstractVector{Continuous}`) for targets `y`. Algorithms requiring
matrix input can coerce their inputs appropriately; see below.


## Additional type coercions

If the core algorithm being wrapped requires data in a different or
more specific form, then `fit` will need to coerce the table into the
form desired (and the same coercions applied to `X` will have to be
repeated for `Xnew` in `predict`). To assist with common cases, MLJ
provides the convenience method
[`MMI.matrix`](@ref). `MMI.matrix(Xtable)` has type `Matrix{T}` where
`T` is the tightest common type of elements of `Xtable`, and `Xtable`
is any table. (If `Xtable` is itself just a wrapped matrix,
`Xtable=Tables.table(A)`, then `A=MMI.table(Xtable)` will be returned
without any copying.)

Alternatively, a more performant option is to implement a data
front-end for your model; see [Implementing a data front-end](@ref).

Other auxiliary methods provided by MLJModelInterface for handling tabular data
are: `selectrows`, `selectcols`, `select` and `schema` (for extracting
the size, names and eltypes of a table's columns). See [Convenience
methods](@ref) below for details.


## Important convention

It is to be understood that the columns of table `X` correspond to
features and the rows to observations. So, for example, the predict
method for a linear regression model might look like `predict(model,
w, Xnew) = MMI.matrix(Xnew)*w`, where `w` is the vector of learned
coefficients.


15 changes: 15 additions & 0 deletions docs/src/how_to_register.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# How to add models to the MLJ model registry

The MLJ model registry is located in the [MLJModels.jl
repository](https://github.com/JuliaAI/MLJModels.jl). To
add a model, you need to follow these steps

- Ensure your model conforms to the interface defined above

- Raise an issue at
[MLJModels.jl](https://github.com/JuliaAI/MLJModels.jl/issues)
and point out where the MLJ-interface implementation is, e.g. by
providing a link to the code.

- An administrator will then review your implementation and work with
you to add the model to the registry
112 changes: 112 additions & 0 deletions docs/src/implementing_a_data_front_end.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Implementing a data front-end

!!! note

It is suggested that packages implementing MLJ's model API, that later implement a data front-end, should tag their changes in a breaking release. (The changes will not break the use of models for the ordinary MLJ user, who interacts with models exclusively through the machine interface. However, it will break usage for some external packages that have chosen to depend directly on the model API.)

```julia
MLJModelInterface.reformat(model, args...) -> data
MLJModelInterface.selectrows(::Model, I, data...) -> sampled_data
```

Models optionally overload `reformat` to define transformations of
user-supplied data into some model-specific representation (e.g., from
a table to a matrix). Computational overheads associated with multiple
`fit!`/`predict`/`transform` calls (on MLJ machines) are then avoided
when memory resources allow. The fallback returns `args` (no
transformation).

The `selectrows(::Model, I, data...)` method is overloaded to specify
how the model-specific data is to be subsampled, for some observation
indices `I` (a colon, `:`, or instance of
`AbstractVector{<:Integer}`). In this way, implementing a data
front-end also allows more efficient resampling of data (in user calls
to `evaluate!`).

After detailing formal requirements for implementing a data front-end,
we give a [Sample implementation](@ref). A simple [implementation](https://github.com/Evovest/EvoTrees.jl/blob/94b58faf3042009bd609c9a5155a2e95486c2f0e/src/MLJ.jl#L23)
also appears in the EvoTrees.jl package.

Here "user-supplied data" is what the MLJ user supplies when
constructing a machine, as in `machine(models, args...)`, which
coincides with the arguments expected by `fit(model, verbosity,
args...)` when `reformat` is not overloaded.

Overloading `reformat` is permitted for any `Model`
subtype, except for subtypes of `Static`. Here is a complete list of
responsibilities for such an implementation, for some
`model::SomeModelType` (a sample implementation follows after):

- A `reformat(model::SomeModelType, args...) -> data` method must be
implemented for each form of `args...` appearing in a valid machine
construction `machine(model, args...)` (there will be one for each
possible signature of `fit(::SomeModelType, ...)`).

- Additionally, if not included above, there must be a single argument
form of reformat, `reformat(model::SomeModelType, arg) -> (data,)`,
serving as a data front-end for operations like `predict`. It must
always hold that `reformat(model, args...)[1] = reformat(model,
args[1])`.

The fallback is `reformat(model, args...) = args` (i.e., slurps provided data).

*Important.* `reformat(model::SomeModelType, args...)` must always return a tuple, even if
this has length one. The length of the tuple need not match `length(args)`.
- `fit(model::SomeModelType, verbosity, data...)` should be
implemented as if `data` is the output of `reformat(model,
args...)`, where `args` is the data an MLJ user has bound to `model`
in some machine. The same applies to any overloading of `update`.

- Each implemented operation, such as `predict` and `transform` - but
excluding `inverse_transform` - must be defined as if its data
arguments are `reformat`ed versions of user-supplied data. For
example, in the supervised case, `data_new` in
`predict(model::SomeModelType, fitresult, data_new)` is
`reformat(model, Xnew)`, where `Xnew` is the data provided by the MLJ
user in a call `predict(mach, Xnew)` (`mach.model == model`).

- To specify how the model-specific representation of data is to be
resampled, implement `selectrows(model::SomeModelType, I, data...)
-> resampled_data` for each overloading of `reformat(model::SomeModel,
args...) -> data` above. Here `I` is an arbitrary abstract integer
vector or `:` (type `Colon`).

*Important.* `selectrows(model::SomeModelType, I, args...)` must always
return a tuple of the same length as `args`, even if this is one.

The fallback for `selectrows` is described at [`selectrows`](@ref).


## Sample implementation

Suppose a supervised model type `SomeSupervised` supports sample
weights, leading to two different `fit` signatures, and that it has a
single operation `predict`:

fit(model::SomeSupervised, verbosity, X, y)
fit(model::SomeSupervised, verbosity, X, y, w)

predict(model::SomeSupervised, fitresult, Xnew)

Without a data front-end implemented, suppose `X` is expected to be a
table and `y` a vector, but suppose the core algorithm always converts
`X` to a matrix with features as rows (each record corresponds to
a column in the table). Then a new data-front end might look like
this:

constant MMI = MLJModelInterface

# for fit:
MMI.reformat(::SomeSupervised, X, y) = (MMI.matrix(X)', y)
MMI.reformat(::SomeSupervised, X, y, w) = (MMI.matrix(X)', y, w)
MMI.selectrows(::SomeSupervised, I, Xmatrix, y) =
(view(Xmatrix, :, I), view(y, I))
MMI.selectrows(::SomeSupervised, I, Xmatrix, y, w) =
(view(Xmatrix, :, I), view(y, I), view(w, I))

# for predict:
MMI.reformat(::SomeSupervised, X) = (MMI.matrix(X)',)
MMI.selectrows(::SomeSupervised, I, Xmatrix) = (view(Xmatrix, :, I),)

With these additions, `fit` and `predict` are refactored, so that `X`
and `Xnew` represent matrices with features as rows.
Loading

0 comments on commit 7857aec

Please sign in to comment.