Skip to content

Commit

Permalink
Completed MLJ Intrface for Missing module
Browse files Browse the repository at this point in the history
Now V2 fully implemented except perceptron like and NN

This will be BetaML 0.7, with V2API "deemed" experimental while the full V2API will be default in BetaML 0.8
  • Loading branch information
sylvaticus committed Aug 2, 2022
1 parent ee1123a commit de7c5e8
Show file tree
Hide file tree
Showing 10 changed files with 214 additions and 72 deletions.
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,9 @@ The package can be easily used in R or Python employing [JuliaCall](https://gith

### Examples

!!! Note
We are currently implementing a new "V2" api that further simplify the library usage using a "standard" `mod = Model([Options])`, `train!(mod,X,[Y])`, `predict(mod,[X])` workflow. In BetaML v0.7 this new API is still experimantal, as documentation and implementation is not completed. We plan to make it the default API in BetaML 0.8, when the current API will be dimmed deprecated.

We see how to use three different algorithms to learn the relation between floral sepals and petal measures (first 4 columns) and the specie's name (5th column) in the famous [iris flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set).

The first two algorithms are example of _supervised_ learning, the third one of _unsupervised_ learning.
Expand Down
2 changes: 1 addition & 1 deletion src/BetaML.jl
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ include("Imputation/Imputation.jl") # (Missing) imputation algorithms
const MLJ_PERCEPTRON_MODELS = (PerceptronClassifier, KernelPerceptronClassifier, PegasosClassifier)
const MLJ_TREES_MODELS = (DecisionTreeClassifier, DecisionTreeRegressor, RandomForestClassifier, RandomForestRegressor)
const MLJ_CLUSTERING_MODELS = (KMeans, KMedoids, GMMClusterer, MissingImputator)
const MLJ_IMPUTERS_MODELS = (BetaMLMeanImputer, BetaMLGMMImputer, BetaMLRFImputer,) # these are the name of the MLJ models, not the BetaML ones...
const MLJ_IMPUTERS_MODELS = (BetaMLMeanImputer, BetaMLGMMImputer, BetaMLRFImputer,BetaMLGenericImputer) # these are the name of the MLJ models, not the BetaML ones...
const MLJ_OTHER_MODELS = (BetaMLGMMRegressor,)
const MLJ_INTERFACED_MODELS = (MLJ_PERCEPTRON_MODELS..., MLJ_TREES_MODELS..., MLJ_CLUSTERING_MODELS..., MLJ_IMPUTERS_MODELS..., MLJ_OTHER_MODELS...)

Expand Down
2 changes: 1 addition & 1 deletion src/Clustering/Clustering.jl
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ The module provides the following functions. Use `?[function]` to access their f
"""
module Clustering

using LinearAlgebra, Random, Statistics, Reexport, CategoricalArrays
using LinearAlgebra, Random, Statistics, Reexport, CategoricalArrays, DocStringExtensions
import Distributions

using ForceImport
Expand Down
2 changes: 1 addition & 1 deletion src/GMM/GMM_clustering.jl
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ Base.@kwdef mutable struct GMMClusterHyperParametersSet <: BetaMLHyperParameters
"Number of mixtures (latent classes) to consider [def: 3]"
nClasses::Int64 = 3
"Initial probabilities of the categorical distribution (nClasses x 1) [default: `[]`]"
probMixtures::Vector{Float64} = []
probMixtures::Vector{Float64} = Float64[]
"An array (of length K) of the mixture to employ (see notes) [def: `[DiagonalGaussian() for i in 1:K]`]"
mixtures::Vector{AbstractMixture} = [DiagonalGaussian() for i in 1:nClasses]
"Tolerance to stop the algorithm [default: 10^(-6)]"
Expand Down
63 changes: 44 additions & 19 deletions src/Imputation/Imputation.jl
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,19 @@ Implement the BetaML.Imputation module
Provide various imputation methods for missing data. Note that the interpretation of "missing" can be very wide.
For example, reccomendation systems / collaborative filtering (e.g. suggestion of the film to watch) can well be representated as a missing data to impute problem.
Using the original "V1" API:
- [`predictMissing`](@ref): Impute data using a Generative (Gaussian) Mixture Model (good trade off)
Using the v2 API (experimental):
- [`MeanImputer`](@ref): Simple imputator using the features or the records means, with optional record normalisation (fastest)
- [`GMMImputer`](@ref): Impute data using a Generative (Gaussian) Mixture Model (good trade off)
- [`RFImputer`](@ref): Impute missing data using Random Forests, with optional replicable multiple imputations (most accurate).
- [`GenericImputer`](@ref): Impute missing data using a vector (one per column) of arbitrary learning models (classifiers/regressors) that implement `m = Model([options])`, `train!(m,X,Y)` and `predict(m,X)`.
Imputations for all these models can be optained by running `fit!([Imputator model],X)`. The data with the missing values imputed can then be obtained with `predict(m::Imputer)`. Use`info(m::Imputer)` to retrieve further information concerning the imputation.
Imputations for all these models can be optained by running `mod = ImputatorModel([options])`, `fit!(mod,X)`. The data with the missing values imputed can then be obtained with `predict(mod)`. Use`info(m::Imputer)` to retrieve further information concerning the imputation.
Trained models can be also used to impute missing values in new data with `predict(mox,xNew)`.
Note that if multiple imputations are run (for the supporting imputators) `predict()` will return a vector of predictions rather than a single one`.
## Example
Expand Down Expand Up @@ -69,13 +76,13 @@ julia> medianValues = [median([v[r,c] for v in vals]) for r in 1:nR, c in 1:nC]
julia> infos = info(mod);
julia> infos.nImputedValues
julia> infos[:nImputedValues]
1
```
"""
module Imputation

using Statistics, Random, LinearAlgebra, StableRNGs
using Statistics, Random, LinearAlgebra, StableRNGs, DocStringExtensions
using ForceImport
@force using ..Api
@force using ..Utils
Expand All @@ -99,7 +106,7 @@ abstract type Imputer <: BetaMLModel end
"""
predictMissing(X,K;p₀,mixtures,tol,verbosity,minVariance,minCovariance)
OLD API. Use [`GMMClusterer`](@ref) instead.
Note: This is the OLD API. See [`GMMClusterer`](@ref) for the V2 API.
Fill missing entries in a sparse matrix (i.e. perform a "matrix completion") assuming an underlying Gaussian Mixture probabilistic Model (GMM) and implementing
an Expectation-Maximisation algorithm.
Expand Down Expand Up @@ -224,20 +231,24 @@ function fit!(imputer::MeanImputer,X)
#X̂ = copy(X)
nR,nC = size(X)
missingMask = ismissing.(X)
cMeans = [mean(skipmissing(X[:,i])) for i in 1:nC]
overallMean = mean(skipmissing(X))
cMeans = [sum(ismissing.(X[:,i])) == nR ? overallMean : mean(skipmissing(X[:,i])) for i in 1:nC]

if imputer.hpar.norm == nothing
adjNorms = []
= [missingMask[r,c] ? cMeans[c] : X[r,c] for r in 1:nR, c in 1:nC]
else
adjNorms = [norm(collect(skipmissing(r)),imputer.hpar.norm) / (nC - sum(ismissing.(r))) for r in eachrow(X)]
adjNorms = [sum(ismissing.(r)) == nC ? missing : norm(collect(skipmissing(r)),imputer.hpar.norm) / (nC - sum(ismissing.(r))) for r in eachrow(X)]
adjNormsMean = mean(skipmissing(adjNorms))
adjNorms[ismissing.(adjNorms)] .= adjNormsMean
= [missingMask[r,c] ? cMeans[c]*adjNorms[r]/sum(adjNorms) : X[r,c] for r in 1:nR, c in 1:nC]
end
imputer.par = MeanImputerLearnableParameters(cMeans,adjNorms,X̂)
imputer.info[:nImputedValues] = sum(missingMask)
imputer.fitted = true
return true
end

"""
predict(m::MeanImputer)
Expand All @@ -246,21 +257,21 @@ Return the data with the missing values replaced with the imputed ones using [`M
predict(m::MeanImputer) = m.par.imputedValues

"""
predict(m::MeanImputer)
predict(m::MeanImputer, X)
Return the data with the missing values replaced with the imputed ones using [`MeanImputer`](@ref).
"""
function predict(m::MeanImputer,X)
nR,nC = size(X)
m.fitted || error()
nC == length(m.par.cMeans) || error()
(m.hpar.norm == nothing || nR == length(m.par.norms)) || error()
nC == length(m.par.cMeans) || error("`MeanImputer` can only predict missing values in matrices with the same number of columns as the matrice it has been trained with.")
(m.hpar.norm == nothing || nR == length(m.par.norms)) || error("If norms are used, `MeanImputer` can predict only matrices with the same number of rows as the matrix it has been trained with.")

missingMask = ismissing.(X)
if m.hpar.norm == nothing
= [missingMask[r,c] ? m.par.cMeans[c] : X[r,c] for r in 1:nR, c in 1:nC]
else
= [missingMask[r,c] ? m.par.cMeans[c]*m.par.adjNorms[r]/sum(m.par.adjNorms) : X[r,c] for r in 1:nR, c in 1:nC]
= [missingMask[r,c] ? m.par.cMeans[c]*m.par.norms[r]/sum(m.par.norms) : X[r,c] for r in 1:nR, c in 1:nC]
end
return
end
Expand Down Expand Up @@ -430,7 +441,17 @@ end
# ------------------------------------------------------------------------------
# RFImputer

"""
**`RFImputerHyperParametersSet`**
Hyperparameters for RFImputer
#Parameters
- For the underlying random forest algorithm parameters (`nTrees`,`maxDepth`,`minGain`,`minRecords`,`maxFeatures:`,`splittingCriterion`,`β`,`initStrategy`, `oob` and `rng`) see [`RFHyperParametersSet`](@ref) for the specific RF algorithm parameters
- `forcedCategoricalCols`: specify the positions of the integer columns to treat as categorical instead of cardinal. [Default: empty vector (all numerical cols are treated as cardinal by default and the others as categorical)]
- `recursivePassages `: Define the times to go trough the various columns to impute their data. Useful when there are data to impute on multiple columns. The order of the first passage is given by the decreasing number of missing values per column, the other passages are random [default: `1`].
- `multipleImputations`: Determine the number of independent imputation of the whole dataset to make. Note that while independent, the imputations share the same random number generator (RNG).
"""
Base.@kwdef mutable struct RFImputerHyperParametersSet <: BetaMLHyperParametersSet
rfhpar = RFHyperParametersSet()
forcedCatCols::Vector{Int64} = Int64[] # like in RF, normally integers are considered ordinal
Expand All @@ -450,12 +471,7 @@ end
Impute missing data using Random Forests, with optional replicable multiple imputations.
For the underlying random forest algorithm parameters (`nTrees`,`maxDepth`,`minGain`,`minRecords`,`maxFeatures:`,`splittingCriterion`,`β`,`initStrategy`, `oob` and `rng`) see [`buildTree`](@ref) and [`buildForest`](@ref).
### Specific parameters:
- `forcedCategoricalCols`: specify the positions of the integer columns to treat as categorical instead of cardinal. [Default: empty vector (all numerical cols are treated as cardinal by default and the others as categorical)]
- `recursivePassages `: Define the times to go trough the various columns to impute their data. Useful when there are data to impute on multiple columns. The order of the first passage is given by the decreasing number of missing values per column, the other passages are random [default: `1`].
- `multipleImputations`: Determine the number of independent imputation of the whole dataset to make. Note that while independent, the imputations share the same random number generator (RNG).
See [`RFImputerHyperParametersSet`](@ref) and [`RFHyperParametersSet`](@ref)
### Notes:
- Given a certain RNG and its status (e.g. `RFImputer(...,rng=StableRNG(FIXEDSEED))`), the algorithm is completely deterministic, i.e. replicable.
Expand Down Expand Up @@ -692,10 +708,20 @@ end
# ------------------------------------------------------------------------------
# GeneralImputer

"""
$(TYPEDEF)
Hyperparameters for GeneralImputer
## Parameters:
$(FIELDS)
"""
Base.@kwdef mutable struct GeneralImputerHyperParametersSet <: BetaMLHyperParametersSet
"Specify a regressor or classier model per column. Default to random forests."
models = nothing
"Define the times to go trough the various columns to impute their data. Useful when there are data to impute on multiple columns. The order of the first passage is given by the decreasing number of missing values per column, the other passages are random [default: `1`]."
recursivePassages::Int64 = 1
"Determine the number of independent imputation of the whole dataset to make. Note that while independent, the imputations share the same random number generator (RNG)."
multipleImputations::Int64 = 1
end

Expand All @@ -707,10 +733,9 @@ end
"""
GeneralImputer
Impute missing data using any regressor/classifier (not necessarily from BetaML) that implements `fit!(X,Y)` and `predict(X)`
Impute missing data using any regressor/classifier (not necessarily from BetaML) that implements `m=Model([options])`, `fit!(m,X,Y)` and `predict(m,X)`
### Specific parameters:
- `multipleImputations`: Determine the number of independent imputation of the whole dataset to make. Note that while independent, the imputations share the same random number generator (RNG).
See [`GeneralImputerHyperParametersSet`](@ref) for the hyper-parameters.
"""
mutable struct GeneralImputer <: Imputer
Expand Down
Loading

2 comments on commit de7c5e8

@sylvaticus
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator register

Release notes:

  • new experimental V2 API that implements a "standard" mod = Model([Options]), train!(mod,X,[Y]), predict(mod,[X]) workflow. In BetaML v0.7 this new API is still experimental, as documentation and implementation are not completed (missing yet perceptions and NeuralNetworks). We plan to make it the default API in BetaML 0.8, when the current API will be dimmed deprecated.
  • new Missing module with several missing values imputers MeanImputer, GMMImputer, RFImputer, GeneralImputer and relative MLJ interfaces. The last one, in particular, allows using any regressor/classifier (not necessarily of BetaML) for which the API described above is valid
  • Cluster module reorganised with only hard clustering algorithms (K-Means and K-medoids), while GMM clustering and the new GMMRegressor1 and GMMRegressor2 are in the new GMM module
  • Split large files in subfiles, like Trees.jl where DT and RF are now on separate (included) files
  • New oneHotDecoder(x) function in Utils module
  • New dependency to DocStringExtensions.jl
  • Several bugfixes

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/65488

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a v0.7.0 -m "<description of version>" de7c5e8dead95e3091861ef4f545ad9357e516ad
git push origin v0.7.0

Please sign in to comment.