Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measures that rely on Tasks do not work for Pipelines #13

Open
henrifnk opened this issue Feb 14, 2021 · 7 comments
Open

Measures that rely on Tasks do not work for Pipelines #13

henrifnk opened this issue Feb 14, 2021 · 7 comments

Comments

@henrifnk
Copy link
Contributor

Simple Example:

task <- tsk("usarrests")
kmeans_centers <- lapply(1:10, function(x) po("scale") %>>% lrn("clust.kmeans", centers = x))
design = benchmark_grid(
  tasks = task,
  learners = kmeans_centers,
  resamplings = rsmp("insample")
)
bmr = benchmark(design)
bmr$score(msr("clust.wss"))$clust.wss

will throw an output like

[1] 355807.82 114846.81 81862.19 79208.07 70152.06 68255.12 68148.43 63241.63 54304.11 43632.32

The output from wss is obviously too high to be scaled.

The problem can be found in MeasureClustInternal that takes the "raw" task without any preprocessing to calculate the features.
I think, this is probably only an issue that mlr3cluster suffers from, as all other Measures are only dependent on the predictions ...?

private = list(
.score = function(prediction, task, ...) {
X = as.matrix(task$data(rows = prediction$row_ids))
if (!is.double(X)) { # clusterCrit does not convert lgls/ints
storage.mode(X) = "double"
}
intCriteria(X, prediction$partition, self$crit)[[1L]]
}
)

This could be avioded if there is any generic access to the preprocessed task in the pipeline.
In this case, one could exchange the taske in the function by the learner itself.
The problem is, if I enter the state of a trained pipeline, stored preprocessed Tasks are empty...

@pfistfl
Copy link
Member

pfistfl commented Feb 26, 2021

Hey, I had a brief talk with Bernd about this today.

What we understood is the following:
Cluster measures internally expect fully numeric, scaled features. They do not necessarily require the exact pre-processed features. In fact, exact pre-processed features might even be detrimental, because () e.g. applying PCA might destroy some relevant information about the high dimensional data situation.
(
) Not perfectly sure whether this is true, is it?
Another thing I have not fully understood is whether scaling is actually required, what does the magnitude of the metric actually tell me? I would assume that scaling might at least make features comparable so I see why this might be required.

  • Instead of scaling using the pre-processing pipeline, we should perhaps simply add a fixed scaling step.
  • We could for now state that the metrics are only defined for fully metric spaces and e.g. provide options / versions of the measure that include one-hot encoding.

In general, on an abstract level, what the cluster measure should measure is with respect to the original data and not some processed version? If I tune against a cluster measure and I get to measure with respect to the pre-processed data, I can pre-process the data such that the metric is optimal (e.g. by just dropping all variables or something).

Pinging for comment here @damirpolat @henrifnk

@henrifnk
Copy link
Contributor Author

henrifnk commented Feb 27, 2021

Thank you for the thoughts @pfistfl.
I'm not sure if I understood you perfectly right...

In my opinion, it should be up to the user how the metric should be calculated.

  • A fixed scaling step would force the user to scale any data in any learning algorithm using any metric? Maybe I understood you wrong but that makes no sense to me, since some metrics might by independent from the scale of the features...

  • What about new (test) data, they would have to be scaled by the original tasks scaling parameters in this case?

  • More than that, if one decides that the high-dimensional data used for an arbitrary task should be shrunken towards a smaller dataset ,e.g. by PCA, we should leave this option to the user?

    • Here, they state:
    • in very high-dimensional spaces, Euclidean distances tend to become inflated (this is an instance of the so-called “curse of dimensionality”). Running a dimensionality reduction algorithm such as Principal component analysis (PCA) prior to k-means clustering can alleviate this problem (...).

  • Since mlr3 learners base their clustering on the preprocessed task shouldn't any metric rely on this preprocessed task, too?

  • If one would calculate a metric on the unprocessed data this metric could easily become pathological:

    • unscaled data might cause very high scores that, in the end, might only depend on one very high scaled feature
    • Non-Imputed data might cause NA in the metric
    • If one is able to create data by preprocessing, that enhance the performance, doesn't that simply mean that they make the data more readable for the learner, by ,e.g. filtering out noise somehow?
      • Usually, if you drop features that should decrease your performance...

Please have a look at the PR I made yesterday.
I think it is very simple and clean, since it will only store the (preprocessed) task in the prediction.
This is somehow analog to the saved truth in predictions in classification and regression context.
With this information saved, we only need to rely on the prediction now to calculate any metric.
What I could think of, as an addition, is implementing an optional task argument, where users could place the unpreprocessed task in to calculate predictions on...

@giuseppec
Copy link

giuseppec commented Feb 27, 2021

I think @henrifnk is right. Here is another example. In supervised learning, performance measures that can be "extracted" from the fitted model should match with the ones computed from the "outside" via $score method (for single learners and pipelines), see e.g.:

task = tsk("boston_housing")
l1 = lrn("regr.lm")
l1$train(task)
mean(l1$model$residuals^2) # extract MSE from the model (residuals)
p1 = l1$predict(task)
p1$score(msr("regr.mse")) # computing MSE from "outside" gives the same value

The same thing can be done with a pipeline:

task = tsk("boston_housing")
pscale = po("scale")
l2 = pscale %>>% lrn("regr.lm")
l2$train(task)
mean(l2$pipeops$regr.lm$state$model$residuals^2) # extract mse from the model
p2 = l2$predict(task)
p2$regr.lm.output$score(msr("regr.mse")) # computing mse from "outside"  gives the same value

I would expect the same behavior for clustering tasks, i.e., measures that can be extracted from the cluster model should be the same as the ones that are computed from the "outside". @pfistfl would you agree here?
However, this does not happen if we combine a cluster learner with a pre-processing pipeline. Like @henrifnk pointed out, the issue is that clustering measures are computed on the data that was used to fit the cluster model. But the prediction object that is used for measures does not have access to this data. Here a similar example like the above one:

task = tsk("usarrests")
l1 = lrn("clust.kmeans", centers = 2)
l1$train(task)
l1$model$tot.withinss # extract wss from the model
p1 = l1$predict(task)
p1$score(msr("clust.wss"), task = task) # computing wss from "outside" gives the same value

pscale = po("scale")
l2 = pscale %>>% lrn("clust.kmeans", centers = 2)
l2$train(task)
p2 = l2$predict(task)
l2$pipeops$clust.kmeans$state$model$tot.withinss 
p2$clust.kmeans.output$score(msr("clust.wss"), task) # computing wss from "outside" is not the same
# you have to do this to fix it and obtain the same wss value as the one that can be extracted from the model
p2$clust.kmeans.output$score(msr("clust.wss"), task = pscale$train(list(task))$output) 

Obviously, the "fix" in the last line where we pass the scaled task does not work if you benchmark multiple learners.

@damirpolat
Copy link
Member

I agree with @henrifnk. If I were to scale data and do clustering, I would expect measures to be applied to the preprocessed data since that's what cluster analysis was done on.

@pfistfl
Copy link
Member

pfistfl commented Feb 28, 2021

I am happy that we disagree here since this gives us the possibility to flesh things out.
I might be wrong, I am the person with the least experience in clustering after all, but I am still not convinced.

To reduce confusion I am trying to re-state the discussion quickly. Given a graph such as:

<<HERE>> po("scale") %>>% ... %>>% <<THERE>> po(lrn("clust.kmeans"))

The open question is at which point we want to compute cluster measures, <<HERE>> (favoured by me) or <<THERE>>
favoured by you. <<HERE>> could be followed up by a fixed set of pre-processing steps such as scale, that is independent of the other pipeline.

@henrifnk stated

If one would calculate a metric on the unprocessed data this metric could easily become pathological:
unscaled data might cause very high scores that, in the end, might only depend on one very high scaled feature

This is exactly my problem. We would like to ensure that any data that is passed to the measure has the same scale.
I'll try to give examples:

  1. Suppose we tune a pre-processing pipeline along with a clustering algorithm that looks something like
po("scale") %>>% po(flt("anova")) %>>% po(lrn("clust.kmeans"))

and measure using the preprocessed task.
This could have a pathological optimum: Drop all features but one that can be easily clustered.
This yields an objectively stupid clustering algorithm (that only takes into account one feature but disregards the rest of the data) that is good with respect to the desired clustering metric.

  1. Assume we have only one pre-processing operator that divides each feature's value by a number a: po("col_divide", a).
    This will yield a better cluster learner for larger a and will be optimal as a -> Inf. This happens without any practical improvements to the underlying clustering model but instead due to a pathology in the measuring process.

My general argument is the following:

By allowing transformations for the measure, we allow the pipeline to change the goal post (the values measured by our clustering metric) . And if an agent (our pipeline) can move it's own goal post (e.g. through tuning), it will often not become better but instead, just move the goal towards something that is easier to solve (by simply ignoring conflicting information). The analogy is the cleaning robot that learned to put a bucket on its head so it does not see any dirt. Can not see any dirt -> problem solved!

With respect to @henrifnk 's other comments:

Non-Imputed data might cause NA in the metric
-> Agree! Same holds for e.g. categorical features. But instead of using the pipeline here, we might want to have a FIXED preproc pipeline!

What about new (test) data, they would have to be scaled by the original tasks scaling parameters in this case?
Agree that we need to find a solution here, but this is orthogonal!

More than that, if one decides that the high-dimensional data used for an arbitrary task should be shrunken towards a smaller dataset, e.g. by PCA, we should leave this option to the user?
Agree, so you should be able to do PCA but we should not measure quality with respect to PCA transformed features.

since some metrics might by independent from the scale of the features...
In this case simply no scaling!

@giuseppec I get your problem but in your case, we look at the target variable which is mostly unchanged throughout the Pipeline. I think my suggestion is not optimal BUT it avoids falling into the traps mentioned above.

What we instead should have:

Each metric should know IF it is sensible to scaling / can deal with NA's etc. And it should then treat it's input accordingly (i.e. by re-scaling).

@henrifnk
Copy link
Contributor Author

Finally, I think a really understand you point, thank you for carrying that out :).
I'll try to wrap those two options up.

Option 1: Have a stable pipeline for cluster measures:

Independent from the pipeline of a given cluster learner, there will always be the same mechanism that preprocesses the task data that determine the scoring of a certain cluster measure.
This pipeline could look somewhat like:

po('imputemean') %>>% po('scale') %>>% po(lrn("clust.[lrn_id]"))

The pipe operators within that pipeline must be somewhat smart to the task and their measure, such, that they can decide whether it is really necessary to call on them.
E.g., scaling would not need to be performed on a task with data within the same range,
one hot encoding, only if there are non numeric, non binary features ...
@pfistfl please, correct me if I am wrong or misunderstood something, here.

Option 2: Mirror the preprocessed task from pipeline-learner

The measure is calculated by the same task, as the learner was calculated on, by default.
If the data get manipulated like in the following pipeline, the measure is calculated on exactly these manipulated data, no matter how the user specified the manipulation.

po('imputemean') %>>% po('pca') %>>% po(lrn("clust.[lrn_id]"))

Additionally, an optional task argument in the measure enables the user to specify any preprocessed task for the measure to calculate the score on.
This could then, e.g., be a task that is specified like this:

po('imputemean') %>>% po('scale') %>>% po(lrn("clust.[lrn_id]"))

Addition: This might be supplemented by a warning if measures are calculated on tasks whre features have a different scale or similar issues...

Let me briefly point out 2 scenarios where your approach would be problematic:

Scenario 1:

lrn(kmeans.clust)$train(tsk('usarrest'))

The user is training a scale sensitive learner on a task with differently scaled features.

Option 1: Measures from prediction would be magically scaled now and the user wouldn't notice his faulty design...
Consitantly, mlr3cluster, would have to force the user now, also to scale the learner in this scenario, right?
This would not make any sense to me, as we don't force scaling in regression or classification context where learners might be scale sensitive too...

**Option 2: ** Results would be biased be the features with higher scale, but (!) also clusters made by the learner are biased from that problem...

Scenario 2

The user reads in very raw data that are no even in shape for the learner to use them (e.g. images etc...). She/He wants to use mlr3 now.

Option 1: Not working. The user could do predictions but couldn't calculate measures, as the pre-given pipeline is not able to shape the raw data. Imagine him seeing this error, that the measure is not able to handle the data. This will probably be contraintuitive and confusing...
To me, it was always one of the key features of mlr3 pipelines, that it enables you to do the whole workflow within mlr3, smoothly...

**Option2:" No problmes...

To be honest, to me, the second option is still way more attractive as is gives the user any freedom to calculate the measure on any data that might make sense in a certain situation!
It is more flexible, as the user can specify any detail of the task, he wants to have.
And it is more transparent, such, that the user knows how she/he specified the pipeline that lead to the measure in the output. If we dictate the pipeline, no one will ever know how it is really calculated...
I see your point, that measures could be wrong, if the pipeline was set in a wrong way.
But this pipeline seems to me like a arbitrary conglomerate of different preprocessing steps that might make sense.
Should we e.g. standardize or normalize, should we impute mean or median... This desicions will all be arbitrary...
In the end, this really sounds like an assessment between the freedom and the capabilities a certain user has with mlr3 package and the security of a stable measure.
But, you have this problem of wrong usage in any ML context, when it comes to people that specify things not correctly...

@damirpolat
Copy link
Member

I thought about it again recently. My opinion: measures should be calculated on the same data on which clustering was done. I can see @pfistfl's argument about moving the goalpost but at the same time I think users should be the ones that are responsible for ensuring that their pipeline makes sense for their task. Also, I would image this could become a problem if there was an automated way for tuning pipelines that takes into the account preprocessing ops. But does mlr3 do that now? We could deal with that later when it comes up.
Any other final thoughts: @giuseppec or @mllg?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants