sending no seed to mlcontext does not actually randomize stuff #7259

superichmann · 2024-10-06T16:33:38Z

System Information (please complete the following information):

OS & Version: windows 10
ML.NET Version: 4.0.0-preview.24271.1
.NET Version: 8.0.8

Describe the bug
initializing mlcontext without any seed (new MLContext()) and training on the same data does not actually results in different models created by Regression.Trainers.FastForest() or Regression.Trainers.LightGbm().

To Reproduce
Steps to reproduce the behavior:

initialize mlcontext without a seed.
create a model on data with FastForest.
initialize a different mlcontext without a seed.
create a model on the same exact data with FastForest.
Compare the predictions of both models (in my case, I compared 23645 predictions)

repeat the process with lgbm.

Expected behavior
As I see it, first FastForest predictions should be different then the second FastForest predictions. same in lightgbm.

Even the slightest change in randomness for the bootstrapped dataset selection should end up in different results.

It seems like the FastForest or LightGbm under ml.net are not so random.. :{

Further Research PLZ READ ME TOO
I played with LightGbm on python and I was able to introduce randomness into it with feat these params: 'feature_fraction': 0.2,'seed': rand_num. removing one of them removes also the randomness in the results, see code:

import lightgbm as lgb
import pandas as pd
import numpy as np
np.random.seed(42)
num_train_samples = 1000
num_test_samples = 10
num_features = 10
X = np.random.rand(num_train_samples + num_test_samples, num_features)
y = np.random.uniform(0, 2, num_train_samples + num_test_samples)
y = y + np.random.normal(0, 0.2, num_train_samples + num_test_samples)
X_train, X_test = X[:num_train_samples], X[num_train_samples:]
y_train, y_test = y[:num_train_samples], y[num_train_samples:]
params1 = {
    'objective': 'regression',
    'verbose': -1,
    'feature_fraction': 0.2,
    'seed': 42
}
params2 = {
    'objective': 'regression',
    'verbose': -1,
    'feature_fraction': 0.2,
    'seed': 43
}
model1 = lgb.train(params1, lgb.Dataset(X_train, y_train), num_boost_round=1000)  # Increase boosting rounds
model2 = lgb.train(params2, lgb.Dataset(X_train, y_train), num_boost_round=1000)
y_pred1 = model1.predict(X_test)
y_pred2 = model2.predict(X_test)
results = pd.DataFrame({'true_value': y_test, 'model1_pred': y_pred1, 'model2_pred': y_pred2})
print(results)

WorkArounds
A workaround for lightgbm would be to add FeatureFraction into the params.
sending Seed as a part of FastForestRegressionTrainer.Options is a workaround for FastForest.

The text was updated successfully, but these errors were encountered:

dotnet-policy-service bot added the untriaged New issue has not been triaged label Oct 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sending no seed to mlcontext does not actually randomize stuff #7259

sending no seed to mlcontext does not actually randomize stuff #7259

superichmann commented Oct 6, 2024 •

edited

Loading

sending no seed to mlcontext does not actually randomize stuff #7259

sending no seed to mlcontext does not actually randomize stuff #7259

Comments

superichmann commented Oct 6, 2024 • edited Loading

superichmann commented Oct 6, 2024 •

edited

Loading