Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sending no seed to mlcontext does not actually randomize stuff #7259

Open
superichmann opened this issue Oct 6, 2024 · 0 comments
Open

sending no seed to mlcontext does not actually randomize stuff #7259

superichmann opened this issue Oct 6, 2024 · 0 comments
Labels
untriaged New issue has not been triaged

Comments

@superichmann
Copy link

superichmann commented Oct 6, 2024

System Information (please complete the following information):

  • OS & Version: windows 10
  • ML.NET Version: 4.0.0-preview.24271.1
  • .NET Version: 8.0.8

Describe the bug
initializing mlcontext without any seed (new MLContext()) and training on the same data does not actually results in different models created by Regression.Trainers.FastForest() or Regression.Trainers.LightGbm().

To Reproduce
Steps to reproduce the behavior:

  1. initialize mlcontext without a seed.
  2. create a model on data with FastForest.
  3. initialize a different mlcontext without a seed.
  4. create a model on the same exact data with FastForest.
  5. Compare the predictions of both models (in my case, I compared 23645 predictions)

repeat the process with lgbm.

Expected behavior
As I see it, first FastForest predictions should be different then the second FastForest predictions. same in lightgbm.

Even the slightest change in randomness for the bootstrapped dataset selection should end up in different results.

It seems like the FastForest or LightGbm under ml.net are not so random.. :{

Further Research PLZ READ ME TOO
I played with LightGbm on python and I was able to introduce randomness into it with feat these params: 'feature_fraction': 0.2,'seed': rand_num. removing one of them removes also the randomness in the results, see code:

import lightgbm as lgb
import pandas as pd
import numpy as np
np.random.seed(42)
num_train_samples = 1000
num_test_samples = 10
num_features = 10
X = np.random.rand(num_train_samples + num_test_samples, num_features)
y = np.random.uniform(0, 2, num_train_samples + num_test_samples)
y = y + np.random.normal(0, 0.2, num_train_samples + num_test_samples)
X_train, X_test = X[:num_train_samples], X[num_train_samples:]
y_train, y_test = y[:num_train_samples], y[num_train_samples:]
params1 = {
    'objective': 'regression',
    'verbose': -1,
    'feature_fraction': 0.2,
    'seed': 42
}
params2 = {
    'objective': 'regression',
    'verbose': -1,
    'feature_fraction': 0.2,
    'seed': 43
}
model1 = lgb.train(params1, lgb.Dataset(X_train, y_train), num_boost_round=1000)  # Increase boosting rounds
model2 = lgb.train(params2, lgb.Dataset(X_train, y_train), num_boost_round=1000)
y_pred1 = model1.predict(X_test)
y_pred2 = model2.predict(X_test)
results = pd.DataFrame({'true_value': y_test, 'model1_pred': y_pred1, 'model2_pred': y_pred2})
print(results)

WorkArounds
A workaround for lightgbm would be to add FeatureFraction into the params.
sending Seed as a part of FastForestRegressionTrainer.Options is a workaround for FastForest.

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged label Oct 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
untriaged New issue has not been triaged
Projects
None yet
Development

No branches or pull requests

1 participant