Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instance features break SMAC #1063

Closed
mfeurer opened this issue Aug 4, 2023 · 1 comment
Closed

Instance features break SMAC #1063

mfeurer opened this issue Aug 4, 2023 · 1 comment

Comments

@mfeurer
Copy link
Contributor

mfeurer commented Aug 4, 2023

When working on an example (#1061) I faced the warning:

--- Logging error ---
Traceback (most recent call last):
  File "/home/feurerm/miniconda/3-4.12.0/envs/smac3/lib/python3.10/logging/__init__.py", line 1100, in emit
    msg = self.format(record)
  File "/home/feurerm/miniconda/3-4.12.0/envs/smac3/lib/python3.10/logging/__init__.py", line 943, in format
    return fmt.format(record)
  File "/home/feurerm/miniconda/3-4.12.0/envs/smac3/lib/python3.10/logging/__init__.py", line 678, in format
    record.message = record.getMessage()
  File "/home/feurerm/miniconda/3-4.12.0/envs/smac3/lib/python3.10/logging/__init__.py", line 368, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/home/feurerm/sync_dir/projects/smac3/examples/4_advanced_optimizer/4_intensify_crossvalidation.py", line 92, in <module>
    smac = HyperparameterOptimizationFacade(
  File "/home/feurerm/sync_dir/projects/smac3/smac/facade/abstract_facade.py", line 148, in __init__
    runhistory_encoder = self.get_runhistory_encoder(scenario)
  File "/home/feurerm/sync_dir/projects/smac3/smac/facade/hyperparameter_optimization_facade.py", line 210, in get_runhistory_encoder
    return RunHistoryLogScaledEncoder(scenario)
  File "/home/feurerm/sync_dir/projects/smac3/smac/runhistory/encoder/abstract_encoder.py", line 74, in __init__
    logger.warning(
Message: 'We strongly encourage to use instance features when using instances.'
Arguments: ('If no instance features are passed, the runhistory encoder can not distinguish between different instances and therefore returns the same data points with different values, all of which are used to train the surrogate model.\nConsider using instance indices as features.',)

and when aiming to add instance features, instead of disappearing, an exception appeared:

Traceback (most recent call last):
  File "/home/feurerm/sync_dir/projects/smac3/examples/4_advanced_optimizer/4_intensify_crossvalidation.py", line 106, in <module>
    incumbent = smac.optimize()
  File "/home/feurerm/sync_dir/projects/smac3/smac/facade/abstract_facade.py", line 319, in optimize
    incumbents = self._optimizer.optimize(data_to_scatter=data_to_scatter)
  File "/home/feurerm/sync_dir/projects/smac3/smac/main/smbo.py", line 300, in optimize
    trial_info = self.ask()
  File "/home/feurerm/sync_dir/projects/smac3/smac/main/smbo.py", line 153, in ask
    trial_info = next(self._trial_generator)
  File "/home/feurerm/sync_dir/projects/smac3/smac/intensifier/intensifier.py", line 226, in __iter__
    config = next(self.config_generator)
  File "/home/feurerm/sync_dir/projects/smac3/smac/main/config_selector.py", line 199, in __iter__
    x_best_array, best_observation = self._get_x_best(X_configurations)
  File "/home/feurerm/sync_dir/projects/smac3/smac/main/config_selector.py", line 324, in _get_x_best
    costs = list(
  File "/home/feurerm/sync_dir/projects/smac3/smac/main/config_selector.py", line 327, in <lambda>
    model.predict_marginalized(x.reshape((1, -1)))[0][0][0],  # type: ignore
  File "/home/feurerm/sync_dir/projects/smac3/smac/model/random_forest/random_forest.py", line 278, in predict_marginalized
    dat_ = self._rf.predict_marginalized_over_instances_batch(X, X_feat, self._log_y)
AttributeError: 'binary_rss_forest' object has no attribute 'predict_marginalized_over_instances_batch'

Code to reproduce:

"""
Speeding up Cross-Validation with Intensification
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

An example of optimizing a simple support vector machine on the digits dataset. In contrast to the
[simple example](examples/1_basics/2_svm_cv.py), in which all cross-validation folds are executed
at once, we use the intensification mechanism described in the original 
[SMAC paper](https://link.springer.com/chapter/10.1007/978-3-642-25566-3_40) as also demonstrated
by [Auto-WEKA](https://dl.acm.org/doi/10.1145/2487575.2487629). 
"""
__copyright__ = "Copyright 2023, AutoML.org Freiburg-Hannover"
__license__ = "3-clause BSD"

N_FOLDS = 10  # Global variable that determines the number of folds

from ConfigSpace import Configuration, ConfigurationSpace, Float
from sklearn import datasets, svm
from sklearn.model_selection import StratifiedKFold

from smac import HyperparameterOptimizationFacade, Scenario
from smac.intensifier import Intensifier

# We load the digits dataset, a small-scale 10-class digit recognition dataset
X, y = datasets.load_digits(return_X_y=True)


class SVM:
    @property
    def configspace(self) -> ConfigurationSpace:
        # Build Configuration Space which defines all parameters and their ranges
        cs = ConfigurationSpace(seed=0)

        # First we create our hyperparameters
        C = Float("C", (2 ** - 5, 2 ** 15), default=1.0, log=True)
        gamma = Float("gamma", (2 ** -15, 2 ** 3), default=1.0, log=True)

        # Add hyperparameters to our configspace
        cs.add_hyperparameters([C, gamma])

        return cs

    def train(self, config: Configuration, instance: str, seed: int = 0) -> float:
        """Creates a SVM based on a configuration and evaluate on the given fold of the digits dataset
        
        Parameters
        ----------
        config: Configuration
            The configuration to train the SVM.
        instance: str
            The name of the instance this configuration should be evaluated on. This is always of type
            string by definition. In our case we cast to int, but this could also be the filename of a
            problem instance to be loaded.
        seed: int
            The seed used for this call.
        """
        instance = int(instance)
        config_dict = config.get_dictionary()
        classifier = svm.SVC(**config_dict, random_state=seed)
        splitter = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=seed)
        for k, (train_idx, test_idx) in enumerate(splitter.split(X=X, y=y)):
            if k != instance:
                continue
            else:
                train_X = X[train_idx]
                train_y = y[train_idx]
                test_X = X[test_idx]
                test_y = y[test_idx]
                classifier.fit(train_X, train_y)
                cost = 1 - classifier.score(test_X, test_y)

        return cost


if __name__ == "__main__":
    classifier = SVM()

    # Next, we create an object, holding general information about the run
    scenario = Scenario(
        classifier.configspace,
        n_trials=50,  # We want to run max 50 trials (combination of config and seed)
        instances=[f"{i}" for i in range(N_FOLDS)],  # Specify all instances by their name (as a string)
        instance_features={f"{i}": [i] for i in range(N_FOLDS)}, # breaks SMAC
        deterministic=True  # To simplify the problem we make SMAC believe that we have a deterministic
                            # optimization problem.
        
    )

    # We want to run the facade's default initial design, but we want to change the number
    # of initial configs to 5.
    initial_design = HyperparameterOptimizationFacade.get_initial_design(scenario, n_configs=5)

    # Now we use SMAC to find the best hyperparameters
    smac = HyperparameterOptimizationFacade(
        scenario,
        classifier.train,
        initial_design=initial_design,
        overwrite=True,  # If the run exists, we overwrite it; alternatively, we can continue from last state
        # The next line defines the intensifier, i.e., the module that governs the selection of 
        # instance-seed pairs. Since we set deterministic to True above, it only governs the instance in
        # this example. Technically, it is not necessary to create the intensifier as a user, but it is
        # necessary to do so because we change the argument max_config_calls (the number of instance-seed pairs
        # per configuration to try) to the number of cross-validation folds, while the default would be 3.
        intensifier = Intensifier(scenario=scenario, max_config_calls=N_FOLDS, seed=0)
    )

    incumbent = smac.optimize()

    # Get cost of default configuration
    default_cost = smac.validate(classifier.configspace.get_default_configuration())
    print(f"Default cost: {default_cost}")

    # Let's calculate the cost of the incumbent
    incumbent_cost = smac.validate(incumbent)
    print(f"Incumbent cost: {incumbent_cost}")

    # Let's see how many configurations we have evaluated. If this number is higher than 5, we have looked
    # at more configurations than would have been possible with regular cross-validation, where the number
    # of configurations would be determined by the number of trials divided by the number of folds (50 / 10).
    runhistory = smac.runhistory
    print(f"Number of evaluated configurations: {len(runhistory.config_ids)}")
@mfeurer
Copy link
Contributor Author

mfeurer commented Aug 8, 2023

This issue was due to an old version of the random forest package.

@mfeurer mfeurer closed this as completed Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant