Replies: 12 comments 23 replies
-
@lacava @MilesCranmer @gbomarito @hengzhe-zhang @gAldeia @foolnotion |
Beta Was this translation helpful? Give feedback.
-
Are we considering keeping Strogatz datasets in this track? If we use Feynman, I would suggest replacing our datasets with those from Matsubara et al. in https://arxiv.org/pdf/2206.10540, taking the same number of easy/medium/hard problems to use. |
Beta Was this translation helpful? Give feedback.
-
When selecting a subset of our current datasets, I like the idea that this subset should have the slightest effect on overall results from the current SRBench version. I can see the reasons why we could keep/drop many datasets. However, IMO this should be done without affecting the overall results so we can avoid inserting bias based on our interpretation of the results. Also, from my perspective, I would like to use the ol' reliable number of 30 repetitions for each dataset since this gives us more statistical robustness and allows us to expect less variation in the results. If you agree, the final number of datasets we are going to use should be defined by taking this computational burden into account. |
Beta Was this translation helpful? Give feedback.
-
Hi! Here is an update from my side. In our previous meeting, @lacava mentioned that our goal might be to find datasets that can differentiate algorithms. Thus, I now use the pairwise differences between different algorithms on each dataset to select representative datasets. This approach is different from clustering. An intuitive way to understand this is that many Friedman datasets are selected because Friedman effectively differentiates different algorithms. In other words, some datasets may not be able to distinguish different algorithms, and all algorithms achieve an R2 of -1 (just an example). So, if clustering is used, I'm not sure if a dataset where all algorithms perform equally poorly will be selected or not. If so, we might select problematic datasets from which no learning algorithm can extract useful information. The following two tables use pairwise differences to select top datasets based on R2/Edit distance. They show stark differences. I think R2 might be more appropriate in the current situation.
|
Beta Was this translation helpful? Give feedback.
-
Hey all, I just saw this at GECCO: https://arxiv.org/pdf/2301.01488 its an informed down sampling for test cases in lexicase selection. As I was watching I was thinking about how similar this is to down sampling of datasets in SRBENCH. Could/Should we do something similar where we choose a subset of datasets based on maximizing distance of "performance vectors" from our current algorithms? |
Beta Was this translation helpful? Give feedback.
-
@gbomarito This is a great idea. I have implemented a simple function to select datasets based on the error matrix def euclidean_distance(vec1, vec2):
return np.sqrt(sum((x1 - x2) ** 2 for x1, x2 in zip(vec1, vec2)))
def farthest_first_traversal(T, r):
ds = set() # the down-sample
size = int(r * len(T)) # desired size of down-sample
# Add a random case to the down-sample
ds.add(random.choice(list(T)))
T = T - ds # remove the selected case from the training set
while len(ds) < size:
max_min_dist = -1
case_to_add = None
for case in T:
min_dist = min(euclidean_distance(case, c) for c in ds)
if min_dist > max_min_dist:
max_min_dist = min_dist
case_to_add = case
ds.add(case_to_add)
T.remove(case_to_add)
return ds |
Beta Was this translation helpful? Give feedback.
-
During GECCO, Alberto Tonda mentioned this recently curated benchmark for regression It contains only 23 datasets. I think most of them are currently contained in srbench. Maybe we should consider integrating these for srbench 2025? |
Beta Was this translation helpful? Give feedback.
-
Some datasets we should ignore when doing this analysis: print(df_results.shape)
# Removing mislabeled datasets (these are clf, but PMLB v1.0 had it as regr)
df_results = df_results[ ~df_results["dataset"].isin(["banana", "titanic"]) ]
# ignoring new datasets from PMLB that hasnt been benchmarked with other methods yet
# df_results[['algorithm', 'dataset']].value_counts().unstack().sum(axis=0).sort_values()
df_results = df_results[ ~df_results["dataset"].isin([
"nikuradse_2",
"nikuradse_1"
]) ]
# Removing duplicated datasets
# 562 and 227, 573 and 197, 1203 and 229, 207 and 195
df_results = df_results[ ~df_results["dataset"].isin([
"562_cpu_small",
"573_cpu_act",
"229_pwLinear",
"207_autoPrice",
]) ]
# with pd.option_context('display.max_rows', None, 'display.max_columns', None):
# display(df_results.dataset.value_counts())
print(df_results.shape) |
Beta Was this translation helpful? Give feedback.
-
Alright, so as we discussed yesterday, we will pick
I suggest we discard the Z and {0,1} and then we pick 10 R, 4 R+, 6 Z+ (more or less proportional). We should be careful to pick Z+ with many distinct values, as we are not compatible with multiclass yet. From the 10 R we can try to select 2 of each: From the above selection we try to pick 1 or 2 Friedman, maybe those with high variance in the results. |
Beta Was this translation helpful? Give feedback.
-
EDIT: trying to do two clusters for reduced srbench, one with only-friedman and other with non-friedman I come to a suggestion for the next iteration. There are 4 different tracks: I come to a suggestion for the next iteration. There are four different tracks:
|
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
I've explored Kaggle competitions a bit more. Kaggle allows for custom evaluation scripts, but the time budget for evaluation is limited to 30 minutes. Some existing competitions, especially those hosted by large companies, seem to have an 8-hour evaluation period, but it's unclear if this is available to everyone. Nevertheless, we might consider using Kaggle for SRBench 2025. The advantage of Kaggle is that it allows participants to see their results in real time. However, a potential drawback is that it might be challenging to control the number of evaluations unless we implement post-competition checks on their code. |
Beta Was this translation helpful? Give feedback.
-
We've been discussing internally about the next iteration of SRBench. One of the key points is the selection of a subset of the current datasets and the introduction of new ones. What we discussed so far:
Regarding the selection from the current set, we've come up with three different criteria:
Maybe we should start with the whole set and simply filter them out by removing all that doesn't change rank, all that are trival (By the edit distance measure), and see what remains.
Beta Was this translation helpful? Give feedback.
All reactions