Prepare OpenML import #99

LizzAlice · 2023-08-31T13:37:51Z

Get Overview about data available via api. This will be documented here:

dataset data openml.datasets.list_datasets(output_format="dataframe")
- did: unique dataset ID
- name: non unique
- version: int, the combination of name and version seems to be unique in every case but one
- uploader: int (maybe this is a user id??)
- status: "active" for all of them
- format: one of ARFF, SParse_ARFF, arff or sparse_arff
- MajorityClassSize: number or NaN
- MaxNominalAttDistinctValues: number or NaN
- MinorityClassSize: number or NaN
- NumberOfClasses: number or NaN
- NumberOfFeatures: number or NaN
- NumberOfInstances: number or NaN
- NumberOfInstancesWithMissingValues: number or NaN
- NumberOfMissingValues: number or NaN
- NumberOfNumericFeatues: number or NaN
- NumberOfSymbolicFeatures: number or NaN
evaluations (have to give evaluation function)
- run_id: run id
- task_id: task id
- setup_id: setup id
- flow_id: flow id
- flow_name: flow name
- data_id: dataset id?
- data_name: dataset name?
- function: evaluation function
- upload_time: time it was uploaded
- uploader: uploader number
- uploader_name: name string
- value: int
- values: always None?
- array_data: always None?
flows
- id: unique id
- full_name: name with number in parentheses
- name: name of python class or function\
- version: number
- external_version: None or package versions with package name in the form 'openml==0.14.1,sklearn==1.3.0'
- uploader: number
runs
- run_id: unique id
- task_id: task id
- setup_id: setup id
- flow_id: flow id
- uploader: number
- task_type: instance of task type in the following form: TaskType.LEARNING_CURVE
- upload_time: time in the format of 2014-04-06 23:30:40
- error_message: string
setups:
- setup_id: unique id
- flow_id: flow id
- parameters: dict of things that are given as numbers; the dicts contain information such as flow information, data_type, default_value etc
study openml.study.list_studies(output_format="dataframe") (a bit unclear, what this is, but there are only two... However, from the ids, it seems as if there were more)
- id: unique id, only 123 and 226
- main_entity_type: "run"
- status: "active"
- creation_date: time in the format of 2019-02-21 19:55:30
- creator: number
- alia: NaN or "amlb"
tasks openml.tasks.list_tasks(output_format="dataframe")
- tid: unique task id
- ttid: String with task type in the form of TaskType.TASK_TYPE_NAME
- did: dataset id
- name: should be the task name, but actually looks like the dataset name
- task_type: task type as in ttid, but in words
- status: "active" for all of them
- estimation_procedure: string
- evaluation_measures: string or NaN
- source_data: seems to be the same as did
- target_feature: string
- MajorityClassSize: number or NaN (is this the value from the dataset?)
- MaxNominalAttDistinctValues: number or NaN (is this the value from the dataset?)
- MinorityClassSize: number or NaN (is this the value from the dataset?)
- NumberOfClasses: number or NaN (is this the value from the dataset?)
- NumberOfFeatures: number or NaN (is this the value from the dataset?)
- NumberOfInstances: number or NaN (is this the value from the dataset?)
- NumberOfInstancesWithMissingValues: number or NaN (is this the value from the dataset?)
- NumberOfMissingValues: number or NaN (is this the value from the dataset?)
- NumberOfNumericFeatures: number or NaN (is this the value from the dataset?)
- NumberOfSymbolicFeatures: number or NaN (is this the value from the dataset?)
- number_samples: number or NaN
- cost_matrix: NaN or matrix in list of lists format or string or number
- source_data_labeled: NaN or '1227' or '1451'
- target_feature_event: NaN, or 'event' or 'OS_event'
- target_feature_left: NaN
- target_feature_right: NaN or "time" or "OS_years"
- quality_measure: NaN or string
- target_value: NaN or string

Dependencies: Task on Dataset; Run on Task, Setup and Flow; Setup on Flow, Evaluation on Run, Task, Setup, Flow, Dataset

The text was updated successfully, but these errors were encountered:

LizzAlice · 2023-11-03T13:35:23Z

Questions:

Dataset:
- why is dataset name not unique? --> just how it works
- uploader: is this ID unique? -->yes
- what does status=active mean and why are they all active? always active, can be ignored
Evaluation:
- is data_id the dataset_id? yes
- where do I find a list of evaluation functions? list_evaluation_measures
- what is values and when is it not None?
- what is array_data and when is it not None?
Study:
- what are studies, why are there only two but the ids seem as if there are more, why are they not linked to the other --> seems to be a bug
Task:
- name here is not the task name, but the dataset name, or what? they dont have a name
- what does source_data_labeled mean?
- target_feature_event: what is the difference between event and OS_event?
- task type classification and regression only important?

LizzAlice · 2023-11-23T10:50:55Z

excluded fields for Dataset:
- MaxNominalAttDistinctValues: this one is the number of distinct attributes overall, i.e. over several columns; doesn't make sense to show

LizzAlice · 2024-01-08T10:14:29Z

potential changes to prototype:

what about a field for the quality? --> verified/not
extra field with just text from "cites work"
rename cites work to sth that makes it clear that it is an item and it has a doi?
are all I get back from the api active? --> yes!

LizzAlice added the enhancement New feature or request label Aug 31, 2023

LizzAlice self-assigned this Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepare OpenML import #99

Prepare OpenML import #99

LizzAlice commented Aug 31, 2023 •

edited

Loading

LizzAlice commented Nov 3, 2023 •

edited

Loading

LizzAlice commented Nov 23, 2023

LizzAlice commented Jan 8, 2024 •

edited

Loading

Prepare OpenML import #99

Prepare OpenML import #99

Comments

LizzAlice commented Aug 31, 2023 • edited Loading

LizzAlice commented Nov 3, 2023 • edited Loading

LizzAlice commented Nov 23, 2023

LizzAlice commented Jan 8, 2024 • edited Loading

LizzAlice commented Aug 31, 2023 •

edited

Loading

LizzAlice commented Nov 3, 2023 •

edited

Loading

LizzAlice commented Jan 8, 2024 •

edited

Loading