We are introducing this public development roadmap to help Hub users and contributors understand the project's direction. For a longer discussion, please refer to this document. It is meant for anyone interested in Hub, including developers and users.
Features fall into one of two categories: "Data In" and "Data Out". Ideally, "Data In" features should improve usability and accessibility while "Data Out" features should improve benchmark performance.
"Data In" features help users push data to Hub. New features should improve usability.
Currently, the process consists of:
- Fetching an existing dataset from a remote url, verifying dataset integrity (sha-256), unzipping dataset
- Identifying organizational structure of dataset (this is not easily automated since there isn't a standard way to organize large datasets)
- Defining Hub schema for dataset (which can be challenging if the abstract data structure does not exist already, eg DataFrame)
- Pushing dataset to a Hub repo (or another location, whether local or another remote store)
We would like to abstract away as many steps as possible. Examples include schema generation and higher level dataset objects.
"Data Out" features help users stream data from Hub. New features should meaningfully improve performance (benchmarking scripts can be found here.
Currently, the process consists of:
- Locating the relevant dataset on Hub
- Applying transformations (perhaps PyTorch transforms)
- Fetching relevant slices from dataset (ie for some downstream task, such as model training)
Examples include tokenization and subsampling.
Anyone from the community can add a new proposed project to the roadmap:
- By adding a new issue in the repo
- Add the issue to the project board
When a new project is created, it will be placed in the Discussion column, where it receives feedback from the community and core maintainers (@AbhinavTuli, @davidbuniat, @edogrigqv2, @kristinagrig06, @haiyangdeperci, @mynameisvinn, @mikayelh).
For a project to happen (in other words, get "prioritized" in the roadmap), three things need to be in place:
- The scope is clear enough to understand the functional benefits and user impact on a high level
- A community member (ideally the proposer) is willing to dedicate the effort and resources to make the project happen
- A core maintainer is willing to sponsor the project - that is, define scope, write corresponding unit tests, and review PR
When these conditions are met, the card is moved to the Committed column. The subsequent In Development and Done columns indicate the development status.