Skip to content

Commit

Permalink
feat: 🚀 Updated devcontainer to use pre-built Python image, added dep…
Browse files Browse the repository at this point in the history
…endabot for devcontainers, and made enhancements to docs and notebooks

- Devcontainer now uses a pre-built Microsoft Python image for efficiency.
- Introduced a Dependabot configuration to keep devcontainer dependencies up to date automatically.
- Removed local Dockerfile in favor of using the pre-built image, streamlining setup.
- Added a Makefile to automate environment setup and dependency installation for development.
- Updated README to better describe the project's goals and community contributions.
- Improved FAQ section in the documentation for clarity and user engagement.
- Simplified ROADMAP document, focusing on critical milestones for open data collaboration.
- Adjusted notebooks to demonstrate updated dataset handling and catalog interaction, reflecting the latest frictionless data practices.
- Added Hugging Face Hub to the list of requirements to support dataset sharing and versioning initiatives.
  • Loading branch information
davidgasquez committed May 3, 2024
1 parent 9848f75 commit 9e56a43
Show file tree
Hide file tree
Showing 10 changed files with 93 additions and 98 deletions.
7 changes: 2 additions & 5 deletions .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,4 @@
{
"name": "Datasets",
"build": {
"dockerfile": "../Dockerfile",
"context": ".."
}
"name": "Datonic Hub",
"image": "mcr.microsoft.com/devcontainers/python:1-3.12-bullseye"
}
6 changes: 6 additions & 0 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
version: 2
updates:
- package-ecosystem: "devcontainers"
directory: "/"
schedule:
interval: weekly
6 changes: 0 additions & 6 deletions Dockerfile

This file was deleted.

5 changes: 5 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
venv:
@command -v uv >/dev/null 2>&1 || pip install -U uv
uv venv
uv pip install -U -r requirements.txt
. .venv/bin/activate
5 changes: 2 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
# 📦 Datonic Hub

The center of the Datonic community.
The center of the Datonic community. A place to improve the way the world produces, share, consume and collaborate on open datasets.

We aim to improve the way the world produces, share, consume and collaborate on open datasets.
We aim for a world that produces **open data** with **open source software** using **open protocols** running on **open infrastructure**.
Aiming to share **open data** generated with **open source software** using **open protocols** running on **open infrastructure**.

## đź“– Documentation

Expand Down
16 changes: 6 additions & 10 deletions docs/FAQ.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# FAQ

Please open an issue if you have any other question!
[Open an issue](https://github.com/datonic/hub/issues/new) if you have questions!

## Why Frictionless?

Expand All @@ -16,7 +16,7 @@ We need to solve the problem of "packaging data" as a community. Frictionless is

I've [tried quite a bunch of Data Package Managers](https://publish.obsidian.md/davidgasquez/Open+Data#Data+Package+Managers). Frictionless is the simplest and most flexible one. It also has a reasonable adoption and active community.

That said, I'm open to other options. If you have a better idea, please open an issue and let's chat!
That said, I'm open to other options. If you have a better idea, [let's chat](https://davidgasquez.com/)!

### How would you make datasets immutable?

Expand All @@ -31,7 +31,7 @@ resources:
scheme: ipfs
```
In the end, the Frictionless abstraction is just a URL. We can use anything we want in the backend as long as we provide a way to read the data. In this case:
In the end, the Frictionless abstraction is just an URL. We can use anything we want in the backend as long as we provide a way to read the data. In this case:
```python
ipfs_package = Package("my-dataset-datapackage.yaml") # Could even be Package("bafyreca4sf...")
Expand All @@ -45,7 +45,7 @@ ipfs_resource.sql("SELECT * FROM my-data")

### How would you backup datasets?

An easy and cheap way to backup datasets is to preiodically backup the data resources on IPFS/Filecoin. This can be done using GitHub Actions and [Estuary](https://estuary.tech/)/[web3.storage](https://web3.storage/). Once the data in there, we can rely on the [`_cache` property of the Frictionless Specs](https://specs.frictionlessdata.io/patterns/#caching-of-resources) (or a `_backup` one) to point to the IPFS CID.
Depending on the dataset, this feature could be pushed to the hosting later. If you publish in HuggingFace, you get versioning and backup for free! Once the data in there, we can rely on the [`_cache` property of the Frictionless Specs](https://specs.frictionlessdata.io/patterns/#caching-of-resources) (or a `_backup` one) to point to the previous backup.

### How would you make datasets discoverable?

Expand Down Expand Up @@ -89,7 +89,7 @@ Some interesting plugins ideas might be to integrate with Socrata ([Simon Wilson

### How would you make datasets reproducible?

Need more thought but probably using something like Bacalhau to run the pipelines.
By versioning the code and the data together, it should be possible to reproduce the dataset. The easiest way to do this is by publishing datasets via GitHub Actions, this way the code and the data are always in sync. Furthermore, attaching a Docker image and Dev Container environment makes it easy to reproduce the dataset in any environment.

### How would you make datasets versioned?

Expand All @@ -108,13 +108,9 @@ Yes, the new LLM models could help with this vision. A few things that could be
- Extract data and generate resources from anything. Define the schema and let GPT-N do the rest. [Some projects are already working on this](https://jamesturk.github.io/scrapeghost/).
- Can datapackages be written in natural language? Can we use GPT-N to generate them? The same way [plugins are starting to be written for ChatGPT](https://raw.githubusercontent.com/openai/chatgpt-retrieval-plugin/336ff64b96ef23bda164ab94ca6f349607bbc5b6/.well-known/ai-plugin.json) that only requires a `description_for_model` text. Could something like this work on data packages. Embeddings become the flexible metadata we all want.

### How does Frictionless Data compare to other data management or data packaging tools?

TODO: Explain how the project fits into the larger open data ecosystem and how it relates to other similar projects.

### Can Frictionless be used for non-tabular data formats?

TODO: Explain how the project can be used for non-tabular data formats and add examples.
Yes! It is probably not the best fit but the basic idea would be to have a table pointing to the URI of the non-tabular data. For example, you could have a datasets of sounds, images, or videos by having a column with the URI of the file.

### Why should people use Frictionless Data?

Expand Down
17 changes: 7 additions & 10 deletions docs/ROADMAP.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,22 @@
# ROADMAP

## Overview
## Goal

Align the way we package data as an ecosystem.
Create better ways to produce **open data** with **open source software** using **open protocols** running on **open infrastructure**.

## Milestones

### 0.1

- [ ] Document how to backup datasets to IPFS
- [ ] Create a Catalog of existing datasets
- [ ] Make datasets retrievable via gateways
- [ ] Make datasets retrievable via IPFS with `fsspec`
- [ ] Create a sample repository for creating and sharing datasets
- [ ] Make datasets easily retrievable
- [ ] Make datasets discoverable
- [ ] Early community reach out to look for potential datasets to package and collaborate on

### 0.2

- [ ] Write HuggingFace plugin
- [ ] Write Socrata plugin
- [ ] Backup HuggingFace and Socrata datasets to IPFS/Filecoin
- [ ] Integrate with other community projects like [OpSci Commons](https://commons.opsci.io/), [OpenNeuro](https://openneuro.org/), [OpenPanda](https://openpanda.io/).
- [ ] Backup datasets to multiple locations
- [ ] Automate dataset format conversion

### 0.3

Expand Down
2 changes: 1 addition & 1 deletion docs/working-group.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# 📦 Open Data Working Group

Exploring a better way to produces **open data** with **open source software** using **open protocols** running on **open infrastructure**.
Exploring better ways to produce **open data** with **open source software** using **open protocols** running on **open infrastructure**.

## 🧑‍🦱 Interesting Folks

Expand Down
Loading

0 comments on commit 9e56a43

Please sign in to comment.