Release v0.14.0: Filesystem API, Webhook Server, upload improvements, keep-alive connections, and more · huggingface/huggingface_hub

HfFileSystem: interact with the Hub through the Filesystem API

We introduce HfFileSystem, a pythonic filesystem interface compatible with fsspec. Built on top of HfApi, it offers typical filesystem operations like cp, mv, ls, du, glob, get_file and put_file.

>>> from huggingface_hub import HfFileSystem
>>> fs = HfFileSystem()

# List all files in a directory
>>> fs.ls("datasets/myself/my-dataset/data", detail=False)
['datasets/myself/my-dataset/data/train.csv', 'datasets/myself/my-dataset/data/test.csv']

>>> train_data = fs.read_text("datasets/myself/my-dataset/data/train.csv")

Its biggest advantage is to provide ready-to-use integrations with popular libraries like Pandas, DuckDB and Zarr.

import pandas as pd

# Read a remote CSV file into a dataframe
df = pd.read_csv("hf://datasets/my-username/my-dataset-repo/train.csv")

# Write a dataframe to a remote CSV file
df.to_csv("hf://datasets/my-username/my-dataset-repo/test.csv")

For a more detailed overview, please have a look to this guide.

Transfer the hffs code to hfh by @mariosasko in #1420
Hffs misc improvements by @mariosasko in #1433

Webhook Server

WebhooksServer allows to implement, debug and deploy webhook endpoints on the Hub without any overhead. Creating a new endpoint is as easy as decorating a Python function.

# app.py
from huggingface_hub import webhook_endpoint, WebhookPayload

@webhook_endpoint
async def trigger_training(payload: WebhookPayload) -> None:
    if payload.repo.type == "dataset" and payload.event.action == "update":
        # Trigger a training job if a dataset is updated
        ...

For more details, check out this twitter thread or the documentation guide.

Note that this feature is experimental which means the API/behavior might change without prior notice. A warning is displayed to the user when using it. As it is experimental, we would love to get feedback!

[Feat] Webhook server by @Wauplin in #1410

Some upload QOL improvements

Faster upload with `hf_transfer`

Integration with a Rust-based library to upload large files in chunks and concurrently. Expect x3 speed-up if your bandwidth allows it!

feat: add hf_transfer upload by @McPatate in #1395

Upload in multiple commits

Uploading large folders at once might be annoying if any error happens while committing (e.g. a connection error occurs). It is now possible to upload a folder in multiple (smaller) commits. If a commit fails, you can re-run the script and resume the upload. Commits are pushed to a dedicated PR. Once completed, the PR is merged to the main branch resulting in a single commit in your git history.

upload_folder(
    folder_path="local/checkpoints",
    repo_id="username/my-dataset",
    repo_type="dataset",
    multi_commits=True, # resumable multi-upload
    multi_commits_verbose=True,
)

Note that this feature is also experimental, meaning its behavior might be updated in the future.

New endpoint: create_commits_on_pr by @Wauplin in #1375

Upload validation

Some more pre-validation done before committing files to the Hub. The .git folder is ignored in upload_folder (if any) + fail early in case of invalid paths.

Fix path_in_repo validation when committing files by @Wauplin in #1382
Raise issue if trying to upload .git/ folder + ignore .git/ folder in upload_folder by @Wauplin in #1408

Keep-alive connections between requests

Internal update to reuse the same HTTP session across huggingface_hub. The goal is to keep the connection open when doing multiple calls to the Hub which ultimately saves a lot of time. For instance, updating metadata in a README became 40% faster while listing all models from the Hub is 60% faster. This has no impact for atomic calls (e.g. 1 standalone GET call).

Keep-alive connection between requests by @Wauplin in #1394
Accept backend_factory to configure Sessions by @Wauplin in #1442

Custom sleep time for Spaces

It is now possible to programmatically set a custom sleep time on your upgraded Space. After X seconds of inactivity, your Space will go to sleep to save you some $$$.

from huggingface_hub import set_space_sleep_time

# Put your Space to sleep after 1h of inactivity
set_space_sleep_time(repo_id=repo_id, sleep_time=3600)

[Feat] Add sleep_time for Spaces by @Wauplin in #1438

Breaking change

fsspec has been added as a main dependency. It's a lightweight Python library required for HfFileSystem.

No other breaking change expected in this release.

Bugfixes & small improvements

File-related

A lot of effort has been invested in making huggingface_hub's cache system more robust especially when working with symlinks on Windows. Hope everything's fixed by now.

Fix relative symlinks in cache by @Wauplin in #1390
Hotfix - use relative symlinks whenever possible by @Wauplin in #1399
[hot-fix] Malicious repo can overwrite any file on disk by @Wauplin in #1429
Fix symlinks on different volumes on Windows by @Wauplin in #1437
[FIX] bug "Invalid cross-device link" error when using snapshot_download to local_dir with no symlink by @thaiminhpv in #1439
Raise after download if file size is not consistent by @Wauplin in # 1403

ETag-related

After a server-side configuration issue, we made huggingface_hub more robust when getting Hub's Etags to be more future-proof.

Update file_download.py by @Wauplin in #1406
🧹 Use HUGGINGFACE_HEADER_X_LINKED_ETAG const by @julien-c in #1405
Normalize both possible variants of the Etag to remove potentially invalid path elements by @dwforbes in #1428

Documentation-related

Docs about how to hide progress bars by @Wauplin in #1416
[docs] Update docstring for repo_id in push_to_hub by @tomaarsen in #1436

Misc

Prepare for 0.14 by @Wauplin in #1381
Add force_download to snapshot_download by @Wauplin in #1391
Model card template: Move model usage instructions out of Bias section by @NimaBoscarino in #1400
typo by @Wauplin (direct commit on main)
Log as warning when waiting for ongoing commands by @Wauplin in #1415
Fix: notebook_login() does not update UI on Databricks by @fwetdb in #1414
Passing the headers to hf_transfer download. by @Narsil in #1444

Internal stuff

Fix CI by @Wauplin in #1392
PR should not fail if codecov is bad by @Wauplin (direct commit on main)
remove cov check in PR by @Wauplin (direct commit on main)
Fix restart space test by @Wauplin (direct commit on main)
fix move repo test by @Wauplin (direct commit on main)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.14.0: Filesystem API, Webhook Server, upload improvements, keep-alive connections, and more

HfFileSystem: interact with the Hub through the Filesystem API

Webhook Server

Some upload QOL improvements

Faster upload with `hf_transfer`

Upload in multiple commits

Upload validation

Keep-alive connections between requests

Custom sleep time for Spaces

Breaking change

Bugfixes & small improvements

File-related

ETag-related

Documentation-related

Misc

Internal stuff

Contributors

v0.14.0: Filesystem API, Webhook Server, upload improvements, keep-alive connections, and more

HfFileSystem: interact with the Hub through the Filesystem API

Webhook Server

Some upload QOL improvements

Faster upload with hf_transfer

Upload in multiple commits

Upload validation

Keep-alive connections between requests

Custom sleep time for Spaces

Breaking change

Bugfixes & small improvements

File-related

ETag-related

Documentation-related

Misc

Internal stuff

Contributors

Faster upload with `hf_transfer`