-
[2023-03-30] ⛵ Project Creation;
-
[2023-11] 🪨 Complete the collection and organization of NFT1000 dataset;
-
[2023-12-30] 📄 Paper based on NFT1000 was submitted to ICME 2024;
-
[2024-3-12] 💔 Paper was rejected by ICME;🩶
-
[2024-04-12] 📄 A better paper was finished and submitted to ACM Multimedia 2024;
-
[2024-07-15] 🥳 Paper “NFT1000: A Cross-Modal Dataset For Non-Fungible Token Retrieval” was accepted by MM!🎊
-
[2024-9] 💾 Open source the whole dataset,progress: ████████████████████████ [1001/1001]
-
[2024-10-25] 🎉 MM2024 Poster was released!
Please visit the Hugging Face for more details~
-
……
NFT (Non-Fungible Token) is a new type of digital asset that represents ownership or proof of authenticity of unique items, such as artwork, music, videos, or virtual goods, on a blockchain. Unlike cryptocurrencies like Bitcoin, which are fungible and can be exchanged on a one-to-one basis, NFTs are one-of-a-kind and cannot be exchanged for something of equal value. Each NFT has a unique identifier, making it valuable for collectors, creators, and digital markets. As an essential digital asset in the Web 3.0 world, NFTs are set to play an increasingly important role. Given that the academic community currently lacks a dataset focused on NFTs, we have created NFT-Net, aiming to inspire and foster research and development in the field of NFTs!
The ImageNet is a milestone in the field of computer vision, driving advancements and cross-industry applications, such as autonomous driving and medical image analysis. Building on this legacy, we aim to create a comprehensive dataset for the Web3.0 domain: NFT-Net, which is designed to be the Web3.0 counterpart of ImageNet!
NFT-Net is a multi-chain, multi-category, and multimodal dataset focused on Non-Fungible Tokens (NFTs). Each NFT project in the dataset serves as a basic unit, encompassing metadata, standardized image data (img), captions (text descriptions extracted from metadata for image-text alignment training), prompts (text labels derived from metadata for generative model training), and a dashboard (an overview of the project). Our long-term goal is to collect NFT projects across multiple blockchains (e.g., Ethereum, Solana, BTC) and categories (PFP, Arts, Photographs, Games, etc.), thus advancing research in NFT-related areas such as retrieval, generation, and quantitative trading.
Now,we have already achieved significant milestones with the development of the NFT1000 dataset! NFT1000 consists of the top 1000 (1001, in fact) most popular PFP NFT projects on the Ethereum blockchain, comprising 7.56 million image-text pairs, totaling 1.75TB of data. The dataset includes 356 themes and 600,000 noun phrases, making it suitable for various downstream tasks such as NFT retrieval, generation, and visual question answering. Additionally, our research based on the NFT1000 dataset has been recognized, with the paper titled "NFT1000: A Cross-Modal Dataset For Non-Fungible Token Retrieval" being accepted by ACM Multimedia 2024, one of the top three conferences in the field of multimedia AI.
The NFT1000 dataset comprises 1000 outstanding PFP NFT projects, each containing approximately 7500 image-text pairs, encompassing a total of 7.56 million image-text pairs with a collective data volume of 1.75TB.
In the dataset, the training set includes 800 projects with 6,178,249 image-text pairs. The validation set comprises 50 projects with 383,916 image-text pairs, and the test set consists of 150 projects with 1,000,838 imagetext pairs. The content spans a diverse range of artistic types, including 3D rendered images, 2D flat illustrations, pixel arts, NPC characters, real photographs,etc. It covers a total of 356 different content themes and 595,504 unique descriptive phrases.
The NFT1000 dataset comprises the most renowned 1000 avatar NFT projects from the Ethereum mainnet, based on sales data 2023-6-23.(Interestingly, there are actually 1001 projects included, as my own project, BanaCat, is among them). These NFT projects have laid the foundations of the early NFT ecosystem and have heralded the golden era of NFTs!
🍊List of collections in NFT1000
Please visit 📃PDF for the total list!
🍉Introduction to the dataset directory structure
NFT1000
└── BoredApeYachtClub
├── captions/ # Caption of each image
│ ├── BoredApeYachtClub_0.txt
│ ├── BoredApeYachtClub_1.txt
│ ├── ...
│ └── BoredApeYachtClub_9999.txt
├── images/ # Image of each NFT
│ ├── BoredApeYachtClub_0.png
│ ├── BoredApeYachtClub_1.png
│ ├── ...
│ └── BoredApeYachtClub_9999.png
├── metadata/ # Metadata of each NFT
│ ├── BoredApeYachtClub_0.json
│ ├── BoredApeYachtClub_1.json
│ ├── ...
│ └── BoredApeYachtClub_9999.json
├── prompts/ # Prompt of each NFT
│ ├── BoredApeYachtClub_0.txt
│ ├── BoredApeYachtClub_1.txt
│ ├── ...
│ └── BoredApeYachtClub_9999.txt
└── metadata_dashboard.json # Metadata dashboard,it contains the overview of each NFT project
└── CRYPTOPUNKS
├── ...
└── MutantApeYachtClub
├── ...
└── Azuki
├── ...
...
You have two methods for downloading the NFT1000:
Visit the Hugging Face official repository at:NFT-NET,and clone the repository or download each project on click
NFT-NET-HUB is a package management tool specifically designed to accompany the NFT-NET dataset. You can use the corresponding script to flexibly download specific projects, such as:
from utils.downloader import NFT1000
local_repo_path = "absolute/absolute/path/to/local/repo"
# modfiy the NFT_name_list to the NFT projects you want to download
NFT_name_list = ["BoredApeYachtClub", "CRYPTOPUNKS"]
NFT1000 = NFT1000("NFT1000", local_repo_path)
NFT1000.download(NFT_name_list)
For a more detailed tutorial, please refer to: NFT-NET-HUB
NFT1000 is a research paper focused on cross-modal retrieval on NFT data. This work marks the first application of cross-modal retrieval technologies to NFT data, utilizing intelligent search technologies from Web 2.0 in the context of Web 3.0. Our key contributions of this paper include:
- Dataset Construction: We constructed the first NFT visual-text dataset in the field of computer vision, named NFT1000.
- Training Methodology: We propose an effective training method for NFT-type data, termed the dynamic masking fine-tuning scheme, and have trained several models to serve as our baseline.
- Similarity Quantification: To quantify image-text similarity, we introduce the Comprehensive Variance Index (CVI, in short), which accounts for similarities within images and texts, as well as the degree of image-text matching.
- Application in Image Generation: We also explore the application of NFT data in the field of image generation.
And this paper was accepted by ACM Multimedia 2024! Please refer to 📄full paper for more details!
Based on the research in the paper, we jointly developed an NFT search engine with NFTScan.You can try our online search demo at : https://www.nftscan.com/ai-search
Thank you 🙏 to all our contributors!
WTF Academy | NFTScan | Alchemy | NFTGO | Hugging Face | OpenSea | GCC | BABEL
All data in the NFT-NET dataset is for scientific research only. Please do not use it for any commercial non-academic purposes such as secondary sales! Downloading data means that you comply with this agreement by default, and any disputes arising from this will be the responsibility of the downloader himself!
@inproceedings{10.1145/3664647.3680903,
author = {Wang, Shuxun and Lei, Yunfei and Zhang, Ziqi and Liu, Wei and Liu, Haowei and Yang, Li and Li, Bing and Li, Wenjuan and Gao, Jin and Hu, Weiming},
title = {NFT1000: A Cross-Modal Dataset For Non-Fungible Token Retrieval},
year = {2024},
isbn = {9798400706868},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3664647.3680903},
doi = {10.1145/3664647.3680903},
booktitle = {Proceedings of the 32nd ACM International Conference on Multimedia},
pages = {2214–2222},
numpages = {9},
keywords = {aigc, blockchain, clip, cross-modal retrieval, nft},
location = {Melbourne VIC, Australia},
series = {MM '24}
}