DIGITS will download some standard datasets for you and store them for you locally in the format that DIGITS expects (see Image Folder Format for a detailed explanation). Once these folders are created, you can use them to create your datasets with DIGITS.
$ python -m digits.download_data -h
usage: __main__.py [-h] [-c] dataset output_dir
Download-Data tool - DIGITS
positional arguments:
dataset mnist/cifar10/cifar100
output_dir The output directory for the data
optional arguments:
-h, --help show this help message and exit
-c, --clean Clean out the directory first (if necessary)
Yann LeCun provides a dataset of 28x28 grayscale images of handwritten digits. You can read all about it here: http://yann.lecun.com/exdb/mnist/
The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.
It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.
Run this:
$ python -m digits.download_data mnist ~/mnist
And these folders and files will be created for you (images and temporary files omitted):
mnist/
├── train/
│ ├── 0/
│ ├── 1/
│ ├── 2/
│ ├── 3/
│ ├── 4/
│ ├── 5/
│ ├── 6/
│ ├── 7/
│ ├── 8/
│ ├── 9/
│ ├── labels.txt
│ └── test.txt
└── test/
├── 0/
├── ...
├── 9/
├── labels.txt
└── train.txt
Then, you can use ~/mnist/train
for your training images and ~/mnist/test
for your validation or test images.
Alex Krizhevsky provides two datasets of 32x32 color images. You can read all about them here: http://www.cs.toronto.edu/~kriz/cifar.html
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
Run this:
$ python -m digits.download_data cifar10 ~/cifar10
And these folders and files will be created for you (images and temporary files omitted):
cifar10
├── train/
│ ├── airplane/
│ ├── automobile/
│ ├── bird/
│ ├── cat/
│ ├── deer/
│ ├── dog/
│ ├── frog/
│ ├── horse/
│ ├── ship/
│ ├── truck/
│ ├── labels.txt
│ └── test.txt
└── test/
├── airplane/
├── ...
├── truck/
├── labels.txt
└── train.txt
Then, you can use ~/cifar10/train
for your training images and ~/cifar10/test
for your validation or test images.
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs).
Run this:
$ python -m digits.download_data cifar100 ~/cifar100
And these folders and files will be created for you (images and temporary files omitted):
cifar100/
├── coarse/
│ ├── train/
│ │ └── ...
│ ├── test/
│ │ └── ...
│ ├── labels.txt
│ ├── test.txt
│ └── train.txt
└── fine/
├── train/
│ └── ...
├── test/
│ └── ...
├── labels.txt
├── test.txt
└── train.txt
If you want to use the coarse dataset (10 classes), use ~/cifar100/coarse/train
and ~/cifar100/coarse/test
.
If you want to use the fine dataset (100 classes), use ~/cifar100/fine/train
and ~/cifar100/fine/test
.