This crate gives access to image data and label data from the MNIST dataset without having to worry about the dataset's file formats
add the following to your Cargo.toml
[dependencies]
mnist_dataset = { git = "https://github.com/daniel-j-anderson-dev/mnist_dataset.git" }The dataset is accessed through the following
DataSettrait- implemented by
TrainingDataandTestData - abstracts between the training and testing data sets
- gives access to the following iterators
data_set.images()- yields
[f32; IMAGE_SIZE]for each datum - each image array is row major gray-scale values between 0.0(white) and 1.0(black)
- yields
data_set.labels()- yields
[f32; DigitClass::COUNT]for each datum - one hot encoded
DigitClass
- yields
- implemented by
- types for handles to a specific image/label.
TrainingImageTrainingLabelTestImageTestLabel
Accessing the actual image data/image label is definiens using these traits
Image- implemented by
TrainingImageandTestImage - use
image.as_bytes()to get the784row major bytes of image data specified by an image handle
- implemented by
Label- implemented by
TrainingLabelandTestLabel - use
label.digit_class()to get theDigitClassspecified by a label handle
- implemented by
See the mnist_dataset::visualization module has several test to generate viewable images from the MNIST dataset
$ cargo test ascii_artgenerates two text files that contain ASCII art depictions of the dataset$ cargo test pgmgenerates a PGM image for every image in the dataset
dataset downloaded from: https://github.com/mrgloom/MNIST-dataset-in-different-formats/tree/master/data/Original%20dataset
information about the dataset: https://yann.lecun.com/exdb/mnist/ (dead link).
From the Wayback Machine of https://yann.lecun.com/exdb/mnist/:
The data is stored in a very simple file format designed for storing vectors and multidimensional matrices. General info on this format is given at the end of this page, but you don't need to read that to use the data files.
All the integers in the files are stored in the MSB first (high endian) format used by most non-Intel processors. Users of Intel processors and other low-endian machines must flip the bytes of the header.
There are 4 files:
train-images-idx3-ubyte: training set imagestrain-labels-idx1-ubyte: training set labelst10k-images-idx3-ubyte: test set imagest10k-labels-idx1-ubyte: test set labels
The training set contains 60000 examples, and the test set 10000 examples.
The first 5000 examples of the test set are taken from the original NIST training set. The last 5000 are taken from the original NIST test set. The first 5000 are cleaner and easier than the last 5000.
| offset | type | value | description |
|---|---|---|---|
| 0000 | 32 bit integer | 0x00000801(2049) | magic number (MSB first) |
| 0004 | 32 bit integer | 60000 | number of items |
| 0008 | unsigned byte | ?? | label |
| 0009 | unsigned byte | ?? | label |
| ... | ... | ... | ... |
| xxxx | unsigned byte | ?? | label |
The labels values are 0 to 9.
| offset | type | value | description |
|---|---|---|---|
| 0000 | 32 bit integer | 0x00000803(2051) | magic number |
| 0004 | 32 bit integer | 60000 | number of images |
| 0008 | 32 bit integer | 28 | number of rows |
| 0012 | 32 bit integer | 28 | number of columns |
| 0016 | unsigned byte | ?? | pixel |
| 0017 | unsigned byte | ?? | pixel |
| ... | ... | ... | ... |
| xxxx | unsigned byte | ?? | pixel |
Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black).
| offset | type | value | description |
|---|---|---|---|
| 0000 | 32 bit integer | 0x00000801(2049) | magic number (MSB first) |
| 0004 | 32 bit integer | 10000 | number of items |
| 0008 | unsigned byte | ?? | label |
| 0009 | unsigned byte | ?? | label |
| ... | ... | ... | ... |
| xxxx | unsigned byte | ?? | label |
The labels values are 0 to 9.
| offset | type | value | description |
|---|---|---|---|
| 0000 | 32 bit integer | 0x00000803(2051) | magic number |
| 0004 | 32 bit integer | 10000 | number of images |
| 0008 | 32 bit integer | 28 | number of rows |
| 0012 | 32 bit integer | 28 | number of columns |
| 0016 | unsigned byte | ?? | pixel |
| 0017 | unsigned byte | ?? | pixel |
| ... | ... | ... | ... |
| xxxx | unsigned byte | ?? | pixel |
Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black).
the IDX file format is a simple format for vectors and multidimensional matrices of various numerical types.
The basic format is:
magic number
size in dimension 0
size in dimension 1
size in dimension 2
.....
size in dimension N
data
- The magic number is an integer (MSB first).
- The first 2 bytes are always 0.
- The third byte codes the type of the data:
0x08: unsigned byte0x09: signed byte0x0B: short (2 bytes)0x0C: int (4 bytes)0x0D: float (4 bytes)0x0E: double (8 bytes)
- The fourth byte codes the number of dimensions of the vector/matrix: 1 for vectors, 2 for matrices....
- The sizes in each dimension are 4-byte integers (MSB first, high endian, like in most non-Intel processors).
- The data is stored like in a C array, i.e. the index in the last dimension changes the fastest.