Dataset Specification

Dataset is an important part of machine learning. Subsequent models are built based on datasets. We need to manage datasets. The following is the standard format of the dataset that Pipcook should save after the data is collected through the datasource script.

For different dataset formats, datasource script is used to smooth the differences.

Image

PascalVOC Dataset format, the detailed directory is as follows:

📂dataset
   ┣ 📂annotations
   ┃ ┣ 📂train
   ┃ ┃ ┣ 📜...
   ┃ ┃ ┗ 📜${image_name}.xml
   ┃ ┣ 📂test
   ┃ ┗ 📂validation
   ┗ 📂images
     ┣ 📜...
     ┗ 📜${image_name}.jpg

Or representing in XML:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<annotation>
  <folder>folder path</folder>
  <filename>image name</filename>
  <size>
    <width>width</width>
    <height>height</height>
  </size>
  <object>
    <name>category name</name>
    <bndbox> <!--this is not necessary for image classification problem-->
      <xmin>left</xmin>
      <ymin>top</ymin>
      <xmax>right</xmax>
      <ymax>bottom</ymax>
    </bndbox>
  </object>
</annotation>

Text

The text category should be a CSV file. The first column is the text content, and the second column is the category name. The delimiter is ',' without a header.

name, category
prod1, type1
prod2, type2
prod3, type2
prod4, type1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset.md

dataset.md

Dataset Specification

Image

Text

Files

dataset.md

Latest commit

History

dataset.md

File metadata and controls

Dataset Specification

Image

Text