From 588074c26a010920f07049c33b605b055c0812b1 Mon Sep 17 00:00:00 2001 From: Percy Liang Date: Tue, 9 Jul 2013 18:57:55 -0700 Subject: [PATCH] initial version specification (incomplete) --- SPECIFICATION.md | 217 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 217 insertions(+) create mode 100644 SPECIFICATION.md diff --git a/SPECIFICATION.md b/SPECIFICATION.md new file mode 100644 index 000000000..50a032b9d --- /dev/null +++ b/SPECIFICATION.md @@ -0,0 +1,217 @@ +# Overview + +This document describes the design of the CodaLab website. This is still +incomplete. + +## Entities + +These are the low-level entities in CodaLab: + +- **Bundle**: a Bundle is the basic building block used to represent programs + and datasets. Formally, it is an *immutable* directory containing files and + possibly further sub-directories. Each Bundle has a global ID which is + unique and stable (this is how we get reproducibility in research). Some + Bundles are uploaded by the user, and others are generated on CodaLab by + running programs. A Bundle has a designated directory inside it (currently, + this directory is represented as a YAML file): + - metadata + - name: name of the Bundle. + - description: free-form description of the Bundle + - tags: list of strings to help CodaLab classify the Bundle. + +- **Program**: a Program is a Bundle (which presumably contains executable + code) plus a *command-line string* which is executed to access that code. + The Bundle also must contain additional information: + - Architecture: this program can only be run on certain architectures (e.g., Windows, Linux, MacOS)? + - Modules needed: what does program need (in the appropriate path) to run (e.g., numpy, scipy, Matlab)? + +- **Run**: A Run represents the execution of a program. Formally, is just a + Bundle, but is required to have the following directories: + - program: the Bundle of the program that is run. + - input: the Bundle of the input that is read by the program. + - output: the Bundle of the output written by the program. + - stdout: what was output to stdout + - stderr: what was output to stderr + - specification: contains information about the Run (optional) + - allowedMemory: how much memory to give the Run + - allowedTime: how much time to give the Run + - status: contains system statistics about the Run + - command: what command was executed + - exitCode: 0 means normal termination + - time: how long did the Run take? + - OS: what operating system was used to run it + - CPU: what CPU was used + - memory: how much memory did the Run take? + While the Run is executing, all of these directories should be updated in + real-time. + +These are the high-level entities in CodaLab: + +- **Worksheet**: A Worksheet represents a multi-stage experiment that a user + wants to perform (e.g., running and evaluating a machine learning algorithm). + Formally, a Worksheet is a sequence of *blocks*, where each block is one of + the following: + - A rich text block which allows users to document their experiment using free text. + - A single Bundle whose contents can be viewed and edited. + - A Table, where each row represents a Bundle and columns represent + different attributes of that Bundle. This is useful for comparing + different algorithms side-by-side according to some metric. + +- **Competition**: A Competition is an organized way for users to upload either + a program or predictions. An evaluation program will automatically scores + those predictions against the dataset that the competition organizer has + provided and generate various metrics. In the case that competititors upload + predictions, a Competition contains: + - A dataset (Bundle) which competitors download during the competition and + can do whatever they want with. + - A Bundle containing the true predictions which is not revealed to the competitors. + - An evaluation Program, which takes as input the true predictions and the + predictions of competitor and generates evaluation metrics in the + corresponding Run. + - Various metadata associated with the Competition (e.g., timeline, access, + etc.). + +Both Worksheets and Competitions are built on the common foundation of Bundles, +Programs, and Runs. In the following sections, we will go through each of +these entities in greater detail, including what exactly is the information we +need to track about each one. + +## User + +A user represents someone who uses CodaLab to upload Bundles, download Bundles, +create Runs, etc. + +### Database fields + +- Username (mandatory): immutable, Java identifier +- Email (mandatory): used to validate passwords, etc. +- First name, last name (mandatory) +- Affiliation (mandatory) +- Homepage (optional) +- Interests (optional): free-form paragraph description +- Groups that this user is part of (e.g., admin, medicalImaging), used to + determine access control. +- Link to Facebook/Google IDs? +- List of Programs/Bundles/Users they are interested in tracking. They will be + notified of any activity around them. +- Analytics: creation time, login/logout times + +## Bundle + +A Bundle is the fundamental primitive in CodaLab: It is used to represent both +datasets, statistics of programs, trained models, and predictions. The idea is +many research workflows are quite heterogenous, where one program's output is +another's input (raw data to reformatting to feature extraction to machine +learning algorithms to evaluation). + +### Database fields + +Some of these database fields are also in the metadata file, +because that is possibly how the user first uploaded it. CodaLab must +automatically keep the metadata and the database synchronized. + +- Name (mandatory if this is a program or dataset): immutable Java identifier +- Description (text, mandatory): should be fairly descriptive (need to encourage this socially). +- Tags: created by users or created by programs as a weak type system. +- Owner (User who created this) +- Creation time +- Which groups are allowed to list/read/download/run on this Bundle. +- Provenance: list of Bundles that this Bundle depends on (for example, it is + common for one of the sub-directories to be a link to a sub-directory inside + another Bundle). + +### Examples + +As we noted earlier, Bundles have many uses. Here are some examples of what +Bundles are used to represent in machine learning: + +- **Dataset** (in ARFF format): + - metadata + - data.arff: contains the actual data in ARFF format + - status/numExamples: number of examples (produced by a program that inspects a Dataset). + - status/numAttributes: number of attributes +- **Model** (for Weka): + - weka_classifier: +- **Run** (was already shown above). + +## Program + +Recall that a Program includes a Bundle and a command-line string. + +### Database fields + +- (all the fields inherited from Bundle) +- ID (should be hash of the Bundle name and command-line) +- Command (string): what to run + +### Examples of types of programs + +- Learner (e.g., SVMlight): takes a datashard, and returns a model along with statistics. +- Predictor: takes a model and a datashard, returns predictions. +- Dataset splitter: divide one datashard into two datashard (based on training fraction, randomly or keeping the order (makes sense for NLP)). +- Dataset stripper: removes the correct labels from each training example. +- Dataset inspector: input is a datashard, output is a datashard. Makes sure + that the input dataset basically conforms the correct format, can be lenient. + The output dataset should canoncalize things (for SvmLightFormat, remove + trailing spaces, replace tabs with spaces, sort features by index, etc.). +- Dataset converters (e.g., convert ARFF or CSV to SVMlight): these converters + are not meant to be invertible, but commit to a particular encoding. For + example, ARFF can have categorical features (red,white,blue), whereas + SVMlight only takes real vectors. +- Evaluator: takes the predictions and computes statistics (e.g., accuarcy, + ROC, precision, recall); will be different for different tasks +- Compiler: input is program, output is compiled program targeting a particular + architecture +- Visualizer: takes predictions or statistics and generates a table or figure. + +## Run + +A Run is a Bundle with some additional fields. + +### Database fields + +- (all the fields inherited from Bundle) +- (all the fields from the specification directory) +- (all the fields from the status directory) +- Whether the user should be emailed when a Run finishes. +- Whether a Run should be terminated (user can select this). + +## Worksheet + +There are several applications of a worksheet: + +- Blog post: if I found some interesting patterns (e.g., linear classifiers + work well for high-dimensional data), I can add a set of Codalab experiments + to a worksheet, write some analysis and publish the worksheet and make it + public to people. +- Paper: if I published a paper, I can put all the datasets, code, and + experiments in one worksheet and include the link to that page from my + website. +- Working: if I am just developing a bunch of different algorithms, I can create + a worksheet as a scratch space to keep track of the algorithms that I am + working on, what the results are so far, etc. + +## Competition + +Any user can create a competition. He has to supply the following: +- Deadline dates (when datasets will be released, etc.) +- Multiple datasets (one for training, one for testing) +- A worksheet describing the competition rules, etc. +- Access restrictions (the competition is only open to people in a certain group) + +When competitors upload predictions, an Evaluation program is launched, some +metrics are computed, and the leaderboard is updated. + +## Miscellaneous + +- As much as possible, information should be represented as a directory so that + it can be downloadable. +- The database should be used to index the information efficiently, especially + for numeric properties, so we can pull up the algorithms that have the lowest + error rates. +- An inverted index (e.g., Lucene) should be used for search. + +## Macro + +Macros provide the users a way to easily create multiple related experiments +given a set of arguments. TODO: complete this.