Initial commit

Toloka · Jan 20, 2022 · 9a4ed3b · 9a4ed3b
commit 9a4ed3b
Show file tree

Hide file tree

Showing 36 changed files with 3,403 additions and 0 deletions.
diff --git a/AUTHORS b/AUTHORS
@@ -0,0 +1,3 @@
+The following authors have created the source code of "toloka-pachyderm" published and distributed by YANDEX LLC as the owner:
+
+Daniil Fedulov [email protected]
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,35 @@
+# Notice to external contributors
+
+
+## General info
+
+Hello! In order for us (YANDEX LLC) to accept patches and other contributions from you, you will have to adopt our Yandex Contributor License Agreement (the “**CLA**”). The current version of the CLA can be found here:
+1) https://yandex.ru/legal/cla/?lang=en (in English) and 
+2) https://yandex.ru/legal/cla/?lang=ru (in Russian).
+
+By adopting the CLA, you state the following:
+
+* You obviously wish and are willingly licensing your contributions to us for our open source projects under the terms of the CLA,
+* You have read the terms and conditions of the CLA and agree with them in full,
+* You are legally able to provide and license your contributions as stated,
+* We may use your contributions for our open source projects and for any other project too,
+* We rely on your assurances concerning the rights of third parties in relation to your contributions.
+
+If you agree with these principles, please read and adopt our CLA. By providing us your contributions, you hereby declare that you have already read and adopt our CLA, and we may freely merge your contributions with our corresponding open source project and use it further in accordance with terms and conditions of the CLA.
+
+## Provide contributions
+
+If you have already adopted terms and conditions of the CLA, you are able to provide your contributions. When you submit your first pull request, please add the following information into it:
+
+```
+I hereby agree to the terms of the CLA available at: [link].
+```
+
+Replace the bracketed text as follows:
+* [link] is the link to the current version of the CLA: https://yandex.ru/legal/cla/?lang=en (in English) or https://yandex.ru/legal/cla/?lang=ru (in Russian).
+
+It is enough to provide this notification only once.
+
+## Other questions
+
+If you have any questions, please mail us at [email protected].
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,13 @@
+Copyright 2022 YANDEX LLC
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
diff --git a/README.md b/README.md
@@ -0,0 +1 @@
+example/README.md
diff --git a/example/README.md b/example/README.md
@@ -0,0 +1,260 @@
+# Pachyderm example: Use Toloka to enrich data for clickbait detection
+
+We are going to use [Pachyderm](https://pachyderm.com/) to create a project, run news headlines annotation tasks in [Toloka](https://toloka.ai/) project, aggregate data
+and train a model.
+
+So, there is to be 5 Pachyderm pipeines:
+
+1. Project creation
+2. Training creation (including tasks upload)
+3. Pool creation (including tasks upload)
+4. Starting a pool and waiting for pool to complete
+5. Assignments aggregation
+
+We also will run 2 pipeline as examples of further data processing:
+* Datasets concatenation
+* Training and testing Random Forest classifier for clickbait headings detection
+
+To set up Pachyderm locally please follow [Pachyderm Local Installation documentation](https://docs.pachyderm.com/latest/getting_started/local_installation/).
+
+Now, let's create these pipelines:
+
+## Setup
+
+1. Before creating pipelines, we need to create a secret with an API key. First things first, we need to create a secret
+   in kubernetes:
+
+```bash
+kubectl create secret generic toloka-api --from-literal=token=<Your token> --dry-run=client  --output=json > toloka-api-key.json
+```
+
+2. Then we push this kubernetes token to Pachyderm and check it was added correctly
+
+```bash
+pachctl create secret -f toloka-api-key.json
+pachctl list secret
+```
+
+3. In the end of this step we are going to create docker for pipelines:
+
+```bash
+docker build -f src/Dockerfile -t toloka_pachyderm:latest .
+```
+
+4. We also have to add our train and test datasets to pachyderm repo:|
+
+```bash
+pachctl create repo clickbait_data
+pachctl put file clickbait_data@master:train.csv -f ./data/train.csv
+pachctl put file clickbait_data@master:test.csv -f ./data/test.csv
+
+```
+
+## Project creation
+
+You may run project creation script:
+
+```bash
+./create_toloka_project.sh
+```
+
+or create project step-by-step:
+
+1. Init pachyderm repo for project config
+
+```bash
+pachctl create repo toloka_project_config
+```
+
+2. Put project config provided
+
+```bash
+pachctl put file toloka_project_config@master:project.json -f ./configs/project.json
+```
+
+3. Check if data have been put to repository correctly
+
+```bash
+pachctl list file toloka_project_config@master
+```
+
+4. Now let's create a pipeline
+
+```bash
+pachctl create pipeline -f toloka_create_project.json
+```
+
+5. When pipeline is created, a new job is started. Let's check if the pipeline's job completed successfully
+
+```bash
+pachctl list job -p toloka_create_project
+```
+
+## Training creation
+
+Training is needed for Tolokers to understand how to annotate your data properly.
+
+You may run training creation script:
+
+```bash
+./create_toloka_training.sh
+```
+
+or initialize training step-by-step:
+
+1. Init pachyderm repo for training config
+
+```bash
+pachctl create repo toloka_training_config
+```
+
+2. Put training config provided
+
+```bash
+pachctl put file toloka_training_config@master:training.json -f ./configs/training.json
+```
+
+3. Check if data have been put to repository correctly
+
+```bash
+pachctl list file toloka_training_config@master
+```
+
+4. Now let's create a pipeline
+
+```bash
+pachctl create pipeline -f toloka_create_training.json
+```
+
+5. Let's check if the pipeline's job completed successfully
+
+```bash
+pachctl list job -p toloka_create_training
+```
+
+6. But our training contains no tasks for Tolokers to study how to annotate your data. To fix this, we are going to
+   create a repo for training data and upload them:
+```bash
+pachctl create repo toloka_training_tasks
+pachctl put file toloka_training_tasks@master:training_tasks.csv -f ./data/training_tasks.csv
+pachctl create pipeline -f toloka_create_training_tasks.json
+pachctl list job -p toloka_create_training_tasks
+```
+
+## Create pool
+
+Now let's create a pool – a set of paid tasks sent out for completion at the same time.
+
+You may run pool creation script:
+
+```bash
+./create_toloka_pool.sh
+```
+
+or initialize pool step-by-step:
+
+1. Init pachyderm repo for pool config
+
+```bash
+pachctl create repo toloka_pool
+```
+
+2. Put pool config and task data provided (we will work with tasks in the next stage)
+
+```bash
+pachctl put file toloka_pool@master:pool.json -f ./configs/pool.json
+pachctl put file toloka_pool@master:control_tasks.csv -f ./data/control_tasks.csv
+pachctl put file toloka_pool@master:pool_tasks.csv -f ./data/pool_tasks.csv
+```
+
+3. Check if data have been put to repository correctly
+
+```bash
+pachctl list file toloka_pool@master
+```
+
+4. Now let's create a pipeline
+
+```bash
+pachctl create pipeline -f toloka_create_pool.json
+```
+
+5. Let's check if the pipeline's job completed successfully
+
+```bash
+pachctl list job -p toloka_create_pool
+```
+
+## Adding tasks to pool
+
+Each item we need to get annotated with Toloka called Task. There are three types for Tasks: training Task, control Task and (simple) Task.
+You've already worked with training Tasks when created Training. These type of tasks has a correct answer and a hint for Tolokers in description.
+Control Tasks are used to check Tolokers labelling quality: they have a correct answer in description and Toloka checks whether Toloker's response matches the correct answer. If many control tasks were not labelled by Toloker correctly, they may be banned.
+
+You may run pool tasks creation script:
+
+```bash
+./create_toloka_pool_tasks.sh
+```
+
+or initialize pool step-by-step:
+
+1. Let's run a pipeline to upload these control tasks to Toloka:
+```bash
+pachctl create pipeline -f toloka_create_control_tasks.json
+pachctl list job -p toloka_create_control_tasks
+```
+2. We also need to run a pipeline to upload the tasks we need to get annotations of to Toloka:
+```bash
+pachctl create pipeline -f toloka_create_pool_tasks.json
+pachctl list job -p toloka_create_pool_tasks
+```
+
+## Wait for pool to complete
+
+In the previous step we've created a pool, but we haven't started it yet. So we do it in this step.
+
+1. Create a pipeline
+
+```bash
+pachctl create pipeline -f toloka_wait_pool.json
+```
+
+2. Wait for the pool to be annotated by Tolokers
+
+```bash
+pachctl list job -p toloka_wait_pool
+```
+
+
+## Assignments aggregation
+
+In the pool we set each heading to have been annotated by 5 different Tolokers. Now we need to aggregate them to have
+one and only one category for each heading. We will use `crowd-kit` library:
+
+```bash
+pachctl create pipeline -f toloka_aggregate_assignments.json
+pachctl list job -p toloka_aggregate_assignments
+```
+
+## Further data processing
+### Datasets concatenation
+
+Suppose we want to add labelled data to train datasets. In this step we will
+concatenate our train dataset and just labelled dataset from Toloka:
+
+```bash
+pachctl create pipeline -f concatenate_datasets.json
+pachctl list job -p concatenate_datasets
+```
+
+### Training and testing Random Forest classifier
+
+In most cases we need to annotate data for further model training. In this step we are going to create a pipeline that get train and test data from different sources, concatenate them if necessary and train Random Forest classifier with evaluating accuracy and F1 score.
+
+```bash
+pachctl create pipeline -f train_test_model.json
+pachctl list job -p train_test_model
+pachctl get file train_test_model@master:random_forest_test.json 1> random_forest_test.json
+cat random_forest_test.json
+```
diff --git a/example/concatenate_datasets.json b/example/concatenate_datasets.json
@@ -0,0 +1,43 @@
+{
+    "pipeline": {
+        "name": "concatenate_datasets"
+    },
+    "description": "A pipeline that concatenates two datasets into one.",
+    "transform": {
+        "image": "toloka_pachyderm:latest",
+        "cmd": [
+            "python3",
+            "/code/concatenate_datasets.py",
+            "--datasets", "/pfs/clickbait_data/train.csv", "/pfs/toloka_aggregate_assignments/results.csv",
+            "--output", "/pfs/out/enriched_train.csv"
+        ],
+        "secrets": [
+            {
+                "name": "toloka-api",
+                "env_var": "TOLOKA_API_ACCESS_KEY",
+                "key": "token"
+            }
+        ]
+    },
+    "parallelism_spec": {
+        "constant": "1"
+    },
+    "input": {
+        "join": [
+            {
+                "pfs": {
+                    "repo": "clickbait_data",
+                    "glob": "/",
+                    "join_on": "$1"
+                }
+            },
+            {
+                "pfs": {
+                    "repo": "toloka_aggregate_assignments",
+                    "glob": "/",
+                    "join_on": "$1"
+                }
+            }
+        ]
+    }
+}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		The following authors have created the source code of "toloka-pachyderm" published and distributed by YANDEX LLC as the owner:

		Daniil Fedulov [email protected]