diff --git a/examples/0. get-start-with-your-level/.DS_Store b/examples/0. get-start-with-your-level/.DS_Store new file mode 100644 index 00000000..aeb2efe3 Binary files /dev/null and b/examples/0. get-start-with-your-level/.DS_Store differ diff --git a/examples/0.getting_started/0.learn_the_basics/img/OAuth.png b/examples/0. get-start-with-your-level/0. learn_the_basics/img/OAuth.png similarity index 100% rename from examples/0.getting_started/0.learn_the_basics/img/OAuth.png rename to examples/0. get-start-with-your-level/0. learn_the_basics/img/OAuth.png diff --git a/examples/0.getting_started/0.learn_the_basics/img/drop-down.svg b/examples/0. get-start-with-your-level/0. learn_the_basics/img/drop-down.svg similarity index 100% rename from examples/0.getting_started/0.learn_the_basics/img/drop-down.svg rename to examples/0. get-start-with-your-level/0. learn_the_basics/img/drop-down.svg diff --git a/examples/0.getting_started/0.learn_the_basics/img/pool_preview.png b/examples/0. get-start-with-your-level/0. learn_the_basics/img/pool_preview.png similarity index 100% rename from examples/0.getting_started/0.learn_the_basics/img/pool_preview.png rename to examples/0. get-start-with-your-level/0. learn_the_basics/img/pool_preview.png diff --git a/examples/0.getting_started/0.learn_the_basics/img/possible_results.png b/examples/0. get-start-with-your-level/0. learn_the_basics/img/possible_results.png similarity index 100% rename from examples/0.getting_started/0.learn_the_basics/img/possible_results.png rename to examples/0. get-start-with-your-level/0. learn_the_basics/img/possible_results.png diff --git a/examples/0.getting_started/0.learn_the_basics/img/preview.svg b/examples/0. get-start-with-your-level/0. learn_the_basics/img/preview.svg similarity index 100% rename from examples/0.getting_started/0.learn_the_basics/img/preview.svg rename to examples/0. get-start-with-your-level/0. learn_the_basics/img/preview.svg diff --git a/examples/0.getting_started/0.learn_the_basics/img/project_look.png b/examples/0. get-start-with-your-level/0. learn_the_basics/img/project_look.png similarity index 100% rename from examples/0.getting_started/0.learn_the_basics/img/project_look.png rename to examples/0. get-start-with-your-level/0. learn_the_basics/img/project_look.png diff --git a/examples/0.getting_started/0.learn_the_basics/img/project_with_pool.png b/examples/0. get-start-with-your-level/0. learn_the_basics/img/project_with_pool.png similarity index 100% rename from examples/0.getting_started/0.learn_the_basics/img/project_with_pool.png rename to examples/0. get-start-with-your-level/0. learn_the_basics/img/project_with_pool.png diff --git a/examples/0.getting_started/0.learn_the_basics/img/results_preview.png b/examples/0. get-start-with-your-level/0. learn_the_basics/img/results_preview.png similarity index 100% rename from examples/0.getting_started/0.learn_the_basics/img/results_preview.png rename to examples/0. get-start-with-your-level/0. learn_the_basics/img/results_preview.png diff --git a/examples/0.getting_started/0.learn_the_basics/img/task_look.png b/examples/0. get-start-with-your-level/0. learn_the_basics/img/task_look.png similarity index 100% rename from examples/0.getting_started/0.learn_the_basics/img/task_look.png rename to examples/0. get-start-with-your-level/0. learn_the_basics/img/task_look.png diff --git a/examples/0.getting_started/0.learn_the_basics/learn_the_basics.ipynb b/examples/0. get-start-with-your-level/0. learn_the_basics/learn_the_basics.ipynb similarity index 97% rename from examples/0.getting_started/0.learn_the_basics/learn_the_basics.ipynb rename to examples/0. get-start-with-your-level/0. learn_the_basics/learn_the_basics.ipynb index 5faa4072..32f4121e 100644 --- a/examples/0.getting_started/0.learn_the_basics/learn_the_basics.ipynb +++ b/examples/0. get-start-with-your-level/0. learn_the_basics/learn_the_basics.ipynb @@ -1,812 +1,812 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Introduction to Toloka and Toloka API\n", - "\n", - "Toloka is a crowdsourcing platform that helps to analyze large volumes of data in a short period of time.\n", - "\n", - "Examples of common tasks:\n", - "* Group the wide variety of items in your online store into categories.\n", - "* Find or verify information.\n", - "* Translate texts.\n", - "\n", - "[Toloka-Kit](https://github.com/Toloka/toloka-kit) is an open-source library, integrated into Toloka API functionality.\n", - "\n", - "**Useful links:**\n", - "\n", - "- [Toloka Kit documentation](https://toloka.ai/en/docs/toloka-kit/?utm_source=github&utm_medium=site&utm_campaign=tolokakit)\n", - "- [Toloka homepage](https://toloka.ai/?utm_source=github&utm_medium=site&utm_campaign=tolokakit)\n", - "- [Toloka requester's guide](https://toloka.ai/en/docs/?utm_source=github&utm_medium=site&utm_campaign=tolokakit)\n", - "- [Toloka API documentation](https://toloka.ai/en/docs/api/?utm_source=github&utm_medium=site&utm_campaign=tolokakit)\n", - "\n", - "The best way to start is to test Toloka web interface by trying out [one of the tutorials](https://toloka.ai/en/docs/guide/concepts/usecases?utm_source=github&utm_medium=site&utm_campaign=tolokakit).\n", - "\n", - "## Registration\n", - "\n", - "1. [Register](https://toloka.ai/en/docs/guide/concepts/access?utm_source=github&utm_medium=site&utm_campaign=tolokakit) in Toloka as a requester.\n", - "2. Choose the backend:\n", - " * The [production backend](https://platform.toloka.ai/for-requesters/?utm_source=github&utm_medium=site&utm_campaign=tolokakit) is used by default in this example.\n", - " * The [sandbox backend](https://platform.sandbox.toloka.ai/for-requesters/?utm_source=github&utm_medium=site&utm_campaign=tolokakit) is a testing environment for Toloka. [Learn more](https://toloka.ai/en/docs/guide/concepts/sandbox?utm_source=github&utm_medium=site&utm_campaign=tolokakit).\n", - "3. [Add funds](https://toloka.ai/en/docs/guide/concepts/refill?utm_source=github&utm_medium=site&utm_campaign=tolokakit) to your Toloka account, if you're going to use the production version.\n", - "4. [Get an OAuth token](https://toloka.ai/en/docs/api/concepts/access#token?utm_source=github&utm_medium=site&utm_campaign=tolokakit) for your version. Go to **Profile** → **Integrations** → **Get OAuth Token**.\n", - "\n", - "\n", - " \n", - " \n", - "
\n", - " \"OAuth\n", - "
\n", - " Figure 1. How to get an OAuth token.\n", - "
\n", - "\n", - "### Call to action\n", - "If you found some bugs or have a new feature idea, don't hesitate to [open a new issue on Github](https://github.com/Toloka/toloka-kit/issues/new/choose).\n", - "Like our library and examples? Star [our repo on Github](https://github.com/Toloka/toloka-kit)\n", - "\n", - "## Getting started with Toloka-Kit\n", - "Install Toloka-Kit, import the necessary libraries into your Python script and set up logging." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%%capture\n", - "!pip install toloka-kit==1.0.2\n", - "!pip install pandas\n", - "!pip install ipyplot\n", - "\n", - "import datetime\n", - "import time\n", - "import logging\n", - "import sys\n", - "import getpass\n", - "\n", - "import pandas\n", - "import ipyplot\n", - "\n", - "import toloka.client as toloka\n", - "import toloka.client.project.template_builder as tb\n", - "\n", - "logging.basicConfig(\n", - " format='[%(levelname)s] %(name)s: %(message)s',\n", - " level=logging.INFO,\n", - " stream=sys.stdout,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Сreate a Toloka client instance. All API calls will go through it." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "toloka_client = toloka.TolokaClient(getpass.getpass('Enter your OAuth token: '), 'PRODUCTION') # Or switch to 'SANDBOX'\n", - "# Lines below check that the OAuth token is correct and print your account's name\n", - "print(toloka_client.get_requester())" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Cells below can help you learn more about an object or a method you are interested in." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "toloka.TolokaClient?" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "toloka.TolokaClient.get_requester?" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "toloka.requester.Requester?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Toloka entities and how to manage them with Toloka-Kit\n", - "\n", - "### Project\n", - "A [project](https://toloka.ai/en/docs/guide/concepts/overview?utm_source=github&utm_medium=site&utm_campaign=tolokakit#project) is a top-level object. It contains instructions, task interface settings, input and output data specification, and default quality control rules for this project pools. Projects make it easier for you to post similar tasks in the future, because you don't have to re-configure the interface.\n", - "\n", - "The easier the task, the better the results. If your task contains more than one question, you should divide it into several projects.\n", - "\n", - "In this tutorial you will create a project with tasks that ask Tolokers to specify the type of animal depicted in a photo." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "new_project = toloka.Project(\n", - " public_name='Cat or Dog?',\n", - " public_description='Specify the type of animal depicted in a photo.',\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The cell above creates an object in your device memory. This is not all, the project must also contain:\n", - "* [Input and output data specification](https://toloka.ai/en/docs/guide/concepts/incoming?utm_source=github&utm_medium=site&utm_campaign=tolokakit)\n", - "* [Task interface settings](https://toloka.ai/en/docs/guide/concepts/spec?utm_source=github&utm_medium=site&utm_campaign=tolokakit)\n", - "* [An instruction](ttps://toloka.ai/en/docs/guide/concepts/instruction?utm_source=github&utm_medium=site&utm_campaign=tolokakit)\n", - "\n", - "**Important:** Several cells below will add changes to the object stored in your device memory. The data will only be sent to the server after calling one of the `toloka_client` methods.\n", - "\n", - "#### Input and output data\n", - "\n", - "The `image` input field contains URLs of images that need to be labeled.\n", - "\n", - "The `result` output field will receive `cat` and `dog` labels." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "input_specification = {'image': toloka.project.UrlSpec()}\n", - "output_specification = {'result': toloka.project.StringSpec()}" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Task interface\n", - "\n", - "The task interface displays the main task elements to Tolokers. It's important because it is how Tolokers see your tasks. If it's too complicated and unclear, the labeling results might be poor.\n", - "\n", - "There are two editors available in Toloka:\n", - "* [HTML/CSS/JS editor](https://toloka.ai/en/docs/guide/concepts/spec?utm_source=github&utm_medium=site&utm_campaign=tolokakit#interface-section)\n", - "* [Template Builder](https://toloka.ai/en/docs/template-builder/?utm_source=github&utm_medium=site&utm_campaign=tolokakit)\n", - "\n", - "Template Builder configures task interface at the entity level. We recommend it for your projects, especially for the first ones.\n", - "\n", - "The cell below creates a task interface for our project." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# This component shows images\n", - "image_viewer = tb.ImageViewV1(tb.InputData('image'), ratio=[1, 1])\n", - "\n", - "# This component allows to select a label\n", - "radio_group_field = tb.RadioGroupFieldV1(\n", - " tb.OutputData('result'),\n", - " [\n", - " tb.fields.GroupFieldOption('cat', 'Cat'),\n", - " tb.fields.GroupFieldOption('dog', 'Dog')\n", - " ],\n", - " validation=tb.RequiredConditionV1(),\n", - ")\n", - "\n", - "# Allows to set a width limit when displaying a task\n", - "task_width_plugin = tb.TolokaPluginV1(\n", - " 'scroll',\n", - " task_width=400,\n", - ")\n", - "\n", - "# How performers will see the task\n", - "project_interface = toloka.project.TemplateBuilderViewSpec(\n", - " view=tb.ListViewV1([image_viewer, radio_group_field]),\n", - " plugins=[task_width_plugin],\n", - ")\n", - "\n", - "# This block assigns task interface and input/output data specification to the project\n", - "# Note that this is done via the task specification class\n", - "new_project.task_spec = toloka.project.task_spec.TaskSpec(\n", - " input_spec=input_specification,\n", - " output_spec=output_specification,\n", - " view_spec=project_interface,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Task instructions\n", - "\n", - "The first thing the Tolokers see when they select a task are the [instructions](https://toloka.ai/en/docs/guide/concepts/instruction?utm_source=github&utm_medium=site&utm_campaign=tolokakit) that you wrote. Describe what needs to be done in simple and clear language, and give examples.\n", - "\n", - "Good instructions help the Tolokers complete the task correctly. The clarity and completeness of the instructions affect the response quality and the project rating. Unclear or too complex instructions, on the contrary, will scare off Tolokers.\n", - "\n", - "Create the instructions for your project with the following cell." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "new_project.public_instructions = 'Look at the picture. Determine what is on it: a cat or a dog. Choose the correct option.'" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Create a project\n", - "\n", - "Now, use `toloka_client` defined at the beginning to create the project.\n", - "\n", - "The data is only sent to the server after calling one of the `toloka_client` methods, the cell below actually creates a project." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "new_project = toloka_client.create_project(new_project)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Project preview\n", - "\n", - "1. Preview project\n", - "\n", - "2. Go to the project page to make sure the task interface works correctly. To do this, click the link in the output of the cell above.\n", - "\n", - "3. In the upper-right corner of the project page click ![Project actions](./img/drop-down.svg) → **![Preview](./img/preview.svg) Preview**:\n", - "\n", - "\n", - " \n", - " \n", - "
\n", - " \"Project\n", - "
\n", - " Figure 2. What the project interface might look like.\n", - "
\n", - "\n", - "4. In the upper part of the preview page click **Change input data**, and insert an image URL into the `image` field, then click **Apply**.\n", - "\n", - "\n", - " \n", - " \n", - "
\n", - " \"Task\n", - "
\n", - " Figure 3. What the task interface might look like and how to insert images in the preview.\n", - "
\n", - "\n", - "5. In the upper part of the preview demonstration click **Instructions**. Make sure the instructions are shown and that they say what you want them to.\n", - "\n", - "6. Select an option in your task. In the lower-left corner of the preview demonstration click **Submit**, then **View responses**. In the appeared result window, check that your results are written in expected format and that the entered data is correct.\n", - "\n", - "\n", - " \n", - " \n", - "
\n", - " \"Result\n", - "
\n", - " Figure 4. What the results might look like.\n", - "
\n", - "\n", - "Tips:\n", - "* We strongly recommend **checking the task interface and instructions** every time you create a project. This will help you to ensure that the Tolokers will complete the task and that your results will be useful.\n", - "* Do a **trial run** with a small amount of data. Make sure that after running the entire pipeline you get the data in the expected format and quality." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Pool\n", - "A [pool](https://toloka.ai/en/docs/guide/concepts/overview?utm_source=github&utm_medium=site&utm_campaign=tolokakit#pool) is a set of tasks that share common pricing, start date, selection of Tolokers, overlap, and quality control configurations. All task in a pool are processed in parallel. One project can have several pools. You can add new tasks to a pool at any time, as well as open or stop it.\n", - "\n", - "The cell below will create a pool as an object in your device memory. You will send it to Toloka with `toloka_client` method a bit later." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "new_pool = toloka.Pool(\n", - " project_id=new_project.id,\n", - " private_name='Pool 1', # Only you can see this information\n", - " may_contain_adult_content=False,\n", - " reward_per_assignment=0.01, # Sets the minimum payment amount for one task suite in USD\n", - " assignment_max_duration_seconds=60*5, # Gives performers 5 minutes to complete one task suite\n", - " will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365), # Sets that the pool will close after one year\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Overlap\n", - "\n", - "To minimize the risk of getting wrong answers, you can ask several Tolokers to complete the same task. This is called _overlap_.\n", - "\n", - "In this example we set the overlap to `3`. This means that **every** task will be completed by **three** different Tolokers." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "new_pool.defaults = toloka.pool.Pool.Defaults(\n", - " default_overlap_for_new_tasks=3,\n", - " default_overlap_for_new_task_suites=0,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Task suite\n", - "\n", - "A [task suite](https://toloka.ai/en/docs/guide/concepts/overview?utm_source=github&utm_medium=site&utm_campaign=tolokakit#tasks-page) is a set of tasks that are shown on a single page.\n", - "\n", - "An important part of configuring pools is to decide how many tasks will be issued to a Toloker at once. For example, if you set 3 tasks for a task suite, a Toloker will see three images at once on one page.\n", - "\n", - "Note that the `reward_per_assignment` and `assignment_max_duration_seconds` fields in pool settings set the price and time for one **task suite**, not task.\n", - "\n", - "Why you should combine tasks in a task suite:\n", - "\n", - "* To set a more precise price for a single task.\n", - "* To calculate a Toloker's skill and use it to determine the correct answer more accurately. Learn more in the [Aggregation](#aggregation) section.\n", - "* To better apply quality control settings that improve the final quality of the response. Learn more in the [Quality control rules](#quality_control_rules) section." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "new_pool.set_mixer_config(\n", - " real_tasks_count=10, # The number of tasks per page.\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Filters\n", - "\n", - "[Filters](https://toloka.ai/en/docs/guide/concepts/filters?utm_source=github&utm_medium=site&utm_campaign=tolokakit) help you select Tolokers for your project.\n", - "\n", - "There may be different reasons to use filters, for example:\n", - "\n", - "* You require some specific group of Tolokers for your pool.\n", - "* You want to exclude a certain group of Tolokers.\n", - "\n", - "Tasks will only be shown to matching Tolokers, rather than to all of them.\n", - "\n", - "This example requires English-speaking Tolokers, because the project instructions are in English." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "new_pool.filter = toloka.filter.Languages.in_('EN')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Quality control rules\n", - "\n", - "[Quality control rules](https://toloka.ai/en/docs/guide/concepts/check-performers?utm_source=github&utm_medium=site&utm_campaign=tolokakit) regulate task completion and Toloker access.\n", - "\n", - "Quality control lets you get more accurate responses and restrict access to tasks for cheating users. All rules work independently. Learn more about [settting up quality control](https://toloka.ai/en/docs/guide/concepts/qa-pool-settings?utm_source=github&utm_medium=site&utm_campaign=tolokakit).\n", - "\n", - "This example uses the captcha rule. It is the simplest way to exclude fake users (robots) and cheaters." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Turns on captchas\n", - "new_pool.set_captcha_frequency('MEDIUM')\n", - "# Bans performers by captcha criteria\n", - "new_pool.quality_control.add_action(\n", - " # Type of quality control rule\n", - " collector=toloka.collectors.Captcha(history_size=5),\n", - " # This condition triggers the action below\n", - " # Here overridden comparison operator actually returns a Condition object\n", - " conditions=[toloka.conditions.FailRate > 20],\n", - " # What exactly should the rule do when the condition is met\n", - " # It bans the performer for 1 day\n", - " action=toloka.actions.RestrictionV2(\n", - " scope='PROJECT',\n", - " duration=1,\n", - " duration_unit='DAYS',\n", - " private_comment='Captcha',\n", - " )\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Create pool\n", - "\n", - "The cell below creates a pool with all the information above which was stored in your device memory." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "new_pool = toloka_client.create_pool(new_pool)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Open your project page. You will see your new pool.\n", - "\n", - "\n", - " \n", - " \n", - "
\n", - " \"Project\n", - "
\n", - " Figure 5. Project interface with a pool.\n", - "
\n", - "\n", - "The pool interface looks like this.\n", - "\n", - "\n", - " \n", - " \n", - "
\n", - " \"Pool\n", - "
\n", - " Figure 6. Pool interface.\n", - "
\n", - "\n", - "Right now the pool is empty and closed. It has no tasks or task suites." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Upload tasks\n", - "\n", - "A [task](https://toloka.ai/en/docs/guide/concepts/overview?utm_source=github&utm_medium=site&utm_campaign=tolokakit#task) is the smallest portion of data you need to mark up.\n", - "\n", - "This example uses a small data set with images. This dataset is collected by the Toloka team and distributed under a\n", - "[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Download the data set\n", - "!curl https://tlk.s3.yandex.net/dataset/cats_vs_dogs/toy_dataset.tsv --output dataset.tsv\n", - "\n", - "dataset = pandas.read_csv('dataset.tsv', sep='\\t')\n", - "\n", - "print(f'Dataset contains {len(dataset)} rows\\n')\n", - "\n", - "dataset = dataset.sample(frac=1).reset_index(drop=True)\n", - "\n", - "ipyplot.plot_images(\n", - " images=[row['url'] for _, row in dataset.iterrows()],\n", - " labels=[row['label'] for _, row in dataset.iterrows()],\n", - " max_images=12,\n", - " img_width=300,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create tasks. One task will be created from one image.\n", - "\n", - "Toloka will automatically create task suites and show the tasks depending on a project overlap:\n", - "\n", - "1. One task suite will consist of 10 tasks.\n", - "2. Toloka will let 3 different Tolokers to complete the tasks.\n", - "\n", - "We configured these settings while creating the pool." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "tasks = [\n", - " toloka.Task(input_values={'image': url}, pool_id=new_pool.id)\n", - " for url in dataset['url']\n", - "]\n", - "# Add tasks to a pool\n", - "toloka_client.create_tasks(tasks, allow_defaults=True)\n", - "print(f'Populated pool with {len(tasks)} tasks')\n", - "print(f'To view this pool, go to https://toloka.dev/requester/project/{new_project.id}/pool/{new_pool.id}')\n", - "# print(f'To view this pool, go to https://sandbox.toloka.dev/requester/project/{new_project.id}/pool/{new_pool.id}') # Print a sandbox version link\n", - "\n", - "# Opens the pool\n", - "new_pool = toloka_client.open_pool(new_pool.id)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "When you open your pool, Tolokers will see your tasks in their mobile app or in Toloka web version, and start working on them.\n", - "\n", - "In small pools like this, it usually takes up to 10 minutes for all the tasks to be performed.\n", - "\n", - "With big pools, we recommend that you set up automatic waiting. See the example in the cell below.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pool_id = new_pool.id\n", - "\n", - "def wait_pool_for_close(pool_id, minutes_to_wait=1):\n", - " sleep_time = 60 * minutes_to_wait\n", - " pool = toloka_client.get_pool(pool_id)\n", - " while not pool.is_closed():\n", - " op = toloka_client.get_analytics([toloka.analytics_request.CompletionPercentagePoolAnalytics(subject_id=pool.id)])\n", - " op = toloka_client.wait_operation(op)\n", - " percentage = op.details['value'][0]['result']['value']\n", - " print(\n", - " f' {datetime.datetime.now().strftime(\"%H:%M:%S\")}\\t'\n", - " f'Pool {pool.id} - {percentage}%'\n", - " )\n", - " time.sleep(sleep_time)\n", - " pool = toloka_client.get_pool(pool.id)\n", - " print('Pool was closed.')\n", - "\n", - "wait_pool_for_close(pool_id)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Get responses\n", - "\n", - "When all the tasks are completed, look at the responses from Tolokers." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "answers_df = toloka_client.get_assignments_df(pool_id)\n", - "# Prepare dataframe for aggregation\n", - "answers_df = answers_df.rename(columns={\n", - " 'INPUT:image': 'task',\n", - " 'OUTPUT:result': 'label',\n", - " 'ASSIGNMENT:worker_id': 'worker',\n", - "})\n", - "\n", - "print(f'answers count: {len(answers_df)}')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "An `assignment` value is one Toloker's responses to all the tasks on a task suite.\n", - "\n", - "If a Toloker completed several task suites, then `toloka_client.get_assignments_df` will contain several `assignment` values." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Aggregation \n", - "\n", - "You should run the [results aggregation](https://toloka.ai/en/docs/guide/concepts/result-aggregation?utm_source=github&utm_medium=site&utm_campaign=tolokakit) only if you set the overlap for the tasks to 2 or higher.\n", - "\n", - "The [majority vote](https://toloka.ai/en/docs/guide/concepts/mvote?utm_source=github&utm_medium=site&utm_campaign=tolokakit) method is a quality control method based on matching responses from the majority of Tolokers who complete the same task. For example, if 2 out of 3 Tolokers selected the `cat` label, then the final label for this task will be `cat`.\n", - "\n", - "Majority vote is easily implemented, but you can also use our crowdsourcing [Crowd-Kit library](https://github.com/Toloka/crowd-kit?utm_source=github&utm_medium=site&utm_campaign=tolokakit). It contains a lot of new aggregation methods." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!pip install crowd-kit==1.1.0\n", - "from crowdkit.aggregation import MajorityVote" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "MajorityVote?" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Run majority vote aggregation\n", - "predicted_answers = MajorityVote().fit_predict(answers_df)\n", - "\n", - "print(predicted_answers)\n", - "\n", - "# Some preparations for displaying the results\n", - "predicted_answers = predicted_answers.sample(frac=1)\n", - "images = predicted_answers.index.values\n", - "labels = predicted_answers.values\n", - "start_with = 0" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Look at the results.\n", - "\n", - "Note: The cell below can be run several times." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "if start_with >= len(predicted_answers):\n", - " print('no more images')\n", - "else:\n", - " ipyplot.plot_images(\n", - " images=images[start_with:],\n", - " labels=labels[start_with:],\n", - " max_images=12,\n", - " img_width=300,\n", - " )\n", - "\n", - " start_with += 12" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can see the labeled images. Some possible results are shown in figure 7 below.\n", - "\n", - "\n", - " \n", - " \n", - "
\n", - " \"Possible\n", - "
\n", - " Figure 7. Possible results.\n", - "
" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Summary\n", - "\n", - "This example explained basic Toloka entities and how Toloka-Kit can work with them.\n", - "\n", - "The described project (classification) is very useful for:\n", - "* Accurate evaluation.\n", - "* Checking the results of a complex project, as in [object detection example](https://github.com/Toloka/toloka-kit/tree/main/examples/1.computer_vision/object_detection) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/1.computer_vision/object_detection/object_detection.ipynb)." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3.10.6 64-bit", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.6" - }, - "vscode": { - "interpreter": { - "hash": "e7370f93d1d0cde622a1f8e1c04877d8463912d04d973331ad4851f04de6915a" - } - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Introduction to Toloka and Toloka API\n", + "\n", + "Toloka is a crowdsourcing platform that helps to analyze large volumes of data in a short period of time.\n", + "\n", + "Examples of common tasks:\n", + "* Group the wide variety of items in your online store into categories.\n", + "* Find or verify information.\n", + "* Translate texts.\n", + "\n", + "[Toloka-Kit](https://github.com/Toloka/toloka-kit) is an open-source library, integrated into Toloka API functionality.\n", + "\n", + "**Useful links:**\n", + "\n", + "- [Toloka Kit documentation](https://toloka.ai/en/docs/toloka-kit/?utm_source=github&utm_medium=site&utm_campaign=tolokakit)\n", + "- [Toloka homepage](https://toloka.ai/?utm_source=github&utm_medium=site&utm_campaign=tolokakit)\n", + "- [Toloka requester's guide](https://toloka.ai/en/docs/?utm_source=github&utm_medium=site&utm_campaign=tolokakit)\n", + "- [Toloka API documentation](https://toloka.ai/en/docs/api/?utm_source=github&utm_medium=site&utm_campaign=tolokakit)\n", + "\n", + "The best way to start is to test Toloka web interface by trying out [one of the tutorials](https://toloka.ai/en/docs/guide/concepts/usecases?utm_source=github&utm_medium=site&utm_campaign=tolokakit).\n", + "\n", + "## Registration\n", + "\n", + "1. [Register](https://toloka.ai/en/docs/guide/concepts/access?utm_source=github&utm_medium=site&utm_campaign=tolokakit) in Toloka as a requester.\n", + "2. Choose the backend:\n", + " * The [production backend](https://platform.toloka.ai/for-requesters/?utm_source=github&utm_medium=site&utm_campaign=tolokakit) is used by default in this example.\n", + " * The [sandbox backend](https://platform.sandbox.toloka.ai/for-requesters/?utm_source=github&utm_medium=site&utm_campaign=tolokakit) is a testing environment for Toloka. [Learn more](https://toloka.ai/en/docs/guide/concepts/sandbox?utm_source=github&utm_medium=site&utm_campaign=tolokakit).\n", + "3. [Add funds](https://toloka.ai/en/docs/guide/concepts/refill?utm_source=github&utm_medium=site&utm_campaign=tolokakit) to your Toloka account, if you're going to use the production version.\n", + "4. [Get an OAuth token](https://toloka.ai/en/docs/api/concepts/access#token?utm_source=github&utm_medium=site&utm_campaign=tolokakit) for your version. Go to **Profile** → **Integrations** → **Get OAuth Token**.\n", + "\n", + "\n", + " \n", + " \n", + "
\n", + " \"OAuth\n", + "
\n", + " Figure 1. How to get an OAuth token.\n", + "
\n", + "\n", + "### Call to action\n", + "If you found some bugs or have a new feature idea, don't hesitate to [open a new issue on Github](https://github.com/Toloka/toloka-kit/issues/new/choose).\n", + "Like our library and examples? Star [our repo on Github](https://github.com/Toloka/toloka-kit)\n", + "\n", + "## Getting started with Toloka-Kit\n", + "Install Toloka-Kit, import the necessary libraries into your Python script and set up logging." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "!pip install toloka-kit==1.0.2\n", + "!pip install pandas\n", + "!pip install ipyplot\n", + "\n", + "import datetime\n", + "import time\n", + "import logging\n", + "import sys\n", + "import getpass\n", + "\n", + "import pandas\n", + "import ipyplot\n", + "\n", + "import toloka.client as toloka\n", + "import toloka.client.project.template_builder as tb\n", + "\n", + "logging.basicConfig(\n", + " format='[%(levelname)s] %(name)s: %(message)s',\n", + " level=logging.INFO,\n", + " stream=sys.stdout,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Сreate a Toloka client instance. All API calls will go through it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "toloka_client = toloka.TolokaClient(getpass.getpass('Enter your OAuth token: '), 'PRODUCTION') # Or switch to 'SANDBOX'\n", + "# Lines below check that the OAuth token is correct and print your account's name\n", + "print(toloka_client.get_requester())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Cells below can help you learn more about an object or a method you are interested in." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "toloka.TolokaClient?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "toloka.TolokaClient.get_requester?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "toloka.requester.Requester?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Toloka entities and how to manage them with Toloka-Kit\n", + "\n", + "### Project\n", + "A [project](https://toloka.ai/en/docs/guide/concepts/overview?utm_source=github&utm_medium=site&utm_campaign=tolokakit#project) is a top-level object. It contains instructions, task interface settings, input and output data specification, and default quality control rules for this project pools. Projects make it easier for you to post similar tasks in the future, because you don't have to re-configure the interface.\n", + "\n", + "The easier the task, the better the results. If your task contains more than one question, you should divide it into several projects.\n", + "\n", + "In this tutorial you will create a project with tasks that ask Tolokers to specify the type of animal depicted in a photo." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "new_project = toloka.Project(\n", + " public_name='Cat or Dog?',\n", + " public_description='Specify the type of animal depicted in a photo.',\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The cell above creates an object in your device memory. This is not all, the project must also contain:\n", + "* [Input and output data specification](https://toloka.ai/en/docs/guide/concepts/incoming?utm_source=github&utm_medium=site&utm_campaign=tolokakit)\n", + "* [Task interface settings](https://toloka.ai/en/docs/guide/concepts/spec?utm_source=github&utm_medium=site&utm_campaign=tolokakit)\n", + "* [An instruction](ttps://toloka.ai/en/docs/guide/concepts/instruction?utm_source=github&utm_medium=site&utm_campaign=tolokakit)\n", + "\n", + "**Important:** Several cells below will add changes to the object stored in your device memory. The data will only be sent to the server after calling one of the `toloka_client` methods.\n", + "\n", + "#### Input and output data\n", + "\n", + "The `image` input field contains URLs of images that need to be labeled.\n", + "\n", + "The `result` output field will receive `cat` and `dog` labels." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "input_specification = {'image': toloka.project.UrlSpec()}\n", + "output_specification = {'result': toloka.project.StringSpec()}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Task interface\n", + "\n", + "The task interface displays the main task elements to Tolokers. It's important because it is how Tolokers see your tasks. If it's too complicated and unclear, the labeling results might be poor.\n", + "\n", + "There are two editors available in Toloka:\n", + "* [HTML/CSS/JS editor](https://toloka.ai/en/docs/guide/concepts/spec?utm_source=github&utm_medium=site&utm_campaign=tolokakit#interface-section)\n", + "* [Template Builder](https://toloka.ai/en/docs/template-builder/?utm_source=github&utm_medium=site&utm_campaign=tolokakit)\n", + "\n", + "Template Builder configures task interface at the entity level. We recommend it for your projects, especially for the first ones.\n", + "\n", + "The cell below creates a task interface for our project." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# This component shows images\n", + "image_viewer = tb.ImageViewV1(tb.InputData('image'), ratio=[1, 1])\n", + "\n", + "# This component allows to select a label\n", + "radio_group_field = tb.RadioGroupFieldV1(\n", + " tb.OutputData('result'),\n", + " [\n", + " tb.fields.GroupFieldOption('cat', 'Cat'),\n", + " tb.fields.GroupFieldOption('dog', 'Dog')\n", + " ],\n", + " validation=tb.RequiredConditionV1(),\n", + ")\n", + "\n", + "# Allows to set a width limit when displaying a task\n", + "task_width_plugin = tb.TolokaPluginV1(\n", + " 'scroll',\n", + " task_width=400,\n", + ")\n", + "\n", + "# How performers will see the task\n", + "project_interface = toloka.project.TemplateBuilderViewSpec(\n", + " view=tb.ListViewV1([image_viewer, radio_group_field]),\n", + " plugins=[task_width_plugin],\n", + ")\n", + "\n", + "# This block assigns task interface and input/output data specification to the project\n", + "# Note that this is done via the task specification class\n", + "new_project.task_spec = toloka.project.task_spec.TaskSpec(\n", + " input_spec=input_specification,\n", + " output_spec=output_specification,\n", + " view_spec=project_interface,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Task instructions\n", + "\n", + "The first thing the Tolokers see when they select a task are the [instructions](https://toloka.ai/en/docs/guide/concepts/instruction?utm_source=github&utm_medium=site&utm_campaign=tolokakit) that you wrote. Describe what needs to be done in simple and clear language, and give examples.\n", + "\n", + "Good instructions help the Tolokers complete the task correctly. The clarity and completeness of the instructions affect the response quality and the project rating. Unclear or too complex instructions, on the contrary, will scare off Tolokers.\n", + "\n", + "Create the instructions for your project with the following cell." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "new_project.public_instructions = 'Look at the picture. Determine what is on it: a cat or a dog. Choose the correct option.'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Create a project\n", + "\n", + "Now, use `toloka_client` defined at the beginning to create the project.\n", + "\n", + "The data is only sent to the server after calling one of the `toloka_client` methods, the cell below actually creates a project." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "new_project = toloka_client.create_project(new_project)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Project preview\n", + "\n", + "1. Preview project\n", + "\n", + "2. Go to the project page to make sure the task interface works correctly. To do this, click the link in the output of the cell above.\n", + "\n", + "3. In the upper-right corner of the project page click ![Project actions](./img/drop-down.svg) → **![Preview](./img/preview.svg) Preview**:\n", + "\n", + "\n", + " \n", + " \n", + "
\n", + " \"Project\n", + "
\n", + " Figure 2. What the project interface might look like.\n", + "
\n", + "\n", + "4. In the upper part of the preview page click **Change input data**, and insert an image URL into the `image` field, then click **Apply**.\n", + "\n", + "\n", + " \n", + " \n", + "
\n", + " \"Task\n", + "
\n", + " Figure 3. What the task interface might look like and how to insert images in the preview.\n", + "
\n", + "\n", + "5. In the upper part of the preview demonstration click **Instructions**. Make sure the instructions are shown and that they say what you want them to.\n", + "\n", + "6. Select an option in your task. In the lower-left corner of the preview demonstration click **Submit**, then **View responses**. In the appeared result window, check that your results are written in expected format and that the entered data is correct.\n", + "\n", + "\n", + " \n", + " \n", + "
\n", + " \"Result\n", + "
\n", + " Figure 4. What the results might look like.\n", + "
\n", + "\n", + "Tips:\n", + "* We strongly recommend **checking the task interface and instructions** every time you create a project. This will help you to ensure that the Tolokers will complete the task and that your results will be useful.\n", + "* Do a **trial run** with a small amount of data. Make sure that after running the entire pipeline you get the data in the expected format and quality." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Pool\n", + "A [pool](https://toloka.ai/en/docs/guide/concepts/overview?utm_source=github&utm_medium=site&utm_campaign=tolokakit#pool) is a set of tasks that share common pricing, start date, selection of Tolokers, overlap, and quality control configurations. All task in a pool are processed in parallel. One project can have several pools. You can add new tasks to a pool at any time, as well as open or stop it.\n", + "\n", + "The cell below will create a pool as an object in your device memory. You will send it to Toloka with `toloka_client` method a bit later." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "new_pool = toloka.Pool(\n", + " project_id=new_project.id,\n", + " private_name='Pool 1', # Only you can see this information\n", + " may_contain_adult_content=False,\n", + " reward_per_assignment=0.01, # Sets the minimum payment amount for one task suite in USD\n", + " assignment_max_duration_seconds=60*5, # Gives performers 5 minutes to complete one task suite\n", + " will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365), # Sets that the pool will close after one year\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Overlap\n", + "\n", + "To minimize the risk of getting wrong answers, you can ask several Tolokers to complete the same task. This is called _overlap_.\n", + "\n", + "In this example we set the overlap to `3`. This means that **every** task will be completed by **three** different Tolokers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "new_pool.defaults = toloka.pool.Pool.Defaults(\n", + " default_overlap_for_new_tasks=3,\n", + " default_overlap_for_new_task_suites=0,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Task suite\n", + "\n", + "A [task suite](https://toloka.ai/en/docs/guide/concepts/overview?utm_source=github&utm_medium=site&utm_campaign=tolokakit#tasks-page) is a set of tasks that are shown on a single page.\n", + "\n", + "An important part of configuring pools is to decide how many tasks will be issued to a Toloker at once. For example, if you set 3 tasks for a task suite, a Toloker will see three images at once on one page.\n", + "\n", + "Note that the `reward_per_assignment` and `assignment_max_duration_seconds` fields in pool settings set the price and time for one **task suite**, not task.\n", + "\n", + "Why you should combine tasks in a task suite:\n", + "\n", + "* To set a more precise price for a single task.\n", + "* To calculate a Toloker's skill and use it to determine the correct answer more accurately. Learn more in the [Aggregation](#aggregation) section.\n", + "* To better apply quality control settings that improve the final quality of the response. Learn more in the [Quality control rules](#quality_control_rules) section." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "new_pool.set_mixer_config(\n", + " real_tasks_count=10, # The number of tasks per page.\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Filters\n", + "\n", + "[Filters](https://toloka.ai/en/docs/guide/concepts/filters?utm_source=github&utm_medium=site&utm_campaign=tolokakit) help you select Tolokers for your project.\n", + "\n", + "There may be different reasons to use filters, for example:\n", + "\n", + "* You require some specific group of Tolokers for your pool.\n", + "* You want to exclude a certain group of Tolokers.\n", + "\n", + "Tasks will only be shown to matching Tolokers, rather than to all of them.\n", + "\n", + "This example requires English-speaking Tolokers, because the project instructions are in English." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "new_pool.filter = toloka.filter.Languages.in_('EN')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Quality control rules\n", + "\n", + "[Quality control rules](https://toloka.ai/en/docs/guide/concepts/check-performers?utm_source=github&utm_medium=site&utm_campaign=tolokakit) regulate task completion and Toloker access.\n", + "\n", + "Quality control lets you get more accurate responses and restrict access to tasks for cheating users. All rules work independently. Learn more about [settting up quality control](https://toloka.ai/en/docs/guide/concepts/qa-pool-settings?utm_source=github&utm_medium=site&utm_campaign=tolokakit).\n", + "\n", + "This example uses the captcha rule. It is the simplest way to exclude fake users (robots) and cheaters." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Turns on captchas\n", + "new_pool.set_captcha_frequency('MEDIUM')\n", + "# Bans performers by captcha criteria\n", + "new_pool.quality_control.add_action(\n", + " # Type of quality control rule\n", + " collector=toloka.collectors.Captcha(history_size=5),\n", + " # This condition triggers the action below\n", + " # Here overridden comparison operator actually returns a Condition object\n", + " conditions=[toloka.conditions.FailRate > 20],\n", + " # What exactly should the rule do when the condition is met\n", + " # It bans the performer for 1 day\n", + " action=toloka.actions.RestrictionV2(\n", + " scope='PROJECT',\n", + " duration=1,\n", + " duration_unit='DAYS',\n", + " private_comment='Captcha',\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create pool\n", + "\n", + "The cell below creates a pool with all the information above which was stored in your device memory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "new_pool = toloka_client.create_pool(new_pool)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Open your project page. You will see your new pool.\n", + "\n", + "\n", + " \n", + " \n", + "
\n", + " \"Project\n", + "
\n", + " Figure 5. Project interface with a pool.\n", + "
\n", + "\n", + "The pool interface looks like this.\n", + "\n", + "\n", + " \n", + " \n", + "
\n", + " \"Pool\n", + "
\n", + " Figure 6. Pool interface.\n", + "
\n", + "\n", + "Right now the pool is empty and closed. It has no tasks or task suites." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Upload tasks\n", + "\n", + "A [task](https://toloka.ai/en/docs/guide/concepts/overview?utm_source=github&utm_medium=site&utm_campaign=tolokakit#task) is the smallest portion of data you need to mark up.\n", + "\n", + "This example uses a small data set with images. This dataset is collected by the Toloka team and distributed under a\n", + "[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Download the data set\n", + "!curl https://tlk.s3.yandex.net/dataset/cats_vs_dogs/toy_dataset.tsv --output dataset.tsv\n", + "\n", + "dataset = pandas.read_csv('dataset.tsv', sep='\\t')\n", + "\n", + "print(f'Dataset contains {len(dataset)} rows\\n')\n", + "\n", + "dataset = dataset.sample(frac=1).reset_index(drop=True)\n", + "\n", + "ipyplot.plot_images(\n", + " images=[row['url'] for _, row in dataset.iterrows()],\n", + " labels=[row['label'] for _, row in dataset.iterrows()],\n", + " max_images=12,\n", + " img_width=300,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create tasks. One task will be created from one image.\n", + "\n", + "Toloka will automatically create task suites and show the tasks depending on a project overlap:\n", + "\n", + "1. One task suite will consist of 10 tasks.\n", + "2. Toloka will let 3 different Tolokers to complete the tasks.\n", + "\n", + "We configured these settings while creating the pool." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "tasks = [\n", + " toloka.Task(input_values={'image': url}, pool_id=new_pool.id)\n", + " for url in dataset['url']\n", + "]\n", + "# Add tasks to a pool\n", + "toloka_client.create_tasks(tasks, allow_defaults=True)\n", + "print(f'Populated pool with {len(tasks)} tasks')\n", + "print(f'To view this pool, go to https://toloka.dev/requester/project/{new_project.id}/pool/{new_pool.id}')\n", + "# print(f'To view this pool, go to https://sandbox.toloka.dev/requester/project/{new_project.id}/pool/{new_pool.id}') # Print a sandbox version link\n", + "\n", + "# Opens the pool\n", + "new_pool = toloka_client.open_pool(new_pool.id)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When you open your pool, Tolokers will see your tasks in their mobile app or in Toloka web version, and start working on them.\n", + "\n", + "In small pools like this, it usually takes up to 10 minutes for all the tasks to be performed.\n", + "\n", + "With big pools, we recommend that you set up automatic waiting. See the example in the cell below.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pool_id = new_pool.id\n", + "\n", + "def wait_pool_for_close(pool_id, minutes_to_wait=1):\n", + " sleep_time = 60 * minutes_to_wait\n", + " pool = toloka_client.get_pool(pool_id)\n", + " while not pool.is_closed():\n", + " op = toloka_client.get_analytics([toloka.analytics_request.CompletionPercentagePoolAnalytics(subject_id=pool.id)])\n", + " op = toloka_client.wait_operation(op)\n", + " percentage = op.details['value'][0]['result']['value']\n", + " print(\n", + " f' {datetime.datetime.now().strftime(\"%H:%M:%S\")}\\t'\n", + " f'Pool {pool.id} - {percentage}%'\n", + " )\n", + " time.sleep(sleep_time)\n", + " pool = toloka_client.get_pool(pool.id)\n", + " print('Pool was closed.')\n", + "\n", + "wait_pool_for_close(pool_id)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Get responses\n", + "\n", + "When all the tasks are completed, look at the responses from Tolokers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "answers_df = toloka_client.get_assignments_df(pool_id)\n", + "# Prepare dataframe for aggregation\n", + "answers_df = answers_df.rename(columns={\n", + " 'INPUT:image': 'task',\n", + " 'OUTPUT:result': 'label',\n", + " 'ASSIGNMENT:worker_id': 'worker',\n", + "})\n", + "\n", + "print(f'answers count: {len(answers_df)}')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "An `assignment` value is one Toloker's responses to all the tasks on a task suite.\n", + "\n", + "If a Toloker completed several task suites, then `toloka_client.get_assignments_df` will contain several `assignment` values." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Aggregation \n", + "\n", + "You should run the [results aggregation](https://toloka.ai/en/docs/guide/concepts/result-aggregation?utm_source=github&utm_medium=site&utm_campaign=tolokakit) only if you set the overlap for the tasks to 2 or higher.\n", + "\n", + "The [majority vote](https://toloka.ai/en/docs/guide/concepts/mvote?utm_source=github&utm_medium=site&utm_campaign=tolokakit) method is a quality control method based on matching responses from the majority of Tolokers who complete the same task. For example, if 2 out of 3 Tolokers selected the `cat` label, then the final label for this task will be `cat`.\n", + "\n", + "Majority vote is easily implemented, but you can also use our crowdsourcing [Crowd-Kit library](https://github.com/Toloka/crowd-kit?utm_source=github&utm_medium=site&utm_campaign=tolokakit). It contains a lot of new aggregation methods." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install crowd-kit==1.1.0\n", + "from crowdkit.aggregation import MajorityVote" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "MajorityVote?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Run majority vote aggregation\n", + "predicted_answers = MajorityVote().fit_predict(answers_df)\n", + "\n", + "print(predicted_answers)\n", + "\n", + "# Some preparations for displaying the results\n", + "predicted_answers = predicted_answers.sample(frac=1)\n", + "images = predicted_answers.index.values\n", + "labels = predicted_answers.values\n", + "start_with = 0" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Look at the results.\n", + "\n", + "Note: The cell below can be run several times." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "if start_with >= len(predicted_answers):\n", + " print('no more images')\n", + "else:\n", + " ipyplot.plot_images(\n", + " images=images[start_with:],\n", + " labels=labels[start_with:],\n", + " max_images=12,\n", + " img_width=300,\n", + " )\n", + "\n", + " start_with += 12" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can see the labeled images. Some possible results are shown in figure 7 below.\n", + "\n", + "\n", + " \n", + " \n", + "
\n", + " \"Possible\n", + "
\n", + " Figure 7. Possible results.\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Summary\n", + "\n", + "This example explained basic Toloka entities and how Toloka-Kit can work with them.\n", + "\n", + "The described project (classification) is very useful for:\n", + "* Accurate evaluation.\n", + "* Checking the results of a complex project, as in [object detection example](https://github.com/Toloka/toloka-kit/tree/main/examples/1.computer_vision/object_detection) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/1.computer_vision/object_detection/object_detection.ipynb)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.10.6 64-bit", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.6" + }, + "vscode": { + "interpreter": { + "hash": "e7370f93d1d0cde622a1f8e1c04877d8463912d04d973331ad4851f04de6915a" + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/examples/0. get-start-with-your-level/1. learn_the_features/learn_the_features.ipynb b/examples/0. get-start-with-your-level/1. learn_the_features/learn_the_features.ipynb new file mode 100644 index 00000000..c0451ea2 --- /dev/null +++ b/examples/0. get-start-with-your-level/1. learn_the_features/learn_the_features.ipynb @@ -0,0 +1,1453 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Get your medium level" + ], + "metadata": { + "id": "W3LjutNLcuVp" + } + }, + { + "cell_type": "markdown", + "source": [ + "Welcome to the medium level Toloka-Kit tutorial! In this notebook, we will explore some of the powerful features and capabilities that Toloka-Kit has to offer. This tutorial is designed for users who already have some experience with Toloka-Kit and are looking to expand their knowledge and implement more complex workflows in their crowdsourcing projects.\n", + "\n", + "We will cover topics such as advanced quality control, custom interface design and using various collectors, conditions, and actions to create more sophisticated task pools." + ], + "metadata": { + "id": "DCtfQgrO1WMX" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Libraries importing" + ], + "metadata": { + "id": "WqSrUAMNdMbN" + } + }, + { + "cell_type": "markdown", + "source": [ + "First of all, let's install and import necessary libraries" + ], + "metadata": { + "id": "Vn9Z4FSFbduE" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "P52cvCgKzPZE" + }, + "outputs": [], + "source": [ + "import datetime\n", + "import getpass\n", + "import time\n", + "\n", + "import pandas as pd\n", + "import toloka.client as toloka\n", + "import toloka.client.project.template_builder as tb" + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Authorization in Toloka API" + ], + "metadata": { + "id": "nWLdAVkqdXKt" + } + }, + { + "cell_type": "markdown", + "source": [ + "To interact with the Toloka API, you need to provide an API key for authentication.\n", + "\n", + "You can obtain your API key from the Toloka website https://platform.toloka.ai/requester/profile/integration.\n", + "\n", + "Make sure to keep your API key secret and not share it with anyone." + ], + "metadata": { + "id": "7vXiAoNif_IN" + } + }, + { + "cell_type": "code", + "source": [ + "toloka_client = toloka.TolokaClient(getpass.getpass('Enter your OAuth token: '), 'PRODUCTION') # Or switch to 'SANDBOX'\n", + "# Lines below check that the OAuth token is correct and print your account's name\n", + "print(toloka_client.get_requester())" + ], + "metadata": { + "id": "4_jLOymszQUN" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# Creating new project" + ], + "metadata": { + "id": "sBi_aC2idonJ" + } + }, + { + "cell_type": "markdown", + "source": [ + "Here we create an instance of the Toloka project with a specified public name and description.\n", + "\n", + "The public name and description will be visible to the performers who work on your tasks.\n", + "\n", + "Make sure to provide a clear and concise name and description that accurately represent the purpose and requirements of your crowdsourcing project." + ], + "metadata": { + "id": "llDtUF7ZgtXP" + } + }, + { + "cell_type": "code", + "source": [ + "project = toloka.Project(\n", + " public_name='Your project name',\n", + " public_description='Your project description',\n", + ")" + ], + "metadata": { + "id": "3aBi2SrnzRCa" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Input and output data" + ], + "metadata": { + "id": "QVxd68KQ2fed" + } + }, + { + "cell_type": "markdown", + "source": [ + "Define the input and output data types for our project.\n", + "\n", + "The `input_specification` is a dictionary that describes the format of the input data (in this case, a URL pointing to an image).\n", + "\n", + "The `output_specification` is a dictionary that describes the format of the output data (in this case, a string that will contain the result provided by the performer)." + ], + "metadata": { + "id": "_E35QfsTgv8n" + } + }, + { + "cell_type": "code", + "source": [ + "input_specification = {'image': toloka.project.UrlSpec()}\n", + "output_specification = {'result': toloka.project.StringSpec()}" + ], + "metadata": { + "id": "t_u04omt2evG" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Task interface" + ], + "metadata": { + "id": "nwPm_wd-E7bJ" + } + }, + { + "cell_type": "markdown", + "source": [ + "We will configure the task interface that performers will interact with while working on the tasks.\n", + "A well-designed interface is essential for obtaining high-quality results. We will create different input and output fields\n", + "to capture various types of user responses, and use Toloka's Template Builder to customize the interface appearance and layout." + ], + "metadata": { + "id": "kwi-X31ptqjQ" + } + }, + { + "cell_type": "markdown", + "source": [ + "In this section, we create an instance of the Interface Builder. The Interface Builder allows you to easily create and customize various UI elements, such as text fields, image displays, and buttons, to ensure that your tasks are user-friendly and accessible to the performers." + ], + "metadata": { + "id": "yJIe2W3ydxXy" + } + }, + { + "cell_type": "code", + "source": [ + "project_interface = tb.InterfaceBuilder()" + ], + "metadata": { + "id": "ftt29KiozREo" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Here we create a text field UI element using the TextFieldV1 class from the Interface Builder.\n", + "The `output_name` parameter specifies the key that will be used to store the performer's input in the output data.\n", + "The `label` parameter sets the label that will be displayed next to the text field, providing instructions for the performer.\n", + "After creating the text field, we add it to the project interface using the `add_element` method." + ], + "metadata": { + "id": "z0ywP2oDd6Zq" + } + }, + { + "cell_type": "code", + "source": [ + "text_field = tb.TextFieldV1(output_name='text_result', label='Enter your answer here:')\n", + "project_interface.view.add_element(text_field)" + ], + "metadata": { + "id": "EGEnktYJFCeZ" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Next, we create a radio group UI element using the RadioGroupFieldV1 class from the Interface Builder.\n", + "The `data_output_path` parameter specifies the key that will be used to store the performer's selected option in the output data.\n", + "The `label` parameter sets the label that will be displayed.\n", + "The `options` parameter is a list of GroupFieldOption instances, each representing a selectable option.\n", + "After creating the radio group field, we add it to the project interface using the `add_element` method (not shown in this snippet)." + ], + "metadata": { + "id": "ORFpfOWSd9yn" + } + }, + { + "cell_type": "code", + "source": [ + "radio_group = tb.RadioGroupFieldV1(\n", + " data_output_path='radio_result',\n", + " label='Select an option:',\n", + " options=[\n", + " tb.GroupFieldOption(label='Option 1', value='option_1'),\n", + " tb.GroupFieldOption(label='Option 2', value='option_2'),\n", + " ],\n", + ")" + ], + "metadata": { + "id": "rj0SJdhcFGyk" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Creating a checkbox group UI element using the CheckboxGroupFieldV1 class from the Interface Builder.\n", + "The `data_output_path` parameter specifies the key that will be used to store the performer's selected options in the output data.\n", + "The `label` parameter sets the label that will be displayed.\n", + "The `options` parameter is a list of GroupFieldOption instances, each representing a selectable option in the checkbox group.\n", + "After creating the checkbox group field, we add it to the project interface using the `add_element` method." + ], + "metadata": { + "id": "vKnAUEoefhAC" + } + }, + { + "cell_type": "code", + "source": [ + "checkbox_group = tb.CheckboxGroupFieldV1(\n", + " data_output_path='checkbox_result',\n", + " label='Select all that apply:',\n", + " options=[\n", + " tb.GroupFieldOption(label='Option 1', value='option_1'),\n", + " tb.GroupFieldOption(label='Option 2', value='option_2'),\n", + " ],\n", + ")\n", + "project_interface.view.add_element(checkbox_group)" + ], + "metadata": { + "id": "An6U26weFywj" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Creating a dropdown UI element using the DropdownFieldV1 class from the Interface Builder.\n", + "The `data_output_path` parameter specifies the key that will be used to store the performer's selected option in the output data.\n", + "The `label` parameter sets the label that will be displayed.\n", + "The `options` parameter is a list of GroupFieldOption instances, each representing a selectable option in the dropdown menu.\n", + "After creating the dropdown field, we add it to the project interface using the `add_element` method." + ], + "metadata": { + "id": "egcQi8bBfqBi" + } + }, + { + "cell_type": "code", + "source": [ + "dropdown = tb.DropdownFieldV1(\n", + " data_output_path='dropdown_result',\n", + " label='Choose an option:',\n", + " options=[\n", + " tb.GroupFieldOption(label='Option 1', value='option_1'),\n", + " tb.GroupFieldOption(label='Option 2', value='option_2'),\n", + " ],\n", + ")\n", + "project_interface.view.add_element(dropdown)" + ], + "metadata": { + "id": "-KoxlouMFyyv" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "In this section, we create a TemplateBuilderTaskSpec object to define the task specification for the project.\n", + "\n", + "The `interface` parameter is set to the project_interface, which we converted to a dictionary using the `to_dict` method.\n", + "The `task_defaults` parameter is used to set default values for tasks, such as the input and output data formats.\n", + "The `output_schemas` parameter is a list of OutputFieldSchema instances, each defining the output data format for a specific UI element.\n", + "\n", + "The OutputFieldSchema has three parameters: `name` is the key used in the output data, `field` specifies the output data type, and `required` indicates whether the performer must provide a value for this field.\n", + "\n", + "In this example, we have created output schemas for the text field, radio group, checkbox group, and dropdown field in the project interface." + ], + "metadata": { + "id": "hc9bYGkVf2jT" + } + }, + { + "cell_type": "code", + "source": [ + "project.task_spec = toloka.project.TemplateBuilderTaskSpec(\n", + " interface=project_interface.to_dict(),\n", + " task_defaults=toloka.project.ProjectTaskDefaults(),\n", + " output_schemas=[\n", + " toloka.project.OutputFieldSchema(\n", + " name='text_result',\n", + " field=tb.OutputFieldType.STRING,\n", + " required=True,\n", + " ),\n", + " toloka.project.OutputFieldSchema(\n", + " name='radio_result',\n", + " field=tb.OutputFieldType.STRING,\n", + " required=True,\n", + " ),\n", + " toloka.project.OutputFieldSchema(\n", + " name='checkbox_result',\n", + " field=tb.OutputFieldType.ARRAY,\n", + " required=True,\n", + " ),\n", + " toloka.project.OutputFieldSchema(\n", + " name='dropdown_result',\n", + " field=tb.OutputFieldType.STRING,\n", + " required=True,\n", + " ),\n", + " ],\n", + ")" + ], + "metadata": { + "id": "dLzxcyZKzRG7" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Creating a TolokaPluginV1 object with the scroll plugin to set the task width in the project interface.\n", + "The `task_width` parameter is set to 400 pixels, but you can adjust it to fit the content of your tasks." + ], + "metadata": { + "id": "royjRI3w4RyD" + } + }, + { + "cell_type": "code", + "source": [ + "task_width_plugin = tb.TolokaPluginV1(\n", + " 'scroll',\n", + " task_width=400,\n", + ")" + ], + "metadata": { + "id": "RswF5sV24RiZ" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Here, we configure how the task will be presented to the performers.\n", + "We create a `TemplateBuilderViewSpec` object that combines the dropdown and radio group fields in a ListViewV1.\n", + "We also include the `task_width_plugin` we created earlier to control the task width.\n", + "This configuration ensures a visually appealing and organized task layout for the performers in the Toloka interface." + ], + "metadata": { + "id": "vaepylZ_2Sbu" + } + }, + { + "cell_type": "code", + "source": [ + "project_interface = toloka.project.TemplateBuilderViewSpec(\n", + " view=tb.ListViewV1([dropdown, radio_group]),\n", + " plugins=[task_width_plugin],\n", + ")" + ], + "metadata": { + "id": "82dpy-aM2SDl" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Next, we assign the task interface and input/output data specifications to the project." + ], + "metadata": { + "id": "7YM87jew4fzK" + } + }, + { + "cell_type": "code", + "source": [ + "project.task_spec = toloka.project.task_spec.TaskSpec(\n", + " input_spec=input_specification,\n", + " output_spec=output_specification,\n", + " view_spec=project_interface,\n", + ")" + ], + "metadata": { + "id": "1VG4VLxQ103q" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Provide a text with instructions that will be visible to the performers before they start working on the tasks.\n", + "Clear and comprehensive instructions are crucial for obtaining high-quality results from the performers." + ], + "metadata": { + "id": "mE5_fcVr4xYV" + } + }, + { + "cell_type": "code", + "source": [ + "project.public_instructions = 'Your text with instruction'" + ], + "metadata": { + "id": "RsxIHH4C101N" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Finally, we submit the project configuration to the Toloka platform by calling the `create_project` method.\n", + "It creates a new project on the platform with the specified settings and returns the created project object." + ], + "metadata": { + "id": "V3pmth8V01dp" + } + }, + { + "cell_type": "code", + "source": [ + "project = toloka_client.create_project(project)" + ], + "metadata": { + "id": "gObBgDevG39Z" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# Traning" + ], + "metadata": { + "id": "3EEOxKXZDAS8" + } + }, + { + "cell_type": "markdown", + "source": [ + "We should create a training pool for the project. Training pools help educate performers on how to complete the tasks correctly.\n", + "By setting up a training pool, you ensure that performers understand the instructions and requirements before they start working on the main tasks." + ], + "metadata": { + "id": "kpjee5wxzf6f" + } + }, + { + "cell_type": "code", + "source": [ + "training = toloka.Training(\n", + " project_id=project.id,\n", + " private_name='Training for your project',\n", + " may_contain_adult_content=False,\n", + " inherited_instructions=True,\n", + " assignment_max_duration_seconds=60 * 20,\n", + ")" + ], + "metadata": { + "id": "fxde5ZvdDAIm" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Creating an interface builder specifically for the training tasks.\n", + "The training interface can be similar or identical to the main task interface, depending on the project's requirements." + ], + "metadata": { + "id": "M79EUvrfzjmf" + } + }, + { + "cell_type": "code", + "source": [ + "training_interface = tb.InterfaceBuilder()" + ], + "metadata": { + "id": "mkW2nsRPDAGA" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Next, we configure the components of the training interface.\n", + "Here, we use a text field similar to the main task interface.\n", + "Then, we define the task specification for the training project with the appropriate input and output data schemas." + ], + "metadata": { + "id": "RuoEndTizu1G" + } + }, + { + "cell_type": "code", + "source": [ + "training_text_field = tb.TextFieldV1(output_name='result', label='Enter your answer here:')\n", + "training_interface.view.add_element(training_text_field)\n", + "\n", + "training.task_spec = toloka.project.TemplateBuilderTaskSpec(\n", + " interface=training_interface.to_dict(),\n", + " task_defaults=toloka.project.ProjectTaskDefaults(),\n", + " output_schemas=[\n", + " toloka.project.OutputFieldSchema(\n", + " name='result',\n", + " field=tb.OutputFieldType.STRING,\n", + " required=True,\n", + " ),\n", + " ],\n", + ")" + ], + "metadata": { + "id": "31lQ_2WuDADK" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Finally, we create the training project using the Toloka API client and the configured training object." + ], + "metadata": { + "id": "Fq4c9suEz7xg" + } + }, + { + "cell_type": "code", + "source": [ + "training = toloka_client.create_training(training)" + ], + "metadata": { + "id": "0NIQtxQJC__w" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# Pool" + ], + "metadata": { + "id": "SemqBAELAQIP" + } + }, + { + "cell_type": "markdown", + "source": [ + "Here, we will create a pool for our project. A pool is a group of tasks with similar characteristics and settings, assigned to performers with specific skills or attributes. We will configure the pool settings, such as the reward per assignment, auto-acceptance of solutions, assignment duration, and dynamic pricing, to ensure efficient task distribution and management." + ], + "metadata": { + "id": "cPAoy2Sy7N9y" + } + }, + { + "cell_type": "code", + "source": [ + "pool = toloka.Pool(\n", + " # The project ID to which the pool is related\n", + " project_id=project.id,\n", + " # A private name for the pool, visible only to you\n", + " private_name='Your pool name',\n", + " # Whether the pool may contain adult content\n", + " may_contain_adult_content=False,\n", + " # The pool's expiration date\n", + " will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365),\n", + " # The reward per assignment for performers\n", + " reward_per_assignment=0.01,\n", + " # Whether to automatically accept solutions after the specified period\n", + " auto_accept_solutions=True,\n", + " # The number of days before auto-accepting solutions\n", + " auto_accept_period_day=7,\n", + " # Maximum duration for a performer to complete an assignment\n", + " assignment_max_duration_seconds=60 * 20,\n", + " dynamic_pricing_config=toloka.pool.DynamicPricingConfig(\n", + " # Whether dynamic pricing is enabled\n", + " enabled=True,\n", + " # The multiplier for dynamic pricing\n", + " multiplier=1.5,\n", + " ),\n", + ")\n" + ], + "metadata": { + "id": "qS_lW0GFzRJV" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Setting the default overlap for tasks in the pool. The overlap value determines the number of different performers who will complete the same task. In this case, each task will be completed by 3 different performers, allowing for more reliable results through aggregation." + ], + "metadata": { + "id": "V1rgpmza7Ygw" + } + }, + { + "cell_type": "code", + "source": [ + "pool.defaults = toloka.client.pool.Pool.Defaults (\n", + " overlap=3,\n", + ")" + ], + "metadata": { + "id": "aqLR_XZLef4x" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "In this section, we configure the Task Suite mixer for the pool. The mixer determines the composition of tasks within a Task Suite for performers. In this example, we set the following configuration:\n", + "\n", + "- real_tasks_count: The number of actual tasks in a Task Suite, which is set to 5.\n", + "- golden_tasks_count: The number of golden tasks (tasks with known correct answers) in a Task Suite, which is set to 1. Golden tasks help monitor and evaluate the quality of performer's work.\n", + "- training_tasks_count: The number of training tasks in a Task Suite, which is set to 0. Training tasks are used to teach performers how to complete tasks correctly before they start working on actual tasks." + ], + "metadata": { + "id": "7a2gBSys7jUG" + } + }, + { + "cell_type": "code", + "source": [ + "pool.set_mixer_config(real_tasks_count=5, golden_tasks_count=1, training_tasks_count=0)" + ], + "metadata": { + "id": "lpMMAkn8ey2M" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Filters" + ], + "metadata": { + "id": "gwFSOliwfjsk" + } + }, + { + "cell_type": "markdown", + "source": [ + "Here, we will apply various filters to the pool to ensure that only performers with certain characteristics can work on the tasks. We will combine different filters such as language, custom skill, number of completed tasks, average task completion rate, region based on the performer's IP address, and the type of client used by the performer. By using these filters, we can better control the quality of the work and attract performers who meet our requirements." + ], + "metadata": { + "id": "4zG5y0V77UwA" + } + }, + { + "cell_type": "code", + "source": [ + "pool.filter = (\n", + " # Language filter\n", + " (toloka.filter.Languages.in_('EN')) &\n", + " # Filter by custom skill\n", + " (toloka.filter.Skill('your_custom_skill_id') > 0.8) &\n", + " # Filter by the number of completed tasks\n", + " (toloka.filter.AssignmentsAcceptedCount.gt(100)) &\n", + " # Filter by average task completion rate\n", + " (toloka.filter.AssignmentSubmitTimeRate.lt(60 * 2)) &\n", + " # Filter by performer's region using his location\n", + " (toloka.filter.RegionByIp.in_('US', 'CA')) &\n", + " # Combination of filters using the OR condition due to the gadget type\n", + " (\n", + " toloka.filter.ClientType.eq(toloka.filter.ClientType.ClientType.BROWSER) |\n", + " toloka.filter.ClientType.eq(toloka.filter.ClientType.ClientType.TOLOKA_APP)\n", + " )\n", + ")" + ], + "metadata": { + "id": "HSNiRNgEAvbf" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Creating a custom skill which can be used to track and rate the performers based on their performance in our tasks. This skill will be hidden from the performers, and it won't be accessible through public requests. The custom skill can be used as a filter in the pool to select performers with specific skill levels." + ], + "metadata": { + "id": "0HkMCrbH7sV4" + } + }, + { + "cell_type": "code", + "source": [ + "custom_skill = toloka_client.create_skill(\n", + " toloka.Skill(\n", + " name='custom_skill',\n", + " hidden=True,\n", + " public_request=False,\n", + " )\n", + ")" + ], + "metadata": { + "id": "S8qygDVK7sQ4" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Here are some more functionalities for skills' manipulating" + ], + "metadata": { + "id": "MI5CYikLh9AR" + } + }, + { + "cell_type": "markdown", + "source": [ + "To get a list of all skills in your account, you can use the `get_skills` method" + ], + "metadata": { + "id": "mrmmEkDhjozB" + } + }, + { + "cell_type": "code", + "source": [ + "all_skills = toloka_client.TolokaClient.get_skills()\n", + "for skill in all_skills:\n", + " print(f'Skill ID: {skill.id}, Name: {skill.name}')" + ], + "metadata": { + "id": "z8zHIitOiTP8" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "If you need to delete a skill, you can use the `delete_skill` method" + ], + "metadata": { + "id": "9dkYFT2Ujthj" + } + }, + { + "cell_type": "code", + "source": [ + "toloka_client.TolokaClient.delete_skill(custom_skill.id)\n", + "print(f'Skill with ID {custom_skill.id} has been deleted.')" + ], + "metadata": { + "id": "P_xNbgVTjYxQ" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "You can set a skill value for a specific performer using the `set_performer_skill` method" + ], + "metadata": { + "id": "MQ4GTaukjty7" + } + }, + { + "cell_type": "code", + "source": [ + "performer_id = 'your_performer_id'\n", + "updated_skill_value = 0.9\n", + "\n", + "toloka_client.set_performer_skill(\n", + " user_id=performer_id,\n", + " skill_id=custom_skill.id,\n", + " value=updated_skill_value\n", + ")" + ], + "metadata": { + "id": "zVx_66zrjYu5" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "You can update the properties of a skill using the `update_skill` method" + ], + "metadata": { + "id": "ofTutOYVkFkF" + } + }, + { + "cell_type": "code", + "source": [ + "updated_skill_name = 'Updated Custom Skill'\n", + "updated_skill_description = 'An updated description for the custom skill'\n", + "\n", + "updated_skill = toloka.Skill(\n", + " id=custom_skill.id,\n", + " name=updated_skill_name,\n", + " description=updated_skill_description\n", + ")\n", + "\n", + "updated_skill = toloka.client.TolokaClient.update_skill(updated_skill)" + ], + "metadata": { + "id": "GBJfsPecjYsq" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Tasks in pool" + ], + "metadata": { + "id": "X0LqIwE0gZyG" + } + }, + { + "cell_type": "markdown", + "source": [ + "Uploading your data, it could be any format (csv, json, tsv, sql) but upload it using pd" + ], + "metadata": { + "id": "y566yG0Z9rOo" + } + }, + { + "cell_type": "code", + "source": [ + "dataset = pd.read_csv('Your dataset')" + ], + "metadata": { + "id": "ISNAy8NOfG3_" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Next, we are preparing and uploading tasks to the pool:\n", + "\n", + "- tasks: Creates a list of tasks from the dataset. The toloka.Task object contains various parameters that can be set to customize the tasks as required.\n", + "- toloka_client.create_tasks: Adds the tasks to the pool using the Toloka API client. The allow_defaults parameter is set to True, which allows default values for task parameters." + ], + "metadata": { + "id": "h0RaeeFTEz9Z" + } + }, + { + "cell_type": "code", + "source": [ + "# Forming tasks from data\n", + "tasks = [\n", + " toloka.Task(\n", + " self, *,\n", + " input_values: Optional[Dict[str, Any]] = None,\n", + " known_solutions: Optional[List[BaseTask.KnownSolution]] = None,\n", + " message_on_unknown_solution: Optional[str] = None,\n", + " id: Optional[str] = None,\n", + " infinite_overlap=None,\n", + " overlap=None,\n", + " pool_id: Optional[str] = None,\n", + " remaining_overlap: Optional[int] = None,\n", + " reserved_for: Optional[List[str]] = None,\n", + " unavailable_for: Optional[List[str]] = None,\n", + " origin_task_id: Optional[str] = None,\n", + " created: Optional[datetime] = None,\n", + " baseline_solutions: Optional[List[BaselineSolution]] = None\n", + ")\n", + "]\n", + "\n", + "# Add tasks to a pool\n", + "toloka_client.create_tasks(tasks, allow_defaults=True)" + ], + "metadata": { + "id": "44IyFf-nDEK6" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "For example, we are downloading a dataset of cats and dogs images, loading it into a pandas dataframe, shuffling the dataset, and then creating tasks from the image URLs in the dataset. Finally, we are adding these tasks to the pool." + ], + "metadata": { + "id": "UQC1atttn5iX" + } + }, + { + "cell_type": "code", + "source": [ + "!curl https://tlk.s3.yandex.net/dataset/cats_vs_dogs/toy_dataset.tsv --output dataset.tsv\n", + "\n", + "dataset = pd.read_csv('dataset.tsv', sep='\\t')\n", + "print(f'Dataset contains {len(dataset)} rows\\n')\n", + "dataset = dataset.sample(frac=1).reset_index(drop=True)\n", + "\n", + "tasks = [\n", + " toloka.Task(input_values={'image': url}, pool_id=pool.id)\n", + " for url in dataset['url']\n", + "]\n", + "# Add tasks to a pool\n", + "toloka_client.create_tasks(tasks, allow_defaults=True)\n", + "print(f'Populated pool with {len(tasks)} tasks')\n", + "print(f'To view this pool, go to https://toloka.dev/requester/project/{project.id}/pool/{pool.id}')" + ], + "metadata": { + "id": "BXo75A93n3rN" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "`TaskSuites` are a grouping of several tasks that are meant to be completed together by a single performer. This can be useful for a variety of reasons, such as maintaining context or consistency across tasks, or simply optimizing the workflow for performers.\n", + "\n", + "In contrast, individual Tasks represent a single unit of work that can be completed independently by a performer. These tasks are typically used for projects where context and consistency are not crucial or when tasks are unrelated to each other." + ], + "metadata": { + "id": "vNWxF8e9BUO3" + } + }, + { + "cell_type": "code", + "source": [ + "# Create a list of Task Suites\n", + "task_suites = toloka.client.task_suite.TaskSuite(\n", + " self,\n", + " *,\n", + " infinite_overlap=None,\n", + " overlap=None,\n", + " pool_id: Optional[str] = None,\n", + " tasks: Optional[List[BaseTask]] = ...,\n", + " reserved_for: Optional[List[str]] = None,\n", + " unavailable_for: Optional[List[str]] = None,\n", + " issuing_order_override: Optional[float] = None,\n", + " mixed: Optional[bool] = None,\n", + " traits_all_of: Optional[List[str]] = None,\n", + " traits_any_of: Optional[List[str]] = None,\n", + " traits_none_of_any: Optional[List[str]] = None,\n", + " longitude: Optional[float] = None,\n", + " latitude: Optional[float] = None,\n", + " id: Optional[str] = None,\n", + " remaining_overlap: Optional[int] = None,\n", + " automerged: Optional[bool] = None,\n", + " created: Optional[datetime] = None\n", + ")" + ], + "metadata": { + "id": "jY6eS2TxT3WK" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# Add the Task Suites to the pool\n", + "toloka_client.create_task_suites(task_suites)" + ], + "metadata": { + "id": "sEFdzB4gn_KG" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Quality control" + ], + "metadata": { + "id": "KzZ3U49Xgf5n" + } + }, + { + "cell_type": "markdown", + "source": [ + "Quality control in a Toloka task pool consists of three main components: collectors, conditions, and actions. They work together to monitor the performance of workers and take measures when quality issues are detected.\n", + "- Collectors: These collect information about the worker's performance while they complete tasks. For example, collectors may gather information about the task completion time, correctness scores, etc.\n", + "\n", + "- Conditions: Conditions determine under which circumstances actions will be applied to a worker. Conditions can check information gathered by collectors, or other properties of a worker's performance, such as the number of tasks completed or their current rating.\n", + "\n", + "- Actions: These are measures that are applied to a worker if the conditions are met. For example, an action can involve restricting a worker's access to the project or task pool for a certain period, adjusting the worker's rating, or notifying the administrator." + ], + "metadata": { + "id": "-_05nfjw1I7V" + } + }, + { + "cell_type": "markdown", + "source": [ + "In this section, we're setting a quality control requirement based on the performer's training. \n", + "\n", + "- *pool.quality_control.training_requirement*: This attribute of the pool object sets the training requirement for the quality control.\n", + "\n", + "- *toloka.pool.QualityControl.TrainingRequirement*: This class defines the training requirement configuration.\n", + "\n", + "1) *training_pool_id*: Replace with the ID of the training pool you want the performers to pass before they can work on the main pool tasks.\n", + "\n", + "2) *training_passing_score*: This is the minimum score (in percentage) that performers must achieve in the training pool to be allowed to work on the main pool tasks. In this example, the passing score is set to 80.\n", + "\n", + "By setting this training requirement, you ensure that only performers who have successfully passed the training with a score of at least 80% can access and work on the tasks in the main pool." + ], + "metadata": { + "id": "NC-R9Of892JN" + } + }, + { + "cell_type": "code", + "source": [ + "pool.quality_control.training_requirement = toloka.pool.QualityControl.TrainingRequirement(\n", + " training_pool_id='',\n", + " training_passing_score=80,)" + ], + "metadata": { + "id": "xrhiS3Jx93QO" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "In this section, we're adding a quality control action to the pool based on the assignment submit time.\n", + "\n", + "- pool.quality_control.add_action: This method adds a quality control action to the pool.\n", + "\n", + "- collector=toloka.collectors.AssignmentSubmitTime(history_size=5): This collector gathers information on the average submit time for the last 5 assignments completed by a performer.\n", + "\n", + "- conditions=[toloka.conditions.AssignmentSubmitTime(minutes_ago=15)]: This condition checks if the average submit time for the last 5 assignments was less than 15 minutes ago. If the condition is met, the specified action will be triggered.\n", + "\n", + "- action=toloka.actions.RestrictionV2(...): This action applies a restriction to the performer who meets the condition. The restriction parameters are as follows:\n", + "\n", + " 1) scope='PROJECT': The restriction applies to the entire project.\n", + "\n", + " 2) duration=3: The restriction will last for 3 units of time.\n", + "\n", + " 3) duration_unit='DAYS': The unit of time for the restriction duration is days. So, the restriction will last for 3 days.\n", + "\n", + " 4) reason_code='SUBMIT_TIME_TOO_FAST': This is a custom code to indicate the reason for the restriction.\n", + "\n", + "By adding this quality control action, you restrict access to the project for performers who complete tasks too quickly (average submit time for the last 5 assignments is less than 15 minutes). This can help to prevent low-quality submissions from performers who rush through tasks." + ], + "metadata": { + "id": "4p0xBxZDEkr1" + } + }, + { + "cell_type": "code", + "source": [ + "pool.quality_control.add_action(\n", + " collector=toloka.collectors.AssignmentSubmitTime(history_size=5),\n", + " conditions=[\n", + " toloka.conditions.AssignmentSubmitTime(minutes_ago=15),\n", + " ],\n", + " action=toloka.actions.RestrictionV2(\n", + " scope='PROJECT',\n", + " duration=3,\n", + " duration_unit='DAYS',\n", + " reason_code='SUBMIT_TIME_TOO_FAST',\n", + " ),\n", + ")" + ], + "metadata": { + "id": "HJ58dh9TA4-C" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Here, we're adding another quality control action to the pool based on the performer's skill.\n", + "\n", + "By adding this quality control action, you restrict access to the project for performers who have a custom skill score lower than 0.8. This helps ensure that only high-skilled performers can participate in the project. Note that the collector should be updated or removed to reflect the condition based on the custom skill, as the current collector is not relevant." + ], + "metadata": { + "id": "1iVdKqj4EsEZ" + } + }, + { + "cell_type": "code", + "source": [ + "pool.quality_control.add_action(\n", + " collector=toloka.collectors.AssignmentSubmitTime(history_size=5),\n", + " conditions=[\n", + " toloka.conditions.Skill(custom_skill.id, '>=', 0.8),\n", + " ],\n", + " action=toloka.actions.RestrictionV2(\n", + " scope='PROJECT',\n", + " duration=3,\n", + " duration_unit='DAYS',\n", + " reason_code='LOW_SKILL',\n", + " ),\n", + ")" + ], + "metadata": { + "id": "yLm1egVRzRMQ" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Final settings and pool creation" + ], + "metadata": { + "id": "rq2lgr_pg1dw" + } + }, + { + "cell_type": "markdown", + "source": [ + "Next, we are setting a limit on the number of assignments a single performer can complete in the pool" + ], + "metadata": { + "id": "WBeNvth3Ez-l" + } + }, + { + "cell_type": "code", + "source": [ + "pool.set_limit(toloka.limits.AssignmentsPerUserCountLimit(max_count=50))" + ], + "metadata": { + "id": "m4WrHn-FBnFj" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Adding a webhook for event notifications in the pool" + ], + "metadata": { + "id": "y8-2XQJPE4-_" + } + }, + { + "cell_type": "code", + "source": [ + "webhook = toloka.Webhook(\n", + " url='https://your-webhook-url.com',\n", + " events=[\n", + " toloka.Webhook.Event.ASSIGNMENT_SUBMITTED,\n", + " toloka.Webhook.Event.ASSIGNMENT_APPROVED,\n", + " ],\n", + ")\n", + "webhook = toloka_client.create_webhook(webhook)\n", + "\n", + "pool.webhooks.add(webhook.id)" + ], + "metadata": { + "id": "hcJ8qJihB-rG" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Creating the pool and open it for performers to start working on the tasks. " + ], + "metadata": { + "id": "NhlQxx--JnQs" + } + }, + { + "cell_type": "code", + "source": [ + "pool = toloka_client.create_pool(pool)\n", + "toloka_client.open_pool(pool.id)" + ], + "metadata": { + "id": "QJM-2xDMB-oW" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Semi-Manual Acceptance and Rejection of Assignments" + ], + "metadata": { + "id": "TS0q2kCRMFXH" + } + }, + { + "cell_type": "markdown", + "source": [ + "In some cases, you might want to review the assignments submitted by performers manually and decide whether to accept or reject them.\n", + "\n", + "The usuall pipeline is the next:\n", + "- Get an assignment by ID;\n", + "- Review the assignment;\n", + "- Accept the assignment or reject the assignment:;\n" + ], + "metadata": { + "id": "EEOJ5nTOMHzr" + } + }, + { + "cell_type": "code", + "source": [ + "# To retrieve an assignment using its ID, you can use the get_assignment function.\n", + "assignment_id = 'your_assignment_id'\n", + "assignment = toloka_client.get_assignment(assignment_id)" + ], + "metadata": { + "id": "w45eDjCTMQb-" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# If you're satisfied with the performer's work, you can accept the assignment using the accept_assignment function.\n", + "toloka_client.accept_assignment(assignment_id, 'Well done!')" + ], + "metadata": { + "id": "Wyez1tBONPsx" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# If the assignment doesn't meet your expectations or requirements, you can reject it using the reject_assignment function.\n", + "toloka_client.reject_assignment(assignment_id, 'Please follow the instructions carefully.')" + ], + "metadata": { + "id": "KyG6BM-aNPqX" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# Get responses" + ], + "metadata": { + "id": "82W_cnGZJz2b" + } + }, + { + "cell_type": "markdown", + "source": [ + "In this section, we define a function to monitor the progress of the pool and wait for its completion" + ], + "metadata": { + "id": "LyfQwCGOKBWl" + } + }, + { + "cell_type": "code", + "source": [ + "pool_id = pool.id\n", + "\n", + "def wait_pool_for_close(pool_id, minutes_to_wait=1):\n", + " sleep_time = 60 * minutes_to_wait\n", + " pool = toloka_client.get_pool(pool_id)\n", + " while not pool.is_closed():\n", + " op = toloka_client.get_analytics([toloka.analytics_request.CompletionPercentagePoolAnalytics(subject_id=pool.id)])\n", + " op = toloka_client.wait_operation(op)\n", + " percentage = op.details['value'][0]['result']['value']\n", + " print(\n", + " f' {datetime.datetime.now().strftime(\"%H:%M:%S\")}\\t'\n", + " f'Pool {pool.id} - {percentage}%'\n", + " )\n", + " time.sleep(sleep_time)\n", + " pool = toloka_client.get_pool(pool.id)\n", + " print('Pool was closed.')\n", + "\n", + "wait_pool_for_close(pool_id)" + ], + "metadata": { + "id": "pIul9A1YJ8l2" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Next, we use the Toloka API to retrieve the completed assignments from the specified pool and store the results in a pandas DataFrame called `answers_df`. This DataFrame will contain all the information about the completed tasks, such as the input values, the performer's answers, and other metadata. You can then use this DataFrame to analyze the results or perform additional quality control steps." + ], + "metadata": { + "id": "NNPUXekZKXfQ" + } + }, + { + "cell_type": "code", + "source": [ + "answers_df = toloka_client.get_assignments_df(pool_id)" + ], + "metadata": { + "id": "3UrBEXEUKb2B" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Aggregate the results" + ], + "metadata": { + "id": "kV1NB5ouKilb" + } + }, + { + "cell_type": "markdown", + "source": [ + "For this step, please, see our prepared notebooks with examples here \n", + "https://github.com/Toloka/crowd-kit/tree/main/examples" + ], + "metadata": { + "id": "bM98a16yMgpX" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Summary" + ], + "metadata": { + "id": "kD0C81cENIOL" + } + }, + { + "cell_type": "markdown", + "source": [ + "In the advanced Toloka notebook, we cover the following main aspects:\n", + "\n", + "- Authorization in Toloka API: Involves authorizing with Toloka using OAuth 2.0 token.\n", + "\n", + "- Creating a new project: Describes the process of creating a new project in Toloka, adding a description, and public instructions.\n", + "\n", + "- Task interface description: Includes creating a task interface using various components like text fields, radio buttons, checkboxes, and dropdowns.\n", + "\n", + "- Creating and setting up a training pool: Describes the process of creating and setting up a training pool for training performers before they work on actual tasks.\n", + "\n", + "- Creating and setting up the main pool: Involves creating and configuring the main pool for tasks execution, setting up performer filters, dynamic pricing, limits, and quality control mechanisms.\n", + "\n", + "- Preparing and uploading data: Preparing data for creating tasks and uploading them to the pool.\n", + "\n", + "- Launching the pool and tracking progress: Describes the process of launching the pool, tracking its progress, and closing it upon completion.\n", + "\n", + "- Retrieving results: Explains how to retrieve the results of completed tasks and save them in a DataFrame.\n", + "\n", + "This advanced notebook provides a detailed guide on creating, configuring, and managing Toloka projects using Python and the Toloka API, as well as exploring various quality control mechanisms and workflow optimizations." + ], + "metadata": { + "id": "gm6TuTz-NKqf" + } + } + ] +} \ No newline at end of file diff --git a/examples/metrics/find_items_pipeline.py b/examples/0. get-start-with-your-level/2. metrics/find_items_pipeline.py similarity index 100% rename from examples/metrics/find_items_pipeline.py rename to examples/0. get-start-with-your-level/2. metrics/find_items_pipeline.py diff --git a/examples/metrics/graphite.ipynb b/examples/0. get-start-with-your-level/2. metrics/graphite.ipynb similarity index 100% rename from examples/metrics/graphite.ipynb rename to examples/0. get-start-with-your-level/2. metrics/graphite.ipynb diff --git a/examples/metrics/img/assignments_in_pool_dash.png b/examples/0. get-start-with-your-level/2. metrics/img/assignments_in_pool_dash.png similarity index 100% rename from examples/metrics/img/assignments_in_pool_dash.png rename to examples/0. get-start-with-your-level/2. metrics/img/assignments_in_pool_dash.png diff --git a/examples/metrics/img/grafana_metrics.png b/examples/0. get-start-with-your-level/2. metrics/img/grafana_metrics.png similarity index 100% rename from examples/metrics/img/grafana_metrics.png rename to examples/0. get-start-with-your-level/2. metrics/img/grafana_metrics.png diff --git a/examples/metrics/img/pool_expenses_dash.png b/examples/0. get-start-with-your-level/2. metrics/img/pool_expenses_dash.png similarity index 100% rename from examples/metrics/img/pool_expenses_dash.png rename to examples/0. get-start-with-your-level/2. metrics/img/pool_expenses_dash.png diff --git a/examples/metrics/jupyter_dashboard.ipynb b/examples/0. get-start-with-your-level/2. metrics/jupyter_dashboard.ipynb similarity index 100% rename from examples/metrics/jupyter_dashboard.ipynb rename to examples/0. get-start-with-your-level/2. metrics/jupyter_dashboard.ipynb diff --git a/examples/autoquality/autoquality_usage.ipynb b/examples/0. get-start-with-your-level/3. autoquality/autoquality_usage.ipynb similarity index 100% rename from examples/autoquality/autoquality_usage.ipynb rename to examples/0. get-start-with-your-level/3. autoquality/autoquality_usage.ipynb diff --git a/examples/6.streaming_pipelines/find_items_pool.json b/examples/0. get-start-with-your-level/4. streaming_pipelines/find_items_pool.json similarity index 100% rename from examples/6.streaming_pipelines/find_items_pool.json rename to examples/0. get-start-with-your-level/4. streaming_pipelines/find_items_pool.json diff --git a/examples/6.streaming_pipelines/find_items_project.json b/examples/0. get-start-with-your-level/4. streaming_pipelines/find_items_project.json similarity index 100% rename from examples/6.streaming_pipelines/find_items_project.json rename to examples/0. get-start-with-your-level/4. streaming_pipelines/find_items_project.json diff --git a/examples/6.streaming_pipelines/sbs_pool.json b/examples/0. get-start-with-your-level/4. streaming_pipelines/sbs_pool.json similarity index 100% rename from examples/6.streaming_pipelines/sbs_pool.json rename to examples/0. get-start-with-your-level/4. streaming_pipelines/sbs_pool.json diff --git a/examples/6.streaming_pipelines/sbs_project.json b/examples/0. get-start-with-your-level/4. streaming_pipelines/sbs_project.json similarity index 100% rename from examples/6.streaming_pipelines/sbs_project.json rename to examples/0. get-start-with-your-level/4. streaming_pipelines/sbs_project.json diff --git a/examples/6.streaming_pipelines/streaming_pipelines.ipynb b/examples/0. get-start-with-your-level/4. streaming_pipelines/streaming_pipelines.ipynb similarity index 100% rename from examples/6.streaming_pipelines/streaming_pipelines.ipynb rename to examples/0. get-start-with-your-level/4. streaming_pipelines/streaming_pipelines.ipynb diff --git a/examples/6.streaming_pipelines/verification_pool.json b/examples/0. get-start-with-your-level/4. streaming_pipelines/verification_pool.json similarity index 100% rename from examples/6.streaming_pipelines/verification_pool.json rename to examples/0. get-start-with-your-level/4. streaming_pipelines/verification_pool.json diff --git a/examples/6.streaming_pipelines/verification_project.json b/examples/0. get-start-with-your-level/4. streaming_pipelines/verification_project.json similarity index 100% rename from examples/6.streaming_pipelines/verification_project.json rename to examples/0. get-start-with-your-level/4. streaming_pipelines/verification_project.json diff --git a/examples/1.computer_vision/faces_detection/faces_detection.ipynb b/examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/faces_detection/faces_detection.ipynb similarity index 100% rename from examples/1.computer_vision/faces_detection/faces_detection.ipynb rename to examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/faces_detection/faces_detection.ipynb diff --git a/examples/1.computer_vision/faces_detection/img/segmentation_example.png b/examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/faces_detection/img/segmentation_example.png similarity index 100% rename from examples/1.computer_vision/faces_detection/img/segmentation_example.png rename to examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/faces_detection/img/segmentation_example.png diff --git a/examples/1.computer_vision/faces_detection/img/segmentation_pool_look.png b/examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/faces_detection/img/segmentation_pool_look.png similarity index 100% rename from examples/1.computer_vision/faces_detection/img/segmentation_pool_look.png rename to examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/faces_detection/img/segmentation_pool_look.png diff --git a/examples/1.computer_vision/faces_detection/img/segmentation_results_preview.png b/examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/faces_detection/img/segmentation_results_preview.png similarity index 100% rename from examples/1.computer_vision/faces_detection/img/segmentation_results_preview.png rename to examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/faces_detection/img/segmentation_results_preview.png diff --git a/examples/1.computer_vision/faces_detection/instructions/detection_instruction.html b/examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/faces_detection/instructions/detection_instruction.html similarity index 93% rename from examples/1.computer_vision/faces_detection/instructions/detection_instruction.html rename to examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/faces_detection/instructions/detection_instruction.html index d3f61677..c1d81bc9 100644 --- a/examples/1.computer_vision/faces_detection/instructions/detection_instruction.html +++ b/examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/faces_detection/instructions/detection_instruction.html @@ -1,194 +1,194 @@ - - - - -

About the task

- -

You will see images with people. You need to outline all the people faces on it.

- - -

Nuances of markup

-
    -
  1. - You need to outline ALL PEOPLE FACES in the image. -
  2. -
  3. - The bounding box should include all visible part of head: from crown to chin. - The box boarders should be as close as possible to the head contour, - not to capture the entire face and not leave empty backround area between the box and face. -
  4. -
  5. - Outline only those people, that have at least one of the visibile face parts: cheek, eyes. - Do not outline people visible only from the back of their heads. -
  6. -
  7. - People in hats, or whose faces are partially covered by external objects (glasses, a medical mask, a scarf, etc.) - should be outlined as well, if this does not contradict the previous paragraphs. - Faces completely covered by opaque objects, so that no part of the face is visible should not be marked. -
  8. -
- - - - - + + + + +

About the task

+ +

You will see images with people. You need to outline all the people faces on it.

+ + +

Nuances of markup

+
    +
  1. + You need to outline ALL PEOPLE FACES in the image. +
  2. +
  3. + The bounding box should include all visible part of head: from crown to chin. + The box boarders should be as close as possible to the head contour, + not to capture the entire face and not leave empty backround area between the box and face. +
  4. +
  5. + Outline only those people, that have at least one of the visibile face parts: cheek, eyes. + Do not outline people visible only from the back of their heads. +
  6. +
  7. + People in hats, or whose faces are partially covered by external objects (glasses, a medical mask, a scarf, etc.) + should be outlined as well, if this does not contradict the previous paragraphs. + Faces completely covered by opaque objects, so that no part of the face is visible should not be marked. +
  8. +
+ + + + + diff --git a/examples/1.computer_vision/faces_detection/instructions/verification_instruction.html b/examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/faces_detection/instructions/verification_instruction.html similarity index 94% rename from examples/1.computer_vision/faces_detection/instructions/verification_instruction.html rename to examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/faces_detection/instructions/verification_instruction.html index dc9fdb09..73b4a311 100644 --- a/examples/1.computer_vision/faces_detection/instructions/verification_instruction.html +++ b/examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/faces_detection/instructions/verification_instruction.html @@ -1,202 +1,202 @@ - - - - -

About the task

- -

- You will see images with people and their faces outlined with bounding boxes in it. - Your task is to check whether the faces are outlined correctly. - There are two possible answers in the task - 'Yes' and 'No'. - For an 'Yes' answer each of the people's faces in image should be outlined correctly. - If there is at least one wrong markup or at least one bounding box is missed - you should select 'No' as answer. -

- Please learn the rules of markup in initial project to find out the rules of defining the correctness of task. -

- - -

Rules of markup

-
    -
  1. - You need to outline ALL PEOPLE FACES in the image. -
  2. -
  3. - The bounding box should include all visible part of head: from crown to chin. - The box boarders should be as close as possible to the head contour, - not to capture the entire face and not leave empty backround area between the box and face. -
  4. -
  5. - Outline only those people, that have at least one of the visibile face parts: cheek, eyes. - Do not outline people visible only from the back of their heads. -
  6. -
  7. - People in hats, or whose faces are partially covered by external objects (glasses, a medical mask, a scarf, etc.) - should be outlined as well, if this does not contradict the previous paragraphs. - Faces completely covered by opaque objects, so that no part of the face is visible should not be marked. -
  8. -
- - - - - + + + + +

About the task

+ +

+ You will see images with people and their faces outlined with bounding boxes in it. + Your task is to check whether the faces are outlined correctly. + There are two possible answers in the task - 'Yes' and 'No'. + For an 'Yes' answer each of the people's faces in image should be outlined correctly. + If there is at least one wrong markup or at least one bounding box is missed - you should select 'No' as answer. +

+ Please learn the rules of markup in initial project to find out the rules of defining the correctness of task. +

+ + +

Rules of markup

+
    +
  1. + You need to outline ALL PEOPLE FACES in the image. +
  2. +
  3. + The bounding box should include all visible part of head: from crown to chin. + The box boarders should be as close as possible to the head contour, + not to capture the entire face and not leave empty backround area between the box and face. +
  4. +
  5. + Outline only those people, that have at least one of the visibile face parts: cheek, eyes. + Do not outline people visible only from the back of their heads. +
  6. +
  7. + People in hats, or whose faces are partially covered by external objects (glasses, a medical mask, a scarf, etc.) + should be outlined as well, if this does not contradict the previous paragraphs. + Faces completely covered by opaque objects, so that no part of the face is visible should not be marked. +
  8. +
+ + + + + diff --git a/examples/1.computer_vision/image_classification/image_classification.ipynb b/examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/image_classification/image_classification.ipynb similarity index 100% rename from examples/1.computer_vision/image_classification/image_classification.ipynb rename to examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/image_classification/image_classification.ipynb diff --git a/examples/1.computer_vision/image_classification/img/created_project.png b/examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/image_classification/img/created_project.png similarity index 100% rename from examples/1.computer_vision/image_classification/img/created_project.png rename to examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/image_classification/img/created_project.png diff --git a/examples/1.computer_vision/image_classification/img/dataset_preview.png b/examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/image_classification/img/dataset_preview.png similarity index 100% rename from examples/1.computer_vision/image_classification/img/dataset_preview.png rename to examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/image_classification/img/dataset_preview.png diff --git a/examples/1.computer_vision/image_classification/img/possible_results.png b/examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/image_classification/img/possible_results.png similarity index 100% rename from examples/1.computer_vision/image_classification/img/possible_results.png rename to examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/image_classification/img/possible_results.png diff --git a/examples/1.computer_vision/image_collection/image_collection.ipynb b/examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/image_collection/image_collection.ipynb similarity index 100% rename from examples/1.computer_vision/image_collection/image_collection.ipynb rename to examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/image_collection/image_collection.ipynb diff --git a/examples/1.computer_vision/image_collection/img/performer_interface.png b/examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/image_collection/img/performer_interface.png similarity index 100% rename from examples/1.computer_vision/image_collection/img/performer_interface.png rename to examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/image_collection/img/performer_interface.png diff --git a/examples/1.computer_vision/object_detection/img/segmentation_example.png b/examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/object_detection/img/segmentation_example.png similarity index 100% rename from examples/1.computer_vision/object_detection/img/segmentation_example.png rename to examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/object_detection/img/segmentation_example.png diff --git a/examples/1.computer_vision/object_detection/img/segmentation_pool_look.png b/examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/object_detection/img/segmentation_pool_look.png similarity index 100% rename from examples/1.computer_vision/object_detection/img/segmentation_pool_look.png rename to examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/object_detection/img/segmentation_pool_look.png diff --git a/examples/1.computer_vision/object_detection/img/segmentation_results_preview.png b/examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/object_detection/img/segmentation_results_preview.png similarity index 100% rename from examples/1.computer_vision/object_detection/img/segmentation_results_preview.png rename to examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/object_detection/img/segmentation_results_preview.png diff --git a/examples/1.computer_vision/object_detection/img/segmentation_task_look.png b/examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/object_detection/img/segmentation_task_look.png similarity index 100% rename from examples/1.computer_vision/object_detection/img/segmentation_task_look.png rename to examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/object_detection/img/segmentation_task_look.png diff --git a/examples/1.computer_vision/object_detection/object_detection.ipynb b/examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/object_detection/object_detection.ipynb similarity index 100% rename from examples/1.computer_vision/object_detection/object_detection.ipynb rename to examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/object_detection/object_detection.ipynb diff --git a/examples/1.computer_vision/text_recognition/img/performer_interface.png b/examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/text_recognition/img/performer_interface.png similarity index 100% rename from examples/1.computer_vision/text_recognition/img/performer_interface.png rename to examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/text_recognition/img/performer_interface.png diff --git a/examples/1.computer_vision/text_recognition/img/possible_result.png b/examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/text_recognition/img/possible_result.png similarity index 100% rename from examples/1.computer_vision/text_recognition/img/possible_result.png rename to examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/text_recognition/img/possible_result.png diff --git a/examples/1.computer_vision/text_recognition/text_recognition.ipynb b/examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/text_recognition/text_recognition.ipynb similarity index 100% rename from examples/1.computer_vision/text_recognition/text_recognition.ipynb rename to examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/text_recognition/text_recognition.ipynb diff --git a/examples/1.computer_vision/video_collection/img/performer_interface.png b/examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/video_collection/img/performer_interface.png similarity index 100% rename from examples/1.computer_vision/video_collection/img/performer_interface.png rename to examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/video_collection/img/performer_interface.png diff --git a/examples/1.computer_vision/video_collection/video_collection.ipynb b/examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/video_collection/video_collection.ipynb similarity index 100% rename from examples/1.computer_vision/video_collection/video_collection.ipynb rename to examples/1. cases-for-CV-NLP-AUDIO/0. computer_vision/video_collection/video_collection.ipynb diff --git a/examples/3.audio_analysis/audio_classification/audio_classification.ipynb b/examples/1. cases-for-CV-NLP-AUDIO/1. audio_analysis/audio_classification/audio_classification.ipynb similarity index 100% rename from examples/3.audio_analysis/audio_classification/audio_classification.ipynb rename to examples/1. cases-for-CV-NLP-AUDIO/1. audio_analysis/audio_classification/audio_classification.ipynb diff --git a/examples/3.audio_analysis/audio_classification/img/tasks_preview.png b/examples/1. cases-for-CV-NLP-AUDIO/1. audio_analysis/audio_classification/img/tasks_preview.png similarity index 100% rename from examples/3.audio_analysis/audio_classification/img/tasks_preview.png rename to examples/1. cases-for-CV-NLP-AUDIO/1. audio_analysis/audio_classification/img/tasks_preview.png diff --git a/examples/3.audio_analysis/audio_collection/audio_collection.ipynb b/examples/1. cases-for-CV-NLP-AUDIO/1. audio_analysis/audio_collection/audio_collection.ipynb similarity index 100% rename from examples/3.audio_analysis/audio_collection/audio_collection.ipynb rename to examples/1. cases-for-CV-NLP-AUDIO/1. audio_analysis/audio_collection/audio_collection.ipynb diff --git a/examples/3.audio_analysis/audio_collection/img/tasks_preview.png b/examples/1. cases-for-CV-NLP-AUDIO/1. audio_analysis/audio_collection/img/tasks_preview.png similarity index 100% rename from examples/3.audio_analysis/audio_collection/img/tasks_preview.png rename to examples/1. cases-for-CV-NLP-AUDIO/1. audio_analysis/audio_collection/img/tasks_preview.png diff --git a/examples/3.audio_analysis/audio_transcription/audio_transcription.ipynb b/examples/1. cases-for-CV-NLP-AUDIO/1. audio_analysis/audio_transcription/audio_transcription.ipynb similarity index 100% rename from examples/3.audio_analysis/audio_transcription/audio_transcription.ipynb rename to examples/1. cases-for-CV-NLP-AUDIO/1. audio_analysis/audio_transcription/audio_transcription.ipynb diff --git a/examples/3.audio_analysis/audio_transcription/img/task_interface.png b/examples/1. cases-for-CV-NLP-AUDIO/1. audio_analysis/audio_transcription/img/task_interface.png similarity index 100% rename from examples/3.audio_analysis/audio_transcription/img/task_interface.png rename to examples/1. cases-for-CV-NLP-AUDIO/1. audio_analysis/audio_transcription/img/task_interface.png diff --git a/examples/SQUAD2.0/SQUAD2.0_processing.ipynb b/examples/1. cases-for-CV-NLP-AUDIO/2. nlp/SQUAD2.0/SQUAD2.0_processing.ipynb similarity index 100% rename from examples/SQUAD2.0/SQUAD2.0_processing.ipynb rename to examples/1. cases-for-CV-NLP-AUDIO/2. nlp/SQUAD2.0/SQUAD2.0_processing.ipynb diff --git a/examples/SQUAD2.0/img/marking_project_instruction.png b/examples/1. cases-for-CV-NLP-AUDIO/2. nlp/SQUAD2.0/img/marking_project_instruction.png similarity index 100% rename from examples/SQUAD2.0/img/marking_project_instruction.png rename to examples/1. cases-for-CV-NLP-AUDIO/2. nlp/SQUAD2.0/img/marking_project_instruction.png diff --git a/examples/SQUAD2.0/img/marking_project_interface.png b/examples/1. cases-for-CV-NLP-AUDIO/2. nlp/SQUAD2.0/img/marking_project_interface.png similarity index 100% rename from examples/SQUAD2.0/img/marking_project_interface.png rename to examples/1. cases-for-CV-NLP-AUDIO/2. nlp/SQUAD2.0/img/marking_project_interface.png diff --git a/examples/SQUAD2.0/img/verification_project_instruction.png b/examples/1. cases-for-CV-NLP-AUDIO/2. nlp/SQUAD2.0/img/verification_project_instruction.png similarity index 100% rename from examples/SQUAD2.0/img/verification_project_instruction.png rename to examples/1. cases-for-CV-NLP-AUDIO/2. nlp/SQUAD2.0/img/verification_project_instruction.png diff --git a/examples/SQUAD2.0/img/verification_project_interface.png b/examples/1. cases-for-CV-NLP-AUDIO/2. nlp/SQUAD2.0/img/verification_project_interface.png similarity index 100% rename from examples/SQUAD2.0/img/verification_project_interface.png rename to examples/1. cases-for-CV-NLP-AUDIO/2. nlp/SQUAD2.0/img/verification_project_interface.png diff --git a/examples/SQUAD2.0/marking_public_instruction.html b/examples/1. cases-for-CV-NLP-AUDIO/2. nlp/SQUAD2.0/marking_public_instruction.html similarity index 100% rename from examples/SQUAD2.0/marking_public_instruction.html rename to examples/1. cases-for-CV-NLP-AUDIO/2. nlp/SQUAD2.0/marking_public_instruction.html diff --git a/examples/SQUAD2.0/verification_public_instruction.html b/examples/1. cases-for-CV-NLP-AUDIO/2. nlp/SQUAD2.0/verification_public_instruction.html similarity index 100% rename from examples/SQUAD2.0/verification_public_instruction.html rename to examples/1. cases-for-CV-NLP-AUDIO/2. nlp/SQUAD2.0/verification_public_instruction.html diff --git a/examples/5.nlp/intent_classification/intent_classification.ipynb b/examples/1. cases-for-CV-NLP-AUDIO/2. nlp/intent_classification/intent_classification.ipynb similarity index 100% rename from examples/5.nlp/intent_classification/intent_classification.ipynb rename to examples/1. cases-for-CV-NLP-AUDIO/2. nlp/intent_classification/intent_classification.ipynb diff --git a/examples/5.nlp/intent_classification/public_instructions/public_instruction.html b/examples/1. cases-for-CV-NLP-AUDIO/2. nlp/intent_classification/public_instructions/public_instruction.html similarity index 100% rename from examples/5.nlp/intent_classification/public_instructions/public_instruction.html rename to examples/1. cases-for-CV-NLP-AUDIO/2. nlp/intent_classification/public_instructions/public_instruction.html diff --git a/examples/5.nlp/sentiment_analysis/img/pool_preview.png b/examples/1. cases-for-CV-NLP-AUDIO/2. nlp/sentiment_analysis/img/pool_preview.png similarity index 100% rename from examples/5.nlp/sentiment_analysis/img/pool_preview.png rename to examples/1. cases-for-CV-NLP-AUDIO/2. nlp/sentiment_analysis/img/pool_preview.png diff --git a/examples/5.nlp/sentiment_analysis/sentiment_analysis.ipynb b/examples/1. cases-for-CV-NLP-AUDIO/2. nlp/sentiment_analysis/sentiment_analysis.ipynb similarity index 100% rename from examples/5.nlp/sentiment_analysis/sentiment_analysis.ipynb rename to examples/1. cases-for-CV-NLP-AUDIO/2. nlp/sentiment_analysis/sentiment_analysis.ipynb diff --git a/examples/5.nlp/text_classification/img/tasks_preview.png b/examples/1. cases-for-CV-NLP-AUDIO/2. nlp/text_classification/img/tasks_preview.png similarity index 100% rename from examples/5.nlp/text_classification/img/tasks_preview.png rename to examples/1. cases-for-CV-NLP-AUDIO/2. nlp/text_classification/img/tasks_preview.png diff --git a/examples/5.nlp/text_classification/text_classification.ipynb b/examples/1. cases-for-CV-NLP-AUDIO/2. nlp/text_classification/text_classification.ipynb similarity index 100% rename from examples/5.nlp/text_classification/text_classification.ipynb rename to examples/1. cases-for-CV-NLP-AUDIO/2. nlp/text_classification/text_classification.ipynb diff --git a/examples/2. cases-for-applied-tasks/.DS_Store b/examples/2. cases-for-applied-tasks/.DS_Store new file mode 100644 index 00000000..5008ddfc Binary files /dev/null and b/examples/2. cases-for-applied-tasks/.DS_Store differ diff --git a/examples/2.spatial_crowdsourcing/0.simplest_example/img/google_map1.png b/examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/img/google_map1.png similarity index 100% rename from examples/2.spatial_crowdsourcing/0.simplest_example/img/google_map1.png rename to examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/img/google_map1.png diff --git a/examples/2.spatial_crowdsourcing/0.simplest_example/img/google_map2.png b/examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/img/google_map2.png similarity index 100% rename from examples/2.spatial_crowdsourcing/0.simplest_example/img/google_map2.png rename to examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/img/google_map2.png diff --git a/examples/2.spatial_crowdsourcing/0.simplest_example/img/instruction1.png b/examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/img/instruction1.png similarity index 100% rename from examples/2.spatial_crowdsourcing/0.simplest_example/img/instruction1.png rename to examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/img/instruction1.png diff --git a/examples/2.spatial_crowdsourcing/0.simplest_example/img/instruction2.png b/examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/img/instruction2.png similarity index 100% rename from examples/2.spatial_crowdsourcing/0.simplest_example/img/instruction2.png rename to examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/img/instruction2.png diff --git a/examples/2.spatial_crowdsourcing/0.simplest_example/img/instruction3.png b/examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/img/instruction3.png similarity index 100% rename from examples/2.spatial_crowdsourcing/0.simplest_example/img/instruction3.png rename to examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/img/instruction3.png diff --git a/examples/2.spatial_crowdsourcing/0.simplest_example/img/instruction4.png b/examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/img/instruction4.png similarity index 100% rename from examples/2.spatial_crowdsourcing/0.simplest_example/img/instruction4.png rename to examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/img/instruction4.png diff --git a/examples/2.spatial_crowdsourcing/0.simplest_example/img/performer_interface1.png b/examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/img/performer_interface1.png similarity index 100% rename from examples/2.spatial_crowdsourcing/0.simplest_example/img/performer_interface1.png rename to examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/img/performer_interface1.png diff --git a/examples/2.spatial_crowdsourcing/0.simplest_example/img/performer_interface2.png b/examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/img/performer_interface2.png similarity index 100% rename from examples/2.spatial_crowdsourcing/0.simplest_example/img/performer_interface2.png rename to examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/img/performer_interface2.png diff --git a/examples/2.spatial_crowdsourcing/0.simplest_example/img/performer_interface3.png b/examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/img/performer_interface3.png similarity index 100% rename from examples/2.spatial_crowdsourcing/0.simplest_example/img/performer_interface3.png rename to examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/img/performer_interface3.png diff --git a/examples/2.spatial_crowdsourcing/0.simplest_example/img/yandex_map1.png b/examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/img/yandex_map1.png similarity index 100% rename from examples/2.spatial_crowdsourcing/0.simplest_example/img/yandex_map1.png rename to examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/img/yandex_map1.png diff --git a/examples/2.spatial_crowdsourcing/0.simplest_example/img/yandex_map2.png b/examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/img/yandex_map2.png similarity index 100% rename from examples/2.spatial_crowdsourcing/0.simplest_example/img/yandex_map2.png rename to examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/img/yandex_map2.png diff --git a/examples/2.spatial_crowdsourcing/0.simplest_example/instruction.html b/examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/instruction.html similarity index 98% rename from examples/2.spatial_crowdsourcing/0.simplest_example/instruction.html rename to examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/instruction.html index 1247ff2c..c1d3a3c7 100644 --- a/examples/2.spatial_crowdsourcing/0.simplest_example/instruction.html +++ b/examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/instruction.html @@ -1,201 +1,201 @@ - - - -

Your task is to find the specified entrance to the metro and take a picture of it from both sides.

-

Phone and app settings

-
  1. Grant access to your device's camera, location, and memory. -
    - - -
    - Open Applications, find the Toloka app in the list, go to Permissions and slide the access switches for the camera, location, and memory to the active position. -
    - - -
    -

    Example

    -
    -
    -
    -
    -
    - - -
    - Open Settings and find the Toloka app in the list. Allow access to the camera by sliding the switch to the active position. Also, allow access to detect your location, and to read and write photos in the gallery. -
    - - -
    -

    Example

    -
    -
    -
    -
    -
  2. Set up geodata transfer. - -
    - - -
    - • On the phone. Go to the location settings and enable High accuracy detection. -
    - - -
    -

    Example

    -
    -
    - • In the camera app. Open the camera on your device, go to settings and enable the geotag option. -
    -
    -
    - - -
    - • Open Settings → Privacy → Location Services. Set the switch to the active position. -
    - - -
    -

    Example

    -
    -
    - • Find the Camera app in the list below, click on it and enable the option While Using the App. -
    - - -
    -

    Example

    -
    -
    -
  3. Make sure that the date and time setting matches the time zone of your region.
  4. In the Toloka app settings, select a method to submit tasks: Over Wi-Fi or Mobile data. -

    - Important. We recommend choosing Mobile data method – tasks will be sent immediately after clicking Submit button. For Over Wi-Fi method: if the phone does not connect to Wi-Fi before the task is completed, the app will not send the data, and the task will be considered uncompleted. -
-

Steps to follow

-
    -
  1. Open the map to see all available tasks. Select the task you want and click Reserve task or Start task.
  2. -
  3. Go to the point listed in the task.
  4. -
  5. Find the specified metro entrance.
  6. -
  7. Take two photos of the entrance on the right and left. The photo should show the entire entrance and the floor. So that you can properly assess the cleanliness around the metro.
  8. -
  9. If the specified entrance does not exist, take four photos in all directions around you. So that we can understand where you are and that there is no entrance to the metro there.
  10. -
  11. Tap Submit.
-

Important. If you try to complete a task too far from the point indicated on the map, or if the app cannot access your location, an error message will appear, and it will not be possible to submit the task.

- -
- - -
-

Шаги, которые нужно выполнить

-
    -
  1. Откройте карту, чтобы увидеть все доступные задачи. Выберите нужную задачу и нажмите кнопку Зарезервировать задачу или Начать задачу.
  2. -
  3. Перейдите к точке, указанной в задании.
  4. -
  5. Найдите указанный вход в метро.
  6. -
  7. Сделайте две фотографии входа справа и слева. На фото должны быть видны все входы и пол. Убедитесь, что на фото можно оценить чистоту вокруг входа в метро.
  8. -
  9. Если указанный вход не существует, сделайте 4 фотографии во все стороны, чтобы по ним можно было понять, где вы находитесь и что входа в метро тут нет.
  10. -
  11. Нажмите Отправить.
-

Важно. Если вы попытаетесь выполнить задачу на слишком большом расстоянии от точки, указанной на карте, или приложение не сможет получить доступ к вашему местоположению, то появится сообщение об ошибке, и отправить задачу будет невозможно.

-
-
- -

Photo requirements

-
    -
  1. Use only those photos that you have taken while completing the task. Do not attach images from the internet, screenshots, or pictures taken on other days.
  2. -
  3. Photos must be sharp. We won't accept dark, overexposed, or blurred pictures.
-

Examples of photos

-

Correct. The photo is sharp and the ground is completely visible. If this is not a metro entrance, the panorama from the back is photographed to determine what kind of place it is.

-
- - -
-

Example

-
-
-
- - -
-

Example

-
-
-
- - -
-

Example

-
-
- -

Incorrect. The building entrance and the ground around the entrance are not visible. It's impossible to determine which metro entrance is in the photo.

-
- - -
-

Example

-
-
-
- - -
-

Example

-
-
-
- - -
-

Example

-
-
- -

Checking completed tasks

+ + + +

Your task is to find the specified entrance to the metro and take a picture of it from both sides.

+

Phone and app settings

+
  1. Grant access to your device's camera, location, and memory. +
    + + +
    + Open Applications, find the Toloka app in the list, go to Permissions and slide the access switches for the camera, location, and memory to the active position. +
    + + +
    +

    Example

    +
    +
    +
    +
    +
    + + +
    + Open Settings and find the Toloka app in the list. Allow access to the camera by sliding the switch to the active position. Also, allow access to detect your location, and to read and write photos in the gallery. +
    + + +
    +

    Example

    +
    +
    +
    +
    +
  2. Set up geodata transfer. + +
    + + +
    + • On the phone. Go to the location settings and enable High accuracy detection. +
    + + +
    +

    Example

    +
    +
    + • In the camera app. Open the camera on your device, go to settings and enable the geotag option. +
    +
    +
    + + +
    + • Open Settings → Privacy → Location Services. Set the switch to the active position. +
    + + +
    +

    Example

    +
    +
    + • Find the Camera app in the list below, click on it and enable the option While Using the App. +
    + + +
    +

    Example

    +
    +
    +
  3. Make sure that the date and time setting matches the time zone of your region.
  4. In the Toloka app settings, select a method to submit tasks: Over Wi-Fi or Mobile data. +

    + Important. We recommend choosing Mobile data method – tasks will be sent immediately after clicking Submit button. For Over Wi-Fi method: if the phone does not connect to Wi-Fi before the task is completed, the app will not send the data, and the task will be considered uncompleted. +
+

Steps to follow

+
    +
  1. Open the map to see all available tasks. Select the task you want and click Reserve task or Start task.
  2. +
  3. Go to the point listed in the task.
  4. +
  5. Find the specified metro entrance.
  6. +
  7. Take two photos of the entrance on the right and left. The photo should show the entire entrance and the floor. So that you can properly assess the cleanliness around the metro.
  8. +
  9. If the specified entrance does not exist, take four photos in all directions around you. So that we can understand where you are and that there is no entrance to the metro there.
  10. +
  11. Tap Submit.
+

Important. If you try to complete a task too far from the point indicated on the map, or if the app cannot access your location, an error message will appear, and it will not be possible to submit the task.

+ +
+ + +
+

Шаги, которые нужно выполнить

+
    +
  1. Откройте карту, чтобы увидеть все доступные задачи. Выберите нужную задачу и нажмите кнопку Зарезервировать задачу или Начать задачу.
  2. +
  3. Перейдите к точке, указанной в задании.
  4. +
  5. Найдите указанный вход в метро.
  6. +
  7. Сделайте две фотографии входа справа и слева. На фото должны быть видны все входы и пол. Убедитесь, что на фото можно оценить чистоту вокруг входа в метро.
  8. +
  9. Если указанный вход не существует, сделайте 4 фотографии во все стороны, чтобы по ним можно было понять, где вы находитесь и что входа в метро тут нет.
  10. +
  11. Нажмите Отправить.
+

Важно. Если вы попытаетесь выполнить задачу на слишком большом расстоянии от точки, указанной на карте, или приложение не сможет получить доступ к вашему местоположению, то появится сообщение об ошибке, и отправить задачу будет невозможно.

+
+
+ +

Photo requirements

+
    +
  1. Use only those photos that you have taken while completing the task. Do not attach images from the internet, screenshots, or pictures taken on other days.
  2. +
  3. Photos must be sharp. We won't accept dark, overexposed, or blurred pictures.
+

Examples of photos

+

Correct. The photo is sharp and the ground is completely visible. If this is not a metro entrance, the panorama from the back is photographed to determine what kind of place it is.

+
+ + +
+

Example

+
+
+
+ + +
+

Example

+
+
+
+ + +
+

Example

+
+
+ +

Incorrect. The building entrance and the ground around the entrance are not visible. It's impossible to determine which metro entrance is in the photo.

+
+ + +
+

Example

+
+
+
+ + +
+

Example

+
+
+
+ + +
+

Example

+
+
+ +

Checking completed tasks

We usually check tasks within a few days. The maximum time for checking a task is 5 days.

\ No newline at end of file diff --git a/examples/2.spatial_crowdsourcing/0.simplest_example/spatial_crowdsourcing.ipynb b/examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/spatial_crowdsourcing.ipynb similarity index 97% rename from examples/2.spatial_crowdsourcing/0.simplest_example/spatial_crowdsourcing.ipynb rename to examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/spatial_crowdsourcing.ipynb index d794236d..71657869 100644 --- a/examples/2.spatial_crowdsourcing/0.simplest_example/spatial_crowdsourcing.ipynb +++ b/examples/2. cases-for-applied-tasks/0.spatial_crowdsourcing/0.simplest_example/spatial_crowdsourcing.ipynb @@ -1,804 +1,804 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Spatial crowdsourcing example\n", - "\n", - "In this example, we will work with *field tasks*. Field tasks are available for performers only in Toloka mobile apps. Requesters set the field tasks, and performers choose them as points on the map. Performers are usually requested to visit the place and check something in person or take photos.\n", - "\n", - "**Examples:**\n", - "\n", - "* monitor prices, products, and outdoor items of interest ([case study](https://toloka.ai/blog/fuel?utm_source=github&utm_medium=site&utm_campaign=tolokakit));\n", - "* play the role of *Secret Shopper*: leave reviews for stores and cafés;\n", - "* collect data on businesses: whether a particular organization is closed or has changed its office hours;\n", - "* monitor promotional events;\n", - "* check on outdoor advertising and many more.\n", - "\n", - "\n", - ">**Note:** For more information about field tasks, check out the Spatial Crowdsourcing page on our [website](https://toloka.ai/usecases/spatialcrowd?utm_source=github&utm_medium=site&utm_campaign=tolokakit) and more precisely [Toloka Requester's Guide](https://toloka.ai/en/docs/guide/tutorials/walk?utm_source=github&utm_medium=site&utm_campaign=tolokakit)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Call to action\n", - "If you found some bugs or have a new feature idea, don't hesitate to [open a new issue on Github](https://github.com/Toloka/toloka-kit/issues/new/choose).\n", - "Like our library and examples? Star [our repo on Github](https://github.com/Toloka/toloka-kit)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Introduction\n", - "\n", - "In this example, we will collect pictures of the Moscow metro entrances.\n", - "Because usually spatial crowdsourcing assignments converge longer than others, for this example we have chosen the most frequently used stations. Many potential Toloka performers visit these subways every day.\n", - "\n", - "\n", - "This example also can be reused for production tasks such as monitoring the state of objects, checking the presence of an organization or other physical object.\n", - "\n", - "For example to check:\n", - "\n", - "- that the area around the store is clean and tidy,\n", - "- that the store has placed a new ad at the entrance,\n", - "- how your advertisement was placed on the billboards,\n", - "- the cleanliness of the waste disposal areas,\n", - "- that the bench is installed correctly in the right place." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Data preparation\n", - "\n", - "In most cases, field tasks are tied to a specific point on the map. In our case, these are subway entrances. The performer must come to the entrance and complete some actions.\n", - "\n", - "There is a way to set these points on the map through [toloka-kit](https://toloka.ai/en/docs/toloka-kit/?utm_source=github&utm_medium=site&utm_campaign=tolokakit).\n", - "\n", - "But first, let's take an example of the Okhotny Ryad metro station, from which it is easy to get to the Red Square and Kremlin. Let's open the maps in [Yandex](https://yandex.com/maps/213/moscow/search/metro%20Okhotny%20Ryad/?ll=37.617426%2C55.756762&sctx=ZAAAAAgBEAAaKAoSCfFmDd5XtSVAEXmxMEROl0hAEhIJAAAAAIAdV0ARe2X%2FuFTeQ0AoCjgAQNytB0gBVc3MzD5qAnJ1cACdAc3MTD2gAQCoAQC9AVmBJ6HCARH6tInp5AKeqLT7A%2Bi%2FyPT6BuoBAPIBAPgBAIICEm1ldHJvIE9raG90bnkgUnlhZIoCAA%3D%3D&sll=37.617426%2C55.756762&sspn=0.007170%2C0.002595&z=17.41) or [Google](https://www.google.com/maps/place/Okhotnyy+ryad/@55.7574454,37.615033,18z/data=!4m5!3m4!1s0x46b54a5adb3b8d5b:0x22be0270b70ed47d!8m2!3d55.757689!4d37.6164801). And find the required entrances.\n", - "\n", - "You will see something like this:\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \"Okhotny\n", - "
\n", - " Figure 1. Okhotny Ryad on Google Maps\n", - "


\n", - " \"Okhotny\n", - "
\n", - " Figure 2. Okhotny Ryad on Yandex Maps\n", - "
\n", - "\n", - "Now, to get the point on the map:\n", - "\n", - "1. Find the 1st entrance to the Okhotny Ryad metro station - a red-letter 'M' with a number 1 on the left side of the screen.\n", - "\n", - "2. *On Google Maps:* right-click on the entrance point and coordinates will immediately appear in a pop-up menu.\n", - "\n", - " *On Yandex Maps:* click next to the entrance point, then click on the pop-up card 'Tverskaya Street,1' and coordinates will appear in the left menu.\n", - "\n", - ">**Note:** It is also helpful to switch maps to satellite mode or check yourself by panorama views.\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \"Coordinates\n", - "
\n", - " Figure 3. Coordinates on Google Maps\n", - "


\n", - " \"Coordinates\n", - "
\n", - " Figure 4. Coordinates on Yandex Maps\n", - "
\n", - "\n", - "We will get the following coordinates: 55.756916, 37.614546.\n", - "\n", - "These are latitude and longitude geographic coordinates expressed as [decimal fractions of a degree](https://en.wikipedia.org/wiki/Decimal_degrees)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Create a project\n", - "\n", - "### Set up the environment\n", - "Specifically, we will use the following libraries:\n", - "\n", - "* `toloka-kit` to develop main Toloka functionalities;\n", - "* `pandas` and `numpy` to perform data manipulation;\n", - "* `ipyplot` to deal with images in Jupyter Notebooks." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%%capture\n", - "!pip install toloka-kit==0.1.26\n", - "!pip install pandas\n", - "!pip install ipyplot\n", - "\n", - "import datetime\n", - "import io\n", - "import logging\n", - "import sys\n", - "import getpass\n", - "\n", - "import ipyplot\n", - "import pandas\n", - "from PIL import Image\n", - "\n", - "import toloka.client as toloka\n", - "import toloka.client.project.template_builder as tb" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "logging.basicConfig(\n", - " format='[%(levelname)s] %(name)s: %(message)s',\n", - " level=logging.INFO,\n", - " stream=sys.stdout,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create a `TolokaClient` PRODUCTION or SANDBOX instance. All API calls will pass through it.\n", - "\n", - "Obtain your [OAuth token](https://toloka.ai/en/docs/api/concepts/access?utm_source=github&utm_medium=site&utm_campaign=tolokakit#access__token) from Toloka or [Toloka Sandbox](https://toloka.ai/en/docs/guide/concepts/sandbox?utm_source=github&utm_medium=site&utm_campaign=tolokakit)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "toloka_client = toloka.TolokaClient(getpass.getpass('Enter your OAuth token: '), 'PRODUCTION') # Or switch to 'SANDBOX'\n", - "print(toloka_client.get_requester())" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Create a dataset\n", - "We will use a ready-made dataset of Moscow metro entrances and select several stations from the circular (koltsevaya) line.\n", - "\n", - "This dataset is compiled by the Toloka team and is distributed under the Creative Commons Attribution 4.0 international license\n", - "[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!curl https://tlk.s3.yandex.net/dataset/moscow_metro_entrance_2020_en.tsv --output dataset.tsv\n", - "dataset = pandas.read_csv('dataset.tsv', sep='\\t')\n", - "dataset = dataset[dataset['line'].isin(['Koltsevaya line'])].sample(frac=1).iloc[:5]\n", - "print(dataset)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - ">**Note:** Dataset contains not only the coordinates of the stations but also their names. In field tasks, the performer will come to the real object only once and must clearly understand what is required to do there.\n", - "> It is always helpful to give as much information as possible so that the performer does not make a mistake.\n", - "\n", - "For example, attach a reference photo: from which angle you want to capture the entrance to the subway station.\n", - "\n", - "Otherwise, the performer may misunderstand the task and capture the object too close or too shallow. For example, capture only the door when we need a general plan, or capture a general building plan when we want a sign with the opening hours of the organization." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Input and output data formats\n", - "\n", - "We need to decide what kind of data we have as input and what data we want to receive at the output.\n", - "\n", - "At the input, we have:\n", - "\n", - "- coordinates of the object (entrance);\n", - "- station name and entrance number.\n", - "\n", - "At the output, we want to get:\n", - "\n", - "- verdict - if the object was found or not;\n", - "- photos of the place itself, if the object is found;\n", - "- photos of the surrounding area, if the object is not found - to make sure that the performer was where;\n", - "- the coordinates of the performer at the time of task execution to check that it was the right place." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "input_specification = {\n", - " 'coordinates': toloka.project.StringSpec(),\n", - " 'entrance': toloka.project.StringSpec(),\n", - "}\n", - "output_specification = {\n", - " 'verdict': toloka.project.StringSpec(),\n", - " 'entrance_images': toloka.project.ArrayFileSpec(required=False),\n", - " 'around_images': toloka.project.ArrayFileSpec(required=False),\n", - " 'performer_coordinates': toloka.project.CoordinatesSpec(current_location=True, required=False),\n", - " 'comment': toloka.project.StringSpec(required=False),\n", - "}" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Create instructions for performers\n", - "\n", - "It is essential to prepare detailed instructions covering all the corner cases.\n", - "\n", - "Let's upload a prepared instruction from an HTML file and then analyze best practices for writing instruction for a task." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "prepared_instruction = open('instruction.html').read().strip()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "For field tasks, a good start for the instruction is configuring a device.\n", - "\n", - "\n", - "\n", - " \n", - " \n", - "
\n", - " \"How\n", - "
\n", - " Figure 5. How to configure device\n", - "
\n", - "\n", - "Next, we describe the steps for completing the task.\n", - "\n", - "Be sure to provide exact steps what performers should do if something went wrong: there is no requested object or limited access to it, or something else went wrong.\n", - "\n", - "\n", - " \n", - " \n", - "
\n", - " \"Steps\n", - "
\n", - " Figure 6. Steps for performers\n", - "
\n", - "\n", - "Then, it is always helpful to include requirements for photographs or other information you require from the performer.\n", - "\n", - "Usually, these requirements match with rules on how you will check the task.\n", - "\n", - "\n", - " \n", - " \n", - "
\n", - " \"Photo\n", - "
\n", - " Figure 7. Photo requirements\n", - "
\n", - "\n", - "Another best practice is to provide examples of correctly taken photos and photos that will not be accepted in the task.\n", - "\n", - "\n", - " \n", - " \n", - "
\n", - " \"Photo\n", - "
\n", - " Figure 8. Photo examples\n", - "
" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Project interface\n", - "\n", - "In the cell below, we define some UI elements of the project, such as the task's title, links, text fields, buttons, conditions. We use [template builder](https://toloka.ai/en/docs/toloka-kit/source/toloka.client.project.template_builder?utm_source=github&utm_medium=site&utm_campaign=tolokakit) (`tb`) instance from `toloka-kit`.\n", - "\n", - ">**Note:** You can also create an interface for this task in [Template Builder](https://clck.ru/SMCHv) web version using our [documentation](https://toloka.ai/en/docs/template-builder/?utm_source=github&utm_medium=site&utm_campaign=tolokakit)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Project title\n", - "header = tb.MarkdownViewV1('# Cleanliness at the entrance to the metro:\\n---')\n", - "# Name of metro station and number of the entrance\n", - "entrance_name = tb.TextViewV1(tb.InputData('entrance'), label='Entrance name:')\n", - "# Situation on the spot\n", - "# If the required entrance was found or not\n", - "workflow_options = tb.ButtonRadioGroupFieldV1(\n", - " tb.OutputData('verdict'),\n", - " [\n", - " tb.GroupFieldOption('ok', 'I found the right entrance'),\n", - " tb.GroupFieldOption('no_obj', 'The required entrance is not there'),\n", - " ],\n", - " validation=tb.RequiredConditionV1(hint='Choose one of the answer options'),\n", - ")\n", - "\n", - "# Task interface if the required object was found by a performer\n", - "# First, we set all the elements that will be shown\n", - "\n", - "# Description on what to do next\n", - "exist_header = tb.MarkdownViewV1('** Photos of the entrance from the right and left sides **\\n\\n_Take two photos of the entrance on the right and on the left. The photo should show the entire entrances and the floor. So that you can assess the cleanliness around the entrance to the metro._')\n", - "\n", - "# Links to examples\n", - "example_links = tb.LinkGroupViewV1(\n", - " [\n", - " tb.LinkGroupViewV1.Link(\n", - " 'https://tlk.s3.yandex.net/toloka-kit/images_for_instructions/0spatial_good1.png',\n", - " 'Example1',\n", - " ),\n", - " tb.LinkGroupViewV1.Link(\n", - " 'https://tlk.s3.yandex.net/toloka-kit/images_for_instructions/0spatial_good2.png',\n", - " 'Example2',\n", - " ),\n", - " tb.LinkGroupViewV1.Link(\n", - " 'https://tlk.s3.yandex.net/toloka-kit/images_for_instructions/0spatial_good3.png',\n", - " 'Example3',\n", - " ),\n", - " ]\n", - ")\n", - "\n", - "# Field for loading responses\n", - "image_loader = tb.MediaFileFieldV1(\n", - " tb.OutputData('entrance_images'),\n", - " tb.MediaFileFieldV1.Accept(photo=True, gallery=True),\n", - " multiple=True,\n", - " validation=tb.RequiredConditionV1(hint='There must be at least 2 photos of the entrance: from the right and from the left'),\n", - ")\n", - "\n", - "# Define the condition by which all the necessary interface elements will be shown\n", - "exist_ui = tb.IfHelperV1(\n", - " tb.EqualsConditionV1('ok', tb.OutputData('verdict')),\n", - " tb.ListViewV1([exist_header, example_links, image_loader]),\n", - ")\n", - "\n", - "# Task interface if the required object was NOT found by a performer\n", - "# First, we set all the elements that will be shown\n", - "\n", - "# Description on what to do next\n", - "miss_header = tb.MarkdownViewV1('**Take 4 photos in all directions**\\n\\n_So that we can understand where you are and that there is no entrance to the metro here._')\n", - "\n", - "# Field for loading responses\n", - "miss_image_loader = tb.MediaFileFieldV1(\n", - " tb.OutputData('around_images'),\n", - " tb.MediaFileFieldV1.Accept(photo=True, gallery=True),\n", - " multiple=True,\n", - " validation=tb.RequiredConditionV1(hint='There must be at least 4 photos of the place'),\n", - ")\n", - "\n", - "# Define the condition by which all the necessary interface elements will be shown\n", - "miss_ui = tb.IfHelperV1(\n", - " tb.EqualsConditionV1('no_obj', tb.OutputData('verdict')),\n", - " tb.ListViewV1([miss_header, miss_image_loader]),\n", - ")\n", - "\n", - "# Validation of the performer's geolocation\n", - "coordinates_validation = tb.AllConditionV1(\n", - " [\n", - " # Check that geolocation reading is generally possible\n", - " tb.RequiredConditionV1(\n", - " tb.OutputData('performer_coordinates'),\n", - " hint=\"Couldn't get your coordinates. Please enable geolocation.\",\n", - " ),\n", - " # Check that the performer is close enough to the required location\n", - " tb.DistanceConditionV1(\n", - " tb.LocationData(),\n", - " tb.InputData('coordinates'),\n", - " max=500,\n", - " hint='You are too far from the entrance.',\n", - " ),\n", - " ]\n", - ")\n", - "\n", - "# Plugin to make tasks look nice in the performer's interface\n", - "task_width_plugin = tb.TolokaPluginV1(\n", - " layout = tb.TolokaPluginV1.TolokaPluginLayout('scroll', task_width=400)\n", - ")\n", - "\n", - "# Plugin for writing device coordinates at the time of task's execution to the output\n", - "coordinates_save_plugin = tb.TriggerPluginV1(\n", - " fire_immediately=True,\n", - " action=tb.SetActionV1(\n", - " tb.OutputData('performer_coordinates'),\n", - " tb.LocationData()\n", - " ),\n", - ")\n", - "\n", - "# How performers will see the task\n", - "project_interface = toloka.project.TemplateBuilderViewSpec(\n", - " view=tb.ListViewV1(\n", - " [header, entrance_name, workflow_options, exist_ui, miss_ui],\n", - " validation=coordinates_validation,\n", - " ),\n", - " plugins=[task_width_plugin, coordinates_save_plugin]\n", - ")\n", - "\n", - "print('interface prepared')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The performer will see the interface like this:\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \"Photo\n", - " \n", - " \"Photo\n", - " \n", - " \"Photo\n", - "
\n", - " Figure 9. How performer will see your task\n", - "
" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Create a project in Toloka\n", - "\n", - "Finally, we can create an instance of the [Project class](https://toloka.ai/en/docs/toloka-kit/source/toloka.client.project#module-toloka.client.project?utm_source=github&utm_medium=site&utm_campaign=tolokakit) and send it to Toloka or Toloka Sandbox.\n", - "\n", - "The project with all the instructions and interface will appear in your [Toloka Requester's account](https://platform.toloka.ai/requester?utm_source=github&utm_medium=site&utm_campaign=tolokakit)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "project = toloka.Project(\n", - " public_name='Cleanliness of metro entrances',\n", - " public_description='Take two photos of the entrance on the right and on the left. The photo should show the entire entrances and the floor. So that you can assess the cleanliness around the entrance to the metro.',\n", - " public_instructions=prepared_instruction,\n", - " # We indicate that this task is selected by the performer on the map.\n", - " # Tasks without reference to the map are displayed simply as a list.\n", - " assignments_issuing_type='MAP_SELECTOR',\n", - " # We will indicate how the title of the task and description will be displayed on the map\n", - " assignments_issuing_view_config=toloka.Project.AssignmentsIssuingViewConfig(\n", - " title_template='Photo metro entrance', # Set title as a constant\n", - " description_template='{{inputParams[\"entrance\"]}}', # Set description from the input parameters\n", - " # That way we can have different description for each point on the map\n", - " ),\n", - " task_spec=toloka.project.task_spec.TaskSpec(\n", - " input_spec=input_specification,\n", - " output_spec=output_specification,\n", - " view_spec=project_interface\n", - " ),\n", - ")\n", - "\n", - "project = toloka_client.create_project(project)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Create a pool\n", - "\n", - "Here we create an instance of the [Pool class](https://toloka.ai/en/docs/toloka-kit/source/toloka.client.pool?utm_source=github&utm_medium=site&utm_campaign=tolokakit) and send it to Toloka or Toloka Sandbox.\n", - "\n", - "The pool is a set of tasks sent out to performers. Learn more about working with a [pool](https://toloka.ai/en/docs/guide/concepts/pool-main?utm_source=github&utm_medium=site&utm_campaign=tolokakit)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pool = toloka.Pool(\n", - " project_id=project.id,\n", - " private_name='Metro entrances',\n", - " may_contain_adult_content=False,\n", - " will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=10),\n", - " reward_per_assignment=2,\n", - " assignment_max_duration_seconds=60 * 60 * 2, # We give 2 hours to complete the task,\n", - " # So that performer has time to book a task, get there and complete it.\n", - " auto_accept_solutions=False, # We will check the completed tasks manually before paying for them.\n", - " auto_accept_period_day=5, # Number of days to determine if we pay\n", - " # Only performers from the Toloka mobile application are allowed to pick field tasks\n", - " filter=(\n", - " (toloka.filter.ClientType == 'TOLOKA_APP') &\n", - " (toloka.filter.Languages.in_('EN')) &\n", - " (toloka.filter.RegionByPhone.in_(225)) & # Russia\n", - " (toloka.filter.RegionByIp.in_(213)) # Moscow\n", - " ),\n", - " defaults=toloka.Pool.Defaults(\n", - " # Overlap is not used for most field tasks\n", - " default_overlap_for_new_task_suites=1,\n", - " default_overlap_for_new_tasks=1,\n", - " ),\n", - ")\n", - "\n", - "pool = toloka_client.create_pool(pool)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Add tasks and run the pool\n", - "\n", - "To add tasks and bind them to map coordinates, you need to create instances of [TaskSuite](https://toloka.ai/en/docs/toloka-kit/source/toloka.client?utm_source=github&utm_medium=site&utm_campaign=tolokakit#module-toloka.client.task_suite) class. Only `TaskSuites` allow bindings with geolocation." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "task_suites = [\n", - " toloka.TaskSuite(\n", - " pool_id=pool.id,\n", - " latitude=round(row['latitude'], 10), # First number in a pair of coordinates\n", - " longitude=round(row['longitude'], 10), # Second number\n", - " overlap=1,\n", - " tasks=[\n", - " toloka.Task(input_values={\n", - " 'entrance': row['name'],\n", - " 'coordinates': f\"{round(row['latitude'], 10)}, {round(row['longitude'], 10)}\"\n", - " })\n", - " ],\n", - " )\n", - " for i, row in dataset.iterrows()]\n", - "\n", - "task_suites = toloka_client.create_task_suites(task_suites)\n", - "\n", - "pool_id = pool.id\n", - "pool = toloka_client.open_pool(pool.id)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Get results and check them\n", - "\n", - "Validation of field tasks differs from validation of other types of tasks.\n", - "\n", - "We suggest using the following rules for field task validation:\n", - "\n", - "1. First, you need to check that the performer came where you wanted. We did this in the task interface by comparing the device coordinates with the required coordinates.\n", - "\n", - " But it will be helpful to cross-check that the user has attached real photographs of the place you required. For example, check panorama views on Yandex or Google Maps. Or an old photo of the same place, if you have one.\n", - "\n", - "2. Secondly, you need to remember what you pay the performer, even if the object does not exist on the spot. It is not the fault of the performer.\n", - "\n", - " The performer came and verified this information for you, and for that work should be paid with the same reward.\n", - "\n", - "3. In the third, field tasks usually converge much slower, and there is no point in waiting for the entire pool to close. We suggest retrieving completed tasks from the pool and sending them for verification periodically.\n", - "\n", - " This allows you to collect information faster. And the performers get the well-deserved reward more quickly." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If you interrupted the execution of this notebook, uncomment the code below and include your `Pool ID`.\n", - "\n", - "\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# pool_id = '23461858'" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's get all completed but not yet verified tasks from our pool:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "results_list = []\n", - "\n", - "for assignment in toloka_client.get_assignments(pool_id=pool_id, status='SUBMITTED'):\n", - " for task, solution in zip(assignment.tasks, assignment.solutions):\n", - " results_list.append({\n", - " 'assignment_id': assignment.id,\n", - " 'input_values': task.input_values,\n", - " 'output_values': solution.output_values\n", - " })\n", - "\n", - "print(f'New results received: {len(results_list)}' if len(results_list) > 0 else 'Not received any new results yet, try to run this cell later.')\n", - "\n", - "results_iter = iter(results_list)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If you get at least one result, you can run the cells below. In the first one, you see all the information received. In the second and third, you can accept or reject the assignment.\n", - "\n", - "You can execute these cells several times." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "res = next(results_iter, None)\n", - "\n", - "if res is not None:\n", - " images = []\n", - " for id in res['output_values']['entrance_images']:\n", - " out_b = io.BytesIO()\n", - " toloka_client.download_attachment(id, out_b)\n", - " images.append(Image.open(out_b).convert('RGBA'))\n", - "\n", - " print(f\"Entrance name:\\t\\t{res['input_values']['station']}\")\n", - " print(f\"Object found:\\t\\t{res['output_values']['verdict']}\")\n", - " print(f\"Object coordinates:\\t{res['input_values']['coordinates']}\")\n", - " print(f\"Performer coordinates:\\t{res['output_values']['performer_coordinates']}\")\n", - " ipyplot.plot_images(\n", - " images,\n", - " max_images=10,\n", - " img_width=600,\n", - " )\n", - "else:\n", - " print('No more results')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now let's try to reject or accept the completed assignment.\n", - "\n", - "Remember that it will be automatically accepted after the time specified when creating the pool.\n", - "\n", - "If you try to accept or reject the same task twice, you will get an error." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# If you want to accept assignment\n", - "if res is not None:\n", - " updated_assignment = toloka_client.accept_assignment(res['assignment_id'], 'Well done!')\n", - " print(updated_assignment.status)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# If you want to reject assignment\n", - "if res is not None:\n", - " updated_assignment = toloka_client.reject_assignment(res['assignment_id'], 'Type your issue here')\n", - " print(updated_assignment.status)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - ">**Note:** You do not have to check all the photos you received by yourself. You can also validate field tasks through Toloka. Check how it's done in our [object detection example.](https://github.com/Toloka/toloka-kit/tree/main/examples/1.computer_vision/object_detection) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/1.computer_vision/object_detection/object_detection.ipynb)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Summary\n", - "We have considered the simple example of a field task. In fact, with field tasks, you can carry out various projects: from a Secret Shopper to monitoring the state of urban infrastructure.\n", - "\n", - "The main takeaway is not to forget about the specifics of such projects: give the most detailed instructions to the performers and carefully check the results." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.12" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Spatial crowdsourcing example\n", + "\n", + "In this example, we will work with *field tasks*. Field tasks are available for performers only in Toloka mobile apps. Requesters set the field tasks, and performers choose them as points on the map. Performers are usually requested to visit the place and check something in person or take photos.\n", + "\n", + "**Examples:**\n", + "\n", + "* monitor prices, products, and outdoor items of interest ([case study](https://toloka.ai/blog/fuel?utm_source=github&utm_medium=site&utm_campaign=tolokakit));\n", + "* play the role of *Secret Shopper*: leave reviews for stores and cafés;\n", + "* collect data on businesses: whether a particular organization is closed or has changed its office hours;\n", + "* monitor promotional events;\n", + "* check on outdoor advertising and many more.\n", + "\n", + "\n", + ">**Note:** For more information about field tasks, check out the Spatial Crowdsourcing page on our [website](https://toloka.ai/usecases/spatialcrowd?utm_source=github&utm_medium=site&utm_campaign=tolokakit) and more precisely [Toloka Requester's Guide](https://toloka.ai/en/docs/guide/tutorials/walk?utm_source=github&utm_medium=site&utm_campaign=tolokakit)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Call to action\n", + "If you found some bugs or have a new feature idea, don't hesitate to [open a new issue on Github](https://github.com/Toloka/toloka-kit/issues/new/choose).\n", + "Like our library and examples? Star [our repo on Github](https://github.com/Toloka/toloka-kit)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Introduction\n", + "\n", + "In this example, we will collect pictures of the Moscow metro entrances.\n", + "Because usually spatial crowdsourcing assignments converge longer than others, for this example we have chosen the most frequently used stations. Many potential Toloka performers visit these subways every day.\n", + "\n", + "\n", + "This example also can be reused for production tasks such as monitoring the state of objects, checking the presence of an organization or other physical object.\n", + "\n", + "For example to check:\n", + "\n", + "- that the area around the store is clean and tidy,\n", + "- that the store has placed a new ad at the entrance,\n", + "- how your advertisement was placed on the billboards,\n", + "- the cleanliness of the waste disposal areas,\n", + "- that the bench is installed correctly in the right place." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data preparation\n", + "\n", + "In most cases, field tasks are tied to a specific point on the map. In our case, these are subway entrances. The performer must come to the entrance and complete some actions.\n", + "\n", + "There is a way to set these points on the map through [toloka-kit](https://toloka.ai/en/docs/toloka-kit/?utm_source=github&utm_medium=site&utm_campaign=tolokakit).\n", + "\n", + "But first, let's take an example of the Okhotny Ryad metro station, from which it is easy to get to the Red Square and Kremlin. Let's open the maps in [Yandex](https://yandex.com/maps/213/moscow/search/metro%20Okhotny%20Ryad/?ll=37.617426%2C55.756762&sctx=ZAAAAAgBEAAaKAoSCfFmDd5XtSVAEXmxMEROl0hAEhIJAAAAAIAdV0ARe2X%2FuFTeQ0AoCjgAQNytB0gBVc3MzD5qAnJ1cACdAc3MTD2gAQCoAQC9AVmBJ6HCARH6tInp5AKeqLT7A%2Bi%2FyPT6BuoBAPIBAPgBAIICEm1ldHJvIE9raG90bnkgUnlhZIoCAA%3D%3D&sll=37.617426%2C55.756762&sspn=0.007170%2C0.002595&z=17.41) or [Google](https://www.google.com/maps/place/Okhotnyy+ryad/@55.7574454,37.615033,18z/data=!4m5!3m4!1s0x46b54a5adb3b8d5b:0x22be0270b70ed47d!8m2!3d55.757689!4d37.6164801). And find the required entrances.\n", + "\n", + "You will see something like this:\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \"Okhotny\n", + "
\n", + " Figure 1. Okhotny Ryad on Google Maps\n", + "


\n", + " \"Okhotny\n", + "
\n", + " Figure 2. Okhotny Ryad on Yandex Maps\n", + "
\n", + "\n", + "Now, to get the point on the map:\n", + "\n", + "1. Find the 1st entrance to the Okhotny Ryad metro station - a red-letter 'M' with a number 1 on the left side of the screen.\n", + "\n", + "2. *On Google Maps:* right-click on the entrance point and coordinates will immediately appear in a pop-up menu.\n", + "\n", + " *On Yandex Maps:* click next to the entrance point, then click on the pop-up card 'Tverskaya Street,1' and coordinates will appear in the left menu.\n", + "\n", + ">**Note:** It is also helpful to switch maps to satellite mode or check yourself by panorama views.\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \"Coordinates\n", + "
\n", + " Figure 3. Coordinates on Google Maps\n", + "


\n", + " \"Coordinates\n", + "
\n", + " Figure 4. Coordinates on Yandex Maps\n", + "
\n", + "\n", + "We will get the following coordinates: 55.756916, 37.614546.\n", + "\n", + "These are latitude and longitude geographic coordinates expressed as [decimal fractions of a degree](https://en.wikipedia.org/wiki/Decimal_degrees)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create a project\n", + "\n", + "### Set up the environment\n", + "Specifically, we will use the following libraries:\n", + "\n", + "* `toloka-kit` to develop main Toloka functionalities;\n", + "* `pandas` and `numpy` to perform data manipulation;\n", + "* `ipyplot` to deal with images in Jupyter Notebooks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "!pip install toloka-kit==0.1.26\n", + "!pip install pandas\n", + "!pip install ipyplot\n", + "\n", + "import datetime\n", + "import io\n", + "import logging\n", + "import sys\n", + "import getpass\n", + "\n", + "import ipyplot\n", + "import pandas\n", + "from PIL import Image\n", + "\n", + "import toloka.client as toloka\n", + "import toloka.client.project.template_builder as tb" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "logging.basicConfig(\n", + " format='[%(levelname)s] %(name)s: %(message)s',\n", + " level=logging.INFO,\n", + " stream=sys.stdout,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create a `TolokaClient` PRODUCTION or SANDBOX instance. All API calls will pass through it.\n", + "\n", + "Obtain your [OAuth token](https://toloka.ai/en/docs/api/concepts/access?utm_source=github&utm_medium=site&utm_campaign=tolokakit#access__token) from Toloka or [Toloka Sandbox](https://toloka.ai/en/docs/guide/concepts/sandbox?utm_source=github&utm_medium=site&utm_campaign=tolokakit)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "toloka_client = toloka.TolokaClient(getpass.getpass('Enter your OAuth token: '), 'PRODUCTION') # Or switch to 'SANDBOX'\n", + "print(toloka_client.get_requester())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create a dataset\n", + "We will use a ready-made dataset of Moscow metro entrances and select several stations from the circular (koltsevaya) line.\n", + "\n", + "This dataset is compiled by the Toloka team and is distributed under the Creative Commons Attribution 4.0 international license\n", + "[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!curl https://tlk.s3.yandex.net/dataset/moscow_metro_entrance_2020_en.tsv --output dataset.tsv\n", + "dataset = pandas.read_csv('dataset.tsv', sep='\\t')\n", + "dataset = dataset[dataset['line'].isin(['Koltsevaya line'])].sample(frac=1).iloc[:5]\n", + "print(dataset)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + ">**Note:** Dataset contains not only the coordinates of the stations but also their names. In field tasks, the performer will come to the real object only once and must clearly understand what is required to do there.\n", + "> It is always helpful to give as much information as possible so that the performer does not make a mistake.\n", + "\n", + "For example, attach a reference photo: from which angle you want to capture the entrance to the subway station.\n", + "\n", + "Otherwise, the performer may misunderstand the task and capture the object too close or too shallow. For example, capture only the door when we need a general plan, or capture a general building plan when we want a sign with the opening hours of the organization." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Input and output data formats\n", + "\n", + "We need to decide what kind of data we have as input and what data we want to receive at the output.\n", + "\n", + "At the input, we have:\n", + "\n", + "- coordinates of the object (entrance);\n", + "- station name and entrance number.\n", + "\n", + "At the output, we want to get:\n", + "\n", + "- verdict - if the object was found or not;\n", + "- photos of the place itself, if the object is found;\n", + "- photos of the surrounding area, if the object is not found - to make sure that the performer was where;\n", + "- the coordinates of the performer at the time of task execution to check that it was the right place." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "input_specification = {\n", + " 'coordinates': toloka.project.StringSpec(),\n", + " 'entrance': toloka.project.StringSpec(),\n", + "}\n", + "output_specification = {\n", + " 'verdict': toloka.project.StringSpec(),\n", + " 'entrance_images': toloka.project.ArrayFileSpec(required=False),\n", + " 'around_images': toloka.project.ArrayFileSpec(required=False),\n", + " 'performer_coordinates': toloka.project.CoordinatesSpec(current_location=True, required=False),\n", + " 'comment': toloka.project.StringSpec(required=False),\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create instructions for performers\n", + "\n", + "It is essential to prepare detailed instructions covering all the corner cases.\n", + "\n", + "Let's upload a prepared instruction from an HTML file and then analyze best practices for writing instruction for a task." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "prepared_instruction = open('instruction.html').read().strip()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For field tasks, a good start for the instruction is configuring a device.\n", + "\n", + "\n", + "\n", + " \n", + " \n", + "
\n", + " \"How\n", + "
\n", + " Figure 5. How to configure device\n", + "
\n", + "\n", + "Next, we describe the steps for completing the task.\n", + "\n", + "Be sure to provide exact steps what performers should do if something went wrong: there is no requested object or limited access to it, or something else went wrong.\n", + "\n", + "\n", + " \n", + " \n", + "
\n", + " \"Steps\n", + "
\n", + " Figure 6. Steps for performers\n", + "
\n", + "\n", + "Then, it is always helpful to include requirements for photographs or other information you require from the performer.\n", + "\n", + "Usually, these requirements match with rules on how you will check the task.\n", + "\n", + "\n", + " \n", + " \n", + "
\n", + " \"Photo\n", + "
\n", + " Figure 7. Photo requirements\n", + "
\n", + "\n", + "Another best practice is to provide examples of correctly taken photos and photos that will not be accepted in the task.\n", + "\n", + "\n", + " \n", + " \n", + "
\n", + " \"Photo\n", + "
\n", + " Figure 8. Photo examples\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Project interface\n", + "\n", + "In the cell below, we define some UI elements of the project, such as the task's title, links, text fields, buttons, conditions. We use [template builder](https://toloka.ai/en/docs/toloka-kit/source/toloka.client.project.template_builder?utm_source=github&utm_medium=site&utm_campaign=tolokakit) (`tb`) instance from `toloka-kit`.\n", + "\n", + ">**Note:** You can also create an interface for this task in [Template Builder](https://clck.ru/SMCHv) web version using our [documentation](https://toloka.ai/en/docs/template-builder/?utm_source=github&utm_medium=site&utm_campaign=tolokakit)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Project title\n", + "header = tb.MarkdownViewV1('# Cleanliness at the entrance to the metro:\\n---')\n", + "# Name of metro station and number of the entrance\n", + "entrance_name = tb.TextViewV1(tb.InputData('entrance'), label='Entrance name:')\n", + "# Situation on the spot\n", + "# If the required entrance was found or not\n", + "workflow_options = tb.ButtonRadioGroupFieldV1(\n", + " tb.OutputData('verdict'),\n", + " [\n", + " tb.GroupFieldOption('ok', 'I found the right entrance'),\n", + " tb.GroupFieldOption('no_obj', 'The required entrance is not there'),\n", + " ],\n", + " validation=tb.RequiredConditionV1(hint='Choose one of the answer options'),\n", + ")\n", + "\n", + "# Task interface if the required object was found by a performer\n", + "# First, we set all the elements that will be shown\n", + "\n", + "# Description on what to do next\n", + "exist_header = tb.MarkdownViewV1('** Photos of the entrance from the right and left sides **\\n\\n_Take two photos of the entrance on the right and on the left. The photo should show the entire entrances and the floor. So that you can assess the cleanliness around the entrance to the metro._')\n", + "\n", + "# Links to examples\n", + "example_links = tb.LinkGroupViewV1(\n", + " [\n", + " tb.LinkGroupViewV1.Link(\n", + " 'https://tlk.s3.yandex.net/toloka-kit/images_for_instructions/0spatial_good1.png',\n", + " 'Example1',\n", + " ),\n", + " tb.LinkGroupViewV1.Link(\n", + " 'https://tlk.s3.yandex.net/toloka-kit/images_for_instructions/0spatial_good2.png',\n", + " 'Example2',\n", + " ),\n", + " tb.LinkGroupViewV1.Link(\n", + " 'https://tlk.s3.yandex.net/toloka-kit/images_for_instructions/0spatial_good3.png',\n", + " 'Example3',\n", + " ),\n", + " ]\n", + ")\n", + "\n", + "# Field for loading responses\n", + "image_loader = tb.MediaFileFieldV1(\n", + " tb.OutputData('entrance_images'),\n", + " tb.MediaFileFieldV1.Accept(photo=True, gallery=True),\n", + " multiple=True,\n", + " validation=tb.RequiredConditionV1(hint='There must be at least 2 photos of the entrance: from the right and from the left'),\n", + ")\n", + "\n", + "# Define the condition by which all the necessary interface elements will be shown\n", + "exist_ui = tb.IfHelperV1(\n", + " tb.EqualsConditionV1('ok', tb.OutputData('verdict')),\n", + " tb.ListViewV1([exist_header, example_links, image_loader]),\n", + ")\n", + "\n", + "# Task interface if the required object was NOT found by a performer\n", + "# First, we set all the elements that will be shown\n", + "\n", + "# Description on what to do next\n", + "miss_header = tb.MarkdownViewV1('**Take 4 photos in all directions**\\n\\n_So that we can understand where you are and that there is no entrance to the metro here._')\n", + "\n", + "# Field for loading responses\n", + "miss_image_loader = tb.MediaFileFieldV1(\n", + " tb.OutputData('around_images'),\n", + " tb.MediaFileFieldV1.Accept(photo=True, gallery=True),\n", + " multiple=True,\n", + " validation=tb.RequiredConditionV1(hint='There must be at least 4 photos of the place'),\n", + ")\n", + "\n", + "# Define the condition by which all the necessary interface elements will be shown\n", + "miss_ui = tb.IfHelperV1(\n", + " tb.EqualsConditionV1('no_obj', tb.OutputData('verdict')),\n", + " tb.ListViewV1([miss_header, miss_image_loader]),\n", + ")\n", + "\n", + "# Validation of the performer's geolocation\n", + "coordinates_validation = tb.AllConditionV1(\n", + " [\n", + " # Check that geolocation reading is generally possible\n", + " tb.RequiredConditionV1(\n", + " tb.OutputData('performer_coordinates'),\n", + " hint=\"Couldn't get your coordinates. Please enable geolocation.\",\n", + " ),\n", + " # Check that the performer is close enough to the required location\n", + " tb.DistanceConditionV1(\n", + " tb.LocationData(),\n", + " tb.InputData('coordinates'),\n", + " max=500,\n", + " hint='You are too far from the entrance.',\n", + " ),\n", + " ]\n", + ")\n", + "\n", + "# Plugin to make tasks look nice in the performer's interface\n", + "task_width_plugin = tb.TolokaPluginV1(\n", + " layout = tb.TolokaPluginV1.TolokaPluginLayout('scroll', task_width=400)\n", + ")\n", + "\n", + "# Plugin for writing device coordinates at the time of task's execution to the output\n", + "coordinates_save_plugin = tb.TriggerPluginV1(\n", + " fire_immediately=True,\n", + " action=tb.SetActionV1(\n", + " tb.OutputData('performer_coordinates'),\n", + " tb.LocationData()\n", + " ),\n", + ")\n", + "\n", + "# How performers will see the task\n", + "project_interface = toloka.project.TemplateBuilderViewSpec(\n", + " view=tb.ListViewV1(\n", + " [header, entrance_name, workflow_options, exist_ui, miss_ui],\n", + " validation=coordinates_validation,\n", + " ),\n", + " plugins=[task_width_plugin, coordinates_save_plugin]\n", + ")\n", + "\n", + "print('interface prepared')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The performer will see the interface like this:\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \"Photo\n", + " \n", + " \"Photo\n", + " \n", + " \"Photo\n", + "
\n", + " Figure 9. How performer will see your task\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create a project in Toloka\n", + "\n", + "Finally, we can create an instance of the [Project class](https://toloka.ai/en/docs/toloka-kit/source/toloka.client.project#module-toloka.client.project?utm_source=github&utm_medium=site&utm_campaign=tolokakit) and send it to Toloka or Toloka Sandbox.\n", + "\n", + "The project with all the instructions and interface will appear in your [Toloka Requester's account](https://platform.toloka.ai/requester?utm_source=github&utm_medium=site&utm_campaign=tolokakit)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "project = toloka.Project(\n", + " public_name='Cleanliness of metro entrances',\n", + " public_description='Take two photos of the entrance on the right and on the left. The photo should show the entire entrances and the floor. So that you can assess the cleanliness around the entrance to the metro.',\n", + " public_instructions=prepared_instruction,\n", + " # We indicate that this task is selected by the performer on the map.\n", + " # Tasks without reference to the map are displayed simply as a list.\n", + " assignments_issuing_type='MAP_SELECTOR',\n", + " # We will indicate how the title of the task and description will be displayed on the map\n", + " assignments_issuing_view_config=toloka.Project.AssignmentsIssuingViewConfig(\n", + " title_template='Photo metro entrance', # Set title as a constant\n", + " description_template='{{inputParams[\"entrance\"]}}', # Set description from the input parameters\n", + " # That way we can have different description for each point on the map\n", + " ),\n", + " task_spec=toloka.project.task_spec.TaskSpec(\n", + " input_spec=input_specification,\n", + " output_spec=output_specification,\n", + " view_spec=project_interface\n", + " ),\n", + ")\n", + "\n", + "project = toloka_client.create_project(project)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create a pool\n", + "\n", + "Here we create an instance of the [Pool class](https://toloka.ai/en/docs/toloka-kit/source/toloka.client.pool?utm_source=github&utm_medium=site&utm_campaign=tolokakit) and send it to Toloka or Toloka Sandbox.\n", + "\n", + "The pool is a set of tasks sent out to performers. Learn more about working with a [pool](https://toloka.ai/en/docs/guide/concepts/pool-main?utm_source=github&utm_medium=site&utm_campaign=tolokakit)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pool = toloka.Pool(\n", + " project_id=project.id,\n", + " private_name='Metro entrances',\n", + " may_contain_adult_content=False,\n", + " will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=10),\n", + " reward_per_assignment=2,\n", + " assignment_max_duration_seconds=60 * 60 * 2, # We give 2 hours to complete the task,\n", + " # So that performer has time to book a task, get there and complete it.\n", + " auto_accept_solutions=False, # We will check the completed tasks manually before paying for them.\n", + " auto_accept_period_day=5, # Number of days to determine if we pay\n", + " # Only performers from the Toloka mobile application are allowed to pick field tasks\n", + " filter=(\n", + " (toloka.filter.ClientType == 'TOLOKA_APP') &\n", + " (toloka.filter.Languages.in_('EN')) &\n", + " (toloka.filter.RegionByPhone.in_(225)) & # Russia\n", + " (toloka.filter.RegionByIp.in_(213)) # Moscow\n", + " ),\n", + " defaults=toloka.Pool.Defaults(\n", + " # Overlap is not used for most field tasks\n", + " default_overlap_for_new_task_suites=1,\n", + " default_overlap_for_new_tasks=1,\n", + " ),\n", + ")\n", + "\n", + "pool = toloka_client.create_pool(pool)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Add tasks and run the pool\n", + "\n", + "To add tasks and bind them to map coordinates, you need to create instances of [TaskSuite](https://toloka.ai/en/docs/toloka-kit/source/toloka.client?utm_source=github&utm_medium=site&utm_campaign=tolokakit#module-toloka.client.task_suite) class. Only `TaskSuites` allow bindings with geolocation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "task_suites = [\n", + " toloka.TaskSuite(\n", + " pool_id=pool.id,\n", + " latitude=round(row['latitude'], 10), # First number in a pair of coordinates\n", + " longitude=round(row['longitude'], 10), # Second number\n", + " overlap=1,\n", + " tasks=[\n", + " toloka.Task(input_values={\n", + " 'entrance': row['name'],\n", + " 'coordinates': f\"{round(row['latitude'], 10)}, {round(row['longitude'], 10)}\"\n", + " })\n", + " ],\n", + " )\n", + " for i, row in dataset.iterrows()]\n", + "\n", + "task_suites = toloka_client.create_task_suites(task_suites)\n", + "\n", + "pool_id = pool.id\n", + "pool = toloka_client.open_pool(pool.id)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Get results and check them\n", + "\n", + "Validation of field tasks differs from validation of other types of tasks.\n", + "\n", + "We suggest using the following rules for field task validation:\n", + "\n", + "1. First, you need to check that the performer came where you wanted. We did this in the task interface by comparing the device coordinates with the required coordinates.\n", + "\n", + " But it will be helpful to cross-check that the user has attached real photographs of the place you required. For example, check panorama views on Yandex or Google Maps. Or an old photo of the same place, if you have one.\n", + "\n", + "2. Secondly, you need to remember what you pay the performer, even if the object does not exist on the spot. It is not the fault of the performer.\n", + "\n", + " The performer came and verified this information for you, and for that work should be paid with the same reward.\n", + "\n", + "3. In the third, field tasks usually converge much slower, and there is no point in waiting for the entire pool to close. We suggest retrieving completed tasks from the pool and sending them for verification periodically.\n", + "\n", + " This allows you to collect information faster. And the performers get the well-deserved reward more quickly." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you interrupted the execution of this notebook, uncomment the code below and include your `Pool ID`.\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# pool_id = '23461858'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's get all completed but not yet verified tasks from our pool:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "results_list = []\n", + "\n", + "for assignment in toloka_client.get_assignments(pool_id=pool_id, status='SUBMITTED'):\n", + " for task, solution in zip(assignment.tasks, assignment.solutions):\n", + " results_list.append({\n", + " 'assignment_id': assignment.id,\n", + " 'input_values': task.input_values,\n", + " 'output_values': solution.output_values\n", + " })\n", + "\n", + "print(f'New results received: {len(results_list)}' if len(results_list) > 0 else 'Not received any new results yet, try to run this cell later.')\n", + "\n", + "results_iter = iter(results_list)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you get at least one result, you can run the cells below. In the first one, you see all the information received. In the second and third, you can accept or reject the assignment.\n", + "\n", + "You can execute these cells several times." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "res = next(results_iter, None)\n", + "\n", + "if res is not None:\n", + " images = []\n", + " for id in res['output_values']['entrance_images']:\n", + " out_b = io.BytesIO()\n", + " toloka_client.download_attachment(id, out_b)\n", + " images.append(Image.open(out_b).convert('RGBA'))\n", + "\n", + " print(f\"Entrance name:\\t\\t{res['input_values']['station']}\")\n", + " print(f\"Object found:\\t\\t{res['output_values']['verdict']}\")\n", + " print(f\"Object coordinates:\\t{res['input_values']['coordinates']}\")\n", + " print(f\"Performer coordinates:\\t{res['output_values']['performer_coordinates']}\")\n", + " ipyplot.plot_images(\n", + " images,\n", + " max_images=10,\n", + " img_width=600,\n", + " )\n", + "else:\n", + " print('No more results')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's try to reject or accept the completed assignment.\n", + "\n", + "Remember that it will be automatically accepted after the time specified when creating the pool.\n", + "\n", + "If you try to accept or reject the same task twice, you will get an error." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# If you want to accept assignment\n", + "if res is not None:\n", + " updated_assignment = toloka_client.accept_assignment(res['assignment_id'], 'Well done!')\n", + " print(updated_assignment.status)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# If you want to reject assignment\n", + "if res is not None:\n", + " updated_assignment = toloka_client.reject_assignment(res['assignment_id'], 'Type your issue here')\n", + " print(updated_assignment.status)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + ">**Note:** You do not have to check all the photos you received by yourself. You can also validate field tasks through Toloka. Check how it's done in our [object detection example.](https://github.com/Toloka/toloka-kit/tree/main/examples/1.computer_vision/object_detection) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/1.computer_vision/object_detection/object_detection.ipynb)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Summary\n", + "We have considered the simple example of a field task. In fact, with field tasks, you can carry out various projects: from a Secret Shopper to monitoring the state of urban infrastructure.\n", + "\n", + "The main takeaway is not to forget about the specifics of such projects: give the most detailed instructions to the performers and carefully check the results." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.12" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/examples/4.ranking/side_by_side_image_comparision/img/performer_interface.png b/examples/2. cases-for-applied-tasks/1.ranking/side_by_side_image_comparision/img/performer_interface.png similarity index 100% rename from examples/4.ranking/side_by_side_image_comparision/img/performer_interface.png rename to examples/2. cases-for-applied-tasks/1.ranking/side_by_side_image_comparision/img/performer_interface.png diff --git a/examples/4.ranking/side_by_side_image_comparision/img/possible_results.png b/examples/2. cases-for-applied-tasks/1.ranking/side_by_side_image_comparision/img/possible_results.png similarity index 100% rename from examples/4.ranking/side_by_side_image_comparision/img/possible_results.png rename to examples/2. cases-for-applied-tasks/1.ranking/side_by_side_image_comparision/img/possible_results.png diff --git a/examples/4.ranking/side_by_side_image_comparision/side_by_side_comparision.ipynb b/examples/2. cases-for-applied-tasks/1.ranking/side_by_side_image_comparision/side_by_side_comparision.ipynb similarity index 96% rename from examples/4.ranking/side_by_side_image_comparision/side_by_side_comparision.ipynb rename to examples/2. cases-for-applied-tasks/1.ranking/side_by_side_image_comparision/side_by_side_comparision.ipynb index 8ea11e05..22977d78 100644 --- a/examples/4.ranking/side_by_side_image_comparision/side_by_side_comparision.ipynb +++ b/examples/2. cases-for-applied-tasks/1.ranking/side_by_side_image_comparision/side_by_side_comparision.ipynb @@ -1,627 +1,627 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Side-by-side image comparison\n", - "We have a set of icons.\n", - "We need to find out which icon people prefer and determine the top icon out of the set.\n", - "We ask performers to look at the icons and choose the one they prefer and then we aggregate these results to obtain the top icon.\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Call to action\n", - "If you found some bugs or have a new feature idea, don't hesitate to [open a new issue on Github](https://github.com/Toloka/toloka-kit/issues/new/choose).\n", - "Like our library and examples? Star [our repo on Github](https://github.com/Toloka/toloka-kit)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Prepare environment and import all we'll need." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%%capture\n", - "!pip install toloka-kit==0.1.26\n", - "!pip install crowd-kit==1.0.0\n", - "!pip install pandas\n", - "!pip install ipyplot\n", - "\n", - "import datetime\n", - "import itertools\n", - "import sys\n", - "import time\n", - "import logging\n", - "import getpass\n", - "\n", - "import ipyplot\n", - "import pandas\n", - "\n", - "import toloka.client as toloka\n", - "import toloka.client.project.template_builder as tb\n", - "from crowdkit.aggregation import NoisyBradleyTerry" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "logging.basicConfig(\n", - " format='[%(levelname)s] %(name)s: %(message)s',\n", - " level=logging.INFO,\n", - " stream=sys.stdout,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Сreate toloka-client instance. All api calls will go through it. More about OAuth token in our [Learn the basics example](https://github.com/Toloka/toloka-kit/tree/main/examples/0.getting_started/0.learn_the_basics) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/0.getting_started/0.learn_the_basics/learn_the_basics.ipynb)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "toloka_client = toloka.TolokaClient(getpass.getpass('Enter your OAuth token: '), 'PRODUCTION') # Or switch to 'SANDBOX'\n", - "print(toloka_client.get_requester())" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Create a project" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "project = toloka.Project(\n", - " assignments_issuing_type='AUTOMATED',\n", - " public_name='Which icon do you like more?',\n", - " public_description='Look at the icons and decide which one you like more.',\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create task interface" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "main_interface = tb.SideBySideLayoutV1(\n", - " items=[\n", - " tb.ImageViewV1(url=tb.InputData(path='image_left'), full_height=True),\n", - " tb.ImageViewV1(url=tb.InputData(path='image_right'), full_height=True),\n", - " ],\n", - " controls=tb.ButtonRadioGroupFieldV1(\n", - " data=tb.OutputData(path='result'),\n", - " label='Which icon do you like more?',\n", - " options=[\n", - " tb.GroupFieldOption(label='Left', value='LEFT'),\n", - " tb.GroupFieldOption(label='Right', value='RIGHT'),\n", - " tb.GroupFieldOption(label='Loading error', value='ERROR'),\n", - " ]\n", - " )\n", - ")\n", - "\n", - "hot_keys_plugin = tb.HotkeysPluginV1(\n", - " key_0=tb.SetActionV1(data=tb.OutputData(path='result'), payload='ERROR'),\n", - " key_1=tb.SetActionV1(data=tb.OutputData(path='result'), payload='LEFT'),\n", - " key_2=tb.SetActionV1(data=tb.OutputData(path='result'), payload='RIGHT'),\n", - ")\n", - "\n", - "project_interface = toloka.project.view_spec.TemplateBuilderViewSpec(\n", - " config=tb.TemplateBuilder(\n", - " view=main_interface,\n", - " plugins=[hot_keys_plugin],\n", - " )\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Set data specification. And set task interface to project." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "input_specification = {\n", - " 'image_left': toloka.project.field_spec.UrlSpec(),\n", - " 'image_right': toloka.project.field_spec.UrlSpec(),\n", - "}\n", - "output_specification = {'result': toloka.project.field_spec.StringSpec()}\n", - "\n", - "project.task_spec = toloka.project.task_spec.TaskSpec(\n", - " input_spec=input_specification,\n", - " output_spec=output_specification,\n", - " view_spec=project_interface,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Write short and simple \tinstructions." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "project.public_instructions = \"\"\"

Look at the icons and decide which one you like more.

\n", - "

Select \"Left\" if you like the icon on the left more.

\n", - "

Select \"Right\" if you like the icon on the right more.

\n", - "

Select \"Loadinf error\" if the picture failed to load.

\"\"\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create a project." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "project = toloka_client.create_project(project)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The performer will see the interface like this:\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \"Photo\n", - "
\n", - " Figure 1. How performer will see your task\n", - "
" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Create a pool\n", - "Specify the [pool parameters.](https://toloka.ai/en/docs/guide/concepts/pool_poolparams?utm_source=github&utm_medium=site&utm_campaign=tolokakit)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pool = toloka.Pool(\n", - " project_id=project.id,\n", - " # Give the pool any convenient name. You are the only one who will see it.\n", - " private_name='Which icon do you like more',\n", - " may_contain_adult_content=False,\n", - " # Set the price per task page.\n", - " reward_per_assignment=0.01,\n", - " will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365),\n", - " # Time given to complete a task suite\n", - " assignment_max_duration_seconds=600,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Select English-speaking performers" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pool.filter = toloka.filter.Languages.in_('EN')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Set up [Quality control](https://toloka.ai/en/docs/guide/concepts/control?utm_source=github&utm_medium=site&utm_campaign=tolokakit). Set up the Submitted responses quality control rule. Restrict the number of responses per user to one. This way you will only get one answer from each user and thus ensure a variety of opinions." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pool.quality_control.add_action(\n", - " collector=toloka.collectors.AnswerCount(),\n", - " conditions=[toloka.conditions.AssignmentsAcceptedCount == 1],\n", - " action=toloka.actions.RestrictionV2(\n", - " scope=toloka.user_restriction.UserRestriction.PROJECT,\n", - " duration=3,\n", - " duration_unit='DAYS',\n", - " private_comment='No need more answers from this performer',\n", - " )\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Overlap. This is the number of users who will complete the same task. Since you are interested in a variety of opinions, select a big overlap for each task. For example, 10." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pool.defaults = toloka.Pool.Defaults(default_overlap_for_new_task_suites=10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Specify\tthe number of tasks per page. 1 task per page. A performer will only see one pair of images on a page." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pool.set_mixer_config(\n", - " real_tasks_count=1,\n", - " golden_tasks_count=0,\n", - " training_tasks_count=0\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create pool" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pool = toloka_client.create_pool(pool)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Preparing and uploading tasks\n", - "\n", - "This example uses a small data set with images.\n", - "\n", - "The dataset used is collected by Toloka team and distributed under a Creative Commons Attribution 4.0 International license\n", - "[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!curl https://tlk.s3.yandex.net/dataset/toloka_logos/toloka_logos.tsv --output dataset.tsv\n", - "dataset = pandas.read_csv('dataset.tsv', sep='\\t')\n", - "with pandas.option_context(\"max_colwidth\", 80):\n", - " print(dataset)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Our project is a pairwise comparison of two images. But our dataset contains just a flat list. Let's create a dataset that contains all possible pairs for comparison." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dataset = pandas.DataFrame(itertools.combinations(dataset['url'], 2), columns=['image_left', 'image_right'])\n", - "with pandas.option_context(\"max_colwidth\", 70):\n", - " display(dataset)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create pool tasks" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "tasks = [\n", - " toloka.Task(\n", - " pool_id=pool.id,\n", - " input_values={\n", - " 'image_left': row['image_left'],\n", - " 'image_right': row['image_right'],\n", - " }\n", - " )\n", - " for i, row in dataset.iterrows()\n", - "]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Upload tasks" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "created_tasks = toloka_client.create_tasks(tasks, allow_defaults=True)\n", - "print(len(created_tasks.items))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Start the pool.\n", - "\n", - "**Important.** Remember that real Toloka performers will complete the tasks.\n", - "Double check that everything is correct\n", - "with your project configuration before you start the pool" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pool = toloka_client.open_pool(pool.id)\n", - "print(pool.status)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Receiving responses" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Wait until the pool is completed." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pool_id = pool.id\n", - "\n", - "def wait_pool_for_close(pool_id, minutes_to_wait=1):\n", - " sleep_time = 60 * minutes_to_wait\n", - " pool = toloka_client.get_pool(pool_id)\n", - " while not pool.is_closed():\n", - " op = toloka_client.get_analytics([toloka.analytics_request.CompletionPercentagePoolAnalytics(subject_id=pool.id)])\n", - " op = toloka_client.wait_operation(op)\n", - " percentage = op.details['value'][0]['result']['value']\n", - " print(\n", - " f' {datetime.datetime.now().strftime(\"%H:%M:%S\")}\\t'\n", - " f'Pool {pool.id} - {percentage}%'\n", - " )\n", - " time.sleep(sleep_time)\n", - " pool = toloka_client.get_pool(pool.id)\n", - " print('Pool was closed.')\n", - "\n", - "wait_pool_for_close(pool_id)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Get responses\n", - "\n", - "When all the tasks are completed, look at the responses from performers." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "answers = []\n", - "\n", - "for assignment in toloka_client.get_assignments(pool_id=pool_id, status='ACCEPTED'):\n", - " for task, solution in zip(assignment.tasks, assignment.solutions):\n", - " answers.append(\n", - " [\n", - " task.input_values['image_left'],\n", - " task.input_values['image_right'],\n", - " solution.output_values['result'],\n", - " assignment.user_id\n", - " ]\n", - " )\n", - "\n", - "print(f'answers count: {len(answers)}')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Ranking after a pairwise comparison is quite a difficult task. We will use the Bradley-Terry algorithm, which is already implemented in the Crowd-Kit and allows you to get the result in a few lines of code.\n", - "\n", - "> David R. Hunter. 2004.\n", - "> MM algorithms for generalized Bradley-Terry models\n", - "> Ann. Statist., Vol. 32, 1 (2004): 384–406.\n", - ">\n", - ">\n", - "> Bradley, R. A. and Terry, M. E. 1952.\n", - "> Rank analysis of incomplete block designs. I. The method of paired comparisons.\n", - "> Biometrika, Vol. 39 (1952): 324–345." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Prepare dataframe\n", - "answers_df = pandas.DataFrame(answers, columns=['left', 'right', 'label', 'worker'])\n", - "\n", - "answers_df = answers_df[(answers_df.label == 'LEFT') | (answers_df.label == 'RIGHT')]\n", - "answers_df['label'] = answers_df.apply(lambda row: row[row['label'].lower()], axis=1)\n", - "\n", - "# Run aggregation\n", - "result = NoisyBradleyTerry().fit_predict(answers_df).sort_values(ascending=False)\n", - "print(result)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's look at the ranking results" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "images = result.index.values\n", - "labels = result.values\n", - "ipyplot.plot_images(\n", - " images=images,\n", - " labels=labels,\n", - " max_images=6,\n", - " img_width=200,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**You** can see the ranked images. Some possible results are shown in figure 2 below.\n", - "\n", - "\n", - " \n", - " \n", - "
\n", - " \"Possible\n", - "
\n", - " Figure 2. Possible results.\n", - "
" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.12" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Side-by-side image comparison\n", + "We have a set of icons.\n", + "We need to find out which icon people prefer and determine the top icon out of the set.\n", + "We ask performers to look at the icons and choose the one they prefer and then we aggregate these results to obtain the top icon.\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Call to action\n", + "If you found some bugs or have a new feature idea, don't hesitate to [open a new issue on Github](https://github.com/Toloka/toloka-kit/issues/new/choose).\n", + "Like our library and examples? Star [our repo on Github](https://github.com/Toloka/toloka-kit)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Prepare environment and import all we'll need." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "!pip install toloka-kit==0.1.26\n", + "!pip install crowd-kit==1.0.0\n", + "!pip install pandas\n", + "!pip install ipyplot\n", + "\n", + "import datetime\n", + "import itertools\n", + "import sys\n", + "import time\n", + "import logging\n", + "import getpass\n", + "\n", + "import ipyplot\n", + "import pandas\n", + "\n", + "import toloka.client as toloka\n", + "import toloka.client.project.template_builder as tb\n", + "from crowdkit.aggregation import NoisyBradleyTerry" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "logging.basicConfig(\n", + " format='[%(levelname)s] %(name)s: %(message)s',\n", + " level=logging.INFO,\n", + " stream=sys.stdout,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Сreate toloka-client instance. All api calls will go through it. More about OAuth token in our [Learn the basics example](https://github.com/Toloka/toloka-kit/tree/main/examples/0.getting_started/0.learn_the_basics) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/0.getting_started/0.learn_the_basics/learn_the_basics.ipynb)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "toloka_client = toloka.TolokaClient(getpass.getpass('Enter your OAuth token: '), 'PRODUCTION') # Or switch to 'SANDBOX'\n", + "print(toloka_client.get_requester())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create a project" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "project = toloka.Project(\n", + " assignments_issuing_type='AUTOMATED',\n", + " public_name='Which icon do you like more?',\n", + " public_description='Look at the icons and decide which one you like more.',\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create task interface" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "main_interface = tb.SideBySideLayoutV1(\n", + " items=[\n", + " tb.ImageViewV1(url=tb.InputData(path='image_left'), full_height=True),\n", + " tb.ImageViewV1(url=tb.InputData(path='image_right'), full_height=True),\n", + " ],\n", + " controls=tb.ButtonRadioGroupFieldV1(\n", + " data=tb.OutputData(path='result'),\n", + " label='Which icon do you like more?',\n", + " options=[\n", + " tb.GroupFieldOption(label='Left', value='LEFT'),\n", + " tb.GroupFieldOption(label='Right', value='RIGHT'),\n", + " tb.GroupFieldOption(label='Loading error', value='ERROR'),\n", + " ]\n", + " )\n", + ")\n", + "\n", + "hot_keys_plugin = tb.HotkeysPluginV1(\n", + " key_0=tb.SetActionV1(data=tb.OutputData(path='result'), payload='ERROR'),\n", + " key_1=tb.SetActionV1(data=tb.OutputData(path='result'), payload='LEFT'),\n", + " key_2=tb.SetActionV1(data=tb.OutputData(path='result'), payload='RIGHT'),\n", + ")\n", + "\n", + "project_interface = toloka.project.view_spec.TemplateBuilderViewSpec(\n", + " config=tb.TemplateBuilder(\n", + " view=main_interface,\n", + " plugins=[hot_keys_plugin],\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Set data specification. And set task interface to project." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "input_specification = {\n", + " 'image_left': toloka.project.field_spec.UrlSpec(),\n", + " 'image_right': toloka.project.field_spec.UrlSpec(),\n", + "}\n", + "output_specification = {'result': toloka.project.field_spec.StringSpec()}\n", + "\n", + "project.task_spec = toloka.project.task_spec.TaskSpec(\n", + " input_spec=input_specification,\n", + " output_spec=output_specification,\n", + " view_spec=project_interface,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Write short and simple \tinstructions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "project.public_instructions = \"\"\"

Look at the icons and decide which one you like more.

\n", + "

Select \"Left\" if you like the icon on the left more.

\n", + "

Select \"Right\" if you like the icon on the right more.

\n", + "

Select \"Loadinf error\" if the picture failed to load.

\"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create a project." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "project = toloka_client.create_project(project)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The performer will see the interface like this:\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \"Photo\n", + "
\n", + " Figure 1. How performer will see your task\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create a pool\n", + "Specify the [pool parameters.](https://toloka.ai/en/docs/guide/concepts/pool_poolparams?utm_source=github&utm_medium=site&utm_campaign=tolokakit)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pool = toloka.Pool(\n", + " project_id=project.id,\n", + " # Give the pool any convenient name. You are the only one who will see it.\n", + " private_name='Which icon do you like more',\n", + " may_contain_adult_content=False,\n", + " # Set the price per task page.\n", + " reward_per_assignment=0.01,\n", + " will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365),\n", + " # Time given to complete a task suite\n", + " assignment_max_duration_seconds=600,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Select English-speaking performers" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pool.filter = toloka.filter.Languages.in_('EN')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Set up [Quality control](https://toloka.ai/en/docs/guide/concepts/control?utm_source=github&utm_medium=site&utm_campaign=tolokakit). Set up the Submitted responses quality control rule. Restrict the number of responses per user to one. This way you will only get one answer from each user and thus ensure a variety of opinions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pool.quality_control.add_action(\n", + " collector=toloka.collectors.AnswerCount(),\n", + " conditions=[toloka.conditions.AssignmentsAcceptedCount == 1],\n", + " action=toloka.actions.RestrictionV2(\n", + " scope=toloka.user_restriction.UserRestriction.PROJECT,\n", + " duration=3,\n", + " duration_unit='DAYS',\n", + " private_comment='No need more answers from this performer',\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Overlap. This is the number of users who will complete the same task. Since you are interested in a variety of opinions, select a big overlap for each task. For example, 10." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pool.defaults = toloka.Pool.Defaults(default_overlap_for_new_task_suites=10)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Specify\tthe number of tasks per page. 1 task per page. A performer will only see one pair of images on a page." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pool.set_mixer_config(\n", + " real_tasks_count=1,\n", + " golden_tasks_count=0,\n", + " training_tasks_count=0\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create pool" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pool = toloka_client.create_pool(pool)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Preparing and uploading tasks\n", + "\n", + "This example uses a small data set with images.\n", + "\n", + "The dataset used is collected by Toloka team and distributed under a Creative Commons Attribution 4.0 International license\n", + "[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!curl https://tlk.s3.yandex.net/dataset/toloka_logos/toloka_logos.tsv --output dataset.tsv\n", + "dataset = pandas.read_csv('dataset.tsv', sep='\\t')\n", + "with pandas.option_context(\"max_colwidth\", 80):\n", + " print(dataset)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Our project is a pairwise comparison of two images. But our dataset contains just a flat list. Let's create a dataset that contains all possible pairs for comparison." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dataset = pandas.DataFrame(itertools.combinations(dataset['url'], 2), columns=['image_left', 'image_right'])\n", + "with pandas.option_context(\"max_colwidth\", 70):\n", + " display(dataset)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create pool tasks" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "tasks = [\n", + " toloka.Task(\n", + " pool_id=pool.id,\n", + " input_values={\n", + " 'image_left': row['image_left'],\n", + " 'image_right': row['image_right'],\n", + " }\n", + " )\n", + " for i, row in dataset.iterrows()\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Upload tasks" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "created_tasks = toloka_client.create_tasks(tasks, allow_defaults=True)\n", + "print(len(created_tasks.items))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Start the pool.\n", + "\n", + "**Important.** Remember that real Toloka performers will complete the tasks.\n", + "Double check that everything is correct\n", + "with your project configuration before you start the pool" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pool = toloka_client.open_pool(pool.id)\n", + "print(pool.status)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Receiving responses" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Wait until the pool is completed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pool_id = pool.id\n", + "\n", + "def wait_pool_for_close(pool_id, minutes_to_wait=1):\n", + " sleep_time = 60 * minutes_to_wait\n", + " pool = toloka_client.get_pool(pool_id)\n", + " while not pool.is_closed():\n", + " op = toloka_client.get_analytics([toloka.analytics_request.CompletionPercentagePoolAnalytics(subject_id=pool.id)])\n", + " op = toloka_client.wait_operation(op)\n", + " percentage = op.details['value'][0]['result']['value']\n", + " print(\n", + " f' {datetime.datetime.now().strftime(\"%H:%M:%S\")}\\t'\n", + " f'Pool {pool.id} - {percentage}%'\n", + " )\n", + " time.sleep(sleep_time)\n", + " pool = toloka_client.get_pool(pool.id)\n", + " print('Pool was closed.')\n", + "\n", + "wait_pool_for_close(pool_id)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Get responses\n", + "\n", + "When all the tasks are completed, look at the responses from performers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "answers = []\n", + "\n", + "for assignment in toloka_client.get_assignments(pool_id=pool_id, status='ACCEPTED'):\n", + " for task, solution in zip(assignment.tasks, assignment.solutions):\n", + " answers.append(\n", + " [\n", + " task.input_values['image_left'],\n", + " task.input_values['image_right'],\n", + " solution.output_values['result'],\n", + " assignment.user_id\n", + " ]\n", + " )\n", + "\n", + "print(f'answers count: {len(answers)}')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Ranking after a pairwise comparison is quite a difficult task. We will use the Bradley-Terry algorithm, which is already implemented in the Crowd-Kit and allows you to get the result in a few lines of code.\n", + "\n", + "> David R. Hunter. 2004.\n", + "> MM algorithms for generalized Bradley-Terry models\n", + "> Ann. Statist., Vol. 32, 1 (2004): 384–406.\n", + ">\n", + ">\n", + "> Bradley, R. A. and Terry, M. E. 1952.\n", + "> Rank analysis of incomplete block designs. I. The method of paired comparisons.\n", + "> Biometrika, Vol. 39 (1952): 324–345." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Prepare dataframe\n", + "answers_df = pandas.DataFrame(answers, columns=['left', 'right', 'label', 'worker'])\n", + "\n", + "answers_df = answers_df[(answers_df.label == 'LEFT') | (answers_df.label == 'RIGHT')]\n", + "answers_df['label'] = answers_df.apply(lambda row: row[row['label'].lower()], axis=1)\n", + "\n", + "# Run aggregation\n", + "result = NoisyBradleyTerry().fit_predict(answers_df).sort_values(ascending=False)\n", + "print(result)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's look at the ranking results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "images = result.index.values\n", + "labels = result.values\n", + "ipyplot.plot_images(\n", + " images=images,\n", + " labels=labels,\n", + " max_images=6,\n", + " img_width=200,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**You** can see the ranked images. Some possible results are shown in figure 2 below.\n", + "\n", + "\n", + " \n", + " \n", + "
\n", + " \"Possible\n", + "
\n", + " Figure 2. Possible results.\n", + "
" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.12" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/examples/2. cases-for-applied-tasks/2.streaming_pipelines/find_items_pool.json b/examples/2. cases-for-applied-tasks/2.streaming_pipelines/find_items_pool.json new file mode 100644 index 00000000..802600e7 --- /dev/null +++ b/examples/2. cases-for-applied-tasks/2.streaming_pipelines/find_items_pool.json @@ -0,0 +1,94 @@ +{ + "private_name": "Find a similar item in an online store", + "may_contain_adult_content": true, + "assignment_max_duration_seconds": 300, + "defaults": { + "default_overlap_for_new_task_suites": 1 + }, + "auto_close_after_complete_delay_seconds": 0, + "auto_accept_solutions": false, + "auto_accept_period_day": 14, + "assignments_issuing_config": { + "issue_task_suites_in_creation_order": false + }, + "priority": 0, + "filter": { + "and": [ + { + "or": [ + { + "operator": "IN", + "value": "EN", + "key": "languages", + "category": "profile" + } + ] + } + ] + }, + "quality_control": { + "configs": [ + { + "rules": [ + { + "action": { + "parameters": { + "scope": "PROJECT", + "duration": 3, + "duration_unit": "DAYS", + "private_comment": "fast responses" + }, + "type": "RESTRICTION_V2" + }, + "conditions": [ + { + "operator": "GTE", + "value": 1, + "key": "fast_submitted_count" + } + ] + } + ], + "collector_config": { + "uuid": "5492a0e3-6007-4895-ab86-78c4ad2e02eb", + "parameters": { + "fast_submit_threshold_seconds": 30 + }, + "type": "ASSIGNMENT_SUBMIT_TIME" + } + }, + { + "rules": [ + { + "action": { + "parameters": { + "scope": "ALL_PROJECTS", + "duration": 3, + "duration_unit": "DAYS", + "private_comment": "rejected assignments" + }, + "type": "RESTRICTION_V2" + }, + "conditions": [ + { + "operator": "GTE", + "value": 1.0, + "key": "rejected_assignments_rate" + } + ] + } + ], + "collector_config": { + "uuid": "a89f0a86-c1b8-4c3b-ae6e-0f8f7795a0b7", + "type": "ACCEPTANCE_RATE" + } + } + ] + }, + "mixer_config": { + "real_tasks_count": 1, + "golden_tasks_count": 0, + "training_tasks_count": 0 + }, + "type": "REGULAR" +} diff --git a/examples/2. cases-for-applied-tasks/2.streaming_pipelines/find_items_project.json b/examples/2. cases-for-applied-tasks/2.streaming_pipelines/find_items_project.json new file mode 100644 index 00000000..6c018d07 --- /dev/null +++ b/examples/2. cases-for-applied-tasks/2.streaming_pipelines/find_items_project.json @@ -0,0 +1,45 @@ +{ + "public_name": "Find a similar item in an online store 2", + "public_description": "Go to M&S online and find similar shoes on the website", + "task_spec": { + "input_spec": { + "image": { + "required": true, + "hidden": false, + "type": "url" + } + }, + "output_spec": { + "found_link": { + "required": true, + "hidden": false, + "type": "string" + } + }, + "view_spec": { + "localizationConfig": null, + "config": "{\n \"view\": {\n \"content\": {\n \"url\": {\n \"path\": \"image\",\n \"type\": \"data.input\"\n },\n \"fullHeight\": true,\n \"type\": \"view.image\"\n },\n \"controls\": {\n \"items\": [\n {\n \"content\": \"Find the same **shoes** on Marks and Spencer\",\n \"type\": \"view.markdown\"\n },\n {\n \"action\": {\n \"payload\": \"https://www.marksandspencer.com\",\n \"type\": \"action.open-link\"\n },\n \"label\": \"Marks and Spencer\",\n \"validation\": {\n \"url\": \"https://www.marksandspencer.com\",\n \"type\": \"condition.link-opened\"\n },\n \"type\": \"view.action-button\"\n },\n {\n \"content\": \"Shoes must be the same color and the same style.\",\n \"type\": \"view.text\"\n },\n {\n \"data\": {\n \"path\": \"found_link\",\n \"type\": \"data.output\"\n },\n \"label\": \"Paste the link here\",\n \"validation\": {\n \"conditions\": [\n {\n \"type\": \"condition.required\"\n },\n {\n \"data\": {\n \"path\": \"found_link\",\n \"type\": \"data.output\"\n },\n \"schema\": {\n \"type\": \"string\",\n \"pattern\": \"marksandspencer.com/?\"\n },\n \"hint\": \"the link must be from the Marks and Spencer website\",\n \"type\": \"condition.schema\"\n }\n ],\n \"type\": \"condition.all\"\n },\n \"type\": \"field.text\"\n }\n ],\n \"type\": \"view.list\"\n },\n \"controlsWidth\": 320.0,\n \"type\": \"layout.sidebar\"\n },\n \"plugins\": [\n {\n \"layout\": {\n \"kind\": \"scroll\",\n \"taskWidth\": 800.0\n },\n \"type\": \"plugin.toloka\"\n }\n ]\n}", + "type": "tb", + "lock": { + "core": "1.0.0", + "view.image": "1.0.0", + "view.markdown": "1.0.0", + "action.open-link": "1.0.0", + "condition.link-opened": "1.0.0", + "view.action-button": "1.0.0", + "view.text": "1.0.0", + "condition.required": "1.0.0", + "condition.schema": "1.0.0", + "condition.all": "1.0.0", + "field.text": "1.0.0", + "view.list": "1.0.0", + "layout.sidebar": "1.0.0", + "plugin.toloka": "1.0.0" + } + } + }, + "assignments_issuing_type": "AUTOMATED", + "assignments_automerge_enabled": false, + "public_instructions": "\nLook at the shoes the person is wearing in the picture. \nGo to M&S online store and find the same or similar pair of shoes on the website.\nThe shoes should be similar in color, style or height.\n", + "private_comment": "streaming_piplines example find_items_project" +} diff --git a/examples/2. cases-for-applied-tasks/2.streaming_pipelines/sbs_pool.json b/examples/2. cases-for-applied-tasks/2.streaming_pipelines/sbs_pool.json new file mode 100644 index 00000000..ffa5ae66 --- /dev/null +++ b/examples/2. cases-for-applied-tasks/2.streaming_pipelines/sbs_pool.json @@ -0,0 +1,65 @@ +{ + "private_name": "Which suits best?", + "may_contain_adult_content": true, + "assignment_max_duration_seconds": 60, + "defaults": { + "default_overlap_for_new_task_suites": 3 + }, + "auto_close_after_complete_delay_seconds": 0, + "auto_accept_solutions": true, + "auto_accept_period_day": 14, + "assignments_issuing_config": { + "issue_task_suites_in_creation_order": false + }, + "priority": 0, + "filter": { + "and": [ + { + "operator": "IN", + "value": "EN", + "key": "languages", + "category": "profile" + } + ] + }, + "quality_control": { + "captcha_frequency": "HIGH", + "configs": [ + { + "rules": [ + { + "action": { + "parameters": { + "scope": "ALL_PROJECTS", + "duration": 3, + "duration_unit": "DAYS", + "private_comment": "Captcha" + }, + "type": "RESTRICTION_V2" + }, + "conditions": [ + { + "operator": "GTE", + "value": 20.0, + "key": "fail_rate" + } + ] + } + ], + "collector_config": { + "uuid": "7a023230-1304-47d7-b785-fc7e160bf247", + "parameters": { + "history_size": 5 + }, + "type": "CAPTCHA" + } + } + ] + }, + "mixer_config": { + "real_tasks_count": 1, + "golden_tasks_count": 0, + "training_tasks_count": 0 + }, + "type": "REGULAR" +} diff --git a/examples/2. cases-for-applied-tasks/2.streaming_pipelines/sbs_project.json b/examples/2. cases-for-applied-tasks/2.streaming_pipelines/sbs_project.json new file mode 100644 index 00000000..c4f7e579 --- /dev/null +++ b/examples/2. cases-for-applied-tasks/2.streaming_pipelines/sbs_project.json @@ -0,0 +1,53 @@ +{ + "public_name": "Which item is more similar? ", + "public_description": "Decide which pair of shoes look more similar to the initial pair. ", + "task_spec": { + "input_spec": { + "image": { + "required": true, + "hidden": false, + "type": "url" + }, + "left_link": { + "required": true, + "hidden": false, + "type": "url" + }, + "right_link": { + "required": true, + "hidden": false, + "type": "url" + } + }, + "output_spec": { + "result": { + "required": true, + "hidden": false, + "type": "url" + } + }, + "view_spec": { + "localizationConfig": null, + "config": "{\n \"view\": {\n \"items\": [\n {\n \"url\": {\n \"path\": \"image\",\n \"type\": \"data.input\"\n },\n \"fullHeight\": true,\n \"type\": \"view.image\"\n },\n {\n \"items\": [\n {\n \"items\": [\n {\n \"content\": {\n \"path\": \"left_link\",\n \"type\": \"data.input\"\n },\n \"type\": \"view.text\"\n },\n {\n \"action\": {\n \"payload\": {\n \"path\": \"left_link\",\n \"type\": \"data.input\"\n },\n \"type\": \"action.open-link\"\n },\n \"label\": \"Go to site\",\n \"type\": \"view.action-button\"\n },\n {\n \"url\": {\n \"path\": \"left_link\",\n \"type\": \"data.input\"\n },\n \"fullHeight\": true,\n \"type\": \"view.iframe\"\n }\n ],\n \"type\": \"view.list\"\n },\n {\n \"items\": [\n {\n \"content\": {\n \"path\": \"right_link\",\n \"type\": \"data.input\"\n },\n \"type\": \"view.text\"\n },\n {\n \"action\": {\n \"payload\": {\n \"path\": \"right_link\",\n \"type\": \"data.input\"\n },\n \"type\": \"action.open-link\"\n },\n \"label\": \"Go to site\",\n \"type\": \"view.action-button\"\n },\n {\n \"url\": {\n \"path\": \"right_link\",\n \"type\": \"data.input\"\n },\n \"fullHeight\": true,\n \"type\": \"view.iframe\"\n }\n ],\n \"type\": \"view.list\"\n }\n ],\n \"type\": \"layout.side-by-side\"\n },\n {\n \"content\": \"Which photo is the most similar to the original one?\",\n \"type\": \"view.text\"\n },\n {\n \"data\": {\n \"path\": \"result\",\n \"type\": \"data.output\"\n },\n \"options\": [\n {\n \"value\": {\n \"path\": \"left_link\",\n \"type\": \"data.input\"\n },\n \"label\": \"The left one is better\"\n },\n {\n \"value\": {\n \"path\": \"right_link\",\n \"type\": \"data.input\"\n },\n \"label\": \"The right one is better\"\n }\n ],\n \"validation\": {\n \"type\": \"condition.required\"\n },\n \"type\": \"field.radio-group\"\n }\n ],\n \"type\": \"view.list\"\n },\n \"plugins\": [\n {\n \"1\": {\n \"data\": {\n \"path\": \"result\",\n \"type\": \"data.output\"\n },\n \"payload\": {\n \"path\": \"left_link\",\n \"type\": \"data.input\"\n },\n \"type\": \"action.set\"\n },\n \"2\": {\n \"data\": {\n \"path\": \"result\",\n \"type\": \"data.output\"\n },\n \"payload\": {\n \"path\": \"right_link\",\n \"type\": \"data.input\"\n },\n \"type\": \"action.set\"\n },\n \"type\": \"plugin.hotkeys\"\n }\n ]\n}", + "type": "tb", + "lock": { + "core": "1.0.0", + "view.image": "1.0.0", + "view.text": "1.0.0", + "action.open-link": "1.0.0", + "view.action-button": "1.0.0", + "view.iframe": "1.0.0", + "view.list": "1.0.0", + "layout.side-by-side": "1.0.0", + "condition.required": "1.0.0", + "field.radio-group": "1.0.0", + "action.set": "1.0.0", + "plugin.hotkeys": "1.0.0" + } + } + }, + "assignments_issuing_type": "AUTOMATED", + "assignments_automerge_enabled": false, + "public_instructions": "Look at the pictures and decide which pair of shoes are more similar to the initial pair of shoes above. Use your own sense of style, but also remember that they will look alike if they are similar color, form, fabric and style. 

Good luck!
", + "private_comment": "streaming_piplines example sbs_project" +} diff --git a/examples/2. cases-for-applied-tasks/2.streaming_pipelines/streaming_pipelines.ipynb b/examples/2. cases-for-applied-tasks/2.streaming_pipelines/streaming_pipelines.ipynb new file mode 100644 index 00000000..ccd30b0c --- /dev/null +++ b/examples/2. cases-for-applied-tasks/2.streaming_pipelines/streaming_pipelines.ipynb @@ -0,0 +1,441 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Building streaming pipelines in Toloka\n", + "\n", + "Let's solve the following task: find the goods in the online-store by given image and aggange found results by relevance.\n", + "\n", + "It can be solved in 3 steps:\n", + "* For given image find corresponding goods in the online shop;\n", + "* Verfiy that the selected goods are correct;\n", + "* Arrange found goods by relevance using side-by-side comparison.\n", + "\n", + "Each step is represented by Toloka pool. We should also connect those pools and move data between them.\n", + "\"Example" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Call to action\n", + "If you found some bugs or have a new feature idea, don't hesitate to [open a new issue on Github](https://github.com/Toloka/toloka-kit/issues/new/choose).\n", + "Like our library and examples? Star [our repo on Github](https://github.com/Toloka/toloka-kit)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "ExecuteTime": { + "end_time": "2021-08-26T15:33:19.731357Z", + "start_time": "2021-08-26T15:33:17.615813Z" + } + }, + "outputs": [], + "source": [ + "%%capture\n", + "!pip install toloka-kit==0.1.26\n", + "!pip install crowd-kit==1.0.0" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "ExecuteTime": { + "end_time": "2021-08-26T15:33:21.245650Z", + "start_time": "2021-08-26T15:33:21.228813Z" + } + }, + "outputs": [], + "source": [ + "import logging\n", + "import sys\n", + "import getpass\n", + "from toloka.client import TolokaClient\n", + "\n", + "logging.basicConfig(format='%(levelname)s - %(asctime)s - %(name)s: %(message)s',\n", + " level=logging.INFO,\n", + " stream=sys.stdout)\n", + "client = toloka.TolokaClient(getpass.getpass('Enter your OAuth token: '), 'PRODUCTION') # Or switch to 'SANDBOX'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This example focuses on pools connections, so we don't pay much attention on projects and pools configuration here.\n", + "Let's just load configuration from files stored on GitHub." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import datetime\n", + "import requests\n", + "import os\n", + "from toloka.client import Pool, Project, structure\n", + "\n", + "GITHUB_RAW = 'https://raw.githubusercontent.com'\n", + "GITHUB_BASE_PATH = 'Toloka/toloka-kit/main/examples/6.streaming_pipelines'\n", + "\n", + "def _load_json_from_github(filename: str):\n", + " response = requests.get(os.path.join(GITHUB_RAW, GITHUB_BASE_PATH, filename))\n", + " response.raise_for_status()\n", + " return response.json()\n", + "\n", + "def create_project(filename: str) -> Project:\n", + " return client.create_project(_load_json_from_github(filename))\n", + "\n", + "def create_pool(filename: str, project_id: str, reward_per_assignment: float) -> Pool:\n", + " pool = structure(_load_json_from_github(filename), Pool)\n", + " pool.project_id = project_id\n", + " pool.reward_per_assignment = reward_per_assignment\n", + " pool.will_expire = datetime.datetime.now() + datetime.timedelta(days=3)\n", + " return client.create_pool(pool)\n", + "\n", + "find_items_project = create_project('find_items_project.json')\n", + "find_items_pool = create_pool('find_items_pool.json', find_items_project.id, 0.08)\n", + "\n", + "verification_project = create_project('verification_project.json')\n", + "verification_pool = create_pool('verification_pool.json', verification_project.id, 0.02)\n", + "\n", + "sbs_project = create_project('sbs_project.json')\n", + "sbs_pool = create_pool('sbs_pool.json', sbs_project.id, 0.04)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Some data flows may be implicitely implemented using pools quality control rules.\n", + "\n", + "Here, if some assignment is rejected, the overlap of the corresponding tasks increases, that results in new microtasks appearance." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from toloka.client.actions import ChangeOverlap\n", + "from toloka.client.collectors import AssignmentsAssessment\n", + "from toloka.client.conditions import AssessmentEvent\n", + "\n", + "find_items_pool.quality_control.add_action(\n", + " collector=AssignmentsAssessment(),\n", + " conditions=[AssessmentEvent == AssessmentEvent.REJECT],\n", + " action=ChangeOverlap(delta=1, open_pool=True),\n", + ")\n", + "client.update_pool(find_items_pool.id, find_items_pool);" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Connections\n", + "\n", + "Now define each connection as a separate callable.\n", + "\n", + "Entire pipeline will be as follows:\n", + "\"Example" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import collections\n", + "import itertools\n", + "import pandas as pd\n", + "from typing import List\n", + "\n", + "from toloka.client.task import Task\n", + "from toloka.streaming.event import AssignmentEvent\n", + "\n", + "OVERLAP_FIND_ITEMS = 12\n", + "OVERLAP_VERIFICATION = 3\n", + "OVERLAP_SBS = 3" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def handle_found_items(events: List[AssignmentEvent]) -> None:\n", + " \"\"\"find_items_pool -> verification_pool\"\"\"\n", + " verification_tasks = [\n", + " Task(\n", + " pool_id=verification_pool.id,\n", + " unavailable_for=[event.assignment.user_id],\n", + " overlap=OVERLAP_VERIFICATION,\n", + " input_values={\n", + " 'image': task.input_values['image'],\n", + " 'found_link': solution.output_values['found_link'],\n", + " 'assignment_id': event.assignment.id\n", + " },\n", + " )\n", + " for event in events\n", + " for task, solution in zip(event.assignment.tasks, event.assignment.solutions)\n", + " ]\n", + " client.create_tasks(verification_tasks, open_pool=True)\n", + " logging.info('Verification tasks created count: %d', len(verification_tasks))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from crowdkit.aggregation import MajorityVote\n", + "from toloka.client.exceptions import IncorrectActionsApiError\n", + "\n", + "\n", + "class VerificationDoneHandler:\n", + " \"\"\"verification_pool -> find_items_pool back using quality control rule\"\"\"\n", + " def __init__(self, client: TolokaClient):\n", + " self.client = client\n", + " self.waiting = collections.defaultdict(list)\n", + "\n", + " def __call__(self, events: List[AssignmentEvent]) -> None:\n", + " for event in events:\n", + " for task, solution in zip(event.assignment.tasks, event.assignment.solutions):\n", + " answer = (solution.output_values['result'], event.assignment.user_id)\n", + " self.waiting[task.input_values['assignment_id']].append(answer)\n", + "\n", + " to_aggregate = []\n", + " for assignment_id, answers in self.waiting.items():\n", + " if len(answers) >= OVERLAP_VERIFICATION:\n", + " to_aggregate.extend((assignment_id, result, user_id) for result, user_id in answers)\n", + "\n", + " if to_aggregate:\n", + " to_aggregate_df = pd.DataFrame(to_aggregate, columns=['task', 'label', 'worker'])\n", + " aggregated: pd.Series = MajorityVote().fit_predict(to_aggregate_df)\n", + " logging.info('Statuses to apply count: %s', collections.Counter(aggregated.values))\n", + "\n", + " for assignment_id, result in aggregated.items():\n", + " try:\n", + " if result == 'Yes':\n", + " self.client.accept_assignment(assignment_id, 'Well done!')\n", + " else:\n", + " self.client.reject_assignment(assignment_id, 'Incorrect object.')\n", + " except IncorrectActionsApiError: # You could have accepted or rejected it in the UI.\n", + " logging.exception('Can\\'t set status %s at %s', result, assignment_id)\n", + " del self.waiting[assignment_id]\n", + "\n", + " logging.info('Waiting for verification count: %d', len(self.waiting))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "class AcceptedItemsToComparison:\n", + " \"\"\"find_items_pool -> sbs_pool\"\"\"\n", + " def __init__(self, client: TolokaClient):\n", + " self.client = client\n", + " self.waiting = collections.defaultdict(list)\n", + "\n", + " def __call__(self, events: List[AssignmentEvent]) -> None:\n", + " for event in events:\n", + " for task, solution in zip(event.assignment.tasks, event.assignment.solutions):\n", + " self.waiting[task.input_values['image']].append(solution.output_values['found_link'])\n", + "\n", + " to_sbs = [(image, found_links)\n", + " for image, found_links in self.waiting.items()\n", + " if len(found_links) >= OVERLAP_FIND_ITEMS]\n", + "\n", + " if to_sbs:\n", + " logging.info('Got images ready for SbS count: %d', len(to_sbs))\n", + "\n", + " sbs_tasks = []\n", + " for image, found_links in to_sbs:\n", + " for left_link, right_link in itertools.combinations(found_links, 2):\n", + " input_values = {'image': image, 'left_link': left_link, 'right_link': right_link}\n", + " sbs_tasks.append(Task(pool_id=sbs_pool.id, overlap=OVERLAP_SBS, input_values=input_values))\n", + "\n", + " logging.info('SbS tasks to create count: %d', len(sbs_tasks))\n", + " self.client.create_tasks(sbs_tasks, open_pool=True)\n", + "\n", + " for image, _ in to_sbs:\n", + " del self.waiting[image]\n", + " logging.info('Waiting for SbS count: %d', len(self.waiting))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from crowdkit.aggregation import BradleyTerry\n", + "\n", + "\n", + "class HandleSbS:\n", + " \"\"\"sbs_pool results aggregation\"\"\"\n", + " def __init__(self, client: TolokaClient):\n", + " self.client = client\n", + " self.waiting = collections.defaultdict(list)\n", + " self.scores_by_image = {}\n", + "\n", + " def __call__(self, events: List[AssignmentEvent]) -> None:\n", + " for event in events:\n", + " for task, solution in zip(event.assignment.tasks, event.assignment.solutions):\n", + " answer = {'image': task.input_values['image'],\n", + " 'worker': event.assignment.user_id,\n", + " 'left': task.input_values['left_link'],\n", + " 'right': task.input_values['right_link'],\n", + " 'label': solution.output_values['result']}\n", + " self.waiting[task.input_values['image']].append(answer)\n", + "\n", + " for image, answers in list(self.waiting.items()):\n", + " if len(answers) >= OVERLAP_SBS:\n", + " scores = BradleyTerry(n_iter=100).fit_predict(pd.DataFrame(answers))\n", + " self.scores_by_image[image] = scores.sort_values(ascending=False)\n", + " del self.waiting[image]\n", + "\n", + " logging.info('Waiting for SbS aggregation count: %d', len(self.waiting))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Putting it all together" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from toloka.streaming import AssignmentsObserver, Pipeline\n", + "\n", + "pipeline = Pipeline()\n", + "found_items_observer = pipeline.register(AssignmentsObserver(client, find_items_pool.id))\n", + "verification_observer = pipeline.register(AssignmentsObserver(client, verification_pool.id))\n", + "sbs_observer = pipeline.register(AssignmentsObserver(client, sbs_pool.id))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "found_items_observer.on_submitted(handle_found_items)\n", + "found_items_observer.on_accepted(AcceptedItemsToComparison(client))\n", + "verification_observer.on_accepted(VerificationDoneHandler(client))\n", + "sbs_handler = sbs_observer.on_accepted(HandleSbS(client))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create tasks for initial pool." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "images = [\n", + " 'https://tlk.s3.yandex.net/wsdm2020/photos/8ca087fe33065d75327cafdb8720204b.jpg',\n", + " 'https://tlk.s3.yandex.net/wsdm2020/photos/d0c9eb8737f48df5964d93b08ec0d758.jpg',\n", + " 'https://tlk.s3.yandex.net/wsdm2020/photos/9245eed8aa1d1e6f5d5d39d00ab044c6.jpg',\n", + " 'https://tlk.s3.yandex.net/wsdm2020/photos/0aff4fc1edbe6096a9a517092902627f.jpg',\n", + " 'http://tolokaadmin.s3.yandex.net/demo/abb61898-c886-4e20-b7cd-c0d359ddbb9a',\n", + "]\n", + "tasks = [\n", + " Task(pool_id=find_items_pool.id, overlap=OVERLAP_FIND_ITEMS, input_values={'image': image})\n", + " for image in images\n", + "]\n", + "client.create_tasks(tasks, open_pool=True);" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Run" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Google Colab is using a global event pool,\n", + "# so in order to run our pipeline we have to apply nest_asyncio to create an inner pool\n", + "if 'google.colab' in str(get_ipython()):\n", + " import nest_asyncio, asyncio\n", + " nest_asyncio.apply()\n", + " asyncio.get_event_loop().run_until_complete(pipeline.run())\n", + "else:\n", + " await pipeline.run()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Display results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import Image, display\n", + "\n", + "for image, scores in sbs_handler.scores_by_image.items():\n", + " display(Image(url=image, height=200))\n", + " print(f'{scores.nlargest(1)}\\n')" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.12" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/examples/2. cases-for-applied-tasks/2.streaming_pipelines/verification_pool.json b/examples/2. cases-for-applied-tasks/2.streaming_pipelines/verification_pool.json new file mode 100644 index 00000000..5c3ab5af --- /dev/null +++ b/examples/2. cases-for-applied-tasks/2.streaming_pipelines/verification_pool.json @@ -0,0 +1,93 @@ +{ + "private_name": "Is ok?", + "may_contain_adult_content": true, + "assignment_max_duration_seconds": 60, + "defaults": { + "default_overlap_for_new_task_suites": 1 + }, + "auto_close_after_complete_delay_seconds": 0, + "auto_accept_solutions": true, + "auto_accept_period_day": 14, + "assignments_issuing_config": { + "issue_task_suites_in_creation_order": false + }, + "priority": 0, + "filter": { + "and": [ + { + "operator": "IN", + "value": "EN", + "key": "languages", + "category": "profile" + } + ] + }, + "quality_control": { + "configs": [ + { + "rules": [ + { + "action": { + "parameters": { + "scope": "ALL_PROJECTS", + "duration": 3, + "duration_unit": "DAYS", + "private_comment": "Doesn't match the majority" + }, + "type": "RESTRICTION_V2" + }, + "conditions": [ + { + "operator": "GTE", + "value": 4, + "key": "total_answers_count" + }, + { + "operator": "LT", + "value": 75.0, + "key": "correct_answers_rate" + } + ] + } + ], + "collector_config": { + "uuid": "8a34d3d8-fa33-4681-b728-194f5937a1c6", + "parameters": { + "answer_threshold": 3 + }, + "type": "MAJORITY_VOTE" + } + }, + { + "rules": [ + { + "action": { + "parameters": { + "delta": 1, + "open_pool": true + }, + "type": "CHANGE_OVERLAP" + }, + "conditions": [ + { + "operator": "EQ", + "value": "REJECT", + "key": "assessment_event" + } + ] + } + ], + "collector_config": { + "uuid": "1e5f1d39-a065-468a-95f3-36b7d0507be4", + "type": "ASSIGNMENTS_ASSESSMENT" + } + } + ] + }, + "mixer_config": { + "real_tasks_count": 1, + "golden_tasks_count": 0, + "training_tasks_count": 0 + }, + "type": "REGULAR" +} diff --git a/examples/2. cases-for-applied-tasks/2.streaming_pipelines/verification_project.json b/examples/2. cases-for-applied-tasks/2.streaming_pipelines/verification_project.json new file mode 100644 index 00000000..d022ff4e --- /dev/null +++ b/examples/2. cases-for-applied-tasks/2.streaming_pipelines/verification_project.json @@ -0,0 +1,56 @@ +{ + "public_name": "Do the shoes look similar to each other?", + "public_description": "", + "task_spec": { + "input_spec": { + "image": { + "required": true, + "hidden": false, + "type": "url" + }, + "found_link": { + "required": true, + "hidden": false, + "type": "url" + }, + "assignment_id": { + "required": true, + "hidden": false, + "type": "string" + } + }, + "output_spec": { + "result": { + "required": true, + "hidden": false, + "allowed_values": [ + "Yes", + "No" + ], + "type": "string" + } + }, + "view_spec": { + "localizationConfig": null, + "config": "{\n \"view\": {\n \"controls\": {\n \"items\": [\n {\n \"content\": \"Check that the uploaded image matches the product in the store.\",\n \"type\": \"view.text\"\n },\n {\n \"action\": {\n \"payload\": {\n \"path\": \"found_link\",\n \"type\": \"data.input\"\n },\n \"type\": \"action.open-link\"\n },\n \"label\": \"Check the item\",\n \"type\": \"view.action-button\"\n },\n {\n \"content\": \"Are these **shoes** similar to each other?\",\n \"type\": \"view.markdown\"\n },\n {\n \"content\": \"Shoes must be the same **color and the same style**\",\n \"type\": \"view.text\"\n },\n {\n \"data\": {\n \"path\": \"result\",\n \"type\": \"data.output\"\n },\n \"options\": [\n {\n \"value\": \"Yes\",\n \"label\": \"Yes\"\n },\n {\n \"value\": \"No\",\n \"label\": \"No\"\n }\n ],\n \"validation\": {\n \"type\": \"condition.required\"\n },\n \"type\": \"field.radio-group\"\n }\n ],\n \"type\": \"view.list\"\n },\n \"items\": [\n {\n \"url\": {\n \"path\": \"image\",\n \"type\": \"data.input\"\n },\n \"fullHeight\": true,\n \"type\": \"view.image\"\n },\n {\n \"url\": {\n \"path\": \"found_link\",\n \"type\": \"data.input\"\n },\n \"fullHeight\": true,\n \"type\": \"view.iframe\"\n }\n ],\n \"type\": \"layout.side-by-side\"\n }\n}", + "type": "tb", + "lock": { + "core": "1.0.0", + "view.text": "1.0.0", + "action.open-link": "1.0.0", + "view.action-button": "1.0.0", + "view.markdown": "1.0.0", + "condition.required": "1.0.0", + "field.radio-group": "1.0.0", + "view.list": "1.0.0", + "view.image": "1.0.0", + "view.iframe": "1.0.0", + "layout.side-by-side": "1.0.0" + } + } + }, + "assignments_issuing_type": "AUTOMATED", + "assignments_automerge_enabled": false, + "public_instructions": "\nTake a look at 2 pictures. Decide whether the shoes look similar or not.\n

\n
The shoes look similar if they are the same or similar color, fabric and style.
\n
If you do not see any shoes in the pictures choose No
\n
\n", + "private_comment": "streaming_piplines example verification_project" +} diff --git a/examples/7.survey/simplest_survey/img/task_interface.png b/examples/2. cases-for-applied-tasks/2.survey/simplest_survey/img/task_interface.png similarity index 100% rename from examples/7.survey/simplest_survey/img/task_interface.png rename to examples/2. cases-for-applied-tasks/2.survey/simplest_survey/img/task_interface.png diff --git a/examples/2. cases-for-applied-tasks/2.survey/simplest_survey/simplest_survey.ipynb b/examples/2. cases-for-applied-tasks/2.survey/simplest_survey/simplest_survey.ipynb new file mode 100644 index 00000000..9deefdb0 --- /dev/null +++ b/examples/2. cases-for-applied-tasks/2.survey/simplest_survey/simplest_survey.ipynb @@ -0,0 +1,6091 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Survey manual" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Call to action\n", + "If you found some bugs or have a new feature idea, don't hesitate to [open a new issue on Github](https://github.com/Toloka/toloka-kit/issues/new/choose).\n", + "Like our library and examples? Star [our repo on Github](https://github.com/Toloka/toloka-kit)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Prepare environment and import all we'll need." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "!pip install toloka-kit==0.1.26\n", + "!pip install pandas\n", + "!pip install plotly\n", + "\n", + "import datetime\n", + "import sys\n", + "import time\n", + "import logging\n", + "\n", + "import plotly.express as px\n", + "import pandas\n", + "\n", + "import toloka.client as toloka\n", + "import toloka.client.project.template_builder as tb" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "logging.basicConfig(\n", + " format='[%(levelname)s] %(name)s: %(message)s',\n", + " level=logging.INFO,\n", + " stream=sys.stdout,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Сreate toloka-client instance. All api calls will go through it. More about OAuth token in our [Learn the basics example](https://github.com/Toloka/toloka-kit/tree/main/examples/0.getting_started/0.learn_the_basics) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/0.getting_started/0.learn_the_basics/learn_the_basics.ipynb)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "toloka_client = toloka.TolokaClient(getpass.getpass('Enter your OAuth token: '), 'PRODUCTION') # Or switch to 'SANDBOX'\n", + "logging.info(toloka_client.get_requester())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create a project\n", + "Enter a clear project name and description.\n", + "> The project name and description will be visible to the performers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "project = toloka.Project(\n", + " public_name='Survey on stress management',\n", + " public_description='This survey will take about 1-2 minutes',\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create task interface.\n", + "Each question described by one field:\n", + "\n", + " - RadioGroupFieldV1 - used for question with only one possible answer.\n", + " - CheckboxGroupFieldV1 - used for question with several possible answers.\n", + "\n", + "\n", + "You can replace or add new questions:\n", + "\n", + " - create new field,\n", + " - add it to the ListView in project_interface,\n", + " - add output field to output_specification below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "work_mode_field = tb.RadioGroupFieldV1(\n", + " tb.OutputData('workmode'),\n", + " [\n", + " tb.GroupFieldOption('office', 'Office'),\n", + " tb.GroupFieldOption('home', 'Home office'),\n", + " ],\n", + " label='Where do you work?',\n", + " validation=tb.RequiredConditionV1(hint='Select an option'),\n", + ")\n", + "\n", + "stress_field = tb.RadioGroupFieldV1(\n", + " tb.OutputData('stress'),\n", + " [\n", + " tb.GroupFieldOption('alot', 'A lot'),\n", + " tb.GroupFieldOption('notmuch', 'Not much'),\n", + " ],\n", + " label='Is there a lot of stress in your everyday life?',\n", + " validation=tb.RequiredConditionV1(hint='Select an option'),\n", + ")\n", + "\n", + "cope_field = tb.CheckboxGroupFieldV1(\n", + " tb.OutputData('coping'),\n", + " [\n", + " tb.GroupFieldOption('family', 'Spending time with family'),\n", + " tb.GroupFieldOption('sleeping', 'Sleeping'),\n", + " tb.GroupFieldOption('goingout', 'Going out to restaurants, cinemas etc'),\n", + " tb.GroupFieldOption('sport', 'Sport'),\n", + " tb.GroupFieldOption('meditation', 'Meditation'),\n", + " tb.GroupFieldOption('therapy', 'Therapy'),\n", + " tb.GroupFieldOption('alcohol', 'Alcohol'),\n", + " tb.GroupFieldOption('other', 'Other'),\n", + " tb.GroupFieldOption('none', 'None of the above'),\n", + " ],\n", + " label='How do you cope with stress? You can select several options',\n", + " validation=tb.RequiredConditionV1(hint='Choose one or more options'),\n", + ")\n", + "\n", + "meditation_field = tb.RadioGroupFieldV1(\n", + " tb.OutputData('meditation'),\n", + " [\n", + " tb.GroupFieldOption('practice', 'I practice meditation'),\n", + " tb.GroupFieldOption('usedtopractice', 'I used to practice meditation'),\n", + " tb.GroupFieldOption('wanttotry', 'I have never practiced but I\\'d like to try'),\n", + " tb.GroupFieldOption('dontwant', 'I have never practiced and I don\\'t want to try'),\n", + " ],\n", + " label='How do you feel about meditation?',\n", + " validation=tb.RequiredConditionV1(hint='Select an option'),\n", + ")\n", + "\n", + "# Add an attention check question (or several). Since there are no correct\n", + "# answers to a survey and we can’t just check if they are right or wrong,\n", + "# we need to use some workaround techniques to ensure quality.\n", + "honeypot_field = tb.RadioGroupFieldV1(\n", + " tb.OutputData('honeypot'),\n", + " [\n", + " tb.GroupFieldOption('yes', 'Yes'),\n", + " tb.GroupFieldOption('no', 'No'),\n", + " ],\n", + " label='Are you now completing a survey on Toloka?',\n", + " validation=tb.RequiredConditionV1(hint='Select an option'),\n", + ")\n", + "\n", + "mobile_apps_field = tb.RadioGroupFieldV1(\n", + " tb.OutputData('apps'),\n", + " [\n", + " tb.GroupFieldOption('yes', 'Yes'),\n", + " tb.GroupFieldOption('dontneed', 'No, I don\\'t need them'),\n", + " tb.GroupFieldOption('dontpay', 'No, I\\'m not ready to pay'),\n", + " ],\n", + " label='Do you buy mobile apps?',\n", + " validation=tb.RequiredConditionV1(hint='Select an option'),\n", + ")\n", + "\n", + "project_interface = toloka.project.TemplateBuilderViewSpec(\n", + " view=tb.ListViewV1(\n", + " [\n", + " work_mode_field,\n", + " stress_field,\n", + " cope_field,\n", + " meditation_field,\n", + " honeypot_field,\n", + " mobile_apps_field\n", + " ]\n", + " ),\n", + " plugins=[tb.TolokaPluginV1(kind='scroll', task_width=500)],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Make sure the specifications include all output data paths that you have created.\n", + "> Specifications are a description of input data that will be used in a project and the output data that will be collected from the performers.\n", + "\n", + "Read more about [input and output data specifications](https://yandex.ru/support/toloka-tb/operations/create-specs.html?utm_source=github&utm_medium=site&utm_campaign=tolokakit) in the Requester’s Guide." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "input_specification = {'theme': toloka.project.StringSpec()}\n", + "output_specification = {\n", + " 'workmode': toloka.project.StringSpec(),\n", + " 'stress': toloka.project.StringSpec(),\n", + " 'coping': toloka.project.JsonSpec(),\n", + " 'meditation': toloka.project.StringSpec(),\n", + " 'honeypot': toloka.project.StringSpec(),\n", + " 'apps': toloka.project.StringSpec(),\n", + "}\n", + "\n", + "project.task_spec = toloka.project.task_spec.TaskSpec(\n", + " input_spec=input_specification,\n", + " output_spec=output_specification,\n", + " view_spec=project_interface,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If there is anything important about the survey that the performers should know, put it in the instructions. In that case, the attention check question can be based on this information." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "project.public_instructions = \"\"\"We are conducting research on how people cope with stress in their everyday life
\n", + "Answer the questions by selecting one or more possible answers.\"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create a project." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "project = toloka_client.create_project(project)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create a pool\n", + "A pool is a set of paid tasks grouped into task pages. These tasks are sent out for completion at the same time.\n", + "> All tasks within a pool have the same settings (price, quality control, etc.)\n", + "\n", + "We will use non-automatic acceptance. The reason for accepting the task will be a correct answer to the attention check question." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pool = toloka.Pool(\n", + " project_id=project.id,\n", + " # Give the pool any name you find suitable. You are the only one who will see it.\n", + " private_name='Survey on stress management',\n", + " may_contain_adult_content=False,\n", + " # Set the price per task page.\n", + " reward_per_assignment=0.01,\n", + " # We will check the completed tasks manually before paying for them.\n", + " auto_accept_solutions=False,\n", + " # Number of days to determine if we pay.\n", + " auto_accept_period_day=1,\n", + " will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365),\n", + " # Overlap. This is the number of users who will complete the same task.\n", + " defaults=toloka.Pool.Defaults(default_overlap_for_new_task_suites=50),\n", + " # Time allowed for completing a task page\n", + " assignment_max_duration_seconds=60*10,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Select English-speaking performers.\n", + "\n", + "\n", + "Add access from both the Toloka web version and Toloka for mobile. Most surveys are suitable for completion on a mobile device, and it will speed up pool completion.\n", + "\n", + "\n", + "We would like to run our survey on people living in the UK and the USA who are over 30.\n", + "> Please note that personal information like dates of birth is provided by Tolokers themselves. The platform does not control the accuracy of this info. The region can be double-checked using the Region by IP parameter." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pool.filter = (\n", + " (toloka.filter.Languages.in_('EN')) &\n", + " ((toloka.filter.ClientType == 'BROWSER') | (toloka.filter.ClientType == 'TOLOKA_APP')) &\n", + " ((toloka.filter.Country == 'US') | (toloka.filter.Country == 'GB')) &\n", + " (toloka.filter.RegionByIp.in_(102) | toloka.filter.RegionByIp.in_(84)) &\n", + " (toloka.filter.DateOfBirth < int(datetime.datetime.strptime('01.01.1991', '%d.%M.%Y').timestamp()))\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Set up [Quality control](https://toloka.ai/en/docs/guide/concepts/control?utm_source=github&utm_medium=site&utm_campaign=tolokakit). Read more about [configuring this rule](https://toloka.ai/en/docs/guide/concepts/goldenset?utm_source=github&utm_medium=site&utm_campaign=tolokakit) in our Requester’s Guide.\n", + "\n", + "\n", + "If the number of responses is at least 1 and the correctness of the responses = 100%, then the answer will be auto-accepted." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pool.quality_control.add_action(\n", + " collector=toloka.collectors.GoldenSet(history_size=1),\n", + " conditions=[toloka.conditions.GoldenSetCorrectAnswersRate == 100],\n", + " action=toloka.actions.ApproveAllAssignments()\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " Create the “Stress management” skill that will reflect response quality. It can later be used if you re-run the survey and need to exclude those who have already taken part in it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "survey_skill = next(toloka_client.get_skills(name='stress-management'), None)\n", + "if survey_skill:\n", + " print('Detection skill already exists')\n", + "else:\n", + " survey_skill = toloka_client.create_skill(\n", + " name='stress-management',\n", + " hidden=True,\n", + " )\n", + "\n", + "pool.quality_control.add_action(\n", + " collector=toloka.collectors.GoldenSet(history_size=1),\n", + " conditions=[toloka.conditions.TotalAnswersCount > 0],\n", + " action=toloka.actions.SetSkillFromOutputField(\n", + " skill_id=survey_skill.id,\n", + " from_field='correct_answers_rate',\n", + " ),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Add the Processing rejected and accepted assignments rule. If an assignment has been rejected, the task will be sent to another performer." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pool.quality_control.add_action(\n", + " collector=toloka.collectors.AssignmentsAssessment(),\n", + " conditions=[toloka.conditions.AssessmentEvent == 'REJECT'],\n", + " action=toloka.actions.ChangeOverlap(delta=1, open_pool=True),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Specify\tthe number of tasks per page." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pool.set_mixer_config(golden_tasks_count=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create pool" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pool = toloka_client.create_pool(pool)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Preparing and uploading tasks\n", + "Create pool task" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "tasks = [\n", + " toloka.Task(\n", + " pool_id=pool.id,\n", + " input_values={'theme': 'Stress management'},\n", + " known_solutions = [\n", + " toloka.task.BaseTask.KnownSolution(\n", + " output_values={'honeypot': 'yes'}\n", + " )\n", + " ],\n", + " )\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Upload tasks" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "created_tasks = toloka_client.create_tasks(tasks, allow_defaults=True)\n", + "logging.info(len(created_tasks.items))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can open pool in web-interface and preview preformers interface.\n", + "\n", + "\n", + " \n", + " \n", + "
\n", + " \"Task\n", + "
\n", + " Figure 1. What the task page preview can looks like.\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Start the pool.\n", + "\n", + "**Important.** Remember that real Toloka performers will complete the tasks.\n", + "Double check that everything is correct\n", + "with your project configuration before you start the pool" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pool = toloka_client.open_pool(pool.id)\n", + "logging.info(pool.status)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Receiving responses" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Wait until the pool is completed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pool_id = pool.id\n", + "\n", + "def wait_pool_for_close(pool_id, minutes_to_wait=1):\n", + " sleep_time = 60 * minutes_to_wait\n", + " pool = toloka_client.get_pool(pool_id)\n", + " while not pool.is_closed():\n", + " op = toloka_client.get_analytics([toloka.analytics_request.CompletionPercentagePoolAnalytics(subject_id=pool.id)])\n", + " op = toloka_client.wait_operation(op)\n", + " percentage = op.details['value'][0]['result']['value']\n", + " logging.info(\n", + " f' {datetime.datetime.now().strftime(\"%H:%M:%S\")}\\t'\n", + " f'Pool {pool.id} - {percentage}%'\n", + " )\n", + " time.sleep(sleep_time)\n", + " pool = toloka_client.get_pool(pool.id)\n", + " logging.info('Pool was closed.')\n", + "\n", + "wait_pool_for_close(pool_id)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Get responses. There are accepted assignments, and assignments that need to be reviewed. They were not accepted because a user failed the attention check. They need to be rejected." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for assignment in toloka_client.get_assignments(status='SUBMITTED', pool_id=pool_id):\n", + " toloka_client.reject_assignment(assignment.id, 'There was an attention check question that was failed.')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's just show the distribution of answers in all our questions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "answers = []\n", + "answers_df = toloka_client.get_assignments_df(pool_id)\n", + "answers_df = answers_df.rename(columns={\n", + " 'OUTPUT:apps': 'pay for apps',\n", + " 'OUTPUT:coping': 'coping with stress',\n", + " 'OUTPUT:stress': 'stress level',\n", + " 'OUTPUT:workmode': 'work mode',\n", + " 'OUTPUT:meditation': 'using meditation',\n", + "})" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One choice questions." + ] + }, + { + "cell_type": "code", + "execution_count": 87, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.plotly.v1+json": { + "config": { + "plotlyServerURL": "https://plot.ly" + }, + "data": [ + { + "alignmentgroup": "True", + "bingroup": "x", + "histnorm": "percent", + "hovertemplate": "work mode=%{x}
percent=%{y}", + "legendgroup": "", + "marker": { + "color": "#636efa", + "pattern": { + "shape": "" + } + }, + "name": "", + "offsetgroup": "", + "orientation": "v", + "showlegend": false, + "type": "histogram", + "x": [ + "office", + "office", + "home", + "office", + "office", + "home", + "home", + "office", + "home", + "home", + "home", + "home", + "home", + "home", + "home", + "home", + "home", + "office", + "home", + "home", + "office", + "home", + "home", + "home", + "home", + "office", + "home", + "home", + "office", + "home", + "home", + "home", + "office", + "home", + "office", + "office", + "home", + "home", + "office", + "office", + "office", + "office", + "home", + "home", + "home", + "home", + "office", + "home", + "office", + "home" + ], + "xaxis": "x", + "yaxis": "y" + } + ], + "layout": { + "barmode": "relative", + "legend": { + "tracegroupgap": 0 + }, + "margin": { + "t": 60 + }, + "template": { + "data": { + "bar": [ + { + "error_x": { + "color": "#2a3f5f" + }, + "error_y": { + "color": "#2a3f5f" + }, + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "bar" + } + ], + "barpolar": [ + { + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "barpolar" + } + ], + "carpet": [ + { + "aaxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "baxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "type": "carpet" + } + ], + "choropleth": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "choropleth" + } + ], + "contour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "contour" + } + ], + "contourcarpet": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "contourcarpet" + } + ], + "heatmap": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmap" + } + ], + "heatmapgl": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmapgl" + } + ], + "histogram": [ + { + "marker": { + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "histogram" + } + ], + "histogram2d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2d" + } + ], + "histogram2dcontour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2dcontour" + } + ], + "mesh3d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "mesh3d" + } + ], + "parcoords": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "parcoords" + } + ], + "pie": [ + { + "automargin": true, + "type": "pie" + } + ], + "scatter": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter" + } + ], + "scatter3d": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter3d" + } + ], + "scattercarpet": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattercarpet" + } + ], + "scattergeo": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergeo" + } + ], + "scattergl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergl" + } + ], + "scattermapbox": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattermapbox" + } + ], + "scatterpolar": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolar" + } + ], + "scatterpolargl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolargl" + } + ], + "scatterternary": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterternary" + } + ], + "surface": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "surface" + } + ], + "table": [ + { + "cells": { + "fill": { + "color": "#EBF0F8" + }, + "line": { + "color": "white" + } + }, + "header": { + "fill": { + "color": "#C8D4E3" + }, + "line": { + "color": "white" + } + }, + "type": "table" + } + ] + }, + "layout": { + "annotationdefaults": { + "arrowcolor": "#2a3f5f", + "arrowhead": 0, + "arrowwidth": 1 + }, + "autotypenumbers": "strict", + "coloraxis": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "colorscale": { + "diverging": [ + [ + 0, + "#8e0152" + ], + [ + 0.1, + "#c51b7d" + ], + [ + 0.2, + "#de77ae" + ], + [ + 0.3, + "#f1b6da" + ], + [ + 0.4, + "#fde0ef" + ], + [ + 0.5, + "#f7f7f7" + ], + [ + 0.6, + "#e6f5d0" + ], + [ + 0.7, + "#b8e186" + ], + [ + 0.8, + "#7fbc41" + ], + [ + 0.9, + "#4d9221" + ], + [ + 1, + "#276419" + ] + ], + "sequential": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "sequentialminus": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ] + }, + "colorway": [ + "#636efa", + "#EF553B", + "#00cc96", + "#ab63fa", + "#FFA15A", + "#19d3f3", + "#FF6692", + "#B6E880", + "#FF97FF", + "#FECB52" + ], + "font": { + "color": "#2a3f5f" + }, + "geo": { + "bgcolor": "white", + "lakecolor": "white", + "landcolor": "#E5ECF6", + "showlakes": true, + "showland": true, + "subunitcolor": "white" + }, + "hoverlabel": { + "align": "left" + }, + "hovermode": "closest", + "mapbox": { + "style": "light" + }, + "paper_bgcolor": "white", + "plot_bgcolor": "#E5ECF6", + "polar": { + "angularaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "radialaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "scene": { + "xaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "yaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "zaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + } + }, + "shapedefaults": { + "line": { + "color": "#2a3f5f" + } + }, + "ternary": { + "aaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "baxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "caxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "title": { + "x": 0.05 + }, + "xaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + }, + "yaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + } + } + }, + "xaxis": { + "anchor": "y", + "domain": [ + 0, + 1 + ], + "title": { + "text": "work mode" + } + }, + "yaxis": { + "anchor": "x", + "domain": [ + 0, + 1 + ], + "title": { + "text": "percent" + } + } + } + }, + "text/html": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "fig = px.histogram(answers_df, x='work mode', histnorm='percent')\n", + "fig.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 90, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.plotly.v1+json": { + "config": { + "plotlyServerURL": "https://plot.ly" + }, + "data": [ + { + "alignmentgroup": "True", + "bingroup": "x", + "histnorm": "percent", + "hovertemplate": "work mode=office
stress level=%{x}
percent=%{y}", + "legendgroup": "office", + "marker": { + "color": "#636efa", + "pattern": { + "shape": "" + } + }, + "name": "office", + "offsetgroup": "office", + "orientation": "v", + "showlegend": true, + "type": "histogram", + "x": [ + "alot", + "alot", + "alot", + "notmuch", + "alot", + "notmuch", + "alot", + "alot", + "notmuch", + "alot", + "alot", + "notmuch", + "alot", + "alot", + "notmuch", + "alot", + "notmuch", + "alot" + ], + "xaxis": "x", + "yaxis": "y" + }, + { + "alignmentgroup": "True", + "bingroup": "x", + "histnorm": "percent", + "hovertemplate": "work mode=home
stress level=%{x}
percent=%{y}", + "legendgroup": "home", + "marker": { + "color": "#EF553B", + "pattern": { + "shape": "" + } + }, + "name": "home", + "offsetgroup": "home", + "orientation": "v", + "showlegend": true, + "type": "histogram", + "x": [ + "alot", + "notmuch", + "notmuch", + "alot", + "notmuch", + "notmuch", + "alot", + "alot", + "notmuch", + "notmuch", + "alot", + "notmuch", + "alot", + "alot", + "notmuch", + "notmuch", + "alot", + "alot", + "alot", + "notmuch", + "alot", + "alot", + "alot", + "alot", + "notmuch", + "alot", + "notmuch", + "alot", + "alot", + "alot", + "notmuch", + "alot" + ], + "xaxis": "x", + "yaxis": "y" + } + ], + "layout": { + "barmode": "relative", + "legend": { + "title": { + "text": "work mode" + }, + "tracegroupgap": 0 + }, + "margin": { + "t": 60 + }, + "template": { + "data": { + "bar": [ + { + "error_x": { + "color": "#2a3f5f" + }, + "error_y": { + "color": "#2a3f5f" + }, + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "bar" + } + ], + "barpolar": [ + { + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "barpolar" + } + ], + "carpet": [ + { + "aaxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "baxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "type": "carpet" + } + ], + "choropleth": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "choropleth" + } + ], + "contour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "contour" + } + ], + "contourcarpet": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "contourcarpet" + } + ], + "heatmap": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmap" + } + ], + "heatmapgl": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmapgl" + } + ], + "histogram": [ + { + "marker": { + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "histogram" + } + ], + "histogram2d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2d" + } + ], + "histogram2dcontour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2dcontour" + } + ], + "mesh3d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "mesh3d" + } + ], + "parcoords": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "parcoords" + } + ], + "pie": [ + { + "automargin": true, + "type": "pie" + } + ], + "scatter": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter" + } + ], + "scatter3d": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter3d" + } + ], + "scattercarpet": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattercarpet" + } + ], + "scattergeo": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergeo" + } + ], + "scattergl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergl" + } + ], + "scattermapbox": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattermapbox" + } + ], + "scatterpolar": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolar" + } + ], + "scatterpolargl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolargl" + } + ], + "scatterternary": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterternary" + } + ], + "surface": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "surface" + } + ], + "table": [ + { + "cells": { + "fill": { + "color": "#EBF0F8" + }, + "line": { + "color": "white" + } + }, + "header": { + "fill": { + "color": "#C8D4E3" + }, + "line": { + "color": "white" + } + }, + "type": "table" + } + ] + }, + "layout": { + "annotationdefaults": { + "arrowcolor": "#2a3f5f", + "arrowhead": 0, + "arrowwidth": 1 + }, + "autotypenumbers": "strict", + "coloraxis": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "colorscale": { + "diverging": [ + [ + 0, + "#8e0152" + ], + [ + 0.1, + "#c51b7d" + ], + [ + 0.2, + "#de77ae" + ], + [ + 0.3, + "#f1b6da" + ], + [ + 0.4, + "#fde0ef" + ], + [ + 0.5, + "#f7f7f7" + ], + [ + 0.6, + "#e6f5d0" + ], + [ + 0.7, + "#b8e186" + ], + [ + 0.8, + "#7fbc41" + ], + [ + 0.9, + "#4d9221" + ], + [ + 1, + "#276419" + ] + ], + "sequential": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "sequentialminus": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ] + }, + "colorway": [ + "#636efa", + "#EF553B", + "#00cc96", + "#ab63fa", + "#FFA15A", + "#19d3f3", + "#FF6692", + "#B6E880", + "#FF97FF", + "#FECB52" + ], + "font": { + "color": "#2a3f5f" + }, + "geo": { + "bgcolor": "white", + "lakecolor": "white", + "landcolor": "#E5ECF6", + "showlakes": true, + "showland": true, + "subunitcolor": "white" + }, + "hoverlabel": { + "align": "left" + }, + "hovermode": "closest", + "mapbox": { + "style": "light" + }, + "paper_bgcolor": "white", + "plot_bgcolor": "#E5ECF6", + "polar": { + "angularaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "radialaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "scene": { + "xaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "yaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "zaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + } + }, + "shapedefaults": { + "line": { + "color": "#2a3f5f" + } + }, + "ternary": { + "aaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "baxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "caxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "title": { + "x": 0.05 + }, + "xaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + }, + "yaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + } + } + }, + "xaxis": { + "anchor": "y", + "domain": [ + 0, + 1 + ], + "title": { + "text": "stress level" + } + }, + "yaxis": { + "anchor": "x", + "domain": [ + 0, + 1 + ], + "title": { + "text": "percent" + } + } + } + }, + "text/html": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "fig = px.histogram(answers_df, x='stress level', histnorm='percent', color='work mode')\n", + "fig.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 88, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.plotly.v1+json": { + "config": { + "plotlyServerURL": "https://plot.ly" + }, + "data": [ + { + "alignmentgroup": "True", + "bingroup": "x", + "histnorm": "percent", + "hovertemplate": "using meditation=%{x}
percent=%{y}", + "legendgroup": "", + "marker": { + "color": "#636efa", + "pattern": { + "shape": "" + } + }, + "name": "", + "offsetgroup": "", + "orientation": "v", + "showlegend": false, + "type": "histogram", + "x": [ + "usedtopractice", + "wanttotry", + "usedtopractice", + "dontwant", + "usedtopractice", + "wanttotry", + "practice", + "practice", + "wanttotry", + "practice", + "usedtopractice", + "practice", + "usedtopractice", + "wanttotry", + "practice", + "usedtopractice", + "practice", + "wanttotry", + "wanttotry", + "wanttotry", + "wanttotry", + "practice", + "practice", + "practice", + "practice", + "practice", + "wanttotry", + "usedtopractice", + "wanttotry", + "wanttotry", + "usedtopractice", + "usedtopractice", + "usedtopractice", + "practice", + "wanttotry", + "wanttotry", + "dontwant", + "wanttotry", + "dontwant", + "wanttotry", + "wanttotry", + "practice", + "practice", + "usedtopractice", + "dontwant", + "practice", + "wanttotry", + "dontwant", + "usedtopractice", + "wanttotry" + ], + "xaxis": "x", + "yaxis": "y" + } + ], + "layout": { + "barmode": "relative", + "legend": { + "tracegroupgap": 0 + }, + "margin": { + "t": 60 + }, + "template": { + "data": { + "bar": [ + { + "error_x": { + "color": "#2a3f5f" + }, + "error_y": { + "color": "#2a3f5f" + }, + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "bar" + } + ], + "barpolar": [ + { + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "barpolar" + } + ], + "carpet": [ + { + "aaxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "baxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "type": "carpet" + } + ], + "choropleth": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "choropleth" + } + ], + "contour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "contour" + } + ], + "contourcarpet": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "contourcarpet" + } + ], + "heatmap": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmap" + } + ], + "heatmapgl": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmapgl" + } + ], + "histogram": [ + { + "marker": { + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "histogram" + } + ], + "histogram2d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2d" + } + ], + "histogram2dcontour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2dcontour" + } + ], + "mesh3d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "mesh3d" + } + ], + "parcoords": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "parcoords" + } + ], + "pie": [ + { + "automargin": true, + "type": "pie" + } + ], + "scatter": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter" + } + ], + "scatter3d": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter3d" + } + ], + "scattercarpet": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattercarpet" + } + ], + "scattergeo": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergeo" + } + ], + "scattergl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergl" + } + ], + "scattermapbox": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattermapbox" + } + ], + "scatterpolar": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolar" + } + ], + "scatterpolargl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolargl" + } + ], + "scatterternary": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterternary" + } + ], + "surface": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "surface" + } + ], + "table": [ + { + "cells": { + "fill": { + "color": "#EBF0F8" + }, + "line": { + "color": "white" + } + }, + "header": { + "fill": { + "color": "#C8D4E3" + }, + "line": { + "color": "white" + } + }, + "type": "table" + } + ] + }, + "layout": { + "annotationdefaults": { + "arrowcolor": "#2a3f5f", + "arrowhead": 0, + "arrowwidth": 1 + }, + "autotypenumbers": "strict", + "coloraxis": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "colorscale": { + "diverging": [ + [ + 0, + "#8e0152" + ], + [ + 0.1, + "#c51b7d" + ], + [ + 0.2, + "#de77ae" + ], + [ + 0.3, + "#f1b6da" + ], + [ + 0.4, + "#fde0ef" + ], + [ + 0.5, + "#f7f7f7" + ], + [ + 0.6, + "#e6f5d0" + ], + [ + 0.7, + "#b8e186" + ], + [ + 0.8, + "#7fbc41" + ], + [ + 0.9, + "#4d9221" + ], + [ + 1, + "#276419" + ] + ], + "sequential": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "sequentialminus": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ] + }, + "colorway": [ + "#636efa", + "#EF553B", + "#00cc96", + "#ab63fa", + "#FFA15A", + "#19d3f3", + "#FF6692", + "#B6E880", + "#FF97FF", + "#FECB52" + ], + "font": { + "color": "#2a3f5f" + }, + "geo": { + "bgcolor": "white", + "lakecolor": "white", + "landcolor": "#E5ECF6", + "showlakes": true, + "showland": true, + "subunitcolor": "white" + }, + "hoverlabel": { + "align": "left" + }, + "hovermode": "closest", + "mapbox": { + "style": "light" + }, + "paper_bgcolor": "white", + "plot_bgcolor": "#E5ECF6", + "polar": { + "angularaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "radialaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "scene": { + "xaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "yaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "zaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + } + }, + "shapedefaults": { + "line": { + "color": "#2a3f5f" + } + }, + "ternary": { + "aaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "baxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "caxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "title": { + "x": 0.05 + }, + "xaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + }, + "yaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + } + } + }, + "xaxis": { + "anchor": "y", + "domain": [ + 0, + 1 + ], + "title": { + "text": "using meditation" + } + }, + "yaxis": { + "anchor": "x", + "domain": [ + 0, + 1 + ], + "title": { + "text": "percent" + } + } + } + }, + "text/html": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "fig = px.histogram(answers_df, x='using meditation', histnorm='percent')\n", + "fig.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 89, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.plotly.v1+json": { + "config": { + "plotlyServerURL": "https://plot.ly" + }, + "data": [ + { + "alignmentgroup": "True", + "bingroup": "x", + "histnorm": "percent", + "hovertemplate": "pay for apps=%{x}
percent=%{y}", + "legendgroup": "", + "marker": { + "color": "#636efa", + "pattern": { + "shape": "" + } + }, + "name": "", + "offsetgroup": "", + "orientation": "v", + "showlegend": false, + "type": "histogram", + "x": [ + "dontneed", + "yes", + "yes", + "dontpay", + "yes", + "dontneed", + "yes", + "dontneed", + "yes", + "dontneed", + "dontpay", + "dontpay", + "dontneed", + "dontpay", + "yes", + "dontpay", + "yes", + "dontpay", + "dontpay", + "yes", + "yes", + "dontneed", + "dontneed", + "dontneed", + "dontneed", + "yes", + "yes", + "yes", + "yes", + "dontneed", + "dontneed", + "dontneed", + "yes", + "yes", + "yes", + "yes", + "dontneed", + "yes", + "yes", + "dontpay", + "dontneed", + "yes", + "dontpay", + "dontpay", + "dontpay", + "yes", + "dontpay", + "dontpay", + "yes", + "yes" + ], + "xaxis": "x", + "yaxis": "y" + } + ], + "layout": { + "barmode": "relative", + "legend": { + "tracegroupgap": 0 + }, + "margin": { + "t": 60 + }, + "template": { + "data": { + "bar": [ + { + "error_x": { + "color": "#2a3f5f" + }, + "error_y": { + "color": "#2a3f5f" + }, + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "bar" + } + ], + "barpolar": [ + { + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "barpolar" + } + ], + "carpet": [ + { + "aaxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "baxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "type": "carpet" + } + ], + "choropleth": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "choropleth" + } + ], + "contour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "contour" + } + ], + "contourcarpet": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "contourcarpet" + } + ], + "heatmap": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmap" + } + ], + "heatmapgl": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmapgl" + } + ], + "histogram": [ + { + "marker": { + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "histogram" + } + ], + "histogram2d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2d" + } + ], + "histogram2dcontour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2dcontour" + } + ], + "mesh3d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "mesh3d" + } + ], + "parcoords": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "parcoords" + } + ], + "pie": [ + { + "automargin": true, + "type": "pie" + } + ], + "scatter": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter" + } + ], + "scatter3d": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter3d" + } + ], + "scattercarpet": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattercarpet" + } + ], + "scattergeo": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergeo" + } + ], + "scattergl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergl" + } + ], + "scattermapbox": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattermapbox" + } + ], + "scatterpolar": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolar" + } + ], + "scatterpolargl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolargl" + } + ], + "scatterternary": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterternary" + } + ], + "surface": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "surface" + } + ], + "table": [ + { + "cells": { + "fill": { + "color": "#EBF0F8" + }, + "line": { + "color": "white" + } + }, + "header": { + "fill": { + "color": "#C8D4E3" + }, + "line": { + "color": "white" + } + }, + "type": "table" + } + ] + }, + "layout": { + "annotationdefaults": { + "arrowcolor": "#2a3f5f", + "arrowhead": 0, + "arrowwidth": 1 + }, + "autotypenumbers": "strict", + "coloraxis": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "colorscale": { + "diverging": [ + [ + 0, + "#8e0152" + ], + [ + 0.1, + "#c51b7d" + ], + [ + 0.2, + "#de77ae" + ], + [ + 0.3, + "#f1b6da" + ], + [ + 0.4, + "#fde0ef" + ], + [ + 0.5, + "#f7f7f7" + ], + [ + 0.6, + "#e6f5d0" + ], + [ + 0.7, + "#b8e186" + ], + [ + 0.8, + "#7fbc41" + ], + [ + 0.9, + "#4d9221" + ], + [ + 1, + "#276419" + ] + ], + "sequential": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "sequentialminus": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ] + }, + "colorway": [ + "#636efa", + "#EF553B", + "#00cc96", + "#ab63fa", + "#FFA15A", + "#19d3f3", + "#FF6692", + "#B6E880", + "#FF97FF", + "#FECB52" + ], + "font": { + "color": "#2a3f5f" + }, + "geo": { + "bgcolor": "white", + "lakecolor": "white", + "landcolor": "#E5ECF6", + "showlakes": true, + "showland": true, + "subunitcolor": "white" + }, + "hoverlabel": { + "align": "left" + }, + "hovermode": "closest", + "mapbox": { + "style": "light" + }, + "paper_bgcolor": "white", + "plot_bgcolor": "#E5ECF6", + "polar": { + "angularaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "radialaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "scene": { + "xaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "yaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "zaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + } + }, + "shapedefaults": { + "line": { + "color": "#2a3f5f" + } + }, + "ternary": { + "aaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "baxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "caxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "title": { + "x": 0.05 + }, + "xaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + }, + "yaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + } + } + }, + "xaxis": { + "anchor": "y", + "domain": [ + 0, + 1 + ], + "title": { + "text": "pay for apps" + } + }, + "yaxis": { + "anchor": "x", + "domain": [ + 0, + 1 + ], + "title": { + "text": "percent" + } + } + } + }, + "text/html": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "fig = px.histogram(answers_df, x='pay for apps', histnorm='percent')\n", + "fig.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 94, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.plotly.v1+json": { + "config": { + "plotlyServerURL": "https://plot.ly" + }, + "data": [ + { + "alignmentgroup": "True", + "bingroup": "x", + "hovertemplate": "variable=goingout
value=%{x}
count=%{y}", + "legendgroup": "goingout", + "marker": { + "color": "#636efa", + "pattern": { + "shape": "" + } + }, + "name": "goingout", + "offsetgroup": "goingout", + "orientation": "v", + "showlegend": true, + "type": "histogram", + "x": [ + true, + true, + null, + null, + null, + null, + null, + null, + null, + null, + true, + true, + null, + null, + true, + true, + null, + true, + true, + null, + null, + null, + null, + null, + null, + null, + true, + null, + null, + null, + null, + null, + null, + null, + null, + true, + null, + true, + null, + true, + null, + true, + null, + null, + null, + true, + null, + null, + true, + null + ], + "xaxis": "x", + "yaxis": "y" + }, + { + "alignmentgroup": "True", + "bingroup": "x", + "hovertemplate": "variable=sleeping
value=%{x}
count=%{y}", + "legendgroup": "sleeping", + "marker": { + "color": "#EF553B", + "pattern": { + "shape": "" + } + }, + "name": "sleeping", + "offsetgroup": "sleeping", + "orientation": "v", + "showlegend": true, + "type": "histogram", + "x": [ + true, + null, + null, + true, + true, + null, + true, + null, + null, + true, + true, + null, + true, + null, + true, + true, + true, + true, + true, + true, + true, + null, + true, + true, + null, + true, + true, + true, + true, + true, + null, + null, + true, + true, + true, + true, + true, + true, + null, + null, + true, + true, + true, + true, + null, + true, + true, + null, + true, + true + ], + "xaxis": "x", + "yaxis": "y" + }, + { + "alignmentgroup": "True", + "bingroup": "x", + "hovertemplate": "variable=family
value=%{x}
count=%{y}", + "legendgroup": "family", + "marker": { + "color": "#00cc96", + "pattern": { + "shape": "" + } + }, + "name": "family", + "offsetgroup": "family", + "orientation": "v", + "showlegend": true, + "type": "histogram", + "x": [ + true, + true, + true, + null, + null, + true, + true, + null, + null, + null, + true, + null, + true, + null, + true, + true, + null, + true, + null, + true, + true, + null, + null, + true, + true, + null, + true, + null, + null, + null, + null, + null, + null, + true, + null, + true, + true, + true, + true, + null, + true, + true, + null, + true, + true, + true, + null, + true, + null, + true + ], + "xaxis": "x", + "yaxis": "y" + }, + { + "alignmentgroup": "True", + "bingroup": "x", + "hovertemplate": "variable=other
value=%{x}
count=%{y}", + "legendgroup": "other", + "marker": { + "color": "#ab63fa", + "pattern": { + "shape": "" + } + }, + "name": "other", + "offsetgroup": "other", + "orientation": "v", + "showlegend": true, + "type": "histogram", + "x": [ + null, + true, + null, + true, + null, + true, + null, + true, + true, + true, + null, + true, + null, + true, + null, + true, + null, + null, + null, + null, + null, + true, + null, + true, + null, + null, + null, + null, + null, + true, + null, + null, + null, + null, + true, + null, + null, + null, + true, + null, + null, + null, + true, + null, + true, + null, + null, + null, + null, + null + ], + "xaxis": "x", + "yaxis": "y" + }, + { + "alignmentgroup": "True", + "bingroup": "x", + "hovertemplate": "variable=sport
value=%{x}
count=%{y}", + "legendgroup": "sport", + "marker": { + "color": "#FFA15A", + "pattern": { + "shape": "" + } + }, + "name": "sport", + "offsetgroup": "sport", + "orientation": "v", + "showlegend": true, + "type": "histogram", + "x": [ + null, + null, + null, + null, + true, + null, + true, + null, + null, + null, + true, + null, + null, + null, + null, + true, + true, + null, + null, + null, + null, + null, + true, + null, + null, + null, + null, + true, + null, + null, + null, + true, + null, + null, + null, + null, + null, + true, + null, + null, + true, + true, + null, + true, + null, + null, + null, + true, + null, + null + ], + "xaxis": "x", + "yaxis": "y" + }, + { + "alignmentgroup": "True", + "bingroup": "x", + "hovertemplate": "variable=alcohol
value=%{x}
count=%{y}", + "legendgroup": "alcohol", + "marker": { + "color": "#19d3f3", + "pattern": { + "shape": "" + } + }, + "name": "alcohol", + "offsetgroup": "alcohol", + "orientation": "v", + "showlegend": true, + "type": "histogram", + "x": [ + null, + null, + null, + null, + true, + null, + null, + null, + true, + null, + true, + null, + true, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + true, + null, + null, + true, + null, + null, + true, + null, + true, + null, + null, + null, + null, + null, + null, + true, + null, + null, + null, + true + ], + "xaxis": "x", + "yaxis": "y" + }, + { + "alignmentgroup": "True", + "bingroup": "x", + "hovertemplate": "variable=meditation
value=%{x}
count=%{y}", + "legendgroup": "meditation", + "marker": { + "color": "#FF6692", + "pattern": { + "shape": "" + } + }, + "name": "meditation", + "offsetgroup": "meditation", + "orientation": "v", + "showlegend": true, + "type": "histogram", + "x": [ + null, + null, + null, + null, + null, + null, + null, + null, + null, + true, + null, + true, + true, + null, + true, + null, + true, + null, + null, + null, + null, + null, + true, + true, + null, + true, + null, + null, + null, + null, + null, + null, + null, + true, + null, + null, + null, + null, + null, + null, + null, + true, + true, + null, + null, + true, + null, + null, + null, + null + ], + "xaxis": "x", + "yaxis": "y" + }, + { + "alignmentgroup": "True", + "bingroup": "x", + "hovertemplate": "variable=therapy
value=%{x}
count=%{y}", + "legendgroup": "therapy", + "marker": { + "color": "#B6E880", + "pattern": { + "shape": "" + } + }, + "name": "therapy", + "offsetgroup": "therapy", + "orientation": "v", + "showlegend": true, + "type": "histogram", + "x": [ + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + true, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + true, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + true, + null, + null, + null, + true, + null, + true, + null, + null + ], + "xaxis": "x", + "yaxis": "y" + }, + { + "alignmentgroup": "True", + "bingroup": "x", + "hovertemplate": "variable=none
value=%{x}
count=%{y}", + "legendgroup": "none", + "marker": { + "color": "#FF97FF", + "pattern": { + "shape": "" + } + }, + "name": "none", + "offsetgroup": "none", + "orientation": "v", + "showlegend": true, + "type": "histogram", + "x": [ + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + true, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null, + null + ], + "xaxis": "x", + "yaxis": "y" + } + ], + "layout": { + "barmode": "group", + "legend": { + "title": { + "text": "variable" + }, + "tracegroupgap": 0 + }, + "margin": { + "t": 60 + }, + "template": { + "data": { + "bar": [ + { + "error_x": { + "color": "#2a3f5f" + }, + "error_y": { + "color": "#2a3f5f" + }, + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "bar" + } + ], + "barpolar": [ + { + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "barpolar" + } + ], + "carpet": [ + { + "aaxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "baxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "type": "carpet" + } + ], + "choropleth": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "choropleth" + } + ], + "contour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "contour" + } + ], + "contourcarpet": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "contourcarpet" + } + ], + "heatmap": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmap" + } + ], + "heatmapgl": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmapgl" + } + ], + "histogram": [ + { + "marker": { + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "histogram" + } + ], + "histogram2d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2d" + } + ], + "histogram2dcontour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2dcontour" + } + ], + "mesh3d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "mesh3d" + } + ], + "parcoords": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "parcoords" + } + ], + "pie": [ + { + "automargin": true, + "type": "pie" + } + ], + "scatter": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter" + } + ], + "scatter3d": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter3d" + } + ], + "scattercarpet": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattercarpet" + } + ], + "scattergeo": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergeo" + } + ], + "scattergl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergl" + } + ], + "scattermapbox": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattermapbox" + } + ], + "scatterpolar": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolar" + } + ], + "scatterpolargl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolargl" + } + ], + "scatterternary": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterternary" + } + ], + "surface": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "surface" + } + ], + "table": [ + { + "cells": { + "fill": { + "color": "#EBF0F8" + }, + "line": { + "color": "white" + } + }, + "header": { + "fill": { + "color": "#C8D4E3" + }, + "line": { + "color": "white" + } + }, + "type": "table" + } + ] + }, + "layout": { + "annotationdefaults": { + "arrowcolor": "#2a3f5f", + "arrowhead": 0, + "arrowwidth": 1 + }, + "autotypenumbers": "strict", + "coloraxis": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "colorscale": { + "diverging": [ + [ + 0, + "#8e0152" + ], + [ + 0.1, + "#c51b7d" + ], + [ + 0.2, + "#de77ae" + ], + [ + 0.3, + "#f1b6da" + ], + [ + 0.4, + "#fde0ef" + ], + [ + 0.5, + "#f7f7f7" + ], + [ + 0.6, + "#e6f5d0" + ], + [ + 0.7, + "#b8e186" + ], + [ + 0.8, + "#7fbc41" + ], + [ + 0.9, + "#4d9221" + ], + [ + 1, + "#276419" + ] + ], + "sequential": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "sequentialminus": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ] + }, + "colorway": [ + "#636efa", + "#EF553B", + "#00cc96", + "#ab63fa", + "#FFA15A", + "#19d3f3", + "#FF6692", + "#B6E880", + "#FF97FF", + "#FECB52" + ], + "font": { + "color": "#2a3f5f" + }, + "geo": { + "bgcolor": "white", + "lakecolor": "white", + "landcolor": "#E5ECF6", + "showlakes": true, + "showland": true, + "subunitcolor": "white" + }, + "hoverlabel": { + "align": "left" + }, + "hovermode": "closest", + "mapbox": { + "style": "light" + }, + "paper_bgcolor": "white", + "plot_bgcolor": "#E5ECF6", + "polar": { + "angularaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "radialaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "scene": { + "xaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "yaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "zaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + } + }, + "shapedefaults": { + "line": { + "color": "#2a3f5f" + } + }, + "ternary": { + "aaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "baxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "caxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "title": { + "x": 0.05 + }, + "xaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + }, + "yaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + } + } + }, + "xaxis": { + "anchor": "y", + "domain": [ + 0, + 1 + ], + "title": { + "text": "value" + } + }, + "yaxis": { + "anchor": "x", + "domain": [ + 0, + 1 + ], + "title": { + "text": "count" + } + } + } + }, + "text/html": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import json\n", + "coping_df = pandas.json_normalize(answers_df['coping with stress'].apply(lambda x : json.loads(x)))\n", + "fig = px.histogram(coping_df, barmode='group')\n", + "fig.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.12" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/examples/8.search_relevance/img/performer_interface.png b/examples/2. cases-for-applied-tasks/3.search_relevance/img/performer_interface.png similarity index 100% rename from examples/8.search_relevance/img/performer_interface.png rename to examples/2. cases-for-applied-tasks/3.search_relevance/img/performer_interface.png diff --git a/examples/8.search_relevance/img/task_suite_interface.png b/examples/2. cases-for-applied-tasks/3.search_relevance/img/task_suite_interface.png similarity index 100% rename from examples/8.search_relevance/img/task_suite_interface.png rename to examples/2. cases-for-applied-tasks/3.search_relevance/img/task_suite_interface.png diff --git a/examples/2. cases-for-applied-tasks/3.search_relevance/search_relevance.ipynb b/examples/2. cases-for-applied-tasks/3.search_relevance/search_relevance.ipynb new file mode 100644 index 00000000..ff1fdd49 --- /dev/null +++ b/examples/2. cases-for-applied-tasks/3.search_relevance/search_relevance.ipynb @@ -0,0 +1,976 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Search relevance\n", + "\n", + "We have a set of search queries and products on a website. We need to determine the extent to which each query is relevant to the corresponding product on the website. We ask performers to look at the search query and the product image from the website and rate the relevance level." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Call to action\n", + "If you found some bugs or have a new feature idea, don't hesitate to [open a new issue on Github](https://github.com/Toloka/toloka-kit/issues/new/choose).\n", + "Like our library and examples? Star [our repo on Github](https://github.com/Toloka/toloka-kit)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Prepare environment and import all we'll need." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "!pip install toloka-kit==0.1.26\n", + "!pip install crowd-kit==1.0.0\n", + "\n", + "import datetime\n", + "import sys\n", + "import time\n", + "import logging\n", + "import getpass\n", + "import urllib.request\n", + "\n", + "import pandas\n", + "import numpy as np\n", + "\n", + "import toloka.client as toloka\n", + "import toloka.client.project.template_builder as tb\n", + "from crowdkit.aggregation import DawidSkene" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "logging.basicConfig(\n", + " format='[%(levelname)s] %(name)s: %(message)s',\n", + " level=logging.INFO,\n", + " stream=sys.stdout,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Сreate toloka-client instance. All api calls will go through it. More about OAuth token in our [Learn the basics example](https://github.com/Toloka/toloka-kit/tree/main/examples/0.getting_started/0.learn_the_basics) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/0.getting_started/0.learn_the_basics/learn_the_basics.ipynb)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "toloka_client = toloka.TolokaClient(getpass.getpass('Enter your OAuth token: '), 'PRODUCTION') # Or switch to 'SANDBOX'\n", + "print(toloka_client.get_requester())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create a project\n", + "Enter a clear project name and description.\n", + "> Note: The project name and description will be visible to the performers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "project = toloka.Project(\n", + " public_name='Classify search query relevance',\n", + " public_description='Analyze a website with a product and decide to what extent it meets the search query',\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create task interface.\n", + "> Read about configuring the [task interface](https://toloka.ai/en/docs/guide/reference/interface-spec?utm_source=github&utm_medium=site&utm_campaign=tolokakit) in the Requester’s Guide.\n", + "\n", + "> Check the [Interface section](https://toloka.ai/knowledgebase/interface?utm_source=github&utm_medium=site&utm_campaign=tolokakit) of our Knowledge Base for more tips on interface design.\n", + "\n", + "This interface contains a query, a picture of a product, and its title, which needs to be assessed. There is a button for checking this query in Google, which is handy because the query might not be obvious and performers will often need to look it up. There is also a plugin that checks if a label was really chosen." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# left column\n", + "product_image = tb.ImageViewV1(tb.InputData('imagepath'))\n", + "product_description = tb.MarkdownViewV1(tb.InputData('title'), label='Product title:')\n", + "\n", + "# right column\n", + "request = tb.AlertViewV1(tb.TextViewV1(tb.InputData('query')), label='Search query', theme='info')\n", + "google_link = tb.ActionButtonViewV1(tb.OpenLinkActionV1(tb.InputData('search_url')), label='Search query in Google')\n", + "divider = tb.DividerViewV1()\n", + "label = tb.RadioGroupFieldV1(\n", + " tb.OutputData('result_class'),\n", + " label='Choose relevance class',\n", + " options=[\n", + " tb.GroupFieldOption('relevant', 'Relevant'),\n", + " tb.GroupFieldOption('relevant_minus', 'Slightly relevant'),\n", + " tb.GroupFieldOption('irrelevant', 'Irrelevant'),\n", + " ],\n", + " validation=tb.RequiredConditionV1()\n", + " )\n", + "\n", + "# create interface with two columns\n", + "general_interface = tb.SidebarLayoutV1(\n", + " tb.ListViewV1([product_image, product_description], direction='vertical'),\n", + " tb.ListViewV1([request, google_link, divider, label], direction='vertical'),\n", + " min_width=400,\n", + ")\n", + "\n", + "task_width_plugin = tb.TolokaPluginV1(\n", + " layout=tb.TolokaPluginV1.TolokaPluginLayout(\n", + " kind='scroll',\n", + " task_width=600,\n", + " )\n", + ")\n", + "\n", + "project_interface = toloka.project.TemplateBuilderViewSpec(\n", + " view=tb.ListViewV1([general_interface]),\n", + " plugins=[task_width_plugin],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For performers, our interface will look like this.\n", + "\n", + "\n", + " \n", + " \n", + "
\n", + " \"Task\n", + "
\n", + " Figure 1. What the task can looks like.\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Specifications are a description of input data that will be used in a project and the output data that will be collected from the performers.\n", + "\n", + "We are using screenshots to make this demo more robust against possible webpage changes. Another way is to use an iframe and let the performers assess the whole webpage.\n", + "\n", + "> Read more about [input and output data specifications](https://yandex.ru/support/toloka-tb/operations/create-specs.html?utm_source=github&utm_medium=site&utm_campaign=tolokakit) in the Requester’s Guide." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "input_specification = {\n", + " 'imagepath': toloka.project.UrlSpec(),\n", + " 'title': toloka.project.StringSpec(),\n", + " 'query': toloka.project.StringSpec(),\n", + " 'search_url': toloka.project.UrlSpec(),\n", + "}\n", + "output_specification = {'result_class': toloka.project.StringSpec()}\n", + "\n", + "project.task_spec = toloka.project.task_spec.TaskSpec(\n", + " input_spec=input_specification,\n", + " output_spec=output_specification,\n", + " view_spec=project_interface,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Write comprehensive instructions.\n", + "\n", + "Instructions are essential for complex tasks like relevance evaluation that are based on a set of rules and various criteria. Make sure to not only describe the general idea, but also go through examples and explain the evaluation logic in each case. We recommend trying to evaluate around two dozen cases yourself to get more insights for the instructions.\n", + "\n", + "> Get more tips on [designing instructions](https://toloka.ai/knowledgebase/instruction?utm_source=github&utm_medium=site&utm_campaign=tolokakit) in our Knowledge Base." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "project.public_instructions = \"\"\"Your task is to determine whether a product is relevant to the search query and to what degree.
\n", + "
\n", + "Imagine that you're searching for this product and get such an answer for your query.
\n", + "
\n", + "Basic steps:\n", + "\n", + "
\n", + "If image is too small click on the expand button!\n", + "
\n", + "Relevant:
\n", + "
  1. The product fully matches the query
\n", + "
\n", + "Slightly relevant:
\n", + "
  1. The product is somewhat right but some properties are different
\n", + "
\n", + "Irrelevant:
\n", + "
  1. There is a completely different product in the image
  2. \n", + "
  3. The title doesn't match the query at all
\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "example_images = [\n", + " {\n", + " 'label': 'Relevant',\n", + " 'product': 'Bodum Bistro Electric Burr Coffee Grinder-(Brand New)',\n", + " 'query': 'coffee grinder',\n", + " 'img_url': 'https://tlklab.s3.yandex.net/screenshots/1026.jpg'\n", + " },\n", + " {\n", + " 'label': 'Slightly relevant',\n", + " 'product': 'The Hobbit: The Desolation of Smaug',\n", + " 'query': 'Bluray Hobbit extended',\n", + " 'img_url': 'https://tlklab.s3.yandex.net/screenshots/1037.jpg'\n", + " },\n", + " {\n", + " 'label': 'Irrelevant',\n", + " 'product': 'NEW IKEA RUSCH BATTERY OPERATED WHITE WALL CLOCK',\n", + " 'query': 'stop watches',\n", + " 'img_url': 'https://tlklab.s3.yandex.net/screenshots/1066.jpg'\n", + " },\n", + "]\n", + "\n", + "table_rows = ''.join([\n", + " f'{row[\"label\"]}'\n", + " f'{row[\"product\"]}'\n", + " f'{row[\"query\"]}'\n", + " f'\"{row[\"label\"]}\"\\n'\n", + " for row in example_images\n", + "])\n", + "\n", + "project.public_instructions = project.public_instructions + f\"\"\"\n", + "
\n", + "Examples:
\n", + "\n", + "\n", + "{table_rows}\n", + "
ClassProductQueryImage
\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create a project." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "project = toloka_client.create_project(project)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Preparing data\n", + "This example uses [eCommerce search relevance](https://data.world/crowdflower/ecommerce-search-relevance) that distributed under Public Domain License [![License: ODbL](https://img.shields.io/badge/License-PDDL-brightgreen.svg)](https://opendatacommons.org/licenses/pddl/)\n", + "\n", + "Let's load this dataset and split it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!curl https://tlk.s3.yandex.net/ext_dataset/ecommerce_search_relevance.csv --output dataset.csv\n", + "\n", + "dataset = pandas.read_csv('dataset.csv')\n", + "dataset = dataset.sample(frac=1).reset_index(drop=True)\n", + "\n", + "with pandas.option_context(\"max_colwidth\", 100):\n", + " display(dataset)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Because it's old dataset, we need to check images. Let's take 80 rows with valid images:\n", + "- 10 - for training\n", + "- 10 - for exam\n", + "- 10 - for golden-set in the main pool\n", + "- 50 - main tasks" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "rows_cnt = 80\n", + "new_dataset = pandas.DataFrame(columns=['relevance', 'product_image', 'product_title', 'query'])\n", + "for row in dataset.itertuples():\n", + " try:\n", + " response = urllib.request.urlopen(row.product_image)\n", + " data = response.read()\n", + " if len(data) > 2000:\n", + " new_dataset = new_dataset.append(\n", + " {\n", + " 'relevance': row.relevance,\n", + " 'product_image': row.product_image,\n", + " 'product_title': row.product_title,\n", + " 'query': row.query,\n", + " },\n", + " ignore_index=True\n", + " )\n", + " print(len(new_dataset), row.product_image)\n", + " if len(new_dataset) >= rows_cnt:\n", + " break\n", + " except:\n", + " pass" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Split dataset into 4 parts." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dataset_with_answers = new_dataset[~new_dataset['relevance'].isna()].head(30)\n", + "main_dataset = new_dataset.drop(dataset_with_answers.index)\n", + "training_dataset, exam_dataset, gold_dataset = np.split(dataset_with_answers, [10, 20], axis=0)\n", + "\n", + "print(f'training_dataset - {len(training_dataset)}')\n", + "print(f'exam_dataset - {len(exam_dataset)}')\n", + "print(f'gold_dataset - {len(gold_dataset)}')\n", + "print(f'main_dataset - {len(main_dataset)}')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the dataset relevance is a float, where 1.0 is \"irrelevant\" and 4.0 is absolutely \"relevant\". But in our project, we need three string labels. Let's prepare function to convert one to another." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def str_relevance(relevance: float) -> str:\n", + " if relevance > 3:\n", + " return 'relevant'\n", + " if relevance > 2:\n", + " return 'relevant_minus'\n", + " return 'irrelevant'\n", + "\n", + "print(str_relevance(1.0))\n", + "print(str_relevance(3.0))\n", + "print(str_relevance(3.66))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create a training pool\n", + "\n", + "Since relevance evaluation is based on rules, not just common sense or certain skills, we recommend investing some time on learning how to explain all the rules. Training needs to involve both common and extreme cases. The comments should explain the underlying logic rather than just state the correct answers.\n", + "\n", + "> A well-grounded training exercise is also a great tool for scaling your task, because you can run it any time you need new performers.\n", + "\n", + "Read more about [selecting performers](https://toloka.ai/knowledgebase/quality-control?utm_source=github&utm_medium=site&utm_campaign=tolokakit) in our Knowledge Base.\n", + "\n", + "Read more about [training pools](https://toloka.ai/en/docs/guide/concepts/train?utm_source=github&utm_medium=site&utm_campaign=tolokakit) in our Requester’s Guide." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "training = toloka.Training(\n", + " project_id=project.id,\n", + " private_name='Search relevance training',\n", + " may_contain_adult_content=False,\n", + " assignment_max_duration_seconds=60*10,\n", + " mix_tasks_in_creation_order=True,\n", + " shuffle_tasks_in_task_suite=True,\n", + " training_tasks_in_task_suite_count=2,\n", + " task_suites_required_to_pass=5,\n", + " retry_training_after_days=2,\n", + " inherited_instructions=True,\n", + ")\n", + "training = toloka_client.create_training(training)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Upload training tasks to the pool, without opening the training pool.\n", + "\n", + "> We recommend opening the training pool along with the main pool. Otherwise Tolokers will spend their time on training but get no access to real tasks, which is frustrating. Also, do not forget to close the training pool when there are no main tasks available anymore." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "hint_messages = {\n", + " 'irrelevant': 'The product does not fit the request.',\n", + " 'relevant_minus': 'The product is similar, but does not fully satisfy the request.',\n", + " 'relevant': 'Product is satisfied.',\n", + "}\n", + "\n", + "training_tasks = [\n", + " toloka.Task(\n", + " pool_id=training.id,\n", + " input_values={\n", + " 'imagepath': row.product_image,\n", + " 'title': row.product_title,\n", + " 'query': row.query,\n", + " 'search_url': f'https://www.google.ru/search?q={row.query}',\n", + " },\n", + " known_solutions = [toloka.task.BaseTask.KnownSolution(output_values={'result_class': str_relevance(row.relevance)})],\n", + " message_on_unknown_solution=hint_messages[str_relevance(row.relevance)],\n", + " )\n", + " for row in training_dataset.itertuples()\n", + "]\n", + "result = toloka_client.create_tasks(training_tasks, allow_defaults=True)\n", + "print(len(result.items))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create an exam pool\n", + "We recommend adding an exam pool along with the training because relevance evaluation projects are usually more complicated than most crowdsourcing projects, and it takes a certain effort to master all the guidelines. The more guidelines there are, the greater will be the need to check if the performers have really learned them.\n", + "\n", + "Set up exam quality calculation via skill.\n", + "Create new skill." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "exam_skill = next(toloka_client.get_skills(name='Search relevance exam'), None)\n", + "if exam_skill:\n", + " print('Skill already exists')\n", + "else:\n", + " exam_skill = toloka_client.create_skill(\n", + " name='Search relevance exam',\n", + " hidden=True,\n", + " public_requester_description={'EN': 'How performer deal with search relevance exam'},\n", + " )\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Set the price per task suite (for example, $0.03).\n", + "> You can use a zero price as well. However, if the exam is time-consuming, a zero price might be unfair, as the performers will spend a lot of time completing it.\n", + "\n", + "Read more about [pricing principles](https://toloka.ai/knowledgebase/pricing?utm_source=github&utm_medium=site&utm_campaign=tolokakit) in our Knowledge Base." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "exam = toloka.Pool(\n", + " project_id=project.id,\n", + " # Give the pool any convenient name. You are the only one who will see it.\n", + " private_name='Classify search query relevance - exam',\n", + " may_contain_adult_content=False,\n", + " # Set the price per task page.\n", + " reward_per_assignment=0.03,\n", + " will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365),\n", + " # Time allowed for completing a task page\n", + " assignment_max_duration_seconds=600,\n", + " filter=(toloka.filter.Languages.in_('EN')),\n", + ")\n", + "\n", + "exam.set_mixer_config(golden_tasks_count=10)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Attach the training pool.\n", + "\n", + "> The quality of the training can be low to just filter out potential deception, because we expect performers to make mistakes and learn from them (yet again, relevance is a complicated task type)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "exam.set_training_requirement(training_pool_id=training.id, training_passing_skill_value=10)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will have 10 tasks in the exam pool, so the quality will be calculated after the whole exam has been passed.\n", + "> We will then use this parameter as an entry filter for the main pool." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "exam.quality_control.add_action(\n", + " collector=toloka.collectors.GoldenSet(history_size=10),\n", + " conditions=[toloka.conditions.TotalAnswersCount >= 10,],\n", + " action=toloka.actions.SetSkillFromOutputField(\n", + " skill_id=exam_skill.id,\n", + " from_field='correct_answers_rate',\n", + " ),\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "exam = toloka_client.create_pool(exam)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Add tasks to exam." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "exam_tasks = [\n", + " toloka.Task(\n", + " pool_id=exam.id,\n", + " input_values={\n", + " 'imagepath': row.product_image,\n", + " 'title': row.product_title,\n", + " 'query': row.query,\n", + " 'search_url': f'https://www.google.ru/search?q={row.query}',\n", + " },\n", + " known_solutions = [toloka.task.BaseTask.KnownSolution(output_values={'result_class': str_relevance(row.relevance)})],\n", + " infinite_overlap=True,\n", + " )\n", + " for row in exam_dataset.itertuples()\n", + "]\n", + "result = toloka_client.create_tasks(exam_tasks, allow_defaults=True)\n", + "print(len(result.items))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create the main pool\n", + "A pool is a set of paid tasks grouped into task pages. These tasks are sent out for completion at the same time.\n", + "\n", + ">Note: All tasks within a pool have the same settings (price, quality control, etc.)\n", + "\n", + "Set the price per task suite for 0.03$. Read more about [pricing principles](https://toloka.ai/knowledgebase/pricing?utm_source=github&utm_medium=site&utm_campaign=tolokakit) in our Knowledge Base.\n", + "\n", + "Sets an overlap of 3. This is the number of users who will complete the same task. We will aggregate the results after the pool is completed. To understand [how this rule works](https://toloka.ai/en/docs/guide/concepts/mvote?utm_source=github&utm_medium=site&utm_campaign=tolokakit), go to the Requester’s Guide.\n", + "\n", + "Let's add language filter so performers who speak English will be invited to complete this task." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pool = toloka.Pool(\n", + " project_id=project.id,\n", + " # Give the pool any convenient name. You are the only one who will see it.\n", + " private_name='Classify search query relevance',\n", + " may_contain_adult_content=False,\n", + " # Set the price per task page.\n", + " reward_per_assignment=0.03,\n", + " will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365),\n", + " # Overlap. This is the number of users who will complete the same task.\n", + " defaults=toloka.Pool.Defaults(default_overlap_for_new_task_suites=3),\n", + " # Time allowed for completing a task page\n", + " assignment_max_duration_seconds=600,\n", + " filter=(\n", + " (toloka.filter.Languages.in_('EN')) &\n", + " (toloka.filter.Skill(exam_skill.id) >= 90)\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Set up [Quality control](https://toloka.ai/en/docs/guide/concepts/control?utm_source=github&utm_medium=site&utm_campaign=tolokakit):\n", + " - Set the number of responses and the percentage of correct responses. We will record a quality parameter in the same skill we used in the quality filter.\n", + " - Set up the [Fast responses](https://toloka.ai/en/docs/guide/concepts/quick-answers?utm_source=github&utm_medium=site&utm_campaign=tolokakit) rule. This rule allows you to ban performers who submit tasks at a suspiciously high speed.\n", + "\n", + "Read more about [quality control principles](https://toloka.ai/knowledgebase/quality-control?utm_source=github&utm_medium=site&utm_campaign=tolokakit) in our Knowledge Base or check out [control tasks settings](https://toloka.ai/en/docs/guide/concepts/goldenset?utm_source=github&utm_medium=site&utm_campaign=tolokakit) in the Requester’s Guide." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pool.quality_control.add_action(\n", + " collector=toloka.collectors.GoldenSet(history_size=20),\n", + " conditions=[toloka.conditions.TotalAnswersCount >= 1,],\n", + " action=toloka.actions.SetSkillFromOutputField(\n", + " skill_id=exam_skill.id,\n", + " from_field='correct_answers_rate',\n", + " ),\n", + ")\n", + "\n", + "pool.quality_control.add_action(\n", + " collector=toloka.collectors.AssignmentSubmitTime(history_size=5, fast_submit_threshold_seconds=10),\n", + " conditions=[toloka.conditions.FastSubmittedCount >= 1],\n", + " action=toloka.actions.RestrictionV2(\n", + " scope=toloka.user_restriction.UserRestriction.PROJECT,\n", + " duration=10,\n", + " duration_unit='DAYS',\n", + " private_comment='Fast responses',\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Specify\tthe number of tasks per page. We recommend putting as many tasks on one page as a performer can complete in 1 to 5 minutes. That way, performers are less likely to get tired, and they won’t lose a significant amount of data if a technical issue occurs.\n", + "\n", + "To learn more about [grouping tasks](https://toloka.ai/en/docs/guide/concepts/distribute-tasks-by-pages?utm_source=github&utm_medium=site&utm_campaign=tolokakit) into suites, read the Requester’s Guide." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pool.set_mixer_config(\n", + " real_tasks_count=4,\n", + " golden_tasks_count=1,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create pool" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pool = toloka_client.create_pool(pool)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Preparing and uploading tasks\n", + "\n", + "We recommend putting as many tasks on one page as a performer can complete in 1 to 5 minutes. That way, performers are less likely to get tired, and they won’t lose a significant amount of data if a technical issue occurs.\n", + "\n", + "To learn more about [grouping tasks](https://toloka.ai/en/docs/guide/concepts/distribute-tasks-by-pages?utm_source=github&utm_medium=site&utm_campaign=tolokakit) into suites, read the Requester’s Guide." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "golden_tasks = [\n", + " toloka.Task(\n", + " pool_id=pool.id,\n", + " input_values={\n", + " 'imagepath': row.product_image,\n", + " 'title': row.product_title,\n", + " 'query': row.query,\n", + " 'search_url': f'https://www.google.ru/search?q={row.query}',\n", + " },\n", + " known_solutions = [toloka.task.BaseTask.KnownSolution(output_values={'result_class': str_relevance(row.relevance)})],\n", + " infinite_overlap=True,\n", + " )\n", + " for row in gold_dataset.itertuples()\n", + "]\n", + "\n", + "tasks = [\n", + " toloka.Task(\n", + " pool_id=pool.id,\n", + " input_values={\n", + " 'imagepath': row.product_image,\n", + " 'title': row.product_title,\n", + " 'query': row.query,\n", + " 'search_url': f'https://www.google.ru/search?q={row.query}',\n", + " },\n", + " )\n", + " for row in main_dataset.itertuples()\n", + "]\n", + "created_tasks = toloka_client.create_tasks(golden_tasks + tasks, allow_defaults=True)\n", + "print(len(created_tasks.items))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can visit web interface and preview task suites.\n", + "\n", + " \n", + " \n", + "
\n", + " \"How\n", + "
\n", + " Figure 2. How performers will see your tasks\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Start the pools.\n", + "\n", + "**Important.** Remember that real Toloka performers will complete the tasks.\n", + "Double check that everything is correct\n", + "with your project configuration before you start the pool" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "training = toloka_client.open_training(training.id)\n", + "print(f'training - {training.status}')\n", + "\n", + "exam = toloka_client.open_pool(exam.id)\n", + "print(f'exam - {exam.status}')\n", + "\n", + "pool = toloka_client.open_pool(pool.id)\n", + "print(f'main pool - {pool.status}')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Receiving responses" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Wait until the pool is completed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pool_id = pool.id\n", + "\n", + "def wait_pool_for_close(pool_id, minutes_to_wait=1):\n", + " sleep_time = 60 * minutes_to_wait\n", + " pool = toloka_client.get_pool(pool_id)\n", + " while not pool.is_closed():\n", + " op = toloka_client.get_analytics([toloka.analytics_request.CompletionPercentagePoolAnalytics(subject_id=pool.id)])\n", + " op = toloka_client.wait_operation(op)\n", + " percentage = op.details['value'][0]['result']['value']\n", + " print(\n", + " f' {datetime.datetime.now().strftime(\"%H:%M:%S\")}\\t'\n", + " f'Pool {pool.id} - {percentage}%'\n", + " )\n", + " time.sleep(sleep_time)\n", + " pool = toloka_client.get_pool(pool.id)\n", + " print('Pool was closed.')\n", + "\n", + "wait_pool_for_close(pool_id)\n", + "\n", + "exam = toloka_client.close_pool(exam.id)\n", + "print(f'exam - {exam.status}')\n", + "\n", + "training = toloka_client.close_training(training.id)\n", + "print(f'training - {training.status}')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Get responses\n", + "\n", + "When all the tasks are completed, look at the responses from performers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "answers_df = toloka_client.get_assignments_df(pool.id, field=['ASSIGNMENT:task_id', 'ASSIGNMENT:worker_id'])\n", + "\n", + "answers_df = answers_df[answers_df['GOLDEN:result_class'].isna()]\n", + "\n", + "answers_df = answers_df.rename(columns={\n", + " 'ASSIGNMENT:task_id': 'task',\n", + " 'OUTPUT:result_class': 'label',\n", + " 'ASSIGNMENT:worker_id': 'worker',\n", + " 'INPUT:query': 'query',\n", + " 'INPUT:imagepath': 'imagepath',\n", + " 'INPUT:title': 'title',\n", + "})\n", + "\n", + "answers_to_aggregate = answers_df[['task', 'label', 'worker']]\n", + "\n", + "with pandas.option_context(\"max_colwidth\", None):\n", + " display(answers_df)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Aggregation results using the Dawid-Skene model. We use this aggregation model because our questions are of comparable difficulty, and we don't have many control tasks.\n", + "\n", + "Read more about the [Dawid-Skene model](https://toloka.ai/en/docs/guide/concepts/result-aggregation?utm_source=github&utm_medium=site&utm_campaign=tolokakit#aggr__dawid-skene) in the Requester’s Guide or get at an overview of different [aggregation models](https://toloka.ai/knowledgebase/aggregation) our Knowledge Base.\n", + "\n", + "More aggregation models in [Crowd-Kit](https://github.com/Toloka/crowd-kit#crowd-kit-computational-quality-control-for-crowdsourcing)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "predicted_answers = DawidSkene(n_iter=20).fit_predict(answers_to_aggregate).reset_index(name='result')\n", + "\n", + "predicted_answers = pandas.merge(predicted_answers, answers_df.drop_duplicates(subset='task'), on='task')\n", + "\n", + "with pandas.option_context(\"max_colwidth\", None):\n", + " display(predicted_answers[['query', 'imagepath', 'title', 'result']])" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.12" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/examples/2. cases-for-applied-tasks/3.survey/simplest_survey/img/task_interface.png b/examples/2. cases-for-applied-tasks/3.survey/simplest_survey/img/task_interface.png new file mode 100644 index 00000000..a4baa2d7 Binary files /dev/null and b/examples/2. cases-for-applied-tasks/3.survey/simplest_survey/img/task_interface.png differ diff --git a/examples/7.survey/simplest_survey/simplest_survey.ipynb b/examples/2. cases-for-applied-tasks/3.survey/simplest_survey/simplest_survey.ipynb similarity index 100% rename from examples/7.survey/simplest_survey/simplest_survey.ipynb rename to examples/2. cases-for-applied-tasks/3.survey/simplest_survey/simplest_survey.ipynb diff --git a/examples/2. cases-for-applied-tasks/4.search_relevance/img/performer_interface.png b/examples/2. cases-for-applied-tasks/4.search_relevance/img/performer_interface.png new file mode 100644 index 00000000..c545e2eb Binary files /dev/null and b/examples/2. cases-for-applied-tasks/4.search_relevance/img/performer_interface.png differ diff --git a/examples/2. cases-for-applied-tasks/4.search_relevance/img/task_suite_interface.png b/examples/2. cases-for-applied-tasks/4.search_relevance/img/task_suite_interface.png new file mode 100644 index 00000000..d1e46985 Binary files /dev/null and b/examples/2. cases-for-applied-tasks/4.search_relevance/img/task_suite_interface.png differ diff --git a/examples/8.search_relevance/search_relevance.ipynb b/examples/2. cases-for-applied-tasks/4.search_relevance/search_relevance.ipynb similarity index 100% rename from examples/8.search_relevance/search_relevance.ipynb rename to examples/2. cases-for-applied-tasks/4.search_relevance/search_relevance.ipynb diff --git a/examples/2. cases-for-applied-tasks/4.toloka_and_ml_on_prefect/configs/pool.json b/examples/2. cases-for-applied-tasks/4.toloka_and_ml_on_prefect/configs/pool.json new file mode 100644 index 00000000..c4f11015 --- /dev/null +++ b/examples/2. cases-for-applied-tasks/4.toloka_and_ml_on_prefect/configs/pool.json @@ -0,0 +1,157 @@ +{ + "assignment_max_duration_seconds": 1200, + "assignments_issuing_config": { + "issue_task_suites_in_creation_order": false + }, + "auto_accept_period_day": 21, + "auto_accept_solutions": true, + "auto_close_after_complete_delay_seconds": 0, + "created": null, + "defaults": { + "default_overlap_for_new_task_suites": 3 + }, + "filter": { + "and": [ + { + "category": "profile", + "key": "languages", + "operator": "IN", + "value": "EN" + }, + { + "or": [ + { + "category": "computed", + "key": "client_type", + "operator": "EQ", + "value": "TOLOKA_APP" + }, + { + "category": "computed", + "key": "client_type", + "operator": "EQ", + "value": "BROWSER" + } + ] + } + ] + }, + "id": null, + "may_contain_adult_content": false, + "mixer_config": { + "golden_tasks_count": 1, + "real_tasks_count": 9, + "training_tasks_count": 0 + }, + "owner": { + "company_id": "1", + "id": "b39ea2ce2474c437ed0ee0d4aeec630b", + "myself": true + }, + "priority": 0, + "private_name": "Classify customer reviews as positive or negative", + "project_id": "61655", + "quality_control": { + "captcha_frequency": "MEDIUM", + "configs": [ + { + "collector_config": { + "parameters": { + "history_size": 10 + }, + "type": "CAPTCHA", + "uuid": "ca48a2a7-c100-4677-a85d-a933be5a94d5" + }, + "rules": [ + { + "action": { + "parameters": { + "duration": 1, + "duration_unit": "DAYS", + "private_comment": "captcha", + "scope": "PROJECT" + }, + "type": "RESTRICTION_V2" + }, + "conditions": [ + { + "key": "stored_results_count", + "operator": "GTE", + "value": 4 + }, + { + "key": "success_rate", + "operator": "LT", + "value": 75.0 + } + ] + } + ] + }, + { + "collector_config": { + "parameters": { + "fast_submit_threshold_seconds": 20 + }, + "type": "ASSIGNMENT_SUBMIT_TIME", + "uuid": "0877ba2f-3665-4c99-8003-3314ce4b8882" + }, + "rules": [ + { + "action": { + "parameters": { + "duration": 1, + "duration_unit": "DAYS", + "private_comment": "fast responses", + "scope": "PROJECT" + }, + "type": "RESTRICTION_V2" + }, + "conditions": [ + { + "key": "total_submitted_count", + "operator": "GT", + "value": 4 + }, + { + "key": "fast_submitted_count", + "operator": "GT", + "value": 2 + } + ] + } + ] + }, + { + "collector_config": { + "type": "ANSWER_COUNT", + "uuid": "1288e38e-83d0-4397-aee9-865761050951" + }, + "rules": [ + { + "action": { + "parameters": { + "duration": 1, + "duration_unit": "DAYS", + "private_comment": "too many responses", + "scope": "PROJECT" + }, + "type": "RESTRICTION_V2" + }, + "conditions": [ + { + "key": "assignments_accepted_count", + "operator": "GTE", + "value": 30 + } + ] + } + ] + } + ] + }, + "reward_per_assignment": 0.10, + "status": "CLOSED", + "type": "REGULAR", + "will_expire": null +} diff --git a/examples/2. cases-for-applied-tasks/4.toloka_and_ml_on_prefect/configs/project.json b/examples/2. cases-for-applied-tasks/4.toloka_and_ml_on_prefect/configs/project.json new file mode 100644 index 00000000..aa4e552a --- /dev/null +++ b/examples/2. cases-for-applied-tasks/4.toloka_and_ml_on_prefect/configs/project.json @@ -0,0 +1,47 @@ +{ + "assignments_automerge_enabled": false, + "assignments_issuing_type": "AUTOMATED", + "created": "2021-10-03T23:01:31.291000", + "id": null, + "owner": { + "company_id": "1", + "id": "b39ea2ce2474c437ed0ee0d4aeec630b", + "myself": true + }, + "public_description": "Decide whether a review is positive or negative", + "public_instructions": "

In the task you will have to read customer reviews and define whether they are positive or negative

\n", + "public_name": "Classify customer reviews as positive or negative", + "status": "ACTIVE", + "task_spec": { + "input_spec": { + "review": { + "hidden": false, + "required": true, + "type": "string" + } + }, + "output_spec": { + "sentiment": { + "hidden": false, + "required": true, + "type": "string" + } + }, + "view_spec": { + "config": "{\"plugins\": [{\"layout\": {\"kind\": \"scroll\", \"taskWidth\": 650.0}, \"type\": \"plugin.toloka\"}, {\"1\": {\"data\": {\"path\": \"sentiment\", \"type\": \"data.output\"}, \"payload\": \"pos\", \"type\": \"action.set\"}, \"2\": {\"data\": {\"path\": \"sentiment\", \"type\": \"data.output\"}, \"payload\": \"neg\", \"type\": \"action.set\"}, \"type\": \"plugin.hotkeys\"}], \"view\": {\"items\": [{\"content\": {\"content\": {\"path\": \"review\", \"type\": \"data.input\"}, \"type\": \"view.text\"}, \"type\": \"view.group\"}, {\"data\": {\"path\": \"sentiment\", \"type\": \"data.output\"}, \"label\": \"Is this review positive or negative?\", \"options\": [{\"label\": \"Positive\", \"value\": \"pos\"}, {\"label\": \"Negative\", \"value\": \"neg\"}], \"type\": \"field.button-radio-group\", \"validation\": {\"type\": \"condition.required\"}}], \"type\": \"view.list\"}}", + "localizationConfig": null, + "lock": { + "action.set": "1.0.0", + "condition.required": "1.0.0", + "core": "1.0.0", + "field.button-radio-group": "1.0.0", + "plugin.hotkeys": "1.0.0", + "plugin.toloka": "1.0.0", + "view.group": "1.0.0", + "view.list": "1.0.0", + "view.text": "1.0.0" + }, + "type": "tb" + } + } +} diff --git a/examples/2. cases-for-applied-tasks/4.toloka_and_ml_on_prefect/example.ipynb b/examples/2. cases-for-applied-tasks/4.toloka_and_ml_on_prefect/example.ipynb new file mode 100644 index 00000000..a411d8c4 --- /dev/null +++ b/examples/2. cases-for-applied-tasks/4.toloka_and_ml_on_prefect/example.ipynb @@ -0,0 +1,853 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "dd168c5d", + "metadata": {}, + "source": [ + "This example illustrates how crowdsourcing using Toloka can be made easier and cheaper by integrating an ML model (which we refer to as an autohelper) into the usual pipeline. Furthermore, it shows how to run the whole project in the cloud using [Prefect](https://www.prefect.io/), which makes workflow orchestration much simpler." + ] + }, + { + "cell_type": "markdown", + "id": "80d00e1b", + "metadata": {}, + "source": [ + "The main steps are:\n", + "* setting up Prefect\n", + "* getting predictions using ML\n", + "* evaluating predictions' quality\n", + "* sending tasks with prediction below a certain quality threshold to Toloka users\n", + "* aggregating the results\n", + "\n", + "Such a process leads to better quality and helps spend less by reducing the number of manual tasks" + ] + }, + { + "cell_type": "markdown", + "id": "ee4fac1b", + "metadata": {}, + "source": [ + "## Setting up Prefect" + ] + }, + { + "cell_type": "markdown", + "id": "b5142eca", + "metadata": {}, + "source": [ + "Prefect is a workflow management system, designed for modern infrastructure and powered by the open-source Prefect Core workflow engine." + ] + }, + { + "cell_type": "markdown", + "id": "a0286c77", + "metadata": {}, + "source": [ + "Prefect offers many options for workflow management. We'll use its [cloud-based service](https://www.prefect.io/cloud/) for orchestration and run examples using local machine and local storage. What follows is a quick guide for setting it up (for more detailed information refer to [this material](https://docs.prefect.io/orchestration/getting-started/quick-start.html))" + ] + }, + { + "cell_type": "markdown", + "id": "8179b38a", + "metadata": {}, + "source": [ + "First, let's make sure prefect is installed" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8c821700", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install prefect\n", + "# !conda install -c conda-forge prefect\n", + "# !pipenv install --pre prefect" + ] + }, + { + "cell_type": "markdown", + "id": "4892dc48", + "metadata": {}, + "source": [ + "To use Prefect Cloud we'll need to login to (or set up an account for) Prefect Cloud at https://cloud.prefect.io. Once it's done, let's set the backend to use Prefect Cloud" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "df466104", + "metadata": {}, + "outputs": [], + "source": [ + "!prefect backend cloud" + ] + }, + { + "cell_type": "markdown", + "id": "7c8dc6e3", + "metadata": {}, + "source": [ + "Next, we'll need to authenticate with the backend - follow [these instructions](https://docs.prefect.io/orchestration/getting-started/set-up.html#authenticate-with-prefect-cloud) to do that, then enter your key in the next cell" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "be9f8032", + "metadata": {}, + "outputs": [], + "source": [ + "YOUR_KEY = input()\n", + "!prefect auth login --key $YOUR_KEY" + ] + }, + { + "cell_type": "markdown", + "id": "2e5b8ac2", + "metadata": {}, + "source": [ + "All that remains is to create a project and start an agent that will run Prefect flows on the local machine.\n", + "Prefect agent is responsible for starting and monitoring flow runs" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e789d22c", + "metadata": {}, + "outputs": [], + "source": [ + "PROJECT_NAME = input()\n", + "# PROJECT_NAME = 'Toloka test project 1'\n", + "!prefect create project $PROJECT_NAME\n", + "!prefect agent local start" + ] + }, + { + "cell_type": "markdown", + "id": "8691fd95", + "metadata": {}, + "source": [ + "Prefect uses an abstraction called [Executor](https://docs.prefect.io/api/latest/executors.html#executor) to run tasks, which is set to local by default, but also [natively supports](https://docs.prefect.io/orchestration/flow_config/executors.html#daskexecutor) dask. Other storage [types](https://docs.prefect.io/orchestration/flow_config/executors.html#daskexecutor) and agent [options](https://docs.prefect.io/orchestration/flow_config/executors.html#daskexecutor) are also supported, but we'll keep everything local for simplicity." + ] + }, + { + "cell_type": "markdown", + "id": "5a6888dd", + "metadata": {}, + "source": [ + "## Writing code" + ] + }, + { + "cell_type": "markdown", + "id": "7ad96d78", + "metadata": {}, + "source": [ + "For the project, we have a set of customer reviews, and we need to classify them as “Positive” or “Negative”. We ask performers to read a review and decide which category it belongs to.\n", + "For more details refer to an official Toloka-kit [example](https://github.com/Toloka/toloka-kit/blob/main/examples/5.nlp/sentiment_analysis/sentiment_analysis.ipynb) which this project is based on" + ] + }, + { + "cell_type": "markdown", + "id": "d8c1d051", + "metadata": {}, + "source": [ + "### Call to action\n", + "If you found some bugs or have a new feature idea, don't hesitate to [open a new issue on Github](https://github.com/Toloka/toloka-kit/issues/new/choose).\n", + "Like our library and examples? Star [our repo on Github](https://github.com/Toloka/toloka-kit)" + ] + }, + { + "cell_type": "markdown", + "id": "6dac8f66", + "metadata": {}, + "source": [ + "Prepare environment and import all we'll need." + ] + }, + { + "cell_type": "code", + "execution_count": 182, + "id": "c4d20ebb", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "!pip install toloka-kit==0.1.26\n", + "!pip install crowd-kit==1.0.0\n", + "\n", + "import datetime\n", + "import requests\n", + "import os\n", + "import time\n", + "import getpass\n", + "from typing import List, Tuple\n", + "\n", + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "import toloka.client as toloka\n", + "from toloka.client import Pool, Project, TolokaClient\n", + "from toloka.client.analytics_request import CompletionPercentagePoolAnalytics\n", + "from crowdkit.aggregation import DawidSkene\n", + "\n", + "import prefect\n", + "from prefect import Flow, task\n", + "from prefect.engine.results import LocalResult\n", + "\n", + "import torch\n", + "from transformers import AutoTokenizer, AutoModelForSequenceClassification" + ] + }, + { + "cell_type": "markdown", + "id": "825a0ec9", + "metadata": {}, + "source": [ + "Set up the steps for getting json configs for the project and the pool" + ] + }, + { + "cell_type": "code", + "execution_count": 102, + "id": "16f2f13d", + "metadata": {}, + "outputs": [], + "source": [ + "GITHUB_RAW = 'https://raw.githubusercontent.com'\n", + "GITHUB_BASE_PATH = 'Toloka/toloka-kit/main/examples/9.toloka_and_ml_on_prefect/configs'\n", + "\n", + "\n", + "def _load_json_from_github(filename: str):\n", + " response = requests.get(os.path.join(GITHUB_RAW, GITHUB_BASE_PATH, filename))\n", + " response.raise_for_status()\n", + " return response.json()" + ] + }, + { + "cell_type": "markdown", + "id": "952ceca7", + "metadata": {}, + "source": [ + "Now we can start building the project.\n", + "Prefect refers to each step as a [*task*](https://docs.prefect.io/core/about_prefect/thinking-prefectly.html#tasks). In a simple sense, a task is just a Python function representing a logically distinct stage of a process.\n", + "This example is split into different tasks for project and pool creationg, data preparationa and finally task processing by autohelper and Toloka separately. Most tasks receive Toloka API token and env variable, which enables creating the Toloka client inside and solves possible difficulties involved in sharing such an object between different tasks.\n", + "Some tasks also specify that `print()` statements inside should be sent to Prefect Cloud logs, which makes debugging easier: `@task(log_stdout=True)`" + ] + }, + { + "cell_type": "markdown", + "id": "713753f8", + "metadata": {}, + "source": [ + "Let's create all the necessary blocks for our flow.\n", + "First, create a project" + ] + }, + { + "cell_type": "code", + "execution_count": 103, + "id": "1f9bed0d", + "metadata": {}, + "outputs": [], + "source": [ + "@task\n", + "def create_project(token: str, env: str) -> str:\n", + " client = TolokaClient(token, env)\n", + " project = Project.structure(_load_json_from_github('project.json'))\n", + " project = client.create_project(project)\n", + " return project.id" + ] + }, + { + "cell_type": "markdown", + "id": "59f02e42", + "metadata": {}, + "source": [ + "Create pool with a skill-check." + ] + }, + { + "cell_type": "code", + "execution_count": 184, + "id": "92698bc2", + "metadata": {}, + "outputs": [], + "source": [ + "@task\n", + "def create_skill(token: str, env: str, name='sentiment-analysis') -> str:\n", + " client = TolokaClient(token, env)\n", + " skill = next(client.get_skills(name=name), None) or client.create_skill(name=name)\n", + " return skill.id\n", + "\n", + "@task\n", + "def create_pool(token: str, env: str, project_id: str, skill_id: str, reward: float) -> str:\n", + " client = TolokaClient(token, env)\n", + " pool = Pool.structure(_load_json_from_github('pool.json'))\n", + " pool.project_id = project_id\n", + " skill_filter = (toloka.filter.Skill(skill_id) == None) | (toloka.filter.Skill(skill_id) >= 90)\n", + " pool.set_filter(pool.filter & skill_filter)\n", + " pool.quality_control.add_action(\n", + " collector=toloka.collectors.GoldenSet(history_size=10),\n", + " conditions=[toloka.conditions.TotalAnswersCount > 4],\n", + " action=toloka.actions.SetSkillFromOutputField(skill_id=skill_id, from_field='correct_answers_rate')\n", + " )\n", + " pool.reward_per_assignment = reward\n", + " pool.will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=7)\n", + " pool = client.create_pool(pool)\n", + " return pool.id" + ] + }, + { + "cell_type": "markdown", + "id": "98b72981", + "metadata": {}, + "source": [ + "We will use [Grammar and Online Product Reviews](https://data.world/datafiniti/grammar-and-online-product-reviews) dataset under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license\n", + "\n", + "\n", + "[![CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC%20BY--NC--SA%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)" + ] + }, + { + "cell_type": "markdown", + "id": "19c74391", + "metadata": {}, + "source": [ + "Download the necessary data and separate it into golden and non-golden tasks.\n", + "We'll use `cnt_tasks` of regular tasks and `cnt_golden` of golden tasks" + ] + }, + { + "cell_type": "code", + "execution_count": 104, + "id": "439e9e07", + "metadata": {}, + "outputs": [], + "source": [ + "@task(nout=2, log_stdout=True)\n", + "def prepare_dataset(\n", + " dataset_url: str,\n", + " cnt_tasks: int,\n", + " cnt_golden: int,\n", + ") -> Tuple[pd.DataFrame, pd.DataFrame]:\n", + " dataset = pd.read_csv(dataset_url)\n", + " print(f'Initial dataset size: {len(dataset)}')\n", + "\n", + " dataset = dataset[['reviews.text', 'reviews.doRecommend']].dropna().reset_index(drop=True)\n", + " dataset = dataset.replace({'reviews.doRecommend': {True: 'pos', False: 'neg'}})\n", + "\n", + " positive_tasks = dataset[dataset['reviews.doRecommend'] == 'pos']\n", + " negative_tasks = dataset[dataset['reviews.doRecommend'] == 'neg']\n", + " print(f'positive count: {len(positive_tasks)}. negative count: {len(negative_tasks)}')\n", + "\n", + " slice_tasks = cnt_tasks // 2\n", + " slice_golden = slice_tasks + cnt_golden // 2\n", + " pos_task_dataset, pos_golden_dataset, _ = np.split(positive_tasks, [slice_tasks, slice_golden])\n", + " neg_task_dataset, neg_golden_dataset, _ = np.split(negative_tasks, [slice_tasks, slice_golden])\n", + "\n", + " task_dataset = pd.concat([pos_task_dataset, neg_task_dataset])\n", + " golden_dataset = pd.concat([pos_golden_dataset, neg_golden_dataset])\n", + "\n", + " return task_dataset, golden_dataset" + ] + }, + { + "cell_type": "markdown", + "id": "5ae1c08d", + "metadata": {}, + "source": [ + "Create a function for getting the ML model and tokenizer, which will serve as an autohelper in our project. We'll use the readily-available models from [Hugging Face](https://huggingface.co/), namely [finetuned DistilBERT](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)" + ] + }, + { + "cell_type": "code", + "execution_count": 105, + "id": "2a154b1d", + "metadata": {}, + "outputs": [], + "source": [ + "def _get_resources(model_name: str) -> Tuple[AutoModelForSequenceClassification, AutoTokenizer]:\n", + " model = AutoModelForSequenceClassification.from_pretrained(model_name)\n", + " tokenizer = AutoTokenizer.from_pretrained(model_name)\n", + " return model, tokenizer" + ] + }, + { + "cell_type": "markdown", + "id": "1f240a6f", + "metadata": {}, + "source": [ + "Set up a function to get autohelper predictions.\n", + "We'll use confidence scores later on to decide whether to trust the autohelper answer or to send the task to Toloka" + ] + }, + { + "cell_type": "code", + "execution_count": 106, + "id": "29032ed8", + "metadata": {}, + "outputs": [], + "source": [ + "# batch should be an array of reviews\n", + "def _make_predictions(batch: List[str], model: AutoModelForSequenceClassification, tokenizer: AutoTokenizer):\n", + " batch = tokenizer(batch, padding=True, truncation=True, return_tensors='pt')\n", + " print('start apply...')\n", + " outputs = model(**batch)\n", + " print('apply done')\n", + " predictions = torch.nn.functional.softmax(outputs.logits, dim=-1).detach().numpy()\n", + " # predictions are an array of pairs containing confidence scores for classes 0 and 1\n", + " # (in this order)\n", + " # therefore, (predictions[idx,1] > 0.5) is True if the model thinks\n", + " # that element idx is in the class 'pos'\n", + " labels = np.vectorize(lambda flag: 'pos' if flag else 'neg')(predictions[:,1] > 0.5)\n", + " confidence = predictions.max(axis=1)\n", + " return labels, confidence" + ] + }, + { + "cell_type": "markdown", + "id": "b5a3f515", + "metadata": {}, + "source": [ + "Create a function to decide, which tasks got adequate answers from autohelper (`accepted_tasks`) and which should be sent to Toloka (`manual_tasks`).\n", + "We assume that if a model performs below a certain confidence threshold, then the task should be given to Toloka users. For simplicity, the threshold is calculated as the 90th percentile of confidence scores on answers where the autohelper was wrong (this is a robust enough estimate for our case). To find it, we can use golden tasks for which we know the correct responses" + ] + }, + { + "cell_type": "code", + "execution_count": 120, + "id": "50ca621d", + "metadata": {}, + "outputs": [], + "source": [ + "@task(nout=2, log_stdout=True)\n", + "def apply(\n", + " model_name: str,\n", + " task_dataset: pd.DataFrame,\n", + " golden_dataset: pd.DataFrame,\n", + ") -> Tuple[pd.DataFrame, pd.DataFrame]:\n", + " model, tokenizer = _get_resources(model_name)\n", + " print(f'Model loaded: {model_name}')\n", + "\n", + " # find the threshold using golden tasks:\n", + " golden_items = list(golden_dataset.iloc[:,0].values)\n", + " autohelper_golden_labels, golden_confidence = _make_predictions(golden_items, model, tokenizer)\n", + " # extract true answers\n", + " true_golden_labels = golden_dataset.iloc[:,1]\n", + " # find wrong answers\n", + " wrong_answers_mask = true_golden_labels != autohelper_golden_labels\n", + " # set threshold to 90th percentile of confidence scores when the model was wrong\n", + " # or if the model got all answers right, then set it to 0.95\n", + " if wrong_answers_mask.any():\n", + " threshold = np.percentile(golden_confidence[wrong_answers_mask], 90)\n", + " else:\n", + " threshold = 0.95\n", + "\n", + " # find elements where we think the model is likely to predict the right answer\n", + " nongolden_items = list(task_dataset.iloc[:,0].values)\n", + " nongolden_labels, nongolden_confidence = _make_predictions(nongolden_items, model, tokenizer)\n", + " accepted_solutions_mask = nongolden_confidence > threshold\n", + "\n", + " # make a dataframe from the answers which we accepted\n", + " accepted_tasks = pd.DataFrame({\n", + " 'review': task_dataset[accepted_solutions_mask]['reviews.text'],\n", + " 'sentiment': nongolden_labels[accepted_solutions_mask]\n", + " })\n", + " manual_tasks = task_dataset[~accepted_solutions_mask]\n", + "\n", + " print(f'accepted_tasks count: {len(accepted_tasks)}')\n", + " print(f'manual_tasks count: {len(manual_tasks)}')\n", + "\n", + " return accepted_tasks, manual_tasks" + ] + }, + { + "cell_type": "markdown", + "id": "9aff1b83", + "metadata": {}, + "source": [ + "Send golden and manual non-golden tasks to toloka" + ] + }, + { + "cell_type": "code", + "execution_count": 108, + "id": "8db1d70e", + "metadata": {}, + "outputs": [], + "source": [ + "@task\n", + "def send_to_toloka(\n", + " token: str,\n", + " env: str,\n", + " pool_id: str,\n", + " golden_dataset: pd.DataFrame,\n", + " manual_tasks: pd.DataFrame,\n", + ") -> None:\n", + " client = TolokaClient(token, env)\n", + "\n", + " golden_tasks = [\n", + " toloka.Task(\n", + " pool_id=pool_id,\n", + " input_values={'review': row['reviews.text']},\n", + " known_solutions = [{'output_values': {'sentiment': row['reviews.doRecommend']}}],\n", + " infinite_overlap=True,\n", + " )\n", + " for _, row in golden_dataset.iterrows()\n", + " ]\n", + "\n", + " tasks = [\n", + " toloka.Task(pool_id=pool_id, input_values={'review': review})\n", + " for review in manual_tasks['reviews.text']\n", + " ]\n", + "\n", + " client.create_tasks(golden_tasks + tasks, allow_defaults=True, open_pool=True)" + ] + }, + { + "cell_type": "markdown", + "id": "783ed081", + "metadata": {}, + "source": [ + "Create a process to await pool's completion" + ] + }, + { + "cell_type": "code", + "execution_count": 109, + "id": "2074e39a", + "metadata": {}, + "outputs": [], + "source": [ + "@task(log_stdout=True)\n", + "def wait_pool_for_close(token: str, env: str, pool_id: str) -> None:\n", + " client = TolokaClient(token, env)\n", + "\n", + " while True:\n", + " pool = client.get_pool(pool_id)\n", + " if pool.is_closed():\n", + " print(f'Pool {pool_id} is closed.')\n", + " return\n", + " op = client.get_analytics([CompletionPercentagePoolAnalytics(subject_id=pool_id)])\n", + " percentage = client.wait_operation(op).details['value'][0]['result']['value']\n", + " print(f' {datetime.datetime.now().strftime(\"%H:%M:%S\")}\\t'\n", + " f'Pool {pool_id} - {percentage}%')\n", + " time.sleep(60)" + ] + }, + { + "cell_type": "markdown", + "id": "75501352", + "metadata": {}, + "source": [ + "Create a task for processing Toloka responses and combining them with autohelper's answers" + ] + }, + { + "cell_type": "markdown", + "id": "92aa5553", + "metadata": {}, + "source": [ + "We'll run aggregation using the Dawid-Skene model.\n", + "\n", + "We use this aggregation model because our questions are of the same difficulty, and we don't have many control tasks.\n", + "\n", + "Read more about the Dawid-Skene model in the Requester’s Guide or get at an overview of different aggregation models in our Knowledge Base.\n" + ] + }, + { + "cell_type": "markdown", + "id": "1753d094", + "metadata": {}, + "source": [ + "In order to save the data, we'll use Prefect's [output persistance option](https://docs.prefect.io/core/concepts/persistence.html#persisting-output), setting task's `checkpoint` flag to `True` and specifying the location where the pickled version of our information will be stored using Prefect's `LocalResult` (`dir` is the directory for the result, `location` is the file's name, so this file's relative path will be `./prefect_results/sentiments`)" + ] + }, + { + "cell_type": "code", + "execution_count": 183, + "id": "26ba115a", + "metadata": {}, + "outputs": [], + "source": [ + "@task(\n", + " log_stdout=True,\n", + " checkpoint=True,\n", + " result=LocalResult(dir=\"./prefect_results\", location='sentiments')\n", + ")\n", + "def collect_results(\n", + " token: str,\n", + " env: str,\n", + " pool_id: str,\n", + " autohelper_results: pd.DataFrame\n", + ") -> pd.DataFrame:\n", + " client = TolokaClient(token, env)\n", + "\n", + " toloka_answers_df = client.get_assignments_df(pool_id)\n", + " # Drop golden tasks\n", + " toloka_answers_df = toloka_answers_df[toloka_answers_df['GOLDEN:sentiment'].isna()]\n", + " # Prepare DataFrame for aggregation\n", + " toloka_answers_df = toloka_answers_df.rename(columns={\n", + " 'INPUT:review': 'task',\n", + " 'OUTPUT:sentiment': 'label',\n", + " 'ASSIGNMENT:worker_id': 'worker'\n", + " })\n", + "\n", + " print(f'Toloka answers count: {len(answers_df)}')\n", + "\n", + " toloka_predicted_answers = DawidSkene(n_iter=20).fit_predict(toloka_answers_df)\n", + " toloka_results = pd.DataFrame({\n", + " 'review': toloka_predicted_answers.index,\n", + " 'sentiment': toloka_predicted_answers.values\n", + " })\n", + "\n", + " answers = pd.concat([autohelper_results, toloka_results])\n", + "\n", + " return answers" + ] + }, + { + "cell_type": "markdown", + "id": "29bbec17", + "metadata": {}, + "source": [ + "Now we can finally set up our flow. We'll use Prefect's [Parameters](https://docs.prefect.io/core/concepts/parameters.html) to securely send Toloka API token to the flow and to choose the environment (`SANDBOX` or `PRODUCTION`)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2b16473a", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "with Flow('ML assisted pipeline example') as flow:\n", + " # Toloka API token\n", + " token = prefect.Parameter('token')\n", + " # project environment\n", + " env = prefect.Parameter('env')\n", + "\n", + " DATASET_URL = 'https://tlk.s3.yandex.net/ext_dataset/datafiniti_grammar_and_online_product_reviews.csv'\n", + " model_name = 'distilbert-base-uncased-finetuned-sst-2-english'\n", + "\n", + " project_id = create_project(token, env)\n", + " skill_id = create_skill(token, env)\n", + " pool_id = create_pool(token, env, project_id, skill_id, reward=0.03)\n", + "\n", + " task_dataset, golden_dataset = prepare_dataset(DATASET_URL, cnt_tasks=200, cnt_golden=20)\n", + " accepted_tasks, manual_tasks = apply(model_name, task_dataset, golden_dataset)\n", + "\n", + " sent = send_to_toloka(token, env, pool_id, golden_dataset, manual_tasks)\n", + " pooling = wait_pool_for_close(token, env, pool_id).set_upstream(sent)\n", + "\n", + " collect_results(token, env, pool_id, accepted_tasks).set_upstream(pooling)\n", + "\n", + "# register the flow with the project we've created in the beginning\n", + "# flow.register(project_name=PROJECT_NAME)\n", + "flow.register(project_name=\"test\")" + ] + }, + { + "cell_type": "markdown", + "id": "11cc48ef", + "metadata": {}, + "source": [ + "Go to the link in the last cell's output, leading to the Prefect Cloud UI\n", + "\"Tabs\"" + ] + }, + { + "cell_type": "markdown", + "id": "6b5b703e", + "metadata": {}, + "source": [ + "Click on the *SETTINGS* tab and turn *Heartbeat* off. Tasks send *heartbeats* at regular intervals, if they're maling progress, and it's Prefect's way of protecting against zombie tasks (more info [here](https://docs.prefect.io/orchestration/concepts/services.html#zombie-killer)). But in our case, Toloka users may be slow and not have enough time to submit an answer before Prefect starts thinking the *pooling* task is a zombie\n", + "\"Heartbeat\"" + ] + }, + { + "cell_type": "markdown", + "id": "5817b418", + "metadata": {}, + "source": [ + "Next, navigate to the *RUN* tab, input the *env* and *token* and click on run in the bottom of the page\n", + "\"Parameters\"" + ] + }, + { + "cell_type": "markdown", + "id": "fef3a1e8", + "metadata": {}, + "source": [ + "You can also use the Cloud UI to inspect the flow's progress, see its structure (the *SCHEMATIC* tab), view the logs (choose the necessary *run* in the *RUNS* tab and select the *LOGS* tab) and many other things." + ] + }, + { + "cell_type": "markdown", + "id": "672789ee", + "metadata": {}, + "source": [ + "## Viewing the results" + ] + }, + { + "cell_type": "markdown", + "id": "de20e044", + "metadata": {}, + "source": [ + "Let's unpickle the data we've saved and view it (by default, Prefect uses `cloudpickle` for data serialization)" + ] + }, + { + "cell_type": "code", + "execution_count": 188, + "id": "c290a23b", + "metadata": {}, + "outputs": [], + "source": [ + "import cloudpickle\n", + "\n", + "FILEPATH = './prefect_results/sentiments'\n", + "with open(FILEPATH, 'rb') as file:\n", + " results = cloudpickle.loads(file.read())" + ] + }, + { + "cell_type": "code", + "execution_count": 197, + "id": "476cb69e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
reviewsentiment
24Great product! Exactly what it says works very...pos
49This cream did not do much for my face or thro...neg
6Got as a surprise for my husband there is noth...neg
53I've been using this product for years and it ...neg
13I bought this to try to spice things up, but I...neg
9Bought this to enhance our time a bit, did abs...pos
22Exceptional product, this is smooth, not slimy...pos
27You will LOVE this lotion. I smile every time ...pos
14I bought this because it had better reviews th...neg
52I am so disappointed! I have used this product...neg
\n", + "
" + ], + "text/plain": [ + " review sentiment\n", + "24 Great product! Exactly what it says works very... pos\n", + "49 This cream did not do much for my face or thro... neg\n", + "6 Got as a surprise for my husband there is noth... neg\n", + "53 I've been using this product for years and it ... neg\n", + "13 I bought this to try to spice things up, but I... neg\n", + "9 Bought this to enhance our time a bit, did abs... pos\n", + "22 Exceptional product, this is smooth, not slimy... pos\n", + "27 You will LOVE this lotion. I smile every time ... pos\n", + "14 I bought this because it had better reviews th... neg\n", + "52 I am so disappointed! I have used this product... neg" + ] + }, + "execution_count": 197, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "results.sample(10)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/9.toloka_and_ml_on_prefect/images/heartbeat.png b/examples/2. cases-for-applied-tasks/4.toloka_and_ml_on_prefect/images/heartbeat.png similarity index 100% rename from examples/9.toloka_and_ml_on_prefect/images/heartbeat.png rename to examples/2. cases-for-applied-tasks/4.toloka_and_ml_on_prefect/images/heartbeat.png diff --git a/examples/9.toloka_and_ml_on_prefect/images/parameters.png b/examples/2. cases-for-applied-tasks/4.toloka_and_ml_on_prefect/images/parameters.png similarity index 100% rename from examples/9.toloka_and_ml_on_prefect/images/parameters.png rename to examples/2. cases-for-applied-tasks/4.toloka_and_ml_on_prefect/images/parameters.png diff --git a/examples/9.toloka_and_ml_on_prefect/images/tabs.png b/examples/2. cases-for-applied-tasks/4.toloka_and_ml_on_prefect/images/tabs.png similarity index 100% rename from examples/9.toloka_and_ml_on_prefect/images/tabs.png rename to examples/2. cases-for-applied-tasks/4.toloka_and_ml_on_prefect/images/tabs.png diff --git a/examples/9.toloka_and_ml_on_prefect/configs/pool.json b/examples/2. cases-for-applied-tasks/5.toloka_and_ml_on_prefect/configs/pool.json similarity index 100% rename from examples/9.toloka_and_ml_on_prefect/configs/pool.json rename to examples/2. cases-for-applied-tasks/5.toloka_and_ml_on_prefect/configs/pool.json diff --git a/examples/9.toloka_and_ml_on_prefect/configs/project.json b/examples/2. cases-for-applied-tasks/5.toloka_and_ml_on_prefect/configs/project.json similarity index 100% rename from examples/9.toloka_and_ml_on_prefect/configs/project.json rename to examples/2. cases-for-applied-tasks/5.toloka_and_ml_on_prefect/configs/project.json diff --git a/examples/9.toloka_and_ml_on_prefect/example.ipynb b/examples/2. cases-for-applied-tasks/5.toloka_and_ml_on_prefect/example.ipynb similarity index 100% rename from examples/9.toloka_and_ml_on_prefect/example.ipynb rename to examples/2. cases-for-applied-tasks/5.toloka_and_ml_on_prefect/example.ipynb diff --git a/examples/2. cases-for-applied-tasks/5.toloka_and_ml_on_prefect/images/heartbeat.png b/examples/2. cases-for-applied-tasks/5.toloka_and_ml_on_prefect/images/heartbeat.png new file mode 100644 index 00000000..3ae23f5b Binary files /dev/null and b/examples/2. cases-for-applied-tasks/5.toloka_and_ml_on_prefect/images/heartbeat.png differ diff --git a/examples/2. cases-for-applied-tasks/5.toloka_and_ml_on_prefect/images/parameters.png b/examples/2. cases-for-applied-tasks/5.toloka_and_ml_on_prefect/images/parameters.png new file mode 100644 index 00000000..b762259b Binary files /dev/null and b/examples/2. cases-for-applied-tasks/5.toloka_and_ml_on_prefect/images/parameters.png differ diff --git a/examples/2. cases-for-applied-tasks/5.toloka_and_ml_on_prefect/images/tabs.png b/examples/2. cases-for-applied-tasks/5.toloka_and_ml_on_prefect/images/tabs.png new file mode 100644 index 00000000..0a836fcc Binary files /dev/null and b/examples/2. cases-for-applied-tasks/5.toloka_and_ml_on_prefect/images/tabs.png differ diff --git a/examples/3. benchmarks/.DS_Store b/examples/3. benchmarks/.DS_Store new file mode 100644 index 00000000..05eed74c Binary files /dev/null and b/examples/3. benchmarks/.DS_Store differ diff --git a/examples/3. benchmarks/benchmarks/image_classification_cinic10.ipynb b/examples/3. benchmarks/benchmarks/image_classification_cinic10.ipynb new file mode 100644 index 00000000..2e0787dd --- /dev/null +++ b/examples/3. benchmarks/benchmarks/image_classification_cinic10.ipynb @@ -0,0 +1,2393 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "c41e70d2", + "metadata": {}, + "source": [ + "# Image classification on CINIC-10" + ] + }, + { + "cell_type": "markdown", + "id": "64318eff", + "metadata": {}, + "source": [ + "Dataset source: [https://github.com/BayesWatch/cinic-10](https://github.com/BayesWatch/cinic-10)\n", + "\n", + "License: [MIT](https://github.com/BayesWatch/cinic-10/blob/master/LICENSE)" + ] + }, + { + "cell_type": "markdown", + "id": "e87ed5a1", + "metadata": {}, + "source": [ + "### Call to action\n", + "If you found some bugs or have a new feature idea, don't hesitate to [open a new issue on Github](https://github.com/Toloka/toloka-kit/issues/new/choose).\n", + "Like our library and examples? Star [our repo on Github](https://github.com/Toloka/toloka-kit)" + ] + }, + { + "cell_type": "markdown", + "id": "0e13cd62", + "metadata": {}, + "source": [ + "## Install dependencies and import" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9fe66427", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "!pip install toloka-kit==0.1.26\n", + "!pip install crowd-kit==1.0.0\n", + "!pip install ipyplot # display images" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "id": "0168a13b", + "metadata": {}, + "outputs": [], + "source": [ + "import datetime\n", + "import time\n", + "import pandas as pd\n", + "import numpy as np\n", + "import ipyplot\n", + "from sklearn.metrics import balanced_accuracy_score\n", + "import os\n", + "import logging\n", + "import sys\n", + "import getpass\n", + "\n", + "import toloka.client as toloka\n", + "import toloka.client.project.template_builder as tb\n", + "\n", + "from crowdkit.aggregation import DawidSkene\n", + "%matplotlib inline" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "id": "bdade39f", + "metadata": {}, + "outputs": [], + "source": [ + "logging.basicConfig(\n", + " format='[%(levelname)s] %(name)s: %(message)s',\n", + " level=logging.INFO,\n", + " stream=sys.stdout,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "7ce0ec8c", + "metadata": {}, + "source": [ + "# Load the dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "id": "dc24cf83", + "metadata": {}, + "outputs": [], + "source": [ + "N_ROWS = 1000" + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "id": "e292f1c4", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
img_pathlabelimg_url
0test/ship/n03545470_11883.pngshiphttps://tlk.s3.yandex.net/ext_dataset/CINIC-10...
1test/dog/n02111889_9460.pngdoghttps://tlk.s3.yandex.net/ext_dataset/CINIC-10...
2test/frog/n01645776_9028.pngfroghttps://tlk.s3.yandex.net/ext_dataset/CINIC-10...
3test/frog/n01639765_21028.pngfroghttps://tlk.s3.yandex.net/ext_dataset/CINIC-10...
4test/dog/cifar10-train-45915.pngdoghttps://tlk.s3.yandex.net/ext_dataset/CINIC-10...
\n", + "
" + ], + "text/plain": [ + " img_path label \\\n", + "0 test/ship/n03545470_11883.png ship \n", + "1 test/dog/n02111889_9460.png dog \n", + "2 test/frog/n01645776_9028.png frog \n", + "3 test/frog/n01639765_21028.png frog \n", + "4 test/dog/cifar10-train-45915.png dog \n", + "\n", + " img_url \n", + "0 https://tlk.s3.yandex.net/ext_dataset/CINIC-10... \n", + "1 https://tlk.s3.yandex.net/ext_dataset/CINIC-10... \n", + "2 https://tlk.s3.yandex.net/ext_dataset/CINIC-10... \n", + "3 https://tlk.s3.yandex.net/ext_dataset/CINIC-10... \n", + "4 https://tlk.s3.yandex.net/ext_dataset/CINIC-10... " + ] + }, + "execution_count": 62, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def sample_stratified(df, label_column, n_rows):\n", + " \"\"\"Function to sample n_rows from a dataframe while presenving class distribution\"\"\"\n", + " return df.groupby(label_column, group_keys=False) \\\n", + " .apply(lambda x: x.sample(int(np.rint(n_rows*len(x)/len(df))))) \\\n", + " .sample(frac=1).reset_index(drop=True)\n", + "\n", + "base_url = 'https://tlk.s3.yandex.net/ext_dataset/CINIC-10'\n", + "df = pd.read_csv(os.path.join(base_url, 'test.csv'))\n", + "df['img_url'] = df.img_path.apply(lambda p: os.path.join(base_url, p))\n", + "\n", + "df = sample_stratified(df, 'label', n_rows=N_ROWS)\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "id": "955ef75c", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "
\n", + " " + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "\n", + " \n", + "
\n", + "
\n", + "
\n", + "

airplane

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/airplane/cifar10-train-41696.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "

automobile

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/automobile/n03498781_5177.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "

bird

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/bird/n02011943_4339.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "

cat

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/cat/n02124075_2053.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "

deer

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/deer/cifar10-test-4383.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "

dog

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/dog/n02111889_9460.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "

frog

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/frog/n01645776_9028.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "

horse

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/horse/cifar10-train-17569.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "

ship

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/ship/n03545470_11883.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "

truck

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/truck/n04490091_1445.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "ipyplot.plot_class_representations(images=df.img_url, labels=df.label, img_width=70)" + ] + }, + { + "cell_type": "markdown", + "id": "361965f3", + "metadata": {}, + "source": [ + "# Setup the project" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "id": "8799a3a7", + "metadata": {}, + "outputs": [], + "source": [ + "toloka_client = toloka.TolokaClient(getpass.getpass('Enter your OAuth token: '), 'PRODUCTION') # Or switch to 'SANDBOX'" + ] + }, + { + "cell_type": "markdown", + "id": "359ed525", + "metadata": {}, + "source": [ + "## Create project" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "id": "8675ce81", + "metadata": {}, + "outputs": [], + "source": [ + "project = toloka.Project(\n", + " public_name='Small images classification',\n", + " public_description='Classify small images into 10 categories',\n", + " private_comment='OOTB: CINIC-10'\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 66, + "id": "f581f3d3", + "metadata": {}, + "outputs": [], + "source": [ + "input_specification = {'image': toloka.project.UrlSpec()}\n", + "output_specification = {'result': toloka.project.StringSpec()}" + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "id": "604d3d97", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "10" + ] + }, + "execution_count": 67, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "CINIC_LABELS = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n", + "len(CINIC_LABELS)" + ] + }, + { + "cell_type": "markdown", + "id": "e98ba32c", + "metadata": {}, + "source": [ + "## Annotator interface" + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "id": "8f1172a4", + "metadata": {}, + "outputs": [], + "source": [ + "image_viewer = tb.ImageViewV1(tb.InputData('image'), \n", + " ratio=[1, 1],\n", + " popup=False,\n", + " )\n", + "\n", + "label_buttons = [tb.GroupFieldOption(l, l.capitalize()) for l in CINIC_LABELS]\n", + "radio_group_field = tb.ButtonRadioGroupFieldV1(\n", + " tb.OutputData('result'),\n", + " label_buttons,\n", + " validation=tb.RequiredConditionV1(),\n", + ")\n", + "\n", + "task_width_plugin = tb.TolokaPluginV1(\n", + " 'scroll',\n", + " task_width=300,\n", + ")\n", + "\n", + "hot_keys_plugin = tb.HotkeysPluginV1(\n", + " key_1=tb.SetActionV1(tb.OutputData('result'), 'airplane'),\n", + " key_2=tb.SetActionV1(tb.OutputData('result'), 'automobile'),\n", + " key_3=tb.SetActionV1(tb.OutputData('result'), 'bird'),\n", + " key_4=tb.SetActionV1(tb.OutputData('result'), 'cat'),\n", + " key_5=tb.SetActionV1(tb.OutputData('result'), 'deer'),\n", + " key_6=tb.SetActionV1(tb.OutputData('result'), 'dog'),\n", + " key_7=tb.SetActionV1(tb.OutputData('result'), 'frog'),\n", + " key_8=tb.SetActionV1(tb.OutputData('result'), 'horse'),\n", + " key_9=tb.SetActionV1(tb.OutputData('result'), 'ship'),\n", + " key_0=tb.SetActionV1(tb.OutputData('result'), 'truck'),\n", + ")\n", + "\n", + "project_interface = toloka.project.TemplateBuilderViewSpec(\n", + " config=tb.TemplateBuilder(\n", + " view=tb.ListViewV1([image_viewer, radio_group_field]),\n", + " plugins=[task_width_plugin, hot_keys_plugin],\n", + " )\n", + ")\n", + "\n", + "project.task_spec = toloka.project.task_spec.TaskSpec(\n", + " input_spec=input_specification,\n", + " output_spec=output_specification,\n", + " view_spec=project_interface,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "id": "8c18b516", + "metadata": {}, + "outputs": [], + "source": [ + "project.public_instructions = \"\"\"\n", + "In this task, you will see images from 10 different classes.
\n", + "Your task is to classify these images.
\n", + "\n", + "Some images are blurry and hard to label. That's the nature of the task, so just assign whatever label seems most appropriate.\n", + "\n", + "How to complete the task:\n", + "
    \n", + "
  • Look at the picture.
  • \n", + "
  • Click on the image to resize it. You can rotate the image if it's in the wrong orientation.
  • \n", + "
  • Chose one of the possible answers. If the picture is unavailable or you have any other technical difficulty, please write us about it.
  • \n", + "
  • If you think that you can not classify the image correctly, choose the most appropriate label in your opinion.
  • \n", + "
  • You can use keyboard shortcuts (numbers from 1 to 0) to pick labels.
  • \n", + "
\n", + "\"\"\".strip()" + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "id": "8e09cde3", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[INFO] toloka.client: A new project with ID \"60772\" has been created. Link to open in web interface: https://toloka.dev/requester/project/60772\n" + ] + } + ], + "source": [ + "project = toloka_client.create_project(project)" + ] + }, + { + "cell_type": "markdown", + "id": "9231cc41", + "metadata": {}, + "source": [ + "## Create training tasks" + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "id": "846c1478", + "metadata": {}, + "outputs": [], + "source": [ + "training_pool = toloka.training.Training(project_id=project.id,\n", + " private_name='Training pool', \n", + " training_tasks_in_task_suite_count=10, \n", + " task_suites_required_to_pass=1, \n", + " may_contain_adult_content=False,\n", + " inherited_instructions=True,\n", + " assignment_max_duration_seconds=60*5,\n", + " retry_training_after_days=1,\n", + " mix_tasks_in_creation_order=True,\n", + " shuffle_tasks_in_task_suite=True,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "id": "f402f17f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[INFO] toloka.client: A new training with ID \"28225833\" has been created. Link to open in web interface: https://toloka.dev/requester/project/60772/training/28225833\n" + ] + } + ], + "source": [ + "training_pool = toloka_client.create_training(training_pool)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f52f3896", + "metadata": {}, + "outputs": [], + "source": [ + "label_examples = {label: df[df.label == label].head(1).img_url.item() for label in CINIC_LABELS}\n", + "tasks = [\n", + " toloka.Task(input_values={'image': url}, \n", + " known_solutions=[toloka.task.BaseTask.KnownSolution(output_values={'result': label})], \n", + " message_on_unknown_solution=f'Incorrect label! The actual label is: {label}',\n", + " infinite_overlap=True,\n", + " pool_id=training_pool.id)\n", + " for label, url in label_examples.items()\n", + "]\n", + "toloka_client.create_tasks(tasks, allow_defaults=True)" + ] + }, + { + "cell_type": "markdown", + "id": "cae34066", + "metadata": {}, + "source": [ + "## Create task Pool" + ] + }, + { + "cell_type": "code", + "execution_count": 74, + "id": "fad033b4", + "metadata": {}, + "outputs": [], + "source": [ + "pool = toloka.Pool(\n", + " project_id=project.id,\n", + " private_name='Pool', \n", + " may_contain_adult_content=False,\n", + " reward_per_assignment=0.01, \n", + " assignment_max_duration_seconds=60*5, \n", + " will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365),\n", + ")\n", + "pool.defaults = toloka.pool.Pool.Defaults(\n", + " default_overlap_for_new_tasks=5,\n", + " default_overlap_for_new_task_suites=0,\n", + ")\n", + "pool.set_mixer_config(\n", + " real_tasks_count=10,\n", + ")\n", + "pool.filter = toloka.filter.Languages.in_('EN')" + ] + }, + { + "cell_type": "code", + "execution_count": 75, + "id": "ab3786fd", + "metadata": {}, + "outputs": [], + "source": [ + "pool.quality_control.training_requirement = toloka.quality_control.QualityControl.TrainingRequirement(\n", + " training_pool_id=training_pool.id, \n", + " training_passing_skill_value=30,\n", + ") \n", + "\n", + "pool.quality_control.add_action(\n", + " collector=toloka.collectors.MajorityVote(\n", + " answer_threshold=4,\n", + " history_size=5,\n", + " ),\n", + " conditions=[\n", + " toloka.conditions.TotalAnswersCount >= 5,\n", + " toloka.conditions.IncorrectAnswersRate > 30,\n", + " ],\n", + " action=toloka.actions.RestrictionV2(\n", + " scope='PROJECT',\n", + " duration=1,\n", + " duration_unit='DAYS',\n", + " private_comment='Wrong on over 30% cases',\n", + " ), \n", + ")\n", + "\n", + "pool.quality_control.add_action(\n", + " collector=toloka.collectors.AssignmentSubmitTime(history_size=5, fast_submit_threshold_seconds=15),\n", + " conditions=[\n", + " toloka.conditions.TotalSubmittedCount >= 5,\n", + " toloka.conditions.FastSubmittedCount >= 3],\n", + " action=toloka.actions.RestrictionV2(\n", + " scope='PROJECT',\n", + " duration=1,\n", + " duration_unit='DAYS',\n", + " private_comment='Answering too fast',\n", + " ), \n", + ")\n", + "\n", + "pool.quality_control.add_action(\n", + " collector=toloka.collectors.SkippedInRowAssignments(),\n", + " conditions=[toloka.conditions.SkippedInRowCount >= 3],\n", + " action=toloka.actions.RestrictionV2(\n", + " scope=toloka.user_restriction.UserRestriction.PROJECT,\n", + " duration=1,\n", + " duration_unit='DAYS',\n", + " private_comment='Lazy performer',\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "id": "aa98caf9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[INFO] toloka.client: A new pool with ID \"28225839\" has been created. Link to open in web interface: https://toloka.dev/requester/project/60772/pool/28225839\n" + ] + } + ], + "source": [ + "pool = toloka_client.create_pool(pool)" + ] + }, + { + "cell_type": "markdown", + "id": "a865d420", + "metadata": {}, + "source": [ + "## Create tasks from dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "89dcaf9c", + "metadata": {}, + "outputs": [], + "source": [ + "tasks = [\n", + " toloka.Task(input_values={'image': url}, pool_id=pool.id)\n", + " for url in df.img_url\n", + "]\n", + "toloka_client.create_tasks(tasks, allow_defaults=True)" + ] + }, + { + "cell_type": "markdown", + "id": "c37b9b7d", + "metadata": {}, + "source": [ + "# Start annotation" + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "id": "268f4964", + "metadata": {}, + "outputs": [], + "source": [ + "training_pool = toloka_client.open_pool(training_pool.id)\n", + "pool = toloka_client.open_pool(pool.id)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "928f7ace", + "metadata": {}, + "outputs": [], + "source": [ + "pool_id = pool.id\n", + "\n", + "def wait_pool_for_close(pool_id, minutes_to_wait=0.5):\n", + " sleep_time = 60 * minutes_to_wait\n", + " pool = toloka_client.get_pool(pool_id)\n", + " while not pool.is_closed():\n", + " op = toloka_client.get_analytics([toloka.analytics_request.CompletionPercentagePoolAnalytics(subject_id=pool.id)])\n", + " op = toloka_client.wait_operation(op)\n", + " percentage = op.details['value'][0]['result']['value']\n", + " print(\n", + " f' {datetime.datetime.now().strftime(\"%H:%M:%S\")}\\t'\n", + " f'Pool {pool.id} - {percentage}%'\n", + " )\n", + " time.sleep(sleep_time)\n", + " pool = toloka_client.get_pool(pool.id)\n", + " print('Pool was closed.')\n", + "\n", + "wait_pool_for_close(pool_id)" + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "id": "77d2d802", + "metadata": {}, + "outputs": [], + "source": [ + "training_pool = toloka_client.close_pool(training_pool.id)" + ] + }, + { + "cell_type": "markdown", + "id": "2efd83af", + "metadata": {}, + "source": [ + "# Extract results" + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "id": "81e5344e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[WARNING] toloka.client: Experimental method\n" + ] + } + ], + "source": [ + "answers_df = toloka_client.get_assignments_df(pool_id)\n", + "answers_df = answers_df.rename(columns={\n", + " 'INPUT:image': 'task',\n", + " 'OUTPUT:result': 'label',\n", + " 'ASSIGNMENT:worker_id': 'worker',\n", + "})" + ] + }, + { + "cell_type": "markdown", + "id": "8a698c75", + "metadata": {}, + "source": [ + "# Aggregate results" + ] + }, + { + "cell_type": "code", + "execution_count": 82, + "id": "20c3bfd5", + "metadata": {}, + "outputs": [], + "source": [ + "aggregated_answers = DawidSkene(n_iter=100).fit_predict(answers_df)" + ] + }, + { + "cell_type": "code", + "execution_count": 83, + "id": "9e79d038", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
img_urlpred_labelimg_pathlabel
0https://tlk.s3.yandex.net/ext_dataset/CINIC-10...airplanetest/airplane/n02686121_2744.pngairplane
1https://tlk.s3.yandex.net/ext_dataset/CINIC-10...deertest/deer/cifar10-test-1720.pngdeer
2https://tlk.s3.yandex.net/ext_dataset/CINIC-10...deertest/deer/n02419796_6293.pngdeer
3https://tlk.s3.yandex.net/ext_dataset/CINIC-10...cattest/cat/n02125081_7666.pngcat
4https://tlk.s3.yandex.net/ext_dataset/CINIC-10...deertest/deer/cifar10-test-4957.pngdeer
\n", + "
" + ], + "text/plain": [ + " img_url pred_label \\\n", + "0 https://tlk.s3.yandex.net/ext_dataset/CINIC-10... airplane \n", + "1 https://tlk.s3.yandex.net/ext_dataset/CINIC-10... deer \n", + "2 https://tlk.s3.yandex.net/ext_dataset/CINIC-10... deer \n", + "3 https://tlk.s3.yandex.net/ext_dataset/CINIC-10... cat \n", + "4 https://tlk.s3.yandex.net/ext_dataset/CINIC-10... deer \n", + "\n", + " img_path label \n", + "0 test/airplane/n02686121_2744.png airplane \n", + "1 test/deer/cifar10-test-1720.png deer \n", + "2 test/deer/n02419796_6293.png deer \n", + "3 test/cat/n02125081_7666.png cat \n", + "4 test/deer/cifar10-test-4957.png deer " + ] + }, + "execution_count": 83, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "aggregated_answers = aggregated_answers.reset_index()\n", + "aggregated_answers.columns = ['img_url', 'pred_label']\n", + "aggregated_answers = aggregated_answers.merge(df, on='img_url')\n", + "aggregated_answers.head()" + ] + }, + { + "cell_type": "markdown", + "id": "27eb9d65", + "metadata": {}, + "source": [ + "# View results" + ] + }, + { + "cell_type": "code", + "execution_count": 84, + "id": "37a1942a", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "
\n", + " " + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "\n", + " \n", + "
\n", + "
\n", + "
\n", + "

True: truck\n", + "Pred: truck

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/truck/cifar10-test-7955.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "

True: frog\n", + "Pred: frog

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/frog/n01651059_5403.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "

True: bird\n", + "Pred: bird

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/bird/n01580870_1174.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "

True: deer\n", + "Pred: deer

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/deer/n02433546_735.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "

True: truck\n", + "Pred: automobile

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/truck/n03930630_12390.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "

True: truck\n", + "Pred: truck

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/truck/n03173929_11315.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "

True: dog\n", + "Pred: cat

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/dog/n02110341_772.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "

True: airplane\n", + "Pred: airplane

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/airplane/n03577672_2390.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "

True: bird\n", + "Pred: bird

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/bird/n01536186_7536.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "

True: deer\n", + "Pred: deer

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/deer/cifar10-test-2971.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "sample = aggregated_answers.sample(10)\n", + "captions = [f'True: {row.label}\\nPred: {row.pred_label}' for row in sample.itertuples()]\n", + "\n", + "ipyplot.plot_images(\n", + " images=sample.img_url.values,\n", + " labels=captions,\n", + " max_images=10,\n", + " img_width=100,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "29e3ba84", + "metadata": {}, + "source": [ + "# View mistakes" + ] + }, + { + "cell_type": "code", + "execution_count": 85, + "id": "644a90f7", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "
\n", + " " + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "\n", + " \n", + "
\n", + "
\n", + "
\n", + "

True: cat\n", + "Pred: dog

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/cat/n02129991_4072.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "

True: deer\n", + "Pred: frog

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/deer/n02433546_10947.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "

True: airplane\n", + "Pred: bird

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/airplane/n04012084_1706.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "

True: dog\n", + "Pred: horse

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/dog/n02087046_6326.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "

True: dog\n", + "Pred: cat

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/dog/n02119022_406.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "

True: automobile\n", + "Pred: truck

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/automobile/n03670208_1021.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "

True: cat\n", + "Pred: horse

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/cat/n02130308_10004.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "

True: deer\n", + "Pred: bird

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/deer/n02439398_5532.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "

True: cat\n", + "Pred: dog

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/cat/n02127808_8006.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "

True: deer\n", + "Pred: horse

\n", + "

https://tlk.s3.yandex.net/ext_dataset/CINIC-10/test/deer/n02410509_866.png

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "wrong_answers = aggregated_answers[aggregated_answers.pred_label != aggregated_answers.label]\n", + "sample = wrong_answers.sample(12)\n", + "captions = [f'True: {row.label}\\nPred: {row.pred_label}' for row in sample.itertuples()]\n", + "\n", + "ipyplot.plot_images(\n", + " images=sample.img_url.values,\n", + " labels=captions,\n", + " max_images=10,\n", + " img_width=100,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "69072b3b", + "metadata": {}, + "source": [ + "# Obtain accuracy" + ] + }, + { + "cell_type": "code", + "execution_count": 86, + "id": "7dc1905f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Accuracy: 0.88\n", + "Error: 0.12\n" + ] + } + ], + "source": [ + "accuracy = balanced_accuracy_score(aggregated_answers.label, aggregated_answers.pred_label)\n", + "print(f'Accuracy: {accuracy:.2f}')\n", + "print(f'Error: {1-accuracy:.2f}')" + ] + } + ], + "metadata": { + "interpreter": { + "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49" + }, + "jupytercloud": { + "share": { + "history": [ + { + "link": "https://a.yandex-team.ru/arc/trunk/arcadia/junk/btseytlin/jupyter/ootb/oottb_image_classification.ipynb?rev=8548703", + "path": "junk/btseytlin/jupyter/ootb/oottb_image_classification.ipynb", + "success": true, + "timestamp": 1629718035338 + }, + { + "link": "https://a.yandex-team.ru/arc/trunk/arcadia/junk/btseytlin/jupyter/ootb/oottb_image_classification.ipynb?rev=8560593", + "path": "junk/btseytlin/jupyter/ootb/oottb_image_classification.ipynb", + "success": true, + "timestamp": 1629970369047 + }, + { + "link": "https://a.yandex-team.ru/arc/trunk/arcadia/junk/btseytlin/jupyter/ootb/oottb_image_classification.ipynb?rev=8561371", + "path": "junk/btseytlin/jupyter/ootb/oottb_image_classification.ipynb", + "success": true, + "timestamp": 1629978932609 + }, + { + "link": "https://a.yandex-team.ru/arc/trunk/arcadia/junk/btseytlin/jupyter/ootb/oottb_image_classification.ipynb?rev=8561829", + "path": "junk/btseytlin/jupyter/ootb/oottb_image_classification.ipynb", + "success": true, + "timestamp": 1629983215124 + } + ] + } + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/3. benchmarks/benchmarks/text_classification_imdb.ipynb b/examples/3. benchmarks/benchmarks/text_classification_imdb.ipynb new file mode 100644 index 00000000..0f1fd68c --- /dev/null +++ b/examples/3. benchmarks/benchmarks/text_classification_imdb.ipynb @@ -0,0 +1,1258 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "5ac1f909", + "metadata": {}, + "source": [ + "# Image classification on IMDB movie reviews\n", + "\n", + "Dataset source: https://ai.stanford.edu/~amaas/data/sentiment/\n", + "\n", + "Paper: https://aclanthology.org/P11-1015" + ] + }, + { + "cell_type": "markdown", + "id": "3d6ece80", + "metadata": {}, + "source": [ + "### Call to action\n", + "If you found some bugs or have a new feature idea, don't hesitate to [open a new issue on Github](https://github.com/Toloka/toloka-kit/issues/new/choose).\n", + "Like our library and examples? Star [our repo on Github](https://github.com/Toloka/toloka-kit)" + ] + }, + { + "cell_type": "markdown", + "id": "33c32263", + "metadata": {}, + "source": [ + "## Install dependencies and import" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "146b9955", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "!pip install toloka-kit==0.1.26\n", + "!pip install crowd-kit==1.0.0" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "2c4b2a73", + "metadata": {}, + "outputs": [], + "source": [ + "import datetime\n", + "import time\n", + "import pandas as pd\n", + "import numpy as np\n", + "from sklearn.metrics import balanced_accuracy_score\n", + "import os\n", + "import logging\n", + "import sys\n", + "import getpass\n", + "import requests\n", + "from tqdm.auto import tqdm\n", + "\n", + "import toloka.client as toloka\n", + "import toloka.client.project.template_builder as tb\n", + "\n", + "from crowdkit.aggregation import DawidSkene\n", + "%matplotlib inline\n", + "pd.options.display.max_colwidth = 300" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "bcc0b1e7", + "metadata": {}, + "outputs": [], + "source": [ + "logging.basicConfig(\n", + " format='[%(levelname)s] %(name)s: %(message)s',\n", + " level=logging.INFO,\n", + " stream=sys.stdout,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "91af6166", + "metadata": {}, + "source": [ + "# Load dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "02e50d7b", + "metadata": {}, + "outputs": [], + "source": [ + "N_ROWS = 1000" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "20015b63", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
pathlabel
0test/neg/3571_3.txtneg
1test/neg/4311_2.txtneg
2test/pos/8424_9.txtpos
3test/neg/9456_4.txtneg
4test/neg/3853_1.txtneg
\n", + "
" + ], + "text/plain": [ + " path label\n", + "0 test/neg/3571_3.txt neg\n", + "1 test/neg/4311_2.txt neg\n", + "2 test/pos/8424_9.txt pos\n", + "3 test/neg/9456_4.txt neg\n", + "4 test/neg/3853_1.txt neg" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def sample_stratified(df, label_column, n_rows):\n", + " \"\"\"Function to sample n_rows from a dataframe while presenving class distribution\"\"\"\n", + " return df.groupby(label_column, group_keys=False) \\\n", + " .apply(lambda x: x.sample(int(np.rint(n_rows*len(x)/len(df))))) \\\n", + " .sample(frac=1)\n", + "\n", + "base_url = 'https://tlk.s3.yandex.net/ext_dataset/aclImdb'\n", + "df = pd.read_csv(os.path.join(base_url, 'test.csv'))\n", + "df_control = sample_stratified(df, 'label', n_rows=100)\n", + "df = df.drop(df_control.index)\n", + "df_training = sample_stratified(df, 'label', n_rows=10)\n", + "df = df.drop(df_training.index)\n", + "df = sample_stratified(df, 'label', n_rows=N_ROWS)\n", + "df_control = df_control.reset_index(drop=True)\n", + "df = df.reset_index(drop=True)\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "e66c8ab1", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "neg 500\n", + "pos 500\n", + "Name: label, dtype: int64" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.label.value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "16412480", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 100/100 [00:11<00:00, 8.87it/s]\n", + "100%|██████████| 10/10 [00:01<00:00, 7.74it/s]\n", + "100%|██████████| 1000/1000 [02:04<00:00, 8.04it/s]\n" + ] + } + ], + "source": [ + "def load_texts(urls):\n", + " texts = []\n", + " for url in tqdm(urls):\n", + " resp = requests.get(url)\n", + " texts.append(resp.text)\n", + " return texts\n", + "\n", + "df_control['text'] = load_texts(base_url + '/' + df_control.path)\n", + "df_training['text'] = load_texts(base_url + '/' + df_training.path)\n", + "df['text'] = load_texts(base_url + '/' + df.path)" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "a7a2bb94", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
pathlabeltext
0test/neg/3571_3.txtnegUpon writing this review I have difficulty trying to think of what to write about. Nothing much happens in this film. The storyline is a South Asian woman who falls for an English Londoner. The problem is he and his friends have had a racist streak. At the same time her friend at work is unknowi...
1test/neg/4311_2.txtnegWhat a clunker!<br /><br />It MUST have been made for TV or Cable.<br /><br />Look: forget the screenplay - forget the bunch of forgettable actors. Excuse me? Continuity? The NSA/NIA/whatever or whoever he is (an agent) takes-off in an F16 - is shown in an F18 chucking his guts up and, later, th...
2test/pos/8424_9.txtposThis review has been written by someone who has read it (several times) and knows what they are talking about!. Firstly I have read others comments and noticed that some of the objections were really quite stupid. People who have read the book and other Jane Austens or for that matter any good b...
3test/neg/9456_4.txtnegOuch!! What a mess we have here. Not so much of a mess as a painfully dull, half-assed excuse for exploitation. Brought to you by the one and only, J. G. \"Pat\" Patterson, yeah, the same one from Moonshine Mountain. Doctor Gore, formerly known as The Body Shop, is, I guess, somewhat inspired by F...
4test/neg/3853_1.txtnegI am almost tempted to demand my money back from the video store. This movie plumbs the depths of inanity and is almost completely unwatchable. I NEVER bail out of a film early but this was painful to view. A thorough waste of celluloid. My vote 1/10 (it would have been zero).
\n", + "
" + ], + "text/plain": [ + " path label \\\n", + "0 test/neg/3571_3.txt neg \n", + "1 test/neg/4311_2.txt neg \n", + "2 test/pos/8424_9.txt pos \n", + "3 test/neg/9456_4.txt neg \n", + "4 test/neg/3853_1.txt neg \n", + "\n", + " text \n", + "0 Upon writing this review I have difficulty trying to think of what to write about. Nothing much happens in this film. The storyline is a South Asian woman who falls for an English Londoner. The problem is he and his friends have had a racist streak. At the same time her friend at work is unknowi... \n", + "1 What a clunker!

It MUST have been made for TV or Cable.

Look: forget the screenplay - forget the bunch of forgettable actors. Excuse me? Continuity? The NSA/NIA/whatever or whoever he is (an agent) takes-off in an F16 - is shown in an F18 chucking his guts up and, later, th... \n", + "2 This review has been written by someone who has read it (several times) and knows what they are talking about!. Firstly I have read others comments and noticed that some of the objections were really quite stupid. People who have read the book and other Jane Austens or for that matter any good b... \n", + "3 Ouch!! What a mess we have here. Not so much of a mess as a painfully dull, half-assed excuse for exploitation. Brought to you by the one and only, J. G. \"Pat\" Patterson, yeah, the same one from Moonshine Mountain. Doctor Gore, formerly known as The Body Shop, is, I guess, somewhat inspired by F... \n", + "4 I am almost tempted to demand my money back from the video store. This movie plumbs the depths of inanity and is almost completely unwatchable. I NEVER bail out of a film early but this was painful to view. A thorough waste of celluloid. My vote 1/10 (it would have been zero). " + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "29fe7760", + "metadata": {}, + "source": [ + "# Setup the project" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "13453217", + "metadata": {}, + "outputs": [], + "source": [ + "toloka_client = toloka.TolokaClient(getpass.getpass('Enter your OAuth token: '), 'PRODUCTION') # Or switch to 'SANDBOX'" + ] + }, + { + "cell_type": "markdown", + "id": "ba15a58d", + "metadata": {}, + "source": [ + "## Create project" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "0b298578", + "metadata": {}, + "outputs": [], + "source": [ + "project = toloka.Project(\n", + " public_name='Movie review classification',\n", + " public_description='Classify sentiment of movie reviews',\n", + " private_comment='OOTB: IMDb Movie Reviews'\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "5ce8fb18", + "metadata": {}, + "outputs": [], + "source": [ + "input_specification = {'text': toloka.project.StringSpec()}\n", + "output_specification = {'result': toloka.project.StringSpec()}" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "c70b7ffe", + "metadata": {}, + "outputs": [], + "source": [ + "text_viewer = tb.TextViewV1(tb.InputData('text'))\n", + "\n", + "radio_group_field = tb.ButtonRadioGroupFieldV1(\n", + " tb.OutputData('result'),\n", + " [\n", + " tb.GroupFieldOption('pos', '😃 Positive'),\n", + " tb.GroupFieldOption('neg', '😡 Negative'),\n", + " ],\n", + " label='What is the review sentiment?',\n", + " validation=tb.RequiredConditionV1(hint='You need to select one answer'),\n", + ")\n", + "\n", + "task_width_plugin = tb.TolokaPluginV1(\n", + " layout=tb.TolokaPluginV1.TolokaPluginLayout(\n", + " kind='pager', \n", + " task_width=500,\n", + " )\n", + ")\n", + "\n", + "hot_keys_plugin = tb.HotkeysPluginV1(\n", + " key_1=tb.SetActionV1(tb.OutputData('result'), 'pos'),\n", + " key_2=tb.SetActionV1(tb.OutputData('result'), 'neg'),\n", + ")\n", + "\n", + "project_interface = toloka.project.TemplateBuilderViewSpec(\n", + " view=tb.ListViewV1([radio_group_field, text_viewer]),\n", + " plugins=[task_width_plugin, hot_keys_plugin],\n", + ")\n", + "\n", + "project.task_spec = toloka.project.task_spec.TaskSpec(\n", + " input_spec=input_specification,\n", + " output_spec=output_specification,\n", + " view_spec=project_interface,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "id": "078efae9", + "metadata": {}, + "outputs": [], + "source": [ + "project.public_instructions = \"\"\"\n", + "

How to complete the task

\n", + "
    \n", + "
  • 1. Look at the movie review text.
  • \n", + "
  • 2. If it seems 😃 positive, assign the positive label. Otherwise assign the 😡 negative label.
  • \n", + "
  • 3. If you are unsure choose the label that seems most appropriate.
  • \n", + "
\n", + "\n", + "In case of problems send us a message. Good luck!\n", + "\"\"\".strip()" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "id": "0e359f15", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[INFO] toloka.client: A new project with ID \"61042\" has been created. Link to open in web interface: https://toloka.yandex.com/requester/project/61042\n" + ] + } + ], + "source": [ + "project = toloka_client.create_project(project)" + ] + }, + { + "cell_type": "markdown", + "id": "7f08588c", + "metadata": {}, + "source": [ + "## Create training tasks" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "id": "ae976731", + "metadata": {}, + "outputs": [], + "source": [ + "training_pool = toloka.Training(\n", + " project_id=project.id,\n", + " private_name='Training pool', \n", + " training_tasks_in_task_suite_count=5, \n", + " task_suites_required_to_pass=1,\n", + " may_contain_adult_content=False,\n", + " inherited_instructions=True,\n", + " assignment_max_duration_seconds=60*5,\n", + " retry_training_after_days=1,\n", + " mix_tasks_in_creation_order=True,\n", + " shuffle_tasks_in_task_suite=True,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "id": "48d38d06", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[INFO] toloka.client: A new training with ID \"28288532\" has been created. Link to open in web interface: https://toloka.yandex.com/requester/project/61042/training/28288532\n" + ] + } + ], + "source": [ + "training_pool = toloka_client.create_training(training_pool)" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "id": "7440c365", + "metadata": {}, + "outputs": [], + "source": [ + "label_to_hint_map = {\n", + " 'pos': 'Positive', \n", + " 'neg': 'Negative',\n", + "}\n", + "\n", + "\n", + "tasks = []\n", + "for l in ['pos', 'neg']: \n", + " examples = df_training[df_training.label == l]\n", + " \n", + " for ex_tuple in examples.itertuples():\n", + " tasks.append(\n", + " toloka.Task(\n", + " input_values={'text': ex_tuple.text}, \n", + " known_solutions=[toloka.task.BaseTask.KnownSolution(output_values={'result': ex_tuple.label})], \n", + " message_on_unknown_solution=f'Incorrect label! The actual label is: {label_to_hint_map[ex_tuple.label]}',\n", + " infinite_overlap=True,\n", + " pool_id=training_pool.id,\n", + " )\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "id": "55f1c4cd", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n" + ] + } + ], + "source": [ + "result = toloka_client.create_tasks(tasks, allow_defaults=True)\n", + "print(len(result.items))" + ] + }, + { + "cell_type": "markdown", + "id": "45b21857", + "metadata": {}, + "source": [ + "## Create task Pool" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "id": "82266781", + "metadata": {}, + "outputs": [], + "source": [ + "pool = toloka.Pool(\n", + " project_id=project.id,\n", + " private_name='Pool',\n", + " may_contain_adult_content=False,\n", + " reward_per_assignment=0.02, \n", + " assignment_max_duration_seconds=60*5, \n", + " will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365), \n", + " filter=(\n", + " (toloka.filter.Languages.in_('EN')) &\n", + " (toloka.filter.ClientType == 'BROWSER')\n", + " ),\n", + ")\n", + "pool.defaults = toloka.pool.Pool.Defaults(\n", + " default_overlap_for_new_task_suites=5,\n", + ")\n", + "pool.set_mixer_config(\n", + " real_tasks_count=4,\n", + " golden_tasks_count=1,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "07aaf41b", + "metadata": {}, + "outputs": [], + "source": [ + "pool.quality_control.training_requirement = toloka.quality_control.QualityControl.TrainingRequirement(\n", + " training_pool_id=training_pool.id, \n", + " training_passing_skill_value=50,\n", + ")\n", + "\n", + "\n", + "pool.quality_control.add_action(\n", + " collector=toloka.collectors.GoldenSet(\n", + " history_size=5,\n", + " ),\n", + " conditions=[\n", + " toloka.conditions.GoldenSetAnswersCount >= 5,\n", + " toloka.conditions.IncorrectAnswersRate >= 30,\n", + " ],\n", + " action=toloka.actions.RestrictionV2(\n", + " scope='PROJECT',\n", + " duration=1,\n", + " duration_unit='DAYS',\n", + " private_comment='Low quality of responses',\n", + " ), \n", + ")\n", + "\n", + "\n", + "pool.quality_control.add_action(\n", + " collector=toloka.collectors.AssignmentSubmitTime(history_size=5, fast_submit_threshold_seconds=15),\n", + " conditions=[\n", + " toloka.conditions.TotalSubmittedCount >= 5,\n", + " toloka.conditions.FastSubmittedCount >= 3\n", + " ],\n", + " action=toloka.actions.RestrictionV2(\n", + " scope='PROJECT',\n", + " duration=1,\n", + " duration_unit='DAYS',\n", + " private_comment='Answering too fast',\n", + " ), \n", + ")\n", + "\n", + "pool.quality_control.add_action(\n", + " collector=toloka.collectors.SkippedInRowAssignments(),\n", + " conditions=[toloka.conditions.SkippedInRowCount >= 3],\n", + " action=toloka.actions.RestrictionV2(\n", + " scope=toloka.user_restriction.UserRestriction.PROJECT,\n", + " duration=1,\n", + " duration_unit='DAYS',\n", + " private_comment='Too many skipped assignments',\n", + " )\n", + ")\n", + "\n", + "pool.quality_control.add_action(\n", + " collector=toloka.collectors.MajorityVote(\n", + " answer_threshold=4,\n", + " history_size=5,\n", + " ),\n", + " conditions=[\n", + " toloka.conditions.TotalAnswersCount >= 5,\n", + " toloka.conditions.IncorrectAnswersRate > 30,\n", + " ],\n", + " action=toloka.actions.RestrictionV2(\n", + " scope='PROJECT',\n", + " duration=1,\n", + " duration_unit='DAYS',\n", + " private_comment='Low quality responses',\n", + " ), \n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "id": "45e6ff19", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[INFO] toloka.client: A new pool with ID \"28288533\" has been created. Link to open in web interface: https://toloka.yandex.com/requester/project/61042/pool/28288533\n" + ] + } + ], + "source": [ + "pool = toloka_client.create_pool(pool)" + ] + }, + { + "cell_type": "markdown", + "id": "27051f27", + "metadata": {}, + "source": [ + "## Create control tasks" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "id": "5fbe842c", + "metadata": {}, + "outputs": [], + "source": [ + "tasks = []\n", + "for ex_tuple in df_control.itertuples():\n", + " tasks.append(\n", + " toloka.Task(\n", + " input_values={'text': ex_tuple.text}, \n", + " known_solutions=[toloka.task.BaseTask.KnownSolution(output_values={'result': ex_tuple.label})], \n", + " pool_id=pool.id,\n", + " infinite_overlap=True,\n", + " )\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "id": "07da67c2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "100\n" + ] + } + ], + "source": [ + "result = toloka_client.create_tasks(tasks=tasks, allow_defaults=True)\n", + "print(len(result.items))" + ] + }, + { + "cell_type": "markdown", + "id": "a58247c4", + "metadata": {}, + "source": [ + "## Create tasks from dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "id": "0d5b5ea6", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1000\n" + ] + } + ], + "source": [ + "tasks = []\n", + "for ex_tuple in df.itertuples():\n", + " tasks.append(\n", + " toloka.Task(\n", + " input_values={'text': ex_tuple.text}, \n", + " pool_id=pool.id,\n", + " )\n", + " )\n", + "result = toloka_client.create_tasks(tasks=tasks, allow_defaults=True)\n", + "print(len(result.items))" + ] + }, + { + "cell_type": "markdown", + "id": "a54129ff", + "metadata": {}, + "source": [ + "# Start annotation" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "id": "13d912de", + "metadata": {}, + "outputs": [], + "source": [ + "training_pool = toloka_client.open_pool(training_pool.id)\n", + "pool = toloka_client.open_pool(pool.id)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5f5f5979", + "metadata": {}, + "outputs": [], + "source": [ + "pool_id = pool.id\n", + "\n", + "def wait_pool_for_close(pool_id, minutes_to_wait=0.5):\n", + " sleep_time = 60 * minutes_to_wait\n", + " pool = toloka_client.get_pool(pool_id)\n", + " while not pool.is_closed():\n", + " op = toloka_client.get_analytics([toloka.analytics_request.CompletionPercentagePoolAnalytics(subject_id=pool.id)])\n", + " op = toloka_client.wait_operation(op)\n", + " percentage = op.details['value'][0]['result']['value']\n", + " print(\n", + " f' {datetime.datetime.now().strftime(\"%H:%M:%S\")}\\t'\n", + " f'Pool {pool.id} - {percentage}%'\n", + " )\n", + " time.sleep(sleep_time)\n", + " pool = toloka_client.get_pool(pool.id)\n", + " print('Pool was closed.')\n", + "\n", + "wait_pool_for_close(pool_id)" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "id": "d1318fa4", + "metadata": {}, + "outputs": [], + "source": [ + "training_pool = toloka_client.close_pool(training_pool.id)" + ] + }, + { + "cell_type": "markdown", + "id": "d9c3a3a1", + "metadata": {}, + "source": [ + "# Extract results" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "id": "e3d0ba73", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[WARNING] toloka.client: Experimental method\n" + ] + } + ], + "source": [ + "answers_df = toloka_client.get_assignments_df(pool.id, exclude_banned=True)\n", + "answers_df = answers_df[answers_df['GOLDEN:result'].isnull()]\n", + "answers_df = answers_df.rename(columns={\n", + " 'INPUT:text': 'task',\n", + " 'OUTPUT:result': 'label',\n", + " 'ASSIGNMENT:worker_id': 'worker',\n", + "})" + ] + }, + { + "cell_type": "markdown", + "id": "7ef15a6f", + "metadata": {}, + "source": [ + "# Aggregate results" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "id": "916a6a80", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
textpred_labelpathlabel
0I think that this short TV series, was absolutely wonderful, and gave both a in-depth and clear explanation of everything that was on the screen at the given time. This was by far David Attenborough at his best. I personally thought this was one of the best documentaries in the past decade. This...postest/pos/3585_10.txtpos
1Kill the scream queen may sound like a good slasher flick but it is terribly boring and very dumb.<br /><br />Kill the scream queen is about a crazy filmmaker who auditions girls to be in his snuff film. He rapes and tortures them. This is trash that is not amusing, suspenseful or entertaining.T...negtest/neg/2817_1.txtneg
2Well, okay, maybe not perfect, but it was pretty close. This movie jumped from crime drama to romantic goofball comedy and back again so quickly all the way throughout that it seemed like two different movies that played simultaneously and then joined up again at the end. But they did it smoothl...postest/pos/2847_8.txtpos
3Some movies want to make us think, some want to excite us, some want to exhilarate us. But sometimes, a movie wants only to make us laugh, and \"In & Out\" certainly succeeds in this department.<br /><br />Indiana high-school teacher Howard Brackett (Kevin Kline) is going to be married to fellow t...postest/pos/10922_8.txtpos
4In 1913, in Carlton Mine, Addytown, Pennsylvania, the cruel owner of a mine uses poor children in the exploration and after an explosion, a group of children is buried alive. On the present days, Karen Tunny (Lori Heuring) has just lost her husband after a long period of terminal disease when th...negtest/neg/1114_4.txtneg
\n", + "
" + ], + "text/plain": [ + " text \\\n", + "0 I think that this short TV series, was absolutely wonderful, and gave both a in-depth and clear explanation of everything that was on the screen at the given time. This was by far David Attenborough at his best. I personally thought this was one of the best documentaries in the past decade. This... \n", + "1 Kill the scream queen may sound like a good slasher flick but it is terribly boring and very dumb.

Kill the scream queen is about a crazy filmmaker who auditions girls to be in his snuff film. He rapes and tortures them. This is trash that is not amusing, suspenseful or entertaining.T... \n", + "2 Well, okay, maybe not perfect, but it was pretty close. This movie jumped from crime drama to romantic goofball comedy and back again so quickly all the way throughout that it seemed like two different movies that played simultaneously and then joined up again at the end. But they did it smoothl... \n", + "3 Some movies want to make us think, some want to excite us, some want to exhilarate us. But sometimes, a movie wants only to make us laugh, and \"In & Out\" certainly succeeds in this department.

Indiana high-school teacher Howard Brackett (Kevin Kline) is going to be married to fellow t... \n", + "4 In 1913, in Carlton Mine, Addytown, Pennsylvania, the cruel owner of a mine uses poor children in the exploration and after an explosion, a group of children is buried alive. On the present days, Karen Tunny (Lori Heuring) has just lost her husband after a long period of terminal disease when th... \n", + "\n", + " pred_label path label \n", + "0 pos test/pos/3585_10.txt pos \n", + "1 neg test/neg/2817_1.txt neg \n", + "2 pos test/pos/2847_8.txt pos \n", + "3 pos test/pos/10922_8.txt pos \n", + "4 neg test/neg/1114_4.txt neg " + ] + }, + "execution_count": 61, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "aggregated_answers = DawidSkene(n_iter=100).fit_predict(answers_df)\n", + "aggregated_answers = aggregated_answers.reset_index()\n", + "aggregated_answers.columns = ['text', 'pred_label']\n", + "aggregated_answers = aggregated_answers.merge(df, on='text')\n", + "aggregated_answers.head()" + ] + }, + { + "cell_type": "markdown", + "id": "cac70a10", + "metadata": {}, + "source": [ + "# View results" + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "id": "ead08680", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
textpred_labellabel
253Don't ask me why I love this movie so much...Maybe it came at a time in my life I desperately wanted to fit in, maybe it is the amazing monster effects, maybe because I enjoyed the novel \"Cabal\", but It's probably because I LOVE Clive Barker. I think it's fair to warn you the movie and the novel...pospos
22With a name like \"10 Commandments\" you would expect a film to be representative of the account in the Bible, specifically Exodus. Not so here. This is standard procedure with any Biblical Hallmark-made film. Remember \"Noah\"?? That was utter fiction and one of the worst films ever made. At least ...negneg
852I've seen some bad things in my time. A half dead cow trying to get out of waist high mud; a head on collision between two cars; a thousand plates smashing on a kitchen floor; human beings living like animals.<br /><br />But never in my life have I seen anything as bad as The Cat in the Hat.<br ...negneg
298Talk about being boring!<br /><br />I got this expecting a fascinating insight into the life of the man who wrote the mythical Night on the Galactic Railroad. I expected to see crazy stories and hijinks of an eccentric man and to discover his inspirations for such bizarre material. Boy, was I wr...posneg
864This Blake Edwards film isn't too sure whether it wants to be a comedy, a drama or a musical. No matter, the sheer presence of Julie Andrews, is reason enough to see this comedy-drama-musical-spy spoof. Julie is beautiful and uses her many talents, throughout the film. Rock Hudson looks tired, b...pospos
\n", + "
" + ], + "text/plain": [ + " text \\\n", + "253 Don't ask me why I love this movie so much...Maybe it came at a time in my life I desperately wanted to fit in, maybe it is the amazing monster effects, maybe because I enjoyed the novel \"Cabal\", but It's probably because I LOVE Clive Barker. I think it's fair to warn you the movie and the novel... \n", + "22 With a name like \"10 Commandments\" you would expect a film to be representative of the account in the Bible, specifically Exodus. Not so here. This is standard procedure with any Biblical Hallmark-made film. Remember \"Noah\"?? That was utter fiction and one of the worst films ever made. At least ... \n", + "852 I've seen some bad things in my time. A half dead cow trying to get out of waist high mud; a head on collision between two cars; a thousand plates smashing on a kitchen floor; human beings living like animals.

But never in my life have I seen anything as bad as The Cat in the Hat.

I got this expecting a fascinating insight into the life of the man who wrote the mythical Night on the Galactic Railroad. I expected to see crazy stories and hijinks of an eccentric man and to discover his inspirations for such bizarre material. Boy, was I wr... \n", + "864 This Blake Edwards film isn't too sure whether it wants to be a comedy, a drama or a musical. No matter, the sheer presence of Julie Andrews, is reason enough to see this comedy-drama-musical-spy spoof. Julie is beautiful and uses her many talents, throughout the film. Rock Hudson looks tired, b... \n", + "\n", + " pred_label label \n", + "253 pos pos \n", + "22 neg neg \n", + "852 neg neg \n", + "298 pos neg \n", + "864 pos pos " + ] + }, + "execution_count": 62, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "aggregated_answers[['text', 'pred_label', 'label']].sample(5).head()" + ] + }, + { + "cell_type": "markdown", + "id": "42c9a9a0", + "metadata": {}, + "source": [ + "# View errors" + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "id": "710412d0", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
textpred_labellabel
755The only reason I watched this movie a second time, was to learn the name of the \"second banana\" girl playing opposite Katie Holms. Her name is Marisa Coughlan. Never heard of her before. She is lovely. Captivating. With an animated face, and cute bod, she is highly watchable... She's got real, ...posneg
608It was inferred by a previous poster that the military would not be subordinate to the police in a disaster as depicted in the film. In fact the military role would be to supply aid to the civil authorities when requested to do so. The civil authorities would retain primacy. In practise the Army...posneg
598On paper this looks a good film . Michael Caine plays a tough and ruthless boxing promoter who's son is up for a title eliminator . The pity is that when the story is transferred from paper to my television screen it loses a certain everything . I had hoped we'd be seen emulating his definitive ...posneg
609Young Warriors (1983) <br /><br />While this is a deeply flawed (and in some ways idiotic) movie, the way it continually defies expectations makes it decent viewing for the adventurous sleaze fan.<br /><br />Meet yuppie college student Kevin and his gang of lovable frat boy buddies. In what star...negpos
759I have recently gone to the movie theatres to see the new (2007) version of Bridge to Teribithia. After, I went to the library to rent the older version to see it again without paying again. I must say that I was extremely disappointed! I found the older version to have horrible acting as well a...posneg
\n", + "
" + ], + "text/plain": [ + " text \\\n", + "755 The only reason I watched this movie a second time, was to learn the name of the \"second banana\" girl playing opposite Katie Holms. Her name is Marisa Coughlan. Never heard of her before. She is lovely. Captivating. With an animated face, and cute bod, she is highly watchable... She's got real, ... \n", + "608 It was inferred by a previous poster that the military would not be subordinate to the police in a disaster as depicted in the film. In fact the military role would be to supply aid to the civil authorities when requested to do so. The civil authorities would retain primacy. In practise the Army... \n", + "598 On paper this looks a good film . Michael Caine plays a tough and ruthless boxing promoter who's son is up for a title eliminator . The pity is that when the story is transferred from paper to my television screen it loses a certain everything . I had hoped we'd be seen emulating his definitive ... \n", + "609 Young Warriors (1983)

While this is a deeply flawed (and in some ways idiotic) movie, the way it continually defies expectations makes it decent viewing for the adventurous sleaze fan.

Meet yuppie college student Kevin and his gang of lovable frat boy buddies. In what star... \n", + "759 I have recently gone to the movie theatres to see the new (2007) version of Bridge to Teribithia. After, I went to the library to rent the older version to see it again without paying again. I must say that I was extremely disappointed! I found the older version to have horrible acting as well a... \n", + "\n", + " pred_label label \n", + "755 pos neg \n", + "608 pos neg \n", + "598 pos neg \n", + "609 neg pos \n", + "759 pos neg " + ] + }, + "execution_count": 63, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "aggregated_answers[aggregated_answers.pred_label != aggregated_answers.label][['text', 'pred_label', 'label']].sample(5).head()" + ] + }, + { + "cell_type": "markdown", + "id": "cbcee026", + "metadata": {}, + "source": [ + "# Obtain accuracy" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "id": "136d00c9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Accuracy: 0.89\n", + "Error: 0.11\n" + ] + } + ], + "source": [ + "accuracy = balanced_accuracy_score(aggregated_answers.label, aggregated_answers.pred_label)\n", + "print(f'Accuracy: {accuracy:.2f}')\n", + "print(f'Error: {1-accuracy:.2f}')" + ] + } + ], + "metadata": { + "interpreter": { + "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49" + }, + "jupytercloud": { + "share": { + "history": [ + { + "link": "https://a.yandex-team.ru/arc/trunk/arcadia/junk/btseytlin/jupyter/ootb/oottb_image_classification.ipynb?rev=8548703", + "path": "junk/btseytlin/jupyter/ootb/oottb_image_classification.ipynb", + "success": true, + "timestamp": 1629718035338 + }, + { + "link": "https://a.yandex-team.ru/arc/trunk/arcadia/junk/btseytlin/jupyter/ootb/oottb_image_classification.ipynb?rev=8560593", + "path": "junk/btseytlin/jupyter/ootb/oottb_image_classification.ipynb", + "success": true, + "timestamp": 1629970369047 + }, + { + "link": "https://a.yandex-team.ru/arc/trunk/arcadia/junk/btseytlin/jupyter/ootb/oottb_image_classification.ipynb?rev=8561371", + "path": "junk/btseytlin/jupyter/ootb/oottb_image_classification.ipynb", + "success": true, + "timestamp": 1629978932609 + }, + { + "link": "https://a.yandex-team.ru/arc/trunk/arcadia/junk/btseytlin/jupyter/ootb/oottb_text_classification.ipynb?rev=8565052", + "path": "junk/btseytlin/jupyter/ootb/oottb_text_classification.ipynb", + "success": true, + "timestamp": 1630057712069 + } + ] + } + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/4. cases-for-statistics-quality/autoquality/autoquality_usage.ipynb b/examples/4. cases-for-statistics-quality/autoquality/autoquality_usage.ipynb new file mode 100644 index 00000000..a9873358 --- /dev/null +++ b/examples/4. cases-for-statistics-quality/autoquality/autoquality_usage.ipynb @@ -0,0 +1,1590 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "d9b60407", + "metadata": {}, + "source": [ + "# AutoQuality\n", + "\n", + "This example illustrates how to use `toloka.autoquality` module. AutoQuality is a tool to help set up quality control for Toloka project. AutoQuality uses random search to find the optimal set of quality control parameters. Every parameter has its own distribution. AutoQuality creates several pools with different parameter values and compares them. Distributions and optimum critera can be modified by user." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "94672aa7", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install pandas\n", + "!pip install toloka-kit[autoquality]==0.1.26" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "15ff8bec", + "metadata": {}, + "outputs": [], + "source": [ + "import logging\n", + "import sys\n", + "\n", + "logging.basicConfig(\n", + " format='[%(levelname)s] %(name)s: %(message)s',\n", + " level=logging.INFO,\n", + " stream=sys.stdout,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "61f470aa", + "metadata": {}, + "outputs": [], + "source": [ + "import toloka.client as toloka\n", + "import toloka.client.project.template_builder as tb\n", + "from toloka.autoquality import AutoQuality\n", + "\n", + "import datetime\n", + "import numpy as np\n", + "import os\n", + "import requests\n", + "import pandas as pd\n", + "from tqdm import tqdm" + ] + }, + { + "cell_type": "markdown", + "id": "81cfe2c5", + "metadata": {}, + "source": [ + "In this example our task will be Movie Reviews Sentiment Analysis. We will use a [Large Movie Review Dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews):" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "759b152d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
pathtextlabel
0test/neg/8744_2.txtThis joins the endless line of corny, predicta...neg
1test/pos/6011_10.txtSwift's writing really has more in common with...pos
2test/pos/9149_8.txtThis film is a good start for novices that hav...pos
3test/pos/6504_9.txtWonderfully funny, awe-inspiring feature on th...pos
4test/neg/970_4.txtEddy Murphy and Robert De Niro should be a com...neg
\n", + "
" + ], + "text/plain": [ + " path text \\\n", + "0 test/neg/8744_2.txt This joins the endless line of corny, predicta... \n", + "1 test/pos/6011_10.txt Swift's writing really has more in common with... \n", + "2 test/pos/9149_8.txt This film is a good start for novices that hav... \n", + "3 test/pos/6504_9.txt Wonderfully funny, awe-inspiring feature on th... \n", + "4 test/neg/970_4.txt Eddy Murphy and Robert De Niro should be a com... \n", + "\n", + " label \n", + "0 neg \n", + "1 pos \n", + "2 pos \n", + "3 pos \n", + "4 neg " + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "N_ROWS = 1000\n", + "\n", + "def sample_stratified(df, label_column, n_rows):\n", + " \"\"\"Function to sample n_rows from a dataframe while presenving class distribution\"\"\"\n", + " return df.groupby(label_column, group_keys=False) \\\n", + " .apply(lambda x: x.sample(int(np.rint(n_rows*len(x)/len(df))))) \\\n", + " .sample(frac=1)\n", + "\n", + "base_url = 'https://tlk.s3.yandex.net/ext_dataset/aclImdb'\n", + "df = pd.read_csv(os.path.join(base_url, 'test.csv'))\n", + "df_control = sample_stratified(df, 'label', n_rows=1000)\n", + "df = df.drop(df_control.index)\n", + "df = sample_stratified(df, 'label', n_rows=N_ROWS)\n", + "\n", + "df_control = df_control.reset_index(drop=True)\n", + "df = df.reset_index(drop=True)\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "9d39df9c", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "neg 500\n", + "pos 500\n", + "Name: label, dtype: int64" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.label.value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "13f66ff5", + "metadata": {}, + "outputs": [], + "source": [ + "def load_texts(urls):\n", + " texts = []\n", + " for url in tqdm(urls):\n", + " resp = requests.get(url)\n", + " texts.append(resp.text)\n", + " return texts\n", + "\n", + "df['text'] = load_texts(base_url + '/' + df.path)\n", + "df_control['text'] = load_texts(base_url + '/' + df_control.path)" + ] + }, + { + "cell_type": "markdown", + "id": "84ed2020", + "metadata": {}, + "source": [ + "## Project setup\n", + "\n", + "Let's create an appropriate Toloka project. AutoQuality requires to set up a training pool and a base pool. The base bool should be set up like the regular pools you will be running. AutoQuality will clone this pool and change quality control settings to explore different configurations.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ad0893b0", + "metadata": {}, + "outputs": [], + "source": [ + "token = input(\"Enter your token:\")\n", + "toloka_client = toloka.TolokaClient(token, 'PRODUCTION')" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "f802f9a5", + "metadata": {}, + "outputs": [], + "source": [ + "project = toloka.Project(\n", + " public_name='Movie review classification',\n", + " public_description='Classify sentiment of movie reviews',\n", + " private_comment='Auto quality control optimization experiments',\n", + ")\n", + "input_specification = {'text': toloka.project.StringSpec()}\n", + "output_specification = {'result': toloka.project.StringSpec()}" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "965eb4b6", + "metadata": {}, + "outputs": [], + "source": [ + "text_viewer = tb.TextViewV1(tb.InputData('text'))\n", + "\n", + "radio_group_field = tb.ButtonRadioGroupFieldV1(\n", + " tb.OutputData('result'),\n", + " [\n", + " tb.GroupFieldOption('pos', '😃 Positive'),\n", + " tb.GroupFieldOption('neg', '😡 Negative'),\n", + " ],\n", + " label='What is the review sentiment?',\n", + " validation=tb.RequiredConditionV1(hint='You need to select one answer'),\n", + ")\n", + "\n", + "task_width_plugin = tb.TolokaPluginV1(\n", + " layout=tb.TolokaPluginV1.TolokaPluginLayout(\n", + " kind='pager', \n", + " task_width=500,\n", + " )\n", + ")\n", + "\n", + "hot_keys_plugin = tb.HotkeysPluginV1(\n", + " key_1=tb.SetActionV1(tb.OutputData('result'), 'pos'),\n", + " key_2=tb.SetActionV1(tb.OutputData('result'), 'neg'),\n", + ")\n", + "\n", + "project_interface = toloka.project.TemplateBuilderViewSpec(\n", + " view=tb.ListViewV1([radio_group_field, text_viewer]),\n", + " plugins=[task_width_plugin, hot_keys_plugin],\n", + ")\n", + "\n", + "project.task_spec = toloka.project.task_spec.TaskSpec(\n", + " input_spec=input_specification,\n", + " output_spec=output_specification,\n", + " view_spec=project_interface,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "4aadf91c", + "metadata": {}, + "outputs": [], + "source": [ + "project.public_instructions = \"\"\"\n", + "

How to complete the task

\n", + "
    \n", + "
  • 1. Look at the movie review text.
  • \n", + "
  • 2. If it seems 😃 positive, assign the positive label. Otherwise assign the 😡 negative label.
  • \n", + "
  • 3. If you are unsure choose the label that seems most appropriate.
  • \n", + "
\n", + "\n", + "In case of problems send us a message. Good luck!\n", + "\"\"\".strip()" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "a36cd7f6", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[INFO] toloka.client: A new project with ID \"100100\" has been created. Link to open in web interface: https://toloka.dev/requester/project/100100\n" + ] + } + ], + "source": [ + "project = toloka_client.create_project(project)" + ] + }, + { + "cell_type": "markdown", + "id": "74a8db95", + "metadata": {}, + "source": [ + "## Training and base pool setup" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "a984acc8", + "metadata": {}, + "outputs": [], + "source": [ + "training_pool = toloka.training.Training(project_id=project.id,\n", + " private_name='Training pool', \n", + " training_tasks_in_task_suite_count=5, \n", + " task_suites_required_to_pass=1,\n", + " may_contain_adult_content=False,\n", + " inherited_instructions=True,\n", + " assignment_max_duration_seconds=60*5,\n", + " retry_training_after_days=5,\n", + " mix_tasks_in_creation_order=True,\n", + " shuffle_tasks_in_task_suite=True,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "90777897", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[INFO] toloka.client: A new training with ID \"34223462\" has been created. Link to open in web interface: https://toloka.dev/requester/project/100100/training/34223462\n" + ] + } + ], + "source": [ + "training_pool = toloka_client.create_training(training_pool)" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "54c6a329", + "metadata": {}, + "outputs": [], + "source": [ + "label_to_hint_map = {\n", + " 'pos': 'Positive', \n", + " 'neg': 'Negative',\n", + "}\n", + "\n", + "\n", + "tasks = []\n", + "for l in ['pos', 'neg']: \n", + " examples = df[df.label == l].head(3)\n", + " \n", + " for ex_tuple in examples.itertuples():\n", + " tasks.append(\n", + " toloka.Task(\n", + " input_values={'text': ex_tuple.text},\n", + " known_solutions=[toloka.task.BaseTask.KnownSolution(output_values={'result': ex_tuple.label})],\n", + " message_on_unknown_solution=f'Incorrect label! The actual label is: {label_to_hint_map[ex_tuple.label]}',\n", + " infinite_overlap=True,\n", + " pool_id=training_pool.id\n", + " )\n", + " )\n", + "\n", + "result = toloka_client.create_tasks(tasks, allow_defaults=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "bdc7b7d2", + "metadata": {}, + "outputs": [], + "source": [ + "base_pool = toloka.Pool(\n", + " project_id=project.id,\n", + " private_name='AutoQuality Base Pool',\n", + " may_contain_adult_content=False,\n", + " reward_per_assignment=0.01, \n", + " assignment_max_duration_seconds=60*7, \n", + " will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365), \n", + " filter=(\n", + " (toloka.filter.Languages.in_('EN')) &\n", + " (\n", + " (toloka.filter.ClientType == 'TOLOKA_APP') | \n", + " (toloka.filter.ClientType == 'BROWSER')\n", + " )\n", + " ),\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "46c6f9b2", + "metadata": {}, + "outputs": [], + "source": [ + "base_pool.set_mixer_config(\n", + " real_tasks_count=4,\n", + " golden_tasks_count=1\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "11480cbf", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[INFO] toloka.client: A new pool with ID \"34223545\" has been created. Link to open in web interface: https://toloka.dev/requester/project/100100/pool/34223545\n" + ] + } + ], + "source": [ + "base_pool = toloka_client.create_pool(base_pool)" + ] + }, + { + "cell_type": "markdown", + "id": "39b19ee1", + "metadata": {}, + "source": [ + "## AutoQuality basic usage\n", + "\n", + "To use AutoQuality class you need to set project_id, base_pool_id, training_pool_id. If your target label field is different from `label` when you also need to specify it. " + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "e7fc50ef", + "metadata": {}, + "outputs": [], + "source": [ + "aq = AutoQuality(\n", + " toloka_client=toloka_client,\n", + " project_id=project.id,\n", + " base_pool_id=base_pool.id,\n", + " training_pool_id=training_pool.id,\n", + " label_field='result'\n", + " # you can also use exam pool\n", + " # exam_pool_id = ...,\n", + " # exam_skill_id = ...,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "4681a079", + "metadata": {}, + "source": [ + "First, call `setup_pools` to create multiple pools with a different quality control settings (autoquality pools)" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "acd38696", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[INFO] toloka.autoquality.optimizer: Creating pools\n", + "[INFO] toloka.client: A new pool with ID \"34223548\" has been cloned. Link to open in web interface: https://toloka.dev/requester/project/100100/pool/34223548\n", + "[INFO] toloka.client: A new skill with ID \"46804\" has been created. Link to open in web interface: https://toloka.dev/requester/quality/skill/46804\n", + "[INFO] toloka.autoquality.optimizer: {'AssignmentSubmitTime': {'avg_page_seconds': 90, 'history_size': 5, 'too_fast_fraction': 0.1477143300195338}, 'ExamRequirement': {'exam_passing_skill_value': 53.285216686727665}, 'GoldenSet': {'history_size': 5, 'incorrect_answers_rate': 84.85131966116259}, 'MajorityVote': {'history_size': 5, 'incorrect_answers_rate': 78.20005144646932}, 'TrainingRequirement': {'training_passing_skill_value': 54.66533962883042}, 'overlap': 2}\n", + "[INFO] toloka.client: A new pool with ID \"34223549\" has been cloned. Link to open in web interface: https://toloka.dev/requester/project/100100/pool/34223549\n", + "[INFO] toloka.client: A new skill with ID \"46805\" has been created. Link to open in web interface: https://toloka.dev/requester/quality/skill/46805\n", + "[INFO] toloka.autoquality.optimizer: {'AssignmentSubmitTime': {'avg_page_seconds': 90, 'history_size': 5, 'too_fast_fraction': 0.23201628347219955}, 'ExamRequirement': {'exam_passing_skill_value': 69.62253027714169}, 'GoldenSet': {'history_size': 5, 'incorrect_answers_rate': 56.978409332827354}, 'MajorityVote': {'history_size': 5, 'incorrect_answers_rate': 49.64673390232204}, 'TrainingRequirement': {'training_passing_skill_value': 88.51934128668412}, 'overlap': 4}\n", + "[INFO] toloka.client: A new pool with ID \"34223550\" has been cloned. Link to open in web interface: https://toloka.dev/requester/project/100100/pool/34223550\n", + "[INFO] toloka.client: A new skill with ID \"46806\" has been created. Link to open in web interface: https://toloka.dev/requester/quality/skill/46806\n", + "[INFO] toloka.autoquality.optimizer: {'AssignmentSubmitTime': {'avg_page_seconds': 90, 'history_size': 5, 'too_fast_fraction': 0.3343549452822557}, 'ExamRequirement': {'exam_passing_skill_value': 82.11583064708395}, 'GoldenSet': {'history_size': 5, 'incorrect_answers_rate': 56.963714455271074}, 'MajorityVote': {'history_size': 5, 'incorrect_answers_rate': 59.77426079537283}, 'TrainingRequirement': {'training_passing_skill_value': 47.590059740117084}, 'overlap': 2}\n", + "[INFO] toloka.client: A new pool with ID \"34223551\" has been cloned. Link to open in web interface: https://toloka.dev/requester/project/100100/pool/34223551\n", + "[INFO] toloka.client: A new skill with ID \"46807\" has been created. Link to open in web interface: https://toloka.dev/requester/quality/skill/46807\n", + "[INFO] toloka.autoquality.optimizer: {'AssignmentSubmitTime': {'avg_page_seconds': 90, 'history_size': 5, 'too_fast_fraction': 0.45514044809801646}, 'ExamRequirement': {'exam_passing_skill_value': 73.50951079166411}, 'GoldenSet': {'history_size': 5, 'incorrect_answers_rate': 76.79440658958411}, 'MajorityVote': {'history_size': 5, 'incorrect_answers_rate': 54.728594803367706}, 'TrainingRequirement': {'training_passing_skill_value': 55.763206387618524}, 'overlap': 2}\n", + "[INFO] toloka.client: A new pool with ID \"34223553\" has been cloned. Link to open in web interface: https://toloka.dev/requester/project/100100/pool/34223553\n", + "[INFO] toloka.client: A new skill with ID \"46808\" has been created. Link to open in web interface: https://toloka.dev/requester/quality/skill/46808\n", + "[INFO] toloka.autoquality.optimizer: {'AssignmentSubmitTime': {'avg_page_seconds': 90, 'history_size': 5, 'too_fast_fraction': 0.29346427012576315}, 'ExamRequirement': {'exam_passing_skill_value': 72.06472728354181}, 'GoldenSet': {'history_size': 5, 'incorrect_answers_rate': 37.89247344156907}, 'MajorityVote': {'history_size': 5, 'incorrect_answers_rate': 53.62586551022977}, 'TrainingRequirement': {'training_passing_skill_value': 19.888309636790538}, 'overlap': 2}\n", + "[INFO] toloka.client: A new pool with ID \"34223554\" has been cloned. Link to open in web interface: https://toloka.dev/requester/project/100100/pool/34223554\n", + "[INFO] toloka.client: A new skill with ID \"46809\" has been created. Link to open in web interface: https://toloka.dev/requester/quality/skill/46809\n", + "[INFO] toloka.autoquality.optimizer: {'AssignmentSubmitTime': {'avg_page_seconds': 90, 'history_size': 5, 'too_fast_fraction': 0.17296793978241715}, 'ExamRequirement': {'exam_passing_skill_value': 31.831342002205403}, 'GoldenSet': {'history_size': 5, 'incorrect_answers_rate': 53.28453154405012}, 'MajorityVote': {'history_size': 5, 'incorrect_answers_rate': 66.86767780051952}, 'TrainingRequirement': {'training_passing_skill_value': 29.63761557346693}, 'overlap': 4}\n", + "[INFO] toloka.client: A new pool with ID \"34223555\" has been cloned. Link to open in web interface: https://toloka.dev/requester/project/100100/pool/34223555\n", + "[INFO] toloka.client: A new skill with ID \"46810\" has been created. Link to open in web interface: https://toloka.dev/requester/quality/skill/46810\n", + "[INFO] toloka.autoquality.optimizer: {'AssignmentSubmitTime': {'avg_page_seconds': 90, 'history_size': 5, 'too_fast_fraction': 0.40328158825067395}, 'ExamRequirement': {'exam_passing_skill_value': 40.80936929832271}, 'GoldenSet': {'history_size': 5, 'incorrect_answers_rate': 74.03553397977555}, 'MajorityVote': {'history_size': 5, 'incorrect_answers_rate': 51.19501664153089}, 'TrainingRequirement': {'training_passing_skill_value': 60.870975371389}, 'overlap': 3}\n", + "[INFO] toloka.client: A new pool with ID \"34223556\" has been cloned. Link to open in web interface: https://toloka.dev/requester/project/100100/pool/34223556\n", + "[INFO] toloka.client: A new skill with ID \"46811\" has been created. Link to open in web interface: https://toloka.dev/requester/quality/skill/46811\n", + "[INFO] toloka.autoquality.optimizer: {'AssignmentSubmitTime': {'avg_page_seconds': 90, 'history_size': 5, 'too_fast_fraction': 0.29328743451998623}, 'ExamRequirement': {'exam_passing_skill_value': 45.54181665616344}, 'GoldenSet': {'history_size': 5, 'incorrect_answers_rate': 48.43325160973314}, 'MajorityVote': {'history_size': 5, 'incorrect_answers_rate': 30.988505985671708}, 'TrainingRequirement': {'training_passing_skill_value': 63.74196964105757}, 'overlap': 2}\n", + "[INFO] toloka.client: A new pool with ID \"34223557\" has been cloned. Link to open in web interface: https://toloka.dev/requester/project/100100/pool/34223557\n", + "[INFO] toloka.client: A new skill with ID \"46812\" has been created. Link to open in web interface: https://toloka.dev/requester/quality/skill/46812\n", + "[INFO] toloka.autoquality.optimizer: {'AssignmentSubmitTime': {'avg_page_seconds': 90, 'history_size': 5, 'too_fast_fraction': 0.5112505771972173}, 'ExamRequirement': {'exam_passing_skill_value': 19.45845961235321}, 'GoldenSet': {'history_size': 5, 'incorrect_answers_rate': 65.48091762987362}, 'MajorityVote': {'history_size': 5, 'incorrect_answers_rate': 75.51178871391954}, 'TrainingRequirement': {'training_passing_skill_value': 58.32543986930165}, 'overlap': 2}\n", + "[INFO] toloka.client: A new pool with ID \"34223558\" has been cloned. Link to open in web interface: https://toloka.dev/requester/project/100100/pool/34223558\n", + "[INFO] toloka.client: A new skill with ID \"46813\" has been created. Link to open in web interface: https://toloka.dev/requester/quality/skill/46813\n", + "[INFO] toloka.autoquality.optimizer: {'AssignmentSubmitTime': {'avg_page_seconds': 90, 'history_size': 5, 'too_fast_fraction': 0.28818617778607497}, 'ExamRequirement': {'exam_passing_skill_value': 84.79545467497228}, 'GoldenSet': {'history_size': 5, 'incorrect_answers_rate': 67.80647202040045}, 'MajorityVote': {'history_size': 5, 'incorrect_answers_rate': 56.974801198749965}, 'TrainingRequirement': {'training_passing_skill_value': 33.18491499105997}, 'overlap': 4}\n" + ] + } + ], + "source": [ + "aq.setup_pools()" + ] + }, + { + "cell_type": "markdown", + "id": "1fe48890", + "metadata": {}, + "source": [ + "Then use `create_tasks` to add tasks for every autoquality pool. AutoQuality usually requires 300-500 tasks to work properly(you also need enough control tasks if Golden Set quality control is used)." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "8f863735", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "((200, 3), (800, 3))" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "n_optim = 200\n", + "df_optim = df_control.iloc[:n_optim].copy()\n", + "df_optim_golden = df_control.iloc[n_optim:].copy()\n", + "df_optim.shape, df_optim_golden.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "ce4595a5", + "metadata": {}, + "outputs": [], + "source": [ + "aq_tasks = []" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "63e6c15e", + "metadata": {}, + "outputs": [], + "source": [ + "for row in df_optim.itertuples():\n", + " aq_tasks.append(\n", + " toloka.Task(\n", + " input_values={'text': row.text}, \n", + " )\n", + " )\n", + "for row in df_optim_golden.itertuples():\n", + " aq_tasks.append(\n", + " toloka.Task(\n", + " input_values={'text': row.text},\n", + " known_solutions=[toloka.task.BaseTask.KnownSolution(output_values={'result': row.label})]\n", + " )\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "480f05b8", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[INFO] toloka.autoquality.optimizer: Creating tasks in pools\n", + "[INFO] toloka.autoquality.optimizer: Populated pool 34223548 with 1000 tasks\n", + "[INFO] toloka.autoquality.optimizer: Populated pool 34223549 with 1000 tasks\n", + "[INFO] toloka.autoquality.optimizer: Populated pool 34223550 with 1000 tasks\n", + "[INFO] toloka.autoquality.optimizer: Populated pool 34223551 with 1000 tasks\n", + "[INFO] toloka.autoquality.optimizer: Populated pool 34223553 with 1000 tasks\n", + "[INFO] toloka.autoquality.optimizer: Populated pool 34223554 with 1000 tasks\n", + "[INFO] toloka.autoquality.optimizer: Populated pool 34223555 with 1000 tasks\n", + "[INFO] toloka.autoquality.optimizer: Populated pool 34223556 with 1000 tasks\n", + "[INFO] toloka.autoquality.optimizer: Populated pool 34223557 with 1000 tasks\n", + "[INFO] toloka.autoquality.optimizer: Populated pool 34223558 with 1000 tasks\n", + "[INFO] toloka.autoquality.optimizer: Setup complete, please verify\n" + ] + } + ], + "source": [ + "aq.create_tasks(aq_tasks)" + ] + }, + { + "cell_type": "markdown", + "id": "3f1bf67a", + "metadata": {}, + "source": [ + "Finally, just `run` autoquality. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ad3649ba", + "metadata": {}, + "outputs": [], + "source": [ + "aq.run()" + ] + }, + { + "cell_type": "markdown", + "id": "295c136e", + "metadata": {}, + "source": [ + "After that your autoquality instance will have some useful attributes with the results of the work." + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "2318d3da", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'34223549'" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "aq.best_pool_id" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "4379732b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'AssignmentSubmitTime': {'avg_page_seconds': 90,\n", + " 'history_size': 5,\n", + " 'too_fast_fraction': 0.23201628347219955},\n", + " 'ExamRequirement': {'exam_passing_skill_value': 69.62253027714169},\n", + " 'GoldenSet': {'history_size': 5,\n", + " 'incorrect_answers_rate': 56.978409332827354},\n", + " 'MajorityVote': {'history_size': 5,\n", + " 'incorrect_answers_rate': 49.64673390232204},\n", + " 'TrainingRequirement': {'training_passing_skill_value': 88.51934128668412},\n", + " 'overlap': 4}" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "aq.best_pool_params" + ] + }, + { + "cell_type": "markdown", + "id": "40960ef4", + "metadata": {}, + "source": [ + "You can also compare all autoqualoty pools by a variety of different metrics" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "7bbdb9ee", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
pool_idaccuracy_goldenaccuracy_mvalpha_krippendorffuncertanitytime_spent_secondsunique_submitters_countspent_budgetavg_submit_assignment_millisnum_bans...accuracy_golden_rankaccuracy_mv_rankalpha_krippendorff_rankspending_per_task_ranktasks_per_second_rankbans_ratio_rankavg_quality_rankavg_rankoptimal_quality_rankmain_rank
0342235480.2959180.7742350.2577510.424374101798.01.00092376.058...3322962.674.91753.4020003.402000
1342235490.4528620.8509380.5816010.3745841158165.02.003134946.079...881018108.676.91757.2686677.268667
2342235500.3192980.7885960.3312570.39511254795.01.00098809.057...54421054.335.33254.3313334.331333
3342235510.3960780.8137250.3723020.422319159885.01.000102025.043...7662796.336.08255.8646675.864667
4342235530.2554350.7635870.2076600.460486181692.01.00083584.061...2212621.672.91752.0686672.068667
5342235540.3142340.7991110.4271090.4272602249185.02.003112834.0113...4581545.673.91754.4686674.468667
6342235550.3321000.8227620.4766270.3973023068113.01.503103747.064...6791387.334.83255.8646675.864667
7342235560.2483550.8031800.3928820.398519290376.01.003116097.055...1571414.332.58253.1980003.198000
8342235570.2516950.7472790.2947820.460207326356.01.003131404.037...1131231.671.91751.7353331.735333
9342235580.2966670.7650420.3437920.5294003495106.02.003118670.062...3351173.673.16753.4020003.402000
\n", + "

10 rows × 30 columns

\n", + "
" + ], + "text/plain": [ + " pool_id accuracy_golden accuracy_mv alpha_krippendorff uncertanity \\\n", + "0 34223548 0.295918 0.774235 0.257751 0.424374 \n", + "1 34223549 0.452862 0.850938 0.581601 0.374584 \n", + "2 34223550 0.319298 0.788596 0.331257 0.395112 \n", + "3 34223551 0.396078 0.813725 0.372302 0.422319 \n", + "4 34223553 0.255435 0.763587 0.207660 0.460486 \n", + "5 34223554 0.314234 0.799111 0.427109 0.427260 \n", + "6 34223555 0.332100 0.822762 0.476627 0.397302 \n", + "7 34223556 0.248355 0.803180 0.392882 0.398519 \n", + "8 34223557 0.251695 0.747279 0.294782 0.460207 \n", + "9 34223558 0.296667 0.765042 0.343792 0.529400 \n", + "\n", + " time_spent_seconds unique_submitters_count spent_budget \\\n", + "0 1017 98.0 1.000 \n", + "1 1158 165.0 2.003 \n", + "2 547 95.0 1.000 \n", + "3 1598 85.0 1.000 \n", + "4 1816 92.0 1.000 \n", + "5 2249 185.0 2.003 \n", + "6 3068 113.0 1.503 \n", + "7 2903 76.0 1.003 \n", + "8 3263 56.0 1.003 \n", + "9 3495 106.0 2.003 \n", + "\n", + " avg_submit_assignment_millis num_bans ... accuracy_golden_rank \\\n", + "0 92376.0 58 ... 3 \n", + "1 134946.0 79 ... 8 \n", + "2 98809.0 57 ... 5 \n", + "3 102025.0 43 ... 7 \n", + "4 83584.0 61 ... 2 \n", + "5 112834.0 113 ... 4 \n", + "6 103747.0 64 ... 6 \n", + "7 116097.0 55 ... 1 \n", + "8 131404.0 37 ... 1 \n", + "9 118670.0 62 ... 3 \n", + "\n", + " accuracy_mv_rank alpha_krippendorff_rank spending_per_task_rank \\\n", + "0 3 2 2 \n", + "1 8 10 1 \n", + "2 4 4 2 \n", + "3 6 6 2 \n", + "4 2 1 2 \n", + "5 5 8 1 \n", + "6 7 9 1 \n", + "7 5 7 1 \n", + "8 1 3 1 \n", + "9 3 5 1 \n", + "\n", + " tasks_per_second_rank bans_ratio_rank avg_quality_rank avg_rank \\\n", + "0 9 6 2.67 4.9175 \n", + "1 8 10 8.67 6.9175 \n", + "2 10 5 4.33 5.3325 \n", + "3 7 9 6.33 6.0825 \n", + "4 6 2 1.67 2.9175 \n", + "5 5 4 5.67 3.9175 \n", + "6 3 8 7.33 4.8325 \n", + "7 4 1 4.33 2.5825 \n", + "8 2 3 1.67 1.9175 \n", + "9 1 7 3.67 3.1675 \n", + "\n", + " optimal_quality_rank main_rank \n", + "0 3.402000 3.402000 \n", + "1 7.268667 7.268667 \n", + "2 4.331333 4.331333 \n", + "3 5.864667 5.864667 \n", + "4 2.068667 2.068667 \n", + "5 4.468667 4.468667 \n", + "6 5.864667 5.864667 \n", + "7 3.198000 3.198000 \n", + "8 1.735333 1.735333 \n", + "9 3.402000 3.402000 \n", + "\n", + "[10 rows x 30 columns]" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "aq.ranks" + ] + }, + { + "cell_type": "markdown", + "id": "c7907bae", + "metadata": {}, + "source": [ + "And archive all pools created by autoquality." + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "9fe8bfe2", + "metadata": {}, + "outputs": [], + "source": [ + "aq.archive_autoquality_pools()" + ] + }, + { + "cell_type": "markdown", + "id": "4694fee4", + "metadata": {}, + "source": [ + "## Autoquality advanced usage\n", + "\n", + "AutoQuality class provides many ways to customize your optimization algorithm. Let's create another instance with a different settings." + ] + }, + { + "cell_type": "markdown", + "id": "a695af55", + "metadata": {}, + "source": [ + "First of all, you can set `n_iter` parameter which determines how many autoquality pools will be created." + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "f831d030", + "metadata": {}, + "outputs": [], + "source": [ + "aq = AutoQuality(\n", + " toloka_client=toloka_client,\n", + " project_id=project.id,\n", + " base_pool_id=base_pool.id,\n", + " training_pool_id=training_pool.id,\n", + " label_field='result',\n", + " n_iter=5\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "00762747", + "metadata": {}, + "source": [ + "Also you can change the distributions for quality control parameters optimized by autoquality. In this example we will change the distributions for the majority vote rule. AutoQuality will sample new values for every autoquality pool from this distributions." + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "8a390b6a", + "metadata": {}, + "outputs": [], + "source": [ + "from scipy import stats\n", + "aq.parameter_distributions['MajorityVote'] = dict(\n", + " history_size=[3, 5, 7], \n", + " incorrect_answers_rate=stats.norm(loc=70, scale=10)\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "6b4b9d8a", + "metadata": {}, + "source": [ + "Finally, you can customize methods which calculate scores or ranks. Let's modify the ranking function to give preference to a cheaper pools. Do not forget to set your new rank to a `main_rank` column so that AutoQuality knows how to choose the best pool." + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "151ced9a", + "metadata": {}, + "outputs": [], + "source": [ + "from toloka.autoquality.scoring import default_calc_ranks\n", + "def my_new_calc_ranks(scores_df: pd.DataFrame) -> pd.DataFrame:\n", + " ranks = default_calc_ranks(scores_df)\n", + " ranks['my_new_rank'] = (\n", + " 0.5 * scores_df['spending_per_task_rank']\n", + " + 0.4 * scores_df['avg_quality_rank']\n", + " + 0.05 * scores_df['bans_ratio_rank']\n", + " + 0.05 * scores_df['tasks_per_second_rank']\n", + " )\n", + " ranks['main_rank'] = ranks['my_new_rank']\n", + " return ranks\n", + "aq.ranking_func = my_new_calc_ranks" + ] + }, + { + "cell_type": "markdown", + "id": "a94fe867", + "metadata": {}, + "source": [ + "You can create completely new scoring and ranking functions to use AutoQuality the way you need. Just keep the same signature as in the [default methods](https://github.com/Toloka/toloka-kit/blob/main/src/autoquality/scoring.py)." + ] + }, + { + "cell_type": "markdown", + "id": "bfeda9d6", + "metadata": {}, + "source": [ + "Now let's run our modified AutoQuality instance again" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "id": "316ea453", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[INFO] toloka.autoquality.optimizer: Creating pools\n", + "[INFO] toloka.client: A new pool with ID \"34224967\" has been cloned. Link to open in web interface: https://toloka.dev/requester/project/100100/pool/34224967\n", + "[INFO] toloka.autoquality.optimizer: {'AssignmentSubmitTime': {'avg_page_seconds': 90, 'history_size': 5, 'too_fast_fraction': 0.23797148002513394}, 'ExamRequirement': {'exam_passing_skill_value': 67.66191223513454}, 'GoldenSet': {'history_size': 5, 'incorrect_answers_rate': 77.59380523413068}, 'MajorityVote': {'history_size': 7, 'incorrect_answers_rate': 61.81437838740685}, 'TrainingRequirement': {'training_passing_skill_value': 42.91563427986361}, 'overlap': 3}\n", + "[INFO] toloka.client: A new pool with ID \"34224968\" has been cloned. Link to open in web interface: https://toloka.dev/requester/project/100100/pool/34224968\n", + "[INFO] toloka.autoquality.optimizer: {'AssignmentSubmitTime': {'avg_page_seconds': 90, 'history_size': 5, 'too_fast_fraction': 0.21702013951044616}, 'ExamRequirement': {'exam_passing_skill_value': 75.87638640144884}, 'GoldenSet': {'history_size': 5, 'incorrect_answers_rate': 65.29114837367328}, 'MajorityVote': {'history_size': 5, 'incorrect_answers_rate': 69.32452329459505}, 'TrainingRequirement': {'training_passing_skill_value': 49.40059101881683}, 'overlap': 3}\n", + "[INFO] toloka.client: A new pool with ID \"34224971\" has been cloned. Link to open in web interface: https://toloka.dev/requester/project/100100/pool/34224971\n", + "[INFO] toloka.autoquality.optimizer: {'AssignmentSubmitTime': {'avg_page_seconds': 90, 'history_size': 5, 'too_fast_fraction': 0.24919852639254328}, 'ExamRequirement': {'exam_passing_skill_value': 52.36319606278474}, 'GoldenSet': {'history_size': 5, 'incorrect_answers_rate': 63.662734228475955}, 'MajorityVote': {'history_size': 3, 'incorrect_answers_rate': 64.71433960028887}, 'TrainingRequirement': {'training_passing_skill_value': 33.499919567379365}, 'overlap': 2}\n", + "[INFO] toloka.client: A new pool with ID \"34224972\" has been cloned. Link to open in web interface: https://toloka.dev/requester/project/100100/pool/34224972\n", + "[INFO] toloka.autoquality.optimizer: {'AssignmentSubmitTime': {'avg_page_seconds': 90, 'history_size': 5, 'too_fast_fraction': 0.2431566430532228}, 'ExamRequirement': {'exam_passing_skill_value': 41.63475243222442}, 'GoldenSet': {'history_size': 5, 'incorrect_answers_rate': 38.278764737735415}, 'MajorityVote': {'history_size': 7, 'incorrect_answers_rate': 78.1961475817833}, 'TrainingRequirement': {'training_passing_skill_value': 72.95585692343867}, 'overlap': 2}\n", + "[INFO] toloka.client: A new pool with ID \"34224973\" has been cloned. Link to open in web interface: https://toloka.dev/requester/project/100100/pool/34224973\n", + "[INFO] toloka.autoquality.optimizer: {'AssignmentSubmitTime': {'avg_page_seconds': 90, 'history_size': 5, 'too_fast_fraction': 0.3063388211369379}, 'ExamRequirement': {'exam_passing_skill_value': 52.124449490318256}, 'GoldenSet': {'history_size': 5, 'incorrect_answers_rate': 36.4322085974661}, 'MajorityVote': {'history_size': 5, 'incorrect_answers_rate': 82.1998010085741}, 'TrainingRequirement': {'training_passing_skill_value': 39.056099665979254}, 'overlap': 2}\n" + ] + } + ], + "source": [ + "aq.setup_pools()" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "id": "b72cca59", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[INFO] toloka.autoquality.optimizer: Creating tasks in pools\n", + "[INFO] toloka.autoquality.optimizer: Populated pool 34224967 with 1000 tasks\n", + "[INFO] toloka.autoquality.optimizer: Populated pool 34224968 with 1000 tasks\n", + "[INFO] toloka.autoquality.optimizer: Populated pool 34224971 with 1000 tasks\n", + "[WARNING] urllib3.connectionpool: Retrying (TolokaRetry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError(\"HTTPSConnectionPool(host='toloka.yandex.com', port=443): Read timed out. (read timeout=10.0)\")': /api/v1/operations/d0dfc196-4a20-4a77-9a6d-ad4302e1d1a4\n", + "[INFO] toloka.autoquality.optimizer: Populated pool 34224972 with 1000 tasks\n", + "[INFO] toloka.autoquality.optimizer: Populated pool 34224973 with 1000 tasks\n", + "[INFO] toloka.autoquality.optimizer: Setup complete, please verify\n" + ] + } + ], + "source": [ + "aq.create_tasks(aq_tasks)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bd046c0c", + "metadata": {}, + "outputs": [], + "source": [ + "aq.run()" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "id": "ffae683b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'AssignmentSubmitTime': {'avg_page_seconds': 90,\n", + " 'history_size': 5,\n", + " 'too_fast_fraction': 0.2431566430532228},\n", + " 'ExamRequirement': {'exam_passing_skill_value': 41.63475243222442},\n", + " 'GoldenSet': {'history_size': 5,\n", + " 'incorrect_answers_rate': 38.278764737735415},\n", + " 'MajorityVote': {'history_size': 7,\n", + " 'incorrect_answers_rate': 78.1961475817833},\n", + " 'TrainingRequirement': {'training_passing_skill_value': 72.95585692343867},\n", + " 'overlap': 2}" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "aq.best_pool_params" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "id": "7c735d77", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
pool_idaccuracy_goldenaccuracy_mvalpha_krippendorffuncertanitytime_spent_secondsunique_submitters_countspent_budgetavg_submit_assignment_millisnum_bans...accuracy_mv_rankalpha_krippendorff_rankspending_per_task_ranktasks_per_second_rankbans_ratio_rankavg_quality_rankavg_rankoptimal_quality_rankmain_rankmy_new_rank
0342249670.2807970.7978830.3523310.47787594992.01.50299863.048...221552.003.25002.4000001.8001.800
1342249680.2599220.8359430.4769380.3916331051103.01.506107552.064...451433.332.83252.8646672.1822.182
2342249710.3304050.7536880.1660500.490057139574.01.00082150.045...112342.002.75002.3333332.1502.150
3342249720.3064940.8156580.4209290.403925155077.01.000117902.051...342223.332.33252.7980002.5322.532
4342249730.3101270.8042820.3818670.414965172079.01.003118541.056...231112.671.41752.0020001.6681.668
\n", + "

5 rows × 31 columns

\n", + "
" + ], + "text/plain": [ + " pool_id accuracy_golden accuracy_mv alpha_krippendorff uncertanity \\\n", + "0 34224967 0.280797 0.797883 0.352331 0.477875 \n", + "1 34224968 0.259922 0.835943 0.476938 0.391633 \n", + "2 34224971 0.330405 0.753688 0.166050 0.490057 \n", + "3 34224972 0.306494 0.815658 0.420929 0.403925 \n", + "4 34224973 0.310127 0.804282 0.381867 0.414965 \n", + "\n", + " time_spent_seconds unique_submitters_count spent_budget \\\n", + "0 949 92.0 1.502 \n", + "1 1051 103.0 1.506 \n", + "2 1395 74.0 1.000 \n", + "3 1550 77.0 1.000 \n", + "4 1720 79.0 1.003 \n", + "\n", + " avg_submit_assignment_millis num_bans ... accuracy_mv_rank \\\n", + "0 99863.0 48 ... 2 \n", + "1 107552.0 64 ... 4 \n", + "2 82150.0 45 ... 1 \n", + "3 117902.0 51 ... 3 \n", + "4 118541.0 56 ... 2 \n", + "\n", + " alpha_krippendorff_rank spending_per_task_rank tasks_per_second_rank \\\n", + "0 2 1 5 \n", + "1 5 1 4 \n", + "2 1 2 3 \n", + "3 4 2 2 \n", + "4 3 1 1 \n", + "\n", + " bans_ratio_rank avg_quality_rank avg_rank optimal_quality_rank main_rank \\\n", + "0 5 2.00 3.2500 2.400000 1.800 \n", + "1 3 3.33 2.8325 2.864667 2.182 \n", + "2 4 2.00 2.7500 2.333333 2.150 \n", + "3 2 3.33 2.3325 2.798000 2.532 \n", + "4 1 2.67 1.4175 2.002000 1.668 \n", + "\n", + " my_new_rank \n", + "0 1.800 \n", + "1 2.182 \n", + "2 2.150 \n", + "3 2.532 \n", + "4 1.668 \n", + "\n", + "[5 rows x 31 columns]" + ] + }, + "execution_count": 38, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "aq.ranks" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "id": "9291cc1d", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "aq.archive_autoquality_pools()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.4" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/examples/benchmarks/image_classification_cinic10.ipynb b/examples/4. cases-for-statistics-quality/benchmarks/image_classification_cinic10.ipynb similarity index 100% rename from examples/benchmarks/image_classification_cinic10.ipynb rename to examples/4. cases-for-statistics-quality/benchmarks/image_classification_cinic10.ipynb diff --git a/examples/benchmarks/text_classification_imdb.ipynb b/examples/4. cases-for-statistics-quality/benchmarks/text_classification_imdb.ipynb similarity index 100% rename from examples/benchmarks/text_classification_imdb.ipynb rename to examples/4. cases-for-statistics-quality/benchmarks/text_classification_imdb.ipynb diff --git a/examples/4. cases-for-statistics-quality/metrics/graphite.ipynb b/examples/4. cases-for-statistics-quality/metrics/graphite.ipynb new file mode 100644 index 00000000..b10bd7ca --- /dev/null +++ b/examples/4. cases-for-statistics-quality/metrics/graphite.ipynb @@ -0,0 +1,336 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Collect and show metrics in Graphite" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this example we will learn how to collect metrics using Toloka-kit and\n", + "send them to remote metrics server (we will use [Graphite](https://graphiteapp.org) but switching to any other solution is very easy)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "!pip install toloka-kit==0.1.26\n", + "!pip install crowd-kit==1.0.0\n", + "\n", + "import socket\n", + "import asyncio\n", + "import logging\n", + "import getpass\n", + "\n", + "import toloka.metrics as metrics\n", + "import toloka.client as toloka\n", + "from toloka.metrics import MetricCollector" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "toloka_client = toloka.TolokaClient(getpass.getpass('Enter your OAuth token: '), 'PRODUCTION') # Or switch to 'SANDBOX'\n", + "print(toloka_client.get_requester())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For this example we will run pipeline from [Streaming pipeline example](https://github.com/Toloka/toloka-kit/tree/main/examples/6.streaming_pipelines/streaming_pipelines.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/6.streaming_pipelines/streaming_pipelines.ipynb).\n", + "If you are running this jupyter notebook in colab please download necessary script with the following line of code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!wget --quiet --show-progress \"https://raw.githubusercontent.com/Toloka/toloka-kit/main/examples/metrics/find_items_pipeline.py\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from find_items_pipeline import FindItemsPipeline\n", + "pipeline = FindItemsPipeline(client=toloka_client)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create projects and pools needed for pipeline" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pipeline.init_pipeline()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Configuring metrics collection in Graphite\n", + "\n", + "You need to [configure](https://graphite.readthedocs.io/en/stable/install.html) Graphite server before proceeding\n", + "to this section. An easy option might be using official docker container. Selection of user interface is up to you\n", + "(during creation of this example we used [Grafana](https://grafana.com))." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# specify your Graphite instance url and port\n", + "CARBON_ADDRESS = 'localhost'\n", + "CARBON_PORT = 2003\n", + "\n", + "try:\n", + " sock = socket.socket()\n", + " sock.connect((CARBON_ADDRESS, CARBON_PORT))\n", + " sock.close()\n", + "except ConnectionRefusedError:\n", + " raise RuntimeError('Graphite server is unreachable!')\n", + "else:\n", + " print('Congratulations, connected to Graphite server!')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First, let's define a callback for handling metrics values. We'll use it to store the data on a Graphite server." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "class GraphiteLogger:\n", + " def __init__(self, carbon_address, carbon_port, use_ipv6=False):\n", + " self.carbon_address = carbon_address\n", + " self.carbon_port = carbon_port\n", + " self.use_ipv6 = use_ipv6\n", + " self.logger = logging.getLogger('GraphiteLogger')\n", + "\n", + " def __call__(self, metric_dict):\n", + " if self.use_ipv6:\n", + " s = socket.socket(socket.AF_INET6)\n", + " s.connect((self.carbon_address, self.carbon_port, 0, 0))\n", + " else:\n", + " s = socket.socket()\n", + " s.connect((self.carbon_address, self.carbon_port))\n", + "\n", + " for metric in metric_dict:\n", + " for timestamp, value in metric_dict[metric]:\n", + " s.sendall(\n", + " f'{metric} {value} {timestamp.timestamp()}\\n'.encode()\n", + " )\n", + " self.logger.log(\n", + " logging.INFO,\n", + " f'Logged {metric} {value} {timestamp.timestamp()}'\n", + " )\n", + " s.close()\n", + "\n", + "\n", + "graphite_logger = GraphiteLogger(\n", + " CARBON_ADDRESS, CARBON_PORT,\n", + " # specify use_ipv6=True if your Graphite server is available only via IPv6\n", + " # (this may be the case if you are running Graphite inside docker hosted in MacOS)\n", + " use_ipv6=False,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For sending metrics to Graphite we have to:\n", + "- Define which metrics we'll collect.\n", + "- Describe what we'll do with these metrics, as a callable functor.\n", + "- Define a TolokaClient for each metric.\n", + "- Asynchronously call `run` for the MetricCollector instance.\n", + "\n", + "For this example we will collect a number of submitted assignments, accepted assignments and total expenses for each pool. All available metrics can be found in the [documentation](https://toloka.ai/en/docs/toloka-kit/reference/toloka.metrics.metrics.BaseMetric)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "metric_collector = MetricCollector(\n", + " [\n", + " # Assignments in pools. We will track submitted assignments and\n", + " # accepted assignments counts for every pool.\n", + " metrics.AssignmentsInPool(\n", + " pipeline.verification_pool.id,\n", + " submitted_name='verification_pool.submitted_assignments',\n", + " accepted_name='verification_pool.accepted_assignments',\n", + " ),\n", + " metrics.AssignmentsInPool(\n", + " pipeline.find_items_pool.id,\n", + " submitted_name='find_items_pool.submitted_assignments',\n", + " accepted_name='find_items_pool.accepted_assignments',\n", + " ),\n", + " metrics.AssignmentsInPool(\n", + " pipeline.sbs_pool.id,\n", + " submitted_name='sbs_pool.submitted_assignments',\n", + " accepted_name='sbs_pool.accepted_assignments',\n", + " ),\n", + " # Budget spent for every pool\n", + " metrics.SpentBudgetOnPool(\n", + " pipeline.verification_pool.id,\n", + " 'verification_pool.expenses'\n", + " ),\n", + " metrics.SpentBudgetOnPool(\n", + " pipeline.find_items_pool.id,\n", + " 'find_items_pool.expenses'\n", + " ),\n", + " metrics.SpentBudgetOnPool(\n", + " pipeline.sbs_pool.id,\n", + " 'sbs_pool.expenses'\n", + " )\n", + " ],\n", + " callback=graphite_logger\n", + ")\n", + "\n", + "# You can specify toloka_client argument in each metric instead of calling\n", + "# bind_client if you want to use different clients for different metrics\n", + "metrics.bind_client(metric_collector.metrics, toloka_client)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Running pipeline\n", + "\n", + "Let's try to launch our pipeline and see metrics updated. Metrics will be sent to configured Graphite server.\n", + "\n", + "⚠️ **Be careful**:\n", + "real projects will be created and money will be spent in case of running in production environment! ⚠️\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Google Colab is using a global event pool,\n", + "# so in order to run our pipeline we have to apply nest_asyncio to create an inner pool\n", + "if 'google.colab' in str(get_ipython()):\n", + " import nest_asyncio, asyncio\n", + " nest_asyncio.apply()\n", + " asyncio.get_event_loop().run_until_complete(asyncio.gather(metric_collector.run(), pipeline.run()))\n", + "else:\n", + " await asyncio.gather(metric_collector.run(), pipeline.run())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here is an example of metrics displayed in Grafana with Graphite as the Datasource after pipeline completion.\n", + "\n", + "\n", + " \n", + " \n", + "
\n", + " \n", + "
\n", + " Figure 2. Grafana web view.\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using Graphite in production\n", + "In normal usage it's better to gather metrics from Toloka once in ten minutes or less often. So you must prepare your graphite for that. Typically it already has `count` type of aggregation, that looks like that:\n", + "```\n", + " [count]\n", + " pattern = \\.count$\n", + " xFilesFactor = 0\n", + " aggregationMethod = sum\n", + "```\n", + "\n", + "\n", + "It means, that all new metrics that end on ```.count``` will be processed like that: sum all of their values when graphite needs to aggregate this metric on some interval.\n", + "\n", + "\n", + "But for metrics that cannot be summed, for example, completion percentage, by default it's no useful type. So you need to add them to the ```storage-aggregation.conf```:\n", + "```\n", + " [metric]\n", + " pattern=_metric$\n", + " xFileFactor = 0\n", + " aggregationMethod = average\n", + "```\n", + "\n", + "It means if you send a metric that ends on ```_metric``` to graphite, it will aggregate this metric like an average on any interval." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And you need to set up right retention for this metric in ```storage-schemas.conf```, for example:\n", + "```\n", + " [metric]\n", + " pattern = _metric$\n", + " retentions = 10m:7d,1h:360d\n", + "```" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.12" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/examples/4. cases-for-statistics-quality/metrics/img/assignments_in_pool_dash.png b/examples/4. cases-for-statistics-quality/metrics/img/assignments_in_pool_dash.png new file mode 100644 index 00000000..7b28fab9 Binary files /dev/null and b/examples/4. cases-for-statistics-quality/metrics/img/assignments_in_pool_dash.png differ diff --git a/examples/4. cases-for-statistics-quality/metrics/img/grafana_metrics.png b/examples/4. cases-for-statistics-quality/metrics/img/grafana_metrics.png new file mode 100644 index 00000000..b11c73be Binary files /dev/null and b/examples/4. cases-for-statistics-quality/metrics/img/grafana_metrics.png differ diff --git a/examples/4. cases-for-statistics-quality/metrics/img/pool_expenses_dash.png b/examples/4. cases-for-statistics-quality/metrics/img/pool_expenses_dash.png new file mode 100644 index 00000000..941ea8a3 Binary files /dev/null and b/examples/4. cases-for-statistics-quality/metrics/img/pool_expenses_dash.png differ diff --git a/examples/4. cases-for-statistics-quality/metrics/jupyter_dashboard.ipynb b/examples/4. cases-for-statistics-quality/metrics/jupyter_dashboard.ipynb new file mode 100644 index 00000000..5513dcb0 --- /dev/null +++ b/examples/4. cases-for-statistics-quality/metrics/jupyter_dashboard.ipynb @@ -0,0 +1,365 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Collect and show metrics in jupyter dashboard" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "In this example we will learn how to collect metrics using Toloka-kit and\n", + "display it right inside this jupyter notebook." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "!pip install toloka-kit==0.1.26\n", + "!pip install crowd-kit==1.0.0\n", + "\n", + "import getpass\n", + "\n", + "import toloka.metrics as metrics\n", + "import toloka.client as toloka\n", + "from toloka.metrics.jupyter_dashboard import Chart, DashBoard\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "toloka_client = toloka.TolokaClient(getpass.getpass('Enter your OAuth token: '), 'PRODUCTION') # Or switch to 'SANDBOX'\n", + "print(toloka_client.get_requester())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For this example we will run pipeline from [Streaming pipeline example](https://github.com/Toloka/toloka-kit/tree/main/examples/6.streaming_pipelines/streaming_pipelines.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/6.streaming_pipelines/streaming_pipelines.ipynb).\n", + "If you are running this jupyter notebook in colab please download necessary script with the following line of code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "!wget --quiet --show-progress \"https://raw.githubusercontent.com/Toloka/toloka-kit/main/examples/metrics/find_items_pipeline.py\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "from find_items_pipeline import FindItemsPipeline\n", + "pipeline = FindItemsPipeline(client=toloka_client)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "Create projects and pools needed for pipeline." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "pipeline.init_pipeline()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "## Configuring jupyter dashboard\n", + "\n", + "Let's create dashboard for all pools in the pipeline. For this example we will collect\n", + "a number of submitted assignments, accepted assignments and total expenses for each pool. All available metrics can be found in the [documentation](https://toloka.ai/en/docs/toloka-kit/reference/toloka.metrics.metrics.BaseMetric)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "new_dash = DashBoard(\n", + " [\n", + " # Manually configured charts may contain several metrics and draw all their lines in one chart.\n", + " # You must clearly specify chart name\n", + " # Be careful, if you add same metric type with default line names, you get pair of lines with same names.\n", + " Chart(\n", + " 'Assignments in pools',\n", + " # Assignments in pools. We will track submitted assignments and accepted assignments counts for every pool.\n", + " [metrics.AssignmentsInPool(\n", + " pipeline.verification_pool.id,\n", + " submitted_name='verification_pool.submitted_assignments',\n", + " accepted_name='verification_pool.accepted_assignments',\n", + " ),\n", + " metrics.AssignmentsInPool(\n", + " pipeline.find_items_pool.id,\n", + " submitted_name='find_items_pool.submitted_assignments',\n", + " accepted_name='find_items_pool.accepted_assignments',\n", + " ),\n", + " metrics.AssignmentsInPool(\n", + " pipeline.sbs_pool.id,\n", + " submitted_name='sbs_pool.submitted_assignments',\n", + " accepted_name='sbs_pool.accepted_assignments',\n", + " )]\n", + " ),\n", + " Chart(\n", + " 'Pools expenses',\n", + " [# Budget spent for every pool\n", + " metrics.SpentBudgetOnPool(\n", + " pipeline.verification_pool.id,\n", + " 'verification_pool.expenses'\n", + " ),\n", + " metrics.SpentBudgetOnPool(\n", + " pipeline.find_items_pool.id,\n", + " 'find_items_pool.expenses'\n", + " ),\n", + " metrics.SpentBudgetOnPool(\n", + " pipeline.sbs_pool.id,\n", + " 'sbs_pool.expenses'\n", + " ),]\n", + " ),\n", + " ],\n", + " update_seconds=2, # just example. In real dashboards it's better to drop this parameter\n", + " header='Find items pipeline dashboard',\n", + ")\n", + "\n", + "metrics.bind_client(new_dash.metrics, toloka_client)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Start dashboard. You can run other cells, while this dashboard is tracking metrics." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "new_dash.run_dash()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "---\n", + "**NOTE**\n", + "- This DashBoard is useful only for fast and pretty online demonstration in jupyter notebooks. For real monitoring system, please use `toloka.metric.MetricCollector` (see [Graphite example](https://github.com/Toloka/toloka-kit/tree/main/examples/metrics/graphite.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/tree/main/examples/metrics/graphite.ipynb))\n", + "- DashBoard is not saved with jupyter notebook. It uses `IPython.lib.display.IFrame`. So if you save notebook and share .ipynb file, there will be no dashboard images.\n", + "- DashBoard works with current instance, so you can not share it.\n", + "---\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Running pipeline\n", + "\n", + "Let's try to launch our pipeline and see metrics updated. Metrics will be simultaneously drawn in the above dashboard.\n", + "\n", + "⚠️ **Be careful**:\n", + "real projects will be created and money will be spent in case of running in production environment! ⚠️" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "import asyncio\n", + "# Google Colab is using a global event pool,\n", + "# so in order to run our pipeline we have to apply nest_asyncio to create an inner pool\n", + "if 'google.colab' in str(get_ipython()):\n", + " import nest_asyncio, asyncio\n", + " nest_asyncio.apply()\n", + " asyncio.get_event_loop().run_until_complete(pipeline.run())\n", + "else:\n", + " await pipeline.run()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you want to stop dashboard call this:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "new_dash.stop_dash()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After pipeline completion your dashboard should contain charts similar to this:\n", + "\n", + "\n", + " \n", + " \n", + "
\n", + " \n", + "
\n", + " Figure 1. Assignments in pools metric chart.\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + "
\n", + " \n", + "
\n", + " Figure 2. Pools expenses metric chart.\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "---\n", + "\n", + "## Tips for using Toloka-kit metrics\n", + "\n", + "- If you need several dashboard, create several instances of DashBoard and run it in several ports like:\n", + "```Python\n", + " dash_for_requester1 = DashBoard(toloka_client1, [...])\n", + " dash_for_requester2 = DashBoard(toloka_client2, [...])\n", + "```\n", + "And then run these instances in different cells on different ports:

\n", + "First cell:\n", + "```Python\n", + " dash_for_requester1.run_dash(port='8081')\n", + "```\n", + "Second cell:\n", + "```Python\n", + " dash_for_requester2.run_dash(port='8082')\n", + "```\n", + "- You can use same dashboard for metrics from different clients:\n", + "```Python\n", + " toloka_client_1 = toloka.TolokaClient(, 'PRODUCTION')\n", + " toloka_client_2 = toloka.TolokaClient(, 'PRODUCTION')\n", + "\n", + " my_dash = DashBoard(\n", + " [\n", + " metrics.Balance(toloka_client=toloka_client_1),\n", + " metrics.Balance(toloka_client=toloka_client_2),\n", + " Chart(\n", + " 'Balance for both clients',\n", + " [\n", + " metrics.Balance(balance_name='first client', toloka_client=toloka_client_1),\n", + " metrics.Balance(balance_name='second client', toloka_client=toloka_client_2),\n", + " ]\n", + " )\n", + " ],\n", + " header='Dashboard for several clients',\n", + " )\n", + "```\n", + "You **don't** need to call `bind_client` afterwards because you have already binded clients in each metric.\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.12" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/examples/README.md b/examples/README.md index 9fce5cb3..11010b71 100644 --- a/examples/README.md +++ b/examples/README.md @@ -1,64 +1,64 @@ -# Toloka-kit usage examples -## _Data collection, markup, aggregation, and other applications_ - -Why it may be usefull: -- Easily reuse projects by just copying and pasting code. No need to configure parameters in the interface over and over again. -- Train your ML models and run your data labeling projects in the same environment. -- Take advantage of open-source code that anyone can use and contribute to. - -## Table of content - -| Example | Abstract | Key words | -| ------ | ------ | ------ | -| [Learn the basics](https://github.com/Toloka/toloka-kit/tree/main/examples/0.getting_started/0.learn_the_basics)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/0.getting_started/0.learn_the_basics/learn_the_basics.ipynb) | The very first example explains the basics of working with Toloka and toloka-kit. Everything is explained by the example of the project on the classification of cats and dogs. |```Getting Started```, ```Classification```| -| | **Computer Vision** | -| [Image collection](https://github.com/Toloka/toloka-kit/tree/main/examples/1.computer_vision/image_collection)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/1.computer_vision/image_collection/image_collection.ipynb) | The goal for this project is to collect a dataset of dogs' and cats' images. Performers will be asked to take a photo of their pet and specify its species. |```CV```, ```Classification```, ```Collecting```, ```Dataset```| -| [Image classification](https://github.com/Toloka/toloka-kit/tree/main/examples/1.computer_vision/image_classification)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/1.computer_vision/image_classification/image_classification.ipynb) | An example of binary image classification, made on a dataset with cats and dogs. We ask performers to look at the pictures and decide what animal is in the picture. |```CV```, ```Classification```| -| [Object detection](https://github.com/Toloka/toloka-kit/tree/main/examples/1.computer_vision/object_detection)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/1.computer_vision/object_detection/object_detection.ipynb) | Example of solving the classic problem of annotating images for training detection algorithms. In real-world tasks, annotation is usually done with a polygon. We chose to use a rectangular outline to simplify the task so that we can reduce costs and speed things up. |```CV```, ```Segmentation```, ```Detection```, ```Bounding boxes```, ```Street```, ```Traffic sign```, ```Verification Project```| -| [HTR image gathering](https://github.com/tardis-forever/Handwriting-gathering-with-Toloka)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tardis-forever/Handwriting-gathering-with-Toloka/blob/main/handwriting-gathering.ipynb) | This is an example of simple handwriting images gathering pipeline. Resulting dataset can be used to train and evaluate HTR models. | ```CV```, ```HTR```, ```Texts```, ```Verification project```, ```Collecting```, ```Dataset```| -| [Blood cells classification](https://github.com/oleg-cat/blood-test/blob/main/blood-test.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/oleg-cat/blood-test/blob/main/blood-test.ipynb) | In this project, we will show an image of a blood cell and a brief instruction for Toloka performers. Then, we will ask performers to choose which type of white blood cell they see on this image. | ```CV```, ```Classification```, ```Medicine```, ```Benchmark```| -| [Video collection](https://github.com/Toloka/toloka-kit/tree/main/examples/1.computer_vision/video_collection)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/1.computer_vision/video_collection/video_collection.ipynb) | The goal is to collect a set of video recordings where people show certain gestures, similar to popular emojis. There are several emoji combinations and we ask Tolokers to record a video similar to those emojis, meeting certain criteria about recording quality. |```CV```, ```Video```, ```Collecting```, ```Dataset```| -| [Text Recognition](https://github.com/Toloka/toloka-kit/tree/main/examples/1.computer_vision/text_recognition)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/1.computer_vision/text_recognition/text_recognition.ipynb) | We have a set of water meter images. We need to get each water meter’s readings. We ask performers to look at the images and write down the digits on each water meter. |```CV```, ```OCR```| -| | **NLP** | -| [Text classification](https://github.com/Toloka/toloka-kit/tree/main/examples/5.nlp/text_classification)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/5.nlp/text_classification/text_classification.ipynb) | We have a set of news article headlines. We need to get these classified according to whether they are clickbait or not. | ```NLP```, ```Classification```, ```Texts```| -| [Questing answering on SQuAD](https://github.com/Toloka/toloka-kit/tree/main/examples/SQUAD2.0)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/SQUAD2.0/SQUAD2.0_processing.ipynb) | Solving the problem of question answering on SQUAD2.0 dataset. Collects and validates answers for questions by human performers. One of the most popular tasks in natural language processing. | ```NLP```, ```Questing Answering```, ```Texts```, ```Benchmark```, ```Verification Project```| -| [Sentiment analysis](https://github.com/Toloka/toloka-kit/tree/main/examples/5.nlp/sentiment_analysis)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/5.nlp/sentiment_analysis/sentiment_analysis.ipynb) | We have a set of customer reviews, and we need to classify them as “Positive” or “Negative”. We ask performers to read a review and decide which category it belongs to. | ```NLP```, ```Classification```, ```Text```| -| [Intent classification](https://github.com/Toloka/toloka-kit/tree/main/examples/5.nlp/intent_classification)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/5.nlp/intent_classification/intent_classification.ipynb) | We need to define which class the search query belongs to and distribute the queries between several categories inside the class. There’s a list of queries (related to travel and dining), each with an unknown class and category. | ```NLP```, ```Intent```, ```Classification```, ```Texts```| -| | **Audio analysis** | -| [Audio collection](https://github.com/Toloka/toloka-kit/tree/main/examples/3.audio_analysis/audio_collection)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/3.audio_analysis/audio_collection/audio_collection.ipynb) | We have a set of texts, and we need to get voice recordings of these texts. We ask performers to read the texts aloud and record themselves. Recordings like these are used for training voice assistants. |```ASR```, ```TTS```, ```Collecting```, ```Dataset```| -| [Audio classification](https://github.com/Toloka/toloka-kit/tree/main/examples/3.audio_analysis/audio_classification)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/3.audio_analysis/audio_classification/audio_classification.ipynb) | We have a set of voice recordings from different people. We need to get these classified according to the speaker’s gender. We ask performers to listen to the recordings and decide whether it is a man or a woman speaking. |```ASR```, ```TTS```, ```Classification```| -| [ASR/TTS based on Wikipedia articles](https://github.com/noath/asr-datasets-pipeline)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/noath/asr-datasets-pipeline/blob/main/ASR_pipeline.ipynb) | This example contains full speech data collecting pipeline from extracting raw texts to labeling and validating speech records. | ```ASR```, ```TTS```, ```Texts```, ```Verification project```, ```Audio samples collection```| -| [Audio transcription](https://github.com/Toloka/toloka-kit/tree/main/examples/3.audio_analysis/audio_transcription)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/3.audio_analysis/audio_transcription/audio_transcription.ipynb) | We have a set of audio recordings. We need to obtain a transcription of each recording. We ask performers to listen to the recordings and type what they hear. |```ASR```, ```Transcription```, ```Pipline```, ``` Post-acceptance```| -| | **Ranking** | -| [Side-by-side image comparision](https://github.com/Toloka/toloka-kit/tree/main/examples/4.ranking/side_by_side_image_comparision)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/4.ranking/side_by_side_image_comparision/side_by_side_comparision.ipynb) | We have a set of 6 icons. We need to find out which icon people prefer and determine the top icon out of the set. We show performers two icons each and ask them to choose the one they prefer. Then we aggregate these results to obtain the top icon. |```Ranking```, ```Side-by-side```| -| | **Spatial Crowdsourcing** | -| [Simplest Spatial Crowdsourcing](https://github.com/Toloka/toloka-kit/tree/main/examples/2.spatial_crowdsourcing/0.simplest_example)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/2.spatial_crowdsourcing/0.simplest_example/spatial_crowdsourcing.ipynb) | In this example, we will collect pictures of the Moscow metro entrances. This example also can be reused for production tasks such as monitoring the state of objects, checking the presence of an organization or other physical object. |```Spatial Crowdsourcing```, ```Outdoor monitoring```, ```Collecting```| -| | **Survey** | -| [Simplest survey](https://github.com/Toloka/toloka-kit/tree/main/examples/7.survey/simplest_survey)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/7.survey/simplest_survey/simplest_survey.ipynb) | The goal is to collect some information about how people manage stress and if they are ready to get a meditation app to do that. There is a survey where we ask some questions about stress level and management, meditation practices and users' habits concerning paid apps. |```Survey```, ```Collecting```| -| | **Pipelines** | -| [Simple Toloka+ML pipeline on Prefect](https://github.com/Toloka/toloka-kit/tree/main/examples/9.toloka_and_ml_on_prefect)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/9.toloka_and_ml_on_prefect/example.ipynb) | This example illustrates how crowdsourcing using Toloka can be made easier and cheaper by integrating an ML model. Furthermore, it shows how to run the whole project in the cloud using Prefect, which makes workflow orchestration much simpler. |```Prefect```, ```ML```, ```Autohelper```| -| [Building streaming pipelines in Toloka](https://github.com/Toloka/toloka-kit/tree/main/examples/6.streaming_pipelines)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/6.streaming_pipelines/streaming_pipelines.ipynb) | Let's solve the following task: find the goods in the online-store by given image and aggange found results by relevance. In this example we unite 3 different Toloka projects into one useful Pipeline. |```Pipeline```, ```Collecting```, ```Dataset```| -| | **Relevance** | -| [Search relevance](https://github.com/Toloka/toloka-kit/tree/main/examples/8.search_relevance)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/8.search_relevance/search_relevance.ipynb) | We have a set of search queries and products on a website. We need to determine the extent to which each query is relevant to the corresponding product on the website. We ask performers to look at the search query and the product image from the website and rate the relevance level. |```Relevance```| -| [Ad relevance](https://github.com/oleg-cat/checkadv/blob/main/checkadv.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/oleg-cat/checkadv/blob/main/checkadv.ipynb) | In this example we aim to explore webpages containing ads and their descriptions. We will run the project using new Toloka Ready-to-go solutions (App Services). |```Relevance```| -| | **Benchmarks** | -| [Image classification](https://github.com/Toloka/toloka-kit/tree/main/examples/benchmarks/image_classification_cinic10.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/benchmarks/image_classification_cinic10.ipynb) | Image classification on CINIC-10. Minimal configuration to achieve the described levels of quality. Accuracy on Test = 88% |```Benchmark```, ```CV```, ```Classification```| -| [Text classification](https://github.com/Toloka/toloka-kit/tree/main/examples/benchmarks/text_classification_imdb.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/benchmarks/text_classification_imdb.ipynb) | Text classification on IMDB movie reviews. Minimal configuration to achieve the described levels of quality. Accuracy on Test = 89% |```Benchmark```, ```NLP```, ```Classification```| -| | **Metrics** | -| [Jupyter dashboard](https://github.com/Toloka/toloka-kit/tree/main/examples/metrics/jupyter_dashboard.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/metrics/jupyter_dashboard.ipynb) | An example of using jupyter dashboard to collect and display Toloka metrics inside jupyter notebook. | ```Metrics```, ```Visualization``` -| [Graphite](https://github.com/Toloka/toloka-kit/tree/main/examples/metrics/graphite.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/metrics/graphite.ipynb) | `MetricCollector` usage example. In this notebook you will learn how to collect Toloka metrics and send them to Graphite metrics server simultaneously. | ```Metrics```, ```Logging```, ```Graphite``` - - -# Need more examples? -If you have an example of data labeling using toloka-kit, do not hesitate to send it. Add a link to your GitHub repository and a description to this table via a [pull request](https://github.com/Toloka/toloka-kit/pulls). - -Ideally, a great example should contain the following aspects: -- Problem statement; -- How to set up a project; -- Where to get the data for the example; -- What to pay attention to when writing instructions; -- How to set up quality control; -- What is the final quality; -- Visualization of the obtained results; - -You may also ask any question or ask for a new example using [issues](https://github.com/Toloka/toloka-kit/issues) +# Toloka-kit usage examples +## _Data collection, markup, aggregation, and other applications_ + +Why it may be usefull: +- Easily reuse projects by just copying and pasting code. No need to configure parameters in the interface over and over again. +- Train your ML models and run your data labeling projects in the same environment. +- Take advantage of open-source code that anyone can use and contribute to. + +## Table of content + +| Example | Abstract | Key words | +| ------ | ------ | ------ | +| [Learn the basics](https://github.com/Toloka/toloka-kit/tree/main/examples/0.getting_started/0.learn_the_basics)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/0.getting_started/0.learn_the_basics/learn_the_basics.ipynb) | The very first example explains the basics of working with Toloka and toloka-kit. Everything is explained by the example of the project on the classification of cats and dogs. |```Getting Started```, ```Classification```| +| | **Computer Vision** | +| [Image collection](https://github.com/Toloka/toloka-kit/tree/main/examples/1.computer_vision/image_collection)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/1.computer_vision/image_collection/image_collection.ipynb) | The goal for this project is to collect a dataset of dogs' and cats' images. Performers will be asked to take a photo of their pet and specify its species. |```CV```, ```Classification```, ```Collecting```, ```Dataset```| +| [Image classification](https://github.com/Toloka/toloka-kit/tree/main/examples/1.computer_vision/image_classification)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/1.computer_vision/image_classification/image_classification.ipynb) | An example of binary image classification, made on a dataset with cats and dogs. We ask performers to look at the pictures and decide what animal is in the picture. |```CV```, ```Classification```| +| [Object detection](https://github.com/Toloka/toloka-kit/tree/main/examples/1.computer_vision/object_detection)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/1.computer_vision/object_detection/object_detection.ipynb) | Example of solving the classic problem of annotating images for training detection algorithms. In real-world tasks, annotation is usually done with a polygon. We chose to use a rectangular outline to simplify the task so that we can reduce costs and speed things up. |```CV```, ```Segmentation```, ```Detection```, ```Bounding boxes```, ```Street```, ```Traffic sign```, ```Verification Project```| +| [HTR image gathering](https://github.com/tardis-forever/Handwriting-gathering-with-Toloka)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tardis-forever/Handwriting-gathering-with-Toloka/blob/main/handwriting-gathering.ipynb) | This is an example of simple handwriting images gathering pipeline. Resulting dataset can be used to train and evaluate HTR models. | ```CV```, ```HTR```, ```Texts```, ```Verification project```, ```Collecting```, ```Dataset```| +| [Blood cells classification](https://github.com/oleg-cat/blood-test/blob/main/blood-test.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/oleg-cat/blood-test/blob/main/blood-test.ipynb) | In this project, we will show an image of a blood cell and a brief instruction for Toloka performers. Then, we will ask performers to choose which type of white blood cell they see on this image. | ```CV```, ```Classification```, ```Medicine```, ```Benchmark```| +| [Video collection](https://github.com/Toloka/toloka-kit/tree/main/examples/1.computer_vision/video_collection)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/1.computer_vision/video_collection/video_collection.ipynb) | The goal is to collect a set of video recordings where people show certain gestures, similar to popular emojis. There are several emoji combinations and we ask Tolokers to record a video similar to those emojis, meeting certain criteria about recording quality. |```CV```, ```Video```, ```Collecting```, ```Dataset```| +| [Text Recognition](https://github.com/Toloka/toloka-kit/tree/main/examples/1.computer_vision/text_recognition)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/1.computer_vision/text_recognition/text_recognition.ipynb) | We have a set of water meter images. We need to get each water meter’s readings. We ask performers to look at the images and write down the digits on each water meter. |```CV```, ```OCR```| +| | **NLP** | +| [Text classification](https://github.com/Toloka/toloka-kit/tree/main/examples/5.nlp/text_classification)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/5.nlp/text_classification/text_classification.ipynb) | We have a set of news article headlines. We need to get these classified according to whether they are clickbait or not. | ```NLP```, ```Classification```, ```Texts```| +| [Questing answering on SQuAD](https://github.com/Toloka/toloka-kit/tree/main/examples/SQUAD2.0)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/SQUAD2.0/SQUAD2.0_processing.ipynb) | Solving the problem of question answering on SQUAD2.0 dataset. Collects and validates answers for questions by human performers. One of the most popular tasks in natural language processing. | ```NLP```, ```Questing Answering```, ```Texts```, ```Benchmark```, ```Verification Project```| +| [Sentiment analysis](https://github.com/Toloka/toloka-kit/tree/main/examples/5.nlp/sentiment_analysis)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/5.nlp/sentiment_analysis/sentiment_analysis.ipynb) | We have a set of customer reviews, and we need to classify them as “Positive” or “Negative”. We ask performers to read a review and decide which category it belongs to. | ```NLP```, ```Classification```, ```Text```| +| [Intent classification](https://github.com/Toloka/toloka-kit/tree/main/examples/5.nlp/intent_classification)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/5.nlp/intent_classification/intent_classification.ipynb) | We need to define which class the search query belongs to and distribute the queries between several categories inside the class. There’s a list of queries (related to travel and dining), each with an unknown class and category. | ```NLP```, ```Intent```, ```Classification```, ```Texts```| +| | **Audio analysis** | +| [Audio collection](https://github.com/Toloka/toloka-kit/tree/main/examples/3.audio_analysis/audio_collection)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/3.audio_analysis/audio_collection/audio_collection.ipynb) | We have a set of texts, and we need to get voice recordings of these texts. We ask performers to read the texts aloud and record themselves. Recordings like these are used for training voice assistants. |```ASR```, ```TTS```, ```Collecting```, ```Dataset```| +| [Audio classification](https://github.com/Toloka/toloka-kit/tree/main/examples/3.audio_analysis/audio_classification)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/3.audio_analysis/audio_classification/audio_classification.ipynb) | We have a set of voice recordings from different people. We need to get these classified according to the speaker’s gender. We ask performers to listen to the recordings and decide whether it is a man or a woman speaking. |```ASR```, ```TTS```, ```Classification```| +| [ASR/TTS based on Wikipedia articles](https://github.com/noath/asr-datasets-pipeline)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/noath/asr-datasets-pipeline/blob/main/ASR_pipeline.ipynb) | This example contains full speech data collecting pipeline from extracting raw texts to labeling and validating speech records. | ```ASR```, ```TTS```, ```Texts```, ```Verification project```, ```Audio samples collection```| +| [Audio transcription](https://github.com/Toloka/toloka-kit/tree/main/examples/3.audio_analysis/audio_transcription)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/3.audio_analysis/audio_transcription/audio_transcription.ipynb) | We have a set of audio recordings. We need to obtain a transcription of each recording. We ask performers to listen to the recordings and type what they hear. |```ASR```, ```Transcription```, ```Pipline```, ``` Post-acceptance```| +| | **Ranking** | +| [Side-by-side image comparision](https://github.com/Toloka/toloka-kit/tree/main/examples/4.ranking/side_by_side_image_comparision)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/4.ranking/side_by_side_image_comparision/side_by_side_comparision.ipynb) | We have a set of 6 icons. We need to find out which icon people prefer and determine the top icon out of the set. We show performers two icons each and ask them to choose the one they prefer. Then we aggregate these results to obtain the top icon. |```Ranking```, ```Side-by-side```| +| | **Spatial Crowdsourcing** | +| [Simplest Spatial Crowdsourcing](https://github.com/Toloka/toloka-kit/tree/main/examples/2.spatial_crowdsourcing/0.simplest_example)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/2.spatial_crowdsourcing/0.simplest_example/spatial_crowdsourcing.ipynb) | In this example, we will collect pictures of the Moscow metro entrances. This example also can be reused for production tasks such as monitoring the state of objects, checking the presence of an organization or other physical object. |```Spatial Crowdsourcing```, ```Outdoor monitoring```, ```Collecting```| +| | **Survey** | +| [Simplest survey](https://github.com/Toloka/toloka-kit/tree/main/examples/7.survey/simplest_survey)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/7.survey/simplest_survey/simplest_survey.ipynb) | The goal is to collect some information about how people manage stress and if they are ready to get a meditation app to do that. There is a survey where we ask some questions about stress level and management, meditation practices and users' habits concerning paid apps. |```Survey```, ```Collecting```| +| | **Pipelines** | +| [Simple Toloka+ML pipeline on Prefect](https://github.com/Toloka/toloka-kit/tree/main/examples/9.toloka_and_ml_on_prefect)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/9.toloka_and_ml_on_prefect/example.ipynb) | This example illustrates how crowdsourcing using Toloka can be made easier and cheaper by integrating an ML model. Furthermore, it shows how to run the whole project in the cloud using Prefect, which makes workflow orchestration much simpler. |```Prefect```, ```ML```, ```Autohelper```| +| [Building streaming pipelines in Toloka](https://github.com/Toloka/toloka-kit/tree/main/examples/6.streaming_pipelines)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/6.streaming_pipelines/streaming_pipelines.ipynb) | Let's solve the following task: find the goods in the online-store by given image and aggange found results by relevance. In this example we unite 3 different Toloka projects into one useful Pipeline. |```Pipeline```, ```Collecting```, ```Dataset```| +| | **Relevance** | +| [Search relevance](https://github.com/Toloka/toloka-kit/tree/main/examples/8.search_relevance)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/8.search_relevance/search_relevance.ipynb) | We have a set of search queries and products on a website. We need to determine the extent to which each query is relevant to the corresponding product on the website. We ask performers to look at the search query and the product image from the website and rate the relevance level. |```Relevance```| +| [Ad relevance](https://github.com/oleg-cat/checkadv/blob/main/checkadv.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/oleg-cat/checkadv/blob/main/checkadv.ipynb) | In this example we aim to explore webpages containing ads and their descriptions. We will run the project using new Toloka Ready-to-go solutions (App Services). |```Relevance```| +| | **Benchmarks** | +| [Image classification](https://github.com/Toloka/toloka-kit/tree/main/examples/benchmarks/image_classification_cinic10.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/benchmarks/image_classification_cinic10.ipynb) | Image classification on CINIC-10. Minimal configuration to achieve the described levels of quality. Accuracy on Test = 88% |```Benchmark```, ```CV```, ```Classification```| +| [Text classification](https://github.com/Toloka/toloka-kit/tree/main/examples/benchmarks/text_classification_imdb.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/benchmarks/text_classification_imdb.ipynb) | Text classification on IMDB movie reviews. Minimal configuration to achieve the described levels of quality. Accuracy on Test = 89% |```Benchmark```, ```NLP```, ```Classification```| +| | **Metrics** | +| [Jupyter dashboard](https://github.com/Toloka/toloka-kit/tree/main/examples/metrics/jupyter_dashboard.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/metrics/jupyter_dashboard.ipynb) | An example of using jupyter dashboard to collect and display Toloka metrics inside jupyter notebook. | ```Metrics```, ```Visualization``` +| [Graphite](https://github.com/Toloka/toloka-kit/tree/main/examples/metrics/graphite.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/metrics/graphite.ipynb) | `MetricCollector` usage example. In this notebook you will learn how to collect Toloka metrics and send them to Graphite metrics server simultaneously. | ```Metrics```, ```Logging```, ```Graphite``` + + +# Need more examples? +If you have an example of data labeling using toloka-kit, do not hesitate to send it. Add a link to your GitHub repository and a description to this table via a [pull request](https://github.com/Toloka/toloka-kit/pulls). + +Ideally, a great example should contain the following aspects: +- Problem statement; +- How to set up a project; +- Where to get the data for the example; +- What to pay attention to when writing instructions; +- How to set up quality control; +- What is the final quality; +- Visualization of the obtained results; + +You may also ask any question or ask for a new example using [issues](https://github.com/Toloka/toloka-kit/issues) diff --git a/examples/faces_detection/faces_detection.ipynb b/examples/faces_detection/faces_detection.ipynb index c0df8c9c..b2e598e9 100644 --- a/examples/faces_detection/faces_detection.ipynb +++ b/examples/faces_detection/faces_detection.ipynb @@ -1,980 +1,980 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Annotating ground truth for object detection\n", - "\n", - "The goal of this notebook is to annotate images which can later be used for training object detection algorithms.\n", - "\n", - "We will configure and run the annotation project in Toloka from scratch.\n", - "\n", - "Follow the steps in this notebook to get the labeled data at the end.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Prerequisites\n", - "To proceed with this notebook:\n", - "- Make sure you are [registered](https://toloka.ai/get-started/?utm_source=github&utm_medium=instruction-b&utm_campaign=link-1) in Toloka as a requester.\n", - "- Use the promo code **OBJECT_DETECTION_PROMO** to add $10 to your account on your [profile page](https://platform.toloka.ai/requester/profile/income/?utm_source=github&utm_medium=instruction-b&utm_campaign=link-2). You can use it to pay for data labeling while following this tutorial.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## The challenge\n", - "We have a set of photos with people in them:\n", - "\n", - "\n", - " \n", - " \n", - "
\n", - " \"Sample\n", - "
\n", - " Figure 1. Sample photo of people\n", - "
\n", - "\n", - "We need to outline every person's face. Ultimately, we need to get a set of pixelwise boundaries that represent the people's faces in each photo. Here’s what it can look like:\n", - "\n", - "\n", - " \n", - " \n", - "
\n", - " \"Example\n", - "
\n", - " Figure 2. An example of how face detection can be performed\n", - "
\n", - "\n", - "This type of annotation often uses polygons. We are using bounding boxes to simplify the task so that we can reduce costs and speed things up. You can still use polygons, just uncomment the related pieces of code." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Detailed task description\n", - "In this notebook, we will implement projects #2 and #3 from [this Toloka tutorial](https://toloka.ai/en/docs/guide/concepts/image-segmentation-overview?utm_source=github&utm_medium=instruction-b&utm_campaign=link-3). Feel free to check them out if you want to learn how to configure these projects in the web interface. Otherwise you can skip it, since this notebook covers all the steps using the Toloka API.\n", - "\n", - "We’ll skip the first project “Does the image contain a specific object?” from the tutorial above because it’s easy to implement using our “verification project” code.\n", - "\n", - "We will implement two projects in Toloka:\n", - "- A detection project called “[Select an object in the image](https://toloka.ai/en/docs/guide/concepts/image-segmentation-project2?utm_source=github&utm_medium=instruction-b&utm_campaign=link-4)”: Tolokers will select image areas that contain faces. Tolokers are people around the world who get paid for completing your tasks. \n", - "- A verification project called “[Are the bounding boxes correct?](https://toloka.ai/en/docs/guide/concepts/image-segmentation-project3?utm_source=github&utm_medium=instruction-b&utm_campaign=link-5)”: Tolokers will check the annotated images.\n", - "\n", - "We don’t use [control tasks](https://toloka.ai/en/docs/guide/concepts/goldenset?utm_source=github&utm_medium=instruction-b&utm_campaign=link-6) or [majority vote](https://toloka.ai/en/docs/guide/concepts/mvote?utm_source=github&utm_medium=instruction-b&utm_campaign=link-7) to control quality in the detection project because we can’t expect the image annotations provided by the Tolokers to match each other exactly to the pixel. Instead, we’ll check detection results in the second project, where a different group of Tolokers will decide whether the faces are annotated correctly or not." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Set up the environment" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Prepare the environment and import necessary libraries." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!pip install toloka-kit==1.0.0 # To interact with Toloka API\n", - "!pip install ipyplot # To plot images inside Jupyter Notebooks cells\n", - "!pip install crowd-kit==1.1.0\n", - "\n", - "import os\n", - "import datetime\n", - "import time\n", - "import logging\n", - "import sys\n", - "\n", - "import pandas as pd # To perform data manipulation\n", - "import ipyplot\n", - "\n", - "\n", - "from typing import List\n", - "from toloka.streaming.event import AssignmentEvent\n", - "\n", - "import toloka.client as toloka\n", - "import toloka.client.project.template_builder as tb\n", - "\n", - "from crowdkit.aggregation import MajorityVote\n", - "\n", - "logging.basicConfig(\n", - " format='[%(levelname)s] %(name)s: %(message)s',\n", - " level=logging.INFO,\n", - " stream=sys.stdout,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create a toloka-client instance. All API calls will pass through it.\n", - "\n", - "To get an OAuth token, follow these [instructions](https://toloka.ai/en/docs/api/concepts/access/?utm_source=github&utm_medium=instruction-b&utm_campaign=link-8). Enter your token." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "toloka_client = toloka.TolokaClient(input(\"Enter your token:\"), 'PRODUCTION')\n", - "print(toloka_client.get_requester())" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Check out Toloka documentation to learn more about [the Toloka API](https://toloka.ai/en/docs/api/?utm_source=github&utm_medium=instruction-b&utm_campaign=link-9) and [Toloka-Kit](https://toloka.ai/en/docs/toloka-kit/?utm_source=github&utm_medium=instruction-b&utm_campaign=link-10)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Review the dataset\n", - "We use the [Human Parts Dataset](https://github.com/xiaojie1017/Human-Parts/?utm_source=github&utm_medium=instruction-b&utm_campaign=link-11), distributed under the MIT license.\n", - "\n", - "Our dataset is just a collection of URLs leading to JPEG images. GIF, PNG, and WEBP formats are also supported.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "!curl https://tlk.s3.yandex.net/dataset/faces_detection.tsv --output dataset.tsv # TODO\n", - "\n", - "# Load the dataset of links to a pd DataFrame\n", - "dataset = pd.read_csv('dataset.tsv', sep='\\t')\n", - "\n", - "# Plot 5 images from dataset to verify data loading\n", - "ipyplot.plot_images(\n", - " [url for url in dataset['image'].sample(n=50)],\n", - " max_images=5,\n", - " img_width=1000\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "---\n", - "---\n", - "## Create a new detection project\n", - "\n", - "In this project, Tolokers select image areas that contain faces.\n", - "\n", - "The first step is to configure how Tolokers will see the tasks:\n", - "- Write instructions.\n", - "- Define the input and output formats.\n", - "\n", - "**Note:** It’s important to write clear instructions with examples to make sure the Tolokers do exactly what we need them to. We also recommend checking the task interface yourself to see if there are any problems with the layout. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# How tolokers will see the task\n", - "project_interface = toloka.project.TemplateBuilderViewSpec(\n", - " view=tb.ImageAnnotationFieldV1( # Component that selects areas in images\n", - " tb.OutputData('result'), # Path for writing output data\n", - " tb.InputData('image'), # Getter for the input image\n", - " shapes={'rectangle': True}, # Allow to select only rectangular areas #polygons\n", - " validation=tb.RequiredConditionV1(hint='Please select an area'), # At least one area should be selected\n", - " full_height=True\n", - " )\n", - ")\n", - "\n", - "# You can write instructions and upload them from a file or enter them later in the web interface\n", - "# In case of polygon markup adjust the instruction describing the rules of putting the polygons instead of bounding boxes\n", - "prepared_instruction = open('./instructions/detection_instruction.html').read().strip()\n", - "\n", - "# Set up the project\n", - "detection_project = toloka.Project(\n", - " public_name='Outline all people faces with bounding boxes',\n", - " public_description='Find and outline all people faces with bounding boxes.',\n", - " public_instructions=prepared_instruction,\n", - " # Set up the task: view, input, and output parameters\n", - " task_spec=toloka.project.task_spec.TaskSpec(\n", - " input_spec={'image': toloka.project.UrlSpec()},\n", - " output_spec={'result': toloka.project.JsonSpec()},\n", - " view_spec=project_interface,\n", - " ),\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "Call the API to create a new project." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "detection_project = toloka_client.create_project(detection_project)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Review your project and check the task interface\n", - "\n", - "Visit the project page to make sure the task interface is working correctly.\n", - "\n", - "To do this:\n", - "1. Follow the link you got in the output above.\n", - "2. In the project interface, click **Project actions** on the top right.\n", - "3. Click **Preview** in the menu.\n", - "4. Click **Change input data**.\n", - "5. Insert an image URL into the **Image** field. Here, you can use a link to one of the images you want to label, or [this sample image](https://tlk.s3.yandex.net/dataset/faces_detection/5fc9583cbda646c7de9c6451b7f22a165b6b7d4b.jpg).\n", - "6. Click the **Instructions** button. Make sure the instructions are visible and valid.\n", - "7. Try to select multiple areas with a rectangle using **Box annotation tool**.\n", - "8. Click **Submit** and then click **View responses**.\n", - "\n", - "A window with the results will appear. Check that the results are in the expected format and that the data is being entered correctly.\n", - "\n", - "\n", - "\n", - " \n", - " \n", - "
\n", - " \"Task\n", - "
\n", - " Figure 3. What the results window might look like\n", - "
\n", - "\n", - "**Tip:** Do a trial run with a small sample of your data. Make sure that after running a trial for the entire pipeline, you get data with the expected format and quality. " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Add custom skills for tolokers\n", - "\n", - "A [skill](https://toloka.ai/en/docs/guide/concepts/nav?utm_source=github&utm_medium=instruction-b&utm_campaign=link-12) is a characteristic of a Toloker. For example, you can record the percentage of correct responses as a skill.\n", - "\n", - "In this project, we’ll create two skills:\n", - "- **Detection skill**: Shows that a Toloker has completed at least one detection task. We’ll later filter out these Tolokers from verification tasks to ensure that the people doing the verification are not the same people who performed the detection.\n", - "- **Verification skill**: How good the current Toloker’s verification results are compared to others. We’ll need this skill later when we aggregate the results of the second project. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "detection_skill = next(toloka_client.get_skills(name='Area selection of people faces'), None)\n", - "if detection_skill:\n", - " print('Detection skill already exists')\n", - "else:\n", - " detection_skill = toloka_client.create_skill(\n", - " name='Area selection of people faces',\n", - " hidden=True,\n", - " public_requester_description={'EN': 'Toloker is annotating people faces'},\n", - " )\n", - "\n", - "verification_skill = next(toloka_client.get_skills(name='People faces detection verification'), None)\n", - "if verification_skill:\n", - " print('Verification skill already exists')\n", - "else:\n", - " verification_skill = toloka_client.create_skill(\n", - " name='People faces detection verification',\n", - " hidden=True,\n", - " public_requester_description={'EN': 'How good a toloker is at verifying detection tasks'},\n", - " )" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Pool creation for a detection project\n", - "A [pool](https://toloka.ai/en/docs/guide/concepts/pool-main?utm_source=github&utm_medium=instruction-b&utm_campaign=link-13) is a set of tasks sent out for Tolokers.\n", - "\n", - "First, create an instance of a pool and set its basic parameters:\n", - "- Payment amount per task.\n", - "- Non-automatic acceptance of results.\n", - "- Number of tasks Tolokers will see per page.\n", - "- Toloker selection filters to control who can access this task." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "detection_pool = toloka.Pool(\n", - " project_id=detection_project.id,\n", - " private_name='Pool 1', # Only you can see this information.\n", - " may_contain_adult_content=False,\n", - " will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365), # Pool will automatically close after one year\n", - " reward_per_assignment=0.01, # Set the minimum payment amount for one task page\n", - " auto_accept_solutions=False, # Only pay the toloker for completing the task,\n", - " # based on the verification results of the second project.\n", - "\n", - " auto_accept_period_day=7, # Number of days to determine if we'll pay for task completion by this toloker or not.\n", - " assignment_max_duration_seconds=60*20, # Give tolokers 20 minutes maximum to complete one task page.\n", - " defaults=toloka.pool.Pool.Defaults(\n", - " # We don't need overlapping for detection tasks, so we set it to 1\n", - " default_overlap_for_new_task_suites=1,\n", - " default_overlap_for_new_tasks=1,\n", - " ),\n", - ")\n", - "\n", - "# Set the number of tasks per page\n", - "detection_pool.set_mixer_config(real_tasks_count=1)\n", - "# Please note that the payment amount specified when creating the pool is the amount the toloker receives for completing one page of tasks.\n", - "# If you specify 10 tasks per page above, then reward_per_assignment will be paid for completing 10 tasks." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We’ll only show our tasks to English-speaking users because the description of the task is in English. This means that only people who speak English will be able to accept this task. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "detection_pool.filter = toloka.filter.Languages.in_('EN')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Quality control rules**\n", - "\n", - "Each [quality control rule](https://toloka.ai/en/docs/guide/concepts/control?utm_source=github&utm_medium=instruction-b&utm_campaign=link-14) consists of the following:\n", - "- **Collector**: How to collect statistics and which metrics can be used in this rule.\n", - "- **Condition**: When the rule will be triggered. Under this condition, only parameters that apply to the collector can be used.\n", - "- **Action**: What to do if the condition is true." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# The first rule in this project restricts pool access for tolokers who often make mistakes\n", - "detection_pool.quality_control.add_action(\n", - " collector=toloka.collectors.AcceptanceRate(),\n", - " conditions=[\n", - " # Toloker completed more than 2 tasks\n", - " toloka.conditions.TotalAssignmentsCount > 2,\n", - " # And more than 35% of their responses were rejected\n", - " toloka.conditions.RejectedAssignmentsRate > 35,\n", - " ],\n", - " # This action tells Toloka what to do if the condition above is True\n", - " # In our case, we'll restrict access for 15 days\n", - " # Always leave a comment: it may be useful later on\n", - " action=toloka.actions.RestrictionV2(\n", - " scope='ALL_PROJECTS',\n", - " duration=15,\n", - " duration_unit='DAYS',\n", - " private_comment='Tolokers often make mistakes', # Only you will see this comment\n", - " )\n", - ")\n", - "\n", - "# The second useful rule is \"Fast responses\". It allows us to filter out tolokers who respond too quickly.\n", - "detection_pool.quality_control.add_action(\n", - " # Let's monitor fast submissions for the last 5 completed task pages\n", - " # And define ones that take less than 20 seconds as quick responses.\n", - " collector=toloka.collectors.AssignmentSubmitTime(history_size=5, fast_submit_threshold_seconds=20),\n", - " # If we see more than one fast response, we ban the toloker from all our projects for 10 days.\n", - " conditions=[toloka.conditions.FastSubmittedCount > 1],\n", - " action=toloka.actions.RestrictionV2(\n", - " scope='ALL_PROJECTS',\n", - " duration=10,\n", - " duration_unit='DAYS',\n", - " private_comment='Fast responses', # Only you will see this comment\n", - " )\n", - ")\n", - "\n", - "# Another rule we use is for automatically updating skills\n", - "# We update the detection skill for tolokers who complete at least one page of tasks from detection pool.\n", - "detection_pool.quality_control.add_action(\n", - " collector=toloka.collectors.AnswerCount(),\n", - " # If toloker completed at least one task, it sets the new skill to 1\n", - " conditions=[toloka.conditions.AssignmentsAcceptedCount > 0],\n", - " action=toloka.actions.SetSkill(skill_id=detection_skill.id, skill_value=1),\n", - ")\n", - "\n", - "# This rule sends rejected assignments (tasks that you rejected) to other tolokers according to specified parameters.\n", - "detection_pool.quality_control.add_action(\n", - " collector=toloka.collectors.AssignmentsAssessment(),\n", - " # Check if a task was rejected\n", - " conditions=[toloka.conditions.AssessmentEvent == 'REJECT'],\n", - " # If the condition is True, add 1 to overlap and open the pool\n", - " action=toloka.actions.ChangeOverlap(delta=1, open_pool=True),\n", - ")\n", - "\n", - "print('Quality rules count:', len(detection_pool.quality_control.configs))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Create a pool with all specified conditions\n", - "\n", - "Now we call the Toloka API to create a pool in the detection project. Afterwards, you can check the pool in the web interface. You’ll see there aren’t any tasks in it. We’ll add them later. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "detection_pool = toloka_client.create_pool(detection_pool)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "---\n", - "---\n", - "## Create a new project for verification\n", - "In this project, Tolokers will determine whether the faces were outlined correctly or not.\n", - "\n", - "This will be a standard classification project with only two classes: `OK` and `BAD`. We’ll explicitly define these labels as the output values. \n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Configure task interface: how tolokers will see the task\n", - "verification_interface = toloka.project.TemplateBuilderViewSpec(\n", - " view=tb.ListViewV1( # List of components that should be positioned from top to bottom in the UI\n", - " [\n", - " tb.ImageAnnotationFieldV1( # Image and selected areas to verify\n", - " tb.InternalData('selection',\n", - " default=tb.InputData('selection')), # Use the input field as default value to display the selected areas\n", - " tb.InputData('image'),\n", - " disabled=True, # Disable adding and deleting areas\n", - " full_height=True\n", - " ),\n", - " tb.RadioGroupFieldV1( # A component for selecting one value out of several options\n", - " tb.OutputData('result'), # Path for writing output data\n", - " [\n", - " tb.GroupFieldOption('OK', 'Yes'),\n", - " tb.GroupFieldOption('BAD', 'No'),\n", - " ],\n", - " label='Are all people faces outlined correctly?', # Label above the options\n", - " validation=tb.RequiredConditionV1() # Requirement to select one of the options\n", - " )\n", - " ]\n", - " ),\n", - " plugins=[\n", - " tb.HotkeysPluginV1( # Shortcuts for selecting options using the keyboard\n", - " key_1=tb.SetActionV1(tb.OutputData('result'), 'OK'),\n", - " key_2=tb.SetActionV1(tb.OutputData('result'), 'BAD')\n", - " )\n", - " ]\n", - ")\n", - "\n", - "# You can write instructions and upload them from a file or enter them later in the web interface\n", - "# In case of polygon markup adjust the instruction describing the rules of putting the polygons instead of bounding boxes\n", - "verification_instruction = open('./instructions/verification_instruction.html').read().strip()\n", - "\n", - "# Set up the project\n", - "verification_project = toloka.Project(\n", - " public_name='Are the people faces outlined correctly?',\n", - " public_description='Look at the image and decide whether or not the people faces are outlined correctly',\n", - " public_instructions=verification_instruction,\n", - " # Set up the task: view, input, and output parameters\n", - " task_spec=toloka.project.task_spec.TaskSpec(\n", - " input_spec={\n", - " 'image': toloka.project.UrlSpec(),\n", - " 'selection': toloka.project.JsonSpec(),\n", - " 'assignment_id': toloka.project.StringSpec(),\n", - " },\n", - " # Set allowed_values, we'll use smart mixing to get the results of this project\n", - " output_spec={'result': toloka.project.StringSpec(allowed_values=['OK', 'BAD'])},\n", - " view_spec=verification_interface,\n", - " ),\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Call the API to create a new project." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "verification_project = toloka_client.create_project(verification_project)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Examine the project in the web interface:\n", - "\n", - "1. After running the code above, you’ll get a link to your project. Follow this link to check the task interface and instructions.\n", - "\n", - " **Note:** You will see almost the same interface as in the previous project, just without the ability to select areas. It’s important to make sure that the annotation results from the first project are displayed correctly in the second one.\n", - " \n", - " \n", - "2. Open the task **Preview** in the first project.\n", - "\n", - "\n", - "3. Outline the faces and click **Submit**.\n", - "\n", - "\n", - "4. Copy the results.\n", - "\n", - "\n", - "5. Now open the **Preview** of the second project.\n", - "\n", - "\n", - "6. Click **Change input data** and paste the annotation results in the selection field.\n", - "\n", - "\n", - "7. Click **Apply** and make sure the annotation displays correctly. " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Create and set up a pool in the verification project\n", - "Now we need to add a filter for this pool. Specify Tolokers that don’t have the detection skill (we want to avoid using Tolokers who did the detection tasks). You can combine multiple conditions using the `&` and `|` operators.\n", - "\n", - "**Note:** Add two quality control rules with the same collector but with different conditions and actions. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "verification_pool = toloka.Pool(\n", - " project_id=verification_project.id,\n", - " private_name='Pool 1. People faces verification', # Only you can see this information.\n", - " may_contain_adult_content=False,\n", - " will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365), # Pool will close automatically after one year\n", - " reward_per_assignment=0.01, # We set the minimum payment amount for one task page\n", - " # By default, auto_accept_solutions is on,\n", - " # so we'll pay for all the tasks without checking results.\n", - " assignment_max_duration_seconds=60*10, # Give tolokers 10 minutes to complete one task page\n", - " defaults=toloka.pool.Pool.Defaults(\n", - " # We need an overlap to compare the tolokers among themselves,\n", - " default_overlap_for_new_task_suites=5,\n", - " ),\n", - ")\n", - "\n", - "# We'll only show our tasks to English-speaking users because the description of the task is in English.\n", - "# We also won't allow our verification tasks to be performed by users who performed detection tasks.\n", - "verification_pool.filter = (\n", - " (toloka.filter.Languages.in_('EN')) &\n", - " (toloka.filter.Skill(detection_skill.id) == None)\n", - ")\n", - "\n", - "# Set up quality control\n", - "# Quality is based on the majority of matching responses from tolokers who completed the same task.\n", - "verification_pool.quality_control.add_action(\n", - " collector=toloka.collectors.MajorityVote(answer_threshold=3),\n", - " # If a toloker has 10 or more responses\n", - " # And the responses are correct in less than 50% of cases,\n", - " conditions=[\n", - " toloka.conditions.TotalAnswersCount > 9,\n", - " toloka.conditions.CorrectAnswersRate < 50,\n", - " ],\n", - " # We ban the toloker from all our projects for 10 days.\n", - " action=toloka.actions.RestrictionV2(\n", - " scope='ALL_PROJECTS',\n", - " duration=10,\n", - " duration_unit='DAYS',\n", - " private_comment=' Doesn\\'t match the majority', # Only you will see this comment\n", - " )\n", - ")\n", - "\n", - "# Set up the new skill value using MajorityVote.\n", - "# Depending on the percentage of correct responses, we increase the value of the toloker's skill.\n", - "verification_pool.quality_control.add_action(\n", - " collector=toloka.collectors.MajorityVote(answer_threshold=3, history_size=10),\n", - " conditions=[\n", - " toloka.conditions.TotalAnswersCount > 2,\n", - " ],\n", - " action=toloka.actions.SetSkillFromOutputField(\n", - " skill_id=verification_skill.id,\n", - " from_field='correct_answers_rate',\n", - " ),\n", - ")\n", - "print(f'Quality rule count:{len(verification_pool.quality_control.configs)}')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Create a pool" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Set the task count for one page\n", - "verification_pool.set_mixer_config(\n", - " real_tasks_count=10,\n", - " force_last_assignment=True,\n", - ")\n", - "\n", - "verification_pool = toloka_client.create_pool(verification_pool)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "---\n", - "---\n", - "## Add tasks to pools and run the projects\n", - "At this point, we have configured two projects, and now we can upload the real data that we want to annotate." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "tasks = [\n", - " toloka.Task(input_values={'image': url}, pool_id=detection_pool.id)\n", - " for url in dataset['image'].values\n", - "]\n", - "# Add tasks to a pool\n", - "toloka_client.create_tasks(tasks, allow_defaults=True,)\n", - "\n", - "detection_pool = toloka_client.open_pool(detection_pool.id)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Visit the pool page in the web interface and make sure the number of tasks is correct and the pool is running. Some tasks may already be completed.\n", - "\n", - "\n", - " \n", - " \n", - "
\n", - " \"Pool\n", - "
\n", - " Figure 4. What a running pool might look like\n", - "
\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Tolokers work really fast, but they still need time to complete their tasks. We’ll use streaming to start the verification project without having to wait until the detection pool closes completely.\n", - "\n", - "Next, [review](https://toloka.ai/en/docs/guide/concepts/offline-accept?utm_source=github&utm_medium=instruction-b&utm_campaign=link-15) the detection results in the web interface." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from toloka.streaming import AssignmentsObserver, Pipeline\n", - "from toloka.streaming.storage import JSONLocalStorage" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# class for handling submissions in the detection pool\n", - "class DetectionSubmittedHandler:\n", - " def __init__(self, client, verification_pool_id):\n", - " self.client = client\n", - " self.verification_pool_id = verification_pool_id\n", - "\n", - " # create new tasks for the verification pool\n", - " def __call__(self, events: List[AssignmentEvent]) -> None:\n", - " verification_tasks = [\n", - " toloka.Task(\n", - " pool_id=self.verification_pool_id,\n", - " input_values={\n", - " 'image': event.assignment.tasks[0].input_values['image'],\n", - " 'selection': event.assignment.solutions[0].output_values['result'],\n", - " 'assignment_id': event.assignment.id,\n", - " }\n", - " )\n", - " for event in events\n", - " ]\n", - " self.client.create_tasks(verification_tasks, allow_defaults=True, open_pool=True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# class for handling accepted tasks in the verification pool\n", - "class VerificationDoneHandler:\n", - " def __init__(self, client, verification_skill_id):\n", - " self.microtasks = pd.DataFrame([], columns=['task', 'label', 'worker'])\n", - " self.client = client\n", - " self.verification_skill_id = verification_skill_id\n", - "\n", - " # filter out tasks that already have enough overlap and aggregate the result\n", - " def __call__(self, events: List[AssignmentEvent]) -> None:\n", - " # Initializing data\n", - " microtasks = pd.concat([self.microtasks, self.as_frame(events)])\n", - " # get user skills for aggregation\n", - " skills = pd.Series({\n", - " skill.user_id: skill.value\n", - " for skill in self.client.get_user_skills(skill_id=self.verification_skill_id)\n", - " })\n", - "\n", - " # Filtering all microtasks that have overlap of 5\n", - " microtasks['overlap'] = microtasks.groupby('task')['task'].transform('count')\n", - " to_aggregate = microtasks[microtasks['overlap'] >= 5]\n", - " microtasks = microtasks[microtasks['overlap'] < 5]\n", - " aggregated = MajorityVote(on_missing_skill='value', default_skill=0).fit_predict(to_aggregate, skills)\n", - " # Accepting or rejecting assignments\n", - " for assignment_id, result in aggregated.items():\n", - " if result == 'OK':\n", - " self.client.accept_assignment(assignment_id, 'Well done!')\n", - " else:\n", - " toloka_client.reject_assignment(assignment_id, 'The object wasn\\'t selected or was selected incorrectly.')\n", - "\n", - " # Updating mictotasks\n", - " self.microtasks = microtasks[['task', 'label', 'worker']]\n", - "\n", - " # get the data necessary for aggregation\n", - " @staticmethod\n", - " def as_frame(events: List[AssignmentEvent]) -> pd.DataFrame:\n", - " microtasks = [\n", - " (task.input_values['assignment_id'], solution.output_values['result'], event.assignment.user_id)\n", - " for event in events\n", - " for task, solution in zip(event.assignment.tasks, event.assignment.solutions)\n", - " ]\n", - " return pd.DataFrame(microtasks, columns=['task', 'label', 'worker'])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We’ll create a pipeline with an observer for each pool.\n", - "\n", - "Depending on the number of images in the detection pool and the time of day, the whole process can take from a few minutes to almost an hour. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "detection_observer = AssignmentsObserver(toloka_client, detection_pool.id)\n", - "detection_observer.on_submitted(DetectionSubmittedHandler(toloka_client, verification_pool.id))\n", - "verification_observer = AssignmentsObserver(toloka_client, verification_pool.id)\n", - "verification_observer.on_accepted(VerificationDoneHandler(toloka_client, verification_skill.id))\n", - "\n", - "# Create a local directory that will store pipeline progress and logs.\n", - "# It allows to restart the pipeline without losing data in case of pause or failure.\n", - "storage_path = './storage/'\n", - "if not os.path.exists(storage_path):\n", - " os.makedirs(storage_path)\n", - "storage = JSONLocalStorage(storage_path)\n", - "\n", - "pipeline = Pipeline(storage=storage)\n", - "pipeline.register(detection_observer)\n", - "pipeline.register(verification_observer)\n", - "\n", - "# Google Colab is using a global event pool,\n", - "# so in order to run our pipeline we have to apply nest_asyncio to create an inner pool\n", - "if 'google.colab' in str(get_ipython()):\n", - " import nest_asyncio, asyncio\n", - " nest_asyncio.apply()\n", - " asyncio.get_event_loop().run_until_complete(pipeline.run())\n", - "else:\n", - " await pipeline.run()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "---\n", - "---\n", - "## Get the results\n", - "Now you can download all the accepted tasks from the detection pool and work with them. In this notebook, we’ll only show the detection results. The code below will display the results of the bounding boxes markup.\n", - "\n", - "You can also [download](https://toloka.ai/en/docs/guide/concepts/result-of-eval?utm_source=github&utm_medium=instruction-b&utm_campaign=link-16) results as a TSV file from the web interface. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!pip install pillow # To deal with images\n", - "!pip install requests # To make HTTP requests\n", - "from PIL import Image, ImageDraw\n", - "import requests\n", - "\n", - "def get_image(url, selection):\n", - " raw_image = requests.get(url, stream=True).raw\n", - " image = Image.open(raw_image).convert(\"RGBA\")\n", - " regions = Image.new('RGBA', image.size, (255,255,255,0))\n", - " pencil = ImageDraw.Draw(regions)\n", - " for region in selection:\n", - " if region['shape'] != 'rectangle':\n", - " continue\n", - " p1_x = region['left'] * image.size[0]\n", - " p1_y = region['top'] * image.size[1]\n", - " p2_x = (region['left'] + region['width']) * image.size[0]\n", - " p2_y = (region['top'] + region['height']) * image.size[1]\n", - " pencil.rectangle((p1_x, p1_y, p2_x, p2_y), fill =(255, 30, 30, int(255*0.5)))\n", - " image = Image.alpha_composite(image, regions)\n", - " return image\n", - "\n", - "detection_result = {} # We'll store our result here" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "max_images = 2\n", - "images = []\n", - "\n", - "if not detection_result:\n", - "\n", - " for assignment in toloka_client.get_assignments(\n", - " status='ACCEPTED',\n", - " pool_id=detection_pool.id\n", - " ):\n", - " detection_result[assignment.tasks[0].input_values['image']] = assignment.solutions[0].output_values['result']\n", - "\n", - "for i in range(max_images):\n", - " url, selection = detection_result.popitem()\n", - " image = get_image(url, selection)\n", - " images.append(image)\n", - "\n", - "ipyplot.plot_images(\n", - " images,\n", - " max_images=max_images,\n", - " img_width=1000\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "---\n", - "---\n", - "## Summary\n", - "\n", - "This project consists of the minimum number of settings that This project has the minimum number of settings that will allow you to collect annotated images for your dataset right from Jupyter Notebook.\n", - "\n", - "For further experiments use the [Toloka-Kit documentation](https://toloka.ai/en/docs/toloka-kit/?utm_source=github&utm_medium=instruction-b&utm_campaign=link-17) and check out other [use cases](https://github.com/Toloka/toloka-kit/tree/main/examples/?utm_source=github&utm_medium=instruction-b&utm_campaign=link-18). \n" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.13" - } - }, - "nbformat": 4, - "nbformat_minor": 1 -} +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Annotating ground truth for object detection\n", + "\n", + "The goal of this notebook is to annotate images which can later be used for training object detection algorithms.\n", + "\n", + "We will configure and run the annotation project in Toloka from scratch.\n", + "\n", + "Follow the steps in this notebook to get the labeled data at the end.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "To proceed with this notebook:\n", + "- Make sure you are [registered](https://toloka.ai/get-started/?utm_source=github&utm_medium=instruction-b&utm_campaign=link-1) in Toloka as a requester.\n", + "- Use the promo code **OBJECT_DETECTION_PROMO** to add $10 to your account on your [profile page](https://platform.toloka.ai/requester/profile/income/?utm_source=github&utm_medium=instruction-b&utm_campaign=link-2). You can use it to pay for data labeling while following this tutorial.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## The challenge\n", + "We have a set of photos with people in them:\n", + "\n", + "\n", + " \n", + " \n", + "
\n", + " \"Sample\n", + "
\n", + " Figure 1. Sample photo of people\n", + "
\n", + "\n", + "We need to outline every person's face. Ultimately, we need to get a set of pixelwise boundaries that represent the people's faces in each photo. Here’s what it can look like:\n", + "\n", + "\n", + " \n", + " \n", + "
\n", + " \"Example\n", + "
\n", + " Figure 2. An example of how face detection can be performed\n", + "
\n", + "\n", + "This type of annotation often uses polygons. We are using bounding boxes to simplify the task so that we can reduce costs and speed things up. You can still use polygons, just uncomment the related pieces of code." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Detailed task description\n", + "In this notebook, we will implement projects #2 and #3 from [this Toloka tutorial](https://toloka.ai/en/docs/guide/concepts/image-segmentation-overview?utm_source=github&utm_medium=instruction-b&utm_campaign=link-3). Feel free to check them out if you want to learn how to configure these projects in the web interface. Otherwise you can skip it, since this notebook covers all the steps using the Toloka API.\n", + "\n", + "We’ll skip the first project “Does the image contain a specific object?” from the tutorial above because it’s easy to implement using our “verification project” code.\n", + "\n", + "We will implement two projects in Toloka:\n", + "- A detection project called “[Select an object in the image](https://toloka.ai/en/docs/guide/concepts/image-segmentation-project2?utm_source=github&utm_medium=instruction-b&utm_campaign=link-4)”: Tolokers will select image areas that contain faces. Tolokers are people around the world who get paid for completing your tasks. \n", + "- A verification project called “[Are the bounding boxes correct?](https://toloka.ai/en/docs/guide/concepts/image-segmentation-project3?utm_source=github&utm_medium=instruction-b&utm_campaign=link-5)”: Tolokers will check the annotated images.\n", + "\n", + "We don’t use [control tasks](https://toloka.ai/en/docs/guide/concepts/goldenset?utm_source=github&utm_medium=instruction-b&utm_campaign=link-6) or [majority vote](https://toloka.ai/en/docs/guide/concepts/mvote?utm_source=github&utm_medium=instruction-b&utm_campaign=link-7) to control quality in the detection project because we can’t expect the image annotations provided by the Tolokers to match each other exactly to the pixel. Instead, we’ll check detection results in the second project, where a different group of Tolokers will decide whether the faces are annotated correctly or not." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Set up the environment" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Prepare the environment and import necessary libraries." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install toloka-kit==1.0.0 # To interact with Toloka API\n", + "!pip install ipyplot # To plot images inside Jupyter Notebooks cells\n", + "!pip install crowd-kit==1.1.0\n", + "\n", + "import os\n", + "import datetime\n", + "import time\n", + "import logging\n", + "import sys\n", + "\n", + "import pandas as pd # To perform data manipulation\n", + "import ipyplot\n", + "\n", + "\n", + "from typing import List\n", + "from toloka.streaming.event import AssignmentEvent\n", + "\n", + "import toloka.client as toloka\n", + "import toloka.client.project.template_builder as tb\n", + "\n", + "from crowdkit.aggregation import MajorityVote\n", + "\n", + "logging.basicConfig(\n", + " format='[%(levelname)s] %(name)s: %(message)s',\n", + " level=logging.INFO,\n", + " stream=sys.stdout,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create a toloka-client instance. All API calls will pass through it.\n", + "\n", + "To get an OAuth token, follow these [instructions](https://toloka.ai/en/docs/api/concepts/access/?utm_source=github&utm_medium=instruction-b&utm_campaign=link-8). Enter your token." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "toloka_client = toloka.TolokaClient(input(\"Enter your token:\"), 'PRODUCTION')\n", + "print(toloka_client.get_requester())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Check out Toloka documentation to learn more about [the Toloka API](https://toloka.ai/en/docs/api/?utm_source=github&utm_medium=instruction-b&utm_campaign=link-9) and [Toloka-Kit](https://toloka.ai/en/docs/toloka-kit/?utm_source=github&utm_medium=instruction-b&utm_campaign=link-10)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Review the dataset\n", + "We use the [Human Parts Dataset](https://github.com/xiaojie1017/Human-Parts/?utm_source=github&utm_medium=instruction-b&utm_campaign=link-11), distributed under the MIT license.\n", + "\n", + "Our dataset is just a collection of URLs leading to JPEG images. GIF, PNG, and WEBP formats are also supported.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "!curl https://tlk.s3.yandex.net/dataset/faces_detection.tsv --output dataset.tsv # TODO\n", + "\n", + "# Load the dataset of links to a pd DataFrame\n", + "dataset = pd.read_csv('dataset.tsv', sep='\\t')\n", + "\n", + "# Plot 5 images from dataset to verify data loading\n", + "ipyplot.plot_images(\n", + " [url for url in dataset['image'].sample(n=50)],\n", + " max_images=5,\n", + " img_width=1000\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "---\n", + "## Create a new detection project\n", + "\n", + "In this project, Tolokers select image areas that contain faces.\n", + "\n", + "The first step is to configure how Tolokers will see the tasks:\n", + "- Write instructions.\n", + "- Define the input and output formats.\n", + "\n", + "**Note:** It’s important to write clear instructions with examples to make sure the Tolokers do exactly what we need them to. We also recommend checking the task interface yourself to see if there are any problems with the layout. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# How tolokers will see the task\n", + "project_interface = toloka.project.TemplateBuilderViewSpec(\n", + " view=tb.ImageAnnotationFieldV1( # Component that selects areas in images\n", + " tb.OutputData('result'), # Path for writing output data\n", + " tb.InputData('image'), # Getter for the input image\n", + " shapes={'rectangle': True}, # Allow to select only rectangular areas #polygons\n", + " validation=tb.RequiredConditionV1(hint='Please select an area'), # At least one area should be selected\n", + " full_height=True\n", + " )\n", + ")\n", + "\n", + "# You can write instructions and upload them from a file or enter them later in the web interface\n", + "# In case of polygon markup adjust the instruction describing the rules of putting the polygons instead of bounding boxes\n", + "prepared_instruction = open('./instructions/detection_instruction.html').read().strip()\n", + "\n", + "# Set up the project\n", + "detection_project = toloka.Project(\n", + " public_name='Outline all people faces with bounding boxes',\n", + " public_description='Find and outline all people faces with bounding boxes.',\n", + " public_instructions=prepared_instruction,\n", + " # Set up the task: view, input, and output parameters\n", + " task_spec=toloka.project.task_spec.TaskSpec(\n", + " input_spec={'image': toloka.project.UrlSpec()},\n", + " output_spec={'result': toloka.project.JsonSpec()},\n", + " view_spec=project_interface,\n", + " ),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "Call the API to create a new project." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "detection_project = toloka_client.create_project(detection_project)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Review your project and check the task interface\n", + "\n", + "Visit the project page to make sure the task interface is working correctly.\n", + "\n", + "To do this:\n", + "1. Follow the link you got in the output above.\n", + "2. In the project interface, click **Project actions** on the top right.\n", + "3. Click **Preview** in the menu.\n", + "4. Click **Change input data**.\n", + "5. Insert an image URL into the **Image** field. Here, you can use a link to one of the images you want to label, or [this sample image](https://tlk.s3.yandex.net/dataset/faces_detection/5fc9583cbda646c7de9c6451b7f22a165b6b7d4b.jpg).\n", + "6. Click the **Instructions** button. Make sure the instructions are visible and valid.\n", + "7. Try to select multiple areas with a rectangle using **Box annotation tool**.\n", + "8. Click **Submit** and then click **View responses**.\n", + "\n", + "A window with the results will appear. Check that the results are in the expected format and that the data is being entered correctly.\n", + "\n", + "\n", + "\n", + " \n", + " \n", + "
\n", + " \"Task\n", + "
\n", + " Figure 3. What the results window might look like\n", + "
\n", + "\n", + "**Tip:** Do a trial run with a small sample of your data. Make sure that after running a trial for the entire pipeline, you get data with the expected format and quality. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Add custom skills for tolokers\n", + "\n", + "A [skill](https://toloka.ai/en/docs/guide/concepts/nav?utm_source=github&utm_medium=instruction-b&utm_campaign=link-12) is a characteristic of a Toloker. For example, you can record the percentage of correct responses as a skill.\n", + "\n", + "In this project, we’ll create two skills:\n", + "- **Detection skill**: Shows that a Toloker has completed at least one detection task. We’ll later filter out these Tolokers from verification tasks to ensure that the people doing the verification are not the same people who performed the detection.\n", + "- **Verification skill**: How good the current Toloker’s verification results are compared to others. We’ll need this skill later when we aggregate the results of the second project. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "detection_skill = next(toloka_client.get_skills(name='Area selection of people faces'), None)\n", + "if detection_skill:\n", + " print('Detection skill already exists')\n", + "else:\n", + " detection_skill = toloka_client.create_skill(\n", + " name='Area selection of people faces',\n", + " hidden=True,\n", + " public_requester_description={'EN': 'Toloker is annotating people faces'},\n", + " )\n", + "\n", + "verification_skill = next(toloka_client.get_skills(name='People faces detection verification'), None)\n", + "if verification_skill:\n", + " print('Verification skill already exists')\n", + "else:\n", + " verification_skill = toloka_client.create_skill(\n", + " name='People faces detection verification',\n", + " hidden=True,\n", + " public_requester_description={'EN': 'How good a toloker is at verifying detection tasks'},\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Pool creation for a detection project\n", + "A [pool](https://toloka.ai/en/docs/guide/concepts/pool-main?utm_source=github&utm_medium=instruction-b&utm_campaign=link-13) is a set of tasks sent out for Tolokers.\n", + "\n", + "First, create an instance of a pool and set its basic parameters:\n", + "- Payment amount per task.\n", + "- Non-automatic acceptance of results.\n", + "- Number of tasks Tolokers will see per page.\n", + "- Toloker selection filters to control who can access this task." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "detection_pool = toloka.Pool(\n", + " project_id=detection_project.id,\n", + " private_name='Pool 1', # Only you can see this information.\n", + " may_contain_adult_content=False,\n", + " will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365), # Pool will automatically close after one year\n", + " reward_per_assignment=0.01, # Set the minimum payment amount for one task page\n", + " auto_accept_solutions=False, # Only pay the toloker for completing the task,\n", + " # based on the verification results of the second project.\n", + "\n", + " auto_accept_period_day=7, # Number of days to determine if we'll pay for task completion by this toloker or not.\n", + " assignment_max_duration_seconds=60*20, # Give tolokers 20 minutes maximum to complete one task page.\n", + " defaults=toloka.pool.Pool.Defaults(\n", + " # We don't need overlapping for detection tasks, so we set it to 1\n", + " default_overlap_for_new_task_suites=1,\n", + " default_overlap_for_new_tasks=1,\n", + " ),\n", + ")\n", + "\n", + "# Set the number of tasks per page\n", + "detection_pool.set_mixer_config(real_tasks_count=1)\n", + "# Please note that the payment amount specified when creating the pool is the amount the toloker receives for completing one page of tasks.\n", + "# If you specify 10 tasks per page above, then reward_per_assignment will be paid for completing 10 tasks." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We’ll only show our tasks to English-speaking users because the description of the task is in English. This means that only people who speak English will be able to accept this task. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "detection_pool.filter = toloka.filter.Languages.in_('EN')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Quality control rules**\n", + "\n", + "Each [quality control rule](https://toloka.ai/en/docs/guide/concepts/control?utm_source=github&utm_medium=instruction-b&utm_campaign=link-14) consists of the following:\n", + "- **Collector**: How to collect statistics and which metrics can be used in this rule.\n", + "- **Condition**: When the rule will be triggered. Under this condition, only parameters that apply to the collector can be used.\n", + "- **Action**: What to do if the condition is true." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# The first rule in this project restricts pool access for tolokers who often make mistakes\n", + "detection_pool.quality_control.add_action(\n", + " collector=toloka.collectors.AcceptanceRate(),\n", + " conditions=[\n", + " # Toloker completed more than 2 tasks\n", + " toloka.conditions.TotalAssignmentsCount > 2,\n", + " # And more than 35% of their responses were rejected\n", + " toloka.conditions.RejectedAssignmentsRate > 35,\n", + " ],\n", + " # This action tells Toloka what to do if the condition above is True\n", + " # In our case, we'll restrict access for 15 days\n", + " # Always leave a comment: it may be useful later on\n", + " action=toloka.actions.RestrictionV2(\n", + " scope='ALL_PROJECTS',\n", + " duration=15,\n", + " duration_unit='DAYS',\n", + " private_comment='Tolokers often make mistakes', # Only you will see this comment\n", + " )\n", + ")\n", + "\n", + "# The second useful rule is \"Fast responses\". It allows us to filter out tolokers who respond too quickly.\n", + "detection_pool.quality_control.add_action(\n", + " # Let's monitor fast submissions for the last 5 completed task pages\n", + " # And define ones that take less than 20 seconds as quick responses.\n", + " collector=toloka.collectors.AssignmentSubmitTime(history_size=5, fast_submit_threshold_seconds=20),\n", + " # If we see more than one fast response, we ban the toloker from all our projects for 10 days.\n", + " conditions=[toloka.conditions.FastSubmittedCount > 1],\n", + " action=toloka.actions.RestrictionV2(\n", + " scope='ALL_PROJECTS',\n", + " duration=10,\n", + " duration_unit='DAYS',\n", + " private_comment='Fast responses', # Only you will see this comment\n", + " )\n", + ")\n", + "\n", + "# Another rule we use is for automatically updating skills\n", + "# We update the detection skill for tolokers who complete at least one page of tasks from detection pool.\n", + "detection_pool.quality_control.add_action(\n", + " collector=toloka.collectors.AnswerCount(),\n", + " # If toloker completed at least one task, it sets the new skill to 1\n", + " conditions=[toloka.conditions.AssignmentsAcceptedCount > 0],\n", + " action=toloka.actions.SetSkill(skill_id=detection_skill.id, skill_value=1),\n", + ")\n", + "\n", + "# This rule sends rejected assignments (tasks that you rejected) to other tolokers according to specified parameters.\n", + "detection_pool.quality_control.add_action(\n", + " collector=toloka.collectors.AssignmentsAssessment(),\n", + " # Check if a task was rejected\n", + " conditions=[toloka.conditions.AssessmentEvent == 'REJECT'],\n", + " # If the condition is True, add 1 to overlap and open the pool\n", + " action=toloka.actions.ChangeOverlap(delta=1, open_pool=True),\n", + ")\n", + "\n", + "print('Quality rules count:', len(detection_pool.quality_control.configs))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create a pool with all specified conditions\n", + "\n", + "Now we call the Toloka API to create a pool in the detection project. Afterwards, you can check the pool in the web interface. You’ll see there aren’t any tasks in it. We’ll add them later. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "detection_pool = toloka_client.create_pool(detection_pool)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "---\n", + "## Create a new project for verification\n", + "In this project, Tolokers will determine whether the faces were outlined correctly or not.\n", + "\n", + "This will be a standard classification project with only two classes: `OK` and `BAD`. We’ll explicitly define these labels as the output values. \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Configure task interface: how tolokers will see the task\n", + "verification_interface = toloka.project.TemplateBuilderViewSpec(\n", + " view=tb.ListViewV1( # List of components that should be positioned from top to bottom in the UI\n", + " [\n", + " tb.ImageAnnotationFieldV1( # Image and selected areas to verify\n", + " tb.InternalData('selection',\n", + " default=tb.InputData('selection')), # Use the input field as default value to display the selected areas\n", + " tb.InputData('image'),\n", + " disabled=True, # Disable adding and deleting areas\n", + " full_height=True\n", + " ),\n", + " tb.RadioGroupFieldV1( # A component for selecting one value out of several options\n", + " tb.OutputData('result'), # Path for writing output data\n", + " [\n", + " tb.GroupFieldOption('OK', 'Yes'),\n", + " tb.GroupFieldOption('BAD', 'No'),\n", + " ],\n", + " label='Are all people faces outlined correctly?', # Label above the options\n", + " validation=tb.RequiredConditionV1() # Requirement to select one of the options\n", + " )\n", + " ]\n", + " ),\n", + " plugins=[\n", + " tb.HotkeysPluginV1( # Shortcuts for selecting options using the keyboard\n", + " key_1=tb.SetActionV1(tb.OutputData('result'), 'OK'),\n", + " key_2=tb.SetActionV1(tb.OutputData('result'), 'BAD')\n", + " )\n", + " ]\n", + ")\n", + "\n", + "# You can write instructions and upload them from a file or enter them later in the web interface\n", + "# In case of polygon markup adjust the instruction describing the rules of putting the polygons instead of bounding boxes\n", + "verification_instruction = open('./instructions/verification_instruction.html').read().strip()\n", + "\n", + "# Set up the project\n", + "verification_project = toloka.Project(\n", + " public_name='Are the people faces outlined correctly?',\n", + " public_description='Look at the image and decide whether or not the people faces are outlined correctly',\n", + " public_instructions=verification_instruction,\n", + " # Set up the task: view, input, and output parameters\n", + " task_spec=toloka.project.task_spec.TaskSpec(\n", + " input_spec={\n", + " 'image': toloka.project.UrlSpec(),\n", + " 'selection': toloka.project.JsonSpec(),\n", + " 'assignment_id': toloka.project.StringSpec(),\n", + " },\n", + " # Set allowed_values, we'll use smart mixing to get the results of this project\n", + " output_spec={'result': toloka.project.StringSpec(allowed_values=['OK', 'BAD'])},\n", + " view_spec=verification_interface,\n", + " ),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Call the API to create a new project." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "verification_project = toloka_client.create_project(verification_project)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Examine the project in the web interface:\n", + "\n", + "1. After running the code above, you’ll get a link to your project. Follow this link to check the task interface and instructions.\n", + "\n", + " **Note:** You will see almost the same interface as in the previous project, just without the ability to select areas. It’s important to make sure that the annotation results from the first project are displayed correctly in the second one.\n", + " \n", + " \n", + "2. Open the task **Preview** in the first project.\n", + "\n", + "\n", + "3. Outline the faces and click **Submit**.\n", + "\n", + "\n", + "4. Copy the results.\n", + "\n", + "\n", + "5. Now open the **Preview** of the second project.\n", + "\n", + "\n", + "6. Click **Change input data** and paste the annotation results in the selection field.\n", + "\n", + "\n", + "7. Click **Apply** and make sure the annotation displays correctly. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create and set up a pool in the verification project\n", + "Now we need to add a filter for this pool. Specify Tolokers that don’t have the detection skill (we want to avoid using Tolokers who did the detection tasks). You can combine multiple conditions using the `&` and `|` operators.\n", + "\n", + "**Note:** Add two quality control rules with the same collector but with different conditions and actions. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "verification_pool = toloka.Pool(\n", + " project_id=verification_project.id,\n", + " private_name='Pool 1. People faces verification', # Only you can see this information.\n", + " may_contain_adult_content=False,\n", + " will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365), # Pool will close automatically after one year\n", + " reward_per_assignment=0.01, # We set the minimum payment amount for one task page\n", + " # By default, auto_accept_solutions is on,\n", + " # so we'll pay for all the tasks without checking results.\n", + " assignment_max_duration_seconds=60*10, # Give tolokers 10 minutes to complete one task page\n", + " defaults=toloka.pool.Pool.Defaults(\n", + " # We need an overlap to compare the tolokers among themselves,\n", + " default_overlap_for_new_task_suites=5,\n", + " ),\n", + ")\n", + "\n", + "# We'll only show our tasks to English-speaking users because the description of the task is in English.\n", + "# We also won't allow our verification tasks to be performed by users who performed detection tasks.\n", + "verification_pool.filter = (\n", + " (toloka.filter.Languages.in_('EN')) &\n", + " (toloka.filter.Skill(detection_skill.id) == None)\n", + ")\n", + "\n", + "# Set up quality control\n", + "# Quality is based on the majority of matching responses from tolokers who completed the same task.\n", + "verification_pool.quality_control.add_action(\n", + " collector=toloka.collectors.MajorityVote(answer_threshold=3),\n", + " # If a toloker has 10 or more responses\n", + " # And the responses are correct in less than 50% of cases,\n", + " conditions=[\n", + " toloka.conditions.TotalAnswersCount > 9,\n", + " toloka.conditions.CorrectAnswersRate < 50,\n", + " ],\n", + " # We ban the toloker from all our projects for 10 days.\n", + " action=toloka.actions.RestrictionV2(\n", + " scope='ALL_PROJECTS',\n", + " duration=10,\n", + " duration_unit='DAYS',\n", + " private_comment=' Doesn\\'t match the majority', # Only you will see this comment\n", + " )\n", + ")\n", + "\n", + "# Set up the new skill value using MajorityVote.\n", + "# Depending on the percentage of correct responses, we increase the value of the toloker's skill.\n", + "verification_pool.quality_control.add_action(\n", + " collector=toloka.collectors.MajorityVote(answer_threshold=3, history_size=10),\n", + " conditions=[\n", + " toloka.conditions.TotalAnswersCount > 2,\n", + " ],\n", + " action=toloka.actions.SetSkillFromOutputField(\n", + " skill_id=verification_skill.id,\n", + " from_field='correct_answers_rate',\n", + " ),\n", + ")\n", + "print(f'Quality rule count:{len(verification_pool.quality_control.configs)}')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create a pool" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Set the task count for one page\n", + "verification_pool.set_mixer_config(\n", + " real_tasks_count=10,\n", + " force_last_assignment=True,\n", + ")\n", + "\n", + "verification_pool = toloka_client.create_pool(verification_pool)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "---\n", + "## Add tasks to pools and run the projects\n", + "At this point, we have configured two projects, and now we can upload the real data that we want to annotate." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "tasks = [\n", + " toloka.Task(input_values={'image': url}, pool_id=detection_pool.id)\n", + " for url in dataset['image'].values\n", + "]\n", + "# Add tasks to a pool\n", + "toloka_client.create_tasks(tasks, allow_defaults=True,)\n", + "\n", + "detection_pool = toloka_client.open_pool(detection_pool.id)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Visit the pool page in the web interface and make sure the number of tasks is correct and the pool is running. Some tasks may already be completed.\n", + "\n", + "\n", + " \n", + " \n", + "
\n", + " \"Pool\n", + "
\n", + " Figure 4. What a running pool might look like\n", + "
\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Tolokers work really fast, but they still need time to complete their tasks. We’ll use streaming to start the verification project without having to wait until the detection pool closes completely.\n", + "\n", + "Next, [review](https://toloka.ai/en/docs/guide/concepts/offline-accept?utm_source=github&utm_medium=instruction-b&utm_campaign=link-15) the detection results in the web interface." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from toloka.streaming import AssignmentsObserver, Pipeline\n", + "from toloka.streaming.storage import JSONLocalStorage" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# class for handling submissions in the detection pool\n", + "class DetectionSubmittedHandler:\n", + " def __init__(self, client, verification_pool_id):\n", + " self.client = client\n", + " self.verification_pool_id = verification_pool_id\n", + "\n", + " # create new tasks for the verification pool\n", + " def __call__(self, events: List[AssignmentEvent]) -> None:\n", + " verification_tasks = [\n", + " toloka.Task(\n", + " pool_id=self.verification_pool_id,\n", + " input_values={\n", + " 'image': event.assignment.tasks[0].input_values['image'],\n", + " 'selection': event.assignment.solutions[0].output_values['result'],\n", + " 'assignment_id': event.assignment.id,\n", + " }\n", + " )\n", + " for event in events\n", + " ]\n", + " self.client.create_tasks(verification_tasks, allow_defaults=True, open_pool=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# class for handling accepted tasks in the verification pool\n", + "class VerificationDoneHandler:\n", + " def __init__(self, client, verification_skill_id):\n", + " self.microtasks = pd.DataFrame([], columns=['task', 'label', 'worker'])\n", + " self.client = client\n", + " self.verification_skill_id = verification_skill_id\n", + "\n", + " # filter out tasks that already have enough overlap and aggregate the result\n", + " def __call__(self, events: List[AssignmentEvent]) -> None:\n", + " # Initializing data\n", + " microtasks = pd.concat([self.microtasks, self.as_frame(events)])\n", + " # get user skills for aggregation\n", + " skills = pd.Series({\n", + " skill.user_id: skill.value\n", + " for skill in self.client.get_user_skills(skill_id=self.verification_skill_id)\n", + " })\n", + "\n", + " # Filtering all microtasks that have overlap of 5\n", + " microtasks['overlap'] = microtasks.groupby('task')['task'].transform('count')\n", + " to_aggregate = microtasks[microtasks['overlap'] >= 5]\n", + " microtasks = microtasks[microtasks['overlap'] < 5]\n", + " aggregated = MajorityVote(on_missing_skill='value', default_skill=0).fit_predict(to_aggregate, skills)\n", + " # Accepting or rejecting assignments\n", + " for assignment_id, result in aggregated.items():\n", + " if result == 'OK':\n", + " self.client.accept_assignment(assignment_id, 'Well done!')\n", + " else:\n", + " toloka_client.reject_assignment(assignment_id, 'The object wasn\\'t selected or was selected incorrectly.')\n", + "\n", + " # Updating mictotasks\n", + " self.microtasks = microtasks[['task', 'label', 'worker']]\n", + "\n", + " # get the data necessary for aggregation\n", + " @staticmethod\n", + " def as_frame(events: List[AssignmentEvent]) -> pd.DataFrame:\n", + " microtasks = [\n", + " (task.input_values['assignment_id'], solution.output_values['result'], event.assignment.user_id)\n", + " for event in events\n", + " for task, solution in zip(event.assignment.tasks, event.assignment.solutions)\n", + " ]\n", + " return pd.DataFrame(microtasks, columns=['task', 'label', 'worker'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We’ll create a pipeline with an observer for each pool.\n", + "\n", + "Depending on the number of images in the detection pool and the time of day, the whole process can take from a few minutes to almost an hour. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "detection_observer = AssignmentsObserver(toloka_client, detection_pool.id)\n", + "detection_observer.on_submitted(DetectionSubmittedHandler(toloka_client, verification_pool.id))\n", + "verification_observer = AssignmentsObserver(toloka_client, verification_pool.id)\n", + "verification_observer.on_accepted(VerificationDoneHandler(toloka_client, verification_skill.id))\n", + "\n", + "# Create a local directory that will store pipeline progress and logs.\n", + "# It allows to restart the pipeline without losing data in case of pause or failure.\n", + "storage_path = './storage/'\n", + "if not os.path.exists(storage_path):\n", + " os.makedirs(storage_path)\n", + "storage = JSONLocalStorage(storage_path)\n", + "\n", + "pipeline = Pipeline(storage=storage)\n", + "pipeline.register(detection_observer)\n", + "pipeline.register(verification_observer)\n", + "\n", + "# Google Colab is using a global event pool,\n", + "# so in order to run our pipeline we have to apply nest_asyncio to create an inner pool\n", + "if 'google.colab' in str(get_ipython()):\n", + " import nest_asyncio, asyncio\n", + " nest_asyncio.apply()\n", + " asyncio.get_event_loop().run_until_complete(pipeline.run())\n", + "else:\n", + " await pipeline.run()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "---\n", + "## Get the results\n", + "Now you can download all the accepted tasks from the detection pool and work with them. In this notebook, we’ll only show the detection results. The code below will display the results of the bounding boxes markup.\n", + "\n", + "You can also [download](https://toloka.ai/en/docs/guide/concepts/result-of-eval?utm_source=github&utm_medium=instruction-b&utm_campaign=link-16) results as a TSV file from the web interface. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install pillow # To deal with images\n", + "!pip install requests # To make HTTP requests\n", + "from PIL import Image, ImageDraw\n", + "import requests\n", + "\n", + "def get_image(url, selection):\n", + " raw_image = requests.get(url, stream=True).raw\n", + " image = Image.open(raw_image).convert(\"RGBA\")\n", + " regions = Image.new('RGBA', image.size, (255,255,255,0))\n", + " pencil = ImageDraw.Draw(regions)\n", + " for region in selection:\n", + " if region['shape'] != 'rectangle':\n", + " continue\n", + " p1_x = region['left'] * image.size[0]\n", + " p1_y = region['top'] * image.size[1]\n", + " p2_x = (region['left'] + region['width']) * image.size[0]\n", + " p2_y = (region['top'] + region['height']) * image.size[1]\n", + " pencil.rectangle((p1_x, p1_y, p2_x, p2_y), fill =(255, 30, 30, int(255*0.5)))\n", + " image = Image.alpha_composite(image, regions)\n", + " return image\n", + "\n", + "detection_result = {} # We'll store our result here" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "max_images = 2\n", + "images = []\n", + "\n", + "if not detection_result:\n", + "\n", + " for assignment in toloka_client.get_assignments(\n", + " status='ACCEPTED',\n", + " pool_id=detection_pool.id\n", + " ):\n", + " detection_result[assignment.tasks[0].input_values['image']] = assignment.solutions[0].output_values['result']\n", + "\n", + "for i in range(max_images):\n", + " url, selection = detection_result.popitem()\n", + " image = get_image(url, selection)\n", + " images.append(image)\n", + "\n", + "ipyplot.plot_images(\n", + " images,\n", + " max_images=max_images,\n", + " img_width=1000\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "---\n", + "## Summary\n", + "\n", + "This project consists of the minimum number of settings that This project has the minimum number of settings that will allow you to collect annotated images for your dataset right from Jupyter Notebook.\n", + "\n", + "For further experiments use the [Toloka-Kit documentation](https://toloka.ai/en/docs/toloka-kit/?utm_source=github&utm_medium=instruction-b&utm_campaign=link-17) and check out other [use cases](https://github.com/Toloka/toloka-kit/tree/main/examples/?utm_source=github&utm_medium=instruction-b&utm_campaign=link-18). \n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.13" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +}