Public Data Release 1.0.0
Coveo hosted the 2021 SIGIR eCom Data Challenge and this repository contains utility scripts and the dataset, which is freely available for research purposes (see below): the paper introducing the Challenge is available as a pre-print.
The Data Challenge
original README (containing baseline information, design papers, solutions, etc.)
is archived in this repository as README_DC_2021.md
. Background information
about the Challenge, the motivations behind the release and some inspiring submissions
can be found in the original paper, the archival section in README_DC_2021.md
and the SIGIR presentation.
Note: there has been some issues when downloading the file using Safari; we suggest you to use Chrome for the download and sign-up process.
The dataset is available for research and educational purposes at
this page.
To obtain the dataset, you are required to fill a form with information about you
and your institution, and agree to the Terms And Conditions for fair usage of the data.
For convenience, Terms And Conditions are also included in a pure txt
format in this repo:
usage of the data implies the acceptance of these Terms And Conditions.
The dataset is provided as three big text files (.csv
) - browsing_train.csv
, search_train.csv
, sku_to_content.csv
-
inside a zip
archive containing an additional copy of the Terms And Conditions. The final dataset contains 36M events,
and it is the first dataset of this kind to be released to the research community: please review the
Data Challenge paper
for a comparison with existing datasets and for the motivations behind the release format.
For your convenience, three sample files are included in the start
folder, showcasing the data structure.
Below, you will find a detailed description for each file.
The file browsing_train.csv
contains almost 5M anonymized shopping sessions.
The structure of this dataset is similar to our Scientific Reports data release:
each row corresponds to a browsing event in a session, containing session and timestamp information, as well as
(hashed) details on the interaction (was it purchase or a detail event? Was it a simple pageview or a specific
product action?). All data was collected and processed in an anonymized fashion through our standard SDK:
remember that front-end tracking is by nature imperfect, so small inconsistencies are to be expected.
Field | Type | Description |
---|---|---|
session_id_hash | string | Hashed identifier of the shopping session. A session groups together events that are at most 30 minutes apart: if the same user comes back to the target website after 31 minutes from the last interaction, a new session identifier is assigned. |
event_type | enum | The type of event according to the Google Protocol, one of { pageview , event }; for example, an add event can happen on a page load, or as a stand-alone event. |
product_action | enum | One of { detail, add, purchase, remove }. If the field is empty, the event is a simple page view (e.g. the FAQ page) without associated products. Please also note that an action involving removing a product from the cart might lead to several consecutive remove events. Please note that click events (that is, events generated by clicking on a search page) are included in the search_train.csv file. |
product_sku_hash | string | If the event is a product event, hashed identifier of the product in the event. |
server_timestamp_epoch_ms | int | Epoch time, in milliseconds. As a further anonymization technique, the timestamp has been shifted by an unspecified amount of weeks, keeping intact the intra-week patterns. |
hashed_url | string | Hashed url of the current web page. |
Finally, please be aware that a PDP may generate both a detail and a pageview event, and that the order of the events in the file is not strictly chronological (refer to the session identifier and the timestamp information to reconstruct the actual chain of events for a given session).
The file search_train.csv
contains more than 800k search-based interactions. Each row is a search query event issued by a shopper, which includes an array of (hashed) results returned to the client. We also provide which result(s) have been clicked from the result set, if any.
By reporting also products seen but not clicked, we hope to inspire clever ways to use negative feedback.
Field | Type | Description |
---|---|---|
session_id_hash | string | Hashed identifier of the shopping session. A session groups together events that are at most 30 minutes apart: if the same user comes back to the target website after 31 minutes from the last interaction, a new session identifier is assigned. |
server_timestamp_epoch_ms | int | Epoch time, in milliseconds. As a further anonymization technique, the timestamp has been shifted by an unspecified amount of weeks, keeping intact the intra-week patterns. |
query_vector | vector | A dense representation of the search query, obtained through standard pre-trained modeling and dimensionality reduction techniques. |
product_skus_hash | list | Hashed identifiers of the products in the search response. |
clicked_skus_hash | list | Hashed identifiers of the products clicked after issuing the search query. |
The file sku_to_content.csv
contains a mapping between (hashed) product identifiers (SKUs) and dense representation
of textual and image meta-data from the actual catalog, for all the SKUs in the training and the Challenge evaluation
dataset (when the information is available).
Field | Type | Description |
---|---|---|
product_sku_hash | string | Hashed identifier of product ID (SKU). |
category_hash | string | The categories are hashed representations of the category hierarchy, / -separated. |
price_bucket | int | The product price, provided as a 10-quantile integer. |
description_vector | vector | A dense representation of textual meta-data, obtained through standard pre-trained modeling and dimensionality reduction techniques. Please note that this representation is compatible with the one in the search file. |
image_vector | vector | A dense representation of image meta-data, obtained through standard pre-trained modeling and dimensionality reduction techniques. |
Download the zip
folder and unzip it in your local machine. To verify that all is well, you can run the simple
start/dataset_stats.py
script in the folder: the script will parse the three files, show some sample rows and
print out some basic stats and counts (if you don't modify the three paths, it will run on the sample csv
).
Please remember that usage of this dataset implies acceptance of the Terms And Conditions: you agree to not use the dataset for any other purpose than what is stated in the Terms and Conditions, nor attempt to reverse engineer or de-anonymise the dataset by explicitly or implicitly linking the data to any person, brand or legal entity.
For questions about the dataset, please reach out to Jacopo Tagliabue.
The authors of the paper and organizers are:
- Jacopo Tagliabue - Coveo AI Labs
- Ciro Greco - Coveo AI Labs
- Jean-Francis Roy - Coveo
- Federico Bianchi - Postdoctoral Researcher at Università Bocconi
- Giovanni Cassani - Tillburg University
- Bingqing Yu - Coveo
- Patrick John Chia - Coveo
The authors wish to thank the entire Coveo's legal team, for supporting our research and believing in this data sharing initiative; special thanks to Luca Bigon for help in data collection and preparation.
If you use this dataset, please cite our work:
@inproceedings{CoveoSIGIR2021,
author = {Tagliabue, Jacopo and Greco, Ciro and Roy, Jean-Francis and Bianchi, Federico and Cassani, Giovanni and Yu, Bingqing and Chia, Patrick John},
title = {SIGIR 2021 E-Commerce Workshop Data Challenge},
year = {2021},
booktitle = {SIGIR eCom 2021}
}