Skip to content

coveooss/SIGIR-ecom-data-challenge

Repository files navigation

SIGIR eCOM 2021 Data Challenge Dataset

Public Data Release 1.0.0

Overview

Coveo hosted the 2021 SIGIR eCom Data Challenge and this repository contains utility scripts and the dataset, which is freely available for research purposes (see below): the paper introducing the Challenge is available as a pre-print.

The Data Challenge original README (containing baseline information, design papers, solutions, etc.) is archived in this repository as README_DC_2021.md. Background information about the Challenge, the motivations behind the release and some inspiring submissions can be found in the original paper, the archival section in README_DC_2021.md and the SIGIR presentation.

Note: there has been some issues when downloading the file using Safari; we suggest you to use Chrome for the download and sign-up process.

License

The dataset is available for research and educational purposes at this page. To obtain the dataset, you are required to fill a form with information about you and your institution, and agree to the Terms And Conditions for fair usage of the data. For convenience, Terms And Conditions are also included in a pure txt format in this repo: usage of the data implies the acceptance of these Terms And Conditions.

Dataset

Data Description

The dataset is provided as three big text files (.csv) - browsing_train.csv, search_train.csv, sku_to_content.csv - inside a zip archive containing an additional copy of the Terms And Conditions. The final dataset contains 36M events, and it is the first dataset of this kind to be released to the research community: please review the Data Challenge paper for a comparison with existing datasets and for the motivations behind the release format. For your convenience, three sample files are included in the start folder, showcasing the data structure. Below, you will find a detailed description for each file.

Browsing Events

The file browsing_train.csv contains almost 5M anonymized shopping sessions. The structure of this dataset is similar to our Scientific Reports data release: each row corresponds to a browsing event in a session, containing session and timestamp information, as well as (hashed) details on the interaction (was it purchase or a detail event? Was it a simple pageview or a specific product action?). All data was collected and processed in an anonymized fashion through our standard SDK: remember that front-end tracking is by nature imperfect, so small inconsistencies are to be expected.

Field Type Description
session_id_hash string Hashed identifier of the shopping session. A session groups together events that are at most 30 minutes apart: if the same user comes back to the target website after 31 minutes from the last interaction, a new session identifier is assigned.
event_type enum The type of event according to the Google Protocol, one of { pageview , event }; for example, an add event can happen on a page load, or as a stand-alone event.
product_action enum One of { detail, add, purchase, remove }. If the field is empty, the event is a simple page view (e.g. the FAQ page) without associated products. Please also note that an action involving removing a product from the cart might lead to several consecutive remove events. Please note that click events (that is, events generated by clicking on a search page) are included in the search_train.csv file.
product_sku_hash string If the event is a product event, hashed identifier of the product in the event.
server_timestamp_epoch_ms int Epoch time, in milliseconds. As a further anonymization technique, the timestamp has been shifted by an unspecified amount of weeks, keeping intact the intra-week patterns.
hashed_url string Hashed url of the current web page.

Finally, please be aware that a PDP may generate both a detail and a pageview event, and that the order of the events in the file is not strictly chronological (refer to the session identifier and the timestamp information to reconstruct the actual chain of events for a given session).

Search Events

The file search_train.csv contains more than 800k search-based interactions. Each row is a search query event issued by a shopper, which includes an array of (hashed) results returned to the client. We also provide which result(s) have been clicked from the result set, if any. By reporting also products seen but not clicked, we hope to inspire clever ways to use negative feedback.

Field Type Description
session_id_hash string Hashed identifier of the shopping session. A session groups together events that are at most 30 minutes apart: if the same user comes back to the target website after 31 minutes from the last interaction, a new session identifier is assigned.
server_timestamp_epoch_ms int Epoch time, in milliseconds. As a further anonymization technique, the timestamp has been shifted by an unspecified amount of weeks, keeping intact the intra-week patterns.
query_vector vector A dense representation of the search query, obtained through standard pre-trained modeling and dimensionality reduction techniques.
product_skus_hash list Hashed identifiers of the products in the search response.
clicked_skus_hash list Hashed identifiers of the products clicked after issuing the search query.
Catalog Metadata

The file sku_to_content.csv contains a mapping between (hashed) product identifiers (SKUs) and dense representation of textual and image meta-data from the actual catalog, for all the SKUs in the training and the Challenge evaluation dataset (when the information is available).

Field Type Description
product_sku_hash string Hashed identifier of product ID (SKU).
category_hash string The categories are hashed representations of the category hierarchy, /-separated.
price_bucket int The product price, provided as a 10-quantile integer.
description_vector vector A dense representation of textual meta-data, obtained through standard pre-trained modeling and dimensionality reduction techniques. Please note that this representation is compatible with the one in the search file.
image_vector vector A dense representation of image meta-data, obtained through standard pre-trained modeling and dimensionality reduction techniques.

How to Start

Download the zip folder and unzip it in your local machine. To verify that all is well, you can run the simple start/dataset_stats.py script in the folder: the script will parse the three files, show some sample rows and print out some basic stats and counts (if you don't modify the three paths, it will run on the sample csv).

Please remember that usage of this dataset implies acceptance of the Terms And Conditions: you agree to not use the dataset for any other purpose than what is stated in the Terms and Conditions, nor attempt to reverse engineer or de-anonymise the dataset by explicitly or implicitly linking the data to any person, brand or legal entity.

Contacts

For questions about the dataset, please reach out to Jacopo Tagliabue.

Acknowledgments

The authors of the paper and organizers are:

The authors wish to thank the entire Coveo's legal team, for supporting our research and believing in this data sharing initiative; special thanks to Luca Bigon for help in data collection and preparation.

How to Cite our Work

If you use this dataset, please cite our work:

@inproceedings{CoveoSIGIR2021,
author = {Tagliabue, Jacopo and Greco, Ciro and Roy, Jean-Francis and Bianchi, Federico and Cassani, Giovanni and Yu, Bingqing and Chia, Patrick John},
title = {SIGIR 2021 E-Commerce Workshop Data Challenge},
year = {2021},
booktitle = {SIGIR eCom 2021}
}