Skip to content

Latest commit

 

History

History
49 lines (36 loc) · 3.03 KB

File metadata and controls

49 lines (36 loc) · 3.03 KB

Datatón 2022 - Team Latino-Asian Brotherhood

This repo contains terminos_y_condiciones.pdf and the final submission code for reproducing the results.

Environment

To setup the environment:

  • Install python >= 3.8
  • Install requirements.txt in the fresh python environment (Here we use CUDA 11.2. If you use different CUDA version, please change torch version)

Main solution

Training

To train a Categoria classifier, run src/recomendador/train.sh

Inference

To predict news category using Categoria classifier, run src/recomendador/predict.sh

TLDR

Participacion

  • We first use word matching to find news related to each client in clientes_noticias.csv. This will create a matched_news_group.csv.
  • We then use this as a pseudo-label to train a SetFit Model to classify which Sector each news is in. image SetFit first fine-tunes a Sentence Transformer model (sentence-transformers/paraphrase-multilingual-mpnet-base-v2) on a small number of labeled examples (typically 16 per class). This is followed by training a classifier head on the embeddings generated from the fine-tuned Sentence Transformer.

Categoria

  • We first label 28 random news for each category in this file
  • We train a 1st round SetFit Model to classify which Category each news is in.
  • We then re-label wrongly classified news for each category by the previous model, and save it in this file
  • We train a 2nd round SetFit Model to classify which Category each news is in.

Combine results

Pariticipacion

A (client, news) pair is classified as:

  • Cliente: if its name appeared in the news and have the same Sector
  • Sector: if its name did not appear in the news but have the same Sector
  • No aplica: otherwise
Categoria

A (client, news) pair is classified as:

  • Otra: if they have the same Sector and the news is classified as "Descartable"
  • Descartable: if they don't have the same Sector and the news is classified as "Descartable"
  • The original category classified by the model: otherwise
Recomendador:

The recommendation score is given by two parts. The first one is the source reputation score, which is described in the previous paragraph. The other part is the confidence that the sector transformer has on the particular sector of the client. This means that if the transformer is very confident that a news article is about a certain sector, then this article should have a big priority for clients inside that sector.

Recommendation Score = Confidence on client’s sector ∗ Source reputation score