This repo contains terminos_y_condiciones.pdf and the final submission code for reproducing the results.
To setup the environment:
- Install python >= 3.8
- Install
requirements.txt
in the fresh python environment (Here we use CUDA 11.2. If you use different CUDA version, please change torch version)
To train a Categoria classifier, run src/recomendador/train.sh
To predict news category using Categoria classifier, run src/recomendador/predict.sh
- We first use word matching to find news related to each client in
clientes_noticias.csv
. This will create amatched_news_group.csv
. - We then use this as a pseudo-label to train a SetFit Model to classify which Sector each news is in.
SetFit first fine-tunes a Sentence Transformer model (
sentence-transformers/paraphrase-multilingual-mpnet-base-v2
) on a small number of labeled examples (typically 16 per class). This is followed by training a classifier head on the embeddings generated from the fine-tuned Sentence Transformer.
- We first label 28 random news for each category in this file
- We train a 1st round SetFit Model to classify which Category each news is in.
- We then re-label wrongly classified news for each category by the previous model, and save it in this file
- We train a 2nd round SetFit Model to classify which Category each news is in.
A (client, news) pair is classified as:
- Cliente: if its name appeared in the news and have the same Sector
- Sector: if its name did not appear in the news but have the same Sector
- No aplica: otherwise
A (client, news) pair is classified as:
- Otra: if they have the same Sector and the news is classified as "Descartable"
- Descartable: if they don't have the same Sector and the news is classified as "Descartable"
- The original category classified by the model: otherwise
The recommendation score is given by two parts. The first one is the source reputation score, which is described in the previous paragraph. The other part is the confidence that the sector transformer has on the particular sector of the client. This means that if the transformer is very confident that a news article is about a certain sector, then this article should have a big priority for clients inside that sector.
Recommendation Score = Confidence on client’s sector ∗ Source reputation score