Skip to content

Commit

Permalink
Update model cards and readmes with new model info
Browse files Browse the repository at this point in the history
  • Loading branch information
lizgzil committed Sep 29, 2023
1 parent 7d6d622 commit e8b70e0
Show file tree
Hide file tree
Showing 5 changed files with 32 additions and 24 deletions.
11 changes: 6 additions & 5 deletions docs/source/labelling.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,17 @@
# Entity Labelling

To extract skills from job adverts we took an approach of training a named entity recognition (NER) model to predict which parts of job adverts were skills ("skill entities") and which were experiences ("experience entities").
To extract skills from job adverts we took an approach of training a named entity recognition (NER) model to predict which parts of job adverts were skills ("skill entities"), which were experiences ("experience entities") and which were job benefits ("benefit entities").

To train the NER model we needed labelled data. First we created a random sample of job adverts and got them into a form needed for labelling using [Label Studio](https://labelstud.io/). More about this labelling process can be found in the [`skill_ner` pipeline](https://nestauk.github.io/ojd_daps_skills/pipeline/skill_ner/README.md).
To train the NER model we needed labelled data. First we created a random sample of job adverts and got them into a form needed for labelling using [Label Studio](https://labelstud.io/) and also [Prodigy](https://prodi.gy/). More about this labelling process can be found in the [`skill_ner` pipeline](https://nestauk.github.io/ojd_daps_skills/pipeline/skill_ner/README.md).

There are 3 entity labels in our training data:
There are 4 entity labels in our training data:

1. `SKILL`
2. `MULTISKILL`
3. `EXPERIENCE`
4. `BENEFIT`

The user interface for this labelling task looks like:
The user interface for the labelling task in label-studio looks like:

![](../../outputs/reports/figures/label_studio.png)

Expand All @@ -27,4 +28,4 @@ Sometimes there were no entities to label:

### Training dataset

For the current NER model, 5641 entities in 375 job adverts from our dataset of job adverts were labelled; 354 are multiskill, 4696 are skill, and 608 were experience entities. 20% of the labelled entities were held out as a test set to evaluate the models.
For the current NER model (20230808), 8971 entities in 500 job adverts from our dataset of job adverts were labelled; 443 are multiskill, 7313 are skill, 852 were experience entities, and 363 were benefit entities. 20% of the labelled entities were held out as a test set to evaluate the models.
25 changes: 13 additions & 12 deletions docs/source/model_card.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This page contains information for different parts of the skills extraction and mapping pipeline. We detail the two main parts of the pipeline; the extract skills pipeline and the skills to taxonomy mapping pipeline.

Developed by data scientists in Nesta’s Data Analytics Practice, (last updated on 23-11-2022).
Developed by data scientists in Nesta’s Data Analytics Practice, (last updated on 29-09-2023).

- [Model Card: Extract Skills](extract_skills_card)
- [Model Card: Skills to Taxonomy Mapping](mapping_card)
Expand All @@ -17,38 +17,39 @@ _The extracting skills pipeline._

### Summary

- Train a Named Entity Recognition (NER) spaCy component to extract skills, multiskills and experience entities from job adverts.
- Train a Named Entity Recognition (NER) spaCy component to extract skills, multiskills, experience and benefits entities from job adverts.
- Predict whether or not a skill is multi-skill or not using scikit learn's SVM model. Features are length of entity; if 'and' in entity; if ',' in entity.
- Split multiskills, where possible, based on semantic rules.

### Training

- For the NER model, 375 job adverts were labelled for skills, multiskills and experience.
- As of 15th November 2022, **5641** entities in 375 job adverts from OJO were labelled;
- **354** are multiskill, **4696** are skill, and **608** were experience entities. 20% of the labelled entities were held out as a test set to evaluate the models.
- For the NER model, 500 job adverts were labelled for skills, multiskills, experience and benefits.
- As of 8th August 2023, **8971** entities in 500 job adverts from OJO were labelled;
- **443** are multiskill, **7313** are skill, **852** were experience entities, and **363** were benefit entities. 20% of the labelled entities were held out as a test set to evaluate the models.

The NER model we trained used [spaCy's](https://spacy.io/) NER neural network architecture. Their NER architecture _"features a sophisticated word embedding strategy using subword features and 'Bloom' embeddings, a deep convolutional neural network with residual connections, and a novel transition-based approach to named entity parsing"_ - more about this [here](https://spacy.io/universe/project/video-spacys-ner-model).

You can read more about the creation of the labelling data [here](./labelling.md).

### NER Metrics

- A metric in the python library nerevaluate ([read more here](https://pypi.org/project/nervaluate/)) was used to calculate F1, precision and recall for the NER and SVM classifier on the held-out test set. As of 15th November 2022, the results are as follows:
- A metric in the python library nerevaluate ([read more here](https://pypi.org/project/nervaluate/)) was used to calculate F1, precision and recall for the NER and SVM classifier on the held-out test set. As of 8th August 2023, the results are as follows:

| Entity | F1 | Precision | Recall |
| ---------- | ----- | --------- | ------ |
| Skill | 0.586 | 0.679 | 0.515 |
| Experience | 0.506 | 0.648 | 0.416 |
| All | 0.563 | 0.643 | 0.500 |
| Skill | 0.612 | 0.712 | 0.537 |
| Experience | 0.524 | 0.647 | 0.441 |
| Benefit | 0.531 | 0.708 | 0.425 |
| All | 0.590 | 0.680 | 0.521 |

- These metrics use partial entity matching.
- More details of the evaluation performance across both the NER model and the SVM model can be found in `outputs/models/ner_model/20220825/train_details.json`
- More details of the evaluation performance across both the NER model and the SVM model can be found in `outputs/models/ner_model/20230808/train_details.json`

### Multiskill Metrics

- The same training data and held out test set used for the NER model was used to evaluate the SVM model. On a held out test set, the SVM model achieved 91% accuracy.
- The same training data and held out test set used for the NER model was used to evaluate the SVM model. On a held out test set, the SVM model achieved 94% accuracy.
- When evaluating the multiskill splitter algorithm rules, 253 multiskill spans were labelled as ‘good’, ‘ok’ or ‘bad’ splits. Of the 253 multiskill spans, 80 were split. Of the splits, 66% were ‘good’, 9% were ‘ok’ and 25% were ‘bad’.
- More details of the evaluation performance across both the NER model and the SVM model can be found in `outputs/models/ner_model/20220825/train_details.json`
- More details of the evaluation performance across both the NER model and the SVM model can be found in `outputs/models/ner_model/20230808/train_details.json`

### Caveats and Recommendations

Expand Down
2 changes: 1 addition & 1 deletion ojd_daps_skills/analysis/data_overview.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@
"escoe_extension/outputs/labelled_job_adverts/combined_labels_20220824.json"
)
train_details_file = (
"escoe_extension/outputs/models/ner_model/20230808/train_details.json"
"escoe_extension/outputs/models/ner_model/20220825/train_details.json"
)
sample_matches_file_name = "escoe_extension/outputs/data/extract_skills/20220901_ojo_sample_skills_extracted.json"
manually_matches_tagged_file = (
Expand Down
10 changes: 7 additions & 3 deletions ojd_daps_skills/pipeline/skill_ner/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Skill NER

## Label data
## Label data using label-studio

### Creating a sample of the OJO data

Expand Down Expand Up @@ -79,9 +79,13 @@ For the labelling done at the end of June 2022, we labelled the chunk of 400 job

The outputs of this labelled are stored in `s3://open-jobs-lake/escoe_extension/outputs/skill_span_labels/`.

### Merging labelled files
## Label data using Prodigy

Since multiple people labelled files from different locations, we merge the labelled data using the following command:
We labelled another batch of job adverts using [Prodigy](https://prodi.gy/). This was to avail of their active learning capabilities. Details of how we labelled job adverts this way are given in [the Prodigy labelling README](./ojd_daps_skills/ojd_daps_skills/pipeline/skill_ner/prodigy/README.md).

## Merging labelled files

Since multiple people labelled files from different locations, and we labelled in both label-studio and Prodigy, we merge the labelled data using the following command:

```
python ojd_daps_skills/pipeline/skill_ner/combine_labels.py
Expand Down
8 changes: 5 additions & 3 deletions outputs/reports/skills_extraction.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,11 @@ This process means we can extract skills from thousands of job adverts and analy

## Labelling data

To train the NER model we needed labelled data. First we created a random sample of job adverts and got them into a form needed for labelling using [Label Studio](https://labelstud.io/). More about this labelling process can be found in the `skill_ner` pipeline [README.md](./ojd_daps_skills/ojd_daps_skills/pipeline/skill_ner/README.md).
To train the NER model we needed labelled data. First we created a random sample of job adverts and got them into a form needed for labelling using [Label Studio](https://labelstud.io/), we then did a second batch of labelled using [Prodigy](https://prodi.gy/). More about this labelling process can be found in the `skill_ner` pipeline [README.md](./ojd_daps_skills/ojd_daps_skills/pipeline/skill_ner/README.md).

![](figures/label_studio.png)

As of 11th July 2022 we have labelled 3400 entities; 404 (12%) are multiskill, 2603 (77%) are skill, and 393 (12%) are experience entities.
As of 8th August 2023 we have labelled 8971 entities; 443 (5%) are multiskill, 7313 (82%) are skill, 852 (10%) are experience entities and 363 (4%) are benefit entities.

### Multiskill labels

Expand Down Expand Up @@ -60,7 +60,7 @@ A summary of the experiments with training the model is below.

| Date (model name) | Base model | Training size | Evaluation size | Number of iterations | Drop out rate | Learning rate | Convert multiskill? | Other info | Skill F1 | Experience F1 | All F1 | Multiskill test score |
| ----------------- | -------------- | --------------- | --------------- | -------------------- | ------------- | ------------- | ------------------- | ------------------------------------------------------------------------------------------------ | -------- | ------------- | ------ | --------------------- |
| 20230808 | en_core_web_lg | 400 (7149 ents) | 100 (1805 ents) | 100 | 0.1 | 0.001 | True | More data, different base model, BENEFIT label data | 0.61 | 0.52 | 0.59 | 0.94 |
| 20230808\*\* | en_core_web_lg | 400 (7149 ents) | 100 (1805 ents) | 100 | 0.1 | 0.001 | True | More data, different base model, BENEFIT label data | 0.61 | 0.52 | 0.59 | 0.94 |
| 20220825 | blank en | 300 (4508 ents) | 75 (1133 ents) | 100 | 0.1 | 0.001 | True | Changed hyperparams, more data | 0.59 | 0.51 | 0.56 | 0.91 |
| 20220729\* | blank en | 196 (2850 ents) | 49 (636 ents) | 50 | 0.3 | 0.001 | True | More data, padding in cleaning but do fix_entity_annotations after fix_all_formatting to sort it | 0.57 | 0.44 | 0.54 | 0.87 |
| 20220729_nopad | blank en | 196 | 49 | 50 | 0.3 | 0.001 | True | No padding in cleaning, more data | 0.52 | 0.33 | 0.45 | 0.87 |
Expand Down Expand Up @@ -124,6 +124,8 @@ More in-depth metrics for `20220714`:

\* For model `20220714` we relabelled the MULTISKILL labels in the dataset - we were trying to see whether some of them should actually be single skills, or could be separated into single skills rather than (as we found) labelling a large span as a multiskill. This process increased our number of labelled skill entities (from 2603 to 2887) and decreased the number of multiskill entities (from 404 to 218), resulting in a net increase in entities labelled (from 3400 to 3498).

\*\* For model `20230808` we included BENEFIT labels in some of the labelled data.

### Parameter tuning

For model `20220825` onwards we changed our hyperparameters after some additional experimentation revealed improvements could be made. This experimentation was on a dataset of 375 job adverts in total.
Expand Down

0 comments on commit e8b70e0

Please sign in to comment.