Update model cards and readmes with new model info

nestauk · Sep 29, 2023 · e8b70e0 · e8b70e0
1 parent 7d6d622
commit e8b70e0
Show file tree

Hide file tree

Showing 5 changed files with 32 additions and 24 deletions.
diff --git a/docs/source/labelling.md b/docs/source/labelling.md
@@ -1,16 +1,17 @@
 # Entity Labelling
 
-To extract skills from job adverts we took an approach of training a named entity recognition (NER) model to predict which parts of job adverts were skills ("skill entities") and which were experiences ("experience entities").
+To extract skills from job adverts we took an approach of training a named entity recognition (NER) model to predict which parts of job adverts were skills ("skill entities"), which were experiences ("experience entities") and which were job benefits ("benefit entities").
 
-To train the NER model we needed labelled data. First we created a random sample of job adverts and got them into a form needed for labelling using [Label Studio](https://labelstud.io/). More about this labelling process can be found in the [`skill_ner` pipeline](https://nestauk.github.io/ojd_daps_skills/pipeline/skill_ner/README.md).
+To train the NER model we needed labelled data. First we created a random sample of job adverts and got them into a form needed for labelling using [Label Studio](https://labelstud.io/) and also [Prodigy](https://prodi.gy/). More about this labelling process can be found in the [`skill_ner` pipeline](https://nestauk.github.io/ojd_daps_skills/pipeline/skill_ner/README.md).
 
-There are 3 entity labels in our training data:
+There are 4 entity labels in our training data:
 
 1. `SKILL`
 2. `MULTISKILL`
 3. `EXPERIENCE`
+4. `BENEFIT`
 
-The user interface for this labelling task looks like:
+The user interface for the labelling task in label-studio looks like:
 
 ![](../../outputs/reports/figures/label_studio.png)
 
@@ -27,4 +28,4 @@ Sometimes there were no entities to label:
 
 ### Training dataset
 
-For the current NER model, 5641 entities in 375 job adverts from our dataset of job adverts were labelled; 354 are multiskill, 4696 are skill, and 608 were experience entities. 20% of the labelled entities were held out as a test set to evaluate the models.
+For the current NER model (20230808), 8971 entities in 500 job adverts from our dataset of job adverts were labelled; 443 are multiskill, 7313 are skill, 852 were experience entities, and 363 were benefit entities. 20% of the labelled entities were held out as a test set to evaluate the models.
diff --git a/docs/source/model_card.md b/docs/source/model_card.md
@@ -2,7 +2,7 @@
 
 This page contains information for different parts of the skills extraction and mapping pipeline. We detail the two main parts of the pipeline; the extract skills pipeline and the skills to taxonomy mapping pipeline.
 
-Developed by data scientists in Nesta’s Data Analytics Practice, (last updated on 23-11-2022).
+Developed by data scientists in Nesta’s Data Analytics Practice, (last updated on 29-09-2023).
 
 - [Model Card: Extract Skills](extract_skills_card)
 - [Model Card: Skills to Taxonomy Mapping](mapping_card)
@@ -17,38 +17,39 @@ _The extracting skills pipeline._
 
 ### Summary
 
-- Train a Named Entity Recognition (NER) spaCy component to extract skills, multiskills and experience entities from job adverts.
+- Train a Named Entity Recognition (NER) spaCy component to extract skills, multiskills, experience and benefits entities from job adverts.
 - Predict whether or not a skill is multi-skill or not using scikit learn's SVM model. Features are length of entity; if 'and' in entity; if ',' in entity.
 - Split multiskills, where possible, based on semantic rules.
 
 ### Training
 
-- For the NER model, 375 job adverts were labelled for skills, multiskills and experience.
-- As of 15th November 2022, **5641** entities in 375 job adverts from OJO were labelled;
-- **354** are multiskill, **4696** are skill, and **608** were experience entities. 20% of the labelled entities were held out as a test set to evaluate the models.
+- For the NER model, 500 job adverts were labelled for skills, multiskills, experience and benefits.
+- As of 8th August 2023, **8971** entities in 500 job adverts from OJO were labelled;
+- **443** are multiskill, **7313** are skill, **852** were experience entities, and **363** were benefit entities. 20% of the labelled entities were held out as a test set to evaluate the models.
 
 The NER model we trained used [spaCy's](https://spacy.io/) NER neural network architecture. Their NER architecture _"features a sophisticated word embedding strategy using subword features and 'Bloom' embeddings, a deep convolutional neural network with residual connections, and a novel transition-based approach to named entity parsing"_ - more about this [here](https://spacy.io/universe/project/video-spacys-ner-model).
 
 You can read more about the creation of the labelling data [here](./labelling.md).
 
 ### NER Metrics
 
-- A metric in the python library nerevaluate ([read more here](https://pypi.org/project/nervaluate/)) was used to calculate F1, precision and recall for the NER and SVM classifier on the held-out test set. As of 15th November 2022, the results are as follows:
+- A metric in the python library nerevaluate ([read more here](https://pypi.org/project/nervaluate/)) was used to calculate F1, precision and recall for the NER and SVM classifier on the held-out test set. As of 8th August 2023, the results are as follows:
 
 | Entity     | F1    | Precision | Recall |
 | ---------- | ----- | --------- | ------ |
-| Skill      | 0.586 | 0.679     | 0.515  |
-| Experience | 0.506 | 0.648     | 0.416  |
-| All        | 0.563 | 0.643     | 0.500  |
+| Skill      | 0.612 | 0.712     | 0.537  |
+| Experience | 0.524 | 0.647     | 0.441  |
+| Benefit    | 0.531 | 0.708     | 0.425  |
+| All        | 0.590 | 0.680     | 0.521  |
 
 - These metrics use partial entity matching.
-- More details of the evaluation performance across both the NER model and the SVM model can be found in `outputs/models/ner_model/20220825/train_details.json`
+- More details of the evaluation performance across both the NER model and the SVM model can be found in `outputs/models/ner_model/20230808/train_details.json`
 
 ### Multiskill Metrics
 
-- The same training data and held out test set used for the NER model was used to evaluate the SVM model. On a held out test set, the SVM model achieved 91% accuracy.
+- The same training data and held out test set used for the NER model was used to evaluate the SVM model. On a held out test set, the SVM model achieved 94% accuracy.
 - When evaluating the multiskill splitter algorithm rules, 253 multiskill spans were labelled as ‘good’, ‘ok’ or ‘bad’ splits. Of the 253 multiskill spans, 80 were split. Of the splits, 66% were ‘good’, 9% were ‘ok’ and 25% were ‘bad’.
-- More details of the evaluation performance across both the NER model and the SVM model can be found in `outputs/models/ner_model/20220825/train_details.json`
+- More details of the evaluation performance across both the NER model and the SVM model can be found in `outputs/models/ner_model/20230808/train_details.json`
 
 ### Caveats and Recommendations
 

diff --git a/ojd_daps_skills/analysis/data_overview.py b/ojd_daps_skills/analysis/data_overview.py
@@ -62,7 +62,7 @@
     "escoe_extension/outputs/labelled_job_adverts/combined_labels_20220824.json"
 )
 train_details_file = (
-    "escoe_extension/outputs/models/ner_model/20230808/train_details.json"
+    "escoe_extension/outputs/models/ner_model/20220825/train_details.json"
 )
 sample_matches_file_name = "escoe_extension/outputs/data/extract_skills/20220901_ojo_sample_skills_extracted.json"
 manually_matches_tagged_file = (

diff --git a/ojd_daps_skills/pipeline/skill_ner/README.md b/ojd_daps_skills/pipeline/skill_ner/README.md
@@ -1,6 +1,6 @@
 # Skill NER
 
-## Label data
+## Label data using label-studio
 
 ### Creating a sample of the OJO data
 
@@ -79,9 +79,13 @@ For the labelling done at the end of June 2022, we labelled the chunk of 400 job
 
 The outputs of this labelled are stored in `s3://open-jobs-lake/escoe_extension/outputs/skill_span_labels/`.
 
-### Merging labelled files
+## Label data using Prodigy
 
-Since multiple people labelled files from different locations, we merge the labelled data using the following command:
+We labelled another batch of job adverts using [Prodigy](https://prodi.gy/). This was to avail of their active learning capabilities. Details of how we labelled job adverts this way are given in [the Prodigy labelling README](./ojd_daps_skills/ojd_daps_skills/pipeline/skill_ner/prodigy/README.md).
+
+## Merging labelled files
+
+Since multiple people labelled files from different locations, and we labelled in both label-studio and Prodigy, we merge the labelled data using the following command:
 
 ```
 python ojd_daps_skills/pipeline/skill_ner/combine_labels.py

diff --git a/outputs/reports/skills_extraction.md b/outputs/reports/skills_extraction.md
@@ -16,11 +16,11 @@ This process means we can extract skills from thousands of job adverts and analy
 
 ## Labelling data
 
-To train the NER model we needed labelled data. First we created a random sample of job adverts and got them into a form needed for labelling using [Label Studio](https://labelstud.io/). More about this labelling process can be found in the `skill_ner` pipeline [README.md](./ojd_daps_skills/ojd_daps_skills/pipeline/skill_ner/README.md).
+To train the NER model we needed labelled data. First we created a random sample of job adverts and got them into a form needed for labelling using [Label Studio](https://labelstud.io/), we then did a second batch of labelled using [Prodigy](https://prodi.gy/). More about this labelling process can be found in the `skill_ner` pipeline [README.md](./ojd_daps_skills/ojd_daps_skills/pipeline/skill_ner/README.md).
 
 ![](figures/label_studio.png)
 
-As of 11th July 2022 we have labelled 3400 entities; 404 (12%) are multiskill, 2603 (77%) are skill, and 393 (12%) are experience entities.
+As of 8th August 2023 we have labelled 8971 entities; 443 (5%) are multiskill, 7313 (82%) are skill, 852 (10%) are experience entities and 363 (4%) are benefit entities.
 
 ### Multiskill labels
 
@@ -60,7 +60,7 @@ A summary of the experiments with training the model is below.
 
 | Date (model name) | Base model     | Training size   | Evaluation size | Number of iterations | Drop out rate | Learning rate | Convert multiskill? | Other info                                                                                       | Skill F1 | Experience F1 | All F1 | Multiskill test score |
 | ----------------- | -------------- | --------------- | --------------- | -------------------- | ------------- | ------------- | ------------------- | ------------------------------------------------------------------------------------------------ | -------- | ------------- | ------ | --------------------- |
-| 20230808          | en_core_web_lg | 400 (7149 ents) | 100 (1805 ents) | 100                  | 0.1           | 0.001         | True                | More data, different base model, BENEFIT label data                                              | 0.61     | 0.52          | 0.59   | 0.94                  |
+| 20230808\*\*      | en_core_web_lg | 400 (7149 ents) | 100 (1805 ents) | 100                  | 0.1           | 0.001         | True                | More data, different base model, BENEFIT label data                                              | 0.61     | 0.52          | 0.59   | 0.94                  |
 | 20220825          | blank en       | 300 (4508 ents) | 75 (1133 ents)  | 100                  | 0.1           | 0.001         | True                | Changed hyperparams, more data                                                                   | 0.59     | 0.51          | 0.56   | 0.91                  |
 | 20220729\*        | blank en       | 196 (2850 ents) | 49 (636 ents)   | 50                   | 0.3           | 0.001         | True                | More data, padding in cleaning but do fix_entity_annotations after fix_all_formatting to sort it | 0.57     | 0.44          | 0.54   | 0.87                  |
 | 20220729_nopad    | blank en       | 196             | 49              | 50                   | 0.3           | 0.001         | True                | No padding in cleaning, more data                                                                | 0.52     | 0.33          | 0.45   | 0.87                  |
@@ -124,6 +124,8 @@ More in-depth metrics for `20220714`:
 
 \* For model `20220714` we relabelled the MULTISKILL labels in the dataset - we were trying to see whether some of them should actually be single skills, or could be separated into single skills rather than (as we found) labelling a large span as a multiskill. This process increased our number of labelled skill entities (from 2603 to 2887) and decreased the number of multiskill entities (from 404 to 218), resulting in a net increase in entities labelled (from 3400 to 3498).
 
+\*\* For model `20230808` we included BENEFIT labels in some of the labelled data.
+
 ### Parameter tuning
 
 For model `20220825` onwards we changed our hyperparameters after some additional experimentation revealed improvements could be made. This experimentation was on a dataset of 375 job adverts in total.