diff --git a/docs/source/labelling.md b/docs/source/labelling.md index 6d72dc8d..2281e0ba 100644 --- a/docs/source/labelling.md +++ b/docs/source/labelling.md @@ -1,16 +1,17 @@ # Entity Labelling -To extract skills from job adverts we took an approach of training a named entity recognition (NER) model to predict which parts of job adverts were skills ("skill entities") and which were experiences ("experience entities"). +To extract skills from job adverts we took an approach of training a named entity recognition (NER) model to predict which parts of job adverts were skills ("skill entities"), which were experiences ("experience entities") and which were job benefits ("benefit entities"). -To train the NER model we needed labelled data. First we created a random sample of job adverts and got them into a form needed for labelling using [Label Studio](https://labelstud.io/). More about this labelling process can be found in the [`skill_ner` pipeline](https://nestauk.github.io/ojd_daps_skills/pipeline/skill_ner/README.md). +To train the NER model we needed labelled data. First we created a random sample of job adverts and got them into a form needed for labelling using [Label Studio](https://labelstud.io/) and also [Prodigy](https://prodi.gy/). More about this labelling process can be found in the [`skill_ner` pipeline](https://nestauk.github.io/ojd_daps_skills/pipeline/skill_ner/README.md). -There are 3 entity labels in our training data: +There are 4 entity labels in our training data: 1. `SKILL` 2. `MULTISKILL` 3. `EXPERIENCE` +4. `BENEFIT` -The user interface for this labelling task looks like: +The user interface for the labelling task in label-studio looks like: ![](../../outputs/reports/figures/label_studio.png) @@ -27,4 +28,4 @@ Sometimes there were no entities to label: ### Training dataset -For the current NER model, 5641 entities in 375 job adverts from our dataset of job adverts were labelled; 354 are multiskill, 4696 are skill, and 608 were experience entities. 20% of the labelled entities were held out as a test set to evaluate the models. +For the current NER model (20230808), 8971 entities in 500 job adverts from our dataset of job adverts were labelled; 443 are multiskill, 7313 are skill, 852 were experience entities, and 363 were benefit entities. 20% of the labelled entities were held out as a test set to evaluate the models. diff --git a/docs/source/model_card.md b/docs/source/model_card.md index 7afb5fdd..afe9247f 100644 --- a/docs/source/model_card.md +++ b/docs/source/model_card.md @@ -2,7 +2,7 @@ This page contains information for different parts of the skills extraction and mapping pipeline. We detail the two main parts of the pipeline; the extract skills pipeline and the skills to taxonomy mapping pipeline. -Developed by data scientists in Nesta’s Data Analytics Practice, (last updated on 23-11-2022). +Developed by data scientists in Nesta’s Data Analytics Practice, (last updated on 29-09-2023). - [Model Card: Extract Skills](extract_skills_card) - [Model Card: Skills to Taxonomy Mapping](mapping_card) @@ -17,15 +17,15 @@ _The extracting skills pipeline._ ### Summary -- Train a Named Entity Recognition (NER) spaCy component to extract skills, multiskills and experience entities from job adverts. +- Train a Named Entity Recognition (NER) spaCy component to extract skills, multiskills, experience and benefits entities from job adverts. - Predict whether or not a skill is multi-skill or not using scikit learn's SVM model. Features are length of entity; if 'and' in entity; if ',' in entity. - Split multiskills, where possible, based on semantic rules. ### Training -- For the NER model, 375 job adverts were labelled for skills, multiskills and experience. -- As of 15th November 2022, **5641** entities in 375 job adverts from OJO were labelled; -- **354** are multiskill, **4696** are skill, and **608** were experience entities. 20% of the labelled entities were held out as a test set to evaluate the models. +- For the NER model, 500 job adverts were labelled for skills, multiskills, experience and benefits. +- As of 8th August 2023, **8971** entities in 500 job adverts from OJO were labelled; +- **443** are multiskill, **7313** are skill, **852** were experience entities, and **363** were benefit entities. 20% of the labelled entities were held out as a test set to evaluate the models. The NER model we trained used [spaCy's](https://spacy.io/) NER neural network architecture. Their NER architecture _"features a sophisticated word embedding strategy using subword features and 'Bloom' embeddings, a deep convolutional neural network with residual connections, and a novel transition-based approach to named entity parsing"_ - more about this [here](https://spacy.io/universe/project/video-spacys-ner-model). @@ -33,22 +33,23 @@ You can read more about the creation of the labelling data [here](./labelling.md ### NER Metrics -- A metric in the python library nerevaluate ([read more here](https://pypi.org/project/nervaluate/)) was used to calculate F1, precision and recall for the NER and SVM classifier on the held-out test set. As of 15th November 2022, the results are as follows: +- A metric in the python library nerevaluate ([read more here](https://pypi.org/project/nervaluate/)) was used to calculate F1, precision and recall for the NER and SVM classifier on the held-out test set. As of 8th August 2023, the results are as follows: | Entity | F1 | Precision | Recall | | ---------- | ----- | --------- | ------ | -| Skill | 0.586 | 0.679 | 0.515 | -| Experience | 0.506 | 0.648 | 0.416 | -| All | 0.563 | 0.643 | 0.500 | +| Skill | 0.612 | 0.712 | 0.537 | +| Experience | 0.524 | 0.647 | 0.441 | +| Benefit | 0.531 | 0.708 | 0.425 | +| All | 0.590 | 0.680 | 0.521 | - These metrics use partial entity matching. -- More details of the evaluation performance across both the NER model and the SVM model can be found in `outputs/models/ner_model/20220825/train_details.json` +- More details of the evaluation performance across both the NER model and the SVM model can be found in `outputs/models/ner_model/20230808/train_details.json` ### Multiskill Metrics -- The same training data and held out test set used for the NER model was used to evaluate the SVM model. On a held out test set, the SVM model achieved 91% accuracy. +- The same training data and held out test set used for the NER model was used to evaluate the SVM model. On a held out test set, the SVM model achieved 94% accuracy. - When evaluating the multiskill splitter algorithm rules, 253 multiskill spans were labelled as ‘good’, ‘ok’ or ‘bad’ splits. Of the 253 multiskill spans, 80 were split. Of the splits, 66% were ‘good’, 9% were ‘ok’ and 25% were ‘bad’. -- More details of the evaluation performance across both the NER model and the SVM model can be found in `outputs/models/ner_model/20220825/train_details.json` +- More details of the evaluation performance across both the NER model and the SVM model can be found in `outputs/models/ner_model/20230808/train_details.json` ### Caveats and Recommendations diff --git a/docs/source/pipeline_summary.md b/docs/source/pipeline_summary.md index 350a0d0e..ae3b1f48 100644 --- a/docs/source/pipeline_summary.md +++ b/docs/source/pipeline_summary.md @@ -23,7 +23,7 @@ For further information or feedback please contact Liz Gallagher, India Kerle or ## Metrics -There is no exact way to evaluate how well our pipeline works; however we have several proxies to better understand how our approach compares. +There is no exact way to evaluate how well our pipeline works; however we have several proxies to better understand how our approach compares. The analysis in this section was performed using the results of the `20220825` model. We believe the newer `20230808` model will improve these results, but the analysis hasn't been repeated. ### Comparison 1 - Top skill groups per occupation comparison to ESCO essential skill groups per occupation diff --git a/ojd_daps_skills/config/extract_skills_esco.yaml b/ojd_daps_skills/config/extract_skills_esco.yaml index e1b1e05e..48f9b174 100644 --- a/ojd_daps_skills/config/extract_skills_esco.yaml +++ b/ojd_daps_skills/config/extract_skills_esco.yaml @@ -1,4 +1,4 @@ -ner_model_path: "outputs/models/ner_model/20220825/" +ner_model_path: "outputs/models/ner_model/20230808/" taxonomy_name: "esco" taxonomy_path: "outputs/data/skill_ner_mapping/esco_data_formatted.csv" clean_job_ads: True diff --git a/ojd_daps_skills/config/extract_skills_lightcast.yaml b/ojd_daps_skills/config/extract_skills_lightcast.yaml index ed5e4d88..fd2cbb11 100644 --- a/ojd_daps_skills/config/extract_skills_lightcast.yaml +++ b/ojd_daps_skills/config/extract_skills_lightcast.yaml @@ -1,4 +1,4 @@ -ner_model_path: "outputs/models/ner_model/20220825/" +ner_model_path: "outputs/models/ner_model/20230808/" taxonomy_name: "lightcast" taxonomy_path: "outputs/data/skill_ner_mapping/lightcast_data_formatted.csv" clean_job_ads: True diff --git a/ojd_daps_skills/config/extract_skills_lightcast_evaluation.yaml b/ojd_daps_skills/config/extract_skills_lightcast_evaluation.yaml index af25ac58..1a62bf84 100644 --- a/ojd_daps_skills/config/extract_skills_lightcast_evaluation.yaml +++ b/ojd_daps_skills/config/extract_skills_lightcast_evaluation.yaml @@ -1,4 +1,4 @@ -ner_model_path: "outputs/models/ner_model/20220825/" +ner_model_path: "outputs/models/ner_model/20230808/" taxonomy_name: "lightcast" taxonomy_path: "escoe_extension/outputs/data/skill_ner_mapping/lightcast_data_formatted.csv" clean_job_ads: True diff --git a/ojd_daps_skills/config/extract_skills_template.yaml b/ojd_daps_skills/config/extract_skills_template.yaml index 4c0c6817..77b831a2 100644 --- a/ojd_daps_skills/config/extract_skills_template.yaml +++ b/ojd_daps_skills/config/extract_skills_template.yaml @@ -1,7 +1,7 @@ #This is a template config file - we have added definitions to parameters that you will need to modify for your own taxonomy #the relative path to the trained NER model -ner_model_path: "outputs/models/ner_model/20220825/" +ner_model_path: "outputs/models/ner_model/20230808/" #the relative path to where taxonomy_path: "path/to/formatted_taxonomy.csv" #the name of your own taxonomy diff --git a/ojd_daps_skills/config/extract_skills_toy.yaml b/ojd_daps_skills/config/extract_skills_toy.yaml index 302dde00..2ed51fec 100644 --- a/ojd_daps_skills/config/extract_skills_toy.yaml +++ b/ojd_daps_skills/config/extract_skills_toy.yaml @@ -1,4 +1,4 @@ -ner_model_path: "outputs/models/ner_model/20220825/" +ner_model_path: "outputs/models/ner_model/20230808/" taxonomy_name: "toy" taxonomy_path: "" clean_job_ads: True diff --git a/ojd_daps_skills/getters/download_public_data.py b/ojd_daps_skills/getters/download_public_data.py index 9de3e80b..b219e596 100644 --- a/ojd_daps_skills/getters/download_public_data.py +++ b/ojd_daps_skills/getters/download_public_data.py @@ -1,4 +1,4 @@ -from ojd_daps_skills import PUBLIC_DATA_FOLDER_NAME, PROJECT_DIR +from ojd_daps_skills import PUBLIC_DATA_FOLDER_NAME, PROJECT_DIR, logger import os import boto3 @@ -7,6 +7,7 @@ from botocore.config import Config from zipfile import ZipFile + def download(): """Download public data. Expected to run once on first use.""" s3 = boto3.client( @@ -25,11 +26,12 @@ def download(): zip_ref.extractall(PROJECT_DIR) os.remove(f"{public_data_dir}.zip") + logger.info(f"Data folder downloaded from {public_data_dir}") except ClientError as ce: - print(f"Error: {ce}") + logger.warning(f"Error: {ce}") except FileNotFoundError as fnfe: - print(f"Error: {fnfe}") + logger.warning(f"Error: {fnfe}") if __name__ == "__main__": diff --git a/ojd_daps_skills/pipeline/extract_skills/extract_skills.py b/ojd_daps_skills/pipeline/extract_skills/extract_skills.py index 138a3f06..a8c07fa9 100644 --- a/ojd_daps_skills/pipeline/extract_skills/extract_skills.py +++ b/ojd_daps_skills/pipeline/extract_skills/extract_skills.py @@ -64,9 +64,12 @@ def __init__( "Neccessary files are not downloaded. Downloading ~1GB of neccessary files." ) download() + else: + logger.info("Model files found locally") else: self.base_path = "escoe_extension/" self.s3 = True + logger.info("Will be downloading data and models directly from S3") pass self.taxonomy_name = self.config["taxonomy_name"] @@ -146,7 +149,7 @@ def load( self.nlp = self.job_ner.load_model(self.ner_model_path, s3_download=self.s3) - self.labels = self.nlp.get_pipe("ner").labels + ("MULTISKILL",) + self.labels = ("BENEFIT", "SKILL", "MULTISKILL", "EXPERIENCE") logger.info(f"Loading '{self.taxonomy_name}' taxonomy information") if self.taxonomy_name == "toy": diff --git a/ojd_daps_skills/pipeline/skill_ner/README.md b/ojd_daps_skills/pipeline/skill_ner/README.md index 4784ee37..ac540e2b 100644 --- a/ojd_daps_skills/pipeline/skill_ner/README.md +++ b/ojd_daps_skills/pipeline/skill_ner/README.md @@ -1,6 +1,6 @@ # Skill NER -## Label data +## Label data using label-studio ### Creating a sample of the OJO data @@ -79,9 +79,13 @@ For the labelling done at the end of June 2022, we labelled the chunk of 400 job The outputs of this labelled are stored in `s3://open-jobs-lake/escoe_extension/outputs/skill_span_labels/`. -### Merging labelled files +## Label data using Prodigy -Since multiple people labelled files from different locations, we merge the labelled data using the following command: +We labelled another batch of job adverts using [Prodigy](https://prodi.gy/). This was to avail of their active learning capabilities. Details of how we labelled job adverts this way are given in [the Prodigy labelling README](./ojd_daps_skills/ojd_daps_skills/pipeline/skill_ner/prodigy/README.md). + +## Merging labelled files + +Since multiple people labelled files from different locations, and we labelled in both label-studio and Prodigy, we merge the labelled data using the following command: ``` python ojd_daps_skills/pipeline/skill_ner/combine_labels.py diff --git a/ojd_daps_skills/pipeline/skill_ner/get_skills.py b/ojd_daps_skills/pipeline/skill_ner/get_skills.py index 029ea270..456d9db4 100644 --- a/ojd_daps_skills/pipeline/skill_ner/get_skills.py +++ b/ojd_daps_skills/pipeline/skill_ner/get_skills.py @@ -4,7 +4,7 @@ Running python ojd_daps_skills/pipeline/skill_ner/get_skills.py - --model_path outputs/models/ner_model/20220825/ + --model_path outputs/models/ner_model/20230808/ --output_file_dir escoe_extension/outputs/data/skill_ner/skill_predictions/ --job_adverts_filename escoe_extension/inputs/data/skill_ner/data_sample/20220622_sampled_job_ads.json @@ -40,7 +40,7 @@ def parse_arguments(parser): parser.add_argument( "--model_path", help="The path to the model you want to make predictions with", - default="outputs/models/ner_model/20220825/", + default="outputs/models/ner_model/20230808/", ) parser.add_argument( diff --git a/ojd_daps_skills/pipeline/skill_ner/ner_spacy.py b/ojd_daps_skills/pipeline/skill_ner/ner_spacy.py index 97749d3c..18ebf7af 100644 --- a/ojd_daps_skills/pipeline/skill_ner/ner_spacy.py +++ b/ojd_daps_skills/pipeline/skill_ner/ner_spacy.py @@ -512,11 +512,12 @@ def load_model(self, model_folder, s3_download=True): self.ms_classifier = pickle.load( open(os.path.join(model_folder, "ms_classifier.pkl"), "rb") ) + return self.nlp except OSError: - logger.info( + logger.warning( "Model not found locally - you may need to download it from S3 (set s3_download to True)" ) - return self.nlp + return None def parse_arguments(parser): diff --git a/ojd_daps_skills/tests/test_extract_skills.py b/ojd_daps_skills/tests/test_extract_skills.py index 6206d4ae..e2c79ef5 100644 --- a/ojd_daps_skills/tests/test_extract_skills.py +++ b/ojd_daps_skills/tests/test_extract_skills.py @@ -5,8 +5,6 @@ from ojd_daps_skills.utils.text_cleaning import short_hash from ojd_daps_skills.pipeline.extract_skills.extract_skills import ExtractSkills -es = ExtractSkills(local=True) - job_adverts = [ "The job involves communication and maths skills", "The job involves excel and presenting skills. You need good excel skills", @@ -15,10 +13,16 @@ def test_load(): + es = ExtractSkills(local=True) es.load() assert isinstance(es.nlp, spacy.lang.en.English) - assert es.labels == ("EXPERIENCE", "SKILL", "MULTISKILL") + assert all( + [ + label in es.labels + for label in ["EXPERIENCE", "SKILL", "MULTISKILL", "BENEFIT"] + ] + ) assert es.skill_mapper assert ( len( @@ -31,6 +35,9 @@ def test_load(): def test_get_skills(): + es = ExtractSkills(local=True) + es.load() + predicted_skills = es.get_skills(job_adverts) # The keys are the labels for every job prediction @@ -46,6 +53,9 @@ def test_get_skills(): def test_map_skills(): + es = ExtractSkills(local=True) + es.load() + predicted_skills = es.get_skills(job_adverts) matched_skills = es.map_skills(predicted_skills) @@ -56,13 +66,17 @@ def test_map_skills(): *[[skill[1][0] for skill in skills["SKILL"]] for skills in matched_skills] ) ) - assert ( - set(test_skills).difference(set(es.taxonomy_info["hier_name_mapper"].values())) - == set() + tax_skills_and_hier_names = set( + es.taxonomy_skills["description"].tolist() + + list(es.taxonomy_info["hier_name_mapper"].values()) ) + assert set(test_skills).difference(tax_skills_and_hier_names) == set() def test_map_no_skills(): + es = ExtractSkills(local=True) + es.load() + job_adverts = ["nothing", "we want excel skills", "we want communication skills"] extract_matched_skills = es.extract_skills(job_adverts) assert len(job_adverts) == len(extract_matched_skills) @@ -72,6 +86,8 @@ def test_hardcoded_mapping(): """ The mapped results using the algorithm should be the same as the hardcoded results """ + es = ExtractSkills(local=True) + es.load() hard_coded_skills = { "3267542715426065": { diff --git a/ojd_daps_skills/utils/bert_vectorizer.py b/ojd_daps_skills/utils/bert_vectorizer.py index 0b401bf9..4647a1e6 100644 --- a/ojd_daps_skills/utils/bert_vectorizer.py +++ b/ojd_daps_skills/utils/bert_vectorizer.py @@ -2,6 +2,7 @@ import time from ojd_daps_skills import logger import logging +import torch class BertVectorizer: @@ -13,7 +14,7 @@ class BertVectorizer: def __init__( self, bert_model_name="sentence-transformers/all-MiniLM-L6-v2", - multi_process=True, + multi_process=False, batch_size=32, verbose=True, ): @@ -27,7 +28,8 @@ def __init__( logger.setLevel(logging.ERROR) def fit(self, *_): - self.bert_model = SentenceTransformer(self.bert_model_name) + device = torch.device(f"cuda:0" if torch.cuda.is_available() else "cpu") + self.bert_model = SentenceTransformer(self.bert_model_name, device=device) self.bert_model.max_seq_length = 512 return self diff --git a/outputs/reports/skills_extraction.md b/outputs/reports/skills_extraction.md index bd2147fc..5d71ba25 100644 --- a/outputs/reports/skills_extraction.md +++ b/outputs/reports/skills_extraction.md @@ -16,11 +16,11 @@ This process means we can extract skills from thousands of job adverts and analy ## Labelling data -To train the NER model we needed labelled data. First we created a random sample of job adverts and got them into a form needed for labelling using [Label Studio](https://labelstud.io/). More about this labelling process can be found in the `skill_ner` pipeline [README.md](./ojd_daps_skills/ojd_daps_skills/pipeline/skill_ner/README.md). +To train the NER model we needed labelled data. First we created a random sample of job adverts and got them into a form needed for labelling using [Label Studio](https://labelstud.io/), we then did a second batch of labelled using [Prodigy](https://prodi.gy/). More about this labelling process can be found in the `skill_ner` pipeline [README.md](./ojd_daps_skills/ojd_daps_skills/pipeline/skill_ner/README.md). ![](figures/label_studio.png) -As of 11th July 2022 we have labelled 3400 entities; 404 (12%) are multiskill, 2603 (77%) are skill, and 393 (12%) are experience entities. +As of 8th August 2023 we have labelled 8971 entities; 443 (5%) are multiskill, 7313 (82%) are skill, 852 (10%) are experience entities and 363 (4%) are benefit entities. ### Multiskill labels @@ -60,7 +60,7 @@ A summary of the experiments with training the model is below. | Date (model name) | Base model | Training size | Evaluation size | Number of iterations | Drop out rate | Learning rate | Convert multiskill? | Other info | Skill F1 | Experience F1 | All F1 | Multiskill test score | | ----------------- | -------------- | --------------- | --------------- | -------------------- | ------------- | ------------- | ------------------- | ------------------------------------------------------------------------------------------------ | -------- | ------------- | ------ | --------------------- | -| 20230808 | en_core_web_lg | 400 (7149 ents) | 100 (1805 ents) | 100 | 0.1 | 0.001 | True | More data, different base model, BENEFIT label data | 0.61 | 0.52 | 0.59 | 0.94 | +| 20230808\*\* | en_core_web_lg | 400 (7149 ents) | 100 (1805 ents) | 100 | 0.1 | 0.001 | True | More data, different base model, BENEFIT label data | 0.61 | 0.52 | 0.59 | 0.94 | | 20220825 | blank en | 300 (4508 ents) | 75 (1133 ents) | 100 | 0.1 | 0.001 | True | Changed hyperparams, more data | 0.59 | 0.51 | 0.56 | 0.91 | | 20220729\* | blank en | 196 (2850 ents) | 49 (636 ents) | 50 | 0.3 | 0.001 | True | More data, padding in cleaning but do fix_entity_annotations after fix_all_formatting to sort it | 0.57 | 0.44 | 0.54 | 0.87 | | 20220729_nopad | blank en | 196 | 49 | 50 | 0.3 | 0.001 | True | No padding in cleaning, more data | 0.52 | 0.33 | 0.45 | 0.87 | @@ -124,6 +124,8 @@ More in-depth metrics for `20220714`: \* For model `20220714` we relabelled the MULTISKILL labels in the dataset - we were trying to see whether some of them should actually be single skills, or could be separated into single skills rather than (as we found) labelling a large span as a multiskill. This process increased our number of labelled skill entities (from 2603 to 2887) and decreased the number of multiskill entities (from 404 to 218), resulting in a net increase in entities labelled (from 3400 to 3498). +\*\* For model `20230808` we included BENEFIT labels in some of the labelled data. + ### Parameter tuning For model `20220825` onwards we changed our hyperparameters after some additional experimentation revealed improvements could be made. This experimentation was on a dataset of 375 job adverts in total.