Update to latest model #205

lizgzil · 2023-09-29T09:39:37Z

Update references to the old model (20220825) to the new model (20230808)
Added the new model to a public zipped folder. This is currently in ojd_daps_skills_data_new.zip since it will break things if it replaces ojd_daps_skills_data.zip just yet
Add info to model cards
Optimize bertvectorizer

To do:

Rename the old public zipped file to ojd_daps_skills_data_old.zip and change ojd_daps_skills_data_new.zip to ojd_daps_skills_data.zip. DO THIS AFTER THIS PR IS MERGED?

Important. I have created a new public S3 zipped file (s3://open-jobs-indicators/escoe_extension/ojd_daps_skills_data_new.zip) which has the new model (20230808) rather than the old one (20220825).

For the meantime, I have kept s3://open-jobs-indicators/escoe_extension/ojd_daps_skills_data.zip as it is (i.e. with the old model). This is because whilst this code is still in a PR, dev won't work with the new zipped file (it will try to look for the 20220825 model but it won't exist).

This is how I updated the zipped file:

# Move over the new model
aws s3 cp --recursive s3://open-jobs-lake/escoe_extension/outputs/models/ner_model/20230808/ s3://open-jobs-indicators/escoe_extension/outputs/models/ner_model/20230808/

# Delete the old model
aws s3 rm s3://open-jobs-indicators/escoe_extension/outputs/models/ner_model/[20220825/](https://s3.console.aws.amazon.com/s3/buckets/open-jobs-indicators?region=eu-west-1&prefix=escoe_extension/outputs/models/ner_model/20220825/&showversions=false) --recursive

# Download the whole folder locally
aws s3 cp --recursive s3://open-jobs-indicators/escoe_extension/outputs/ ojd_daps_skills_data/outputs/

# zip
zip -r ojd_daps_skills_data.zip ojd_daps_skills_data/

# Upload the zipped file
aws s3 cp ojd_daps_skills_data.zip  s3://open-jobs-indicators/escoe_extension/ojd_daps_skills_data_new.zip

Thanks for contributing to Nesta's Skills Extractor Library 🙏!

If you have suggested changes to code anywhere outside of the ExtractSkills class, please consult the checklist below.

Checklist ✔️🐍:

If you have suggested changes to documentation (and/or the ExtractSkills class), please ALSO consult the checklist below.

Documentation Checklist ✔️📚:

I have run make html in docs
I have manually reviewed the docs/build/*.html files locally to ensure they have formatted correctly
I have pushed both relevant files AND their corresponding docs/build/*.html files

lizgzil · 2023-09-29T10:42:27Z

@india-kerle - no rush on this - but I've tried to update everything with the new model details. Hope I haven't missed something?

india-kerle · 2023-12-08T12:11:20Z

ojd_daps_skills/pipeline/skill_ner/README.md


-Since multiple people labelled files from different locations, we merge the labelled data using the following command:
+We labelled another batch of job adverts using [Prodigy](https://prodi.gy/). This was to avail of their active learning capabilities. Details of how we labelled job adverts this way are given in [the Prodigy labelling README](./ojd_daps_skills/ojd_daps_skills/pipeline/skill_ner/prodigy/README.md).


Suggested change

We labelled another batch of job adverts using [Prodigy](https://prodi.gy/). This was to avail of their active learning capabilities. Details of how we labelled job adverts this way are given in [the Prodigy labelling README](./ojd_daps_skills/ojd_daps_skills/pipeline/skill_ner/prodigy/README.md).

We labelled another batch of job adverts using [Prodigy](https://prodi.gy/). This was to make use of their active learning capabilities. Details of how we labelled job adverts this way are given in [the Prodigy labelling README](./ojd_daps_skills/ojd_daps_skills/pipeline/skill_ner/prodigy/README.md).

india-kerle · 2023-12-08T12:11:56Z

outputs/reports/skills_extraction.md

@@ -16,11 +16,11 @@ This process means we can extract skills from thousands of job adverts and analy

 ## Labelling data

-To train the NER model we needed labelled data. First we created a random sample of job adverts and got them into a form needed for labelling using [Label Studio](https://labelstud.io/). More about this labelling process can be found in the `skill_ner` pipeline [README.md](./ojd_daps_skills/ojd_daps_skills/pipeline/skill_ner/README.md).
+To train the NER model we needed labelled data. First we created a random sample of job adverts and got them into a form needed for labelling using [Label Studio](https://labelstud.io/), we then did a second batch of labelled using [Prodigy](https://prodi.gy/). More about this labelling process can be found in the `skill_ner` pipeline [README.md](./ojd_daps_skills/ojd_daps_skills/pipeline/skill_ner/README.md).


Suggested change

To train the NER model we needed labelled data. First we created a random sample of job adverts and got them into a form needed for labelling using [Label Studio](https://labelstud.io/), we then did a second batch of labelled using [Prodigy](https://prodi.gy/). More about this labelling process can be found in the `skill_ner` pipeline [README.md](./ojd_daps_skills/ojd_daps_skills/pipeline/skill_ner/README.md).

To train the NER model we needed labelled data. First we created a random sample of job adverts and got them into a form needed for labelling using [Label Studio](https://labelstud.io/). We then did a second batch of labelled using [Prodigy](https://prodi.gy/). More about this labelling process can be found in the `skill_ner` pipeline [README.md](./ojd_daps_skills/ojd_daps_skills/pipeline/skill_ner/README.md).

india-kerle

Looks great! thanks for these changes :)

lizgzil requested a review from india-kerle September 29, 2023 10:36

lizgzil marked this pull request as ready for review September 29, 2023 10:37

india-kerle reviewed Dec 8, 2023

View reviewed changes

lizgzil added 4 commits December 8, 2023 12:26

Update configs to latest model

5732bde

Update model cards and readmes with new model info

d40479e

Add disclaimer to pipeline metric summary analysis

3adba4d

Use gpu option in bervectorizer

93a3336

lizgzil force-pushed the new-model-update branch from 5635bc1 to 93a3336 Compare December 8, 2023 12:27

lizgzil added 4 commits December 8, 2023 13:08

Add some model and data folder download messages

925b1b6

Dont use args in tests

4d32e01

use logger in downlon public data error

cb931c7

Update tests, and logs, and only output the 4 entities we care about

535e770

india-kerle approved these changes Dec 8, 2023

View reviewed changes

lizgzil merged commit ade0887 into dev Dec 8, 2023
4 checks passed

lizgzil deleted the new-model-update branch December 8, 2023 14:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to latest model #205

Update to latest model #205

lizgzil commented Sep 29, 2023 •

edited

Loading

lizgzil commented Sep 29, 2023

india-kerle Dec 8, 2023

india-kerle Dec 8, 2023

india-kerle left a comment


		Since multiple people labelled files from different locations, we merge the labelled data using the following command:
		We labelled another batch of job adverts using [Prodigy](https://prodi.gy/). This was to avail of their active learning capabilities. Details of how we labelled job adverts this way are given in [the Prodigy labelling README](./ojd_daps_skills/ojd_daps_skills/pipeline/skill_ner/prodigy/README.md).

Update to latest model #205

Update to latest model #205

Conversation

lizgzil commented Sep 29, 2023 • edited Loading

lizgzil commented Sep 29, 2023

india-kerle Dec 8, 2023

Choose a reason for hiding this comment

india-kerle Dec 8, 2023

Choose a reason for hiding this comment

india-kerle left a comment

Choose a reason for hiding this comment

lizgzil commented Sep 29, 2023 •

edited

Loading