-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor ojd_daps_skills #223
Conversation
… - e.g. adding both the split and unsplit version of a multiskill entity
@Jack-Vines I've had a look at this PR and I think I've spotted the code issues that needed changing and changed them. I trained a new NER model and Multiskill model (and uploaded them to huggingface) after realising they were trained on an old (and smaller) version of the training data. Still to do:
Deep dive into the results pre and post refactorI wanted to know if the results changed before and after the refactor. So using the same sample of 1000 OJO job adverts:
TL;DR
Full resultsThe results are not the same. Generally more skills are extracted from the new method. Original number of skills: [5, 10, 16] (25%,50%,75% percentiles) The "top ESCO skills" extracted are similar, but not the same. The number of occurrences of each unique ESCO skill are correlated across the 2 methods: 📏 The length of the skills are basically the same. In the original the mean length was 29.4 with a IQR of [14,23,38] in the new version the mean length is 30.8 with [14, 25, 40]. 🎉 The matching is the same though. If the same skill entity is extracted from both methods, then they are always matched to the same ESCO skill. 🔬 The models are different. I tested this with the following code:
|
This PR is a major re-factor of the current ojd_daps_skills library. It does a number of things:
poetry
for dependency management: hopefully, this would sort out some of the really type dependency issues that users are currently having.Pydantic
for more enforceable type hintsPathlib
: I've created two Config classes that help in downloaded data from s3 and models from huggingface hub to hopefully avoid theJobNer does not exist
drama.In addition to ensuring that the results are similar to last times, there are additional things outstanding:
docs
directory has been removed so needs to be re-added with updated information on how to use the library and updated model performance (which can be found on huggingface)windows-latest
to the action.