-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #5 from nestauk/add-soc-mapper
Add socmapper functionality
- Loading branch information
Showing
15 changed files
with
1,355 additions
and
36 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,104 @@ | ||
# nlp-link | ||
# 🖇️ NLP Link | ||
|
||
Documentation for NLP Link | ||
NLP Link finds the most similar word (or words) in a reference list to an inputted word. For example, if you are trying to find which word is most similar to 'puppies' from a reference list of `['cats', 'dogs', 'rats', 'birds']`, nlp-link will return 'dogs'. | ||
|
||
- [Page1](./page1.md) | ||
Another functionality of this package is using the linking methodology to find the [SOC](https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc) code most similar to an inputted job title. More on this [here](./page1.md). | ||
|
||
## 🔨 Usage | ||
|
||
Install the package using pip: | ||
|
||
```bash | ||
pip install nlp-link | ||
``` | ||
|
||
### Basic usage | ||
|
||
Match two lists in python: | ||
|
||
```python | ||
|
||
from nlp_link.linker import NLPLinker | ||
|
||
nlp_link = NLPLinker() | ||
|
||
# list inputs | ||
comparison_data = ['cats', 'dogs', 'rats', 'birds'] | ||
input_data = ['owls', 'feline', 'doggies', 'dogs','chair'] | ||
nlp_link.load(comparison_data) | ||
matches = nlp_link.link_dataset(input_data) | ||
# Top match output | ||
print(matches) | ||
|
||
``` | ||
|
||
Which outputs: | ||
|
||
``` | ||
input_id input_text link_id link_text similarity | ||
0 0 owls 3 birds 0.613577 | ||
1 1 feline 0 cats 0.669633 | ||
2 2 doggies 1 dogs 0.757443 | ||
3 3 dogs 1 dogs 1.000000 | ||
4 4 chair 0 cats 0.331178 | ||
|
||
``` | ||
|
||
### Extended usage | ||
|
||
Match using dictionary inputs (where the key is a unique ID): | ||
|
||
```python | ||
|
||
from nlp_link.linker import NLPLinker | ||
|
||
nlp_link = NLPLinker() | ||
|
||
# dict inputs | ||
comparison_data = {'a': 'cats', 'b': 'dogs', 'd': 'rats', 'e': 'birds'} | ||
input_data = {'x': 'owls', 'y': 'feline', 'z': 'doggies', 'za': 'dogs', 'zb': 'chair'} | ||
nlp_link.load(comparison_data) | ||
matches = nlp_link.link_dataset(input_data) | ||
# Top match output | ||
print(matches) | ||
|
||
``` | ||
|
||
Which outputs: | ||
|
||
``` | ||
input_id input_text link_id link_text similarity | ||
0 x owls e birds 0.613577 | ||
1 y feline a cats 0.669633 | ||
2 z doggies b dogs 0.757443 | ||
3 za dogs b dogs 1.000000 | ||
4 zb chair a cats 0.331178 | ||
|
||
``` | ||
|
||
Output several most similar matches using the `top_n` argument (`format_output` needs to be set to False for this): | ||
|
||
```python | ||
|
||
from nlp_link.linker import NLPLinker | ||
|
||
nlp_link = NLPLinker() | ||
|
||
comparison_data = {'a': 'cats', 'b': 'dogs', 'c': 'kittens', 'd': 'rats', 'e': 'birds'} | ||
input_data = {'x': 'pets', 'y': 'feline'} | ||
nlp_link.load(comparison_data) | ||
matches = nlp_link.link_dataset(input_data, top_n=2, format_output=False) | ||
# Top match output | ||
print(matches) | ||
# Format output for ease of reading | ||
print({input_data[k]: [comparison_data[r] for r, _ in v] for k,v in matches.items()}) | ||
``` | ||
|
||
Which will output: | ||
|
||
``` | ||
{'x': [['b', 0.8171109], ['a', 0.7650396]], 'y': [['a', 0.6696329], ['c', 0.5778763]]} | ||
{'pets': ['dogs', 'cats'], 'feline': ['cats', 'kittens']} | ||
``` | ||
|
||
The `drop_most_similar` argument can be set to True if you don't want to output the most similar match - this might be the case if you were comparing a list with itself. For this you would run `nlp_link.link_dataset(input_data, drop_most_similar=True)`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,27 @@ | ||
## Title | ||
# 🗺️ SOC Mapper | ||
|
||
The SOC mapper relies on the [SOC coding index](https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc/soc2020/soc2020volume2codingrulesandconventions) released by the ONS. This dataset contains over 30,000 job titles with the SOC code. | ||
|
||
The `SOCMapper` class in `soc_map.py` maps job title(s) to SOC(s). | ||
|
||
## 🔨 Core functionality | ||
|
||
``` | ||
from nlp_link.soc_mapper.soc_map import SOCMapper | ||
|
||
soc_mapper = SOCMapper() | ||
soc_mapper.load() | ||
job_titles=["data scientist", "Assistant nurse", "Senior financial consultant - London"] | ||
|
||
soc_mapper.get_soc(job_titles, return_soc_name=True) | ||
``` | ||
|
||
Which will output | ||
|
||
``` | ||
[((('2433/04', 'Statistical data scientists'), ('2433', 'Actuaries, economists and statisticians'), '2425'), 'Data scientist'), ((('6131/99', 'Nursing auxiliaries and assistants n.e.c.'), ('6131', 'Nursing auxiliaries and assistants'), '6141'), 'Assistant nurse'), ((('2422/02', 'Financial advisers and planners'), ('2422', 'Finance and investment analysts and advisers'), '3534'), 'Financial consultant')] | ||
``` | ||
|
||
## 📖 Read more | ||
|
||
Read more about the methods and evaluation of the SOCMapper [here](https://github.com/nestauk/nlp-link/soc_mapper/README.md). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,56 @@ | ||
from tqdm import tqdm | ||
|
||
import numpy as np | ||
from sentence_transformers import SentenceTransformer | ||
import torch | ||
|
||
from wasabi import msg, Printer | ||
|
||
msg_print = Printer() | ||
|
||
|
||
def chunk_list(orig_list, n_chunks): | ||
for i in range(0, len(orig_list), n_chunks): | ||
yield orig_list[i : i + n_chunks] | ||
|
||
|
||
def load_bert(bert_model_name="sentence-transformers/all-MiniLM-L6-v2"): | ||
|
||
with msg_print.loading("Loading BERT model"): | ||
device = torch.device(f"cuda:0" if torch.cuda.is_available() else "cpu") | ||
bert_model = SentenceTransformer(bert_model_name, device=device) | ||
bert_model.max_seq_length = 512 | ||
msg.good("BERT model loaded") | ||
return bert_model | ||
|
||
|
||
def get_embeddings( | ||
text_list: list, | ||
bert_model, | ||
embed_chunk_size: int = 500, | ||
batch_size: int = 32, | ||
) -> np.array: | ||
""" | ||
Get embeddings for a list of texts | ||
|
||
Args: | ||
text_list (list): A lists of texts. | ||
bert_model: An initialised SentenceTransformer BERT model. | ||
embed_chunk_size (int): The number of texts per chunk to process. | ||
batch_size (int): BERT batch_size. | ||
Returns: | ||
np.array: The embeddings for the input list of texts | ||
""" | ||
|
||
msg.info( | ||
f"Finding embeddings for {len(text_list)} texts chunked into {round(len(text_list)/embed_chunk_size)} chunks" | ||
) | ||
all_embeddings = [] | ||
for batch_texts in tqdm(chunk_list(text_list, embed_chunk_size)): | ||
all_embeddings.append( | ||
bert_model.encode(np.array(batch_texts), batch_size=batch_size) | ||
) | ||
all_embeddings = np.concatenate(all_embeddings) | ||
msg.good("Texts embedded.") | ||
|
||
return all_embeddings |
Oops, something went wrong.