Skip to content

Commit

Permalink
Merge pull request #16 from nestauk/roisin-suggestions
Browse files Browse the repository at this point in the history
Improvements
  • Loading branch information
lizgzil authored Dec 20, 2024
2 parents f4cedb2 + b04da43 commit 4c782b8
Show file tree
Hide file tree
Showing 12 changed files with 157 additions and 86 deletions.
32 changes: 25 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# 🖇️ NLP Link

NLP Link finds the most similar word (or words) in a reference list to an inputted word. For example, if you are trying to find which word is most similar to 'puppies' from a reference list of `['cats', 'dogs', 'rats', 'birds']`, nlp-link will return 'dogs'.
NLP Link finds the most similar word (or sentences) in a reference list to an inputted word. For example, if you are trying to find which word is most similar to 'puppies' from a reference list of `['cats', 'dogs', 'rats', 'birds']`, nlp-link will return 'dogs'.

# 🗺️ SOC Mapper

Another functionality of this package is using the linking methodology to find the [Standard Occupation Classification (SOC)](https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc) code most similar to an inputted job title. More on this [here](https://github.com/nestauk/nlp-link/blob/main/docs/page1.md).
Another functionality of this package is using the linking methodology to find the [Standard Occupation Classification (SOC)](https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc) code most similar to an inputted job title. More on this [here](https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/README.md).

## 🔨 Usage

Expand All @@ -16,7 +16,9 @@ pip install nlp-link

### Basic usage

Match two lists in python:
> **NOTE:** The first time you import `NLPLinker` in your environment it will take some time (around a minute) to load.
Match two lists of words or sentences in python:

```python

Expand All @@ -25,9 +27,9 @@ from nlp_link.linker import NLPLinker
nlp_link = NLPLinker()

# list inputs
comparison_data = ['cats', 'dogs', 'rats', 'birds']
input_data = ['owls', 'feline', 'doggies', 'dogs','chair']
nlp_link.load(comparison_data)
reference_data = ['cats', 'dogs', 'rats', 'birds']
nlp_link.load(reference_data)
matches = nlp_link.link_dataset(input_data)
# Top match output
print(matches)
Expand All @@ -37,7 +39,7 @@ print(matches)
Which outputs:

```
input_id input_text link_id link_text similarity
input_id input_text reference_id reference_text similarity
0 0 owls 3 birds 0.613577
1 1 feline 0 cats 0.669633
2 2 doggies 1 dogs 0.757443
Expand All @@ -46,6 +48,10 @@ Which outputs:
```

These results show the most similar word from the `reference_data` list to each word in the `input_data` list. The word 'dogs' was found across both lists, so it had a similarity score of 1, 'doggies' was matched to 'dogs' since these words are very similar. The inputted word 'chair' had no words that were very similar - the most similar was 'cats' with a low similarity score.

> 🔍 **INFO:** Semantic similarity scores are between 0 and 1, with 0 being very unsimilar, and 1 being exactly the same. This value is calculated by utilising [a large model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) trained on data sets of sentence pairs from various websites (including Reddit comments and WikiHow). The model learns the semantic rules which link the pairs of sentences - e.g. it will learn synonyms. In the above example the reason 'chair' matches most similarly to 'cats' might be because the model learned that "cats" are often mentioned in relation to "chairs" (e.g. sitting on them) compared to dogs, rats, or birds.
### SOC Mapping

Match a list of job titles to SOC codes:
Expand All @@ -66,9 +72,13 @@ Which will output
[((('2433/04', 'Statistical data scientists'), ('2433', 'Actuaries, economists and statisticians'), '2425'), 'Data scientist'), ((('6131/99', 'Nursing auxiliaries and assistants n.e.c.'), ('6131', 'Nursing auxiliaries and assistants'), '6141'), 'Assistant nurse'), ((('2422/02', 'Financial advisers and planners'), ('2422', 'Finance and investment analysts and advisers'), '3534'), 'Financial consultant')]
```

This nested list gives information about the most similar SOC codes for each of the three inputted job titles. The most similar extended SOC for "data scientist" was 'Statistical data scientists - 2433/04'.

More about this output format is explained in the [SOCMapper page](https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/README.md#soc_output).

## Contributing

The instructions here are for those contrbuting to the repo.
The instructions here are for those contributing to the repo.

### Set-up

Expand Down Expand Up @@ -111,3 +121,11 @@ cd docs
<!-- pip install -r docs/requirements.txt -->
mkdocs serve
```

## References

https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc

https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc/soc2020/soc2020volume2codingrulesandconventions
36 changes: 25 additions & 11 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# 🖇️ NLP Link

NLP Link finds the most similar word (or words) in a reference list to an inputted word. For example, if you are trying to find which word is most similar to 'puppies' from a reference list of `['cats', 'dogs', 'rats', 'birds']`, nlp-link will return 'dogs'.
NLP Link finds the most similar word (or sentences) in a reference list to an inputted word. For example, if you are trying to find which word is most similar to 'puppies' from a reference list of `['cats', 'dogs', 'rats', 'birds']`, nlp-link will return 'dogs'.

Another functionality of this package is using the linking methodology to find the [SOC](https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc) code most similar to an inputted job title. More on this [here](./page1.md).

Expand All @@ -14,6 +14,8 @@ pip install nlp-link

### Basic usage

> **NOTE:** The first time you import `NLPLinker` in your environment it will take some time (around a minute) to load.
Match two lists in python:

```python
Expand All @@ -23,9 +25,9 @@ from nlp_link.linker import NLPLinker
nlp_link = NLPLinker()

# list inputs
comparison_data = ['cats', 'dogs', 'rats', 'birds']
input_data = ['owls', 'feline', 'doggies', 'dogs','chair']
nlp_link.load(comparison_data)
reference_data = ['cats', 'dogs', 'rats', 'birds']
nlp_link.load(reference_data)
matches = nlp_link.link_dataset(input_data)
# Top match output
print(matches)
Expand All @@ -35,7 +37,7 @@ print(matches)
Which outputs:

```
input_id input_text link_id link_text similarity
input_id input_text reference_id reference_text similarity
0 0 owls 3 birds 0.613577
1 1 feline 0 cats 0.669633
2 2 doggies 1 dogs 0.757443
Expand All @@ -44,6 +46,10 @@ Which outputs:
```

These results show the most similar word from the `reference_data` list to each word in the `input_data` list. The word 'dogs' was found across both lists, so it had a similarity score of 1, 'doggies' was matched to 'dogs' since these words are very similar. The inputted word 'chair' had no words that were very similar - the most similar was 'cats' with a low similarity score.

> 🔍 **INFO:** Semantic similarity scores are between 0 and 1, with 0 being very unsimilar, and 1 being exactly the same. This value is calculated by utilising [a large model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) trained on data sets of sentence pairs from various websites (including Reddit comments and WikiHow). The model learns the semantic rules which link the pairs of sentences - e.g. it will learn synonyms. In the above example the reason 'chair' matches most similarly to 'cats' might be because the model learned that "cats" are often mentioned in relation to "chairs" (e.g. sitting on them) compared to dogs, rats, or birds.
### Extended usage

Match using dictionary inputs (where the key is a unique ID):
Expand All @@ -55,9 +61,9 @@ from nlp_link.linker import NLPLinker
nlp_link = NLPLinker()

# dict inputs
comparison_data = {'a': 'cats', 'b': 'dogs', 'd': 'rats', 'e': 'birds'}
reference_data = {'a': 'cats', 'b': 'dogs', 'd': 'rats', 'e': 'birds'}
input_data = {'x': 'owls', 'y': 'feline', 'z': 'doggies', 'za': 'dogs', 'zb': 'chair'}
nlp_link.load(comparison_data)
nlp_link.load(reference_data)
matches = nlp_link.link_dataset(input_data)
# Top match output
print(matches)
Expand All @@ -67,7 +73,7 @@ print(matches)
Which outputs:

```
input_id input_text link_id link_text similarity
input_id input_text reference_id reference_text similarity
0 x owls e birds 0.613577
1 y feline a cats 0.669633
2 z doggies b dogs 0.757443
Expand All @@ -76,22 +82,22 @@ Which outputs:
```

Output several most similar matches using the `top_n` argument (`format_output` needs to be set to False for this):
Output the top n most similar reference word matches using the `top_n` argument (`format_output` needs to be set to False for this):

```python

from nlp_link.linker import NLPLinker

nlp_link = NLPLinker()

comparison_data = {'a': 'cats', 'b': 'dogs', 'c': 'kittens', 'd': 'rats', 'e': 'birds'}
reference_data = {'a': 'cats', 'b': 'dogs', 'c': 'kittens', 'd': 'rats', 'e': 'birds'}
input_data = {'x': 'pets', 'y': 'feline'}
nlp_link.load(comparison_data)
nlp_link.load(reference_data)
matches = nlp_link.link_dataset(input_data, top_n=2, format_output=False)
# Top match output
print(matches)
# Format output for ease of reading
print({input_data[k]: [comparison_data[r] for r, _ in v] for k,v in matches.items()})
print({input_data[k]: [reference_data[r] for r, _ in v] for k,v in matches.items()})
```

Which will output:
Expand All @@ -102,3 +108,11 @@ Which will output:
```

The `drop_most_similar` argument can be set to True if you don't want to output the most similar match - this might be the case if you were comparing a list with itself. For this you would run `nlp_link.link_dataset(input_data, drop_most_similar=True)`.

## References

https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc

https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc/soc2020/soc2020volume2codingrulesandconventions
3 changes: 2 additions & 1 deletion docs/mkdocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ theme:
name: Switch to light mode
nav:
- Home: README.md
- SOCMapper: page1.md
- SOCMapper - Core Usage: page1.md
- SOCMapper - Modifications, Methodology and Evaluation: https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/README.md
plugins:
- same-dir
4 changes: 4 additions & 0 deletions docs/page1.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,10 @@ Which will output
[((('2433/04', 'Statistical data scientists'), ('2433', 'Actuaries, economists and statisticians'), '2425'), 'Data scientist'), ((('6131/99', 'Nursing auxiliaries and assistants n.e.c.'), ('6131', 'Nursing auxiliaries and assistants'), '6141'), 'Assistant nurse'), ((('2422/02', 'Financial advisers and planners'), ('2422', 'Finance and investment analysts and advisers'), '3534'), 'Financial consultant')]
```

This nested list gives information about the most similar SOC codes for each of the three inputted job titles. The most similar extended SOC for "data scientist" was 'Statistical data scientists - 2433/04'.

More about this output format is explained in the [SOCMapper page](https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/README.md#soc_output).

## 📖 Read more

Read more about the methods and evaluation of the SOCMapper [here](https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/README.md).
60 changes: 30 additions & 30 deletions nlp_link/linker.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,17 @@
nlp_link = NLPLinker()
# dict inputs
comparison_data = {'a': 'cats', 'b': 'dogs', 'd': 'rats', 'e': 'birds'}
reference_data = {'a': 'cats', 'b': 'dogs', 'd': 'rats', 'e': 'birds'}
input_data = {'x': 'owls', 'y': 'feline', 'z': 'doggies', 'za': 'dogs', 'zb': 'chair'}
nlp_link.load(comparison_data)
nlp_link.load(reference_data)
matches = nlp_link.link_dataset(input_data)
# Top match output
print(matches)
# list inputs
comparison_data = ['cats', 'dogs', 'rats', 'birds']
reference_data = ['cats', 'dogs', 'rats', 'birds']
input_data = ['owls', 'feline', 'doggies', 'dogs','chair']
nlp_link.load(comparison_data)
nlp_link.load(reference_data)
matches = nlp_link.link_dataset(input_data)
# Top match output
print(matches)
Expand Down Expand Up @@ -90,22 +90,22 @@ def _process_dataset(

def load(
self,
comparison_data: Union[list, dict],
reference_data: Union[list, dict],
):
"""
Load the embedding model and embed the comparison dataset
Load the embedding model and embed the reference dataset
Args:
comparison_data (Union[list, dict]): The comparison texts to find links to.
reference_data (Union[list, dict]): The reference texts to find links to.
A list of texts or a dictionary of texts where the key is the unique id.
If a list is given then a unique id will be assigned with the index order.
"""
self.bert_model = load_bert()

self.comparison_data = self._process_dataset(comparison_data)
self.comparison_data_texts = list(self.comparison_data.values())
self.comparison_data_ids = list(self.comparison_data.keys())
self.reference_data = self._process_dataset(reference_data)
self.reference_data_texts = list(self.reference_data.values())
self.reference_data_ids = list(self.reference_data.keys())

self.comparison_embeddings = self._get_embeddings(self.comparison_data_texts)
self.reference_embeddings = self._get_embeddings(self.reference_data_texts)

def _get_embeddings(self, text_list: list) -> np.array:
"""
Expand All @@ -128,8 +128,8 @@ def get_matches(
self,
input_data_ids: list,
input_embeddings: np.array,
comparison_data_ids: list,
comparison_embeddings: np.array,
reference_data_ids: list,
reference_embeddings: np.array,
top_n: int,
drop_most_similar: bool = False,
) -> dict:
Expand All @@ -139,8 +139,8 @@ def get_matches(
Args:
input_data_ids (list): The ids of the input texts.
input_embeddings (np.array): Embeddings for the input texts.
comparison_data_ids (list): The ids of the comparison texts.
comparison_embeddings (np.array): Embeddings for the comparison texts.
reference_data_ids (list): The ids of the reference texts.
reference_embeddings (np.array): Embeddings for the reference texts.
top_n (int): The number of top links to return in the output.
drop_most_similar (bool, default = False): Whether to not output the most similar match, this would be set to True if you are matching a list with itself.
Expand All @@ -158,7 +158,7 @@ def get_matches(
else:
start_n = 0

# We chunk up comparisons otherwise it can crash
# We chunk up reference list otherwise it can crash
matches_topn = {}
for batch_indices in tqdm(
chunk_list(range(len(input_data_ids)), n_chunks=self.match_chunk_size)
Expand All @@ -167,18 +167,18 @@ def get_matches(
batch_input_embeddings = [input_embeddings[i] for i in batch_indices]

batch_similarities = cosine_similarity(
batch_input_embeddings, comparison_embeddings
batch_input_embeddings, reference_embeddings
)

# Top links for each input text
for input_ix, similarities in enumerate(batch_similarities):
top_links = []
for comparison_ix in np.flip(np.argsort(similarities))[start_n:top_n]:
# comparison data id + cosine similarity score
for reference_ix in np.flip(np.argsort(similarities))[start_n:top_n]:
# reference data id + cosine similarity score
top_links.append(
[
comparison_data_ids[comparison_ix],
similarities[comparison_ix],
reference_data_ids[reference_ix],
similarities[reference_ix],
]
)
matches_topn[batch_input_ids[input_ix]] = top_links
Expand All @@ -192,10 +192,10 @@ def link_dataset(
drop_most_similar: bool = False,
) -> dict:
"""
Link a dataset to the comparison dataset.
Link a dataset to the reference dataset.
Args:
input_data (Union[list, dict]): The main dictionary to be linked to texts in the loaded comparison_data.
input_data (Union[list, dict]): The main dictionary to be linked to texts in the loaded reference_data.
A list of texts or a dictionary of texts where the key is the unique id.
If a list is given then a unique id will be assigned with the index order.
top_n (int, default = 3): The number of top links to return in the output.
Expand All @@ -204,17 +204,17 @@ def link_dataset(
drop_most_similar (bool, default = False): Whether to not output the most similar match, this would be set to True if you are matching a list with itself.
Returns:
dict: The keys are the ids of the input_data and the values are a list of lists of the top_n most similar
ids from the comparison_data and a probability score.
ids from the reference_data and a probability score.
e.g. {'x': [['a', 0.75], ['c', 0.7]], 'y': [...]}
"""

try:
msg.info(
f"Comparing {len(input_data)} input texts to {len(self.comparison_embeddings)} comparison texts"
f"Comparing {len(input_data)} input texts to {len(self.reference_embeddings)} reference texts"
)
except:
msg.warning(
"self.comparison_embeddings does not exist - you may have not run load()"
"self.reference_embeddings does not exist - you may have not run load()"
)

input_data = self._process_dataset(input_data)
Expand All @@ -226,8 +226,8 @@ def link_dataset(
self.matches_topn = self.get_matches(
input_data_ids,
input_embeddings,
self.comparison_data_ids,
self.comparison_embeddings,
self.reference_data_ids,
self.reference_embeddings,
top_n,
drop_most_similar,
)
Expand All @@ -239,8 +239,8 @@ def link_dataset(
{
"input_id": input_id,
"input_text": input_data[input_id],
"link_id": link_data[0][0],
"link_text": self.comparison_data[link_data[0][0]],
"reference_id": link_data[0][0],
"reference_text": self.reference_data[link_data[0][0]],
"similarity": link_data[0][1],
}
for input_id, link_data in self.matches_topn.items()
Expand Down
6 changes: 4 additions & 2 deletions nlp_link/linker_utils.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
from tqdm import tqdm

import numpy as np
from sentence_transformers import SentenceTransformer
import torch

from wasabi import msg, Printer

import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"

msg_print = Printer()


Expand Down
Loading

0 comments on commit 4c782b8

Please sign in to comment.