Skip to content

Commit

Permalink
refer to the comparison list as the reference list throughout and upd…
Browse files Browse the repository at this point in the history
…ate docs
  • Loading branch information
lizgzil committed Dec 20, 2024
1 parent 66726db commit 48f7cd3
Show file tree
Hide file tree
Showing 8 changed files with 106 additions and 69 deletions.
32 changes: 24 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# 🖇️ NLP Link

NLP Link finds the most similar word (or words) in a reference list to an inputted word. For example, if you are trying to find which word is most similar to 'puppies' from a reference list of `['cats', 'dogs', 'rats', 'birds']`, nlp-link will return 'dogs'.
NLP Link finds the most similar word (or sentences) in a reference list to an inputted word. For example, if you are trying to find which word is most similar to 'puppies' from a reference list of `['cats', 'dogs', 'rats', 'birds']`, nlp-link will return 'dogs'.

# 🗺️ SOC Mapper

Another functionality of this package is using the linking methodology to find the [Standard Occupation Classification (SOC)](https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc) code most similar to an inputted job title. More on this [here](https://github.com/nestauk/nlp-link/blob/main/docs/page1.md).
Another functionality of this package is using the linking methodology to find the [Standard Occupation Classification (SOC)](https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc) code most similar to an inputted job title. More on this [here](https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/README.md).

## 🔨 Usage

Expand All @@ -16,9 +16,9 @@ pip install nlp-link

### Basic usage

Note: the first time you import `NLPLinker` it will take some time to load.
> **NOTE:** The first time you import `NLPLinker` in your environment it will take some time (around a minute) to load.
Match two lists in python:
Match two lists of words or sentences in python:

```python

Expand All @@ -27,9 +27,9 @@ from nlp_link.linker import NLPLinker
nlp_link = NLPLinker()

# list inputs
comparison_data = ['cats', 'dogs', 'rats', 'birds']
input_data = ['owls', 'feline', 'doggies', 'dogs','chair']
nlp_link.load(comparison_data)
reference_data = ['cats', 'dogs', 'rats', 'birds']
nlp_link.load(reference_data)
matches = nlp_link.link_dataset(input_data)
# Top match output
print(matches)
Expand All @@ -39,7 +39,7 @@ print(matches)
Which outputs:

```
input_id input_text link_id link_text similarity
input_id input_text reference_id reference_text similarity
0 0 owls 3 birds 0.613577
1 1 feline 0 cats 0.669633
2 2 doggies 1 dogs 0.757443
Expand All @@ -48,6 +48,10 @@ Which outputs:
```

These results show the most similar word from the `reference_data` list to each word in the `input_data` list. The word 'dogs' was found across both lists, so it had a similarity score of 1, 'doggies' was matched to 'dogs' since these words are very similar. The inputted word 'chair' had no words that were very similar - the most similar was 'cats' with a low similarity score.

> 🔍 **INFO:** Semantic similarity scores are between 0 and 1, with 0 being very unsimilar, and 1 being exactly the same. This value is calculated by utilising [a large model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) trained on data sets of sentence pairs from various websites (including Reddit comments and WikiHow). The model learns the semantic rules which link the pairs of sentences - e.g. it will learn synonyms. In the above example the reason 'chair' matches most similarly to 'cats' might be because the model learned that "cats" are often mentioned in relation to "chairs" (e.g. sitting on them) compared to dogs, rats, or birds.
### SOC Mapping

Match a list of job titles to SOC codes:
Expand All @@ -68,9 +72,13 @@ Which will output
[((('2433/04', 'Statistical data scientists'), ('2433', 'Actuaries, economists and statisticians'), '2425'), 'Data scientist'), ((('6131/99', 'Nursing auxiliaries and assistants n.e.c.'), ('6131', 'Nursing auxiliaries and assistants'), '6141'), 'Assistant nurse'), ((('2422/02', 'Financial advisers and planners'), ('2422', 'Finance and investment analysts and advisers'), '3534'), 'Financial consultant')]
```

This nested list gives information about the most similar SOC codes for each of the three inputted job titles. The most similar extended SOC for "data scientist" was 'Statistical data scientists - 2433/04'.

More about this output format is explained in the [SOCMapper page](https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/README.md#soc_output).

## Contributing

The instructions here are for those contrbuting to the repo.
The instructions here are for those contributing to the repo.

### Set-up

Expand Down Expand Up @@ -113,3 +121,11 @@ cd docs
<!-- pip install -r docs/requirements.txt -->
mkdocs serve
```

## References

https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc

https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc/soc2020/soc2020volume2codingrulesandconventions
34 changes: 24 additions & 10 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# 🖇️ NLP Link

NLP Link finds the most similar word (or words) in a reference list to an inputted word. For example, if you are trying to find which word is most similar to 'puppies' from a reference list of `['cats', 'dogs', 'rats', 'birds']`, nlp-link will return 'dogs'.
NLP Link finds the most similar word (or sentences) in a reference list to an inputted word. For example, if you are trying to find which word is most similar to 'puppies' from a reference list of `['cats', 'dogs', 'rats', 'birds']`, nlp-link will return 'dogs'.

Another functionality of this package is using the linking methodology to find the [SOC](https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc) code most similar to an inputted job title. More on this [here](./page1.md).

Expand All @@ -14,6 +14,8 @@ pip install nlp-link

### Basic usage

> **NOTE:** The first time you import `NLPLinker` in your environment it will take some time (around a minute) to load.
Match two lists in python:

```python
Expand All @@ -23,9 +25,9 @@ from nlp_link.linker import NLPLinker
nlp_link = NLPLinker()

# list inputs
comparison_data = ['cats', 'dogs', 'rats', 'birds']
input_data = ['owls', 'feline', 'doggies', 'dogs','chair']
nlp_link.load(comparison_data)
reference_data = ['cats', 'dogs', 'rats', 'birds']
nlp_link.load(reference_data)
matches = nlp_link.link_dataset(input_data)
# Top match output
print(matches)
Expand All @@ -35,7 +37,7 @@ print(matches)
Which outputs:

```
input_id input_text link_id link_text similarity
input_id input_text reference_id reference_text similarity
0 0 owls 3 birds 0.613577
1 1 feline 0 cats 0.669633
2 2 doggies 1 dogs 0.757443
Expand All @@ -44,6 +46,10 @@ Which outputs:
```

These results show the most similar word from the `reference_data` list to each word in the `input_data` list. The word 'dogs' was found across both lists, so it had a similarity score of 1, 'doggies' was matched to 'dogs' since these words are very similar. The inputted word 'chair' had no words that were very similar - the most similar was 'cats' with a low similarity score.

> 🔍 **INFO:** Semantic similarity scores are between 0 and 1, with 0 being very unsimilar, and 1 being exactly the same. This value is calculated by utilising [a large model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) trained on data sets of sentence pairs from various websites (including Reddit comments and WikiHow). The model learns the semantic rules which link the pairs of sentences - e.g. it will learn synonyms. In the above example the reason 'chair' matches most similarly to 'cats' might be because the model learned that "cats" are often mentioned in relation to "chairs" (e.g. sitting on them) compared to dogs, rats, or birds.
### Extended usage

Match using dictionary inputs (where the key is a unique ID):
Expand All @@ -55,9 +61,9 @@ from nlp_link.linker import NLPLinker
nlp_link = NLPLinker()

# dict inputs
comparison_data = {'a': 'cats', 'b': 'dogs', 'd': 'rats', 'e': 'birds'}
reference_data = {'a': 'cats', 'b': 'dogs', 'd': 'rats', 'e': 'birds'}
input_data = {'x': 'owls', 'y': 'feline', 'z': 'doggies', 'za': 'dogs', 'zb': 'chair'}
nlp_link.load(comparison_data)
nlp_link.load(reference_data)
matches = nlp_link.link_dataset(input_data)
# Top match output
print(matches)
Expand All @@ -67,7 +73,7 @@ print(matches)
Which outputs:

```
input_id input_text link_id link_text similarity
input_id input_text reference_id reference_text similarity
0 x owls e birds 0.613577
1 y feline a cats 0.669633
2 z doggies b dogs 0.757443
Expand All @@ -84,14 +90,14 @@ from nlp_link.linker import NLPLinker

nlp_link = NLPLinker()

comparison_data = {'a': 'cats', 'b': 'dogs', 'c': 'kittens', 'd': 'rats', 'e': 'birds'}
reference_data = {'a': 'cats', 'b': 'dogs', 'c': 'kittens', 'd': 'rats', 'e': 'birds'}
input_data = {'x': 'pets', 'y': 'feline'}
nlp_link.load(comparison_data)
nlp_link.load(reference_data)
matches = nlp_link.link_dataset(input_data, top_n=2, format_output=False)
# Top match output
print(matches)
# Format output for ease of reading
print({input_data[k]: [comparison_data[r] for r, _ in v] for k,v in matches.items()})
print({input_data[k]: [reference_data[r] for r, _ in v] for k,v in matches.items()})
```

Which will output:
Expand All @@ -102,3 +108,11 @@ Which will output:
```

The `drop_most_similar` argument can be set to True if you don't want to output the most similar match - this might be the case if you were comparing a list with itself. For this you would run `nlp_link.link_dataset(input_data, drop_most_similar=True)`.

## References

https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc

https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc/soc2020/soc2020volume2codingrulesandconventions
3 changes: 2 additions & 1 deletion docs/mkdocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ theme:
name: Switch to light mode
nav:
- Home: README.md
- SOCMapper: page1.md
- SOCMapper - Core Usage: page1.md
- SOCMapper - Modifications, Methodology and Evaluation: https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/README.md
plugins:
- same-dir
4 changes: 4 additions & 0 deletions docs/page1.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,10 @@ Which will output
[((('2433/04', 'Statistical data scientists'), ('2433', 'Actuaries, economists and statisticians'), '2425'), 'Data scientist'), ((('6131/99', 'Nursing auxiliaries and assistants n.e.c.'), ('6131', 'Nursing auxiliaries and assistants'), '6141'), 'Assistant nurse'), ((('2422/02', 'Financial advisers and planners'), ('2422', 'Finance and investment analysts and advisers'), '3534'), 'Financial consultant')]
```

This nested list gives information about the most similar SOC codes for each of the three inputted job titles. The most similar extended SOC for "data scientist" was 'Statistical data scientists - 2433/04'.

More about this output format is explained in the [SOCMapper page](https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/README.md#soc_output).

## 📖 Read more

Read more about the methods and evaluation of the SOCMapper [here](https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/README.md).
60 changes: 30 additions & 30 deletions nlp_link/linker.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,17 @@
nlp_link = NLPLinker()
# dict inputs
comparison_data = {'a': 'cats', 'b': 'dogs', 'd': 'rats', 'e': 'birds'}
reference_data = {'a': 'cats', 'b': 'dogs', 'd': 'rats', 'e': 'birds'}
input_data = {'x': 'owls', 'y': 'feline', 'z': 'doggies', 'za': 'dogs', 'zb': 'chair'}
nlp_link.load(comparison_data)
nlp_link.load(reference_data)
matches = nlp_link.link_dataset(input_data)
# Top match output
print(matches)
# list inputs
comparison_data = ['cats', 'dogs', 'rats', 'birds']
reference_data = ['cats', 'dogs', 'rats', 'birds']
input_data = ['owls', 'feline', 'doggies', 'dogs','chair']
nlp_link.load(comparison_data)
nlp_link.load(reference_data)
matches = nlp_link.link_dataset(input_data)
# Top match output
print(matches)
Expand Down Expand Up @@ -90,22 +90,22 @@ def _process_dataset(

def load(
self,
comparison_data: Union[list, dict],
reference_data: Union[list, dict],
):
"""
Load the embedding model and embed the comparison dataset
Load the embedding model and embed the reference dataset
Args:
comparison_data (Union[list, dict]): The comparison texts to find links to.
reference_data (Union[list, dict]): The reference texts to find links to.
A list of texts or a dictionary of texts where the key is the unique id.
If a list is given then a unique id will be assigned with the index order.
"""
self.bert_model = load_bert()

self.comparison_data = self._process_dataset(comparison_data)
self.comparison_data_texts = list(self.comparison_data.values())
self.comparison_data_ids = list(self.comparison_data.keys())
self.reference_data = self._process_dataset(reference_data)
self.reference_data_texts = list(self.reference_data.values())
self.reference_data_ids = list(self.reference_data.keys())

self.comparison_embeddings = self._get_embeddings(self.comparison_data_texts)
self.reference_embeddings = self._get_embeddings(self.reference_data_texts)

def _get_embeddings(self, text_list: list) -> np.array:
"""
Expand All @@ -128,8 +128,8 @@ def get_matches(
self,
input_data_ids: list,
input_embeddings: np.array,
comparison_data_ids: list,
comparison_embeddings: np.array,
reference_data_ids: list,
reference_embeddings: np.array,
top_n: int,
drop_most_similar: bool = False,
) -> dict:
Expand All @@ -139,8 +139,8 @@ def get_matches(
Args:
input_data_ids (list): The ids of the input texts.
input_embeddings (np.array): Embeddings for the input texts.
comparison_data_ids (list): The ids of the comparison texts.
comparison_embeddings (np.array): Embeddings for the comparison texts.
reference_data_ids (list): The ids of the reference texts.
reference_embeddings (np.array): Embeddings for the reference texts.
top_n (int): The number of top links to return in the output.
drop_most_similar (bool, default = False): Whether to not output the most similar match, this would be set to True if you are matching a list with itself.
Expand All @@ -158,7 +158,7 @@ def get_matches(
else:
start_n = 0

# We chunk up comparisons otherwise it can crash
# We chunk up reference list otherwise it can crash
matches_topn = {}
for batch_indices in tqdm(
chunk_list(range(len(input_data_ids)), n_chunks=self.match_chunk_size)
Expand All @@ -167,18 +167,18 @@ def get_matches(
batch_input_embeddings = [input_embeddings[i] for i in batch_indices]

batch_similarities = cosine_similarity(
batch_input_embeddings, comparison_embeddings
batch_input_embeddings, reference_embeddings
)

# Top links for each input text
for input_ix, similarities in enumerate(batch_similarities):
top_links = []
for comparison_ix in np.flip(np.argsort(similarities))[start_n:top_n]:
# comparison data id + cosine similarity score
for reference_ix in np.flip(np.argsort(similarities))[start_n:top_n]:
# reference data id + cosine similarity score
top_links.append(
[
comparison_data_ids[comparison_ix],
similarities[comparison_ix],
reference_data_ids[reference_ix],
similarities[reference_ix],
]
)
matches_topn[batch_input_ids[input_ix]] = top_links
Expand All @@ -192,10 +192,10 @@ def link_dataset(
drop_most_similar: bool = False,
) -> dict:
"""
Link a dataset to the comparison dataset.
Link a dataset to the reference dataset.
Args:
input_data (Union[list, dict]): The main dictionary to be linked to texts in the loaded comparison_data.
input_data (Union[list, dict]): The main dictionary to be linked to texts in the loaded reference_data.
A list of texts or a dictionary of texts where the key is the unique id.
If a list is given then a unique id will be assigned with the index order.
top_n (int, default = 3): The number of top links to return in the output.
Expand All @@ -204,17 +204,17 @@ def link_dataset(
drop_most_similar (bool, default = False): Whether to not output the most similar match, this would be set to True if you are matching a list with itself.
Returns:
dict: The keys are the ids of the input_data and the values are a list of lists of the top_n most similar
ids from the comparison_data and a probability score.
ids from the reference_data and a probability score.
e.g. {'x': [['a', 0.75], ['c', 0.7]], 'y': [...]}
"""

try:
msg.info(
f"Comparing {len(input_data)} input texts to {len(self.comparison_embeddings)} comparison texts"
f"Comparing {len(input_data)} input texts to {len(self.reference_embeddings)} reference texts"
)
except:
msg.warning(
"self.comparison_embeddings does not exist - you may have not run load()"
"self.reference_embeddings does not exist - you may have not run load()"
)

input_data = self._process_dataset(input_data)
Expand All @@ -226,8 +226,8 @@ def link_dataset(
self.matches_topn = self.get_matches(
input_data_ids,
input_embeddings,
self.comparison_data_ids,
self.comparison_embeddings,
self.reference_data_ids,
self.reference_embeddings,
top_n,
drop_most_similar,
)
Expand All @@ -239,8 +239,8 @@ def link_dataset(
{
"input_id": input_id,
"input_text": input_data[input_id],
"link_id": link_data[0][0],
"link_text": self.comparison_data[link_data[0][0]],
"reference_id": link_data[0][0],
"reference_text": self.reference_data[link_data[0][0]],
"similarity": link_data[0][1],
}
for input_id, link_data in self.matches_topn.items()
Expand Down
Loading

0 comments on commit 48f7cd3

Please sign in to comment.