Merge pull request #16 from nestauk/roisin-suggestions

Improvements
nestauk · Dec 20, 2024 · 4c782b8 · 4c782b8
2 parents f4cedb2 + b04da43
commit 4c782b8
Show file tree

Hide file tree

Showing 12 changed files with 157 additions and 86 deletions.
diff --git a/README.md b/README.md
@@ -1,10 +1,10 @@
 # 🖇️ NLP Link
 
-NLP Link finds the most similar word (or words) in a reference list to an inputted word. For example, if you are trying to find which word is most similar to 'puppies' from a reference list of `['cats', 'dogs', 'rats', 'birds']`, nlp-link will return 'dogs'.
+NLP Link finds the most similar word (or sentences) in a reference list to an inputted word. For example, if you are trying to find which word is most similar to 'puppies' from a reference list of `['cats', 'dogs', 'rats', 'birds']`, nlp-link will return 'dogs'.
 
 # 🗺️ SOC Mapper
 
-Another functionality of this package is using the linking methodology to find the [Standard Occupation Classification (SOC)](https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc) code most similar to an inputted job title. More on this [here](https://github.com/nestauk/nlp-link/blob/main/docs/page1.md).
+Another functionality of this package is using the linking methodology to find the [Standard Occupation Classification (SOC)](https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc) code most similar to an inputted job title. More on this [here](https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/README.md).
 
 ## 🔨 Usage
 
@@ -16,7 +16,9 @@ pip install nlp-link
 
 ### Basic usage
 
-Match two lists in python:
+> ⏳ **NOTE:** The first time you import `NLPLinker` in your environment it will take some time (around a minute) to load.
+
+Match two lists of words or sentences in python:
 
 ```python
 
@@ -25,9 +27,9 @@ from nlp_link.linker import NLPLinker
 nlp_link = NLPLinker()
 
 # list inputs
-comparison_data = ['cats', 'dogs', 'rats', 'birds']
 input_data = ['owls', 'feline', 'doggies', 'dogs','chair']
-nlp_link.load(comparison_data)
+reference_data = ['cats', 'dogs', 'rats', 'birds']
+nlp_link.load(reference_data)
 matches = nlp_link.link_dataset(input_data)
 # Top match output
 print(matches)
@@ -37,7 +39,7 @@ print(matches)
 Which outputs:
 
 ```
-   input_id input_text  link_id link_text  similarity
+   input_id input_text  reference_id reference_text  similarity
 0         0       owls        3     birds    0.613577
 1         1     feline        0      cats    0.669633
 2         2    doggies        1      dogs    0.757443
@@ -46,6 +48,10 @@ Which outputs:
 
 ```
 
+These results show the most similar word from the `reference_data` list to each word in the `input_data` list. The word 'dogs' was found across both lists, so it had a similarity score of 1, 'doggies' was matched to 'dogs' since these words are very similar. The inputted word 'chair' had no words that were very similar - the most similar was 'cats' with a low similarity score.
+
+> 🔍 **INFO:** Semantic similarity scores are between 0 and 1, with 0 being very unsimilar, and 1 being exactly the same. This value is calculated by utilising [a large model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) trained on data sets of sentence pairs from various websites (including Reddit comments and WikiHow). The model learns the semantic rules which link the pairs of sentences - e.g. it will learn synonyms. In the above example the reason 'chair' matches most similarly to 'cats' might be because the model learned that "cats" are often mentioned in relation to "chairs" (e.g. sitting on them) compared to dogs, rats, or birds.
+
 ### SOC Mapping
 
 Match a list of job titles to SOC codes:
@@ -66,9 +72,13 @@ Which will output
 [((('2433/04', 'Statistical data scientists'), ('2433', 'Actuaries, economists and statisticians'), '2425'), 'Data scientist'), ((('6131/99', 'Nursing auxiliaries and assistants n.e.c.'), ('6131', 'Nursing auxiliaries and assistants'), '6141'), 'Assistant nurse'), ((('2422/02', 'Financial advisers and planners'), ('2422', 'Finance and investment analysts and advisers'), '3534'), 'Financial consultant')]
 ```
 
+This nested list gives information about the most similar SOC codes for each of the three inputted job titles. The most similar extended SOC for "data scientist" was 'Statistical data scientists - 2433/04'.
+
+More about this output format is explained in the [SOCMapper page](https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/README.md#soc_output).
+
 ## Contributing
 
-The instructions here are for those contrbuting to the repo.
+The instructions here are for those contributing to the repo.
 
 ### Set-up
 
@@ -111,3 +121,11 @@ cd docs
 <!-- pip install -r docs/requirements.txt -->
 mkdocs serve
 ```
+
+## References
+
+https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
+
+https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc
+
+https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc/soc2020/soc2020volume2codingrulesandconventions
diff --git a/docs/README.md b/docs/README.md
@@ -1,6 +1,6 @@
 # 🖇️ NLP Link
 
-NLP Link finds the most similar word (or words) in a reference list to an inputted word. For example, if you are trying to find which word is most similar to 'puppies' from a reference list of `['cats', 'dogs', 'rats', 'birds']`, nlp-link will return 'dogs'.
+NLP Link finds the most similar word (or sentences) in a reference list to an inputted word. For example, if you are trying to find which word is most similar to 'puppies' from a reference list of `['cats', 'dogs', 'rats', 'birds']`, nlp-link will return 'dogs'.
 
 Another functionality of this package is using the linking methodology to find the [SOC](https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc) code most similar to an inputted job title. More on this [here](./page1.md).
 
@@ -14,6 +14,8 @@ pip install nlp-link
 
 ### Basic usage
 
+> ⏳ **NOTE:** The first time you import `NLPLinker` in your environment it will take some time (around a minute) to load.
+
 Match two lists in python:
 
 ```python
@@ -23,9 +25,9 @@ from nlp_link.linker import NLPLinker
 nlp_link = NLPLinker()
 
 # list inputs
-comparison_data = ['cats', 'dogs', 'rats', 'birds']
 input_data = ['owls', 'feline', 'doggies', 'dogs','chair']
-nlp_link.load(comparison_data)
+reference_data = ['cats', 'dogs', 'rats', 'birds']
+nlp_link.load(reference_data)
 matches = nlp_link.link_dataset(input_data)
 # Top match output
 print(matches)
@@ -35,7 +37,7 @@ print(matches)
 Which outputs:
 
 ```
-   input_id input_text  link_id link_text  similarity
+   input_id input_text  reference_id reference_text  similarity
 0         0       owls        3     birds    0.613577
 1         1     feline        0      cats    0.669633
 2         2    doggies        1      dogs    0.757443
@@ -44,6 +46,10 @@ Which outputs:
 
 ```
 
+These results show the most similar word from the `reference_data` list to each word in the `input_data` list. The word 'dogs' was found across both lists, so it had a similarity score of 1, 'doggies' was matched to 'dogs' since these words are very similar. The inputted word 'chair' had no words that were very similar - the most similar was 'cats' with a low similarity score.
+
+> 🔍 **INFO:** Semantic similarity scores are between 0 and 1, with 0 being very unsimilar, and 1 being exactly the same. This value is calculated by utilising [a large model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) trained on data sets of sentence pairs from various websites (including Reddit comments and WikiHow). The model learns the semantic rules which link the pairs of sentences - e.g. it will learn synonyms. In the above example the reason 'chair' matches most similarly to 'cats' might be because the model learned that "cats" are often mentioned in relation to "chairs" (e.g. sitting on them) compared to dogs, rats, or birds.
+
 ### Extended usage
 
 Match using dictionary inputs (where the key is a unique ID):
@@ -55,9 +61,9 @@ from nlp_link.linker import NLPLinker
 nlp_link = NLPLinker()
 
 # dict inputs
-comparison_data = {'a': 'cats', 'b': 'dogs', 'd': 'rats', 'e': 'birds'}
+reference_data = {'a': 'cats', 'b': 'dogs', 'd': 'rats', 'e': 'birds'}
 input_data = {'x': 'owls', 'y': 'feline', 'z': 'doggies', 'za': 'dogs', 'zb': 'chair'}
-nlp_link.load(comparison_data)
+nlp_link.load(reference_data)
 matches = nlp_link.link_dataset(input_data)
 # Top match output
 print(matches)
@@ -67,7 +73,7 @@ print(matches)
 Which outputs:
 
 ```
-  input_id input_text link_id link_text  similarity
+  input_id input_text reference_id reference_text  similarity
 0        x       owls       e     birds    0.613577
 1        y     feline       a      cats    0.669633
 2        z    doggies       b      dogs    0.757443
@@ -76,22 +82,22 @@ Which outputs:
 
 ```
 
-Output several most similar matches using the `top_n` argument (`format_output` needs to be set to False for this):
+Output the top n most similar reference word matches using the `top_n` argument (`format_output` needs to be set to False for this):
 
 ```python
 
 from nlp_link.linker import NLPLinker
 
 nlp_link = NLPLinker()
 
-comparison_data = {'a': 'cats', 'b': 'dogs', 'c': 'kittens', 'd': 'rats', 'e': 'birds'}
+reference_data = {'a': 'cats', 'b': 'dogs', 'c': 'kittens', 'd': 'rats', 'e': 'birds'}
 input_data = {'x': 'pets', 'y': 'feline'}
-nlp_link.load(comparison_data)
+nlp_link.load(reference_data)
 matches = nlp_link.link_dataset(input_data, top_n=2, format_output=False)
 # Top match output
 print(matches)
 # Format output for ease of reading
-print({input_data[k]: [comparison_data[r] for r, _ in v] for k,v in matches.items()})
+print({input_data[k]: [reference_data[r] for r, _ in v] for k,v in matches.items()})
 ```
 
 Which will output:
@@ -102,3 +108,11 @@ Which will output:
 ```
 
 The `drop_most_similar` argument can be set to True if you don't want to output the most similar match - this might be the case if you were comparing a list with itself. For this you would run `nlp_link.link_dataset(input_data, drop_most_similar=True)`.
+
+## References
+
+https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
+
+https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc
+
+https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc/soc2020/soc2020volume2codingrulesandconventions
diff --git a/docs/mkdocs.yaml b/docs/mkdocs.yaml
@@ -39,6 +39,7 @@ theme:
         name: Switch to light mode
 nav:
   - Home: README.md
-  - SOCMapper: page1.md
+  - SOCMapper - Core Usage: page1.md
+  - SOCMapper - Modifications, Methodology and Evaluation: https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/README.md
 plugins:
   - same-dir
diff --git a/docs/page1.md b/docs/page1.md
@@ -22,6 +22,10 @@ Which will output
 [((('2433/04', 'Statistical data scientists'), ('2433', 'Actuaries, economists and statisticians'), '2425'), 'Data scientist'), ((('6131/99', 'Nursing auxiliaries and assistants n.e.c.'), ('6131', 'Nursing auxiliaries and assistants'), '6141'), 'Assistant nurse'), ((('2422/02', 'Financial advisers and planners'), ('2422', 'Finance and investment analysts and advisers'), '3534'), 'Financial consultant')]
 ```
 
+This nested list gives information about the most similar SOC codes for each of the three inputted job titles. The most similar extended SOC for "data scientist" was 'Statistical data scientists - 2433/04'.
+
+More about this output format is explained in the [SOCMapper page](https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/README.md#soc_output).
+
 ## 📖 Read more
 
 Read more about the methods and evaluation of the SOCMapper [here](https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/README.md).
diff --git a/nlp_link/linker.py b/nlp_link/linker.py
@@ -8,17 +8,17 @@
 nlp_link = NLPLinker()
 
 # dict inputs
-comparison_data = {'a': 'cats', 'b': 'dogs', 'd': 'rats', 'e': 'birds'}
+reference_data = {'a': 'cats', 'b': 'dogs', 'd': 'rats', 'e': 'birds'}
 input_data = {'x': 'owls', 'y': 'feline', 'z': 'doggies', 'za': 'dogs', 'zb': 'chair'}
-nlp_link.load(comparison_data)
+nlp_link.load(reference_data)
 matches = nlp_link.link_dataset(input_data)
 # Top match output
 print(matches)
 
 # list inputs
-comparison_data = ['cats', 'dogs', 'rats', 'birds']
+reference_data = ['cats', 'dogs', 'rats', 'birds']
 input_data = ['owls', 'feline', 'doggies', 'dogs','chair']
-nlp_link.load(comparison_data)
+nlp_link.load(reference_data)
 matches = nlp_link.link_dataset(input_data)
 # Top match output
 print(matches)
@@ -90,22 +90,22 @@ def _process_dataset(
 
     def load(
         self,
-        comparison_data: Union[list, dict],
+        reference_data: Union[list, dict],
     ):
         """
-        Load the embedding model and embed the comparison dataset
+        Load the embedding model and embed the reference dataset
         Args:
-            comparison_data (Union[list, dict]): The comparison texts to find links to.
+            reference_data (Union[list, dict]): The reference texts to find links to.
                 A list of texts or a dictionary of texts where the key is the unique id.
                 If a list is given then a unique id will be assigned with the index order.
         """
         self.bert_model = load_bert()
 
-        self.comparison_data = self._process_dataset(comparison_data)
-        self.comparison_data_texts = list(self.comparison_data.values())
-        self.comparison_data_ids = list(self.comparison_data.keys())
+        self.reference_data = self._process_dataset(reference_data)
+        self.reference_data_texts = list(self.reference_data.values())
+        self.reference_data_ids = list(self.reference_data.keys())
 
-        self.comparison_embeddings = self._get_embeddings(self.comparison_data_texts)
+        self.reference_embeddings = self._get_embeddings(self.reference_data_texts)
 
     def _get_embeddings(self, text_list: list) -> np.array:
         """
@@ -128,8 +128,8 @@ def get_matches(
         self,
         input_data_ids: list,
         input_embeddings: np.array,
-        comparison_data_ids: list,
-        comparison_embeddings: np.array,
+        reference_data_ids: list,
+        reference_embeddings: np.array,
         top_n: int,
         drop_most_similar: bool = False,
     ) -> dict:
@@ -139,8 +139,8 @@ def get_matches(
         Args:
             input_data_ids (list): The ids of the input texts.
             input_embeddings (np.array): Embeddings for the input texts.
-            comparison_data_ids (list): The ids of the comparison texts.
-            comparison_embeddings (np.array): Embeddings for the comparison texts.
+            reference_data_ids (list): The ids of the reference texts.
+            reference_embeddings (np.array): Embeddings for the reference texts.
             top_n (int): The number of top links to return in the output.
             drop_most_similar (bool, default = False): Whether to not output the most similar match, this would be set to True if you are matching a list with itself.
 
@@ -158,7 +158,7 @@ def get_matches(
         else:
             start_n = 0
 
-        # We chunk up comparisons otherwise it can crash
+        # We chunk up reference list otherwise it can crash
         matches_topn = {}
         for batch_indices in tqdm(
             chunk_list(range(len(input_data_ids)), n_chunks=self.match_chunk_size)
@@ -167,18 +167,18 @@ def get_matches(
             batch_input_embeddings = [input_embeddings[i] for i in batch_indices]
 
             batch_similarities = cosine_similarity(
-                batch_input_embeddings, comparison_embeddings
+                batch_input_embeddings, reference_embeddings
             )
 
             # Top links for each input text
             for input_ix, similarities in enumerate(batch_similarities):
                 top_links = []
-                for comparison_ix in np.flip(np.argsort(similarities))[start_n:top_n]:
-                    # comparison data id + cosine similarity score
+                for reference_ix in np.flip(np.argsort(similarities))[start_n:top_n]:
+                    # reference data id + cosine similarity score
                     top_links.append(
                         [
-                            comparison_data_ids[comparison_ix],
-                            similarities[comparison_ix],
+                            reference_data_ids[reference_ix],
+                            similarities[reference_ix],
                         ]
                     )
                 matches_topn[batch_input_ids[input_ix]] = top_links
@@ -192,10 +192,10 @@ def link_dataset(
         drop_most_similar: bool = False,
     ) -> dict:
         """
-        Link a dataset to the comparison dataset.
+        Link a dataset to the reference dataset.
 
         Args:
-            input_data (Union[list, dict]): The main dictionary to be linked to texts in the loaded comparison_data.
+            input_data (Union[list, dict]): The main dictionary to be linked to texts in the loaded reference_data.
                 A list of texts or a dictionary of texts where the key is the unique id.
                 If a list is given then a unique id will be assigned with the index order.
             top_n (int, default = 3): The number of top links to return in the output.
@@ -204,17 +204,17 @@ def link_dataset(
             drop_most_similar (bool, default = False): Whether to not output the most similar match, this would be set to True if you are matching a list with itself.
         Returns:
             dict: The keys are the ids of the input_data and the values are a list of lists of the top_n most similar
-                ids from the comparison_data and a probability score.
+                ids from the reference_data and a probability score.
                 e.g. {'x': [['a', 0.75], ['c', 0.7]], 'y': [...]}
         """
 
         try:
             msg.info(
-                f"Comparing {len(input_data)} input texts to {len(self.comparison_embeddings)} comparison texts"
+                f"Comparing {len(input_data)} input texts to {len(self.reference_embeddings)} reference texts"
             )
         except:
             msg.warning(
-                "self.comparison_embeddings does not exist - you may have not run load()"
+                "self.reference_embeddings does not exist - you may have not run load()"
             )
 
         input_data = self._process_dataset(input_data)
@@ -226,8 +226,8 @@ def link_dataset(
         self.matches_topn = self.get_matches(
             input_data_ids,
             input_embeddings,
-            self.comparison_data_ids,
-            self.comparison_embeddings,
+            self.reference_data_ids,
+            self.reference_embeddings,
             top_n,
             drop_most_similar,
         )
@@ -239,8 +239,8 @@ def link_dataset(
                     {
                         "input_id": input_id,
                         "input_text": input_data[input_id],
-                        "link_id": link_data[0][0],
-                        "link_text": self.comparison_data[link_data[0][0]],
+                        "reference_id": link_data[0][0],
+                        "reference_text": self.reference_data[link_data[0][0]],
                         "similarity": link_data[0][1],
                     }
                     for input_id, link_data in self.matches_topn.items()

diff --git a/nlp_link/linker_utils.py b/nlp_link/linker_utils.py
@@ -1,11 +1,13 @@
 from tqdm import tqdm
-
 import numpy as np
 from sentence_transformers import SentenceTransformer
 import torch
-
 from wasabi import msg, Printer
 
+import os
+
+os.environ["TOKENIZERS_PARALLELISM"] = "false"
+
 msg_print = Printer()