diff --git a/README.md b/README.md index 1d210f6..74aba06 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,10 @@ # πŸ–‡οΈ NLP Link -NLP Link finds the most similar word (or words) in a reference list to an inputted word. For example, if you are trying to find which word is most similar to 'puppies' from a reference list of `['cats', 'dogs', 'rats', 'birds']`, nlp-link will return 'dogs'. +NLP Link finds the most similar word (or sentences) in a reference list to an inputted word. For example, if you are trying to find which word is most similar to 'puppies' from a reference list of `['cats', 'dogs', 'rats', 'birds']`, nlp-link will return 'dogs'. # πŸ—ΊοΈ SOC Mapper -Another functionality of this package is using the linking methodology to find the [Standard Occupation Classification (SOC)](https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc) code most similar to an inputted job title. More on this [here](https://github.com/nestauk/nlp-link/blob/main/docs/page1.md). +Another functionality of this package is using the linking methodology to find the [Standard Occupation Classification (SOC)](https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc) code most similar to an inputted job title. More on this [here](https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/README.md). ## πŸ”¨ Usage @@ -16,9 +16,9 @@ pip install nlp-link ### Basic usage -Note: the first time you import `NLPLinker` it will take some time to load. +> ⏳ **NOTE:** The first time you import `NLPLinker` in your environment it will take some time (around a minute) to load. -Match two lists in python: +Match two lists of words or sentences in python: ```python @@ -27,9 +27,9 @@ from nlp_link.linker import NLPLinker nlp_link = NLPLinker() # list inputs -comparison_data = ['cats', 'dogs', 'rats', 'birds'] input_data = ['owls', 'feline', 'doggies', 'dogs','chair'] -nlp_link.load(comparison_data) +reference_data = ['cats', 'dogs', 'rats', 'birds'] +nlp_link.load(reference_data) matches = nlp_link.link_dataset(input_data) # Top match output print(matches) @@ -39,7 +39,7 @@ print(matches) Which outputs: ``` - input_id input_text link_id link_text similarity + input_id input_text reference_id reference_text similarity 0 0 owls 3 birds 0.613577 1 1 feline 0 cats 0.669633 2 2 doggies 1 dogs 0.757443 @@ -48,6 +48,10 @@ Which outputs: ``` +These results show the most similar word from the `reference_data` list to each word in the `input_data` list. The word 'dogs' was found across both lists, so it had a similarity score of 1, 'doggies' was matched to 'dogs' since these words are very similar. The inputted word 'chair' had no words that were very similar - the most similar was 'cats' with a low similarity score. + +> πŸ” **INFO:** Semantic similarity scores are between 0 and 1, with 0 being very unsimilar, and 1 being exactly the same. This value is calculated by utilising [a large model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) trained on data sets of sentence pairs from various websites (including Reddit comments and WikiHow). The model learns the semantic rules which link the pairs of sentences - e.g. it will learn synonyms. In the above example the reason 'chair' matches most similarly to 'cats' might be because the model learned that "cats" are often mentioned in relation to "chairs" (e.g. sitting on them) compared to dogs, rats, or birds. + ### SOC Mapping Match a list of job titles to SOC codes: @@ -68,9 +72,13 @@ Which will output [((('2433/04', 'Statistical data scientists'), ('2433', 'Actuaries, economists and statisticians'), '2425'), 'Data scientist'), ((('6131/99', 'Nursing auxiliaries and assistants n.e.c.'), ('6131', 'Nursing auxiliaries and assistants'), '6141'), 'Assistant nurse'), ((('2422/02', 'Financial advisers and planners'), ('2422', 'Finance and investment analysts and advisers'), '3534'), 'Financial consultant')] ``` +This nested list gives information about the most similar SOC codes for each of the three inputted job titles. The most similar extended SOC for "data scientist" was 'Statistical data scientists - 2433/04'. + +More about this output format is explained in the [SOCMapper page](https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/README.md#soc_output). + ## Contributing -The instructions here are for those contrbuting to the repo. +The instructions here are for those contributing to the repo. ### Set-up @@ -113,3 +121,11 @@ cd docs mkdocs serve ``` + +## References + +https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 + +https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc + +https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc/soc2020/soc2020volume2codingrulesandconventions diff --git a/docs/README.md b/docs/README.md index 6398ce2..df3772d 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,6 +1,6 @@ # πŸ–‡οΈ NLP Link -NLP Link finds the most similar word (or words) in a reference list to an inputted word. For example, if you are trying to find which word is most similar to 'puppies' from a reference list of `['cats', 'dogs', 'rats', 'birds']`, nlp-link will return 'dogs'. +NLP Link finds the most similar word (or sentences) in a reference list to an inputted word. For example, if you are trying to find which word is most similar to 'puppies' from a reference list of `['cats', 'dogs', 'rats', 'birds']`, nlp-link will return 'dogs'. Another functionality of this package is using the linking methodology to find the [SOC](https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc) code most similar to an inputted job title. More on this [here](./page1.md). @@ -14,6 +14,8 @@ pip install nlp-link ### Basic usage +> ⏳ **NOTE:** The first time you import `NLPLinker` in your environment it will take some time (around a minute) to load. + Match two lists in python: ```python @@ -23,9 +25,9 @@ from nlp_link.linker import NLPLinker nlp_link = NLPLinker() # list inputs -comparison_data = ['cats', 'dogs', 'rats', 'birds'] input_data = ['owls', 'feline', 'doggies', 'dogs','chair'] -nlp_link.load(comparison_data) +reference_data = ['cats', 'dogs', 'rats', 'birds'] +nlp_link.load(reference_data) matches = nlp_link.link_dataset(input_data) # Top match output print(matches) @@ -35,7 +37,7 @@ print(matches) Which outputs: ``` - input_id input_text link_id link_text similarity + input_id input_text reference_id reference_text similarity 0 0 owls 3 birds 0.613577 1 1 feline 0 cats 0.669633 2 2 doggies 1 dogs 0.757443 @@ -44,6 +46,10 @@ Which outputs: ``` +These results show the most similar word from the `reference_data` list to each word in the `input_data` list. The word 'dogs' was found across both lists, so it had a similarity score of 1, 'doggies' was matched to 'dogs' since these words are very similar. The inputted word 'chair' had no words that were very similar - the most similar was 'cats' with a low similarity score. + +> πŸ” **INFO:** Semantic similarity scores are between 0 and 1, with 0 being very unsimilar, and 1 being exactly the same. This value is calculated by utilising [a large model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) trained on data sets of sentence pairs from various websites (including Reddit comments and WikiHow). The model learns the semantic rules which link the pairs of sentences - e.g. it will learn synonyms. In the above example the reason 'chair' matches most similarly to 'cats' might be because the model learned that "cats" are often mentioned in relation to "chairs" (e.g. sitting on them) compared to dogs, rats, or birds. + ### Extended usage Match using dictionary inputs (where the key is a unique ID): @@ -55,9 +61,9 @@ from nlp_link.linker import NLPLinker nlp_link = NLPLinker() # dict inputs -comparison_data = {'a': 'cats', 'b': 'dogs', 'd': 'rats', 'e': 'birds'} +reference_data = {'a': 'cats', 'b': 'dogs', 'd': 'rats', 'e': 'birds'} input_data = {'x': 'owls', 'y': 'feline', 'z': 'doggies', 'za': 'dogs', 'zb': 'chair'} -nlp_link.load(comparison_data) +nlp_link.load(reference_data) matches = nlp_link.link_dataset(input_data) # Top match output print(matches) @@ -67,7 +73,7 @@ print(matches) Which outputs: ``` - input_id input_text link_id link_text similarity + input_id input_text reference_id reference_text similarity 0 x owls e birds 0.613577 1 y feline a cats 0.669633 2 z doggies b dogs 0.757443 @@ -84,14 +90,14 @@ from nlp_link.linker import NLPLinker nlp_link = NLPLinker() -comparison_data = {'a': 'cats', 'b': 'dogs', 'c': 'kittens', 'd': 'rats', 'e': 'birds'} +reference_data = {'a': 'cats', 'b': 'dogs', 'c': 'kittens', 'd': 'rats', 'e': 'birds'} input_data = {'x': 'pets', 'y': 'feline'} -nlp_link.load(comparison_data) +nlp_link.load(reference_data) matches = nlp_link.link_dataset(input_data, top_n=2, format_output=False) # Top match output print(matches) # Format output for ease of reading -print({input_data[k]: [comparison_data[r] for r, _ in v] for k,v in matches.items()}) +print({input_data[k]: [reference_data[r] for r, _ in v] for k,v in matches.items()}) ``` Which will output: @@ -102,3 +108,11 @@ Which will output: ``` The `drop_most_similar` argument can be set to True if you don't want to output the most similar match - this might be the case if you were comparing a list with itself. For this you would run `nlp_link.link_dataset(input_data, drop_most_similar=True)`. + +## References + +https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 + +https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc + +https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc/soc2020/soc2020volume2codingrulesandconventions diff --git a/docs/mkdocs.yaml b/docs/mkdocs.yaml index 7cc1433..474f699 100644 --- a/docs/mkdocs.yaml +++ b/docs/mkdocs.yaml @@ -39,6 +39,7 @@ theme: name: Switch to light mode nav: - Home: README.md - - SOCMapper: page1.md + - SOCMapper - Core Usage: page1.md + - SOCMapper - Modifications, Methodology and Evaluation: https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/README.md plugins: - same-dir diff --git a/docs/page1.md b/docs/page1.md index f1818da..2b67e78 100644 --- a/docs/page1.md +++ b/docs/page1.md @@ -22,6 +22,10 @@ Which will output [((('2433/04', 'Statistical data scientists'), ('2433', 'Actuaries, economists and statisticians'), '2425'), 'Data scientist'), ((('6131/99', 'Nursing auxiliaries and assistants n.e.c.'), ('6131', 'Nursing auxiliaries and assistants'), '6141'), 'Assistant nurse'), ((('2422/02', 'Financial advisers and planners'), ('2422', 'Finance and investment analysts and advisers'), '3534'), 'Financial consultant')] ``` +This nested list gives information about the most similar SOC codes for each of the three inputted job titles. The most similar extended SOC for "data scientist" was 'Statistical data scientists - 2433/04'. + +More about this output format is explained in the [SOCMapper page](https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/README.md#soc_output). + ## πŸ“– Read more Read more about the methods and evaluation of the SOCMapper [here](https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/README.md). diff --git a/nlp_link/linker.py b/nlp_link/linker.py index f06f5a9..186c94e 100644 --- a/nlp_link/linker.py +++ b/nlp_link/linker.py @@ -8,17 +8,17 @@ nlp_link = NLPLinker() # dict inputs -comparison_data = {'a': 'cats', 'b': 'dogs', 'd': 'rats', 'e': 'birds'} +reference_data = {'a': 'cats', 'b': 'dogs', 'd': 'rats', 'e': 'birds'} input_data = {'x': 'owls', 'y': 'feline', 'z': 'doggies', 'za': 'dogs', 'zb': 'chair'} -nlp_link.load(comparison_data) +nlp_link.load(reference_data) matches = nlp_link.link_dataset(input_data) # Top match output print(matches) # list inputs -comparison_data = ['cats', 'dogs', 'rats', 'birds'] +reference_data = ['cats', 'dogs', 'rats', 'birds'] input_data = ['owls', 'feline', 'doggies', 'dogs','chair'] -nlp_link.load(comparison_data) +nlp_link.load(reference_data) matches = nlp_link.link_dataset(input_data) # Top match output print(matches) @@ -90,22 +90,22 @@ def _process_dataset( def load( self, - comparison_data: Union[list, dict], + reference_data: Union[list, dict], ): """ - Load the embedding model and embed the comparison dataset + Load the embedding model and embed the reference dataset Args: - comparison_data (Union[list, dict]): The comparison texts to find links to. + reference_data (Union[list, dict]): The reference texts to find links to. A list of texts or a dictionary of texts where the key is the unique id. If a list is given then a unique id will be assigned with the index order. """ self.bert_model = load_bert() - self.comparison_data = self._process_dataset(comparison_data) - self.comparison_data_texts = list(self.comparison_data.values()) - self.comparison_data_ids = list(self.comparison_data.keys()) + self.reference_data = self._process_dataset(reference_data) + self.reference_data_texts = list(self.reference_data.values()) + self.reference_data_ids = list(self.reference_data.keys()) - self.comparison_embeddings = self._get_embeddings(self.comparison_data_texts) + self.reference_embeddings = self._get_embeddings(self.reference_data_texts) def _get_embeddings(self, text_list: list) -> np.array: """ @@ -128,8 +128,8 @@ def get_matches( self, input_data_ids: list, input_embeddings: np.array, - comparison_data_ids: list, - comparison_embeddings: np.array, + reference_data_ids: list, + reference_embeddings: np.array, top_n: int, drop_most_similar: bool = False, ) -> dict: @@ -139,8 +139,8 @@ def get_matches( Args: input_data_ids (list): The ids of the input texts. input_embeddings (np.array): Embeddings for the input texts. - comparison_data_ids (list): The ids of the comparison texts. - comparison_embeddings (np.array): Embeddings for the comparison texts. + reference_data_ids (list): The ids of the reference texts. + reference_embeddings (np.array): Embeddings for the reference texts. top_n (int): The number of top links to return in the output. drop_most_similar (bool, default = False): Whether to not output the most similar match, this would be set to True if you are matching a list with itself. @@ -158,7 +158,7 @@ def get_matches( else: start_n = 0 - # We chunk up comparisons otherwise it can crash + # We chunk up reference list otherwise it can crash matches_topn = {} for batch_indices in tqdm( chunk_list(range(len(input_data_ids)), n_chunks=self.match_chunk_size) @@ -167,18 +167,18 @@ def get_matches( batch_input_embeddings = [input_embeddings[i] for i in batch_indices] batch_similarities = cosine_similarity( - batch_input_embeddings, comparison_embeddings + batch_input_embeddings, reference_embeddings ) # Top links for each input text for input_ix, similarities in enumerate(batch_similarities): top_links = [] - for comparison_ix in np.flip(np.argsort(similarities))[start_n:top_n]: - # comparison data id + cosine similarity score + for reference_ix in np.flip(np.argsort(similarities))[start_n:top_n]: + # reference data id + cosine similarity score top_links.append( [ - comparison_data_ids[comparison_ix], - similarities[comparison_ix], + reference_data_ids[reference_ix], + similarities[reference_ix], ] ) matches_topn[batch_input_ids[input_ix]] = top_links @@ -192,10 +192,10 @@ def link_dataset( drop_most_similar: bool = False, ) -> dict: """ - Link a dataset to the comparison dataset. + Link a dataset to the reference dataset. Args: - input_data (Union[list, dict]): The main dictionary to be linked to texts in the loaded comparison_data. + input_data (Union[list, dict]): The main dictionary to be linked to texts in the loaded reference_data. A list of texts or a dictionary of texts where the key is the unique id. If a list is given then a unique id will be assigned with the index order. top_n (int, default = 3): The number of top links to return in the output. @@ -204,17 +204,17 @@ def link_dataset( drop_most_similar (bool, default = False): Whether to not output the most similar match, this would be set to True if you are matching a list with itself. Returns: dict: The keys are the ids of the input_data and the values are a list of lists of the top_n most similar - ids from the comparison_data and a probability score. + ids from the reference_data and a probability score. e.g. {'x': [['a', 0.75], ['c', 0.7]], 'y': [...]} """ try: msg.info( - f"Comparing {len(input_data)} input texts to {len(self.comparison_embeddings)} comparison texts" + f"Comparing {len(input_data)} input texts to {len(self.reference_embeddings)} reference texts" ) except: msg.warning( - "self.comparison_embeddings does not exist - you may have not run load()" + "self.reference_embeddings does not exist - you may have not run load()" ) input_data = self._process_dataset(input_data) @@ -226,8 +226,8 @@ def link_dataset( self.matches_topn = self.get_matches( input_data_ids, input_embeddings, - self.comparison_data_ids, - self.comparison_embeddings, + self.reference_data_ids, + self.reference_embeddings, top_n, drop_most_similar, ) @@ -239,8 +239,8 @@ def link_dataset( { "input_id": input_id, "input_text": input_data[input_id], - "link_id": link_data[0][0], - "link_text": self.comparison_data[link_data[0][0]], + "reference_id": link_data[0][0], + "reference_text": self.reference_data[link_data[0][0]], "similarity": link_data[0][1], } for input_id, link_data in self.matches_topn.items() diff --git a/nlp_link/soc_mapper/README.md b/nlp_link/soc_mapper/README.md index a3fd793..8f9fa8a 100644 --- a/nlp_link/soc_mapper/README.md +++ b/nlp_link/soc_mapper/README.md @@ -2,9 +2,9 @@ Key files and folders in this directory are: -1. [soc_map.py](https://github.com/nestauk/nlp-link/soc_mapper/soc_map.py): The script containing the `SOCMapper` class. -2. [soc_map_utils.py](https://github.com/nestauk/nlp-link/soc_mapper/soc_map_utils.py): Functions for loading data and cleaning job titles for the `SOCMapper` class. -3. [config.yaml](ttps://github.com/nestauk/nlp-link/soc_mapper/config.yaml): The default arguments for the `SOCMapper` class. +1. [soc_map.py](https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/soc_map.py): The script containing the `SOCMapper` class. +2. [soc_map_utils.py](https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/soc_map_utils.py): Functions for loading data and cleaning job titles for the `SOCMapper` class. +3. [config.yaml](https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/config.yaml): The default arguments for the `SOCMapper` class. # πŸ—ΊοΈ SOC Mapper @@ -48,6 +48,8 @@ soc_mapper.load(save_embeds = True) ``` + + ## πŸ“€ Output The output for one job title is in the format @@ -59,8 +61,7 @@ The output for one job title is in the format for example ``` -((('2422/02', 'Financial advisors and planners'), ('2422', 'Fi -nance and investment analysts and advisers'), '3534'), 'financial consultant') +((('2422/02', 'Financial advisors and planners'), ('2422', 'Finance and investment analysts and advisers'), '3534'), 'financial consultant') ``` If the names of the SOC codes aren't needed then you can set `return_soc_name=False`. The variables `soc_mapper.soc_2020_6_dict` and `soc_mapper.soc_2020_4_dict` give the names of each SOC 2020 6 and 4 digit codes. diff --git a/nlp_link/soc_mapper/soc_map.py b/nlp_link/soc_mapper/soc_map.py index fb912e0..930fb16 100644 --- a/nlp_link/soc_mapper/soc_map.py +++ b/nlp_link/soc_mapper/soc_map.py @@ -242,8 +242,8 @@ def find_most_similar_matches( matches_topn_dict = self.nlp_link.get_matches( input_data_ids=list(range(len(job_titles))), input_embeddings=job_title_embeddings, - comparison_data_ids=list(range(len(self.all_soc_embeddings))), - comparison_embeddings=self.all_soc_embeddings, + reference_data_ids=list(range(len(self.all_soc_embeddings))), + reference_embeddings=self.all_soc_embeddings, top_n=self.match_top_n, ) diff --git a/tests/test_linker.py b/tests/test_linker.py index d7b35cf..6949097 100644 --- a/tests/test_linker.py +++ b/tests/test_linker.py @@ -12,7 +12,7 @@ def test_NLPLinker_dict_input(): nlp_link = NLPLinker() - comparison_data = {"a": "cats", "b": "dogs", "c": "rats", "d": "birds"} + reference_data = {"a": "cats", "b": "dogs", "c": "rats", "d": "birds"} input_data = { "x": "owls", "y": "feline", @@ -20,25 +20,26 @@ def test_NLPLinker_dict_input(): "za": "dogs", "zb": "chair", } - nlp_link.load(comparison_data) + nlp_link.load(reference_data) matches = nlp_link.link_dataset(input_data) assert len(matches) == len(input_data) - assert len(set(matches["link_id"]).difference(set(comparison_data.keys()))) == 0 + assert len(set(matches["reference_id"]).difference(set(reference_data.keys()))) == 0 def test_NLPLinker_list_input(): nlp_link = NLPLinker() - comparison_data = ["cats", "dogs", "rats", "birds"] + reference_data = ["cats", "dogs", "rats", "birds"] input_data = ["owls", "feline", "doggies", "dogs", "chair"] - nlp_link.load(comparison_data) + nlp_link.load(reference_data) matches = nlp_link.link_dataset(input_data) assert len(matches) == len(input_data) assert ( - len(set(matches["link_id"]).difference(set(range(len(comparison_data))))) == 0 + len(set(matches["reference_id"]).difference(set(range(len(reference_data))))) + == 0 ) @@ -51,8 +52,8 @@ def test_get_matches(): input_embeddings=np.array( [[0.1, 0.13, 0.14], [0.12, 0.18, 0.15], [0.5, 0.9, 0.91]] ), - comparison_data_ids=["a", "b"], - comparison_embeddings=np.array([[0.51, 0.99, 0.9], [0.1, 0.13, 0.14]]), + reference_data_ids=["a", "b"], + reference_embeddings=np.array([[0.51, 0.99, 0.9], [0.1, 0.13, 0.14]]), top_n=1, ) @@ -65,13 +66,13 @@ def test_same_input(): nlp_link = NLPLinker() - comparison_data = {"a": "cats", "b": "dogs", "c": "rats", "d": "birds"} - input_data = comparison_data - nlp_link.load(comparison_data) + reference_data = {"a": "cats", "b": "dogs", "c": "rats", "d": "birds"} + input_data = reference_data + nlp_link.load(reference_data) matches = nlp_link.link_dataset(input_data, drop_most_similar=False) - assert all(matches["input_id"] == matches["link_id"]) + assert all(matches["input_id"] == matches["reference_id"]) matches = nlp_link.link_dataset(input_data, drop_most_similar=True) - assert all(matches["input_id"] != matches["link_id"]) + assert all(matches["input_id"] != matches["reference_id"])