diff --git a/README.md b/README.md index 5d5fb32..74aba06 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,10 @@ # πŸ–‡οΈ NLP Link -NLP Link finds the most similar word (or words) in a reference list to an inputted word. For example, if you are trying to find which word is most similar to 'puppies' from a reference list of `['cats', 'dogs', 'rats', 'birds']`, nlp-link will return 'dogs'. +NLP Link finds the most similar word (or sentences) in a reference list to an inputted word. For example, if you are trying to find which word is most similar to 'puppies' from a reference list of `['cats', 'dogs', 'rats', 'birds']`, nlp-link will return 'dogs'. # πŸ—ΊοΈ SOC Mapper -Another functionality of this package is using the linking methodology to find the [Standard Occupation Classification (SOC)](https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc) code most similar to an inputted job title. More on this [here](https://github.com/nestauk/nlp-link/blob/main/docs/page1.md). +Another functionality of this package is using the linking methodology to find the [Standard Occupation Classification (SOC)](https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc) code most similar to an inputted job title. More on this [here](https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/README.md). ## πŸ”¨ Usage @@ -16,7 +16,9 @@ pip install nlp-link ### Basic usage -Match two lists in python: +> ⏳ **NOTE:** The first time you import `NLPLinker` in your environment it will take some time (around a minute) to load. + +Match two lists of words or sentences in python: ```python @@ -25,9 +27,9 @@ from nlp_link.linker import NLPLinker nlp_link = NLPLinker() # list inputs -comparison_data = ['cats', 'dogs', 'rats', 'birds'] input_data = ['owls', 'feline', 'doggies', 'dogs','chair'] -nlp_link.load(comparison_data) +reference_data = ['cats', 'dogs', 'rats', 'birds'] +nlp_link.load(reference_data) matches = nlp_link.link_dataset(input_data) # Top match output print(matches) @@ -37,7 +39,7 @@ print(matches) Which outputs: ``` - input_id input_text link_id link_text similarity + input_id input_text reference_id reference_text similarity 0 0 owls 3 birds 0.613577 1 1 feline 0 cats 0.669633 2 2 doggies 1 dogs 0.757443 @@ -46,6 +48,10 @@ Which outputs: ``` +These results show the most similar word from the `reference_data` list to each word in the `input_data` list. The word 'dogs' was found across both lists, so it had a similarity score of 1, 'doggies' was matched to 'dogs' since these words are very similar. The inputted word 'chair' had no words that were very similar - the most similar was 'cats' with a low similarity score. + +> πŸ” **INFO:** Semantic similarity scores are between 0 and 1, with 0 being very unsimilar, and 1 being exactly the same. This value is calculated by utilising [a large model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) trained on data sets of sentence pairs from various websites (including Reddit comments and WikiHow). The model learns the semantic rules which link the pairs of sentences - e.g. it will learn synonyms. In the above example the reason 'chair' matches most similarly to 'cats' might be because the model learned that "cats" are often mentioned in relation to "chairs" (e.g. sitting on them) compared to dogs, rats, or birds. + ### SOC Mapping Match a list of job titles to SOC codes: @@ -66,9 +72,13 @@ Which will output [((('2433/04', 'Statistical data scientists'), ('2433', 'Actuaries, economists and statisticians'), '2425'), 'Data scientist'), ((('6131/99', 'Nursing auxiliaries and assistants n.e.c.'), ('6131', 'Nursing auxiliaries and assistants'), '6141'), 'Assistant nurse'), ((('2422/02', 'Financial advisers and planners'), ('2422', 'Finance and investment analysts and advisers'), '3534'), 'Financial consultant')] ``` +This nested list gives information about the most similar SOC codes for each of the three inputted job titles. The most similar extended SOC for "data scientist" was 'Statistical data scientists - 2433/04'. + +More about this output format is explained in the [SOCMapper page](https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/README.md#soc_output). + ## Contributing -The instructions here are for those contrbuting to the repo. +The instructions here are for those contributing to the repo. ### Set-up @@ -111,3 +121,11 @@ cd docs mkdocs serve ``` + +## References + +https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 + +https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc + +https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc/soc2020/soc2020volume2codingrulesandconventions diff --git a/docs/README.md b/docs/README.md index 6398ce2..a9411fe 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,6 +1,6 @@ # πŸ–‡οΈ NLP Link -NLP Link finds the most similar word (or words) in a reference list to an inputted word. For example, if you are trying to find which word is most similar to 'puppies' from a reference list of `['cats', 'dogs', 'rats', 'birds']`, nlp-link will return 'dogs'. +NLP Link finds the most similar word (or sentences) in a reference list to an inputted word. For example, if you are trying to find which word is most similar to 'puppies' from a reference list of `['cats', 'dogs', 'rats', 'birds']`, nlp-link will return 'dogs'. Another functionality of this package is using the linking methodology to find the [SOC](https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc) code most similar to an inputted job title. More on this [here](./page1.md). @@ -14,6 +14,8 @@ pip install nlp-link ### Basic usage +> ⏳ **NOTE:** The first time you import `NLPLinker` in your environment it will take some time (around a minute) to load. + Match two lists in python: ```python @@ -23,9 +25,9 @@ from nlp_link.linker import NLPLinker nlp_link = NLPLinker() # list inputs -comparison_data = ['cats', 'dogs', 'rats', 'birds'] input_data = ['owls', 'feline', 'doggies', 'dogs','chair'] -nlp_link.load(comparison_data) +reference_data = ['cats', 'dogs', 'rats', 'birds'] +nlp_link.load(reference_data) matches = nlp_link.link_dataset(input_data) # Top match output print(matches) @@ -35,7 +37,7 @@ print(matches) Which outputs: ``` - input_id input_text link_id link_text similarity + input_id input_text reference_id reference_text similarity 0 0 owls 3 birds 0.613577 1 1 feline 0 cats 0.669633 2 2 doggies 1 dogs 0.757443 @@ -44,6 +46,10 @@ Which outputs: ``` +These results show the most similar word from the `reference_data` list to each word in the `input_data` list. The word 'dogs' was found across both lists, so it had a similarity score of 1, 'doggies' was matched to 'dogs' since these words are very similar. The inputted word 'chair' had no words that were very similar - the most similar was 'cats' with a low similarity score. + +> πŸ” **INFO:** Semantic similarity scores are between 0 and 1, with 0 being very unsimilar, and 1 being exactly the same. This value is calculated by utilising [a large model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) trained on data sets of sentence pairs from various websites (including Reddit comments and WikiHow). The model learns the semantic rules which link the pairs of sentences - e.g. it will learn synonyms. In the above example the reason 'chair' matches most similarly to 'cats' might be because the model learned that "cats" are often mentioned in relation to "chairs" (e.g. sitting on them) compared to dogs, rats, or birds. + ### Extended usage Match using dictionary inputs (where the key is a unique ID): @@ -55,9 +61,9 @@ from nlp_link.linker import NLPLinker nlp_link = NLPLinker() # dict inputs -comparison_data = {'a': 'cats', 'b': 'dogs', 'd': 'rats', 'e': 'birds'} +reference_data = {'a': 'cats', 'b': 'dogs', 'd': 'rats', 'e': 'birds'} input_data = {'x': 'owls', 'y': 'feline', 'z': 'doggies', 'za': 'dogs', 'zb': 'chair'} -nlp_link.load(comparison_data) +nlp_link.load(reference_data) matches = nlp_link.link_dataset(input_data) # Top match output print(matches) @@ -67,7 +73,7 @@ print(matches) Which outputs: ``` - input_id input_text link_id link_text similarity + input_id input_text reference_id reference_text similarity 0 x owls e birds 0.613577 1 y feline a cats 0.669633 2 z doggies b dogs 0.757443 @@ -76,7 +82,7 @@ Which outputs: ``` -Output several most similar matches using the `top_n` argument (`format_output` needs to be set to False for this): +Output the top n most similar reference word matches using the `top_n` argument (`format_output` needs to be set to False for this): ```python @@ -84,14 +90,14 @@ from nlp_link.linker import NLPLinker nlp_link = NLPLinker() -comparison_data = {'a': 'cats', 'b': 'dogs', 'c': 'kittens', 'd': 'rats', 'e': 'birds'} +reference_data = {'a': 'cats', 'b': 'dogs', 'c': 'kittens', 'd': 'rats', 'e': 'birds'} input_data = {'x': 'pets', 'y': 'feline'} -nlp_link.load(comparison_data) +nlp_link.load(reference_data) matches = nlp_link.link_dataset(input_data, top_n=2, format_output=False) # Top match output print(matches) # Format output for ease of reading -print({input_data[k]: [comparison_data[r] for r, _ in v] for k,v in matches.items()}) +print({input_data[k]: [reference_data[r] for r, _ in v] for k,v in matches.items()}) ``` Which will output: @@ -102,3 +108,11 @@ Which will output: ``` The `drop_most_similar` argument can be set to True if you don't want to output the most similar match - this might be the case if you were comparing a list with itself. For this you would run `nlp_link.link_dataset(input_data, drop_most_similar=True)`. + +## References + +https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 + +https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc + +https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc/soc2020/soc2020volume2codingrulesandconventions diff --git a/docs/mkdocs.yaml b/docs/mkdocs.yaml index 7cc1433..474f699 100644 --- a/docs/mkdocs.yaml +++ b/docs/mkdocs.yaml @@ -39,6 +39,7 @@ theme: name: Switch to light mode nav: - Home: README.md - - SOCMapper: page1.md + - SOCMapper - Core Usage: page1.md + - SOCMapper - Modifications, Methodology and Evaluation: https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/README.md plugins: - same-dir diff --git a/docs/page1.md b/docs/page1.md index f1818da..2b67e78 100644 --- a/docs/page1.md +++ b/docs/page1.md @@ -22,6 +22,10 @@ Which will output [((('2433/04', 'Statistical data scientists'), ('2433', 'Actuaries, economists and statisticians'), '2425'), 'Data scientist'), ((('6131/99', 'Nursing auxiliaries and assistants n.e.c.'), ('6131', 'Nursing auxiliaries and assistants'), '6141'), 'Assistant nurse'), ((('2422/02', 'Financial advisers and planners'), ('2422', 'Finance and investment analysts and advisers'), '3534'), 'Financial consultant')] ``` +This nested list gives information about the most similar SOC codes for each of the three inputted job titles. The most similar extended SOC for "data scientist" was 'Statistical data scientists - 2433/04'. + +More about this output format is explained in the [SOCMapper page](https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/README.md#soc_output). + ## πŸ“– Read more Read more about the methods and evaluation of the SOCMapper [here](https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/README.md). diff --git a/nlp_link/linker.py b/nlp_link/linker.py index f06f5a9..186c94e 100644 --- a/nlp_link/linker.py +++ b/nlp_link/linker.py @@ -8,17 +8,17 @@ nlp_link = NLPLinker() # dict inputs -comparison_data = {'a': 'cats', 'b': 'dogs', 'd': 'rats', 'e': 'birds'} +reference_data = {'a': 'cats', 'b': 'dogs', 'd': 'rats', 'e': 'birds'} input_data = {'x': 'owls', 'y': 'feline', 'z': 'doggies', 'za': 'dogs', 'zb': 'chair'} -nlp_link.load(comparison_data) +nlp_link.load(reference_data) matches = nlp_link.link_dataset(input_data) # Top match output print(matches) # list inputs -comparison_data = ['cats', 'dogs', 'rats', 'birds'] +reference_data = ['cats', 'dogs', 'rats', 'birds'] input_data = ['owls', 'feline', 'doggies', 'dogs','chair'] -nlp_link.load(comparison_data) +nlp_link.load(reference_data) matches = nlp_link.link_dataset(input_data) # Top match output print(matches) @@ -90,22 +90,22 @@ def _process_dataset( def load( self, - comparison_data: Union[list, dict], + reference_data: Union[list, dict], ): """ - Load the embedding model and embed the comparison dataset + Load the embedding model and embed the reference dataset Args: - comparison_data (Union[list, dict]): The comparison texts to find links to. + reference_data (Union[list, dict]): The reference texts to find links to. A list of texts or a dictionary of texts where the key is the unique id. If a list is given then a unique id will be assigned with the index order. """ self.bert_model = load_bert() - self.comparison_data = self._process_dataset(comparison_data) - self.comparison_data_texts = list(self.comparison_data.values()) - self.comparison_data_ids = list(self.comparison_data.keys()) + self.reference_data = self._process_dataset(reference_data) + self.reference_data_texts = list(self.reference_data.values()) + self.reference_data_ids = list(self.reference_data.keys()) - self.comparison_embeddings = self._get_embeddings(self.comparison_data_texts) + self.reference_embeddings = self._get_embeddings(self.reference_data_texts) def _get_embeddings(self, text_list: list) -> np.array: """ @@ -128,8 +128,8 @@ def get_matches( self, input_data_ids: list, input_embeddings: np.array, - comparison_data_ids: list, - comparison_embeddings: np.array, + reference_data_ids: list, + reference_embeddings: np.array, top_n: int, drop_most_similar: bool = False, ) -> dict: @@ -139,8 +139,8 @@ def get_matches( Args: input_data_ids (list): The ids of the input texts. input_embeddings (np.array): Embeddings for the input texts. - comparison_data_ids (list): The ids of the comparison texts. - comparison_embeddings (np.array): Embeddings for the comparison texts. + reference_data_ids (list): The ids of the reference texts. + reference_embeddings (np.array): Embeddings for the reference texts. top_n (int): The number of top links to return in the output. drop_most_similar (bool, default = False): Whether to not output the most similar match, this would be set to True if you are matching a list with itself. @@ -158,7 +158,7 @@ def get_matches( else: start_n = 0 - # We chunk up comparisons otherwise it can crash + # We chunk up reference list otherwise it can crash matches_topn = {} for batch_indices in tqdm( chunk_list(range(len(input_data_ids)), n_chunks=self.match_chunk_size) @@ -167,18 +167,18 @@ def get_matches( batch_input_embeddings = [input_embeddings[i] for i in batch_indices] batch_similarities = cosine_similarity( - batch_input_embeddings, comparison_embeddings + batch_input_embeddings, reference_embeddings ) # Top links for each input text for input_ix, similarities in enumerate(batch_similarities): top_links = [] - for comparison_ix in np.flip(np.argsort(similarities))[start_n:top_n]: - # comparison data id + cosine similarity score + for reference_ix in np.flip(np.argsort(similarities))[start_n:top_n]: + # reference data id + cosine similarity score top_links.append( [ - comparison_data_ids[comparison_ix], - similarities[comparison_ix], + reference_data_ids[reference_ix], + similarities[reference_ix], ] ) matches_topn[batch_input_ids[input_ix]] = top_links @@ -192,10 +192,10 @@ def link_dataset( drop_most_similar: bool = False, ) -> dict: """ - Link a dataset to the comparison dataset. + Link a dataset to the reference dataset. Args: - input_data (Union[list, dict]): The main dictionary to be linked to texts in the loaded comparison_data. + input_data (Union[list, dict]): The main dictionary to be linked to texts in the loaded reference_data. A list of texts or a dictionary of texts where the key is the unique id. If a list is given then a unique id will be assigned with the index order. top_n (int, default = 3): The number of top links to return in the output. @@ -204,17 +204,17 @@ def link_dataset( drop_most_similar (bool, default = False): Whether to not output the most similar match, this would be set to True if you are matching a list with itself. Returns: dict: The keys are the ids of the input_data and the values are a list of lists of the top_n most similar - ids from the comparison_data and a probability score. + ids from the reference_data and a probability score. e.g. {'x': [['a', 0.75], ['c', 0.7]], 'y': [...]} """ try: msg.info( - f"Comparing {len(input_data)} input texts to {len(self.comparison_embeddings)} comparison texts" + f"Comparing {len(input_data)} input texts to {len(self.reference_embeddings)} reference texts" ) except: msg.warning( - "self.comparison_embeddings does not exist - you may have not run load()" + "self.reference_embeddings does not exist - you may have not run load()" ) input_data = self._process_dataset(input_data) @@ -226,8 +226,8 @@ def link_dataset( self.matches_topn = self.get_matches( input_data_ids, input_embeddings, - self.comparison_data_ids, - self.comparison_embeddings, + self.reference_data_ids, + self.reference_embeddings, top_n, drop_most_similar, ) @@ -239,8 +239,8 @@ def link_dataset( { "input_id": input_id, "input_text": input_data[input_id], - "link_id": link_data[0][0], - "link_text": self.comparison_data[link_data[0][0]], + "reference_id": link_data[0][0], + "reference_text": self.reference_data[link_data[0][0]], "similarity": link_data[0][1], } for input_id, link_data in self.matches_topn.items() diff --git a/nlp_link/linker_utils.py b/nlp_link/linker_utils.py index 1746610..6da2ede 100644 --- a/nlp_link/linker_utils.py +++ b/nlp_link/linker_utils.py @@ -1,11 +1,13 @@ from tqdm import tqdm - import numpy as np from sentence_transformers import SentenceTransformer import torch - from wasabi import msg, Printer +import os + +os.environ["TOKENIZERS_PARALLELISM"] = "false" + msg_print = Printer() diff --git a/nlp_link/soc_mapper/README.md b/nlp_link/soc_mapper/README.md index a3fd793..8f9fa8a 100644 --- a/nlp_link/soc_mapper/README.md +++ b/nlp_link/soc_mapper/README.md @@ -2,9 +2,9 @@ Key files and folders in this directory are: -1. [soc_map.py](https://github.com/nestauk/nlp-link/soc_mapper/soc_map.py): The script containing the `SOCMapper` class. -2. [soc_map_utils.py](https://github.com/nestauk/nlp-link/soc_mapper/soc_map_utils.py): Functions for loading data and cleaning job titles for the `SOCMapper` class. -3. [config.yaml](ttps://github.com/nestauk/nlp-link/soc_mapper/config.yaml): The default arguments for the `SOCMapper` class. +1. [soc_map.py](https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/soc_map.py): The script containing the `SOCMapper` class. +2. [soc_map_utils.py](https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/soc_map_utils.py): Functions for loading data and cleaning job titles for the `SOCMapper` class. +3. [config.yaml](https://github.com/nestauk/nlp-link/blob/main/nlp_link/soc_mapper/config.yaml): The default arguments for the `SOCMapper` class. # πŸ—ΊοΈ SOC Mapper @@ -48,6 +48,8 @@ soc_mapper.load(save_embeds = True) ``` + + ## πŸ“€ Output The output for one job title is in the format @@ -59,8 +61,7 @@ The output for one job title is in the format for example ``` -((('2422/02', 'Financial advisors and planners'), ('2422', 'Fi -nance and investment analysts and advisers'), '3534'), 'financial consultant') +((('2422/02', 'Financial advisors and planners'), ('2422', 'Finance and investment analysts and advisers'), '3534'), 'financial consultant') ``` If the names of the SOC codes aren't needed then you can set `return_soc_name=False`. The variables `soc_mapper.soc_2020_6_dict` and `soc_mapper.soc_2020_4_dict` give the names of each SOC 2020 6 and 4 digit codes. diff --git a/nlp_link/soc_mapper/soc_map.py b/nlp_link/soc_mapper/soc_map.py index fb912e0..930fb16 100644 --- a/nlp_link/soc_mapper/soc_map.py +++ b/nlp_link/soc_mapper/soc_map.py @@ -242,8 +242,8 @@ def find_most_similar_matches( matches_topn_dict = self.nlp_link.get_matches( input_data_ids=list(range(len(job_titles))), input_embeddings=job_title_embeddings, - comparison_data_ids=list(range(len(self.all_soc_embeddings))), - comparison_embeddings=self.all_soc_embeddings, + reference_data_ids=list(range(len(self.all_soc_embeddings))), + reference_embeddings=self.all_soc_embeddings, top_n=self.match_top_n, ) diff --git a/nlp_link/soc_mapper/soc_map_utils.py b/nlp_link/soc_mapper/soc_map_utils.py index a88b7b5..a983b68 100644 --- a/nlp_link/soc_mapper/soc_map_utils.py +++ b/nlp_link/soc_mapper/soc_map_utils.py @@ -1,8 +1,10 @@ import pandas as pd import re +import os from nlp_link import soc_mapper_config +from nlp_link.utils.utils import get_df_from_excel_s3_path def load_job_title_soc(soc_mapper_config: dict = soc_mapper_config) -> pd.DataFrame(): @@ -10,8 +12,15 @@ def load_job_title_soc(soc_mapper_config: dict = soc_mapper_config) -> pd.DataFr Load the ONS dataset which gives SOC codes for thousands of job titles """ - jobtitle_soc_data = pd.read_excel( - soc_mapper_config["soc_data"]["soc_dir"], + soc_dir = soc_mapper_config["soc_data"]["soc_dir"] + dir_split = soc_dir.split("s3://")[1].split("/") + + s3_bucket_name = dir_split[0] + s3_key = os.path.join("", *dir_split[1:]) + + jobtitle_soc_data = get_df_from_excel_s3_path( + bucket_name=s3_bucket_name, + key=s3_key, sheet_name=soc_mapper_config["soc_data"]["sheet_name"], converters={ soc_mapper_config["soc_data"]["soc_2020_ext_col"]: str, @@ -81,15 +90,15 @@ def unique_soc_job_titles(jobtitle_soc_data: pd.DataFrame()) -> dict: ), axis=1, ) - jobtitle_soc_data[f"{col_name_0} and {col_name_1} and {col_name_2}"] = ( - jobtitle_soc_data.apply( - lambda x: ( - x[f"{col_name_0} and {col_name_1}"] + " " + x[col_name_2] - if pd.notnull(x[col_name_2]) - else x[f"{col_name_0} and {col_name_1}"] - ), - axis=1, - ) + jobtitle_soc_data[ + f"{col_name_0} and {col_name_1} and {col_name_2}" + ] = jobtitle_soc_data.apply( + lambda x: ( + x[f"{col_name_0} and {col_name_1}"] + " " + x[col_name_2] + if pd.notnull(x[col_name_2]) + else x[f"{col_name_0} and {col_name_1}"] + ), + axis=1, ) # Try to find a unique job title to SOC 2020 4 or 6 code mapping diff --git a/nlp_link/utils/utils.py b/nlp_link/utils/utils.py index 5772f0c..8f64294 100644 --- a/nlp_link/utils/utils.py +++ b/nlp_link/utils/utils.py @@ -3,6 +3,9 @@ from fnmatch import fnmatch from decimal import Decimal import numpy +import requests +from io import BytesIO +import pandas as pd from nlp_link import logger @@ -91,3 +94,22 @@ def save_json_dict(dictionary: dict, file_name: str): logger.info(f"Saved to {file_name} ...") else: logger.error(f'{file_name} has wrong file extension! Only supports "*.json"') + + +def get_df_from_excel_s3_path(bucket_name: str, key: str, **kwargs) -> pd.DataFrame: + """ + Get dataframe from Excel file stored in s3 path. + + Args + path (str): S3 URI to Excel file + **kwargs for pl.read_excel() + Returns + pd.DataFrame: dataframe from Excel file + """ + + s3 = boto3.client("s3") + s3_data = s3.get_object(Bucket=bucket_name, Key=key) + contents = s3_data["Body"].read() # your Excel's essence, pretty much a stream + + df = pd.read_excel(BytesIO(contents), **kwargs) + return df diff --git a/pyproject.toml b/pyproject.toml index 0438fec..9440f49 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,6 +1,6 @@ [tool.poetry] name = "nlp-link" -version = "0.1.3" +version = "0.1.4" description = "A python package to semantically link two lists of texts." authors = ["Nesta "] readme = "README.md" @@ -17,9 +17,8 @@ tqdm = "^4.66.4" numpy = "^1.26.4" openpyxl = "^3.1.3" wasabi = "^1.1.3" -s3fs = {extras = ["boto3"], version = ">=2023.12.0"} -boto3 = "*" -botocore = "*" +boto3 = "^1.34.99" +botocore = "^1.34.99" [build-system] requires = ["poetry-core"] diff --git a/tests/test_linker.py b/tests/test_linker.py index d7b35cf..6949097 100644 --- a/tests/test_linker.py +++ b/tests/test_linker.py @@ -12,7 +12,7 @@ def test_NLPLinker_dict_input(): nlp_link = NLPLinker() - comparison_data = {"a": "cats", "b": "dogs", "c": "rats", "d": "birds"} + reference_data = {"a": "cats", "b": "dogs", "c": "rats", "d": "birds"} input_data = { "x": "owls", "y": "feline", @@ -20,25 +20,26 @@ def test_NLPLinker_dict_input(): "za": "dogs", "zb": "chair", } - nlp_link.load(comparison_data) + nlp_link.load(reference_data) matches = nlp_link.link_dataset(input_data) assert len(matches) == len(input_data) - assert len(set(matches["link_id"]).difference(set(comparison_data.keys()))) == 0 + assert len(set(matches["reference_id"]).difference(set(reference_data.keys()))) == 0 def test_NLPLinker_list_input(): nlp_link = NLPLinker() - comparison_data = ["cats", "dogs", "rats", "birds"] + reference_data = ["cats", "dogs", "rats", "birds"] input_data = ["owls", "feline", "doggies", "dogs", "chair"] - nlp_link.load(comparison_data) + nlp_link.load(reference_data) matches = nlp_link.link_dataset(input_data) assert len(matches) == len(input_data) assert ( - len(set(matches["link_id"]).difference(set(range(len(comparison_data))))) == 0 + len(set(matches["reference_id"]).difference(set(range(len(reference_data))))) + == 0 ) @@ -51,8 +52,8 @@ def test_get_matches(): input_embeddings=np.array( [[0.1, 0.13, 0.14], [0.12, 0.18, 0.15], [0.5, 0.9, 0.91]] ), - comparison_data_ids=["a", "b"], - comparison_embeddings=np.array([[0.51, 0.99, 0.9], [0.1, 0.13, 0.14]]), + reference_data_ids=["a", "b"], + reference_embeddings=np.array([[0.51, 0.99, 0.9], [0.1, 0.13, 0.14]]), top_n=1, ) @@ -65,13 +66,13 @@ def test_same_input(): nlp_link = NLPLinker() - comparison_data = {"a": "cats", "b": "dogs", "c": "rats", "d": "birds"} - input_data = comparison_data - nlp_link.load(comparison_data) + reference_data = {"a": "cats", "b": "dogs", "c": "rats", "d": "birds"} + input_data = reference_data + nlp_link.load(reference_data) matches = nlp_link.link_dataset(input_data, drop_most_similar=False) - assert all(matches["input_id"] == matches["link_id"]) + assert all(matches["input_id"] == matches["reference_id"]) matches = nlp_link.link_dataset(input_data, drop_most_similar=True) - assert all(matches["input_id"] != matches["link_id"]) + assert all(matches["input_id"] != matches["reference_id"])