Skip to content

Commit

Permalink
Improve docs
Browse files Browse the repository at this point in the history
  • Loading branch information
lizgzil committed Oct 28, 2024
1 parent a85b766 commit 59dcbc4
Show file tree
Hide file tree
Showing 4 changed files with 132 additions and 7 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ Docs for this repo are automatically published to gh-pages branch via. Github ac
However, if you are editing the docs you can test them out locally by running

```
cd guidelines
pip install -r docs/requirements.txt
cd docs
<!-- pip install -r docs/requirements.txt -->
mkdocs serve
```
105 changes: 102 additions & 3 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,104 @@
# nlp-link
# 🖇️ NLP Link

Documentation for NLP Link
NLP Link finds the most similar word (or words) in a reference list to an inputted word. For example, if you are trying to find which word is most similar to 'puppies' from a reference list of `['cats', 'dogs', 'rats', 'birds']`, nlp-link will return 'dogs'.

- [Page1](./page1.md)
Another functionality of this package is using the linking methodology to find the [SOC](https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc) code most similar to an inputted job title. More on this [here](./page1.md).

## 🔨 Usage

Install the package using pip:

```bash
pip install nlp-link
```

### Basic usage

Match two lists in python:

```python

from nlp_link.linker import NLPLinker

nlp_link = NLPLinker()

# list inputs
comparison_data = ['cats', 'dogs', 'rats', 'birds']
input_data = ['owls', 'feline', 'doggies', 'dogs','chair']
nlp_link.load(comparison_data)
matches = nlp_link.link_dataset(input_data)
# Top match output
print(matches)

```

Which outputs:

```
input_id input_text link_id link_text similarity
0 0 owls 3 birds 0.613577
1 1 feline 0 cats 0.669633
2 2 doggies 1 dogs 0.757443
3 3 dogs 1 dogs 1.000000
4 4 chair 0 cats 0.331178
```

### Extended usage

Match using dictionary inputs (where the key is a unique ID):

```python

from nlp_link.linker import NLPLinker

nlp_link = NLPLinker()

# dict inputs
comparison_data = {'a': 'cats', 'b': 'dogs', 'd': 'rats', 'e': 'birds'}
input_data = {'x': 'owls', 'y': 'feline', 'z': 'doggies', 'za': 'dogs', 'zb': 'chair'}
nlp_link.load(comparison_data)
matches = nlp_link.link_dataset(input_data)
# Top match output
print(matches)

```

Which outputs:

```
input_id input_text link_id link_text similarity
0 x owls e birds 0.613577
1 y feline a cats 0.669633
2 z doggies b dogs 0.757443
3 za dogs b dogs 1.000000
4 zb chair a cats 0.331178
```

Output several most similar matches using the `top_n` argument (`format_output` needs to be set to False for this):

```python

from nlp_link.linker import NLPLinker

nlp_link = NLPLinker()

comparison_data = {'a': 'cats', 'b': 'dogs', 'c': 'kittens', 'd': 'rats', 'e': 'birds'}
input_data = {'x': 'pets', 'y': 'feline'}
nlp_link.load(comparison_data)
matches = nlp_link.link_dataset(input_data, top_n=2, format_output=False)
# Top match output
print(matches)
# Format output for ease of reading
print({input_data[k]: [comparison_data[r] for r, _ in v] for k,v in matches.items()})
```

Which will output:

```
{'x': [['b', 0.8171109], ['a', 0.7650396]], 'y': [['a', 0.6696329], ['c', 0.5778763]]}
{'pets': ['dogs', 'cats'], 'feline': ['cats', 'kittens']}
```

The `drop_most_similar` argument can be set to True if you don't want to output the most similar match - this might be the case if you were comparing a list with itself. For this you would run `nlp_link.link_dataset(input_data, drop_most_similar=True)`.
2 changes: 1 addition & 1 deletion docs/mkdocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,6 @@ theme:
name: Switch to light mode
nav:
- Home: README.md
- Page 1: page1.md
- SOCMapper: page1.md
plugins:
- same-dir
28 changes: 27 additions & 1 deletion docs/page1.md
Original file line number Diff line number Diff line change
@@ -1 +1,27 @@
## Title
# 🗺️ SOC Mapper

The SOC mapper relies on the [SOC coding index](https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc/soc2020/soc2020volume2codingrulesandconventions) released by the ONS. This dataset contains over 30,000 job titles with the SOC code.

The `SOCMapper` class in `soc_map.py` maps job title(s) to SOC(s).

## 🔨 Core functionality

```
from nlp_link.soc_mapper.soc_map import SOCMapper
soc_mapper = SOCMapper()
soc_mapper.load()
job_titles=["data scientist", "Assistant nurse", "Senior financial consultant - London"]
soc_mapper.get_soc(job_titles, return_soc_name=True)
```

Which will output

```
[((('2433/04', 'Statistical data scientists'), ('2433', 'Actuaries, economists and statisticians'), '2425'), 'Data scientist'), ((('6131/99', 'Nursing auxiliaries and assistants n.e.c.'), ('6131', 'Nursing auxiliaries and assistants'), '6141'), 'Assistant nurse'), ((('2422/02', 'Financial advisers and planners'), ('2422', 'Finance and investment analysts and advisers'), '3534'), 'Financial consultant')]
```

## 📖 Read more

Read more about the methods and evaluation of the SOCMapper [here](https://github.com/nestauk/nlp-link/soc_mapper/README.md).

0 comments on commit 59dcbc4

Please sign in to comment.