Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process to disambiguate author affiliations using gpt-4 #40

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dtgupta
Copy link
Contributor

@dtgupta dtgupta commented Jan 23, 2024

This process adds a different dimension to author affiliation disambiguation. It uses NLP to extract affiliations from text. This process produces better results than the currently implemented matching strategy. It has a matching rate of 81.24% as compared to 36.73% of the currently implemented algorithm. It also performs much better to match authors to multiple affiliations. This process adds value to the project as it helps researchers with more accurate affiliation results.

Copy link
Owner

@dspinellis dspinellis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very interesting work; well done! I added some comments as a first approximation to code that can be merged. Please also add tests for each process. Thank you again for the PR.

@@ -0,0 +1,172 @@
""""This process is used to distinguish the affiliations mentioned in the crossref dataset."""
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please start with a license comment, identifying you as the contributor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific approach to writing a license or can I copy the license comment from the other link_aa_base_ror.py file?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't invent new licenses 😃 Nor is it a good practice to mix different ones. So just copy-paste the existing text, replacing your name and setting the year to 2024.

"""
This process is used to distinguish the affiliations mentioned in the crossref dataset. It
uses the GPT-4 model to extract the affiliation and city from the affiliation text and
match it to the ROR database based on levenshtein distance.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Levenschtein

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The library used to calculate this distance has the spelling mentioned in the comment (without the 'c'). Should I still change it?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The c was my mistake, sorry. The comment was for you to capitalize it here, because it's proper name.


def process(database_path):
"""
This process is used to distinguish the affiliations mentioned in the crossref dataset. It
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please model the comment after the existing one in link_aa_base_ror.py. It's not bad to copy-paste in this case. Be very clear regarding which ROR level you're linking to.

Consider changing the existing link_aa comments so as to clarify which method each process is using.

if not mentioned_name:
continue
# Prompt for the GPT-4 model to extract the affiliation and city from the affiliation text
prompt = (
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Place the prompt in a constant at the beginning of the file as a multi-line string.

ensure_table_exists(database, "research_organizations")

select_cursor = database.cursor()
select_cursor_2 = database.cursor()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use more descriptive names.


def find_best_ror(gpt_org, select_cursor_2):
"""
This function is used to find the best affiliation match based on levenshtein distance.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not start your comments with "This function…" Just write in imperative voice what the function does. (In all functions.)

try:
# Extract the affiliation and city from the provided textual affiliation
completion = client.chat.completions.create(
model="gpt-4-1106-preview",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put the model in a constant at the top-level of the file.

) in select_cursor.execute(
"""
SELECT id, mentioned_name, gpt_name, city, ror_id FROM distinct_affiliations
WHERE ror_id != 20233
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this number? Please document it and use it as a constant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ror_id 20233 corresponds to the research organization 2B based in Italy. Some organizations cannot be identified by gpt-4 (returns an empty string in response). Levenshtein distance comparison between empty string "" and 2B assigned empty strings to this ror_id. We filter out the organizations in this category to make fewer comparisons.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation. This approach sounds very brittle. Shouldn't we address the general case of empty strings?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants