-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process to disambiguate author affiliations using gpt-4 #40
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very interesting work; well done! I added some comments as a first approximation to code that can be merged. Please also add tests for each process. Thank you again for the PR.
@@ -0,0 +1,172 @@ | |||
""""This process is used to distinguish the affiliations mentioned in the crossref dataset.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please start with a license comment, identifying you as the contributor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific approach to writing a license or can I copy the license comment from the other link_aa_base_ror.py file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't invent new licenses 😃 Nor is it a good practice to mix different ones. So just copy-paste the existing text, replacing your name and setting the year to 2024.
""" | ||
This process is used to distinguish the affiliations mentioned in the crossref dataset. It | ||
uses the GPT-4 model to extract the affiliation and city from the affiliation text and | ||
match it to the ROR database based on levenshtein distance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Levenschtein
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The library used to calculate this distance has the spelling mentioned in the comment (without the 'c'). Should I still change it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The c
was my mistake, sorry. The comment was for you to capitalize it here, because it's proper name.
|
||
def process(database_path): | ||
""" | ||
This process is used to distinguish the affiliations mentioned in the crossref dataset. It |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please model the comment after the existing one in link_aa_base_ror.py
. It's not bad to copy-paste in this case. Be very clear regarding which ROR level you're linking to.
Consider changing the existing link_aa comments so as to clarify which method each process is using.
if not mentioned_name: | ||
continue | ||
# Prompt for the GPT-4 model to extract the affiliation and city from the affiliation text | ||
prompt = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Place the prompt in a constant at the beginning of the file as a multi-line string.
ensure_table_exists(database, "research_organizations") | ||
|
||
select_cursor = database.cursor() | ||
select_cursor_2 = database.cursor() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use more descriptive names.
|
||
def find_best_ror(gpt_org, select_cursor_2): | ||
""" | ||
This function is used to find the best affiliation match based on levenshtein distance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not start your comments with "This function…" Just write in imperative voice what the function does. (In all functions.)
try: | ||
# Extract the affiliation and city from the provided textual affiliation | ||
completion = client.chat.completions.create( | ||
model="gpt-4-1106-preview", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put the model in a constant at the top-level of the file.
) in select_cursor.execute( | ||
""" | ||
SELECT id, mentioned_name, gpt_name, city, ror_id FROM distinct_affiliations | ||
WHERE ror_id != 20233 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this number? Please document it and use it as a constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ror_id 20233
corresponds to the research organization 2B
based in Italy. Some organizations cannot be identified by gpt-4 (returns an empty string in response). Levenshtein distance comparison between empty string ""
and 2B
assigned empty strings to this ror_id. We filter out the organizations in this category to make fewer comparisons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the explanation. This approach sounds very brittle. Shouldn't we address the general case of empty strings?
b4eb879
to
bd775b6
Compare
5277c3c
to
5163396
Compare
This process adds a different dimension to author affiliation disambiguation. It uses NLP to extract affiliations from text. This process produces better results than the currently implemented matching strategy. It has a matching rate of 81.24% as compared to 36.73% of the currently implemented algorithm. It also performs much better to match authors to multiple affiliations. This process adds value to the project as it helps researchers with more accurate affiliation results.