Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run SVD for concept similarity #56

Draft
wants to merge 37 commits into
base: main
Choose a base branch
from
Draft

Run SVD for concept similarity #56

wants to merge 37 commits into from

Conversation

f-hafner
Copy link
Owner

@f-hafner f-hafner commented Oct 2, 2024

work in progress

@f-hafner
Copy link
Owner Author

f-hafner commented Oct 2, 2024

@chrished , can you have a look at this? after it, I will run it. my idea is to use valid_paper when we run the fitted SVD model on all papers. Is the year restriction ok?

# ### Calculate reduced-dimension paper concepts
python -m $sript_path.link.fit_svd_model \
--start 1980 \
--end 2020 \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we include 2021 (and 2022)?

Copy link
Collaborator

@chrished chrished left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks reasonable, would only increase the upper year limit

@f-hafner
Copy link
Owner Author

f-hafner commented Oct 2, 2024

I added the model checks. what's now left to do

  • run svd on subsample of all papers published in the time period in the relevant doc types
  • do the analysis above on explained components
  • fit the model when we're happy and save it
  • run the model on all papers of interest (time period, doc types), save as new "reduced" field of study
  • run the similarity pipeline in addition on those reduced.

we discussed that, after aggregating at the entity level (department-year or author-year), we should appropriately normalize the scores embedding vectors. But I'm now not certain about this anymore, since the embedding values could also be negative, in which case a simple normalization does not make sense. -> we need to think more about this.

@chrished
Copy link
Collaborator

chrished commented Oct 3, 2024

we discussed that, after aggregating at the entity level (department-year or author-year), we should appropriately normalize the scores embedding vectors. But I'm now not certain about this anymore, since the embedding values could also be negative, in which case a simple normalization does not make sense. -> we need to think more about this.

The normalization is not necessary at that level, as the cosine similarity normalizes itself by the length of the vectors, just taking the difference in angle: cosine similarity

The discussion was relevant when considering dimension reduction at the author/department level, as there the length of the vector was relevant.

@chrished
Copy link
Collaborator

chrished commented Oct 6, 2024

the prediction on all papers is running, I commit and then update the similarity part when it’s done

@chrished
Copy link
Collaborator

chrished commented Oct 6, 2024

preidcted vectors are way too large, 1024 columns X 260 million papers takes 2TB storage.

We can instead use the svd model on the fly when calculating the similarities

  • create new similarities script using the svd_model predictions
  • allow to pick N dim of svd_model in script
  • add similarities to db

@chrished
Copy link
Collaborator

@f-hafner topics_collaborators_affiliations has sometimes missing FieldofStudyId, but I do not understand why. If you have an idea…

@f-hafner
Copy link
Owner Author

@f-hafner topics_collaborators_affiliations has sometimes missing FieldofStudyId, but I do not understand why. If you have an idea…

I'll have a look!

@f-hafner
Copy link
Owner Author

f-hafner commented Oct 20, 2024

@chrished , here is an example for similarities by array:

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import pandas as pd
rng = np.random.default_rng(23535)

# generate data
n_students = 100
n_researchers = 1000

emb_students = rng.uniform(size=(n_students,16))
emb_researchers = rng.uniform(size=(n_researchers,16))

data = {}
data["student_id"] = 30 + np.arange(n_students)
for idx in range(emb_students.shape[1]):
    col = emb_students[:,idx]
    data[f"emb_{idx}"] = col
d_students = pd.DataFrame(data)
data = {}
data["researcher_id"] = 500 + np.arange(n_researchers)
for idx in range(emb_researchers.shape[1]):
    col = emb_researchers[:,idx]
    data[f"emb_{idx}"] = col
d_researchers = pd.DataFrame(data)

# similarity
d_researchers = d_researchers.set_index("researcher_id")
d_students = d_students.set_index("student_id")

similarities = cosine_similarity(d_students, d_researchers)

# convert to dataframe
a = pd.DataFrame(similarities)
a.columns = d_researchers.index
a.head()
a["student_id"] = d_students.index
a = a.set_index("student_id")
a.head()

# reshape to long
b = a.stack()
b = b.reset_index()
b = b.rename(columns={0: "sim"})
b.head()

for students own similarity, could you use np.diag(similarities)?

also, similarity_to_closest_collaborator and similarity_to_closest_collaborator_svd are very similar, and so are similarity_to_faculty and similarity_to_faculty_svd. would it be possible to put them into a single function and use svd if it's specified?

Otherwise it looks fine to me, but I did not look through the whole code.

@chrished
Copy link
Collaborator

chrished commented Oct 28, 2024

combining the functions is a possibility but I am not sure it is worth it right now.

Open issues:

  • fix index creation for svd table
  • fix missing rows in svd table (see below)
SELECT *                                                                                                    
 FROM graduates_similarity_to_self AS gss                                                                     
 LEFT JOIN graduates_similarity_to_self_svd AS gss_svd                                                        
 ON gss.AuthorId = gss_svd.AuthorId AND gss.max_level = gss_svd.max_level                                     
 WHERE gss_svd.AuthorId IS NULL AND gss.max_level = 2                                                         
 LIMIT 10;                                                                                                    
          
AuthorId|similarity|max_level|AuthorId|similarity|max_level
31354139|0.148778879513861|2|||
32263467|0.234119255922107|2|||
43860561|0.186600285900166|2|||
96708831|0.442487846995853|2|||
118646155|0.0|2|||
137389383|0.220691473367387|2|||
207844304|0.138700745291053|2|||
221862811|0.211684941713149|2|||
261048268|0.192376176839773|2|||
319527065|0.437573593393941|2|||

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants