Run SVD for concept similarity #56

f-hafner · 2024-10-02T14:28:44Z

work in progress

f-hafner · 2024-10-02T14:40:50Z

@chrished , can you have a look at this? after it, I will run it. my idea is to use valid_paper when we run the fitted SVD model on all papers. Is the year restriction ok?

chrished · 2024-10-02T14:49:03Z

src/dataprep/pipeline.sh

+# ### Calculate reduced-dimension paper concepts
+python -m $sript_path.link.fit_svd_model \
+    --start 1980 \
+    --end 2020 \


can we include 2021 (and 2022)?

chrished

looks reasonable, would only increase the upper year limit

f-hafner · 2024-10-02T18:48:38Z

I added the model checks. what's now left to do

run svd on subsample of all papers published in the time period in the relevant doc types
do the analysis above on explained components
fit the model when we're happy and save it
run the model on all papers of interest (time period, doc types), save as new "reduced" field of study
run the similarity pipeline in addition on those reduced.

we discussed that, after aggregating at the entity level (department-year or author-year), we should appropriately normalize the scores embedding vectors. But I'm now not certain about this anymore, since the embedding values could also be negative, in which case a simple normalization does not make sense. -> we need to think more about this.

chrished · 2024-10-03T11:13:46Z

we discussed that, after aggregating at the entity level (department-year or author-year), we should appropriately normalize the scores embedding vectors. But I'm now not certain about this anymore, since the embedding values could also be negative, in which case a simple normalization does not make sense. -> we need to think more about this.

The normalization is not necessary at that level, as the cosine similarity normalizes itself by the length of the vectors, just taking the difference in angle: cosine similarity

The discussion was relevant when considering dimension reduction at the author/department level, as there the length of the vector was relevant.

chrished · 2024-10-06T12:05:11Z

the prediction on all papers is running, I commit and then update the similarity part when it’s done

chrished · 2024-10-06T20:43:18Z

preidcted vectors are way too large, 1024 columns X 260 million papers takes 2TB storage.

We can instead use the svd model on the fly when calculating the similarities

create new similarities script using the svd_model predictions
allow to pick N dim of svd_model in script
add similarities to db

…imilarity implemented only

…svd_similarity

…tions.py. Adjust to load all fields including level 0 in fit_svd

…y_functions.py

…, fix index accessibility issues, remaining: missing FielfOfStudyId in topics_collaborators_affiliations df

chrished · 2024-10-18T14:25:09Z

@f-hafner topics_collaborators_affiliations has sometimes missing FieldofStudyId, but I do not understand why. If you have an idea…

f-hafner · 2024-10-18T18:39:30Z

@f-hafner topics_collaborators_affiliations has sometimes missing FieldofStudyId, but I do not understand why. If you have an idea…

I'll have a look!

f-hafner · 2024-10-20T16:18:36Z

@chrished , here is an example for similarities by array:

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import pandas as pd
rng = np.random.default_rng(23535)

# generate data
n_students = 100
n_researchers = 1000

emb_students = rng.uniform(size=(n_students,16))
emb_researchers = rng.uniform(size=(n_researchers,16))

data = {}
data["student_id"] = 30 + np.arange(n_students)
for idx in range(emb_students.shape[1]):
    col = emb_students[:,idx]
    data[f"emb_{idx}"] = col
d_students = pd.DataFrame(data)
data = {}
data["researcher_id"] = 500 + np.arange(n_researchers)
for idx in range(emb_researchers.shape[1]):
    col = emb_researchers[:,idx]
    data[f"emb_{idx}"] = col
d_researchers = pd.DataFrame(data)

# similarity
d_researchers = d_researchers.set_index("researcher_id")
d_students = d_students.set_index("student_id")

similarities = cosine_similarity(d_students, d_researchers)

# convert to dataframe
a = pd.DataFrame(similarities)
a.columns = d_researchers.index
a.head()
a["student_id"] = d_students.index
a = a.set_index("student_id")
a.head()

# reshape to long
b = a.stack()
b = b.reset_index()
b = b.rename(columns={0: "sim"})
b.head()

for students own similarity, could you use np.diag(similarities)?

also, similarity_to_closest_collaborator and similarity_to_closest_collaborator_svd are very similar, and so are similarity_to_faculty and similarity_to_faculty_svd. would it be possible to put them into a single function and use svd if it's specified?

Otherwise it looks fine to me, but I did not look through the whole code.

chrished · 2024-10-28T10:53:38Z

combining the functions is a possibility but I am not sure it is worth it right now.

Open issues:

fix index creation for svd table
fix missing rows in svd table (see below)

SELECT *                                                                                                    
 FROM graduates_similarity_to_self AS gss                                                                     
 LEFT JOIN graduates_similarity_to_self_svd AS gss_svd                                                        
 ON gss.AuthorId = gss_svd.AuthorId AND gss.max_level = gss_svd.max_level                                     
 WHERE gss_svd.AuthorId IS NULL AND gss.max_level = 2                                                         
 LIMIT 10;                                                                                                    
          
AuthorId|similarity|max_level|AuthorId|similarity|max_level
31354139|0.148778879513861|2|||
32263467|0.234119255922107|2|||
43860561|0.186600285900166|2|||
96708831|0.442487846995853|2|||
118646155|0.0|2|||
137389383|0.220691473367387|2|||
207844304|0.138700745291053|2|||
221862811|0.211684941713149|2|||
261048268|0.192376176839773|2|||
319527065|0.437573593393941|2|||

f-hafner and others added 6 commits October 2, 2024 15:27

start script for fitting svd model

b948338

refactor script for svd model

9c9edc2

further refactor, add argparser

5bb11d2

fix random state variable

dc70263

add script to pipeline

4af6153

add docstring

fc94cc0

chrished reviewed Oct 2, 2024

View reviewed changes

chrished approved these changes Oct 2, 2024

View reviewed changes

f-hafner and others added 6 commits October 2, 2024 16:54

update end year

e57d453

fix typo in pipeline

94299e3

return ordered fields of study

e0bcd59

add log

7fc718a

add model checks

2fe805c

remove todos

d6fd2fd

feat: Add script to apply SVD model to all papers in the database

a9d0490

predicted vectors take too much storage

30e3b0d

chrished force-pushed the svd_similarity branch from 381470b to 30e3b0d Compare October 6, 2024 20:40

chrished and others added 8 commits October 14, 2024 14:43

Add script to calculate topic similarity using SVD embeddings, own s…

be62899

…imilarity implemented only

Add similarity calculation for student and faculty SVD embeddings

eb2dd35

feat: Add SVD-based similarity calculation for closest collaborators

7fc32ac

feat: Update to use SVD-based similarity function for collaborators

bb1bdca

save model with ndim

050514b

run multiple svd models

6930b90

remove old log file

1af2792

add logs for multiple svd runs

a6bcfb8

chrished added 5 commits October 15, 2024 15:22

Merge branch 'svd_similarity' of github.com:f-hafner/mag_sample into …

0b10a1b

…svd_similarity

move function for svd similarity calculation to topic_similarity_func…

000ea90

…tions.py. Adjust to load all fields including level 0 in fit_svd

style: Format code for consistency and readability in topic_similarit…

d84c314

…y_functions.py

fix: Validate existence of columns before computing SVD similarity

d66c6f4

transform_topics function to create a DataFrame with transformed data…

d86f016

…, fix index accessibility issues, remaining: missing FielfOfStudyId in topics_collaborators_affiliations df

f-hafner and others added 8 commits October 19, 2024 15:31

extend model checks, add more dimensions

c036804

add new logs

9069cbb

move misplaced docstring

1919d2a

fix some more docstrings

651cc74

iterate over ndims within python script

853e7e6

ignore ipynb checkpoints

f5495ad

add script to start jupyter on remote

a1fe172

fix start_jupyter

a4eee7c

read svd similarity

2d05977

add read svd similarity to pipeline

577d174

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run SVD for concept similarity #56

Run SVD for concept similarity #56

f-hafner commented Oct 2, 2024

f-hafner commented Oct 2, 2024

chrished Oct 2, 2024

chrished left a comment

f-hafner commented Oct 2, 2024

chrished commented Oct 3, 2024 •

edited

Loading

chrished commented Oct 6, 2024

chrished commented Oct 6, 2024 •

edited

Loading

chrished commented Oct 18, 2024

f-hafner commented Oct 18, 2024

f-hafner commented Oct 20, 2024 •

edited

Loading

chrished commented Oct 28, 2024 •

edited

Loading

Run SVD for concept similarity #56

Are you sure you want to change the base?

Run SVD for concept similarity #56

Conversation

f-hafner commented Oct 2, 2024

f-hafner commented Oct 2, 2024

chrished Oct 2, 2024

Choose a reason for hiding this comment

chrished left a comment

Choose a reason for hiding this comment

f-hafner commented Oct 2, 2024

chrished commented Oct 3, 2024 • edited Loading

chrished commented Oct 6, 2024

chrished commented Oct 6, 2024 • edited Loading

chrished commented Oct 18, 2024

f-hafner commented Oct 18, 2024

f-hafner commented Oct 20, 2024 • edited Loading

chrished commented Oct 28, 2024 • edited Loading

chrished commented Oct 3, 2024 •

edited

Loading

chrished commented Oct 6, 2024 •

edited

Loading

f-hafner commented Oct 20, 2024 •

edited

Loading

chrished commented Oct 28, 2024 •

edited

Loading