-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run SVD for concept similarity #56
base: main
Are you sure you want to change the base?
Conversation
@chrished , can you have a look at this? after it, I will run it. my idea is to use |
src/dataprep/pipeline.sh
Outdated
# ### Calculate reduced-dimension paper concepts | ||
python -m $sript_path.link.fit_svd_model \ | ||
--start 1980 \ | ||
--end 2020 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we include 2021 (and 2022)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks reasonable, would only increase the upper year limit
I added the model checks. what's now left to do
we discussed that, after aggregating at the entity level (department-year or author-year), we should appropriately normalize the scores embedding vectors. But I'm now not certain about this anymore, since the embedding values could also be negative, in which case a simple normalization does not make sense. -> we need to think more about this. |
The normalization is not necessary at that level, as the cosine similarity normalizes itself by the length of the vectors, just taking the difference in angle: cosine similarity The discussion was relevant when considering dimension reduction at the author/department level, as there the length of the vector was relevant. |
the prediction on all papers is running, I commit and then update the similarity part when it’s done |
381470b
to
30e3b0d
Compare
preidcted vectors are way too large, 1024 columns X 260 million papers takes 2TB storage. We can instead use the svd model on the fly when calculating the similarities
|
…imilarity implemented only
…tions.py. Adjust to load all fields including level 0 in fit_svd
…, fix index accessibility issues, remaining: missing FielfOfStudyId in topics_collaborators_affiliations df
@f-hafner |
I'll have a look! |
@chrished , here is an example for similarities by array: from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import pandas as pd
rng = np.random.default_rng(23535)
# generate data
n_students = 100
n_researchers = 1000
emb_students = rng.uniform(size=(n_students,16))
emb_researchers = rng.uniform(size=(n_researchers,16))
data = {}
data["student_id"] = 30 + np.arange(n_students)
for idx in range(emb_students.shape[1]):
col = emb_students[:,idx]
data[f"emb_{idx}"] = col
d_students = pd.DataFrame(data)
data = {}
data["researcher_id"] = 500 + np.arange(n_researchers)
for idx in range(emb_researchers.shape[1]):
col = emb_researchers[:,idx]
data[f"emb_{idx}"] = col
d_researchers = pd.DataFrame(data)
# similarity
d_researchers = d_researchers.set_index("researcher_id")
d_students = d_students.set_index("student_id")
similarities = cosine_similarity(d_students, d_researchers)
# convert to dataframe
a = pd.DataFrame(similarities)
a.columns = d_researchers.index
a.head()
a["student_id"] = d_students.index
a = a.set_index("student_id")
a.head()
# reshape to long
b = a.stack()
b = b.reset_index()
b = b.rename(columns={0: "sim"})
b.head() for students own similarity, could you use also, Otherwise it looks fine to me, but I did not look through the whole code. |
combining the functions is a possibility but I am not sure it is worth it right now. Open issues:
SELECT *
FROM graduates_similarity_to_self AS gss
LEFT JOIN graduates_similarity_to_self_svd AS gss_svd
ON gss.AuthorId = gss_svd.AuthorId AND gss.max_level = gss_svd.max_level
WHERE gss_svd.AuthorId IS NULL AND gss.max_level = 2
LIMIT 10;
AuthorId|similarity|max_level|AuthorId|similarity|max_level
31354139|0.148778879513861|2|||
32263467|0.234119255922107|2|||
43860561|0.186600285900166|2|||
96708831|0.442487846995853|2|||
118646155|0.0|2|||
137389383|0.220691473367387|2|||
207844304|0.138700745291053|2|||
221862811|0.211684941713149|2|||
261048268|0.192376176839773|2|||
319527065|0.437573593393941|2||| |
work in progress