Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to apply to a different ontology/domain? #7

Open
innerop opened this issue Sep 12, 2019 · 6 comments
Open

How to apply to a different ontology/domain? #7

innerop opened this issue Sep 12, 2019 · 6 comments

Comments

@innerop
Copy link

innerop commented Sep 12, 2019

Very useful and great work.

How do I use a different ontology from a different domain? I can replicate the format used in the current CS ontology, but what about the cached model? Is that a generalized model or specific to the ontology. If the latter, how do I go about constructing one for a different ontology?

Many thanks and happy to share back the results of my work

EDIT:

Learning about word2vec... but would love to hear from you anyway, if you have any tips or instructions.

Thank you.

@angelosalatino
Copy link
Owner

Hi, these are very good questions. I will soon write an article/tutorial/guide on my blog on how to move towards other domains of science. Stay tuned

@innerop
Copy link
Author

innerop commented Sep 18, 2019

@angelosalatino

That would help greatly in adopting and adapting this work.

For now, however, could you please provide the script that generates the token-to-cso-combined file?

The README is clear on what is involved but looking at the CSO I have no clue what constitutes a "topic" The "words" (1,2,3-gram entities) show up in so many places. I have no idea how to even query the CSO properly? Do I use SPARQL? is this RDF? RDFS? I'm completely new to the format.

Referring to this passage in README.MD:

To generate this file, we collected all the set of words available within the vocabulary of the model. Then iterating on each word, we retrieved its top 10 similar words from the model, and we computed their Levenshtein similarity against all CSO topics. If the similarity was above 0.7, we created a record which stored all CSO topics triggered by the initial word.

@angelosalatino
Copy link
Owner

Hi,
we wrote an article explaining how you can adopt the CSO Classifier in other fields: https://infernusweb.altervista.org/wp/how-to-use-the-cso-classifier-in-other-domains/

Please do let us know if you need further information.

@innerop
Copy link
Author

innerop commented Sep 23, 2019

Thank you and I’ll keep you in the loop on how I’m using it and any improvements I can think of or further questions.

I managed to find an older version prior to when you added the cache and I could see how you’re doing the matching against ontology with the embeddings so that was very educational. One note, however, is that the older version only works on Python 3.6, not 3.7 or later. It throws a StopIteration exception from NLTK util. That’s an issue with Python and NLTK not your codebase

Thank you 🙏 .

@innerop
Copy link
Author

innerop commented Sep 25, 2019

@angelosalatino

I looked at the code for generating the file which you shared in the article.

I'd like to point out the divergence I see with respect to the description given in the article.

The description says:

"To generate this dictionary/file, we collected all the different words available within the vocabulary of the model. Then iterating on each word, we retrieved its top 10 similar words from the model, and we computed their Levenshtein similarity against all CSO topics. If the similarity was above 0.7, we created a record which stored all CSO topics triggered by the initial word."

But I believe the code does this instead:

"To generate this dictionary/file, we collected all the different words available within the vocabulary of the model. Then iterating on each word, we retrieved its top 10 similar words from the model and put them in a list, which we iterated over. If the cosine similarity for a word in the list was equal to or greater than 0.7, and we computed its Levenshtein similarity against all CSO topics and where that was equal to or above 0.94 we added the topic to a record (or created it if it didn't exist) which stored all CSO topics triggered by the initial word from our model."

@angelosalatino
Copy link
Owner

Hi, yes. Your explanation is very detailed. We left some details out for the sake of the narrative and demanded the reader to the code for further details. But definitely. Your description fits 100% with the actual process.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants