Calculating distances #5969
-
What's wrong? Hi! I am brand new to GitHub as well as Orange3. And, I am learning a whole lot of once; it almost feels like cognitive overload, but I will do my best. I am following along in your document Introduction to Data Mining (GS-GS-6203) September 2019 and looking at Lesson 5: Introduction to Hierarchical Clustering in that document; specifically, my query pertains to the calculation of distances. I was working with grades2.tab and wanted to understand a bit more about the various distance metrics. I'm not entirely sure what it means to "normalize" a metric, but let me focus on the cosine metric, a metric that I have learned about in my readings about similarity measures. I know that if two vectors are "identical," then one is a scalar multiple of the others. Two vectors are said to be "similar" if the angle between them is "close to zero." But if one takes the cosine of an angle near zero, the value will be close to 1. Similarly, two vectors that are nearly orthogonal (an angle close to 90 degrees) would have a cosine close to 0. I used two cases (Bill and Cynthia), changing their respective grades to 0/1 and 1/0. Clearly, these two vectors are orthogonal; hence, the angle between them is 90 degrees, where the cosine (90) = 0. But Orange produces a distance matrix that looks like: Why does it produce 1 instead of 0? Similarly, when I set the "grades" to be identical, ...as in here.... ...one should expect that the cos(0) = 1, but I get zero in the distance matrix. I'll end there instead of posing multiple questions, although I would love to have some clearer insight into the normalization of the metrics you have in the dropdown menu in the distances widget. How can we reproduce the problem? Last, if I use the Euclidean distance metric for my first case, I should get a measure of sqrt(2), which I do... This makes sense, but the normalization process puzzles me. Where does the following come from... What's your environment?
With thanks, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
You are referring to cosine similarity, which is indeed the cosine of the angle between vectors. But this is cosine distance, which is 1-cosine(angle). Hence the difference. For normalization, please refer to https://en.wikipedia.org/wiki/Feature_scaling and also our own documentation, which explains which type of normalization was used: https://orangedatamining.com/widget-catalog/unsupervised/distances/ |
Beta Was this translation helpful? Give feedback.
You are referring to cosine similarity, which is indeed the cosine of the angle between vectors. But this is cosine distance, which is 1-cosine(angle). Hence the difference.
For normalization, please refer to https://en.wikipedia.org/wiki/Feature_scaling and also our own documentation, which explains which type of normalization was used: https://orangedatamining.com/widget-catalog/unsupervised/distances/