-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Semantic analysis features for fun and elucidation. #435
Comments
Hi David, thanks for your great suggestions 👏🏻
Good catch, never had this case on my end before, but should be easily fixable. We could just ignore messages from the future on stats analysis or we could even classify those as junk.
This is only possible, if Thunderbird provides a function that returns only the actual email content (without quotes, signatures etc) via webext API. I'm afraid this doesn't exists yet.
This is easy to calculate, but depends on the first point. When we don't have an accurate word or sentence count, this would make no sense.
How is this calculated? But again this depends on having the actual email content.
What value can be retrieved from a word cloud? In the past, I found them rather less useful. |
Hi devmount! Thanks for an awesome project. I didn't realize email content wasn't available :-( Vocabulary Density would count unique words after generating a word count, then computing the ratio. As I think about examples, it'd be pretty useless for short messages. Lexical Density might be more interesting, but that requires not only having access to message contents but also to a parts of speech model. Word clouds... I feel your lack of enthusiasm. I imagine it as being a more useful exploratory tool, comparing the dominant results for one folder or correspondent vs. another. Also the TF-IDF internal function yields some useful semantic metrics. -David |
My pleasure. Yeah it's unfortunate. I could try to parse the contents myself (e.g. ignoring every line starting with Lexical density sounds interesting, I already worked with parts of speech models in the past and trained some language models myself. But ThirdStats is a multilingual tool and this means to have a model for each supported language. Which is definitely out of scope for this project 😅 Sorry, I didn't mean to downplay your idea 🙏🏻 I was actually interested in use cases for word clouds. Only because I didn't found them useful doesn't mean they really are not 🤷🏻♂️ I guess the crucial part is the metric that is the basis. Count of occurences? TF-IDF? Again thank you for your suggestions. It makes me think more in the language statistics direction, which is nice. |
Is your feature request related to a problem? Please describe.
I'd really enjoy additional semantic data, both for fun and for research:
Describe the solution you'd like
One smol bug:
The activity heat map year select has a bug if there are future dated emails - the current year (2024 as of this writing) includes messages from 2033 which smooshes the heat map. yes, that should be impossible, but junk mail does that.
Semantic Visualizations:
There are a few useful/interesting semantic analysis tools I'd like to visualize in addition to raw message count, that is being able to switch the view from raw message count to:
By changing the statistical basis from raw message count to a semantic analysis measure, the "Most received from" chart would follow the lead, ranking by sender the relevant semantic score.
An additional widget might be a word cloud.
The text was updated successfully, but these errors were encountered: