Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic analysis features for fun and elucidation. #435

Open
gessel opened this issue Mar 13, 2024 · 3 comments
Open

Semantic analysis features for fun and elucidation. #435

gessel opened this issue Mar 13, 2024 · 3 comments

Comments

@gessel
Copy link

gessel commented Mar 13, 2024

Is your feature request related to a problem? Please describe.
I'd really enjoy additional semantic data, both for fun and for research:

Describe the solution you'd like

One smol bug:

The activity heat map year select has a bug if there are future dated emails - the current year (2024 as of this writing) includes messages from 2033 which smooshes the heat map. yes, that should be impossible, but junk mail does that.

Semantic Visualizations:

There are a few useful/interesting semantic analysis tools I'd like to visualize in addition to raw message count, that is being able to switch the view from raw message count to:

  • Word count (excluding quoted/reply text and signatures)
  • Flesch reading-ease value
  • Vocabulary density

By changing the statistical basis from raw message count to a semantic analysis measure, the "Most received from" chart would follow the lead, ranking by sender the relevant semantic score.

An additional widget might be a word cloud.

@devmount
Copy link
Owner

Hi David, thanks for your great suggestions 👏🏻

The activity heat map year select has a bug if there are future dated emails - the current year (2024 as of this writing) includes messages from 2033 which smooshes the heat map. yes, that should be impossible, but junk mail does that.

Good catch, never had this case on my end before, but should be easily fixable. We could just ignore messages from the future on stats analysis or we could even classify those as junk.

  • Word count (excluding quoted/reply text and signatures)

This is only possible, if Thunderbird provides a function that returns only the actual email content (without quotes, signatures etc) via webext API. I'm afraid this doesn't exists yet.

  • Flesch reading-ease value

This is easy to calculate, but depends on the first point. When we don't have an accurate word or sentence count, this would make no sense.

  • Vocabulary density

How is this calculated? But again this depends on having the actual email content.

An additional widget might be a word cloud.

What value can be retrieved from a word cloud? In the past, I found them rather less useful.

@gessel
Copy link
Author

gessel commented Mar 13, 2024

Hi devmount!

Thanks for an awesome project. I didn't realize email content wasn't available :-(
As an edge case, and probably only useful to an audience of approximately 1, it wouldn't be that hard to write a server-side script to append headers with the semantic values.

Vocabulary Density would count unique words after generating a word count, then computing the ratio. As I think about examples, it'd be pretty useless for short messages. Lexical Density might be more interesting, but that requires not only having access to message contents but also to a parts of speech model.

Word clouds... I feel your lack of enthusiasm. I imagine it as being a more useful exploratory tool, comparing the dominant results for one folder or correspondent vs. another. Also the TF-IDF internal function yields some useful semantic metrics.

-David

@devmount
Copy link
Owner

My pleasure. Yeah it's unfortunate. I could try to parse the contents myself (e.g. ignoring every line starting with > and every line below --), but that might be inefficient and unprecise too. It would most probably increase stats computation time by a multiple.

Lexical density sounds interesting, I already worked with parts of speech models in the past and trained some language models myself. But ThirdStats is a multilingual tool and this means to have a model for each supported language. Which is definitely out of scope for this project 😅

Sorry, I didn't mean to downplay your idea 🙏🏻 I was actually interested in use cases for word clouds. Only because I didn't found them useful doesn't mean they really are not 🤷🏻‍♂️ I guess the crucial part is the metric that is the basis. Count of occurences? TF-IDF?

Again thank you for your suggestions. It makes me think more in the language statistics direction, which is nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants