Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add corpus linguistics and data analysis tools #45

Merged
merged 1 commit into from
Nov 18, 2024

Conversation

adbar
Copy link
Contributor

@adbar adbar commented Oct 30, 2024

You might want to revise the sections "Corpus Linguistics" and "Data Analysis", I tried to add the tools accordingly but the distinction is not always clear-cut.

Short pitch

I posted the link to your repo on Mastodon and this answer makes a good point: further links could be added, you could have a look at the Tapor list you reference below.

Checklist

  • I have read and understood the contribution guidelines.
  • Table of contents has been updated (if a section is added / removed).
  • Contents have been sorted alphabetically.

@maehr
Copy link
Collaborator

maehr commented Nov 6, 2024

@adbar Thanks for your pull request. I've looked at it and also at the discussion on Mastodon. We're not trying to provide a comprehensive list, but a useful one. Hence the question for you: how do you use the tools you suggested? Why do they belong on the list? (I don't use any of these tools, stylometry is not my strong suit.)

@adbar
Copy link
Contributor Author

adbar commented Nov 7, 2024

@maehr I get your point but the "Corpus Linguistics" part of your list could really be extended.

  • AntConc is used to index text corpora and run queries against it. It is not open-source but it is widely used (look for mention of it in Scholar if you're interested).
  • TXM is fully open, it offers more functions (multilingual POS-tagging, subcorpora, web interface, etc.) and more geared towards statistical analysis of texts (with graphical display for instance).
  • Mallet is widely used for topic modeling and document clustering, i.e. finding which documents belong together along which topic for data-driven corpus exploration.
  • Stylo is used to extract a series of features for text classification, e.g. in authorship attribution. It can be partly used like Mallet but provides a wider range of different methods.

@maehr
Copy link
Collaborator

maehr commented Nov 8, 2024

Thank you very much for getting back to me so quickly. What is your take on the section "Corpus Linguistics"? Integrate into "Data Anaylsis"?

@maehr maehr self-requested a review November 8, 2024 09:49
@adbar
Copy link
Contributor Author

adbar commented Nov 8, 2024

Maybe put "Data Collection" first and then "Corpus Linguistics" immediately before "Data Analysis" would already help comparing the tools.

If the list continues to grow I see two options for "Data Analysis":

  1. Add subsections like "Corpus Exploration", "Stylometry" or "Topic Modeling"
  2. Focus on the medium and split it into "Text Analysis", "Image Analysis" etc.

Up to you, I admit it's not an easy task to classify these packages.

@sheiden
Copy link

sheiden commented Nov 8, 2024

I would have suggested to try to use somewhat the Tadirah terminology and structure (https://tadirah.info/pages/Browser.html) produced by the DH community. But it doesn't contain a 'Text analysis' entry :-(

@diegosiqueir4
Copy link
Member

Thank you @adbar, @maehr and @sheiden for your contributions.

@diegosiqueir4 diegosiqueir4 merged commit 31ad7dd into dh-tech:main Nov 18, 2024
3 checks passed
Copy link
Member

@diegosiqueir4 diegosiqueir4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks good :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants