The files in this repository are work towards an investigation of philosophical content, broadly understood, in early New Zealand Newspaper writing (using the New Zealand National Library's Papers Past Open Data Pilot dataset https://natlib.govt.nz/about-us/open-data/papers-past-metadata/papers-past-newspaper-open-data-pilot)
The directories contain:
- NPOD_Starter: the starter corpus from the National Library of New Zealand
- classifiers: trained classification models (pickled)
- dictionaries: dictionaries generated from various subsets of the corpus with gensim
- lda_models: trained lda topic models
- pickles: pickles of various subsets of the dataset. Note: some pickled corpora are too large for GitHub.
- presentation: Latex code for project presentation
- report: Latex code for project report.
The jupyter notebooks have the following roles:
- 'Classifying texts.ipynb': code used to assign categorical labels to articles.
- 'Entity Extraction *.ipynb': application of spaCy to extract named entities and proper nouns from corpora.
- 'NaiveBayes_PhilosoClassification*.ipynb': application of Naive Bayes classifiers trained on labelled dataset and then applied to the corpus as a whole.
- '*_exp.ipynb': Use of collocation, cooccurence, and concordancing to explore candidate corpora.
- 'starter_topicmodels.ipynb': Use of gensim topic modelling to explore the 'Starter kit' of the dataset.
- 'Religion and Evolution in the REL corpus.ipynb': what the filename says.
- 'NZ Content': looking for NZ-specific content in the NB2 corpus.
- 'Relabelling.ipynb': Proposals to improve labelling, begun but not completed.
Various scripts are also included:
- 'NL_helpers.py': a set of helper functions used in the notebooks above
- 'NL_topicmodels.py': a corpus class for use with gensim and helpers specifically for the topic modelling side of the project.
- 'generate_corpus_df.py': script to go from dataset stored in tarballs to a collection of pickled pandas dataframes.
- 'keywords_from_corpus.py': a script to search for keywords in the complete corpus using dataframes generated by 'generate_corpus_df.py'
- 'cooccurrence.py': a script to generate cooccurrence scores for given terms and store the results in a dataframe. This is particularly useful for the Dash app (in a distinct github repository).
- 'add_cooccurrence_terms.py': Used to add terms to already generated cooccurrence dataframes.
- 'generate_*.py': scripts to generate various useful outputs.
- 'corpus2markdown.py': Takes a corpus and saves it as a series of Markdown files with links to Papers Past website.
This repository contains almost all code I have used in the course of the project, but does not contain all of the data (too big for github). Much of the code is in rough-and-ready script form and has not been tidied to the point which would be required for a complete recreation of the project.