(Best displayed w markdown formatting on)
Currently, the following refers to the "coal" subset of data, although most of the structure is replicated and the scripts are the same.
"Civility" data is kept separate even though some of the records may be the same because space is not an issue downloading can occur in the background with little extra input. With just two datasets about rather different topics, this is more straightforward than devising a new structure to keep them together (e.g. by date) but callable separately at will. In future, this should be considered but probably with a proper query structure (SQL-like).
The subfolders are structured in order of execution:
records
obtained from aph website as per search stringfull_text
raw downloads as rectangular dataframes by year including recordsprocessed
cleaned up version offull_text
model_inputs
generated for the model, includes the combined full textscan_parameters
produced in bulk
The first three correspond to scripts in scripts/download
The last two correspond to scripts in scripts/modeling
dtm
contains raw output from Dynamic Topic Modellingscan
contains raw output from SCANcleaned
contains processed output from either model in CSV format, which is used to more easily calculate coherencescan_coherence_*.csv
calculations
All scripts generating the above are stored under scripts/modeling
.
Files needed to run SCAN, Python requirements.txt and R Project data.
Run scan.sh
to run the scan modelling pipeline. It's in $PATH so can be called
anywhere and will run with the settings in the corresponding scripts.