From 5869ef9276763ff313c583397cc7273da958f864 Mon Sep 17 00:00:00 2001 From: Kevin Rose Date: Wed, 20 Sep 2017 00:43:29 -0500 Subject: [PATCH] update docs and model saving --- .gitignore | 1 + CONTRIBUTING.md | 27 +++++++++++++------ INSTALLATION.md | 4 +-- README.md | 10 ++++--- .../bag-of-words-count-stemmed-binary.ipynb | 5 ++-- 5 files changed, 31 insertions(+), 16 deletions(-) diff --git a/.gitignore b/.gitignore index 3e8c46d..c3cae0e 100644 --- a/.gitignore +++ b/.gitignore @@ -6,3 +6,4 @@ build/ dist/ .eggs/ .cache/ +.DS_Store diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 29c5d1a..069f621 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,16 +1,30 @@ # Setup -Fork this repo, download it, and navigate to it. If you're going to be developing it doesn't necessarily make sense to install as a package, but you'll still need to install the dependencies. Head over to the [README](README.md) for instructions on installing the required dependencies. +Fork this repo, clone your fork, and navigate to it. If you're going to be developing it doesn't necessarily make sense to install as a package, but you'll still need to install the dependencies. If you have not already installed the dependencies, then see the instructions in the [README](README.md). -# Getting Started +## The Data + +Due mainly to the file size, the data is not included in the GitHub repo. Instead, it is available on a USB drive at Chi Hack Night, so you'll need to get it there. Extract the data from the archive on the USB drive. Copy the contents into the folder `lib/tagnews/data/`. After this is done, your directory should look something like this: + +```bash +(cjp-at) .../article-tagging/lib/tagnews/data$ ls -l +total 2117928 +-rw-r--r-- 1 kevin.rose 1049089 6071 Sep 19 23:45 column_names.txt +-rw-r--r-- 1 kevin.rose 1049089 2156442023 Sep 18 21:02 newsarticles_article.csv +-rw-r--r-- 1 kevin.rose 1049089 2642 Sep 18 21:02 newsarticles_category.csv +-rw-r--r-- 1 kevin.rose 1049089 10569986 Sep 18 21:02 newsarticles_usercoding.csv +-rw-r--r-- 1 kevin.rose 1049089 1726739 Sep 18 21:02 newsarticles_usercoding_categories.csv +``` -Welcome back. +# Getting Started -A good place to start is the [notebooks](./lib/notebooks). Reading through these should help you get up to speed, and running them is a pretty good test to make sure everything is installed correctly. You will need the data to run the notebooks. There is no current cloud-based data sharing solution being used. Instead, it is contained on a [USB drive](https://en.wikipedia.org/wiki/Sneakernet), come to the Chi Hack Night meeting to get it! If this will be a problem for you but you are still interested, contact one of the maintainers. +A good place to start is the [notebooks](./lib/notebooks). We recommend starting with the explorations notebook -- it should give you a sense of what the data looks like. After reading through that, the bag-of-words-count-stemmed-binary notebook should give you a sense of what the NLP model for tagging looks like. Reading through these should help you get up to speed, and running them is a pretty good test to make sure everything is installed correctly. # What can I do? -You can check out the [open issues](https://github.com/chicago-justice-project/article-tagging/issues) and see if there's anything you'd like to tackle there. +It's important to keep in mind that it can take a significant amount of time to make sure everything is installed and working correctly, and to get a handle on everything that's going on. It's normal to be confused and have questions. Once you feel comfortable with things, then you can: + +Check out the [open issues](https://github.com/chicago-justice-project/article-tagging/issues) and see if there's anything you'd like to tackle there. If not, you can try and improve upon the existing model(s), but be warned, measuring performance in a multi-label task is non-trivial. See the `bag-of-words-count-stemmed-binary.ipynb` notebook for an attempt at doing so. Tweaking that notebook and seeing how performance changes might be a good place to start tinkering with the NLP code. You can also read the `tagnews.crimetype.benchmark.py` file to get an idea of how the cross validation is being performed. @@ -18,9 +32,6 @@ Further yet, you can help improve this very documentation. # FAQ -### Where is this scraped data that you're using and how do I get it? -The scraped data is NOT housed in a Github repositiory - it's on a flash drive. Come to Chi Hack Night in person and save it onto your computer! - ### Do I have to use a specific language to participate in article-tagging? Thusfar, most of the work has been done in Python and R, but there's no reason that always has to be the case. If there is another language that would be perfect for this project or that you have expertise in, that works too. Talk with us and we can figure something out. diff --git a/INSTALLATION.md b/INSTALLATION.md index ffef4fe..67ba302 100644 --- a/INSTALLATION.md +++ b/INSTALLATION.md @@ -30,13 +30,13 @@ Pre-trained models are not saved in Git since they are large binary files. Saved python -m tagnews.crimetype.models.binary_stemmed_logistic.model ``` -which will save two files, `model.pkl` and `vectorizer.pkl` to your current directory. Generating a model locally will require having the data (see below). +Generating a model locally will require having the data (see below). Wherever these two files end up being located (either by downloading or creating locally), you can reference this folder when creating your `Tagger` instance (see simple usage below). ## Data -The data is not stored in the Git repo since it would take up a considerable amount of space. Instead, the data is dumped daily on the server and can be accessed using a SFTP client. The data is only necessary if you wish to create your own model. +The data is not stored in the Git repo since it would take up a considerable amount of space. Instead, the data is dumped daily on the server and can be accessed using a SFTP client. The data is only necessary if you wish to create your own model. Come to ChiHackNight to learn more about the data and how to get it. # Simple Usage diff --git a/README.md b/README.md index f1ace20..c163dc2 100644 --- a/README.md +++ b/README.md @@ -22,13 +22,13 @@ POLM 0.059985 ## Requirements -To use this code, you will need at least the python packages [nltk](http://www.nltk.org/), [numpy](http://www.numpy.org/), [scikit-learn](http://scikit-learn.org/), and [pandas](http://pandas.pydata.org/). We recommend using [Anaconda](https://www.continuum.io/downloads) to manage python environments: +To use this code, you will need at least the python packages [nltk](http://www.nltk.org/), [numpy](http://www.numpy.org/) at version 1.13 or higher, [scikit-learn](http://scikit-learn.org/), and [pandas](http://pandas.pydata.org/). We recommend using [Anaconda](https://www.continuum.io/downloads) to manage python environments: ```bash $ # create a new anaconda environment with required packages -$ conda create -n cjp-ap nltk numpy scikit-learn pandas pytest -$ source activate cjp-ap -(cjp-ap) $ ... +$ conda create -n cjp-at nltk "numpy>=1.13" scikit-learn pandas pytest +$ source activate cjp-at +(cjp-at) $ ... ``` ## Installation @@ -37,6 +37,8 @@ Now that you've got the requirements resolved, you're ready to install the libra ## Usage +Below are sample usages when you want to just use this as a library to make predictions. + ### From python The main class is `tagnews.crimetype.tag.Tagger`: diff --git a/lib/notebooks/bag-of-words-count-stemmed-binary.ipynb b/lib/notebooks/bag-of-words-count-stemmed-binary.ipynb index f0cdd64..a38d7f4 100644 --- a/lib/notebooks/bag-of-words-count-stemmed-binary.ipynb +++ b/lib/notebooks/bag-of-words-count-stemmed-binary.ipynb @@ -743,9 +743,10 @@ "source": [ "import pickle\n", "\n", - "with open('model.pkl', 'wb') as f:\n", + "curr_time = time.strftime('%Y%m%d-%H%M%S')\n", + "with open('model-' + curr_time + '.pkl', 'wb') as f:\n", " pickle.dump(bench_results['clfs'][0], f)\n", - "with open('vectorizer.pkl', 'wb') as f:\n", + "with open('vectorizer-' + curr_time + '.pkl', 'wb') as f:\n", " pickle.dump(vectorizer, f)" ] },