Skip to content

Commit

Permalink
update docs and model saving
Browse files Browse the repository at this point in the history
  • Loading branch information
kbrose committed Sep 20, 2017
1 parent ce36c41 commit 5869ef9
Show file tree
Hide file tree
Showing 5 changed files with 31 additions and 16 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@ build/
dist/
.eggs/
.cache/
.DS_Store
27 changes: 19 additions & 8 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,37 @@
# Setup

Fork this repo, download it, and navigate to it. If you're going to be developing it doesn't necessarily make sense to install as a package, but you'll still need to install the dependencies. Head over to the [README](README.md) for instructions on installing the required dependencies.
Fork this repo, clone your fork, and navigate to it. If you're going to be developing it doesn't necessarily make sense to install as a package, but you'll still need to install the dependencies. If you have not already installed the dependencies, then see the instructions in the [README](README.md).

# Getting Started
## The Data

Due mainly to the file size, the data is not included in the GitHub repo. Instead, it is available on a USB drive at Chi Hack Night, so you'll need to get it there. Extract the data from the archive on the USB drive. Copy the contents into the folder `lib/tagnews/data/`. After this is done, your directory should look something like this:

```bash
(cjp-at) .../article-tagging/lib/tagnews/data$ ls -l
total 2117928
-rw-r--r-- 1 kevin.rose 1049089 6071 Sep 19 23:45 column_names.txt
-rw-r--r-- 1 kevin.rose 1049089 2156442023 Sep 18 21:02 newsarticles_article.csv
-rw-r--r-- 1 kevin.rose 1049089 2642 Sep 18 21:02 newsarticles_category.csv
-rw-r--r-- 1 kevin.rose 1049089 10569986 Sep 18 21:02 newsarticles_usercoding.csv
-rw-r--r-- 1 kevin.rose 1049089 1726739 Sep 18 21:02 newsarticles_usercoding_categories.csv
```

Welcome back.
# Getting Started

A good place to start is the [notebooks](./lib/notebooks). Reading through these should help you get up to speed, and running them is a pretty good test to make sure everything is installed correctly. You will need the data to run the notebooks. There is no current cloud-based data sharing solution being used. Instead, it is contained on a [USB drive](https://en.wikipedia.org/wiki/Sneakernet), come to the Chi Hack Night meeting to get it! If this will be a problem for you but you are still interested, contact one of the maintainers.
A good place to start is the [notebooks](./lib/notebooks). We recommend starting with the explorations notebook -- it should give you a sense of what the data looks like. After reading through that, the bag-of-words-count-stemmed-binary notebook should give you a sense of what the NLP model for tagging looks like. Reading through these should help you get up to speed, and running them is a pretty good test to make sure everything is installed correctly.

# What can I do?

You can check out the [open issues](https://github.com/chicago-justice-project/article-tagging/issues) and see if there's anything you'd like to tackle there.
It's important to keep in mind that it can take a significant amount of time to make sure everything is installed and working correctly, and to get a handle on everything that's going on. It's normal to be confused and have questions. Once you feel comfortable with things, then you can:

Check out the [open issues](https://github.com/chicago-justice-project/article-tagging/issues) and see if there's anything you'd like to tackle there.

If not, you can try and improve upon the existing model(s), but be warned, measuring performance in a multi-label task is non-trivial. See the `bag-of-words-count-stemmed-binary.ipynb` notebook for an attempt at doing so. Tweaking that notebook and seeing how performance changes might be a good place to start tinkering with the NLP code. You can also read the `tagnews.crimetype.benchmark.py` file to get an idea of how the cross validation is being performed.

Further yet, you can help improve this very documentation.

# FAQ

### Where is this scraped data that you're using and how do I get it?
The scraped data is NOT housed in a Github repositiory - it's on a flash drive. Come to Chi Hack Night in person and save it onto your computer!

### Do I have to use a specific language to participate in article-tagging?

Thusfar, most of the work has been done in Python and R, but there's no reason that always has to be the case. If there is another language that would be perfect for this project or that you have expertise in, that works too. Talk with us and we can figure something out.
Expand Down
4 changes: 2 additions & 2 deletions INSTALLATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,13 +30,13 @@ Pre-trained models are not saved in Git since they are large binary files. Saved
python -m tagnews.crimetype.models.binary_stemmed_logistic.model
```

which will save two files, `model.pkl` and `vectorizer.pkl` to your current directory. Generating a model locally will require having the data (see below).
Generating a model locally will require having the data (see below).

Wherever these two files end up being located (either by downloading or creating locally), you can reference this folder when creating your `Tagger` instance (see simple usage below).

## Data

The data is not stored in the Git repo since it would take up a considerable amount of space. Instead, the data is dumped daily on the server and can be accessed using a SFTP client. The data is only necessary if you wish to create your own model.
The data is not stored in the Git repo since it would take up a considerable amount of space. Instead, the data is dumped daily on the server and can be accessed using a SFTP client. The data is only necessary if you wish to create your own model. Come to ChiHackNight to learn more about the data and how to get it.

# Simple Usage

Expand Down
10 changes: 6 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,13 @@ POLM 0.059985

## Requirements

To use this code, you will need at least the python packages [nltk](http://www.nltk.org/), [numpy](http://www.numpy.org/), [scikit-learn](http://scikit-learn.org/), and [pandas](http://pandas.pydata.org/). We recommend using [Anaconda](https://www.continuum.io/downloads) to manage python environments:
To use this code, you will need at least the python packages [nltk](http://www.nltk.org/), [numpy](http://www.numpy.org/) at version 1.13 or higher, [scikit-learn](http://scikit-learn.org/), and [pandas](http://pandas.pydata.org/). We recommend using [Anaconda](https://www.continuum.io/downloads) to manage python environments:

```bash
$ # create a new anaconda environment with required packages
$ conda create -n cjp-ap nltk numpy scikit-learn pandas pytest
$ source activate cjp-ap
(cjp-ap) $ ...
$ conda create -n cjp-at nltk "numpy>=1.13" scikit-learn pandas pytest
$ source activate cjp-at
(cjp-at) $ ...
```

## Installation
Expand All @@ -37,6 +37,8 @@ Now that you've got the requirements resolved, you're ready to install the libra

## Usage

Below are sample usages when you want to just use this as a library to make predictions.

### From python

The main class is `tagnews.crimetype.tag.Tagger`:
Expand Down
5 changes: 3 additions & 2 deletions lib/notebooks/bag-of-words-count-stemmed-binary.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -743,9 +743,10 @@
"source": [
"import pickle\n",
"\n",
"with open('model.pkl', 'wb') as f:\n",
"curr_time = time.strftime('%Y%m%d-%H%M%S')\n",
"with open('model-' + curr_time + '.pkl', 'wb') as f:\n",
" pickle.dump(bench_results['clfs'][0], f)\n",
"with open('vectorizer.pkl', 'wb') as f:\n",
"with open('vectorizer-' + curr_time + '.pkl', 'wb') as f:\n",
" pickle.dump(vectorizer, f)"
]
},
Expand Down

0 comments on commit 5869ef9

Please sign in to comment.