Skip to content

Commit

Permalink
updating documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
kbrose committed Aug 29, 2017
1 parent 59816e9 commit 30a7ab2
Show file tree
Hide file tree
Showing 2 changed files with 59 additions and 34 deletions.
34 changes: 34 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Setup

Fork this repo, download it, and navigate to it. If you're going to be developing it doesn't necessarily make sense to install as a package, but you'll still need to install the dependencies. See the [README](README.md) for more info.

Once that's done, a good place to start is the [notebooks](./lib/notebooks). Reading through these should help you get up to speed, and running them is a pretty good test to make sure everything is installed correctly. You will need the data to run the notebooks. There is no current cloud-based data sharing solution being used. Instead, it is contained on a USB drive, come to the Chi Hack Night meeting to get it! If this will be a problem for you but you are still interested, contact one of the maintainers.

# What can I do?

You can check out the [open issues](https://github.com/chicago-justice-project/article-tagging/issues) and see if there's anything you'd like to tackle there.

If not, you can try and improve upon the existing model(s), but be warned, measuring performance in a multi-label task is non-trivial. See the `bag-of-words-count-stemmed-binary.ipynb` notebook for an attempt at doing so. Tweaking that notebook and seeing how performance changes might be a good place to start tinkering with the NLP code. You can also read the `tagnews.crimetype.benchmark.py` file to get an idea of how the cross validation is being performed.

Further yet, you can help improve this very documentation.

# FAQ

### Where is this scraped data that you're using and how do I get it?
The scraped data is NOT housed in a Github repositiory - it's on a flash drive. Come to Chi Hack Night in person and save it onto your computer!

### Do I have to use a specific language to participate in article-tagging?

Thusfar, most of the work has been done in Python and R, but there's no reason that always has to be the case. If there is another language that would be perfect for this project or that you have expertise in, that works too. Talk with us and we can figure something out.

### Are there concepts that will be helpful for me to understand?

Definitely! [This sklearn user guide](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) details a number of the text analysis methodologies this project uses (sklearn is a Python library, but the user guide is great for understanding machine learning text analysis in general). Also, see the section on 'Automated Article Tagging' in the [README](./README.md) for more detailed literature on some of the relevant concepts. Reading the code and looking up concepts you are unfamiliar with is a valid path forward as well!

### I want to contribute to Chicago Justice Project but I don’t want to work on this NLP stuff. What can I do?

You can help out the [the team scraping articles/maintaining the volunteers' web interface](https://github.com/chicago-justice-project/chicago-justice). If that doesn't sound interesting either, we can always use more [volunteer taggers](http://chicagojustice.org/volunteer-for-cjp/). Or just show up Tuesday nights and ask what you can do!

### How do I productize a model?

You [pickle](https://docs.python.org/3.6/library/pickle.html) it. But working with pickle is difficult. In order to sanely be able load things, I'm running python files that pickle the model using the `-m` flag, e.g. `python -m tagnews.crimetype.models.binary_stemmed_logistic.model` will run code that generates the pickles of the model. All modules should be imported in the same way they will exist when unpickling the model from `tagnews.crimetype.tag`.
59 changes: 25 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Requirements

To install this library, you will need at least the python packages [nltk](http://www.nltk.org/), [numpy](http://www.numpy.org/), [scikit-learn](http://scikit-learn.org/), and [pandas](http://pandas.pydata.org/). We recommend using [Anaconda](https://www.continuum.io/downloads):
To install this library, you will need at least the python packages [nltk](http://www.nltk.org/), [numpy](http://www.numpy.org/), [scikit-learn](http://scikit-learn.org/), and [pandas](http://pandas.pydata.org/). We recommend using [Anaconda](https://www.continuum.io/downloads) to manage python environments:

```bash
$ # create a new anaconda environment with required packages
Expand Down Expand Up @@ -33,39 +33,48 @@ TODO

## Usage

### Inside python
### From python

The main class is `tagnews.crimetype.tag.Tagger`:

```python
>>> import tagnews
>>> tagger = tagnews.crimetype.tag.Tagger()
>>> article_text = 'This is an article about lots of crimes. Crimes about drugs.'
>>> article_text = 'A short article. About drugs and police.'
>>> tagger.relevant(article_text, prob_thresh=0.1)
True
>>> tagger.tagtext(article_text, prob_thresh=0.5)
['DRUG', 'CPD']
>>> tagger.tagtext_prob(article_text)
<pandas series>
>>> tagger.tagtext_proba(article_text)
DRUG 0.747944
CPD 0.617198
VIOL 0.183003
UNSPC 0.145019
ILSP 0.114254
POLM 0.059985
...
```

### Command line interface
### From the command line

The installation comes with a command line interface, which without any arguments defaults to reading from the stdin.
The installation comes with a *very* rudimentary command line interface, which without any arguments defaults to reading from the stdin.

```bash
$ python -m tagnews.crimetype.cli
Go ahead and start typing. Hit ctrl-d when done.
<type here>
```

Or you can provide an article to tag.
Or you can provide a list of articles to tag, a CSV of the probability of each tag is output to `<article name>.tagged`.

```bash
$ python -m tagnews.crimetype.cli sample-article.txt
$ cat sample-article.txt.tagged
GUNV, 0.9877
HOMI, 0.8765
$ python -m tagnews.crimetype.cli sample-article-1.txt sample-article-2.txt
$ cat sample-article-1.txt.tagged
CPD, 0.912382307
UNSPC, 0.051873838
SEXA, 0.031065436
BEAT, 0.023119570
DRUG, 0.017140532
...
```

Expand All @@ -79,6 +88,10 @@ We meet every Tuesday at [Chi Hack Night](https://chihacknight.org/), and you ca

The [Chicago Justice Project](http://chicagojustice.org/) has been scraping RSS feeds of articles written by Chicago area news outlets for several years, allowing them to collect almost 300,000 articles. At the same time, an amazing group of [volunteers](http://chicagojustice.org/volunteer-for-cjp/) have helped them tag these articles. The tags include crime categories like "Gun Violence", "Drugs", "Sexual Assault", but also organizations such as "Cook County State's Attorney's Office", "Illinois State Police", "Chicago Police Department", and other miscellaneous categories such as "LGBTQ", "Immigration". The volunteer UI was also recently updated to allow highlighting of geographic information.

# Contributing

You want to contribute? Great! Check out the [CONTRIBUTING.md](./CONTRIBUTING.md) file for more info.

# Areas of research

## Type-of-Crime Article Tagging
Expand Down Expand Up @@ -112,30 +125,8 @@ Things to checkout:

Some articles may discuss multiple crimes. Some crimes may occur in multiple areas, whereas others may not be associated with any geographic information (e.g. some kinds of fraud).

# The Code

Under the `lib` folder you can find the source code.

The `load_data.py` file will load the data from the CSV files (stored not in GitHub). Specifically, look at the `load_data.load_data()` method, this returns a `k`-hot encoded tagging and article data.

# How to Contribute FAQ

### How can I stay up to date on what you're doing with article-tagging?
Check this document for updates and subscribe to the #quantifyingjusticenews channel on Chi Hack Night's team on Slack.
### Where is this scraped data that you're using and how do I get it?
The scraped data is NOT housed in a Github repositiory - it's on a flash drive. Come to Chi Hack Night in person and save it onto your computer!
### Do I have to use a specific language to participate in article-tagging?
Thusfar, most of the work has been done in Python, but there's no reason that always has to be the case. If there is another language that would be perfect for this project or that you have expertise in, that works too.
### Are there concepts that will be helpful for me to understand?
Definitely! [This sklearn user guide](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) details a number of the text analysis methodologies this project uses (sklearn is a Python library, but the user guide is great for understanding machine learning text analysis in general). Also, see the section above on 'Automated Article Tagging' for more detailed literature on some of the relevant concepts.
### I want to contribute to Chicago Justice Project but I don’t want to work on tagging article subjects OR geolocating articles. What can I do?
Help [the team scraping articles](https://github.com/chicago-justice-project/chicago-justice) (that's where this team gets its data) or help [the team building a front-end](https://github.com/chicago-justice-project/chicago-justice-client)to share this project's insights with Chicago and the world. Or just show up Tuesday nights and ask what you can do!
# See Also

* [Chicago Justice Project](http://chicagojustice.org/)
* [Database Repo](https://github.com/kyaroch/chicago-justice)
* [Chi Hack Night Group Description](https://github.com/chihacknight/breakout-groups/issues/61)

# Saving a new model

Working with pickle is difficult. In order to sanely be able load things, I'm running python files that pickle the model using the `-m` flag, e.g. `python -m tagnews.crimetype.models.binary_stemmed_logistic.model`.

0 comments on commit 30a7ab2

Please sign in to comment.