Skip to content

Latest commit

 

History

History
211 lines (126 loc) · 12.2 KB

README.md

File metadata and controls

211 lines (126 loc) · 12.2 KB

Running this project

Setup

  1. Configure a postgres database with the KDD Cup 2014 DonorsChoose dataset, using the tables donations.csv, essays.csv, projects.csv, and resources.csv

Note: Kaggle hosts a more recent DonorsChoose dataset, which includes data from as recent as May 2018. It includes a similar set of variables, but in a different schema.

  1. Create a new python environment and install python prerequisites from requirements.txt:

     `pip install -r requirements.txt`
    
  2. Create a database.yaml file with your credentials.

How to use

  1. Run database_preparation.py. This executes the queries in database_prep_queries, against your configured database, improving database performance and generating several time-aggregate features.
  2. Run main.py. This will run the Triage experiment defined in donors-choose-config.yaml.
  3. Run model_selection.ipynb. Be sure to update experiment_id to match the experiment hash generated by step 1.

Introduction

DonorsChoose

DonorsChoose is a nonprofit that addresses the education funding gap through crowdfunding. Since 2000, they have facilitated $970 million in donations to 40 million students in the United States.

However, approximately one third of all projects posted on DonorsChoose do not reach their funding goal within four months of posting.

This project will help DonorsChoose shrink the education funding gap by ensuring that more projects reach their funding goals. We will create an early warning system that identifies newly-posted projects that are likely to fail to meet their funding goals, allowing DonorsChoose to target those projects with an intervention such as a donation matching grant.

The DonorsChoose Database

We use four tables from the DonorsChoose database:

Name Description Primary Key Used?
projects Basic metadata including teacher, class, and school information, and project asking price. projectid yes
resources Information about the classroom supply or resource to be funded. Product category, per-unit price, quantity requested, etc. projectid yes
essays Text of funding request. projectid yes
donations Table of donations. Donor information, donation amount, messages from donors. Zero to many rows per project. donationid yes

Initial Processing

We performed some initial processing of the source database to improve database performance and ensure compliance with Triage. The altered tables are stored in a second database schema, leaving the raw data intact.

Renaming projectid to entity_id

Triage expects each feature and label row to be identified by a primary key called entity_id. For convenience, we renamed projectid (our entity primary key) to entity_id.

Integer entity ids

We replaced the source database's string (postgres varchar(32)) projectid key with integer keys. Triage requires integer entityids, and integer keys will improve performance on joins and group operations.

Primary & Foreign Key constraints

We create primary key constraints on projectid in all tables (and a foreign key constraint on donations.projectid). This creates indexes on each of those columns, improving performance in label & feature generation.

Problem Framing

Let's start by stating our qualitative understanding of the problem. Then, we'll translate that into a formal problem framing, using the language of the Triage experiment config file.

Qualitative Framing

Once a DonorsChoose project has been posted, it can receive donations for four months. If it doesn't reach its funding goal by then, it is considered unfunded.

DonorsChoose wants to institute a program where a group of projects at risk of falling short on funding are selected to receive extra support: enrollment in a matching grant program funded by DonorsChoose's corporate partners, and prominent placement on the DonorsChoose project discovery page.

These interventions have limited capacity: funding is limited, and only a few projects at a time can really benefit from extra promotion on the homepage. Each day, DonorsChoose chooses a few newly-posted projects to be enrolled in these interventions, based on information in their application for funding. Each month, 50 projects in total are enrolled in the interventions.

Therefore, our goal is to identify a machine learning model that identifies the 50 projects posted each month that are most likely to fail to reach their funding goal.

Triage Framing

Temporal Config

Start & end times

project_counts_by_month

We'll use the earliest available data in feature generation. Historical information on project performance is likely useful in predicting the performance of new projects in the same locations, or under the same teachers.

feature_start_time: '2000-01-01'
feature_end_time: '2013-06-01'

We're most interested in evaluating the performance of our models on data from recent years. We select a dataset starting in mid-2011, after this period of growth, and running through the end of 2013, the last compete year of data.

label_start_time: '2011-09-02'
label_end_time: '2013-06-01'

Starting our label range with September 1, 2011 causes Triage to generate a useless 13th training set which contains a single day's worth of projects. We start our data on September 2, 2011 to avoid this.

Model update frequency

Each month, the previous month's data becomes available for training a new model.

model_update_frequency: '1month'

Test set duration

Our model will make predictions once a month, on the previous month's unlabeled data. Our one month test set length reflects this.

test_durations:['1month']

Training history

Patterns in the DonorsChoose data can change significantly within a year. We use one-month training sets ensuring that our models capture trends from recent data.

max_training_histories: ['1month']

As of date frequencies

When the model is in production, DonorsChoose will evaluate new projects daily. We use a 1 day as of date frequency to simulate the rate at which DonorsChoose will access the model's predictions.

training_as_of_date_frequencies: ['1day']
test_as_of_date_frequencies: ['1day']
Label timespan

A project's label timespan is the amount of time that must pass from when it is posted, to when its label can be determined. In our case, each project has a four month label timespan.

training_label_timespans: ['4month']
test_label_timespans: ['4month']

Here's a timechop diagram representing our temporal config:

timechop

Outcome

Under our framing, each project can have one of two outcomes:

  • Fully funded: Total donations in the four months following posting were equal to or greater than the requested amount
  • Not fully funded: Total donations in the four months following posting were less than the requested amount.

We generate our label with a query that sums total donations to each project, and calculates a binary variable representing whether the project went unfunded (1) or met its goal (0).

Metric

Since our intervention is resource-constrained and limited to 50 projects each month, we are concerned with minimizing false positives. We track how our models perform on precision among the 50 projects predicted at highest risk of going unfunded.

Feature Generation

We implement two categories of features. The first are features that we read directly from the database, raw, or with basic transformations. These include information like teacher and school demographics, type and price of requested classroom resources, and essay length.

Triage can generate these features directly from our source data, without us performing any manual transforms or aggregations.

The second category of features are temporal aggregations of historical donation information. These answer questions like "how did a posting teacher's previous projects perform?" and "how did previous projects at the originating school perform?"

These aggregations would be too complex to perform with Triage's feature aggregation system. So we generate them manually and store them alongside the source data.

The DDL statements that create these features are stored in precompute_queries

Note: in donors-choose-config.yaml, we define all feature aggregates over the default interval all. This parameter isn't relevant to this project, because all of our time-aggregate features are calculated outside of Triage.

Modeling

Model Grid

Our model grid includes three model function candidates, and three baseline model specifications.

Model function candidates:

  • sklearn.ensemble.RandomForestClassifier
  • sklearn.linear_model.LogisticRegression
  • sklearn.tree.DecisionTreeClassifier

Baselines:

  • sklearn.tree.DecisionTreeClassifier (max_depth = 2)
  • sklearn.dummy.DummyClassifier (predicting our label's base rate)
  • Triage's PercentileRankOneFeature, which ranks entities based on the value of a single feature (here, project total asking price)

Model Selection

precision@50_over_time_all_models

We use Auditioner to manage model selection. Plotting precision@50_abs over time shows that our models groups are generally working well, with most performing better than baselines.

Our logistic regression model groups tend to perform worse than our random forests. The difference in performance (as much as .25) doesn't justify a tradeoff for the models' potential higher interpretability. Plain decision trees also seem to perform consistently worse than random forests.

We use Auditioner to perform some coarse filtering, eliminating the worst-performing model groups:

  • Dropping model groups that achieved precision@50 worse than 0.5 in at least one test set
  • Dropping model groups that had a regret (difference in performance from the best-performing model group) of 0.2 or greater during at least one month

Performance in the resulting set of model groups ranges from ~0.5 to 0.8, well above the prior rate of ~0.3. Looking pretty good so far.

precision@50_over_time Building a basic Auditioner model selection grid, it looks like variance-penalized average precision (penalty = 0.5) and best current precision minimize regret.

Criteria Average regret (precision @ 50_abs)
Best current precision 0.0905
Most frequently within .1 of best 0.0942
Most frequently within .03 of best 0.0996
Random model group 0.1087

regret_over_time

Best current precision, which selects the best-performing model group to serve as the predictor for the next month, minimizes average regret, and beats a random baseline.

This criteria selects three random forest model groups for the next period:

max_depth max_features n_estimators min_samples_split
RandomForestClassifier 5 12 1000 25
RandomForestClassifier 10 12 1000 50
RandomForestClassifier 10 12 10000 50