Skip to content

Original API Draft

Fernando Gutierrez edited this page Jan 4, 2017 · 2 revisions

PyCrunch API draft

  1. Drop rules

Drop rules are use to delete invalid cases --respondents who spent too little time answering the survey ("speeders"), cases with inconsistent data, etc. In Crunch, these are supported using "exclusion filters," which are specified using a logical expression.

For example, suppose client_dataset is the name of a crunch dataset (assigned to the Python object ds ) and disposition is the alias of a variable in this dataset:

import pycrunch as pc
session = pc.login(credentials)
ds = session.datasets.open(name = "client_dataset")
ds.exclude(where = "disposition != 'complete'")

Here, disposition is a categorical variable in crunch and "complete" is the a category label. An equivalent expression would be to use the .value() method to reference the numeric code assigned to the "complete" category of disposition:

ds.exclude(where = "disposition.id != 0")

(Here, zero is the id (or code) assigned to completed interviews.)

We can also exclude a list of ids using either:

ds.exclude(where = "disposition.id in (0, 1)")
ds.exclude(where = "disposition in ('complete', 'screenout')")

(Here, parentheses are used to create a tuple in Python.)

We also need to be able to support compound logical expressions, such as:

ds.exclude(where = "disposition.id != 0 or exit_status.id != 1")

Some convenience functions, such as

For numeric variables, we will support the usual arithmetic operations (+, -, *, and /) and comparisons (==, !=, >, >=, <, <=).

Subvariables of an array can be referenced using brackets, e.g. if A2 is an array with subvariable "BBC", we can reference the values of this variable using A2['BBC'] and treat it like an ordinary categorical variable:

ds.exclude(where = "A2['BBC'] in ('One of my favorites', 'Watch frequently')")
ds.exclude(where = "A2['BBC'].id in (1,2)")

We can handle more complicated things later.


  1. Recodes

A common operation is to create a new variable out of an existing variable by combining categories. For example, if brandrating is a variable with categories "Very favorable", "Somewhat favorable", "Neutral", "Somewhat unfavorable", "Very unfavorable", "Don't know" (with codes 1,2,3,4,5,9 respectively), we may want to create a new variable brandrating2 using the following:

ds["brandrating"].recode(
    alias = "brandrating2",
    name = "Brand rating (recoded)",
    description = ds.brandrating.description,
    map = {1: (1, 2), 2: 2, 3: (4,5)},
    else = "missing",
    categories = {1: "Favorable", 2: "Neutral", 3: "Unfavorable"},
    group = ds.brandrating.group
)

The map is a dict with key-value pairs corresponding to new and old values, respectively. The else argument is optional and can be either "missing" (in which case, values missing from the map are made missing) or "copy" (in which case, values missing from the map are unchanged, with the old categories added to the dict of categories).

Most of the arguments to the recode() method can be omitted, with intelligent defaults supplied. For example, the previous recode can be shortened to:

ds["brandrating"].recode(
    alias = "brandrating2",
    map = {1: (1, 2), 2: 2, 3: (4,5)},
    categories = {1: "Favorable", 2: "Neutral", 3: "Unfavorable"}
)

By default, a recoded variable is inserted into the same group as the source variable, e.g. brandrating and brandrating2 would be in the same group, and inserted at a position immediately after the source variable. (See below for more discussion of groups and positions). Other defaults are to use the name of the source variable (with "(recoded)" appended) and the same description.

We do not allow you to overwrite an existing variable. You can achieve more ore less the same effect by hiding the original variable:

ds["brandrating"].hide()

so that only the recoded variable (brandrating2 ) is visible. We will not (at least for now) allow you to overwrite the existing variable -- these are pure derived variables.

Recodes should work for multiple response and array variables.


  1. Transformations

Transformations create new variables based upon the values of one or more input variables. Initially, we will support only a few basic types, e.g.

ds.create_categorical(
    alias = "newvar",
    name = "New variable name",
    description = "Description of new variable",
    categories = ("First", "Second", "Third"),
    values = (1, 2, 3),
    where = ("x == 1", "x == 2", "true")
)

The where clauses are evaluated in order (as if this were a sequence of if/elif/else statements

if ds.x == 1:
    newvar = 1
elif ds.x == 2:
    newvar == 2
else:
    newvar == 3

(In this example, newvar would equal 3 when x is missing. Expressions involving missing values evaluate to False , unlike in R).


  1. Missing Data

A useful helper function allows easy evaluation of whether a set of variables are valid or missing for a case:

ds.exclude(where = "missing(x, y, z)")
ds.exclude(where = "valid(x, y, z)")

The second expression is equivalent to not missing(x,y,z). Any number of arguments may be provided to the valid() and missing() functions.


  1. Weighting

Raking weights (a.k.a. "rim weights") can be constructed using the following method:

ds.create_weight(
    alias = "weight",
    description = "My weight",
    targets = {
        "age4" : { "18-29" : 20.0, "30-44" : 25.0, "45-64" : 30.0, "Over 65" : 25.0 },
        "gender" : { "male" : 49.0, "female" : 51.0 }
    },
    hide = True,
    default = True
)

The targets are a dict with keys giving aliases of variables in the data set and values a dictionary of category/percentage key-value pairs (the keys may either category names (strings) or ids (integers), but you are not allowed to mix the two (i.e., the keys must all be strings or all integers).

We will need to determine some rules for when the weighting code is executed (e.g., after the total number of interviews completed is at least some minimum number, with at least one case in each marginal cell?)

The PyCrunch streaming API is intended to be executed automatically and, for the most part, once. For example, a derived variable is created and then Crunch handles updates as additional data is received. An exception is weighting, where the weights will need to be recomputed periodically.

In R Crunch we have functions that can set dataset attributes such as:

saveVersion(ds), restoreVersion(), versions(), name(ds), description(ds), startDate(ds), endDate(ds), weight(ds), exclusion(ds), share(ds)

Note that you will want to be able to take editorship of the dataset as well, in R this uses unlock(ds) .

In RCrunch, we estimate weights on a local machine, and so we use as.vector() to bring the variables we need local, then run some calculations, and finally upload a new variable that will be the weight.

In RCrunch, there are 3 functions that we use to organize variables into topics: VariableGroup [defines a group], VariableOrder [defines and order of groups], and ordering [send the VariableOrder up to the server].

Related, we also place datasets in "projects" using very similar functions.

In RCrunch we append together datasets (usually trackers) using a work flow along these lines:

saveVersion(ds_tracker), forkDataset(ds_tracker)
saveVersion(ds_new), forkDataset(ds_new)
appendDataset(ds_tracker, ds_new)
saveVersion(ds_tracker), saveVersion(ds_new)
mergeDataset(ds_tracker)
Clone this wiki locally