Skip to content
Long Ouyang edited this page Nov 17, 2015 · 6 revisions

Here, we will document common ways of manipulating data in webppl. This should be helpful to users who are familiar with the Hadleyverse way of data manipulation. Where appropriate, we will be showing corresponding R / tidyr code.

Making a "data frame"

Using the current babyparse-based readCSV function to read in a data frame, you will likely want to do something like.

// use first row as (header) variable names
var dataFrame = function(rawCSV){
    return map(function(row){
		return _.object(_.zip(rawCSV[0],row))
	}, rawCSV.slice(1))
}

var data = utils.readCSV("data.csv").data
var df = dataFrame(data)

Assuming data.csv has its first row as variable names and subsequent rows as variable values, df will be a list of objects. Each object will look like

{ workerid: '29',
    rt: '5751',
    condition: 'A',
    trial_num: '1',
    response: '0.49'}

Per the suggestions of #146.

Selecting data

In R, you can select a single column with something like df$workerid, and get back all the workerids. In webppl, we would write

_.pluck(df,"workerid")

Get factor levels

In R, you can get the unique levels of a factor (i.e. a categorical variable) by calling factor(df$workerid). In webppl, we would write

_.uniq(_.pluck(df,"workerid"))

This can be useful for later grouping the data.

filter / group_by

In R + tidyr, you subset data by filter(df, workerid=="24"). Here is a function with very similar looking syntax called subset

var subset = function(df, key, value){
	return filter(function(d){
		return (d[key]==value)
	},df)
}

subset(data,"workerid","24")
  • Long says: You can accomplish this using _.where, e.g., _.where(listOfPlays, {author: "Shakespeare", year: 1611}) will return all rows with Shakespeare as the author and 1611 as the year. _.groupBy also exists.