-
Notifications
You must be signed in to change notification settings - Fork 85
Tidying data
Here, we will document common ways of manipulating data in webppl. This should be helpful to users who are familiar with the Hadleyverse way of data manipulation. Where appropriate, we will be showing corresponding R / tidyr code.
Using the current babyparse-based readCSV
function to read in a data frame, you will likely want to do something like.
// use first row as (header) variable names
var dataFrame = function(rawCSV){
return map(function(row){
return _.object(_.zip(rawCSV[0],row))
}, rawCSV.slice(1))
}
var data = utils.readCSV("data.csv").data
var df = dataFrame(data)
Assuming data.csv
has its first row as variable names and subsequent rows as variable values, df
will be a list of objects. Each object will look like
{ workerid: '29',
rt: '5751',
condition: 'A',
trial_num: '1',
response: '0.49'}
Per the suggestions of #146.
In R, you can select a single column with something like df$workerid
, and get back all the workerids. In webppl, we would write
_.pluck(df,"workerid")
In R, you can get the unique levels of a factor (i.e. a categorical variable) by calling factor(df$workerid)
. In webppl, we would write
_.uniq(_.pluck(df,"workerid"))
This can be useful for later grouping the data.
In R + tidyr, you subset data by filter(df, workerid=="24")
. Here is a function with very similar looking syntax called subset
var subset = function(df, key, value){
return filter(function(d){
return (d[key]==value)
},df)
}
subset(data,"workerid","24")
- Long says: You can accomplish this using
_.where
, e.g.,_.where(listOfPlays, {author: "Shakespeare", year: 1611})
will return all rows with Shakespeare as the author and 1611 as the year._.groupBy
also exists.