#Getting & Cleaning Data Course Project#
This repo contains work done as part of an course project required for the Getting & Cleaning Data course, part of the Coursera Data Science specialization.
The goal of the project is to take a number of text files that each contain different elements of a study measuring accelerometer and gyroscope data obtained from Samsung Galaxy smartphones and then to combine these text files into a single, tidy dataset, which is then saved as a text file.
This repo contains, in addition to the README, a Codebook for the generated tidy dataset as well as the R script that can be used to reproduce the tidy data, given the raw data text files.
##Background## An experiment was conducted where a group of 30 volunteers each performed six activities (walking, walking upstairs, walking downstairs, sitting, standing, laying) wearing a smartphone (Samsung Galaxy S II) on the waist. The data generated by the embedded accelerometer and gyroscope in the smartphones were recorded.
More information about the study can be found here: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
##Raw Data## The observations consist of the 3-axial accelerometer and gyroscope measurements obtained from the Samsung Galaxy smartphones used in the experiment described above. As part of the experiment, these measurements were filtered, separated, acted upon and used to calculate a number of derived variables, eventually leading to a total of 561 variables (called features in the study).
The subjects performing the experiment as well as the activities they performed were recorded as separate datasets.
All the data (10299 observations) obtained from the experiment were split into two sets: train (7352 observations) and test (2947 observations).
The raw data relevant to this project is contained in 7 text files:
- The accelerometer and gyroscope measurements as well as the variables calculated from these measurements are recorded in the files x_train.txt and x_test.txt.
- The variable names are recorded in features.txt.
- The subjects (persons) that generated each of the observations: subject_train.txt and subject_test.txt.
- The activities that were performed by the subjects: y_test.txt and y_train.txt.
###1. Downloading data The data can be downloaded from https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip.
In order for the R script in this repo to work, the downloaded files must be unzipped to the working directory of the computer that will run the R script, in such a way that the unzipped folder /UCI HAR Dataset is in the working directory.
###2. Combining test and train datasets ### Each of the test and train dataset combinations (i.e. the 2 observation files, the 2 subject files and the 2 activity files) is combined by simple rbind() operations, in the following combinations:
- X_train.txt, a 7352 x 561 dataset, and X_test.txt, a 2947 x 561 dataset, are combined to create x.combined, a 10299 x 561 dataset.
- subject_train.txt, with dimensions 7352 x 1 and subject_test.txt (dimensions 2947 x 1) is combined to create subject.combined with dimensions 10299 x 1.
- y_test.txt (7352 x 1) and y_train.txt (2947 x 1) is combined to create activity.combined (10299 x 1).
All these operations follow the same pattern, eg:
x.train <- read.table("./UCI HAR Dataset/train/X_train.txt") # Read X_train.txt file into x.train
x.test <- read.table("./UCI HAR Dataset/test/X_test.txt") # Read X_test.txt file into x.test
x.combined <- rbind(x.train, x.test) # Combine x.train and x.test into x.combined
###3. Adding variable names and subject and activity fields ###
1. Generate column names "Subject" and "Activity": For subject.combined and activity.combined, the column names for the single column in each dataset are changed to "Subject" and "Activity" respectively.
2. Add variable names to x.combined:
Since features.txt contains the names of the 561 variables, it needs to be binded to x.combined before adding the subject and activity fields (which would change the number of columns to 563, making it difficult to bind the variable names). Therefore, the next step is to replace the empty column names in x.combined with the values in the second column of features.txt, using names(x.combined) <- features[,2]
.
3. Select relevant variables:
There seems to be some ambiguity regarding exactly which columns to choose, given the assignment's instruction "Extract only the measurements on the mean and standard deviation for each measurement". After studying the features_info.txt file, it seems reasonable to assume that only the columns depicted by "mean()" and "std()" actually qualify as "measurements on the "mean and standard deviation for each measurement", because variables like FrequencyMean and GravityMean are means of other variables, as opposed to means of the measurements.
Looking at the discussion boards and feedabck from the teaching assistants, this doesn't seem to matter much, though. (Detailed discussion here).
The columns with column names containing either mean() or std() are selected in 2 steps:
- A vector containing the relevant column numbers is obtained using the
grep()
function. - This vector is then used to create a new dataset (x.mean.std.only, with 66 columns), consisting only of the columns with column numbers contained in the vector created in the previous step and generated by subsetting the dataset x.combined.
4. Adding "Subject" and "Activity" fields:
The "Subject" and "Activity" fields are added to x.mean.std.only using cbind()
.
5. Adding descriptive activity names:
Since the file activity_info.txt contains a table mapping the actvity numbers too activity names, it is possible to make use of this table to translate the "Activity" field to descriptive activity names, using the factor()
function.
6. Labeling the data set with descriptive variable names: (Please note: a complete description of the variables is contained in the Codebook, which is also available in this repo). All variable names are a combination of
- The domain (t or f)
- "Body" or "Gravity"
- The axis of the measurement (X, Y or Z)
- The function: mean or standard deviation
- A combination of "Acc" (for acceleration), "Gyro" (for angular velocity), "Jerk" (for jerk signals) and "Mag" (for magnitude).
The principle followed here is not to make the names longer than neccessary, but to tidy it up, make it syntactically correct and make it easy to use for analysis. Abbreviations of measurements, are therefore NOT changed (i.e. "Acc" is not changed to "Acceleration"), since this would make the variable names unwieldy.
However, in order to make the names easier the read, they are changed in a number of ways. Although using full stops and capital letters in variable names is normally not good practice, in this dataset there doesn't seem to be too many alternatives short of changing the dataset to the narrow form, which I don't want to do (explained below).
Changing the variable names is accomplished in a step by step fashion. To illustrate how this is done, the raw column name tBodyGyroJerk-std()-Y is used as an example. The final form for this variable is t.Body.Gyro.Jerk.Y.StdDev.
- All "()" strings are removed using the
gsub()
function. Leaving a "()" as part of a variable name can cause problems, since it can be confused with a function. In this step, tBodyGyroJerk-std()-Y, is changed to tBodyGyroJerk-std-Y. - The
make.names()
function is used to change the variable names to names that are syntactically correct in R and remove any duplicate names. This step changes tBodyGyroJerk-std-Y to tBodyGyroJerk.std.Y. gsub()
is used to remove repeating portions of names which doesn't make sense (i.e. BodyBody).- "mean" is changed to "Mean" and "std" to "StdDev": tBodyGyroJerk.std.Y changes to tBodyGyroJerk.StdDev.Y.
- Full stops are added between the name portions to make it easier to read, eg. tBodyGyroJerk.StdDev.Y changes to t.Body.Gyro.Jerk.StdDev.Y.
- Where the "Mean" or "StdDev" portion is not at the end of the name, it is moved to the last position, making it easier to identify these columns at a glance. t.Body.Gyro.Jerk.StdDev.Y now becomes t.Body.Gyro.Jerk.Y.StdDev, which is the final form.
###4. Creating an independent tidy data set with the average of each variable for each activity and each subject:###
#####The principles of tidy data#####
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
All these factors are already in place in the x.mean.std.only dataset: it has all the chosen variables in a column each, each of the 10299 observations forms a row and it contains only one type of observational unit, whichs means it is OK to keep it as a single table.
#####Wide or narrow?#####
Should a tidy dataset be wide or narrow? A discussion on this topic where it applies to this Course Project specifically, can be found here. According to the rubric, either form is acceptable for this assignment. I prefer the wide form, which I think is more user friendly.
#####The final tidy dataset##### All that is needed to create the final dataset is to group the data into the averages for each measurement over the activities and subjects.
Since there are 30 subjects that performed 6 activities each, the new dataset consists of 180 observations (rows) and 68 variables (columns) - the 66 variables chosen form the observations data, plus the activity and subject fields.
In order to do this easily, the R package "data.table" is used. The dataframe x.mean.std.only is converted to a datatable, which is subsequntly grouped using lapply()
.
This creates the tidy.data dataset, with 180 rows and 68 columns.