data_clean.Rmd

---
title: "Data clean"
output: 
  html_document:
    code_folding: hide
    toc: true
    toc_float: true
    
---

```{r setup, echo = FALSE, message = FALSE}
library(tidyverse)

knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE
)
```

### Import data
```{r}
innocent_raw <- 
  read_csv("./data/Innocent Deaths caused by Police (All time).csv")
```

### Tidy data
```{r}
asp <- 
  c("Asphyxiation/Restrained", "Asphyxiation/Restrain", "Restrain/Asphyxiation")
race_exclude <- 
  c("Middle Eastern", "Race unspecified")

innocent_df <- 
  innocent_raw %>% 
  janitor::clean_names() %>% 
  rename(
    date_of_death = date_of_injury_resulting_in_death_month_day_year,
    city = location_of_death_city,
    highest_force = highest_level_of_force,
    intended_force = intended_use_of_force_developing
  ) %>% 
  select(-c(2, 6, 8, 11:13, 16, 18:24, 26:27)) %>% 
  drop_na(
    gender, city, highest_force, intended_force) %>% 
  filter(gender != "Transgender", 
         (race %in% race_exclude) == FALSE) %>% 
  group_by(gender) %>% 
  mutate(
    age = as.numeric(round(replace_na(age, mean(age, na.rm = TRUE)))),
    age_bin = AMR::age_groups(age, c(15, 25, 35, 55, 85)),
    gender = as.factor(gender),
    race = as.factor(race),
    race = recode(race, "european-American/White" = "European-American/White"),
    date_of_death = lubridate::mdy(date_of_death),
    latitude = replace_na(latitude, 42.167834),
    highest_force = 
      as.factor(ifelse(highest_force %in% asp, "Asphyxiated/Restrained", highest_force)),
    intended_force = as.factor(intended_force)
  ) %>% 
  filter(age >= 1) %>% 
  select(1, 2, 12, everything()) %>% 
  arrange(unique_id)
```
+ **Brief data description**   
  + There are ``r nrow(innocent_df)`` observations and ``r ncol(innocent_df)`` variables,  
    + including `2` character variables, `4` numeric variables, `5` factor variables, and `1` date variable  
  + `unique_id` is a key variable and unique in this dataset  
  + `age`, `age_bin`, `gender`, and `race` represent the subject's demographic characteristics  
  + `date_of_death` represents the date of the subject's death and ranges from ``r min(innocent_df$date_of_death)`` to ``r max(innocent_df$date_of_death)``  
  + `city`, `state`, `latitude`, and `longitude` represent the location of the death    
  + `highest_force` represents the cause of death  
  + `intended_force` represents the intentional use of force by police

+ **Data clean steps**  
  + Use reasonable variables and select several variables    
  + Drop NA values      
  + Do the following mutations:  
    + Replace NA values in the `age` with the mean in each gender group    
    + Remove the comma in a value of `latitude`   
    + Normalize several similar terms in `highest_force`      
    + Convert the data type of some variables and create a new variable `age_bin`    
  + Gender with `Transgender`, race with `Middle Eastern` and `Race unspecified`, and age under one are not considered here      
  + Rearrange data
  
### Output data as `maindata.csv`
```{r}
# write_csv(innocent_df, "./data/maindata.csv")
# only do this once
```