-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.Rmd
220 lines (160 loc) · 11.3 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
---
title: 'Project 2: Data Mining, Classification, Prediction'
author: "SDS322E"
date: ''
output:
html_document:
toc: yes
toc_float:
collapsed: no
smooth_scroll: yes
pdf_document:
toc: no
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = TRUE, fig.align = "center", warning = F, message = F,
tidy=TRUE, tidy.opts=list(width.cutoff=60), R.options=list(max.print=100))
class_diag <- function(score, truth, positive, cutoff=.5){
pred <- factor(score>cutoff,levels=c("TRUE","FALSE"))
truth <- factor(truth==positive, levels=c("TRUE","FALSE"))
tab<-table(truth, pred)
acc=sum(diag(tab))/sum(tab)
sens=tab[1,1]/rowSums(tab)[1]
spec=tab[2,2]/rowSums(tab)[2]
ppv=tab[1,1]/colSums(tab)[1]
#CALCULATE F1
f1=2*(sens*ppv)/(sens+ppv)
#CALCULATE EXACT AUC
truth<-as.numeric(truth=="TRUE")
ord<-order(score, decreasing=TRUE)
score <- score[ord]; truth <- truth[ord]
TPR=cumsum(truth)/max(1,sum(truth))
FPR=cumsum(!truth)/max(1,sum(!truth))
dup<-c(score[-1]>=score[-length(score)], FALSE)
TPR<-c(0,TPR[!dup],1); FPR<-c(0,FPR[!dup],1)
n <- length(TPR)
auc<- sum( ((TPR[-1]+TPR[-n])/2) * (FPR[-1]-FPR[-n]) )
round(data.frame(acc,sens,spec,ppv,f1,ba=(sens+spec)/2,auc, row.names = "Metrics"),4)
}
```
# Mining, Classification, Prediction
## Kenneth Imphean ksi238
### Introduction
The two datasets I have used before on Project 1. I have chosen County Health Rankings in Texas and COVID-19 vaccinations by county in Texas. The variables contained in the County Health Rankings in Texas come partially from my Biostatistics project: county, food environment index, primary care physician population, and urban/rural classification. The variables contained in COVID-19 vaccinations by county in Texas deal with counties, the population who is eligible for vaccination (people 12 and older), and population of fully vaccinated individuals. The food environment index was calculated by measuring access to healthy foods and the income of counties and primary care physician population was found by looking at ratios of population to primary care physicians as well as including D.O.s. Population data was aquired by the 2019 US Census Bureau Population Estimates and the population who have been fully vaccinated or partially vaccinated was collected through vaccination records submitted by health care providers.
This data is interesting to me because I want to see if there's any sort of relationship to be found within between the two datasets. The food environment index takes into account access to supermarkets and grocery stores, which usually also have pharmacies as well where vaccinations can be given. There are 254 observations total. There are some observations with NA values, when they are removed the total is 232. Of those 254 observations, 172 are rural and 82 are urban! Of those the set with no NA values, there are 156 that are rural and 76 that are urban.
```{R}
library(tidyverse)
countyhealth <- read_csv("~/Data from County Health Rankings and Biostats project.csv")
countyvaccines <- read_csv("~/COVID-19 Vaccine Data by County.csv")
inner_join(countyhealth,countyvaccines, by="county") -> combinedcounty
combinedcounty %>% mutate(percentage.fully.vaccinated = (people.fully.vaccinated/population.12.and.older)*100,percentage.vaccinated.with.at.least.one.dose = (people.vaccinated.with.at.least.one.dose/population.12.and.older)*100) %>% select (-total.doses.allocated, - vaccine.doses.administered, - people.fully.vaccinated, - people.vaccinated.with.at.least.one.dose) -> combinedcountyfinal
combinedcountyfinal %>% mutate(urban = ifelse(urban.rural.classification=="Urban",1,0)) -> combinedcountyfinal
combinedcountyfinalnona <- combinedcountyfinal %>% na.omit()
combinedcountyfinalnona %>% group_by(urban.rural.classification) %>% summarize(count=n())
```
### Cluster Analysis
```{R}
library(cluster)
clust_dat <- combinedcountyfinal %>% dplyr::select(food.environment.index, percentage.fully.vaccinated, primary.care.physician.population) %>% na.omit()
clust_dat <- clust_dat %>% na.omit() %>% scale
sil_width<-vector()
for(i in 2:10){
pam_fit <- pam(clust_dat, k = i)
sil_width[i] <- pam_fit$silinfo$avg.width
}
ggplot()+geom_line(aes(x=1:10,y=sil_width))+scale_x_continuous(name="k",breaks=1:10)
final <- clust_dat %>% as.data.frame
pam1 <- final %>% pam(4)
pam1
final <- final %>% mutate(cluster=as.factor(pam1$clustering))
ggplot(final, aes(x=food.environment.index,y=percentage.fully.vaccinated, color=cluster))+geom_point()
final%>%slice(pam1$id.med)
library(plotly)
final%>%plot_ly(x= ~food.environment.index, y = ~percentage.fully.vaccinated, z = ~primary.care.physician.population, color= ~cluster, type = "scatter3d", mode = "markers")
plot(pam1,which=2)
library(GGally)
ggpairs(final, aes(color=cluster))
```
When I scaled and standardized all the data, I thought it was interesting to see that there were negative values within my data. I chose to do 4 clusters as it had the highest silhouette width. Cluster 4 seems to have all high values of food.environment.index, percentage of fully vaccinated people, and primary care physicians. Cluster 3 seems to have really low food environment index, medium level percentage of fully vaccinated people, and somewhat low primary care physicians. Cluster 2 has a medium food environment index, medium percentage of fully vaccinated people, and a low population of primary care physicians. Cluster 1 has low values of food environment index, fully vaccinated people, and primary care physicians. In the end, I don't think the goodness-of-fit was that great, the highest average silhouette width was only .34 which means the structure is weak and may be artificial.
### Dimensionality Reduction with PCA
```{R}
countyfinal <- combinedcountyfinal %>% na.omit() %>% select(-urban) %>% select_if(is.numeric) %>% scale
rownames(countyfinal) <- combinedcountyfinalnona$county
county_pca <- princomp(countyfinal, cor=T)
names(county_pca)
summary(county_pca, loadings=T)
countydf<-data.frame(County=combinedcountyfinalnona$county, PC1=county_pca$scores[, 1],PC2=county_pca$scores[, 2])
ggplot(countydf, aes(PC1, PC2)) + geom_point()
```
I decided to retain PC1 and PC2 as they were enough to explain about 81% of the total variance in the dataset. For PC1, a high score means higher population of people 12 and older, as well as primary care physicians. It also means a larger percentage that are vaccinated with one dose and fully vaccinated. A lower score for PC1 means a lower population of people 12 and older and lower number of primary care physicians. It also means a lower percentage that are vaccinated with one dose and fully vaccinated. Interestingly, the food environment index does not play a big role in PC1.
For PC2 a high score means high food environment index, primary care physician population, population 12 and older, but a lower percentage fully vaccinated and lower percentage vaccinated with at least one dose. A low score for PC2 means lower food environment index, primary care physician, and population of people 12 and older. It also means higher percentage of people vaccinated with at least one dose (and fully vaccinated).
### Linear Classifier
```{R}
library(caret)
fit <- glm(urban~ food.environment.index+primary.care.physician.population+population.12.and.older+percentage.fully.vaccinated
+percentage.vaccinated.with.at.least.one.dose, data=combinedcountyfinalnona, family="binomial")
coef(fit)
probs<-predict(fit,type="response")
probs %>% round(3)
class_diag(probs,combinedcountyfinalnona$urban,positive=1)
table(truth = combinedcountyfinalnona$urban, prediction = probs>.5) %>% addmargins
```
```{R}
cv <- trainControl(method="cv", number = 8, classProbs = T, savePredictions = T)
fit <- train(urban ~food.environment.index+primary.care.physician.population+population.12.and.older+percentage.fully.vaccinated
+percentage.vaccinated.with.at.least.one.dose, data=combinedcountyfinalnona, trControl=cv, method="glm")
class_diag(fit$pred$pred, fit$pred$obs, positive=1)
```
The values is 1 for urban and 0 for rural. The model is pretty fair in predicting the new observations per CV AUC. There is definitely signs of overfitting as the CV AUC is less when compared to the regular AUC. The confusion matrix is interesting in that it mostly predicts urban counties correctly, but utterly fails as it tries to predicts rural counties.
### Non-Parametric Classifier
```{R}
library(caret)
knn_fit<- knn3(urban ~ food.environment.index+primary.care.physician.population+population.12.and.older+percentage.fully.vaccinated
+percentage.vaccinated.with.at.least.one.dose, data=combinedcountyfinalnona,k=8)
y_hat_knn <- predict(knn_fit,combinedcountyfinalnona)
y_hat_knn
class_diag(y_hat_knn[,2],combinedcountyfinalnona$urban, positive=1)
table(truth= factor(combinedcountyfinalnona$urban==1, levels=c("TRUE","FALSE")),
prediction= factor(y_hat_knn[,2]>.5, levels=c("TRUE","FALSE"))) %>% addmargins
```
```{R}
cv <- trainControl(method="cv", number = 8, classProbs = T, savePredictions = T)
fit <- train(urban ~ food.environment.index+primary.care.physician.population+population.12.and.older+percentage.fully.vaccinated
+percentage.vaccinated.with.at.least.one.dose, data=combinedcountyfinalnona, trControl=cv, method="knn")
class_diag(fit$pred$pred, fit$pred$obs, positive=1)
```
The model is good at predicting new observations per CV AUC. There are signs of overfitting as the CV AUC is less than the regular AUC. The nonparametric model is actually better than the linear model in its cross-validation performance, the AUC is higher. The confusion matrix predicted more rural counties correctly than the linear classifier confusion matrix.
### Regression/Numeric Prediction
```{R}
fit <- lm(percentage.fully.vaccinated~primary.care.physician.population+food.environment.index,data=combinedcountyfinalnona)
yhat<-predict(fit)
mean((combinedcountyfinalnona$percentage.fully.vaccinated-yhat)^2)
```
```{R}
k=8
data<-combinedcountyfinalnona[sample(nrow(combinedcountyfinalnona)),]
folds<-cut(seq(1:nrow(combinedcountyfinalnona)),breaks=k,labels=F)
diags<-NULL
for(i in 1:k){
train<-data[folds!=i,]
test<-data[folds==i,]
fit<-lm(percentage.fully.vaccinated ~ primary.care.physician.population+food.environment.index, data=train)
yhat<-predict(fit,newdata=test)
diags<-mean((test$percentage.fully.vaccinated-yhat)^2)
}
mean(diags)
```
The MSE for the overall dataset was 113.1287. There is definitely signs of overfitting as the average MSE from my k testing folds were higher than the MSE for the overall dataset.
### Python
```{R}
library(reticulate)
hi<-"Hello"
combinedcounty %>% mutate(percentage.fully.vaccinated = (people.fully.vaccinated/population.12.and.older)*100,percentage.vaccinated.with.at.least.one.dose = (people.vaccinated.with.at.least.one.dose/population.12.and.older)*100)
```
```{python}
import pandas as pd
(r.combinedcounty.assign(PercentageofFullyVaccinated = lambda x: (r.combinedcounty["people.fully.vaccinated"]/r.combinedcounty["population.12.and.older"])*100)
.assign(PercentageofVaccinatedatleastonedose= lambda x:(r.combinedcounty["people.vaccinated.with.at.least.one.dose"]/r.combinedcounty["population.12.and.older"])*100)
.sort_values(by=('PercentageofFullyVaccinated'), ascending=True))
```
In python, I was able to recalculate my dataset's percentage of people fully vaccinated and percentage of people vaccinated with one dose through Python's version of tidyverse!