forked from mkosmala/SnapshotSerengetiScripts
-
Notifications
You must be signed in to change notification settings - Fork 0
/
ClassificationProcessing1.Rmd
243 lines (192 loc) · 7.25 KB
/
ClassificationProcessing1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
---
title: "ClassificationProcessing"
author: "Pat Lorch"
date: "June 6, 2017"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Classification processing
*** Not used. I do this in postgres now ***
We need to "jail break" the json fields to get data for summarizing and further processing classifications.
Zooniverse was developing an engine for doing this on the database. It was in beta, but they have stopped supporting it. We should probably develop a way to do this in postgres, so we can either dump a classification report and have a veiw that updates the view of a flatter table and or reports with consensus stats, etc.
## Get data, take test subset, load libraries
```{r classification}
focus_on_wildlife_cleveland_metroparks_classifications_test= read_csv("~/GitHub/FocusOnWildlifeCMPScripts/focus-on-wildlife-cleveland-metroparks-classifications_test.csv")
# Get first 100 rows to work with
fowcmp_class_test100=focus_on_wildlife_cleveland_metroparks_classifications_test[1:100,]
library(tidyjson)
library(magrittr)
library(ggplot2)
require(dplyr)
#require(jsonlite)
```
## Testing TidyJson
Details for TidyJason are here:
<https://cran.r-project.org/web/packages/tidyjson/vignettes/introduction-to-tidyjson.html>
### Annotation json structure
* We are currently not using answers or filters
* Answers varies with question type, which depend on choice
Structure:
```
[
{
"task": "T1",
"value": [
{
"choice": "DEER",
"answers": {
"ADULTANTLERLESS": "1",
"BEHAVIORCHECKALLTHATAPPLY": [
"STANDING"
]
},
"filters": {
}
}
]
}
]
```
```{r annotation}
# Look at structure
t_row=fowcmp_class_test100[7,]
t_annotation=t_row["annotations"]
t_annotation %>% prettify
# This will add the choice and answers to the table from the annotations field
t_json.jb=t_row %>% as.tbl_json(json.column = "annotations") %>%
gather_array %>%
spread_values(task=jstring("task")) %>%
enter_object("value") %>% gather_array %>%
spread_values(
choice=jstring("choice"),
answers=jstring("answers")
)
t_json.jb
attr(t_json.jb,'JSON')
```
### Subject_data json structure
* This one is tougher to parse since it starts with a bald key without a name
Structure:
```
{
"5054754": {
"retired": {
"id": 3345492,
"workflow_id": 1432,
"classifications_count": 7,
"created_at": "2017-01-01T02:35:37.051Z",
"updated_at": "2017-01-07T21:07:48.640Z",
"retired_at": "2017-01-07T21:07:48.638Z",
"subject_id": 5054754,
"retirement_reason": "consensus"
},
"ID": "422",
"#Image 1": "7thCheckJuly2016_5_RR1106a_101EK113__20160708__15_44__01261.jpg",
"#Image 2": "7thCheckJuly2016_5_RR1106a_101EK113__20160708__15_44__01262.jpg",
"#Image 3": "7thCheckJuly2016_5_RR1106a_101EK113__20160708__15_44__01263.jpg"
}
}
```
```{r subject_data}
# Look at structure
t_row=fowcmp_class_test100[7,]
t_annotation2=t_row["subject_data"]
t_annotation2 %>% prettify
# This will add the subject data
t_json.jb2=t_row %>% as.tbl_json(json.column = "subject_data") %>%
gather_keys %>%
spread_values(
image_ID=jstring("ID"),
Image_1=jstring("#Image 1"),
Image_2=jstring("#Image 2"),
Image_3=jstring("#Image 3")) %>%
enter_object("retired") %>%
spread_values(
retired_id=jstring("id"),
anno_workflow_id=jstring("workflow_id"),
classifications_count=jstring("classifications_count"),
anno_created_at=jstring("created_at"),
anno_updated_at=jstring("updated_at"),
retired_at=jstring("retired_at"),
subject_id=jstring("subject_id"),
retirement_reason=jstring("retirement_reason")
)
t_json.jb2
attr(t_json.jb2,'JSON')
```
### Testing if two as.tbl_json calls can be in one workflow
```{r combine}
t_row=fowcmp_class_test100[7,]
t_json.jb3=t_row %>% as.tbl_json(json.column = "annotations") %>%
gather_array %>%
spread_values(task=jstring("task")) %>%
as.tbl_json(json.column = "subject_data") %>%
gather_keys %>%
spread_values(
image_ID=jstring("ID"))
t_json.jb3
attr(t_json.jb3,'JSON')
```
This fails and gives a second row. Need to do this as a 2 step process.
## Flatten classification report in two steps
If there are multiple choices, this method makes a separate row for each choice. This will be good for not losing humans when vehicles is also selected. It will not combine classifications where marked antlered and non antlered deer separately.
```{r}
focus_on_wildlife_cleveland_metroparks_classifications_wfid1432_v478_99 <- read_csv("~/GitHub/FocusOnWildlifeCMPScripts/focus-on-wildlife-cleveland-metroparks-classifications_wfid1432_v478.99.csv")
# Replace this with the latest classification report
fowcmp_class=focus_on_wildlife_cleveland_metroparks_classifications_wfid1432_v478_99
json.jb.anno=fowcmp_class %>% as.tbl_json(json.column = "annotations") %>%
gather_array %>%
spread_values(task=jstring("task")) %>%
enter_object("value") %>% gather_array %>%
spread_values(
choice=jstring("choice"),
answers=jstring("answers")
)
json.jb.anno.df=as.data.frame(json.jb.anno)
json.jb.flat=json.jb.anno.df %>% as.tbl_json(json.column = "subject_data") %>%
gather_keys %>%
spread_values(
image_ID=jstring("ID"),
Image_1=jstring("#Image 1"),
Image_2=jstring("#Image 2"),
Image_3=jstring("#Image 3")
) %>%
enter_object("retired") %>%
spread_values(
retired_id=jstring("id"),
anno_workflow_id=jstring("workflow_id"),
classifications_count=jstring("classifications_count"),
anno_created_at=jstring("created_at"),
anno_updated_at=jstring("updated_at"),
retired_at=jstring("retired_at"),
subject_id=jstring("subject_id"),
retirement_reason=jstring("retirement_reason")
)
```
## Calculate consensus, percent, and Peilou score for each subject
### Figure out how to deal with double IDs
How many cases are there where the same user identified the same subject multiple times?
* 14,471
* some of these are two different IDs,
* 606 were the same multiple times
* There is no way to distinguish a subject with two different species from two repeats, except using time between IDs
We need to decide how to deal with repeat IDs (probably take first)
```{r}
sorted=json.jb.flat %>%
arrange(subject_ids,choice,user_name)
View(sorted[,c("subject_ids","choice","user_name","classifications_count","retirement_reason")])
sorted.list=aggregate(choice~subject_ids+user_name,data=json.jb.flat,c)
dups=sorted.list[substr(sorted.list$choice,1,3)=="c(\"",]
t_num=unlist(lapply(as.list(sorted.list$choice),length))
range(t_num)
t_morethanone=which(t_num>1)
t_same=sorted.list$choice[t_morethanone][lapply(lapply(sorted.list$choice[t_morethanone],unique),length)==1]
length(t_same)
```
## Doing this in postgres
### Steps needed
1. break out json to get things we need
2. figure out how to select distinct with something like https://gist.github.com/seanbehan/402ffa37838b8aa5f3035fc7b64ac93f
a. this will require distinguishing repeat presentations and images where there are 2 or more things IDed (see dups above)