-
Notifications
You must be signed in to change notification settings - Fork 36
/
Copy pathexample-names-and-ages.Rmd
281 lines (225 loc) · 8.82 KB
/
example-names-and-ages.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
# Example: Predicting Names from Ages
## Prerequisites {-}
This analysis uses these packages.
```{r message=FALSE}
library("tidyverse")
library("babynames")
```
## Statement of the problem
Suppose I know your sex and name, can I guess you age?
The probability of an age given a name and sex,
$$
P(\text{age} | \text{name}, \text{sex}) \propto P(\text{name} | \text{age}, \text{sex}) P(\text{age} | \text{sex})
$$
## Data Wrangling
The source of our data is the **babynames** package in R.
```{r}
YEAR <- 2015
```
We'll consider a single year in our analysis `r YEAR`, which is the last year in the `babynames` package.
The life tables are only provided for the decades from 1900 to 2010. We need to fill in the full life table for all birth years.
For years between 1900 and 2010, we will linearly impute the probability of being alive at each age for non-decadal birth years.
We'll use 2010 for all birth-years after 2010.
```{r life_table}
life_table <- babynames::lifetables %>%
group_by(year) %>%
mutate(px = lx / 1e+05) %>%
rename(age = x) %>%
select(year, age, sex, p_alive = px) %>%
ungroup() %>%
complete(sex, year = seq(min(year), YEAR, by = 1), age) %>%
rename(birthyear = year) %>%
group_by(sex, age) %>%
arrange(sex, age, birthyear) %>%
mutate(p_alive = zoo::na.approx(p_alive, na.rm = FALSE)) %>%
fill(p_alive, .direction = "down") %>%
ungroup() %>%
arrange(sex, age, birthyear)
```
For this analysis, we only need the age distribution in `r YEAR`.
```{r age_distr}
age_distr <- life_table %>%
mutate(year = birthyear + age,
# convert sex to character to avoid join warning
sex = as.character(sex)) %>%
filter(year == YEAR) %>%
select(-year)
```
```{r}
glimpse(life_table)
```
However, the `p_alive` column of `age_distr` only provides the probability of being alive in `r YEAR` conditional on having been born in a given year (and sex),
```{r fig.cap="Probability"}
ggplot(age_distr,
aes(x = birthyear, y = p_alive, color = sex)) +
geom_point() +
geom_line() +
labs(y = expression(paste("P", group("(", paste(alive, "|", year), ")"))),
x = "Year of Birth")
```
We need the number born each year to be able to calculate the number alive in each year, and the age distribution.
Suppose that the number born in each year was equal, the age distribution would be:
```{r}
age_distr %>%
group_by(sex) %>%
mutate(p = p_alive / sum(p_alive)) %>%
ggplot(aes(x = age, y = p, color = sex)) +
geom_point() +
geom_line() +
labs(x = "Age", y = "P(age)") +
theme(legend.pos = "bottom")
```
As a proxy for the number born each year we'll use the proportion of Social Security applicants each year, provided by the `babynames::applicants`.
Since the baby-name information will also come from the Social Security data this is no less restrictive.
```{r fig.caption="Number of Social Security Applicants per year."}
ggplot(babynames::applicants, aes(x = year, y = n_all / 1e6, color = sex)) +
geom_point() +
geom_line() +
labs(x = "", y = "SSA Applicants (mn)") +
theme(legend.pos = "bottom")
```
Clearly, the number of births is not constant per year.
Join the SSA applicant numbers and calculate the probability of each age by sex in `r YEAR`.
```{r age_distr_2}
age_distr <- left_join(age_distr,
rename(babynames::applicants, n_apps = n_all),
by = c("sex", "birthyear" = "year")) %>%
mutate(n_alive = p_alive * n_apps) %>%
group_by(sex) %>%
mutate(p_age = n_alive / sum(n_alive)) %>%
ungroup() %>%
arrange(sex, age)
```
After accounting for different numbers of births in each year, the
age distribution is different.
```{r}
ggplot(age_distr, aes(x = age, y = p_age, color = sex)) +
geom_point() +
geom_line() +
labs(x = "Age", y = "P(age)") +
theme(legend.pos = "bottom")
```
The `babynames` dataset has the number in each sex born each year with a given name (and registered by the SSA).
```{r}
glimpse(babynames)
```
The column `prop` is the proportion of people of that gender with that name born in each year, $P(\text{name} | \text{age}, \text{sex})$.
```{r babynames}
baby_names <- babynames::babynames
```
Also, since the SSA only releases names with > 5 individuals in a year, add
an additional entry for each year for babies born and given rare names.
```{r babynames_other}
babynames_other <- baby_names %>%
group_by(sex, year) %>%
summarise(p_named = sum(prop), n_named = sum(n)) %>%
mutate(prop = 1 - p_named, n = n_named / p_named * prop,
name = "OTHER") %>%
select(sex, year, prop, name, n, prop) %>%
ungroup()
```
```{r babynames2}
baby_names <- bind_rows(baby_names, babynames_other)
```
## Probability of age given name and sex
Consider someone with the name of "Emma" and sex is "female".
What is the posterior distribution of their age,
$$
p(\text{age} | \text{name} = \text{"Emma"}, \text{sex} = \text{"F"}) =
p(\text{name} = \text{"Emma"} | \text{age}, \text{sex} = \text{"F"}) p(\text{age} | \text{sex} = \text{"F"}) .
$$
```{r}
name <- "Emma"
sex <- "F"
```
Filter `babynames` to only include observations for the name "`r name`" and sex "`r name`":
```{r p_name_age}
p_name_age <- baby_names %>%
filter(name == !!name, sex == !!sex) %>%
select(-sex, -name) %>%
mutate(age = YEAR - year)
```
```{r plot_name_age}
ggplot(p_name_age, aes(x = year, y = prop)) +
geom_line() +
labs(x = "", y = "Proportion births")
```
The popularity of the name Emma first declined, then increased.
However, very few of those born when Emma was first popular are likely to still be alive.
```{r posterior}
posterior <-
left_join(p_name_age,
select(filter(age_distr, sex == !!sex), birthyear, prior = p_age),
by = c(year = "birthyear")) %>%
rename(likelihood = prop) %>%
# fill in missing values with 0
mutate(prior = if_else(is.na(prior), 0, prior),
# postrior
post = likelihood * prior,
# normalize posterior to sum to 1
post = post / sum(post))
```
Let's plot the prior ($P(age)$), likelihood ($P(name | age)$), and posterior ($P(name | age)$).
```{r plot_posterior}
posterior %>%
select(age, post, likelihood, prior) %>%
gather(variable, value, -age) %>%
mutate(variable = recode(variable, post = "P(age | name)",
likelihood = "P(name | age)", prior = "P(age)"),
variable = factor(variable,
levels = c("P(age)", "P(name | age)",
"P(age | name)"))) %>%
ggplot(aes(y = value, x = age)) +
geom_line() +
facet_wrap(~ variable, ncol = 1, scales = "free_y")
```
Alternatively, instead of calculating $p(age | name, sex)$ from Bayes' Theorem,
we can calculate it directly from the joint distribution of name, age, and sex.
$$
p(age | name, sex) = p(age, name, sex) / p(name, sex)
$$
```{r}
baby_names_joint <- baby_names %>%
# add probability that the person is alive
left_join(select(age_distr, age, sex, birthyear, p_alive),
by = c("sex", year = "birthyear")) %>%
# calculate number alive
filter(year >= 1900) %>%
mutate(n_alive = p_alive * n) %>%
# calculate p(sex, age, name)
# number alive with sex, age, name / total estimated to be alive
mutate(prop = n_alive / sum(n_alive))
```
Calculate $p(name, sex)$ by summing the probabilities of all
combinations of age and sex.
```{r}
p_name_sex <- baby_names_joint %>%
group_by(name, sex) %>%
summarise(prop = sum(prop))
```
Calculate,
$$
p(\text{age} | \text{name}, \text{sex}) = \frac{p(\text{age}, \text{name}, \text{sex})}{p(\text{name}, \text{sex})}
$$
```{r}
inner_join(baby_names_joint,
select(p_name_sex, name, sex, p_name_sex = prop),
by = c("name", "sex")) %>%
mutate(p_age = prop / p_name_sex)
```
### Questions
1. Which name provides the most information about a person's age? Which provides the least?
1. Consider a rare name - what is the probability in years in which there is no sample. How does that affect the analysis? What would you do about it?
```{r}
filter(baby_names, sex == "M") %>%
count(name) %>%
filter(nn == 1) %>%
sample_n(1)
```
1. How would you calculate $p(age | name)$ if you don't know the sex of the individual?
### References
This example is derived from [How to Tell Someone’s Age When All You Know Is Her Name](https://fivethirtyeight.com/features/how-to-tell-someones-age-when-all-you-know-is-her-name/).
Inferring characteristics of individuals given their name are common. See these related packages:
- [gender](https://github.com/ropensci/gender): An R package for predicting gender from a name using historical name data.
- [genderizeR](https://github.com/kalimu/genderizeR) Another package for predicting gender from first names.
- [wru](https://github.com/kosukeimai/wru): R package to predict race/ethnicity from surnames and geolocation information, based on [this paper](http://imai.princeton.edu/research/race.html)