-
Notifications
You must be signed in to change notification settings - Fork 12
/
29-Activity-Area-Data-VI.Rmd
107 lines (75 loc) · 3.27 KB
/
29-Activity-Area-Data-VI.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
title: "Activity 14: Area Data VI"
output: html_notebook
---
# Activity 14: Area Data VI
## Practice questions
Answer the following questions:
1. Describe and discuss the possible sources of autocorrelation in the residuals of a model.
2. List possible corrective/remedial actions when residual autocorrelation is detected.
3. Under which situations is a Spatial Error Model an adequate modeling strategy?
## Learning objectives
In this activity, you will:
1. Explore a dataset with area data using visualization as appropriate.
2. Discuss a process that might explain any pattern observed from the data.
3. Conduct a modeling exercise using appropriate techniques. Justify your modeling decisions.
## Suggested reading
O'Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 5. John Wiley & Sons: New Jersey.
## Preliminaries
For this activity you will need the following:
* This R markdown notebook.
* A dataset of your choice.
It is good practice to clear the workspace to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is `rm` (for "remove"), followed by a list of items to be removed. To clear the workspace from _all_ objects, do the following:
```{r}
rm(list = ls())
```
Note that `ls()` lists all objects currently on the workspace.
Load the libraries you will use in this activity (load other packages as appropriate).
```{r message = FALSE, warning=FALSE}
library(geog4ga3)
library(sf)
library(spatstat)
library(spdep)
library(tidyverse)
```
Choose one of the following datasets.
### New York leukemia data
```{r}
data("nyleukemia")
```
A `SpatialPolygonsDataFrame` that contains the following variables:
* AREANAME name of census tract
* AREAKEY unique FIPS code for each tract
* POP8 population size (1980 U.S. Census)
* TRACTCAS number of cases of leukemia (1978-1982)
* PROPCAS proportion of cases per tract
* PCTOWNHOME percentage of people in each tract owning their own home
* PCTAGE65P percentage of people in each tract aged 65 or more
* Z transformed proportions
* AVGIDIST average distance between centroid and TCE sites
* PEXPOSURE "exposure potential": inverse distance between each census tract centroid and the nearest TCE site, IDIST, transformed via log(100*IDIST)
This can be converted to a simple features object as follows:
```{r}
nyleukemia.sf <- st_as_sf(nyleukemia)
```
### Pennsylvania lung cancer
```{r}
data("pennlc")
```
A `SpatialPolygonsDataFrame` that contains the following variables:
* county: Name of the county
* cases: Number of cases of lung cancer
* population: Population by county
* rate: Lung cancer rate by county
* smoking: Smoking rate by county
* cancer_ rate: Lung cancer rate by county (%)
* smoking_rate: Smoking rate by county (%)
This can be converted to a simple features object as follows:
```{r}
pennlc.sf <- st_as_sf(pennlc)
```
## Activity
1. Partner with a fellow student to analyze the chosen dataset.
2. Visualize/explore the dataset using appropriate tools.
3. Analyze your dataset by means of regression modeling. Which should be the dependent variable in your dataset? Why?
4. Discuss the results of your analysis, including possible limitations, and possible ways to improve it (e.g., what additional variables would you like to use?)