-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathlinks-79-pair-doc.R
177 lines (177 loc) · 9.34 KB
/
links-79-pair-doc.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
#' @name Links79Pair
#'
#' @docType data
#'
#' @title Kinship linking file for pairs of relatives in the NLSY79 and NLSY79 Children and Young Adults
#'
#' @description This dataset specifies the relatedness coefficient (ie, '`R`') between
#' subjects in the same extended family. Each row represents a unique
#' relationship pair.
#'
#' NOTE: Two variable names changed in November 2013. `Subject1Tag` and `Subject2Tag` became `SubjectTag_S1` and `SubjectTag_S2`.
#'
#' @format A data frame with 42,773 observations on the following 5 variables.
#' There is one row per unique pair of subjects, irrespective of order.
#'
#' * **ExtendedID** Identity of the extended family of the pair; it corresponds to the HHID in the NLSY79. See References below.
#' * **SubjectTag_S1** Identity of the pair's first subject. See Details below.
#' * **SubjectTag_S2** Identity of the pair's second subject. See Details below.
#' * **R** The pair's Relatedness coefficient. See Details below.
#' * **RelationshipPath** Specifies the relationship category of the pair. This variable is a factor, with levels `Gen1Housemates`=1, `Gen2Siblings`=2, `Gen2Cousins`=3, `ParentChild`=4, `AuntNiece`=5.
#'
#' @details The dataset contains Gen1 and Gen2 subjects. "Gen1" refers to subjects in
#' the original NLSY79 sample (https://www.nlsinfo.org/content/cohorts/nlsy79).
#' "Gen2" subjects are the biological children of the Gen1 females -ie, those
#' in the NLSY79 Children and Young Adults sample
#' (https://www.nlsinfo.org/content/cohorts/nlsy79-children).
#'
#' Subjects will be in the same extended family if either:
#' 1. they are Gen1 housemates,
#' 1. they are Gen2 siblings,
#' 1. they are Gen2 cousins (ie, they have mothers who are Gen1 sisters in the NLSY79,
#' 1. they are mother and child (in Gen1 and Gen2, respectively), or
#' 1. they are aunt|uncle and niece|nephew (in Gen1 and Gen2, respectively).
#'
#' The variables `SubjectTag_S1` and `SubjectTag_S2` uniquely identify
#' subjects. For Gen2 subjects, the SubjectTag is identical to their CID (ie,
#' C00001.00 -the SubjectID assigned in the NLSY79-Children files). However
#' for Gen1 subjects, the SubjectTag is their CaseID (ie, R00001.00), with
#' "00" appended. This manipulation is necessary to identify subjects
#' uniquely in inter-generational datasets. A Gen1 subject with an ID of 43
#' has a `SubjectTag` of 4300. The SubjectTags of her four children
#' remain 4301, 4302, 4303, and 4304.
#'
#' Level 5 of `RelationshipPath` (ie, AuntNiece) is gender neutral. The
#' relationship could be either Aunt-Niece, Aunt-Nephew, Uncle-Niece, or
#' Uncle-Nephew. If there's a widely-accepted gender-neutral term, please
#' tell me.
#'
#' An extended family with \eqn{k} subjects will have
#' \eqn{k}(\eqn{k}-1)/2 rows. Typically, Subject1 is older while Subject2 is
#' younger.
#'
#' MZ twins have *R*=1. DZ twins and full-siblings have *R*=.5.
#' Half-siblings have *R*=.25. Typical first cousins have *R*=.125.
#' Unrelated subjects have *R*=0 (this occasionally happens for
#' `Gen1Housemates`, but never for the other paths).
#' Other *R* coefficients are possible.
#'
#' There are several other uncommon possibilities, such as half-cousins (*R*=.0625) and
#' ambiguous aunt-nieces (*R*=.125, which is an average of 1/4 and 0/4).
#' The variable coding for genetic relatedness,`R`, in [`Links79Pair`] contains
#' only the common values of *R* whose groups are likely to have stable estimates.
#' However the variable `RFull` in [`Links79PairExpanded`] contains all *R* values.
#' We strongly recommend using `R` in this [base::data.frame]. Move to
#' `RFull` (or some combination) only if you have a good reason, and are willing
#' to carefully monitor a variety of validity checks. Some of these
#' excluded groups are too small to be estimated reliably.
#'
#' Furthermore, some of these groups have members who are more strongly genetically related than their
#' items would indicate. For instance, there are 41 Gen1 pairs who explicitly claim they are not biologically related
#' (*ie*, `RExplicit`=0), yet their correlation for Adult Height is *r*=0.24. This is
#' much higher than would be expected for two people sampled randomly; it is nearly identical to
#' the *r*=0.26 we observed among the 268 Gen1 half-sibling pairs who claim they share exactly 1
#' biological parent.
#'
#' The `LinksPair79` dataset contains columns necessary for a
#' basic BG analysis. The `Links79PairExpanded` dataset contains
#' further information that might be useful in more complicated BG analyses.
#'
#' A tutorial that produces a similar dataset is
#' http://www.nlsinfo.org/childya/nlsdocs/tutorials/linking_mothers_and_children/linking_mothers_and_children_tutorial.html.
#' It provides examples in SAS, SPSS, and STATA.
#'
#' `RelationshipPath` variable. Code written using this dataset should
#' NOT assume it contains only Gen2 sibling pairs. See below for an example of
#' filtering the relationship category in the in `Links79Pair`
#' documentation.
#'
#' The specific steps to determine the *R* coefficient will be described
#' in an upcoming publication. The following information may influence the
#' decisions of an applied researcher.
#'
#'
#' A distinction is made between `Explicit` and `Implicit` information.
#' Explicit information comes from survey items that directly address the
#' subject's relationships. For instance in 2006, surveys asked if the
#' sibling pair share the same biological father (eg, Y19940.00 and
#' T00020.00). Implicit information comes from items where the subject
#' typically isn't aware that their responses may be used to determine genetic
#' relatedness. For instance, if two siblings have biological fathers with
#' the same month of death (eg, R37722.00 and R37723.00), it may be reasonable
#' to assume they share the same biological father.
#'
#' `Interpolation` is our lingo when other siblings are used to leverage
#' insight into the current pair. For example, assume Subject 101, 102, and
#' 103 have the same mother. Further assume 101 and 102 report they share a
#' biological father, and that 101 and 103 share one too. Finally, assume
#' that we don't have information about the relationship between 102 and 103.
#' If we are comfortable with our level of uncertainty of these
#' determinations, then we can interpolate/infer that 102 and 103 are
#' full-siblings as well.
#'
#' The math and height scores are duplicated from
#' [ExtraOutcomes79], but are included here to make some examples
#' more concise and accessible.
#'
#' @author Will Beasley
#'
#' @seealso The `LinksPair79` dataset contains columns necessary for a
#' basic BG analysis. The [Links79PairExpanded] dataset contains
#' further information that might be useful in more complicated BG analyses.
#'
#' A tutorial that produces a similar dataset is
#' http://www.nlsinfo.org/childya/nlsdocs/tutorials/linking_mothers_and_children/linking_mothers_and_children_tutorial.html.
#' It provides examples in SAS, SPSS, and STATA.
#'
#' The current dataset (ie, `Links79Pair`) can be saved as a CSV file
#' (comma-separated file) and imported into in other programs and languages.
#' In the R console, type the following two lines of code:
#'
#' `library(NlsyLinks)`
#'
#' `write.csv(Links79Pair, "C:/BGDirectory/Links79Pair.csv")`
#'
#' where `"C:/BGDirectory/"` is replaced by your preferred directory.
#' Remember to use forward slashes instead of backslashes; for instance, the
#' path `"C:\BGDirectory\Links79Pair.csv"` can be misinterpreted.
#'
#'
#' **Download CSV**
#' If you're using the NlsyLinks package in R, the dataset is automatically available.
#' To use it in a different environment,
#' [download the csv](https://github.com/nlsy-links/NlsyLinks/blob/master/outside-data/nlsy-79/links-2017-79.csv?raw=true),
#' which is readable by all statistical software.
#' [links-metadata-2017-79.yml](https://github.com/nlsy-links/NlsyLinks/blob/master/outside-data/nlsy-79/links-metadata-2017-79.yml)
#' documents the dataset version information.
#'
#' @references The NLSY79 variable HHID (ie, R00001.49) is the source for the
#' `ExtendedID` variable. This is discussed at
#' http://www.nlsinfo.org/nlsy79/docs/79html/79text/hhcomp.htm.
#'
#' For more information on *R* (*ie*, the Relatedness coefficient), please see
#' Rodgers, Joseph Lee, & Kohler, Hans-Peter (2005).
#' [Reformulating and simplifying the DF analysis model.](https://pubmed.ncbi.nlm.nih.gov/15685433/)
#' *Behavior Genetics, 35* (2), 211-217.
#'
#' @source Gen1 information comes from the Summer 2013 release of the
#' [NLSY79 sample](https://www.nlsinfo.org/content/cohorts/nlsy79). Gen2 information
#' comes from the Summer 2013 release of the
#' [NLSY79 Children and Young Adults](https://www.nlsinfo.org/content/cohorts/nlsy79-children).
#' Data were extracted with the NLS Investigator
#' (https://www.nlsinfo.org/investigator/).
#'
#' The internal version for the links is `Links2011V84`.
#' @keywords datasets
#'
#' @examples
#' library(NlsyLinks) # Load the package into the current R session.
#' summary(Links79Pair) # Summarize the five variables.
#' hist(Links79Pair$R) # Display a histogram of the Relatedness coefficients.
#' table(Links79Pair$R) # Create a table of the Relatedness coefficients for the whole sample.
#'
#' # Create a dataset of only Gen2 sibs, and display the distribution of R.
#' gen2Siblings <- subset(Links79Pair, RelationshipPath == "Gen2Siblings")
#' table(gen2Siblings$R) # Create a table of the Relatedness coefficients for the Gen2 sibs.
#'
NULL