-
Notifications
You must be signed in to change notification settings - Fork 0
/
4data.tex
1963 lines (1753 loc) · 60.7 KB
/
4data.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% !Rnw root = learnR.Rnw
\input{preamble}
\marginnote[8pt]{Data objects and functions are two of several types
of objects (others include model objects, formulae, and expressions)
that are available in R. Users can create and work with such
objects in a user workspace. All can, if the occasion demands,
be treated as data!}
\noindent
\fbox{\parbox{\linewidth}{{\bf Different types of data objects:}\\[6pt]
\begin{tabular}{ll}
Vectors & These collect together elements of the same mode.\\
& (Possible modes are "logical", "integer", "numeric",\\
& "complex", "character" and "raw")\\[6pt]
Factors & Factors identify category levels in categorical data.\\
& Modeling functions know how to represent factors.\\
& (Factors do not quite manage to be vectors! Why?)\\[6pt]
Data & A list of columns -- same length; modes may differ. \\
frame & Data frames are a device for organizing data.\\[6pt]
Lists & Lists group together an arbitrary set of objects \\
& (Lists are recursive; elements of lists are lists.)\\[6pt]
\code{NA}s & Use \code{is.na()} to check for \code{NA}s.
\end{tabular}
}}
\vspace*{8pt}
We start this chapter by noting data objects that may appear as
columns of a data frame.
\section{Column Data Objects -- Vectors and Factors}\label{ss:vecDfM}
Column objects is a convenient name for one-dimensional data
structures that can be included as columns in a data frame.
This includes vectors\footnote{Strictly, the vectors that we
discuss here are \textit{atomic} vectors. Their elements are not,
as happens with lists, wrappers for other language objects.},
factors, and dates.
\subsection{Vectors}\label{ss:vector}
Examples of vectors \marginnote{Common vector modes
are logical, numeric and character. The 4 lines of code create
vectors that are, in order: numeric, numeric, logical, character.} are
\begin{Schunk}
\begin{Sinput}
c(2,3,5,2,7,1)
3:10 # The numbers 3, 4,.., 10
c(TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE)
c("fig","mango","apple","prune")
\end{Sinput}
\end{Schunk}
Use \code{mode()} to show the storage mode of an object, thus:
\begin{Schunk}
\begin{Sinput}
x <- c(TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE)
mode(x)
\end{Sinput}
\begin{Soutput}
[1] "logical"
\end{Soutput}
\end{Schunk}
The missing value symbol is \code{NA}. Subsection \ref{ss:NA} will
discuss issues that arise when one or more vector elements is an \code{NA}.
\subsection*{Subsets of Vectors}
There are four common ways to extract subsets of vectors.
1. Specify the subscripts of elements that are to be extracted:
\begin{Schunk}
\begin{Sinput}
x <- c(3,11,8,15,12) # Assign to x the values
x[c(2,4)] # Extract elements 2 and 4
\end{Sinput}
\begin{Soutput}
[1] 11 15
\end{Soutput}
\end{Schunk}
\noindent
Negative numbers may be used to omit elements:\sidenote{Mixing of
positive and negative subscripts is not allowed.}
\begin{Schunk}
\begin{Sinput}
x <- c(3,11,8,15,12)
x[-c(2,3)]
\end{Sinput}
\begin{Soutput}
[1] 3 15 12
\end{Soutput}
\end{Schunk}
2. Specify a vector of logical values. \marginnote{Arithmetic
relations that may be used for extraction of subsets are
\margtt{>=}, \margtt{==}, \margtt{!=} and \margtt{\%in\%}. The
first four compare magnitudes, \margtt{==} tests for equality,
\margtt{!=} tests for inequality, and \margtt{\%in\%} tests
whether any element matches.} The elements that are
extracted are those for which the logical value is \code{TRUE}.
Thus suppose we want to extract values of x that are greater than
10.
\begin{Schunk}
\begin{Sinput}
x>10 # Values are logical (TRUE or FALSE)
\end{Sinput}
\begin{Soutput}
[1] FALSE TRUE FALSE TRUE TRUE
\end{Soutput}
\begin{Sinput}
x[x > 10]
\end{Sinput}
\begin{Soutput}
[1] 11 15 12
\end{Soutput}
\begin{Sinput}
"John" %in% c("Jeff", "Alan", "John")
\end{Sinput}
\begin{Soutput}
[1] TRUE
\end{Soutput}
\end{Schunk}
3. Where elements have names, these can be used to extract elements:
\begin{Schunk}
\begin{Sinput}
altitude <- c(Cambarville=800, Bellbird=300,
"Allyn River"=300,
"Whian Whian"=400,
Byrangery=200, Conondale=400,
Bulburin=600)
##
## Names can be used to extract elements
altitude[c("Cambarville", "Bellbird")]
\end{Sinput}
\begin{Soutput}
Cambarville Bellbird
800 300
\end{Soutput}
\end{Schunk}
4. Use \code{subset()}, with the vector as the first argument,
and a logical statement that identifies the elements to be
extracted as the second argument. For example:
\begin{Schunk}
\begin{Sinput}
subset(altitude, altitude>400)
\end{Sinput}
\begin{Soutput}
Cambarville Bulburin
800 600
\end{Soutput}
\end{Schunk}
\subsection{Factors}\label{ss:factors}
\marginnote{Factors are an economical way to store vectors of
repetitive text strings. By default, when a vector of text strings
becomes a column in a data frame, it is incorporated as a factor.}
Factors are column objects whose elements are integer values 1, 2,
\ldots, $k$, where $k$ is the number of levels. They are
distinguished from integer vectors by having the class \txtt{factor}
and a \txtt{levels} attribute.
For example, create a character vector \code{fruit}, thus:
\begin{Schunk}
\begin{Sinput}
fruit <- c("fig","mango","apple","plum", "fig")
\end{Sinput}
\end{Schunk}
This might equally well be stored as a factor, thus:
\begin{Schunk}
\begin{Sinput}
fruitfac <- factor(fruit)
\end{Sinput}
\end{Schunk}
Internally, the factor is stored\marginnote{Thus 1 is interpreted
as \margtt{"apple"}; 2:\margtt{"fig"};
3:\margtt{"mango"}; 4:\margtt{"plum"}.}
as the integer vector 2, 3, 1, 4, 2. These numbers are
interpreted according to the attributes table:\\[2mm]
\begin{tabular}{|cccc|}
\hline
1 & 2 & 3 & 4\\
\verb!"apple"! & \verb!"fig"! & \verb!"mango"! & \verb!"plum"!\\
\hline
\end{tabular}
\vspace*{8pt}
\noindent
By default, the levels are taken in alphanumeric order.
The function \code{factor()}, with the \code{levels} argument
specified, can be used both to specify the order of levels when the
factor is created, or to make a later change to the
order.\sidenote{Where counts are tabulated by factor level, or
\textit{lattice} or other graphs have one panel per factor level,
these are in order of the levels.} For example, the following
orders levels according to stated glycemic index:
\begin{Schunk}
\begin{Sinput}
glycInd <- c(apple=40, fig=35, mango=55, plum=25)
## Take levels in order of stated glycInd index
fruitfac <- factor(fruit,
levels=names(sort(glycInd)))
levels(fruitfac)
\end{Sinput}
\begin{Soutput}
[1] "plum" "fig" "apple" "mango"
\end{Soutput}
\begin{Sinput}
unclass(fruitfac) # Examine stored values
\end{Sinput}
\begin{Soutput}
[1] 2 4 3 1 2
attr(,"levels")
[1] "plum" "fig" "apple" "mango"
\end{Soutput}
\end{Schunk}
Incorrect spelling of the level names generates missing values, for
the level that was mis-spelled. Use the \code{labels} argument if you
wish to change the level names, but be careful to ensure that the
label names are in the correct order.
\begin{marginfigure}[-36pt]
Mis-spelt name, example:
\begin{Schunk}
\begin{Sinput}
trt <- c("A","A","Control")
trtfac <- factor(trt,
levels=c("control","A"))
table(trtfac)
\end{Sinput}
\begin{Soutput}
trtfac
control A
0 2
\end{Soutput}
\end{Schunk}
\end{marginfigure}
In most places where the context seems to demand it, the integer levels
are translated into text strings, thus:
\begin{Schunk}
\begin{Sinput}
fruit <- c("fig","mango","apple", "plum","fig")
fruitfac <- factor(fruit)
fruitfac == "fig"
\end{Sinput}
\begin{Soutput}
[1] TRUE FALSE FALSE FALSE TRUE
\end{Soutput}
\end{Schunk}
Section \ref{ss:facs} has detailed examples of the use of factors
in model formulae.
\subsection*{Ordered factors}
In addition to factors, note the existence of ordered factors, created
using the function \code{ordered()}. For ordered factors, the order
of levels implies a relational ordering. For example:
\begin{Schunk}
\begin{Sinput}
windowTint <- ordered(rep(c("lo","med","hi"), 2),
levels=c("lo","med","hi"))
windowTint
\end{Sinput}
\begin{Soutput}
[1] lo med hi lo med hi
Levels: lo < med < hi
\end{Soutput}
\begin{Sinput}
sum(windowTint > "lo")
\end{Sinput}
\begin{Soutput}
[1] 4
\end{Soutput}
\end{Schunk}
\subsection*{Subsetting of factors}
Consider the factor \code{fruitfac} that was created earlier:
\begin{Schunk}
\begin{Sinput}
fruitfac <- factor(c("fig","mango","apple","plum", "fig"))
\end{Sinput}
\end{Schunk}
We can remove elements with levels \txtt{fig} and \txtt{plum} thus:
\begin{Schunk}
\begin{Sinput}
ff2 <- fruitfac[!fruitfac %in% c("fig","plum")]
ff2
\end{Sinput}
\begin{Soutput}
[1] mango apple
Levels: apple fig mango plum
\end{Soutput}
\begin{Sinput}
table(ff2)
\end{Sinput}
\begin{Soutput}
ff2
apple fig mango plum
1 0 1 0
\end{Soutput}
\end{Schunk}
The levels \txtt{fig} and \txtt{plum} remain, but with
the table showing 0 values for these levels. Use the function
\code{droplevels()} to remove levels that are not present in
the data:
\begin{marginfigure}
Note also:
\begin{Schunk}
\begin{Sinput}
table(droplevels(ff2))
\end{Sinput}
\begin{Soutput}
apple mango
1 1
\end{Soutput}
\end{Schunk}
\end{marginfigure}
\begin{Schunk}
\begin{Sinput}
droplevels(ff2)
\end{Sinput}
\begin{Soutput}
[1] mango apple
Levels: apple mango
\end{Soutput}
\end{Schunk}
\subsection*{Why is a factor not a vector?}
Two factors \marginnote{Vectors can be concatenated (joined). Two or
mare factors can be sensibly concatenated only if they have
identical levels vectors.} that
have different levels vectors are different types of object. Thus,
formal concatenation of factors with different levels vectors is
handled by first coercing both factors to integer vectors. The
integer vector that results is not, in most circumstances, meaningful
or useful.
\subsection{Missing Values, Infinite Values and NaNs}\label{ss:NA}
Any arithmetic or logical operation with \code{NA} generates an
\marginnote{Failure to understand the rules for calculations with
\margtt{NA}s can lead to unwelcome surprises.}
\code{NA}. The consequences are more far-reaching than might be
immediately obvious. Use \code{is.na()} to test for a missing value:
\begin{Schunk}
\begin{Sinput}
is.na(c(1, NA, 3, 0, NA))
\end{Sinput}
\begin{Soutput}
[1] FALSE TRUE FALSE FALSE TRUE
\end{Soutput}
\end{Schunk}
An expression such as \code{c(1, NA, 3, 0, NA) == NA} returns a vector of
\code{NA}s, and cannot be used to test for missing values.
\begin{Schunk}
\begin{Sinput}
c(1, NA, 3, 0, NA) == NA
\end{Sinput}
\begin{Soutput}
[1] NA NA NA NA NA
\end{Soutput}
\end{Schunk}
\noindent
As the value is unknown, it might or might not be equal to 1, or to another
\code{NA}, or to 3, or to 0.
Note that different functions handle \code{NA}s in\marginnote{The
modeling function \margtt{lm()} accepts any of the arguments
\margtt{na.action=na.omit} (omit), \margtt{na.action=na.exclude}
(omit \margtt{NA}s when fitting; replace by \margtt{NA}s
when fitted values and residuals are calculated), and
\margtt{na.action=na.fail}.} different ways. Functions such as
\code{mean()} and \code{median()} accept the argument
\code{na.rm=TRUE}, which causes observations that have \code{NA}s to
be ignored. The \code{plot()} function omits \code{NA}s, infinities
and \code{NaN}s. For use of \code{lowess()} to put a smooth curve
through the plot, \code{NA}s must first be removed. By default,
\code{table()} ignores \code{NA}s.
Problems with missing values are a common reason why calculations
fail. Infinite values and \code{NaN}s are a further potential source
of difficulty.
\subsection*{\code{Inf} and \code{NaN}}
The expression \code{1/0} returns \code{Inf}.\marginnote{Note that
\margtt{sqrt(-1+0i)} returns \margtt{0+1i}. R distinguishes between
the real number \margtt{-1} and the complex number \margtt{-1+0i}.}
The expression \code{log(0)} returns \code{-Inf},
i.e., smaller than any real number. The expressions \code{0/0} and
\code{log(-1)} both return \code{NaN}.
\subsection*{\code{NA}s in subscripts?}
It is best to ensure that \code{NA}s do not appear, when there
is an assignment, in subscript expressions on either side of the
expression.
\section{Data Frames, Matrices, Arrays and Lists}\label{sec:dframes}
\marginnote{Data frames with all columns numeric can sometimes be
handled in the same way as matrices. In other cases, a different
syntax may be needed, or conversion from one to the other.
Proceed with care!}
\paragraph{Data frames:} Data frames are lists of column objects.
The requirement that all
of the column objects have the same length gives data frames a row
by column rectangular structure. Different columns can have different
column classes --- commonly numeric or character or factor or logical
or date.
\paragraph{Matrices -- vectors with a Dimension:}
\marginnote[8pt]{Internally, matrices are stored as one long vector
in which the columns are stacked one above the other. The first
element in the dimension attribute gives the number of rows in each
column.}
When printed, matrices appear in a row by column layout in which all
elements have the same mode -- commonly numeric or character or
logical.
\paragraph{Arrays and tables:} Matrices are two-dimensional arrays.
Arrays more generally can have an arbitrary number of dimensions.
Tables have a structure that is identical to that of arrays.
The data frame \code{travelbooks} will feature in the subsequent
discussion. Look back to Section \ref{sec:input} to see how it can be
entered.
\subsection{Data frames versus matrices and tables}\label{ss:df-mat}
\marginnote[12pt]{Computations that can be performed with matrices are
typically much faster than their equivalents with data frames. See
Section \ref{sec:large-dset}.} Modeling functions commonly return
larger numeric objects as matrices rather than data frames. The
principal components function \margtt{prcomp()} returns scores as a
matrix, as does the linear discriminant analysis function
\margtt{MASS::lda()}.
Functions are available to convert data frames into matrices, and vice
versa. For example:
\begin{Schunk}
\begin{Sinput}
travelmat <- as.matrix(travelbooks[, 1:4])
# From data frame to matrix
newtravelbooks <- as.data.frame(travelmat)
# From matrix to data frame
\end{Sinput}
\end{Schunk}
\enlargethispage{24pt}
In comparing data frames with matrices, note that:
\begin{itemizz}
\item Both for data frames and for matrices or two-way tables,
the function \code{dim()} returns number of rows by number of columns,
thus:
\begin{marginfigure}
Alternatively, do:
\begin{Schunk}
\begin{Sinput}
attr(travelmat, "dim")
\end{Sinput}
\begin{Soutput}
[1] 6 4
\end{Soutput}
\end{Schunk}
\end{marginfigure}
\begin{Schunk}
\begin{Sinput}
travelmat <- as.matrix(travelbooks[, 1:4])
dim(travelmat)
\end{Sinput}
\begin{Soutput}
[1] 6 4
\end{Soutput}
\end{Schunk}
\item \marginnote[12pt]{A data frame is a list of columns.
The function \code{length()} returns the list length.}
For a matrix, \code{length()} returns the number of elements.
For a data frame it returns the number of columns.
\begin{Schunk}
\begin{Sinput}
c(dframelgth=length(travelbooks),
matlgth=length(travelmat))
\end{Sinput}
\begin{Soutput}
dframelgth matlgth
6 24
\end{Soutput}
\end{Schunk}
\item The notation that uses single square left and right brackets
to extract subsets of data frames, introduced in Section \ref{sec:df}
works in just the same way with matrices. For example
\begin{Schunk}
\begin{Sinput}
travelmat[, 4]
travelmat[, "weight"]
travelmat[, 1:3]
travelmat[2,]
\end{Sinput}
\end{Schunk}
Negative indices can be used to omit rows and/or columns.
\item Use of the subscript notation to extract a row from a data
frame returns a data frame, whereas extraction of a column yields a
column vector. Thus:
\begin{itemizz}
\item
\marginnote[12pt]{Use \margtt{unlist(travelbooks[6, ])} to turn row
from the data frame into a vector. All elements are coerced to a
common mode, in this case numeric. Thus the final element becomes
1.0 (the code that is stored), rather than \margtt{Guide} which was
the first level of the factor \margtt{type}.}
Extraction of a row from a data frame, for example
\code{travelbooks["Canberra - The Guide", ]}
or \code{travelbooks[6, ]}, yields a data frame,
i.e., a special form of list.
\item \verb!travelbooks$volume! (equivalent to \code{travelbooks[,1]}
or \code{travelbooks[,"volume"]})) is a vector.
\end{itemizz}
\item For either a data frame or a matrix, the function
\code{rownames()} can be used to extract row names, and the
function \code{colnames()} to extract column names. For
data frames, \code{row.names()} is an alternative to
\code{rownames()}, while \code{names()} is an alternative to
\code{colnames()}.
\end{itemizz}
Note also a difference in the mechanisms for adding columns. The
following adds new columns \code{area} (area of page), and
\code{density} (\code{weight} to \code{volume} ratio) to the data
frame \code{travelbooks}:
\begin{Schunk}
\begin{Sinput}
travelbooks$area <- with(travelbooks, width*height)
travelbooks$density <- with(travelbooks,
weight/volume)
names(travelbooks) # Check column names
\end{Sinput}
\begin{Soutput}
[1] "thickness" "width" "height" "weight" "volume" "type"
[7] "area" "density"
\end{Soutput}
\end{Schunk}
Columns are added to the data frame as necessary.
For matrices, use \code{cbind()}, which can also be used for data
frames, to bind in new columns.
\enlargethispage{12pt}
\subsection{Inclusion of character vectors in data frames}
When data frames are created, whether by use of
\txtt{read.table()} or another such function to input data from a
file, or by use of the function \code{data.frame()} to
join columns of data together into a data frame, character vectors are
converted into factors. Thus, the final column (\code{type}) of
\code{travelbooks} became, by default, a factor.\footnote{This assumes
that the global option \code{stringsAsFactors} is \code{FALSE}.
To check, interrogate {\footnotesize \code{options()\$stringsAsFactors}.}}
%$
To prevent such type conversions, specify \code{stringsAsFactors=FALSE}
in the call to \code{read.table()} or \code{data.frame()}.
\subsection{Factor columns in data frame subsets}
The data frame \code{DAAG::ais} has physical characteristics
of athletes, divided up thus between ten different sports:
\begin{fullwidth}
\begin{minipage}[t]{\linewidth}
\begin{Schunk}
\begin{Sinput}
with(ais, table(sport))
\end{Sinput}
\begin{Soutput}
sport
B_Ball Field Gym Netball Row Swim T_400m T_Sprnt Tennis
25 19 4 23 37 22 29 15 11
W_Polo
17
\end{Soutput}
\end{Schunk}
\end{minipage}
\end{fullwidth}
Figure \ref{ss:lat-gph} in Subsection \ref{fig:lattice-ais} limits the
data to swimmers and rowers. For this, at the same time removing all
levels except \code{Row} and \code{Swim} from the factor \code{sport},
do:
\marginnote[17pt]{If redundant levels were left in place, the graph
would show empty panels for each such level.}
\begin{Schunk}
\begin{Sinput}
rowswim <- with(ais, sport %in% c("Row", "Swim"))
aisRS <- droplevels(subset(ais, rowswim))
xtabs(~sport, data=aisRS)
\end{Sinput}
\begin{Soutput}
sport
Row Swim
37 22
\end{Soutput}
\end{Schunk}
Contrast the above with:
\begin{fullwidth}
\begin{Schunk}
\begin{Sinput}
xtabs(~sport, data=subset(ais, rowswim))
\end{Sinput}
\begin{Soutput}
sport
B_Ball Field Gym Netball Row Swim T_400m T_Sprnt Tennis
0 0 0 0 37 22 0 0 0
W_Polo
0
\end{Soutput}
\end{Schunk}
\end{fullwidth}
\subsection{Handlng rows that include missing values}
The function \code{na.omit()} omits rows that contain one
or more missing values. The argument may be a data frame or a
matrix. The function \code{complete.cases()} identifies
such rows. Thus:
\begin{Schunk}
\begin{Sinput}
test.df <- data.frame(x=c(1:2,NA), y=1:3)
test.df
\end{Sinput}
\begin{Soutput}
x y
1 1 1
2 2 2
3 NA 3
\end{Soutput}
\begin{Sinput}
## complete.cases()
complete.cases(test.df)
\end{Sinput}
\begin{Soutput}
[1] TRUE TRUE FALSE
\end{Soutput}
\begin{Sinput}
## na.omit()
na.omit(test.df)
\end{Sinput}
\begin{Soutput}
x y
1 1 1
2 2 2
\end{Soutput}
\end{Schunk}
\subsection{Arrays --- some further details}
\marginnote[12pt]{Tables, which will be the subject of the next subsection, have
a very similar structure to arrays.}
A matrix is a two-dimensional array. More generally, arrays can
have an arbitrary number of dimensions.
\subsection*{Removal of the dimension attribute}
The dimension attribute of a matrix or array can be changed or
removed, thus:
\begin{fullwidth}
\begin{Schunk}
\begin{Sinput}
travelvec <- as.matrix(travelbooks[, 1:4])
dim(travelvec) <- NULL # Columns of travelmat are stacked into one
# long vector
travelvec
\end{Sinput}
\begin{Soutput}
[1] 1.3 3.9 1.2 2.0 0.6 1.5 11.3 13.1 20.0 21.1
[11] 25.8 13.1 23.9 18.7 27.6 28.5 36.0 23.4 250.0 840.0
[21] 550.0 1360.0 640.0 420.0
\end{Soutput}
\begin{Sinput}
# as(travelmat, "vector") is however preferable
\end{Sinput}
\end{Schunk}
\end{fullwidth}
Note again that the \code{\$} notation, used with data frames
and other list objects to reference the contents of list elements, is
not relevant to matrices.
\subsection{Lists}\label{sec:df-lists}
A list \marginnote{Elements of lists are themselves lists.
Distinguish \margtt{rcanberra[4]}, which is a sub-list
and therefore a list, from \margtt{rcanberra[[4]]} which extracts the
contents of the fourth list element.} is a collection of arbitrary
objects. As noted above, a data frame is a specialized form of
list. Consider for example the list
\begin{Schunk}
\begin{Sinput}
rCBR <- list(society="ssai", branch="Canberra",
presenter="John",
tutors=c("Emma", "Chris", "Frank"))
\end{Sinput}
\end{Schunk}
First, extract list length and list names:
\begin{Schunk}
\begin{Sinput}
length(rCBR) # rCBR has 4 elements
names(rCBR)
\end{Sinput}
\end{Schunk}
\begin{Schunk}
\begin{Soutput}
[1] 4
\end{Soutput}
\begin{Soutput}
[1] "society" "branch" "presenter" "tutors"
\end{Soutput}
\end{Schunk}
The following extracts the 4th list element:
\begin{Schunk}
\begin{Sinput}
rCBR[4] # Also a list, name is 'tutors'
\end{Sinput}
\begin{Soutput}
$tutors
[1] "Emma" "Chris" "Frank"
\end{Soutput}
\end{Schunk}
Alternative ways to extract the contents of the 4$^{th}$ element are:
\marginnote[12pt]{List elements can be accessed by name.
Thus, to extract the contents of the 4th list element, alternatives
to \margtt{rcanberra[[4]]} are \margtt{rcanberra[["tutors"]]} or
\margtt{rcanberra\$tutors}.}
\begin{Schunk}
\begin{Sinput}
rCBR[[4]] # Contents of 4th list element
\end{Sinput}
\begin{Soutput}
[1] "Emma" "Chris" "Frank"
\end{Soutput}
\begin{Sinput}
rCBR$tutors # Equivalent to rCBR[["tutors"]]
\end{Sinput}
\begin{Soutput}
[1] "Emma" "Chris" "Frank"
\end{Soutput}
\end{Schunk}
\subsection*{Model objects are lists}
As noted in Subsection \ref{ss:modobj}, the various R modeling
\marginnote{Recall again, also, that data frames are a specialized
form of list, with the restriction that all columns must all have
the same length.} functions all return their own particular type of
model object, either a list or as an S4 object.
\section{Functions}
\noindent
\fbox{\parbox{\textwidth}{
\textbf{Different Kinds of Functions:}\\[4pt]
\begin{tabular}{ll}
Generic & The 'class' of the function argument determines the\\
& action taken. E.g., \code{print()}, \code{plot()}, \code{summary()}\\[6pt]
Modeling & For example, \code{lm()} fits \textit{linear} models.\\
& Output may be stored in a model object.\\[6pt]
Extractor & These extract information from model objects.\\
& Examples include \code{summary()}), \code{coef()}),\\
& \code{resid()}), and \code{fitted()}\\[6pt]
User & Use, e.g., to automate and document computations\\[6pt]
Anonymous & These are user functions that are defined at the\\
& point of use, and do not need a name.
\end{tabular}
}}
\vspace*{6pt}
The above list is intended to include the some of the most important
types of function. These categories may overlap.
The language that R implements has many of the features
of\marginnote{Functions for working with dates are discussed in
Section \ref{ss:dates} immediately following.} a
functional language. Functions have accordingly featured throughout
the earlier discussion. Here will be noted functions that are
commonly important.
\subsection{Built-In Functions}\label{ss:built-in}
\subsection*{Common useful functions}
\begin{fullwidth}
\begin{verbcode}
## Use with any R object as argument
print() # Prints a single R object
length() # Number of elements in a vector or of a list
## Concatenate and print R objects [does less coercion than print()]
cat() # Prints multiple objects, one after the other
## Use with a numeric vector argument
mean() # If argument has NA elements, may want na.rm=TRUE
median() # As for mean(), may want na.rm=TRUE
range() # As for mean(), may want na.rm=TRUE
unique() # Gives the vector of distinct values
diff() # Vector of first differences
# N. B. diff(x) has one less element than x
cumsum() # Cumulative sums, c.f., also, cumprod()
## Use with an atomic vector object
sort() # Sort elements into order, but omitting NAs
order() # x[order(x)] orders elements of x, with NAs last
rev() # reverse the order of vector elements
any() # Returns TRUE if there are any missing values
as() # Coerce argument 1 to class given by argument 2
# e.g. as(1:6, "factor")
is() # Is argument 1 of class given by argument 2?
# is(1:6, "factor") returns FALSE
# is(TRUE, "logical") returns TRUE
is.na() # Returns TRUE if the argument is an NA
## Information on an R object
str() # Information on an R object
args() # Information on arguments to a function
mode() # Gives the storage mode of an R object
# (logical, numeric, character, . . ., list)
## Create a vector
numeric() # numeric(5) creates a numeric vector, length 5,
# all elements 0.
# numeric(0) (length 0) is sometimes useful.
character() # Create character vector; c.f. also logical()
\end{verbcode}
\end{fullwidth}
The function \code{mean()}, and a number of other functions, take the
argument \code{na.rm=TRUE}; i.e., remove \code{NA}s, then proceed with
the calculation. For example
\begin{Schunk}
\begin{Sinput}
mean(c(1, NA, 3, 0, NA), na.rm=T)
\end{Sinput}
\begin{Soutput}
[1] 1.333
\end{Soutput}
\end{Schunk}
Note that the function \code{as()} has, at present, no method for
coercing a matrix to a data frame. For this, use
\code{as.data.frame()}.
\subsection*{Functions in different packages with the same name}
For example, as well as \pkg{lattice} function \code{dotplot()}
the graphics package has a defunct function \code{dotplot()}.
To be sure of getting the \pkg{lattice} function \code{dotplot()},
refer to it as \code{lattice::dotplot()}.
\subsection{Functions for data summary and/or manipulation}
\marginnote[-12pt]{For
data manipulation, note:
\begin{itemizz}
\item[-] the apply family of functions (Subsection \ref{ss:apply}).
\item[-] data manipulation functions in the \textit{reshape2} and
\textit{plyr} packages (Chapter \ref{ch:manip}).
\end{itemizz}}
\subsection{Functions for creating and working with tables}\label{sec:tab}
\subsection{Tables of Counts}
Use either \code{table()} or \code{xtabs()} to make a table of
counts. Use \code{xtabs()} for cross-tabulation, i.e., to determine
totals of numeric values for each table category.
\subsection*{The \code{table()} function}\label{ss:table}
For use of \code{table()}, specify one vector of values (often a
factor) for each table margin that is required. For example:
\begin{Schunk}
\begin{Sinput}
library(DAAG) # possum is from DAAG
with(possum, table(Pop, sex))
\end{Sinput}
\begin{Soutput}
sex
Pop f m
Vic 24 22
other 19 39
\end{Soutput}
\end{Schunk}
\subsection*{NAs in tables}
By default, \code{table()} ignores NAs. To show information on
\code{NA}s, specify \code{exclude=NULL}, thus:
\begin{Schunk}
\begin{Sinput}
library(DAAG)
table(nswdemo$re74==0, exclude=NULL)
\end{Sinput}
\begin{Soutput}
FALSE TRUE <NA>
119 326 277
\end{Soutput}
\end{Schunk}
\subsection*{The \code{xtabs()} function}
This more flexible alternative to \code{table()} uses a table
formula to specify the margins of the table:
\begin{Schunk}
\begin{Sinput}
xtabs(~ Pop+sex, data=possum)
\end{Sinput}
\begin{Soutput}
sex
Pop f m
Vic 24 22
other 19 39
\end{Soutput}
\end{Schunk}
\marginnote[12pt]{Manipulations with data frames are in general
conceptually simpler than manipulations with tables. For tables
that are not unreasonably large, it is in general a good strategy
to first convert the table to a data frame and make that the
starting point for further calculations.}
A column of frequencies can be specified on the left hand side of the
table formula. In order to demonstrate this, the three-way table
\code{UCBAdmissions} (\pkg{datasets} package) will be converted into
its data frame equivalent. Margins in the table become columns in
the data frame:
\begin{Schunk}
\begin{Sinput}
UCBdf <- as.data.frame.table(UCBAdmissions)
head(UCBdf, n=3)
\end{Sinput}
\begin{Soutput}
Admit Gender Dept Freq
1 Admitted Male A 512
2 Rejected Male A 313
3 Admitted Female A 89
\end{Soutput}
\end{Schunk}
The following then forms a table of total admissions and rejections
in each department:
\begin{Schunk}
\begin{Sinput}
xtabs(Freq ~ Admit+Dept, data=UCBdf)
\end{Sinput}
\begin{Soutput}
Dept
Admit A B C D E F
Admitted 601 370 322 269 147 46
Rejected 332 215 596 523 437 668
\end{Soutput}
\end{Schunk}
\subsection*{Information on data objects}
The function \code{str()} gives basic information on the data object that
is given as argument.
\begin{Schunk}
\begin{Sinput}
library(DAAG)
str(possumsites)
\end{Sinput}
\begin{Soutput}
'data.frame': 7 obs. of 3 variables:
$ Longitude: num 146 149 151 153 153 ...
$ Latitude : num -37.5 -37.6 -32.1 -28.6 -28.6 ...
$ altitude : num 800 300 300 400 200 400 600
\end{Soutput}
\end{Schunk}
\subsection{Utility functions}
\begin{fullwidth}
\begin{verbcode}
dir() # List files in the working or other specified directory