-
Notifications
You must be signed in to change notification settings - Fork 0
/
5input.tex
731 lines (629 loc) · 26.8 KB
/
5input.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
% !Rnw root = learnR.Rnw
\input{preamble}
\section{$^*$Data Input from a File}\label{sec:entry}
\marginnote[12pt]{Most data input functions allow import from a file that is
on the web --- give the URL when specifying the file. Another possibility
is to copy the file, or a relevant part of it, to the clipboard. For
reading from and writing to the clipboard under Windows, see
\url{http://bit.ly/2sxyOhG}. For MacOS, see \url{http://bit.ly/2t1nX0I}}.
In base R, and in R packages, there is a wide variety of functions that
can be used for data input. This includes data entry abilities that are
aimed at specific specialized types of data.
Use of the RStudio menu is recommended. This is fast, and allows
a visual check of the data layout before input proceeds. If input
options are incorrectly set, these can be changed as necessary
before proceeding. The code used for input is shown. In those rare
cases where input options are required for which the menu does not
make provision, the command line code can be edited as needed,
before proceeding. Refer back to Subsection \ref{ss:readEtc} for
further details. Note that input is in all cases to a tibble,
which is a specialized form of data frame.
Character columns are not automatically converted to factors,
column names are not converted into valid R identifiers, and row
names are not set. For subsequent processing, there are
important differences between tibbles and data frames that users
need to note.
\marginnote[12pt]{Scatterplot matrices are helpful both for
checking variable ranges and for identifying impossible or
unusual combinations of variable values.}
It is important to check, when data have been entered,
that data values appear sensible. Do minimal checks on: ranges
of variable values, the mode of the input columns (numeric or
factor, or \ldots).
\subsection{Input using the \code{read.table()} family of functions}
\marginnote[12pt]{Non-default option settings can however, for very
large files, severely slow data input.}
There are several aliases for \code{read.table()} that have different
settings for input defaults. Note in particular \code{read.csv()},
for reading in comma delimited {\bf .csv} files such as can be output
from Excel spreadsheets. See \code{help(read.table)}. Recall that
\marginnote{For factor columns check that the levels are as
expected.}
\begin{itemizz}
\item[-] Character vectors are by default converted into factors. To
prevent such type conversions, specify
\code{stringsAsFactors=FALSE}.
\item[-] Specify \code{heading=TRUE}\sidenote{By default, if the
first row of the file has one less field than later rows, it is
taken to be a header row. Otherwise, it is taken as
the first row of data.} to indicate that the first row
of input has column names. Use \code{heading=FALSE}
to indicate that it holds data.\newline \noindent [If names are not
given, columns have default names \code{V1}, \code{V2}, \ldots.]
\item[-] Use the parameter \code{row.names}, then specifying a column
number, to specify a column for use to provide row names.
\end{itemizz}
\subsection*{Issues that may complicate input}
\marginnote{NB also that \code{count.fields()} counts the number
of fields in each record --- albeit watch for differences
from input fields as detected by the input function.}
Where data input fails, consider using \code{read.table()} with the
argument \code{fill=TRUE}, and carefully check the input data frame.
Blank fields will be implicitly added, as needed, so that all records
have an equal number of identified fields.
Carefully check the parameter settings\sidenote{For
text with embedded single quotes, set \code{quote = ""}. For text
with \# embedded; change \code{comment.char} suitably.} for the
version of the input command that is in use. It may be necessary to
change the field separators (specify \code{sep}), and/or the missing
value character(s) (specify \code{na.strings}). Embedded quotes and
comment characters (\code{\#}; by default anything that follows
\code{\#} on the same line is ignored.) can be a source of difficulty.
\marginnote{Among other possibilities, there may be a non-default
missing value symbol (e.g., \margtt{"."}), but without using
\code{na.strings} to indicate this.}
Where a column that should be numeric is converted to a factor this is
an indication that it has one or more fields that, as numbers, would
be illegal. For example, a "1" (one) may have been mistyped as an "l"
(ell), or "0" (zero) as "O" (oh).
Note options that allow the limiting of the number of input rows.
For \code{read.table()}) and aliases, set \code{nrows}. For
functions from the \code{readr} package, set \code{n\_max}.
For \code{scan()}, discussed in the next subsection, set
\code{nlines}. All these functions accept the argument \code{skip},
used to set the number of lines to skip before input starts.
\subsection{$^*$The use of scan() for flexible data input}
Data records may for example spread over several rows. There seems no
way for \code{read.table()} to handle this.
The following code demonstrates the use of \code{scan()} to read in
the file \textbf{molclock1.txt}. To place this file in your working
directory, attach the \textit{DAAG} package and type
\code{datafile("molclock1")}.
\marginnote[12pt]{There are two calls to \margtt{scan()}, each time taking
information from the file \textbf{molclock1.txt}. The first, with
\margtt{nlines=1} and \margtt{what=""}, input the column names. The
second, with \margtt{skip=1} and
\margtt{what=c(list(""), rep(list(1),5)))]}, input
the several rows of data.}
\begin{Schunk}
\begin{Sinput}
colnam <- scan("molclock1.txt", nlines=1, what="")
molclock <- scan("molclock1.txt", skip=1,
what=c(list(""), rep(list(1),5)))
molclock <- data.frame(molclock, row.names=1)
# Column 1 supplies row names
names(molclock) <- colnam
\end{Sinput}
\end{Schunk}
The
\marginnote{
For repeated use with data files that have a similar format, consider
putting the code into a function, with the \code{what} list as an
argument.}
\code{what} parameter should be a list, with one list element
for each field in a record. The "" in the first list element
indicates that the data is to be input as character. The remaining
five list elements are set to 1, indicating numeric data.
Where records extend over several lines, set \code{multi.line=TRUE}.
\subsection{The \pkg{memisc} package: input from SPSS and Stata}
\marginnote{Note also the \pkg{haven} package, mentioned above,
and the \pkg{foreign} package. The \pkg{foreign} package has
functions that allow input of various
types of files from Epi Info, Minitab, S-PLUS, SAS, SPSS, Stata,
Systat and Octave. There are abilities for reading and writing some
dBase files. For further information, see the R Data Import/Export
manual.}
The \pkg{memisc} package has effective abilities for
examining and inputting data from various SPSS and Stata formats,
including {\bf .sav}, {\bf .por}, and Stata {\bf .dta} data types.
It allows users to check the contents of the
columns of the dataset before importing part or all of the file.
An initial step is to use an importer function to create an {\em
importer} object. As of now, {\em importer} functions are:
\code{spss.fixed.file()}, \code{spss.portable.file()} ( {\bf .por}
files), \code{spss.system.file()} ({\bf .sav} files), and
\code{Stata.file()} ({\bf .dta} files). The importer object has
information about the variables: including variable labels, value
labels, missing values, and for an SPSS `fixed' file the columns that
they occupy, etc. Additionally, it has information from further
processing of the file header and/or the file proper that is
needed in preparation for importing the file.
Functions that can be used with an importer object include:
\begin{itemizz}
\item[-] \code{description()}: column header information;
\item[-] \code{codebook()}: detailed information on each column;
\item[-] \code{as.data.set()}: bring the data into R, as a `data.set' object;
\item[-] \code{subset()}: bring a subset of the data into R, as a `data.set' object
\end{itemizz}
\marginnote{Use \margtt{as.data.frame()} to coerce data.set objects
into data frames. Information that is not readily retainable in a
data frame format may be lost in the process.}
The functions
\code{as.data.set()} and \code{subset()} yield `data.set' objects.
These have structure that is additional to that in data frames. Most
functions that are available for use with data frames can be used with
data.set class objects.
The vignette \txtt{anes48} that comes with the \pkg{memisc} package
illustrates the use of the above abilities.
\subsection*{Example}
\marginnote{To substitute your own file, store the path to the
file in \code{path2file.}}
A compressed version of the file \textbf{NES1948.POR} (an SPSS `portable' dataset)
is stored as part of the \pkg{memisc} installation. The following
does the unzipping, places the file in a temporary directory,
and stores the path to the file in the text string \code{path2file}:
\begin{fullwidth}
\begin{Schunk}
\begin{Sinput}
library(memisc)
## Unzip; return path to "NES1948.POR"
path2file <- unzip(system.file("anes/NES1948.ZIP",package="memisc"),
"NES1948.POR",exdir=tempfile())
\end{Sinput}
\end{Schunk}
\end{fullwidth}
Now create an `importer' object, and get summary information:
\begin{fullwidth}
\begin{Schunk}
\begin{Sinput}
# Get information about the columns in the file
nes1948imp <- spss.portable.file(path2file)
show(nes1948imp)
\end{Sinput}
\end{Schunk}
\footnotesize
\begin{Schunk}
\begin{Soutput}
SPSS portable file '/var/folders/00/_kpyywm16hnbs2c0dvlf0mwr0000gq/T//Rtmpx1DzSU/file39656cd5edd1/NES1948.POR'
with 67 variables and 662 observations
\end{Soutput}
\end{Schunk}
\end{fullwidth}
There will be a large number of messages that draw attention to
duplicate labels.
\marginnote[12pt]{Use \margtt{labels()}) to change labels, or
\margtt{missing.values()} to set missing value filters, prior to
data import.} Before importing, it may be well to check details of
what is in the file. The following, which restricts attention to
columns 4 to 9 only, indicates the nature of the information that is
provided.
\begin{fullwidth}
\begin{Schunk}
\begin{Sinput}
## Get details about the columns (here, columns 4 to 9 only)
description(nes1948imp)[4:9]
\end{Sinput}
\end{Schunk}
\begin{Schunk}
\begin{Soutput}
$v480002
[1] "INTERVIEW NUMBER"
$v480003
[1] "POP CLASSIFICATION"
$v480004
[1] "CODER"
$v480005
[1] "NUMBER OF CALLS TO R"
$v480006
[1] "R REMEMBER PREVIOUS INT"
$v480007
[1] "INTR INTERVIEW THIS R"
\end{Soutput}
\end{Schunk}
\end{fullwidth}
\noindent
As there are in this instance 67 columns, it might make sense to look
at columns perhaps 10 at a time.
More detailed information is available by using the R function
\code{codebook()}.
The following gives the codebook information for column 5:
\marginnote[12pt]{This is more interesting than what appears for columns (1 - 4).}
\begin{Schunk}
\begin{Sinput}
## Get codebook information for column 5
codebook(nes1948imp[, 5])
\end{Sinput}
\begin{Soutput}
======================================================
nes1948imp[, 5] 'POP CLASSIFICATION'
------------------------------------------------------
Storage mode: double
Measurement: nominal
Values and labels N Percent
1 'METROPOLITAN AREA' 182 27.5 27.5
2 'TOWN OR CITY' 354 53.5 53.5
3 'OPEN COUNTRY' 126 19.0 19.0
\end{Soutput}
\end{Schunk}
The following imports a subset of just four of the columns:
\begin{Schunk}
\begin{Sinput}
vote.socdem.48 <- subset(nes1948imp,
select=c(
v480018,
v480029,
v480030,
v480045
))
\end{Sinput}
\end{Schunk}
To import all columns, do:
\begin{Schunk}
\begin{Sinput}
socdem.48 <- as.data.set(nes1948imp)
\end{Sinput}
\end{Schunk}
\begin{marginfigure}[24pt]
Look also at the vignette:\\[-3pt]
\begin{Schunk}
\begin{Sinput}
vignette("anes48")
\end{Sinput}
\end{Schunk}
\end{marginfigure}
For more detailed information, type:
\begin{Schunk}
\begin{Sinput}
## Go to help page for 'importers'
help(spss.portable.file)
\end{Sinput}
\end{Schunk}
\section{$^*$Input of Data from a web page}
\marginnote{The web page:\\
{\footnotesize \url{http://www.visualizing.org/data/browse/}}
has an extensive list of web data sources. The World Bank Development
Indicators database will feature prominently in the discussion below.}
This section notes some of the alternative ways in which data that is
available from the web can be input into R. The first subsection
below comments on the use of a point and click interface to identify
and download data.
A point and click interface is often convenient for an initial look.
Rather than downloading the data and then inputting it to R, it may be
better to input it directly from the web page. Direct input into R
has the advantage that the R commands that are used document exactly
what has been done.\sidenote{This may be especially important if a data
download will be repeated from time to time with updated data, or if
data are brought together from a number of different files, or if a
subset is taken from a larger database.}
Note that the functions \code{read.table()}, \code{read.csv()},
\code{scan()}, and other such functions, are able to read data
directly from a file that is available on the web. There is a
limited ability to input part only of a file.
Suppose however that the demand is to downloqd data for several of a
large number of variables, for a specified range of years, and for a
specified geographical area or set of countries. \marginnote{GML, or
Geography Markup Language, is based on XML.} A number of data
archives now offer data in one or more of several markup formats that
assist selective access. Formats include XML, GML, JSON and JSONP.
\paragraph{A browser interface to World Bank data:}
The web page
\url{http://databank.worldbank.org/data/home.aspx}\sidenote[][-0.5cm]{Click
on \underline{COUNTRY} to modify the choice of countries. To expand
(to 246) countries beyond the 20 that appear by
default, click on \underline{Add more country}. Click on
\underline{SERIES} and \underline{TIME} to modify and/or expand those
choices. Click on \underline{Apply
Changes} to set the choices in place.} gives a point and click
interface to, among other possibilities, the World Bank development
indicator database. Clicking on any of 20 country names that are
displayed shows data for these countries for 1991-2010, for 54 of the
1262 series that were available at last check. Depending on the
series, data may be available back to 1964. Once selections have been
made, click on \underline{DOWNLOAD} to download the data. For input
into R, downloading as a {\bf .csv} file is convenient.
Manipulation of these data into a form suitable for a motion
chart display was demonstrated in Subsection \ref{ss:reshape2}
\paragraph{Australian Bureau of Meteorology data:}
Graphs of area-weighted time series of rainfall and temperature
measures, for various regions of Australia, can be accessed from the
Australian Bureau of Meteorology web page
\url{http://www.bom.gov.au/cgi-bin/climate/change/timeseries.cgidemo}.
Click on \underline{Raw data set}\sidenote{To copy the web address, right click on \underline{Raw
data set} and click on \underline{Copy Link Location} (Firefox) or
\underline{Copy Link Address} (Google Chrome) or \underline{Copy
Link} (Safari).} to download the raw data.
Once the web path to the file that has the data has been found,
the data can alternatively be input directly from the web.
The following gets the annual total rainfall in Eastern Australia,
from 1910 through to the present':
\begin{fullwidth}
\begin{Schunk}
\begin{Sinput}
webroot <- "http://www.bom.gov.au/web01/ncc/www/cli_chg/timeseries/"
rpath <- paste0(webroot, "rain/0112/eaus/", "latest.txt")
totrain <- read.table(rpath)
\end{Sinput}
\end{Schunk}
\end{fullwidth}
\paragraph{A function to download multiple data series:}
The following accesses the latest annual data, for total rainfall
and average temperature, from the command line:
\begin{fullwidth}
\begin{Schunk}
\begin{Sinput}
getbom <-
function(suffix=c("AVt","Rain"), loc="eaus"){
webroot <- "http://www.bom.gov.au/web01/ncc/www/cli_chg/timeseries/"
midfix <- switch(suffix[1], AVt="tmean/0112/", Rain="rain/0112/")
webpage <- paste(webroot, midfix, loc, "/latest.txt", sep="")
print(webpage)
read.table(webpage)$V2
}
##
## Example of use
offt = c(seaus=14.7, saus=18.6, eaus=20.5, naus=24.7, swaus=16.3,
qld=23.2, nsw=17.3, nt=25.2,sa=19.5, tas=10.4, vic=14.1,
wa=22.5, mdb=17.7, aus=21.8)
z <- list()
for(loc in names(offt))z[[loc]] <- getbom(suffix="Rain", loc=loc)
bomRain <- as.data.frame(z)
\end{Sinput}
\end{Schunk}
\end{fullwidth}
%$
\noindent
The function can be re-run each time that data is required that
includes the most recent year.
\subsection*{$^*$Extraction of data from tables in web pages}
The function \code{readHTMLTable()}, from the \pkg{XML} package,
will prove very useful for this. It does not work, currently at
least, for pages that use https:.
\paragraph{Historical air crash datra:}
The web page \url{http://www.planecrashinfo.com/database.htm}
has links to tables of aviation accidents, with one table for
each year. The table for years up to and including 1920 is on
the web page \url{http://www.planecrashinfo.com/1920/1920.htm},
that for 1921 on the page \url{http://www.planecrashinfo.com/1921/1921.htm},
and so on through until the most recent year. The following code
inputs the table for years up to and including 1920:
\begin{Schunk}
\begin{Sinput}
library(XML)
\end{Sinput}
\end{Schunk}
\begin{Schunk}
\begin{Sinput}
url <- "http://www.planecrashinfo.com/1920/1920.htm"
to1920 <- readHTMLTable(url, header=TRUE)
to1920 <- as.data.frame(to1920)
\end{Sinput}
\end{Schunk}
The following inputs data from 2010 through until 2014:
\begin{fullwidth}
\begin{Schunk}
\begin{Sinput}
url <- paste0("http://www.planecrashinfo.com/",
2010:2014, "/", 2010:2014, ".htm")
tab <- sapply(url, function(x)readHTMLTable(x, header=TRUE))
\end{Sinput}
\end{Schunk}
\end{fullwidth}
{\small
\begin{fullwidth}
\begin{Schunk}
\begin{Sinput}
## The following less efficent alternative code spells the steps out in more detail
## tab <- vector('list', 5)
## k <- 0
## for(yr in 2010:2014){
## k <- k+1
## url <- paste0("http://www.planecrashinfo.com/", yr, "/", yr, ".htm")
## tab[[k]] <- as.data.frame(readHTMLTable(url, header=TRUE))
## }
\end{Sinput}
\end{Schunk}
\end{fullwidth}
}
Now combine all the tables into one:
\begin{fullwidth}
\begin{Schunk}
\begin{Sinput}
## Now combine the 95 separate tables into one
airAccs <- do.call('rbind', tab)
names(airAccs) <- c("Date", "Location/Operator",
"AircraftType/Registration", "Fatalities")
airAccs$Date <- as.Date(airAccs$Date, format="%d %b %Y")
\end{Sinput}
\end{Schunk}
\end{fullwidth}
The help page \margtt{help(readHTMLTable)} gives examples that
demonstrate other possibilities.
\subsection{$^*$Embedded markup --- XML and alternatives}\label{ss:markup}
Data are are now widely available, from a number of different web
sites, in one or more of several markup formats. Markup code,
designed to make the file self-describing, is included with the data.
The user does not need to supply details of the data structure to the
software reading the data.
\marginnote[12pt]{For details of
markup use, as they relate to the World Bank Development Indicators
database, see {\small \url{http://data.worldbank.org/node/11}}.}
Markup languages that may be used include XML, GML, JSON and JSONP.
Queries are built into the web address.
Alternatives to setting up the query directly may be:
\begin{itemizz}
\item[-] Use a function such as \code{fromJSON()} in the \pkg{RJSONIO}
package to set up the link and download the data;
\item[-] In a few cases, functions have been provided in R packages
that assist selection and downloading of data.
For the World Bank Development Indicators database, note \code{WDI()}
and other functions in the \pkg{WDI} package.
\end{itemizz}
\paragraph{Download of NZ earthquake data:}
\marginnote[12pt]{WFS is Web Feature Service. OGC is Open Geospatial Consortium.
GML is Geographic Markup language GML, based on XML.}
Here the GML markup conventions are used, as defined by
the WFS OGC standard.
Details can be found on the website
\url{http://info.geonet.org.nz/display/appdata/Earthquake+Web+Feature+Service}
The following
\marginnote{The {\bf .csv} format is one of several formats in which
data can be retrieved.}
extracts earthquake data from the New Zealand GeoNet
website. Data is for 1 September 2009 onwards, through until the
current date, for earthquakes of magnitude greater than 4.5.
\begin{Schunk}
\begin{Sinput}
## Input data from internet
from <-
paste(c("http://wfs-beta.geonet.org.nz/",
"geoserver/geonet/ows?service=WFS",
"&version=1.0.0", "&request=GetFeature",
"&typeName=geonet:quake", "&outputFormat=csv",
"&cql_filter=origintime>='2009-08-01'",
"+AND+magnitude>4.5"),
collapse="")
quakes <- read.csv(from)
z <- strsplit(as.character(quakes$origintime),
split="T")
quakes$Date <- as.Date(sapply(z, function(x)x[1]))
quakes$Time <- sapply(z, function(x)x[2])
\end{Sinput}
\end{Schunk}
\paragraph{World Bank data --- using the \pkg{WDI} package}
Use the function \code{WDIsearch()} to search for indicators. Thus,
to search for indicators with `co2' or `CO2' in their name, enter
\code{WDIsearch('co2')}. The first 6 (out of 38) from
such a search, the name details in the second column truncated
to 66 characters, are:
\begin{fullwidth}
\begin{Schunk}
\begin{Sinput}
library(WDI)
co2Inds <- WDIsearch('co2')[1:6,]
print(cbind(co2Inds[,1], substring(co2Inds[,2],1,66)),
quote=FALSE)
\end{Sinput}
\begin{Soutput}
[,1] [,2]
[1,] EN.CO2.OTHX.ZS CO2 emissions from other sectors, excluding residential buildings
[2,] EN.CO2.MANF.ZS CO2 emissions from manufacturing industries and construction (% of
[3,] EN.CO2.ETOT.ZS CO2 emissions from electricity and heat production, total (% of to
[4,] EN.CO2.BLDG.ZS CO2 emissions from residential buildings and commercial and public
[5,] EN.CLC.GHGR.MT.CE GHG net emissions/removals by LUCF (Mt of CO2 equivalent)
[6,] EN.ATM.SF6G.KT.CE SF6 gas emissions (thousand metric tons of CO2 equivalent)
\end{Soutput}
\end{Schunk}
\end{fullwidth}
Use the function \code{WDI()} to input indicator data, thus:
\begin{fullwidth}
\begin{Schunk}
\begin{Sinput}
library(WDI)
inds <- c('SP.DYN.TFRT.IN','SP.DYN.LE00.IN', 'SP.POP.TOTL',
'NY.GDP.PCAP.CD', 'SE.ADT.1524.LT.FE.ZS')
indnams <- c("fertility.rate", "life.expectancy", "population",
"GDP.per.capita.Current.USD", "15.to.25.yr.female.literacy")
names(inds) <- indnams
wdiData <- WDI(country="all",indicator=inds, start=1960, end=2013, extra=TRUE)
colnum <- match(inds, names(wdiData))
names(wdiData)[colnum] <- indnams
## Drop unwanted "region"
WorldBank <- droplevels(subset(wdiData, !region %in% "Aggregates"))
\end{Sinput}
\end{Schunk}
\end{fullwidth}
\marginnote[12pt]{The function \margtt{WDI()} calls the non-visible function
\margtt{wdi.dl()}, which in turn calls the function \margtt{fromJSON()}
from the \pkg{RJSONIO} package. To see the code for \margtt{wdi.dl()},
type \margtt{getAnywhere("wdi.dl")}.}
The effect of \code{extra=TRUE} is to include the additional variables
\code{iso2c} (2-character country code), \code{country}, \code{year},
\code{iso3c} (3-character country code), \code{region},
\code{capital}, \code{longitude}, \code{latitude}, \code{income} and
\code{lending}.
The data frame \code{Worldbank} that results is in a form where it can
be used with the \pkg{googleVIS} function \code{gvisMotionChart()},
as described in Section \ref{sec:gvis}
\section{Creating and Using Databases}\label{ss:dbase}
\marginnote{In addition to the \textit{RSQLite}, note
the \textit{RMySQL} and \textit{ROracle} packages.
All use the interface provided by the \textit{DBI} package.}
The \textit{RSQLite} package makes it possible to create an
SQLite database, or to add new rows to an existing table,
or to add new table(s), within an R session. The SQL query
language can then be used to access tables in the database.
Here is an example. First create the database:
\noindent
\begin{Schunk}
\begin{Sinput}
library(DAAG)
library(RSQLite)
driveLite <- dbDriver("SQLite")
con <- dbConnect(driveLite, dbname="hillracesDB")
dbWriteTable(con, "hills2000", hills2000,
overwrite=TRUE)
dbWriteTable(con, "nihills", nihills,
overwrite=TRUE)
dbListTables(con)
\end{Sinput}
\begin{Soutput}
[1] "hills2000" "nihills"
\end{Soutput}
\end{Schunk}
The database \path{hillracesDB}, if it does not already exist,
is created in the working directory.
Now input rows 16 to 20 from the newly created database:
\begin{Schunk}
\begin{Sinput}
## Get rows 16 to 20 from the nihills DB
dbGetQuery(con,
"select * from nihills limit 5 offset 15")
\end{Sinput}
\begin{Soutput}
dist climb time timef
1 5.5 2790 0.9483 1.2086
2 11.0 3000 1.4569 2.0344
3 4.0 2690 0.6878 0.7992
4 18.9 8775 3.9028 5.9856
5 4.0 1000 0.4347 0.5756
\end{Soutput}
\begin{Sinput}
dbDisconnect(con)
\end{Sinput}
\end{Schunk}
\section{$^*$File compression:} The functions for data
input in versions 2.10.0 and later of R are able to accept certain
types of compressed files. This extends to \code{scan()} and to
functions such as \code{read.maimages()} in the \pkg{limma}
package, that use the standard R data input functions.
By way of illustration, consider the files \textbf{coral551.spot},
\ldots, \textbf{coral556.spot} that are in the subdirectory
\textbf{doc} of the \textit{DAAGbio} package. In a directory that held
the uncompressed files, they were created by typing, on a Unix or
Unix-like command line: \marginnote{Severer compression:
replace\newline \txtt{gzip -9}\newline
\noindent by\newline \txtt{xz -9e}.}
\begin{Schunk}
\begin{Sinput}
gzip -9 coral55?.spot
\end{Sinput}
\end{Schunk}
\noindent
The {\bf .zip} files thus created were renamed back to
\textbf{*.spot} files.
When saving large objects in image format, specify \code{compress=TRUE}.
Alternatives that may lead to more compact files are \code{compress="bzip2"}
and \code{compress="xz"}.
Note also the R functions \code{gzfile()} and \code{xzfile()} that can
be used to create files in a compressed text format. This might for
example be text that has been input using \code{readLines()}.
\section{Summary}
\begin{itemize}
\item[] Following input, perform minimal checks that
values in the various columns are as expected.
\item[] With very large files, it can be helpful to read in the
data in chunks (ranges of rows).
\item[] Note mechanisms for direct input of web data. Many data
archives now offer one or more of several markup formats that
facilitate selective access.
\end{itemize}