forked from hadley/ggplot2-book
-
Notifications
You must be signed in to change notification settings - Fork 1
/
introduction.rmd
184 lines (134 loc) · 15.8 KB
/
introduction.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
---
title: introduction
output: oldbookdown::html_chapter
bibliography: references.bib
---
# Introduction {#cha:introduction}
## Welcome to ggplot2
ggplot2 is an R package for producing statistical, or data, graphics, but it is unlike most other graphics packages because it has a deep underlying grammar. This grammar, based on the Grammar of Graphics [@wilkinson:2006], is made up of a set of independent components that can be composed in many different ways. This makes ggplot2 very powerful because you are not limited to a set of pre-specified graphics, but you can create new graphics that are precisely tailored for your problem. This may sound overwhelming, but because there is a simple set of core principles and very few special cases, ggplot2 is also easy to learn (although it may take a little time to forget your preconceptions from other graphics tools).
Practically, ggplot2 provides beautiful, hassle-free plots that take care of fiddly details like drawing legends. The plots can be built up iteratively and edited later. A carefully chosen set of defaults means that most of the time you can produce a publication-quality graphic in seconds, but if you do have special formatting requirements, a comprehensive theming system makes it easy to do what you want. Instead of spending time making your graph look pretty, you can focus on creating a graph that best reveals the messages in your data.
ggplot2 is designed to work iteratively. You can start with a layer showing the raw data then add layers of annotations and statistical summaries. It allows you to produce graphics using the same structured thinking that you use to design an analysis, reducing the distance between a plot in your head and one on the page. It is especially helpful for students who have not yet developed the structured approach to analysis used by experts.
Learning the grammar not only will help you create graphics that you know about now, but will also help you to think about new graphics that would be even better. Without the grammar, there is no underlying theory, so most graphics packages are just a big collection of special cases. For example, in base R, if you design a new graphic, it's composed of raw plot elements like points and lines, and it's hard to design new components that combine with existing plots. In ggplot2, the expressions used to create a new graphic are composed of higher-level elements like representations of the raw data and statistical transformations, and can easily be combined with new datasets and other plots.
This book provides a hands-on introduction to ggplot2 with lots of example code and graphics. It also explains the grammar on which ggplot2 is based. Like other formal systems, ggplot2 is useful even when you don't understand the underlying model. However, the more you learn about it, the more effectively you'll be able to use ggplot2. This book assumes some basic familiarity with R, to the level described in the first chapter of Dalgaard's *Introductory Statistics with R*.
This book will introduce you to ggplot2 as a novice, unfamiliar with the grammar; teach you the basics so that you can re-create plots you are already familiar with; show you how to use the grammar to create new types of graphics; and eventually turn you into an expert who can build new components to extend the grammar.
## What is the grammar of graphics?
@wilkinson:2006 created the grammar of graphics to describe the deep features that underlie all statistical graphics. The grammar of graphics is an answer to a question: what is a statistical graphic? The layered grammar of graphics [@wickham:2007d] builds on Wilkinson's grammar, focussing on the primacy of layers and adapting it for embedding within R. In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system. Facetting can be used to generate the same plot for different subsets of the dataset. It is the combination of these independent components that make up a graphic.
As the book progresses, the formal grammar will be explained in increasing detail. The first description of the components follows below. It introduces some of the terminology that will be used throughout the book and outlines the basic responsibilities of each component. Don't worry if it doesn't all make sense right away: you will have many more opportunities to learn about the pieces and how they fit together.
All plots are composed of:
- **Data** that you want to visualise and a set of aesthetic **mapping**s
describing how variables in the data are mapped to aesthetic attributes that
you can perceive.
- **Layers** made up of geometric elements and statistical transformation.
Geometric objects, **geom**s for short, represent what you actually see on
the plot: points, lines, polygons, etc. Statistical transformations,
**stat**s for short, summarise data in many useful ways. For example, binning
and counting observations to create a histogram, or summarising a 2d
relationship with a linear model.
- The **scale**s map values in the data space to values in an aesthetic space,
whether it be colour, or size, or shape. Scales draw a legend or axes, which
provide an inverse mapping to make it possible to read the original data
values from the plot.
- A coordinate system, **coord** for short, describes how data coordinates are
mapped to the plane of the graphic. It also provides axes and gridlines to
make it possible to read the graph. We normally use a Cartesian coordinate
system, but a number of others are available, including polar coordinates
and map projections.
- A **facet**ing specification describes how to break up the data into subsets
and how to display those subsets as small multiples. This is also known
as conditioning or latticing/trellising.
- A **theme** which controls the finer points of display, like the font size
and background colour. While the defaults in ggplot2 have been chosen with
care, you may need to consult other references to create an attractive plot.
A good starting place is Tufte's early works [@tufte:1990;@tufte:1997;
@tufte:2001].
It is also important to talk about what the grammar doesn't do:
- It doesn't suggest what graphics you should use to answer the questions you
are interested in. While this book endeavours to promote a sensible process
for producing plots of data, the focus of the book is on how to produce the
plots you want, not knowing what plots to produce. For more advice on this
topic, you may want to consult @robbins:2004, @cleveland:1993, @chambers:1983,
and @tukey:1977.
- It does not describe interactivity: the grammar of graphics describes only
static graphics and there is essentially no benefit to displaying them on a
computer screen as opposed to a piece of paper. ggplot2 can only create
static graphics, so for dynamic and interactive graphics you will have to look
elsewhere (perhaps at ggvis, described below). @cook:2007 provides an
excellent introduction to the interactive graphics package GGobi. GGobi can
be connected to R with the rggobi package [@wickham:2008b].
## How does ggplot2 fit in with other R graphics?
There are a number of other graphics systems available in R: base graphics, grid graphics and trellis/lattice graphics. How does ggplot2 differ from them?
- Base graphics were written by Ross Ihaka based on experience implementing the S
graphics driver and partly looking at @chambers:1983. Base graphics has a pen
on paper model: you can only draw on top of the plot, you cannot modify or
delete existing content. There is no (user accessible) representation of the
graphics, apart from their appearance on the screen. Base graphics includes
both tools for drawing primitives and entire plots. Base graphics functions
are generally fast, but have limited scope. If you've created a single
scatterplot, or histogram, or a set of boxplots in the past, you've probably
used base graphics. \index{Base graphics}
- The development of "grid" graphics, a much richer system of graphical
primitives, started in 2000. Grid is developed by Paul Murrell, growing out of
his PhD work [@murrell:1998]. Grid grobs (graphical objects) can be
represented independently of the plot and modified later. A system of
viewports (each containing its own coordinate system) makes it easier to lay
out complex graphics. Grid provides drawing primitives, but no tools for
producing statistical graphics. \index{grid}
- The lattice package, developed by Deepayan Sarkar, uses grid
graphics to implement the trellis graphics system of @cleveland:1993
and is a considerable improvement over base graphics. You can easily produce
conditioned plots and some plotting details (e.g., legends) are taken care
of automatically. However, lattice graphics lacks a formal model, which can
make it hard to extend. Lattice graphics are explained in depth in
@sarkar:2008. \index{lattice}
- ggplot2, started in 2005, is an attempt to take the good things about base
and lattice graphics and improve on them with a strong underlying model which
supports the production of any kind of statistical graphic, based on
the principles outlined above. The solid underlying model of ggplot2 makes it
easy to describe a wide range of graphics with a compact syntax, and
independent components make extension easy. Like lattice, ggplot2 uses grid
to draw the graphics, which means you can exercise much low-level control over
the appearance of the plot.
- Work on ggvis, the successor to ggplot2, started in 2014. It takes
the foundational ideas of ggplot2 but extends them to the web and interactive
graphics. The syntax is similar, but it's been re-designed from scratch to
take advantage of what I've learned in the 10 years since creating ggplot2.
The most exciting thing about ggvis is that it's interactive and dynamic, so
plots automatically re-draw themselves when the underlying data or plot
specification changes. However, ggvis is work in progress and currently can
create only a fraction of the plots in ggplot2 can. Stay tuned for updates!
\index{ggvis}
- htmlwidgets, <http://www.htmlwidgets.org>, provides a common framework for
accessing web visualisation tools from R. Packages built on top of htmlwidgets
include leaflet (<https://rstudio.github.io/leaflet/>, maps),
dygraph (<http://rstudio.github.io/dygraphs/>, time series) and
networkD3 (<http://christophergandrud.github.io/networkD3/>, networks).
htmlwidgets is to ggvis what the many specialised graphic packages are to
ggplot2: it provides graphics honed for specific purposes.
\index{htmlwidgets}
Many other R packages, such as vcd [@meyer:2006], plotrix [@plotrix] and gplots [@gplots], implement specialist graphics, but no others provide a framework for producing statistical graphics. A comprehensive list of all graphical tools available in other packages can be found in the graphics task view at <http://cran.r-project.org/web/views/Graphics.html>.
## About this book
The first chapter, [getting started with ggplot2](#cha:getting-started), describes how to quickly get started using ggplot2 to make useful graphics. This chapter introduces several important ggplot2 concepts: geoms, aesthetic mappings and facetting. [Toolbox](#cha:toolbox) dives into more details, giving you a toolbox designed to solve a wide range of problems.
[Mastery](#cha:mastery) describes the layered grammar of graphics which underlies ggplot2. The theory is illustrated in [layers](#cha:layers) which demonstrates how to add additional layers to your plot, exercising full control over the geoms and stats used within them.
Understanding how scales work is crucial for fine-tuning the perceptual properties of your plot. Customising scales gives fine control over the exact appearance of the plot and helps to support the story that you are telling. [Scales](#cha:scales) will show you what scales are available, how to adjust their parameters, and how to control the appearance of axes and legends.
Coordinate systems and facetting control the position of elements of the plot. These are described in [position](#cha:position). Facetting is a very powerful graphical tool as it allows you to rapidly compare different subsets of your data. Different coordinate systems are less commonly needed, but are very important for certain types of data.
To polish your plots for publication, you will need to learn about the tools described in [polishing](#cha:polishing). There you will learn about how to control the theming system of ggplot2 and how to save plots to disk.
The book concludes with four chapters that show how to use ggplot2 as part of a data analysis pipeline. ggplot2 works best when your data is tidy, so [Tidying](#cha:data) discusses what that means and how to make your messy data tidy. [Data transformation](#cha:dplyr) teaches you how to use the dplyr package to perform the most common data manipulation operations. [Modelling](#cha:modelling) shows how to integrate visualisation and modelling in two useful ways. Duplicated code is a big inhibitor of flexibility and reduces your ability to respond to changes in requirements. [Programming with ggplot2](#cha:programming) covers useful techniques for reducing duplication in your code.
## Installation {#sec:installation}
\index{Installation}
To use ggplot2, you must first install it. Make sure you have a recent version of R (at least version 3.2.0) from <http://r-project.org> and then run the following code to download and install ggplot2:
```{r install, eval = FALSE}
install.packages("ggplot2")
```
## Other resources {#sec:other-resources}
This book teaches you the elements of ggplot2's grammar and how they fit together, but it does not document every function in complete detail. You will need additional documentation as your use of ggplot2 becomes more complex and varied.
The best resource for specific details of ggplot2 functions and their arguments will always be the built-in documentation. This is accessible online, <http://docs.ggplot2.org/>, and from within R using the usual help syntax. The advantage of the online documentation is that you can see all the example plots and navigate between topics more easily.
If you use ggplot2 regularly, it's a good idea to sign up for the ggplot2 mailing list, <http://groups.google.com/group/ggplot2>. The list has relatively low traffic and is very friendly to new users. Another useful resource is stackoverflow, <http://stackoverflow.com>. There is an active ggplot2 community on stackoverflow, and many common questions have already been asked and answered. In either place, you're much more likely to get help if you create a minimal reproducible example. The [reprex](https://github.com/jennybc/reprex) package by Jenny Bryan provides a convenient way to do this, and also include advice on creating a good example. The more information you provide, the easier it is for the community to help you.
The number of functions in ggplot2 can be overwhelming, but RStudio provides some great cheatsheets to jog your memory at <http://www.rstudio.com/resources/cheatsheets/>.
Finally, the complete source code for the book is available online at <https://github.com/hadley/ggplot2-book>. This contains the complete text for the book, as well as all the code and data needed to recreate all the plots.
## Colophon
This book was written in [R Markdown](http://rmarkdown.rstudio.com/) inside [RStudio](http://www.rstudio.com/ide/). [knitr](http://yihui.name/knitr/) and [pandoc](http://johnmacfarlane.net/pandoc/) converted the raw Rmarkdown to html and pdf. The complete source is available from [github](https://github.com/hadley/ggplot2-book). This version of the book was built with:
```{r colophon}
devtools::session_info(c("ggplot2", "dplyr", "broom"))
getOption("width")
```
## References