Transcription

Introduction to ggplot2Dawn KoffmanOffice of Population ResearchPrinceton UniversityJanuary 20141

Part 1: Concepts and Terminology2

R Package: ggplot2Used to produce statistical graphics, author Hadley Wickham"attempt to take the good things about base and lattice graphicsand improve on them with a strong, underlying model "based on The Grammar of Graphics by Leland Wilkinson, 2005". describes the meaning of what we do when we constructstatistical graphics . More than a taxonomy . Computationalsystem based on the underlying mathematics of representingstatistical functions of data."- does not limit developer to a set of pre-specified graphicsadds some concepts to grammar which allow it to work well with R3

qplot()ggplot2 provides two ways to produce plot objects:qplot() # quick plot – not covered in this workshopuses some concepts of The Grammar of Graphics, but doesn’t provide full capabilityanddesigned to be very similar to plot() and simple to usemay make it easy to produce basic graphsbutmay delay understanding philosophy of ggplot2ggplot() # grammar of graphics plot – focus of this workshopprovides fuller implementation of The Grammar of Graphicsmay have steeper learning curve but allows much more flexibility when building graphs4

Grammar Defines Components of Graphicsdata: in ggplot2, data must be stored as an R data framecoordinate system: describes 2-D space that data is projected onto- for example, Cartesian coordinates, polar coordinates, map projections, .geoms: describe type of geometric objects that represent data- for example, points, lines, polygons, .aesthetics: describe visual characteristics that represent data- for example, position, size, color, shape, transparency, fillscales: for each aesthetic, describe how visual characteristic is converted to display values- for example, log scales, color scales, size scales, shape scales, .stats : describe statistical transformations that typically summarize data- for example, counts, means, medians, regression lines, .facets: describe how data is split into subsets and displayed as multiple small graphs5

Workshop Data Frameextract from 2012 World Population Data Sheet produced by Population Reference Bureauincludes 158 countries where mid-2012 population 1 millionfor notes, sources and full definitions, sheet egioncountry namepopulation mid-2012 (millions)infant mortality rate*total fertility rate*life expectancy at birthmale life expectancy at birthfemale life expectancy at birth(Africa, Americas, Asia & Oceania, Europe)(Northern Africa, Western Africa, Eastern Africa, Middle Africa,North America, Central America, Caribbean, South America,Western Asia, South Central Asia, Southeast Asia, East Asia, Oceania,Northern Europe, Western Europe, Eastern Europe, Southern Europe)*definitions: infant mortality rate – annual number of deaths of infants under age 1 per 1,000 live birthstotal fertility rate – average number of children a woman would have assuming that currentage-specific birth rates remain constant throughout her childbearing years6

ggplot()creates a plot object that can be assigned to a variablecan specify data frame and aesthetics (visual characteristics that represent data)w - read.csv(file "WDS2012.csv", head TRUE, sep ",")p - ggplot(data w, aes(x le, y tfr, color rocco32.6South Sudan9.4Sudan33.5Tunisia10.8Benin9.4Burkina Faso 17.5Cote d'Ivoire ricaAfricaAfricale value is indicated by x-axis positiontfr value is indicated by y-axis positionarea value is indicated by colorBUTplot object p can not be displayedwithout adding at least one layerat this point, there is nothing to see!.7

Add a Layerw - read.csv(file "WDS2012.csv", head TRUE, sep ",")p - ggplot(data w, aes(x le, y tfr, color area))p layer(geom "point", geom params list(size 4))8

Layerpurpose:display the data – allows viewer to seepatterns, overall structure, local structure, outliers, .display statistical summaries of the data – allows viewer to seecounts, means, medians, IQRs, model predictions, .full specification:layer(geom, geom params, stat, stat params, data, mapping, position)every layer specifies a geom or a stat or bothdata and mapping (aesthetics) may be inherited from ggplot() objector added/changed/dropped using layer()position refers to method for adjusting overlapping objects9

Add a geom Layerw - read.csv(file "WDS2012.csv", head TRUE, sep ",")p - ggplot(data w, aes(x le, y tfr, color area))p layer(geom "blank")p layer(geom "line")p layer(geom "jitter")p layer(geom "step")10

Add a stat Layerw - read.csv(file "WDS2012.csv", head TRUE, sep ",")p - ggplot(data w, aes(x le, y tfr))p layer(geom "point", geom params list(shape 1)) layer(stat "smooth"). group is 1000, so using loess. Use 'method x' to change the smoothing method.p layer(geom "point", geom params list(shape 1)) layer(stat "smooth", stat params list(method "lm",se FALSE))11

geom xxx and stat xxx Shortcut Functionscan use geom xxx()and stat xxx() shortcut functions rather than layer().much less typing!w - read.csv(file "WDS2012.csv", head TRUE, sep ",")p - ggplot(data w, aes(x le, y tfr))p geom point(shape 1) stat smooth()p geom point(shape 1) stat smooth(method "lm", se FALSE)12

Shortcut Functions: Adding a geom Layerw - read.csv(file "WDS2012.csv", head TRUE, sep ",")p - ggplot(data w, aes(x le, y tfr, color area))p geom blank()p geom line()p geom jitter()p geom step()13

Add Layers Using Shortcut Functionsgeom xxx()purpose: display the data –allows viewer to see patterns, overall structure, local structure, outliers, .full specification: geom xxx(mapping, data, stat, position, .)each geom xxx() has a default stat (statistical transformation) associated with it,but the default statistical transformation may be changed using stat parameterstat xxx()purpose: display statistical summaries of the data –allows viewer to see counts, means, medians, IQRs, model predictions, .full specification: stat xxx(mapping, data, geom, position, .)each stat xxx() has a default geom (geometric object) associated with it,but the default geometric object may be changed using geom parameterfor a list of geom xxx() and stat xxx(), see http://docs.ggplot2.org/current/14

geoms - help.search(" geom ",geoms matches[, 1:2]topic[1,] "geom abline"[2,] "geom area"[3,] "geom bar"[4,] "geom bin2d"[5,] "geom blank"[6,] "geom boxplot"[7,] "geom contour"[8,] "geom crossbar"[9,] "geom density"[10,] "geom density2d"[11,] "geom dotplot"[12,] "geom errorbar"[13,] "geom errorbarh"[14,] "geom freqpoly"[15,] "geom hex"[16,] "geom histogram"[17,] "geom hline"[18,] "geom jitter"[19,] "geom line"[20,] "geom linerange"[21,] "geom map"[22,] "geom path"[23,] "geom point"[24,] "geom pointrange"[25,] "geom polygon"[26,] "geom quantile"[27,] "geom raster"[28,] "geom rect"[29,] "geom ribbon"[30,] "geom rug"[31,] "geom segment"[32,] "geom smooth"[33,] "geom step"[34,] "geom text"[35,] "geom tile"[36,] "geom violin"[37,] "geom vline"package "ggplot2")geom xxx()title"Line specified by slope and intercept.""Area plot.""Bars, rectangles with bases on x-axis""Add heatmap of 2d bin counts.""Blank, draws nothing.""Box and whiskers plot.""Display contours of a 3d surface in 2d.""Hollow bar with middle indicated by horizontal line.""Display a smooth density estimate.""Contours from a 2d density estimate.""Dot plot""Error bars.""Horizontal error bars""Frequency polygon.""Hexagon bining.""Histogram""Horizontal line.""Points, jittered to reduce overplotting.""Connect observations, ordered by x value.""An interval represented by a vertical line.""Polygons from a reference map.""Connect observations in original order""Points, as for a scatterplot""An interval represented by a vertical line, with a point in the middle.""Polygon, a filled path.""Add quantile lines from a quantile regression.""High-performance rectangular tiling.""2d rectangles.""Ribbons, y range with continuous x values.""Marginal rug plots.""Single line segments.""Add a smoothed conditional mean.""Connect observations by stairs.""Textual annotations.""Tile plane with rectangles.""Violin plot.""Line, vertical."15

stats - help.search(" stat ", package "ggplot2")stats matches[, 1:2]topic[1,] "stat abline"[2,] "stat bin"[3,] "stat bin2d"[4,] "stat bindot"[5,] "stat binhex"[6,] "stat boxplot"[7,] "stat contour"[8,] "stat density"[9,] "stat density2d"[10,] "stat ecdf"[11,] "stat function"[12,] "stat hline"[13,] "stat identity"[14,] "stat qq"[15,] "stat quantile"[16,] "stat smooth"[17,] "stat spoke"[18,] "stat sum"[19,] "stat summary"[20,] "stat summary2d"[21,] "stat summary hex"[22,] "stat unique"[23,] "stat vline"[24,] "stat ydensity"stat xxx()title"Add a line with slope and intercept.""Bin data.""Count number of observation in rectangular bins.""Bin data for dot plot.""Bin 2d plane into hexagons.""Calculate components of box and whisker plot.""Calculate contours of 3d data.""1d kernel density estimate.""2d density estimation.""Empirical Cumulative Density Function""Superimpose a function.""Add a horizontal line""Identity statistic.""Calculation for quantile-quantile plot.""Continuous quantiles.""Add a smoother.""Convert angle and radius to xend and yend.""Sum unique values. Useful for overplotting on scatterplots.""Summarise y values at every unique x.""Apply funciton for 2D rectangular bins.""Apply funciton for 2D hexagonal bins.""Remove duplicates.""Add a vertical line""1d kernel density estimate along y axis, for violin plot."16

Statistical Transformationw - read.csv(file "WDS2012.csv", head TRUE, sep ",")p - ggplot(data w, aes(x 9United States Asia/Oceania.EuropeEurope.bin1234area t bin()statistical transformation17

Change Default Geometric Objectw - read.csv(file "WDS2012.csv", head TRUE, sep ",")p - ggplot(data w, aes(x area)) ylim(0,60)p stat bin()p stat bin(geom "point", size 5)p stat bin(geom "bar")p stat bin(geom "tile")18

Change Default Geometric Objectw - read.csv(file "WDS2012.csv", head TRUE, sep ",")p - ggplot(data w, aes(x le))p stat bin(binwidth 1)p stat bin(geom "point", binwidth 1)p stat bin(geom "line", binwidth 1)p stat bin(geom "line",binwidth 1) stat bin(geom "point",binwidth 1)19

Use Variables Created by stat xxx()stat xxx() may create new variables in transformed data frameaesthetics may be mapped to these new variablesbin1234area .count.Africa48Americas25Asia/Oceania49Europe36w - read.csv(file "WDS2012.csv", head TRUE, sep ",")p - ggplot(data w, aes(x area))p stat bin(aes(y .count./sum(.count.))) ylab("proportion") ylim(0,.5)p stat bin(aes(fill .count.))20

Already Transformed Datawb - read.csv(file "WDS2012areabins.csv", head TRUE, sep ",")wbbinarea count11Africa4822Americas2533 Asia/Oceania4944Europe36p - ggplot(data wb, aes(x area, y count)) ylim(0,60)p geom bar(stat "identity")21

Aestheticsdescribe visual characteristics that represent data- for example, x position, y position, size, color (outside), fill (inside),point shape, line type, transparencyeach layer inherits default aesthetics from plot object- within each layer, aesthetics may added, overwritten, or removedmost layers have some required aesthetics and some optional aestheticsw - read.csv(file "WDS2012.csv", head TRUE, sep ",")p - ggplot(data w, aes(x le, y tfr, color area))p geom point() geom smooth(method "lm", se FALSE)22

Add or Remove Aesthetic Mappingw - read.csv(file "WDS2012.csv", head TRUE, sep ",")p - ggplot(data w, aes(x le, y tfr, color area))add aesthetic mappingp geom point(aes(shape area)) geom smooth(method "lm",se FALSE)remove aesthetic mappingp geom point(aes(color NULL)) geom smooth(method "lm", se FALSE)23

Aesthetic Mapping vs. Parameter Settingaesthetic mappingdata value determines visual characteristicuse aes()settingconstant value determines visual characteristicuse layer parameterw - read.csv(file "WDS2012.csv", head TRUE, sep ",")p - ggplot(data w, aes(x le, y tfr))aesthetic mappingp geom point(aes(color area))settingp geom point(color "red")24

Positionw - read.csv(file "WDS2012.csv", head TRUE, sep ",")w tfrGT2 - w tfr 2p - ggplot(data w, aes(x area, fill tfrGT2))p geom bar()p geom bar(position "dodge")p geom bar(position "stack")p geom bar(position "fill")25

Bar Widthw - read.csv(file "WDS2012.csv", head TRUE, sep ",")p - ggplot(data w, aes(x area))p geom bar()p geom bar(width .5)p geom bar(width .9) # defaultp geom bar(width .97)26

Positionw - read.csv(file "WDS2012.csv", head TRUE, sep ",")p - ggplot(data w, aes(x le, y tfr))p geom point()p geom point(position "jitter")equivalent top geom jitter()27

Transparencyw - read.csv(file "WDS2012.csv", head TRUE, sep ",")p - ggplot(data w, aes(x le, y tfr))p geom point(size 3,alpha 1/2)p geom jitter(size 4,alpha 1/2)techniques for overplotting: adjusting symbol size, shape, jitter, and transparency28

Coordinate Systemw - read.csv(file "WDS2012.csv", head TRUE, sep ",")p - ggplot(w, aes(x factor(1), fill area))p geom bar()p geom bar() coord flip()p geom bar() coord polar(theta "y")p geom bar() coord polar(theta "y",direction -1)29

Data Frameeach plot layer may contain data from a different data framew - read.csv(file "WDS2012.csv", head