Chapter 1 Data visualization with ggplot2
Graphs are often the starting point for statistical analysis. One of the main advantages of R is how easy it is for the user to create many different kinds of graphs. We begin this chapter by studying conventional graphs, followed by an examination of some more complex representations. This final (and main) part uses the ggplot2 package.
1.1 Conventional Graphical Functions
To begin with, it may be interesting to examine few examples of graphical representations which can be constructed with R. We use the demo function:
demo(graphics)
1.1.1 The plot function
The plot function is a generic function used to represent all kinds of data. We can use this function for representing a scatterplot for a variable y according to another variable x. For example, to represent the graph of the function \(x\mapsto \sin(2\pi x)\) on \([0,1]\), at regular steps we use the following commands:
<- seq(-2*pi,2*pi,by=0.05)
x <- sin(x)
y plot(x,y) #points (par défaut)
plot(x,y,type="l") #représentation sous forme de ligne
We provide examples of representations for quantitative and qualitative variables. We use the data file ozone.txt:
<- read.table("data/ozone.txt")
ozone summary(ozone)
maxO3 T9 T12 : 42.00 Min. :11.30 Min. :14.00
Min. : 70.75 1st Qu.:16.20 1st Qu.:18.60
1st Qu.: 81.50 Median :17.80 Median :20.55
Median : 90.30 Mean :18.36 Mean :21.53
Mean :106.00 3rd Qu.:19.93 3rd Qu.:23.55
3rd Qu.:166.00 Max. :27.00 Max. :33.50
Max.
T15 Ne9 Ne12 :14.90 Min. :0.000 Min. :0.000
Min. :19.27 1st Qu.:3.000 1st Qu.:4.000
1st Qu.:22.05 Median :6.000 Median :5.000
Median :22.63 Mean :4.929 Mean :5.018
Mean :25.40 3rd Qu.:7.000 3rd Qu.:7.000
3rd Qu.:35.50 Max. :8.000 Max. :8.000
Max.
Ne15 Vx9 Vx12 :0.00 Min. :-7.8785 Min. :-7.878
Min. :3.00 1st Qu.:-3.2765 1st Qu.:-3.565
1st Qu.:5.00 Median :-0.8660 Median :-1.879
Median :4.83 Mean :-1.2143 Mean :-1.611
Mean :7.00 3rd Qu.: 0.6946 3rd Qu.: 0.000
3rd Qu.:8.00 Max. : 5.1962 Max. : 6.578
Max.
Vx15 maxO3v vent :-9.000 Min. : 42.00 Length:112
Min. :-3.939 1st Qu.: 71.00 Class :character
1st Qu.:-1.550 Median : 82.50 Mode :character
Median :-1.691 Mean : 90.57
Mean : 0.000 3rd Qu.:106.00
3rd Qu.: 5.000 Max. :166.00
Max.
pluie :112
Length:character
Class :character
Mode
Let us start by representing two quantitative variables: maximum ozone concentrations maxO3 in term of temperatures T12:
plot(ozone[,"T12"],ozone[,"maxO3"])
As the two variables are contained and named within the same table, a simpler syntax can be used, which automatically inserts the variables as labels for the axes:
plot(maxO3~T12,data=ozone)
We can also use (more complicated and less interesting)
plot(ozone[,"T12"],ozone[,"maxO3"],xlab="T12",ylab="maxO3")
We have specific functions for each kind of charts, for instance
hist(ozone$maxO3,main="Histogram")
barplot(table(ozone$vent)/nrow(ozone),col="blue")
boxplot(maxO3~vent,data=ozone)
1.1.2 Interactive graphs with rAmCharts
We can use this package to obtain dynamic graphs. It is easy, we just have to use the prefix am before the name of the function:
library(rAmCharts)
amHist(ozone$maxO3)
1.1.3 Some exercises
Exercise 1.1 (First chart)
Draw the sine function between 0 and \(2\pi\).
Add the following title: Plot of the sine function, you can use the function title.
Exercise 1.2 (Gaussian density distribution)
Draw the pdf (probability distribution function) of the standard Gaussian distribution between \(-4\) and 4 (use dnorm).
Add a vertical dashed line of equation \(x=0\) (use abline).
On the same graph, draw Student’s \(t\)-distribution to 5 and 30 degrees of freedom (use dt). Use the lines function and a different colour for each density.
Add a legend at the top left to differentiate between each distribution (use legend).
Exercise 1.3 (Drawing sunspots)
Read the file taches_solaires.csv series which details, data by data, the relative number of sunspots.
Using the cut_interval function from the tidyverse package, create a qualitative variable which separates the dates into 8 intervals. We call period this variable.
Use the following levels for period
<- c("yellow", "magenta", "orange", "cyan", mycolors "grey", "red", "green", "blue")
Explain the output of
Visualize the sunspot series with a diffrect color for the 8 periods.
Exercise 1.4 (Layout) We consider the ozone dataset. With the layout function, split the window into two lines with
- the scatter plot maxO3 vs T12 on the first line;
- 2 graphs on the second lines: histogram of T12 et boxplot of maxO3.
1.2 Ggplot2 grammar
ggplot2 is a plotting system for R based on the grammar of graphics (as dplyr to manipulate data). We can find documentation at http://ggplot2.org and https://ggplot2-book.org.
1.2.1 First charts with ggplot2
We consider a subsample of the diamond dataset
from the package ggplot2 (or tidyverse):
library(tidyverse)
set.seed(1234)
<- diamonds[sample(nrow(diamonds),5000),]
diamonds2 summary(diamonds2)
carat cut color clarity :0.2000 Fair : 158 D: 640 SI1 :1189
Min. :0.4000 Good : 455 E: 916 VS2 :1157
1st Qu.:0.7000 Very Good:1094 F: 900 SI2 : 876
Median :0.7969 Premium :1280 G:1018 VS1 : 738
Mean :1.0400 Ideal :2013 H: 775 VVS2 : 470
3rd Qu.:4.1300 I: 481 VVS1 : 326
Max. : 270 (Other): 244
J
depth table price :43.00 Min. :49.00 Min. : 365
Min. :61.10 1st Qu.:56.00 1st Qu.: 945
1st Qu.:61.80 Median :57.00 Median : 2376
Median :61.76 Mean :57.43 Mean : 3917
Mean :62.50 3rd Qu.:59.00 3rd Qu.: 5294
3rd Qu.:71.60 Max. :95.00 Max. :18757
Max.
x y z : 0.000 Min. :3.720 Min. :0.000
Min. : 4.720 1st Qu.:4.720 1st Qu.:2.920
1st Qu.: 5.690 Median :5.700 Median :3.520
Median : 5.728 Mean :5.731 Mean :3.538
Mean : 6.530 3rd Qu.:6.520 3rd Qu.:4.030
3rd Qu.:10.000 Max. :9.850 Max. :6.430
Max.
help(diamonds)
A graph is always defined from many layers. We usually have to specify:
- the data
- the variables we want to plot
- the type of representation (scatterplot, boxplot…).
Ggplot graphs are defined from these layers. We have to specify each layer with a ggplot function :
- the data with ggplot
- the variables with aes (aesthetics)
- the type of representation with geom_
These layers are gathered with the +
operator.
The scatterplot carat vs price is obtained with the plot function with
plot(price~carat,data=diamonds2)
With ggplot, we use
ggplot(diamonds2) #nothing
ggplot(diamonds2)+aes(x=carat,y=price) #nothing
ggplot(diamonds2)+aes(x=carat,y=price)+geom_point() #OK
Exercise 1.5 (Histograms and barplot with ggplot2)
Draw the histogram of carat (use geom_histogram)
Draw the histogram of carat with 10 bins (help(geom_histogram))
Draw the barplot for the variable cut (use geom_bar)
To summarize, ggplot syntax is based on independent elements. These elements define the grammar of ggplot. Main elements of this grammar include:
- Data (
ggplot
): the dataset, it should be a dataframe or a tibble. - Aesthetics (
aes
): to describe the way that variables in the data are mapped. We have to specify all variables used to build the graph in aes. - Geometrics (
geom_...
): to control the type of plot (scaterplot, histogram…) - Statistics (
stat_...
): to describe transformations of the data needed for the chart. - Scales (
scale_...
): to control the mapping from data to aesthetic attributes (change of colors, options for the axis…)
All these elements are combined with a +.
1.2.2 Data et aesthetics
We have two use these elements for all graphs. ggplot verb is used to indicate the dataset. If the syntax is efficient, you will never have to use the name of the dataset to obtain the required graph. aes verb is used to specify the variables we want to visualize. For instance, for a scaterplot price vs carat, the syntax should start with
ggplot(diamonds2)+aes(x=carat,y=price)
aes also possesses options such as color, size, fill. We use these options as soon as a color or a size is defined from a variable of the dataset. For instance
ggplot(diamonds2)+aes(x=carat,y=price,color=cut)
1.2.3 Geometrics
To draw a graph, we need to specify the type of representation. We have to use geometrics to do that. For a scatter plot, we use geom_point:
ggplot(diamonds2)+aes(x=carat,y=price,color=cut)+geom_point()
Observe that ggplot adds the lengend automatically. Examples of geometrics are described here:
Geom | Description | Aesthetics |
---|---|---|
geom_point() | Scatter plot | x, y, shape, fill |
geom_line() | Line (ordered according to x) | x, y, linetype |
geom_abline() | Line | slope, intercept |
geom_path() | Line (ordered according to the index) | x, y, linetype |
geom_text() | Text | x, y, label, hjust, vjust |
geom_rect() | Rectangle | xmin, xmax, ymin, ymax, fill, linetype |
geom_polygon() | Polygone | x, y, fill, linetype |
geom_segment() | Segment | x, y, xend, yend, fill, linetype |
geom_bar() | Barplot | x, fill, linetype, weight |
geom_histogram() | Histogram | x, fill, linetype, weight |
geom_boxplot() | Boxplots | x, y, fill, weight |
geom_density() | Density | x, y, fill, linetype |
geom_contour() | Contour lines | x, y, fill, linetype |
geom_smooth() | Smoothers (linear or non linear) | x, y, fill, linetype |
All | color, size, group |
Exercise 1.6 (Barplots)
Draw the barplot of cut (with blue bars).
Draw the barplot of cut with one color for each modality of cut. Add a legend.
Draw the barplot of cut with one color of your choice for each modality of cut (and without lengend since it is not useful here).
1.2.4 Statistics
Many graphs need transformations the data to make the representation. It is for example the case for histogram or barplot: we have to calculate the heights for rectangles or bars. These heigths are not explicitely specified in the dataset. Simple transformations can be obtained quickly with ggplot2. For instance we can draw the sine function with
<- data.frame(X=seq(-2*pi,2*pi,by=0.01))
D ggplot(D)+aes(x=X,y=sin(X))+geom_line()
The sine transformation is indicated in aes. For more complex transformations, we have to used statistics. A stat function takes a dataset as input and returns another dataset as output, and so a stat can add new variables to the original dataset. It is then possible to map aesthetics to these new variables. For example, stat_bin function (used to build histograms) creates the following variables:
count
, the number of observations in each bin ;density
, the density of observations in each bin (percentage of total / bar width) ;x
, the center of the bin.
By default geom_histogram represents on the \(y\)-axis the number of observations in each bin (the output count).
ggplot(diamonds2)+aes(x=price)+geom_histogram(bins=40)
If we are interested by another output of stat_bin
, the density for instance, we have to use
ggplot(diamonds2)+aes(x=price,y=..density..)+geom_histogram(bins=40)
ggplot proposes another way to make the representations: we can use stat_ instead of geom_. Formally, each stat_… function has a geom_… and each geom_… has a stat_…. We can draw the same graph as above with
ggplot(diamonds2)+aes(x=price,y=..density..)+stat_bin()
Here are some examples of stat functions :
Stat | Description | Parameters |
---|---|---|
stat_identity() | No transformation | |
stat_bin() | Count | binwidth, origin |
stat_density() | Density | adjust, kernel |
stat_smooth() | Smoother | method, se |
stat_boxplot() | Boxplot | coef |
stat and geom are not always easy to combine. For beginners, we recommend to use geom for beginners.
Exercise 1.7 (A "very simple" barplot...) We consider a color variable \(X\) with probability distribution \[P(X=red)=0.3,\ P(X=blue)=0.2,\ P(X=green)=0.4,\ P(X=black)=0.1\] Draw the barplot of this distribution.
Exercise 1.8 (Smoothing)
Visualize the non linear smoother for the variable
price
in terms ofcarat
. You can usegeom_smooth
orstat_smooth
.Same question but use a dotted line instead of a solid line.
1.2.5 Scales
Scales control the mapping from data to aesthetic attributes (change of colors, sizes…). We generally use this element at the end of the process to refine the graph. Scales are defined as follows:
- begin with
scale_
- add the aesthetics we want to change (
color
,fill
,x_
) - end with the name of the scale (
manual
,identity
…)
For instance,
ggplot(diamonds2)+aes(x=carat,y=price,color=cut)+geom_point()+
scale_color_manual(values=c("Fair"="black","Good"="yellow",
"Very Good"="blue","Premium"="red","Ideal"="green"))
Here are the main scales:
aes | Discrete | Continuous |
---|---|---|
Color (color et fill) | brewer | gradient |
- | grey | gradient2 |
- | hue | gradientn |
- | identity | |
- | manual | |
Position (x and y) | discrete | continous |
- | date | |
Shape | shape | |
- | identity | |
- | manual | |
Size | identity | size |
- | manual |
We now present some examples on scales:
color in a barplot
<- ggplot(diamonds2)+aes(x=cut)+geom_bar(aes(fill=cut)) p1 p1
We change colors by using the palette Purples :
+scale_fill_brewer(palette="Purples") p1
Gradient color for a scatter plot
:<- ggplot(diamonds2)+aes(x=carat,y=price)+geom_point(aes(color=depth)) p2 p2
We change the gradient color
+scale_color_gradient(low="red",high="yellow") p2
Change on the axis
+scale_x_continuous(breaks=seq(0.5,3,by=0.5))+scale_y_continuous(name="prix")+scale_color_gradient("Profondeur") p2
1.2.6 Group and facets
ggplot allows to make representations for subgroup of individuals. We can proceed in two ways:
- to represent subgroup on the same graph => group in aes
- to represent subgroup on the different graphs => facets
We represent (on the same graph) the smoother price vs carat for each modality of cut with
ggplot(diamonds2)+aes(x=carat,y=price,group=cut)+
geom_smooth(method="loess")
To obtain the representation on many graphs, we use
ggplot(diamonds2)+aes(x=carat,y=price)+
geom_smooth(method="loess")+facet_wrap(~cut)
ggplot(diamonds2)+aes(x=carat,y=price)+
geom_smooth(method="loess")+facet_wrap(~cut,nrow=1)
facet_grid
and facet_wrap
do the same job but split the screen in different ways:
ggplot(diamonds2)+aes(x=carat,y=price)+geom_point()+
geom_smooth(method="lm")+facet_grid(color~cut)
ggplot(diamonds2)+aes(x=carat,y=price)+geom_point()+
geom_smooth(method="lm")+facet_wrap(color~cut)
1.3 Complements
Syntax for ggplot is defined according to the following process:
ggplot()+aes()+geom_()+scale_()
It is really flexible: for instance aes could also be specified in ggplot or in geom_
ggplot(diamonds2)+aes(x=carat,y=price)+geom_point()
ggplot(diamonds2,aes(x=carat,y=price))+geom_point()
ggplot(diamonds2)+geom_point(aes(x=carat,y=price))
It could be very useful when we want to use different aes in the geom_
We can also build a chart from many dataset:
<- seq(-2*pi,2*pi,by=0.001)
X <- cos(X)
Y1 <- sin(X)
Y2 <- data.frame(X,Y1)
donnees1 <- data.frame(X,Y2)
donnees2 ggplot(donnees1)+geom_line(aes(x=X,y=Y1))+
geom_line(data=donnees2,aes(x=X,y=Y2),color="red")
Many other functions are proposed by ggplot:
- ggtitle to add a title.
- ggsave ta save a graph.
- theme_ to change the theme of the graph.
<- ggplot(diamonds2)+aes(x=carat,y=price,color=cut)+geom_point()
p +theme_bw() p
+theme_classic() p
+theme_grey() p
+theme_bw() p
Many other themes are available in the ggtheme package. The function set_theme
can be used to change the default theme in a markdown document.
1.4 Some exercises
Exercise 1.9 (Cosine and sine functions...)
Draw the sine and cosine functions on the same graph. You first have to use two datasets (one for the sine function, the other for the cosine function).
Do the same with one dataset and 2
geom_line
.Do the same with one dataset and one
geom_line
. Hint: use the pivot_longer function from the tidyverse package.Draw the two functions on two different graphs (use facet_wrap).
Do the same with the function grid.arrange from the package gridExtra.
Exercise 1.10 (Classical representations for a dataset) We consider the mtcars dataset:
data(mtcars)
summary(mtcars)
mpg cyl disp :10.40 Min. :4.000 Min. : 71.1
Min. :15.43 1st Qu.:4.000 1st Qu.:120.8
1st Qu.:19.20 Median :6.000 Median :196.3
Median :20.09 Mean :6.188 Mean :230.7
Mean :22.80 3rd Qu.:8.000 3rd Qu.:326.0
3rd Qu.:33.90 Max. :8.000 Max. :472.0
Max.
hp drat wt : 52.0 Min. :2.760 Min. :1.513
Min. : 96.5 1st Qu.:3.080 1st Qu.:2.581
1st Qu.:123.0 Median :3.695 Median :3.325
Median :146.7 Mean :3.597 Mean :3.217
Mean :180.0 3rd Qu.:3.920 3rd Qu.:3.610
3rd Qu.:335.0 Max. :4.930 Max. :5.424
Max.
qsec vs am :14.50 Min. :0.0000 Min. :0.0000
Min. :16.89 1st Qu.:0.0000 1st Qu.:0.0000
1st Qu.:17.71 Median :0.0000 Median :0.0000
Median :17.85 Mean :0.4375 Mean :0.4062
Mean :18.90 3rd Qu.:1.0000 3rd Qu.:1.0000
3rd Qu.:22.90 Max. :1.0000 Max. :1.0000
Max.
gear carb :3.000 Min. :1.000
Min. :3.000 1st Qu.:2.000
1st Qu.:4.000 Median :2.000
Median :3.688 Mean :2.812
Mean :4.000 3rd Qu.:4.000
3rd Qu.:5.000 Max. :8.000 Max.
Draw the histogram of
mpg
(use many number of bins).Always for the histogram, represent the density on the \(y\)-axis.
Draw the barplot of
cyl
.Draw the scatter plot
disp vs mpg
with one color for each value ofcyl
.Add the linear smoothers, one linear smoother for each value of
cyl
.
Exercise 1.11 (Residuals for a simple regression model)
Simulate a sample \((x_i,y_i),i=1,\dots,100\) according to the linear model \[y_i=3+x_i+\varepsilon_i\] where the \(x_i\)’s are i.i.d. with uniform distribution over \([0,1]\) and \(\varepsilon_i\)’s are gaussian \(N(0,0.2^2)\) (use runif and rnorm)
Draw the scatter plot y vs x and add the linear smoother.
Draw the residuals: add a vertical line from each point to the linear smoother (use geom_segment).
Exercise 1.12 (Challenge) We consider the diamonds dataset.
Draw the following graphs (use coord_flip for the second one).
Add on the third graph a vertical line for the quartiles of the variable carat (for each value of cut).
Add a horizontal boxplot under each graph to obtain the following chart: