Implementing the grammar of graphics using ggplot2
ggplot2
, and how it is based on the grammar of graphics.We will follow MDSR, section 3.2, so let’s load mdsr
package into the R session.
library(mdsr)
Over time, statisticians have developed standard data graphics for specific use cases. While these data graphics are not always mesmerizing, they are hard to beat for simple effectiveness. Every data scientist should know how to make and interpret these canonical data graphics-they are ignored at your peril.
It is generally useful to understand how a single variable is distributed. If that variable is numeric, then its distribution is commonly summarized graphically using a histogram or density plot. Using the ggplot2
package, we can display either plot for the Math
variable in the SAT_2010
data frame by binding theMath
variable to the x
aesthetic.
g <- ggplot(data = SAT_2010, aes(x = math))
Then we only need to choose either geom_histogram()
or geom_density()
. Both convey the same information, but whereas the histogram uses pre-defined bins to create a discrete distribution, a density plot uses a kernel smoother to make a continuous curve.
g + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
g + geom_density()
Note that what is displayed in the histogram is not the raw data. Instead, geom_histogram()
creates a new variable by count()
(which counts number of points in bin).
A warning message should have generated by running geom_histogram()
. A number of mordern statistical methods are in fact tunable; that is, the final result depends on your input. Later, we will learn how to deal with this. For now, let us play with the histogram and density estimate by setting binwidth = 30
for geom_histogram()
and bw = 5
for geom_density()
. You will see the role of these options by reexecuting with different numbers.
g + geom_histogram(binwidth = 30)
g + geom_density(bw = 5)
The quantile-quantile plot is very useful when comparing an empirical univariate distribution (in “sample”) with a theoretical distribution. To visually check whether the math
variable distributes as a normal distribution:
ggplot(data = SAT_2010) +
geom_qq(aes(sample = math))
If your variable is categorical, it doesn’t make sense to think about the values as having a continuous density. Instead, we can use bar graphs to display the distribution of a categorical variable. We use the data set mosaicData::HELPrct
“Health Evaluation and Linkage to Primary Care” as an example. Confirm that the variables homeless
, substance
, sex
, and female
are categorical.
The distribution of the categorical variable homeless
is given by the counts of each category, which is displayed by geom_bar()
.
g <- ggplot(data = HELPrct, aes(x = homeless))
g + geom_bar()
The distribution is relatively simple, and is a member of a distribution family, called Bernoulli distribution. For a simple distribution like this, tables are better.
with(HELPrct, table(homeless))
## homeless
## homeless housed
## 209 244
When there are two categorical variables, displaying a contingency table is your best option, as long as the possible numbers of categories are small.
with(HELPrct, table(substance,homeless))
## homeless
## substance homeless housed
## alcohol 103 74
## cocaine 59 93
## heroin 47 77
But, let’s perhaps you want to disply the distribution using a bar graph. A natural thing to do would to expand the coordinate system used in the regualr bar plot. Using the x coordinate for homeless
, and the y coordinate for substance
and the length of bars (previously mapped to the y coordinate) is now mapped to the z coordinate. This results in a 3D bar plot, which has been widely criticized in recent times by statisticians. Instead, we can use a flattened version of this, albeit using the visual cue of color
to map the count of each category.
To create above graphic, we need to be able to manipulate the data, and I am not showing you what was done there. For a small data set, the tile plot is not so effective.
Instead, we will expand the bar plot. You would still want to use the x coordinate to differentiate the two possible values of homeless
variable, which is mapped to the plot by the visual cue of length of bar. Let us choose to use the color to map the variable substance
. This provides a stacked bar plot.
ggplot(data = HELPrct, aes(x = homeless, fill = substance)) +
geom_bar()
# Why not use `color` in place of `fill`?
Position adjustments determine how to arrange geoms that would otherwise occupy the same space.
g <- ggplot(data = HELPrct, aes(x = homeless, fill = substance)) +
coord_flip()
g + geom_bar(position = "stack")
g + geom_bar(position = "dodge")
g + geom_bar(position = "fill")
Note that we have used the coord_flip()
function to display the bars horizontally instead of vertically.
Notice that by using position = "fill"
, we are plotting a different quantity, called conditional probability. It can be effectively used to answer conditional and comparative question.
What is the proportion (or probability) of alcohol use among the homeless?
Is there a higher chance of using heroin if the person is housed, as opposed to being homeless?
How would you answer the following question?
What is the proportion of homeless among all people whose primary substance of abuse is heroin?
What is the proportion of homeless among all people whose primary substance of abuse is cocaine?
Are those two proportions noticeably different?
g <- ggplot(data = HELPrct, aes(x = substance, fill = homeless )) +
coord_flip()
g + geom_bar(position = "stack")
g + geom_bar(position = "dodge")
g + geom_bar(position = "fill")
This method of graphical display enables a more direct comparison of proportions than would be possible using two pie charts. In this case, it is clear that homeless participants were more likely to identify as being involved with alcohol as their primary substance of abuse. However, like pie charts, bar charts are sometimes criticized for having a low data-to-ink ratio. That is, they use a comparatively large amount of ink to depict relatively few data points.
A typical example of this type of variable is HELPrct$pss_fr
, recoding the perceived social support by friends (measured at baseline, higher scores indicate more support).
head(HELPrct$pss_fr,10)
## [1] 0 1 13 11 10 5 1 4 5 0
# table(HELPrct$pss_fr)
ggplot(data = HELPrct, aes(x = pss_fr)) +
geom_bar()
The resulting bar plot is now very similar to a histogram (they are indeed the same!), and it now makes sense to discuss the shape and location of the distribution. Many well-known statistical distributions are actually of this type: Bernoulli, Binomial, Geometric, Poisson distributions and so on.
We discussed how stacked bar plot is used to compare two categorical variables. To compare multiple univariate distributions. Side-by-side boxplot is the best option for its simplicity.
ggplot(HELPrct, aes(x = homeless, y = pcs)) +
geom_boxplot() +
facet_wrap(~ substance)
Use the 2D scatterplot.
ggplot(SAT_2010, aes(math, salary)) +
geom_point()
Note that the density function for a bivariate distribution is graphed as a mountain. The best option to overlay the smoothed density estimate is to use geom_density2d()
which first computes estimates of density heights, then transforms the mountain into sets of contours (points of equal elevation), and use the visual cue of color to map the contours.
ggplot(SAT_2010, aes(math, salary)) +
geom_point() +
geom_density2d()
Note that by using scatterplot, we are interested in the relation between the two variables.