Where are we?

We will follow MDSR, section 3.2, so let’s load mdsr package into the R session.

library(mdsr)

Over time, statisticians have developed standard data graphics for specific use cases. While these data graphics are not always mesmerizing, they are hard to beat for simple effectiveness. Every data scientist should know how to make and interpret these canonical data graphics-they are ignored at your peril.

One numeric variable

It is generally useful to understand how a single variable is distributed. If that variable is numeric, then its distribution is commonly summarized graphically using a histogram or density plot. Using the ggplot2 package, we can display either plot for the Math variable in the SAT_2010 data frame by binding theMath variable to the x aesthetic.

g <- ggplot(data = SAT_2010, aes(x = math))

Then we only need to choose either geom_histogram() or geom_density(). Both convey the same information, but whereas the histogram uses pre-defined bins to create a discrete distribution, a density plot uses a kernel smoother to make a continuous curve.

g + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
g + geom_density()

Note that what is displayed in the histogram is not the raw data. Instead, geom_histogram() creates a new variable by count() (which counts number of points in bin).

A warning message should have generated by running geom_histogram(). A number of mordern statistical methods are in fact tunable; that is, the final result depends on your input. Later, we will learn how to deal with this. For now, let us play with the histogram and density estimate by setting binwidth = 30 for geom_histogram() and bw = 5 for geom_density(). You will see the role of these options by reexecuting with different numbers.

g + geom_histogram(binwidth = 30)
g + geom_density(bw = 5)

The quantile-quantile plot is very useful when comparing an empirical univariate distribution (in “sample”) with a theoretical distribution. To visually check whether the math variable distributes as a normal distribution:

ggplot(data = SAT_2010) + 
  geom_qq(aes(sample = math))

One categorical variable

If your variable is categorical, it doesn’t make sense to think about the values as having a continuous density. Instead, we can use bar graphs to display the distribution of a categorical variable. We use the data set mosaicData::HELPrct “Health Evaluation and Linkage to Primary Care” as an example. Confirm that the variables homeless, substance, sex, and female are categorical.

The distribution of the categorical variable homeless is given by the counts of each category, which is displayed by geom_bar().

g <- ggplot(data = HELPrct, aes(x = homeless))
g + geom_bar()

The distribution is relatively simple, and is a member of a distribution family, called Bernoulli distribution. For a simple distribution like this, tables are better.

with(HELPrct, table(homeless))
## homeless
## homeless   housed 
##      209      244

Two categorical variables

When there are two categorical variables, displaying a contingency table is your best option, as long as the possible numbers of categories are small.

with(HELPrct, table(substance,homeless))
##          homeless
## substance homeless housed
##   alcohol      103     74
##   cocaine       59     93
##   heroin        47     77

But, let’s perhaps you want to disply the distribution using a bar graph. A natural thing to do would to expand the coordinate system used in the regualr bar plot. Using the x coordinate for homeless, and the y coordinate for substance and the length of bars (previously mapped to the y coordinate) is now mapped to the z coordinate. This results in a 3D bar plot, which has been widely criticized in recent times by statisticians. Instead, we can use a flattened version of this, albeit using the visual cue of color to map the count of each category.

To create above graphic, we need to be able to manipulate the data, and I am not showing you what was done there. For a small data set, the tile plot is not so effective.

Instead, we will expand the bar plot. You would still want to use the x coordinate to differentiate the two possible values of homeless variable, which is mapped to the plot by the visual cue of length of bar. Let us choose to use the color to map the variable substance. This provides a stacked bar plot.

ggplot(data = HELPrct, aes(x = homeless, fill = substance)) + 
  geom_bar() 

# Why not use `color` in place of `fill`?

Position adjustments determine how to arrange geoms that would otherwise occupy the same space.

g <- ggplot(data = HELPrct, aes(x = homeless, fill = substance)) + 
  coord_flip()
g + geom_bar(position = "stack") 
g + geom_bar(position = "dodge") 
g + geom_bar(position = "fill")

Note that we have used the coord_flip() function to display the bars horizontally instead of vertically.

Notice that by using position = "fill", we are plotting a different quantity, called conditional probability. It can be effectively used to answer conditional and comparative question.

  1. What is the proportion (or probability) of alcohol use among the homeless?

  2. Is there a higher chance of using heroin if the person is housed, as opposed to being homeless?

How would you answer the following question?

  1. What is the proportion of homeless among all people whose primary substance of abuse is heroin?

  2. What is the proportion of homeless among all people whose primary substance of abuse is cocaine?

  3. Are those two proportions noticeably different?

g <- ggplot(data = HELPrct, aes(x = substance, fill = homeless )) + 
  coord_flip()
g + geom_bar(position = "stack") 
g + geom_bar(position = "dodge") 
g + geom_bar(position = "fill")

This method of graphical display enables a more direct comparison of proportions than would be possible using two pie charts. In this case, it is clear that homeless participants were more likely to identify as being involved with alcohol as their primary substance of abuse. However, like pie charts, bar charts are sometimes criticized for having a low data-to-ink ratio. That is, they use a comparatively large amount of ink to depict relatively few data points.

Discrete variables that is numeric and categorical (ordinal), but is not continuous

A typical example of this type of variable is HELPrct$pss_fr, recoding the perceived social support by friends (measured at baseline, higher scores indicate more support).

head(HELPrct$pss_fr,10)
##  [1]  0  1 13 11 10  5  1  4  5  0
# table(HELPrct$pss_fr)
ggplot(data = HELPrct, aes(x = pss_fr)) + 
  geom_bar()

The resulting bar plot is now very similar to a histogram (they are indeed the same!), and it now makes sense to discuss the shape and location of the distribution. Many well-known statistical distributions are actually of this type: Bernoulli, Binomial, Geometric, Poisson distributions and so on.

Comparing two or more univariate distributions

We discussed how stacked bar plot is used to compare two categorical variables. To compare multiple univariate distributions. Side-by-side boxplot is the best option for its simplicity.

ggplot(HELPrct, aes(x = homeless, y = pcs)) + 
  geom_boxplot() +
  facet_wrap(~ substance)

Two numeric variables

Use the 2D scatterplot.

ggplot(SAT_2010, aes(math, salary)) +
  geom_point()

Note that the density function for a bivariate distribution is graphed as a mountain. The best option to overlay the smoothed density estimate is to use geom_density2d() which first computes estimates of density heights, then transforms the mountain into sets of contours (points of equal elevation), and use the visual cue of color to map the contours.

ggplot(SAT_2010, aes(math, salary)) +
  geom_point() + 
  geom_density2d()

Note that by using scatterplot, we are interested in the relation between the two variables.