STAT 1291: Data Science

Lecture 3 - Data Visualization: What is a good graphic?

Sungkyu Jung

Where are we?

Understanding Univariate Data

Look at the distribution of univariate data

\[F(a) = {\rm Prob { }(Height} \le a)\]

Distributions

Histograms show: \(F(b)-F(a)\) for several intervals \((a,b]\)

Easier to interpret than cumulative distribution functions

Normal Approximation

The distribution of many outcomes in nature are approximated by the normal distribution:

Normal Approximation

If our data follows the normal distribution then \(\mu\) and \(\sigma\) are a sufficient summary: they tell us everything!

All we need to know is \(\mu\) and \(\sigma\)

       Average SD
Male        70  3
Female      65  3

How good is the normal approximation?

Here are the approximations for males

  Height Real Approx
1     63 0.02   0.03
2     65 0.07   0.06
3     67 0.16   0.10
4     68 0.31   0.31
5     70 0.50   0.44
6     71 0.69   0.68
7     73 0.84   0.88
8     75 0.93   0.95
9     76 0.98   0.99

QQ-plots

Observed versus normal approximation quantiles

Two variables

Normal approximation for two variables

Many pairs of data are bivariate normal

Regression line

The regression line is defined by this formula \[ \frac{Y- \mu_Y}{\sigma_Y} =\rho \frac{X - \mu_X}{\sigma_X}\]

\[ \mu_X , \mu_Y ,\sigma_X, \sigma_Y, \rho \]

Anscombe’s quartet

Anscombe’s quartet

Visualization goals

Data

Structured data (or a data set) are exaplined by “variables” and “cases”

Example: NHANES

varNames <- colnames(NHANES::NHANES)
head(NHANES::NHANES[,c(3,5,9,11,14,17,20)])
##   Gender AgeDecade    Education    HHIncome HomeRooms Weight Height
## 1   male     30-39  High School 25000-34999         6   87.4  164.7
## 2   male     30-39  High School 25000-34999         6   87.4  164.7
## 3   male     30-39  High School 25000-34999         6   87.4  164.7
## 4   male       0-9         <NA> 20000-24999         9   17.0  105.4
## 5 female     40-49 Some College 35000-44999         5   86.7  168.4
## 6   male       0-9         <NA> 75000-99999         6   29.8  133.1

Examples of statistical graphics

Let’s first browse some options in visualizing the data

Some common graphical elements will be identified, and we will revisit those formally

See DataVis-Supp.pdf

What makes a graphic effective?

A classic text “The Visual Display of Quantitative Information” by Edward Tufte answers this question.

Continue to see DataVis-Supp.pdf

and read the excerpt at http://cs.unm.edu/~pgk/IVCDs14/minitufte.pdf