STAT 1291: Data Science

Lecture 4 - Data Visualization: Composing/dissecting Data Graphics

Sungkyu Jung

Where are we?

A taxonomy for data graphics

  1. visual cues
  2. coordinate system
  3. scale
  4. context

1. Visual Cues

Visual Cues

  1. Position (numerical) where in relation to other things?

  1. Length (numerical) how big (in one dimension)?

  1. Angle (numerical) how wide? parallel to something else?

  1. Direction (numerical) at what slope? In a time series, going up or down?

  1. Shape (categorical) belonging to which group?

  1. Area (numerical) how big (in two dimensions)?
  2. Volume (numerical) how big (in three dimensions)?

  1. Shade and color (color saturation and color hue) to what extent? how severely? Beware of red/green color blindness

Note: Colors can represent both quantitative and categorical variables, using the following

Which visual cues are more effective?

Which visual cues are more effective? (2)

2. Coordinate systems

How are the data points organized? While any number of coordinate systems are possible, three are most common:

An appropriate choice for a coordinate system is critical in representing one’s data accurately, since, for example, displaying spatial data like airline routes on a flat Cartesian plane can lead to gross distortions of reality

3. Scale

Scales translate values into visual cues. The choice of scale is often crucial. The central question is how does distance in the data graphic translate into meaningful differences in quantity? Each coordinate axis can have its own scale, for which we have three different choices:

  1. Numeric A numeric quantity is most commonly set on a linear, logarithmic, or percentage scale.

  2. Categorical A categorical variable may have no ordering (e.g., Democrat, Republican, or Independent), or it may be ordinal (e.g., never, former, or current smoker).

  3. Time Time is a numeric quantity that has some special properties. First, because of the calendar, it can be demarcated by a series of different units (e.g., year, month, day, etc.). Second, it can be considered periodically as a wrap-around scale.

Use data transformation (mutation) to choose the most effective scale

4. Context

The purpose of data graphics is to help the viewer make meaningful comparisons. Context can be added to data graphics in the form of

For multivariate data

Challenging to condense multivariate information into a two-dimensional image. Use

(We will revisit facets and layers while learning A Layered Grammar of Graphics, implemented in ggplot2)

Putting it all together

Exercises

For each of data graphics, answer the following:

  1. Which variables are used, and what are the types of variables?
  2. Which visual cue is used?
  3. On which coordinate system, and on which scale?
  4. How context is provided?

Exercise 1.

  1. Two quantitative variables (Son’s height and father’s height) are used
  2. using the visual cue of position,
  3. in the Cartesian plane with linear scales
  4. Context is provided by the axis lables (to show the positive association).

Exercise 2.

The bar graph displays the average score on the math portion of the 1994–1995 SAT (with possible scores ranging from 200 to 800) among states for whom at least two-thirds of the students took the SAT.

Exercise 3.

A time series shows the progression of the world record times in the 100-meter freestyle swimming event for men and women. The time series plot displays the times as a function of the year in which the new record was set.

Exercise 4.

A choropleth map showing the population of Massachusetts by the 2010 Census tracts

Homework

  1. Read the excerpt of “The Visual Display of Quantitative Information” by Edward Tufte

  2. Dissecting data graphics. Textbook exercises.

  3. Find a data graphic in the wild, and criticize it. For this you will need to utilize the best of both worlds (of Tufte’s and Yau’s).

Find the detailed instruction on the course webpage.