Sungkyu Jung
What is Data Science?
How do we learn Data Science?
Data visualization: What is a good graphic?
Composing/Dissecting Data Graphics
Nathan Yau (http://flowingdata.com/) provides a systematic way of thinking about how data graphics convey specific pieces of information, and how they could be improved.
A complementary grammar of graphics is implemented by Hadley Wickham in the ggplot2
graphics package in R
Data graphics can be understood in terms of four basic elements:
Note: Colors can represent both quantitative and categorical variables, using the following
Sequential The ordering of the data has only one direction.
Diverging The ordering of the data has two directions.
Qualitative There is no ordering of the data
How are the data points organized? While any number of coordinate systems are possible, three are most common:
An appropriate choice for a coordinate system is critical in representing one’s data accurately, since, for example, displaying spatial data like airline routes on a flat Cartesian plane can lead to gross distortions of reality
Scales translate values into visual cues. The choice of scale is often crucial. The central question is how does distance in the data graphic translate into meaningful differences in quantity? Each coordinate axis can have its own scale, for which we have three different choices:
Numeric A numeric quantity is most commonly set on a linear, logarithmic, or percentage scale.
Categorical A categorical variable may have no ordering (e.g., Democrat, Republican, or Independent), or it may be ordinal (e.g., never, former, or current smoker).
Time Time is a numeric quantity that has some special properties. First, because of the calendar, it can be demarcated by a series of different units (e.g., year, month, day, etc.). Second, it can be considered periodically as a wrap-around scale.
The purpose of data graphics is to help the viewer make meaningful comparisons. Context can be added to data graphics in the form of
Challenging to condense multivariate information into a two-dimensional image. Use
(We will revisit facets and layers while learning A Layered Grammar of Graphics, implemented in ggplot2
)
For each of data graphics, answer the following:
The bar graph displays the average score on the math portion of the 1994–1995 SAT (with possible scores ranging from 200 to 800) among states for whom at least two-thirds of the students took the SAT.
A time series shows the progression of the world record times in the 100-meter freestyle swimming event for men and women. The time series plot displays the times as a function of the year in which the new record was set.
A choropleth map showing the population of Massachusetts by the 2010 Census tracts
Read the excerpt of “The Visual Display of Quantitative Information” by Edward Tufte
Dissecting data graphics. Textbook exercises.
Find a data graphic in the wild, and criticize it. For this you will need to utilize the best of both worlds (of Tufte’s and Yau’s).
Find the detailed instruction on the course webpage.