Sungkyu Jung
What is Data Science?
Course webpage: http://www.stat.pitt.edu/sungkyu/course/pds/
How would you reproduce this study?
Data can be found at Harvard DataVerse
What is big data and what is data science?
Is data science the science of Big Data?
Is data science just an extension of statistics?
From wikipedia: Data Science is an interdisciplinary field about scientific methods, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining.
“Unstructured data” can include emails, videos, photos, social media, and other user-generated content.
Data science often requires sorting through a great amount of information and writing algorithms to extract insights from this data.
What is Data Science?
How do we learn Data Science? (Course logistics)
Data visualization
Learn data science by doing data science
use R and RStudio
Two lectures and one recitation (lab) in a week
R is a free software environment for statistical computing and graphics, and is the best data science language. URL https://www.r-project.org/
(https://www.r-bloggers.com/why-you-should-learn-r-first-for-data-science/)
RStudio is an open source and enterprise-ready professional software for R. URL https://www.rstudio.com/
R is a language for data science.
This entire course is about doing data science using R
Fridays classes (11 or 12 AM) will meet at STAT LAB (Posvar 1201) whenever possible
We will begin using R on this Friday.
Computers in STAT LAB have R and RStudio. You can bring your laptop to the lab.
Visit Course webpage at http://www.stat.pitt.edu/sungkyu/course/pds/
See an example Data, collecting sex and height from a group of people data
Timestamp Height Sex
1 9/2/2014 13:40:36 75 Male
2 9/2/2014 13:46:59 70 Male
3 9/2/2014 13:59:20 68 Male
4 9/2/2014 14:51:53 74 Male
5 9/2/2014 15:16:15 61 Male
6 9/2/2014 15:16:16 65 Female
Note that some entries are not in inches.
Timestamp Height Sex
127 9/2/2014 15:16:56 5'7" Male
150 9/2/2014 15:17:09 5'3" Female
187 9/2/2014 15:18:00 5'8.11 Male
202 9/2/2014 15:19:48 5'11 Male
236 9/4/2014 0:46:45 5'9'' Male
55 9/2/2014 15:16:37 165cm Female
Fixing this is part of what we call data wrangling.
After fixing the above issue, there are still some problems:
Timestamp Height Sex
12 9/2/2014 15:16:23 6.00 Male
40 9/2/2014 15:16:32 5.30 Female
66 9/2/2014 15:16:41 511.00 Male
84 9/2/2014 15:16:46 6.00 Male
99 9/2/2014 15:16:50 2.00 Female
126 9/2/2014 15:16:56 9000.00 Male
194 9/2/2014 15:18:14 5.25 Female
231 9/3/2014 21:43:00 5.50 Male
235 9/3/2014 23:55:37 11111.00 Male
241 9/4/2014 5:15:28 6.00 Female
242 9/4/2014 6:31:03 6.50 Male
244 9/4/2014 9:24:41 150.00 Female
We sometimes have to fix these “by hand”
Look at the distribution of univariate data
\[F(a) = {\rm Prob { }(Height} \le a)\]
Histograms show: \(F(b)-F(a)\) for several intervals \((a,b]\)
Easier to interpret than cumulative distribution functions
The distribution of many outcomes in nature are approximated by the normal distribution:
If our data follows the normal distribution then \(\mu\) and \(\sigma\) are a sufficient summary: they tell us everything!
All we need to know is \(\mu\) and \(\sigma\)
Average SD
Male 70 3
Female 65 3
Here are the approximations for males
Height Real Approx
1 63 0.02 0.03
2 65 0.07 0.06
3 67 0.16 0.10
4 68 0.31 0.31
5 70 0.50 0.44
6 71 0.69 0.68
7 73 0.84 0.88
8 75 0.93 0.95
9 76 0.98 0.99
Observed versus normal approximation quantiles
Many pairs of data are bivariate normal
The blue line is the average within each strata
It is called the regression line
The regression line is defined by this formula \[ \frac{Y- \mu_Y}{\sigma_Y} =\rho \frac{X - \mu_X}{\sigma_X}\]
\(\rho\) is called the correlation coefficient
For fathers and son heights it is 0.5
For bivariate normal pairs of data these five numbers provide a complete summary:
\[ \mu_X , \mu_Y ,\sigma_X, \sigma_Y, \rho \]
For example, look at compensation for 199 US CEOs (2000)
Average is $600,000 but 84%, not 50%, make less.
The normal approximation is not useful here.