Your final project for this course consists of two parts. Part I is done individually, while Part II is done in a group.

Part I. Fivethirtyeight data graphics

An R package that provides access to the code and data sets published by FiveThirtyEight https://github.com/fivethirtyeight/data, was just made available to public. The developers, Albert Kim and his colleagues, maintains a webpage for the package fivethirtyeight: https://rudeboybert.github.io/fivethirtyeight/

The data sets included are massive. You can find a list of these, including the URLs to the original fivethirtyeight.com articles, at https://rudeboybert.github.io/fivethirtyeight/articles/fivethirtyeight.html.

Your final project (Part I) is to choose one of the articles with data graphics, and recreate one or more of the data graphics found in the article. Examples of such report can be found at https://rudeboybert.github.io/fivethirtyeight/articles/

Your report consists of

  1. A technical discussion of your data wrangling-visualization statements;
  2. A brief paragraph explaining the context of the data graphic you created,

and must be prepared by R markdown. The instructor will “knit” your .Rmd file, so make sure that your analysis is repeatable. Submit both .Rmd and .html (or .pdf, .docx) files in a zipped file.

Extra credit policy

Creating some graphics may require a functionality that we have not discussed. For example, to create the choropleth map from antiquities_act dataset, you will need to handle spatial data, which is not discussed in class. (See Chapter 14, MDSR for handling spatial data.) Extra credit will be given to thos who appropriately use a new function.

Choosing an article

  • You will have a group of few students, for part II of the project. Choose an article (or equivalently, a dataset) so that there is no overlap with your group members.

  • Avoid choosing the datasets corresponding to the articles at https://rudeboybert.github.io/fivethirtyeight/articles/.

Report is due on

Monday, December 11, 2017. Submit through courseweb.

Part II. Retreive, explore, and analyze

This part of the project is to retreive, explore, and analyze data in one of the topic areas. You will need to choose one from the three topic areas from the following list.

1. American Time Use Survey Data

The ATUS is an annual survey conducted on a sample of individuals across the United States studying how individuals spent their time over the course of a day. Individual respondents were interviewed about what activities they did, during what times (rounded to 15 minute increments), at what locations, and in the presence of which individuals. The activities are subsequently encoded based on 3 separate tier codes for classification.

All activity codes (other than code ‘50’ for ‘Unable to Code’) were included. The full data can be obtained from http://www.bls.gov/tus/.

The R package atus (https://cran.r-project.org/web/packages/atus/index.html) contains abridged data from the American Time Use Survey (ATUS) for years 2003-2016.

2. Runners’ times in the Cherry Blossom Race

The Cherry Blossom Ten Mile Run race held in Washington D.C. started in 1973. It has since grown in popularity. The organizers publish the results at http://cherryblossom.org. These data offer a tremendous resource for learning about the relationship between age and performance, among other things.

Publicitly available data include the race results from 1999 to 2013, at http://cherryblossom.org/aboutus/results_list.php. You would need to scrape these data on the web and read into R.

3. Quantitative Financial Analysis

This topic is to use a new R package tidyquant (https://cran.r-project.org/web/packages/tidyquant/), to retrieve financial data from remote sources, and to extract knowledge from it.

This project must include retreiving and handling a large dataset, e.g. analyzing all stocks in an index or an exchange. Analyzing each individual stock over time can be part of the project activities, but is not sufficient.

Scope of the work

This is an open-ended project. Your project begins with exploring each of the three topic areas. Spend adequate time to understand each data situation, and think about what kinds of interesting knowledge you can extract from the data. Think about what types of statistical analysis or modeling you would need to extract potentially hidden knowledge or to confirm your hypothesis.

Choose a topic area, then evaluate the goals of your analysis. The goals or your final product may consist of

  1. visualization or tabulation of the data (from either exploring or modeling),
  2. results of statistic tests for your hypothesis,
  3. and modeling and predictions from statistical learning methods.

Examples of your goals are “visualization of sleep patterns for cohorts”, “prediction of race results based on age, sex and hometown”, or “association of stock prices between sectors of industry”. (These are just examples!) Your work in achieving the goals will necessarily include data wrangling and visualization. If the project involves ambitious large-scale visualization, then modeling may be omitted.

You are asked to write a proposal for your analysis (see below). Based on the proposal, the instructors will divide students into several working groups. These groups will then together investigate data by performing a preliminary data exploration, and refine their goals. Each group is then asked to write a progress report and the final report for the project.

Proposal

Report your proposal of the project, due on Wednesday, Nov. 8.

The proposal includes your name and background (major), your choice of topic, and goals of the analysis. Prepare your proposal using R Markdown, print and submit your paper directly to me in class.

Progress report

Progress report is due on Monday, Nov. 27, and includes names of group members, (refined) goals of the analysis and evidence of preliminary data exploration. Submit your paper directly to me in class.

Final report

Final report is due on Monday, December 11, 2017. Your report must contain

  1. Proposed goals in your progress report,
  2. Analysis (both code chunks and results),
  3. Interpretation,

and as an appendix,

  1. Each individual group member’s contribution,
  2. A list of the individuals’ name and dataset used in Part I of final project.

Submit a zipped file containing both .Rmd and an output file. Submit through courseweb.

Getting help

You are welcome to ask advise from both Tim any myself, at any stage. The last class at 11-1 on Friday, December 8, will be reserved for a Q&A session for your final project.

Late submission policy (for both Part I and II)

I will deduct 5% of points for each hour. So if you submit at 3:30am on Tuesday, then you will get only (1-0.05)^3 = 0.857375 of your earned points.