Your final project for this course consists of two parts. Part I is done individually, while Part II is done in a group.
An R package that provides access to the code and data sets published by FiveThirtyEight https://github.com/fivethirtyeight/data, was just made available to public. The developers, Albert Kim and his colleagues, maintains a webpage for the package fivethirtyeight
: https://rudeboybert.github.io/fivethirtyeight/
The data sets included are massive. You can find a list of these, including the URLs to the original fivethirtyeight.com articles, at https://rudeboybert.github.io/fivethirtyeight/articles/fivethirtyeight.html.
Your final project (Part I) is to choose one of the articles with data graphics, and recreate one or more of the data graphics found in the article. Examples of such report can be found at https://rudeboybert.github.io/fivethirtyeight/articles/
Your report consists of
and must be prepared by R markdown. The instructor will “knit” your .Rmd file, so make sure that your analysis is repeatable. Submit both .Rmd and .html (or .pdf, .docx) files in a zipped file.
Creating some graphics may require a functionality that we have not discussed. For example, to create the choropleth map from antiquities_act
dataset, you will need to handle spatial data, which is not discussed in class. (See Chapter 14, MDSR for handling spatial data.) Extra credit will be given to thos who appropriately use a new function.
You will have a group of few students, for part II of the project. Choose an article (or equivalently, a dataset) so that there is no overlap with your group members.
Avoid choosing the datasets corresponding to the articles at https://rudeboybert.github.io/fivethirtyeight/articles/.
Monday, December 11, 2017. Submit through courseweb.
This part of the project is to retreive, explore, and analyze data in one of the topic areas. You will need to choose one from the three topic areas from the following list.
The ATUS is an annual survey conducted on a sample of individuals across the United States studying how individuals spent their time over the course of a day. Individual respondents were interviewed about what activities they did, during what times (rounded to 15 minute increments), at what locations, and in the presence of which individuals. The activities are subsequently encoded based on 3 separate tier codes for classification.
All activity codes (other than code ‘50’ for ‘Unable to Code’) were included. The full data can be obtained from http://www.bls.gov/tus/.
The R package atus
(https://cran.r-project.org/web/packages/atus/index.html) contains abridged data from the American Time Use Survey (ATUS) for years 2003-2016.
The Cherry Blossom Ten Mile Run race held in Washington D.C. started in 1973. It has since grown in popularity. The organizers publish the results at http://cherryblossom.org. These data offer a tremendous resource for learning about the relationship between age and performance, among other things.
Publicitly available data include the race results from 1999 to 2013, at http://cherryblossom.org/aboutus/results_list.php. You would need to scrape these data on the web and read into R.
This topic is to use a new R package tidyquant
(https://cran.r-project.org/web/packages/tidyquant/), to retrieve financial data from remote sources, and to extract knowledge from it.
This project must include retreiving and handling a large dataset, e.g. analyzing all stocks in an index or an exchange. Analyzing each individual stock over time can be part of the project activities, but is not sufficient.
This is an open-ended project. Your project begins with exploring each of the three topic areas. Spend adequate time to understand each data situation, and think about what kinds of interesting knowledge you can extract from the data. Think about what types of statistical analysis or modeling you would need to extract potentially hidden knowledge or to confirm your hypothesis.
Choose a topic area, then evaluate the goals of your analysis. The goals or your final product may consist of
Examples of your goals are “visualization of sleep patterns for cohorts”, “prediction of race results based on age, sex and hometown”, or “association of stock prices between sectors of industry”. (These are just examples!) Your work in achieving the goals will necessarily include data wrangling and visualization. If the project involves ambitious large-scale visualization, then modeling may be omitted.
You are asked to write a proposal for your analysis (see below). Based on the proposal, the instructors will divide students into several working groups. These groups will then together investigate data by performing a preliminary data exploration, and refine their goals. Each group is then asked to write a progress report and the final report for the project.
Report your proposal of the project, due on Wednesday, Nov. 8.
The proposal includes your name and background (major), your choice of topic, and goals of the analysis. Prepare your proposal using R Markdown, print and submit your paper directly to me in class.
Progress report is due on Monday, Nov. 27, and includes names of group members, (refined) goals of the analysis and evidence of preliminary data exploration. Submit your paper directly to me in class.
Final report is due on Monday, December 11, 2017. Your report must contain
and as an appendix,
Submit a zipped file containing both .Rmd and an output file. Submit through courseweb.
You are welcome to ask advise from both Tim any myself, at any stage. The last class at 11-1 on Friday, December 8, will be reserved for a Q&A session for your final project.
I will deduct 5% of points for each hour. So if you submit at 3:30am on Tuesday, then you will get only (1-0.05)^3 = 0.857375 of your earned points.