DC1: Approved Data Sets

by Mike Gleicher on September 18, 2018

These are the “approved” data sets for Design Challenge 1. Remember, you must use one of these approved data sets. If you want to use a different data set, you must get it approved (and we’ll put it on this list).

Many datasets are available in this Box folder. (except for ones you need to grab yourself – and even then a copy might be available in the folder). Everyone who is registered for class should have access to the box folder. If you are having a problem, let me know – Box doesn’t always cooperate.

Note that to access some datasets, you may be required to sign up for an account.

Government Data

White House Budget Data

The data used in developing the budgets (back in 2016 and 2017). From the White House github. I recommend going to the 2017 branch (which is still from before the November 2016 election) and selecting “download ZIP” (look for the green “clone or download” button). There is good documentation (db_guide.pdf describes each field), and the data is quite rich – giving historical spending in a lot of categories.

In the past, we considered the “receipts” data as small (~250 rows), and the “budgets and outlays” as harder data sets (~4500 rows and ~5000 rows, respectively). Here we’re grouping them together. Files are available as .csv.

Airline On-Time Performance

The Bureau of Transportation Statistics lets you download a lot of data, one month at a time from this page. We’ve downloaded a few months for you – but even if you download our versions, you might want to refer to this page for explanations of all the fields, and look up tables (files that say what the codes mean). Fields include date, carrier, origin/destination information, departure/arrival performance, gate/airport information, flight summaries, cause of delay, and more.

For this data set, you may choose to use the months we downloaded, or download your own (please specify what data you use). You can choose to use just 1 month, or you can pick multiple months to compare (if you want a real challenge). Note that you may filter by geography (select a particular state or all states), year, and monthly periods. Files are available as .csv.

You may want information on the airports (for example, to get location coordinates for each airport code). This data is available from https://www.faa.gov/airports/airport_safety/airportdata_5010/.

Nationwide Crime Data

One of the functions of the Federal Bureau of Investigation (FBI) is to compile crime statistics within the US and use this information to help local law enforcement to curtail crime. Every year, the FBI releases this data along with recommendations for communities to stem violent crime. We have downloaded the 2014 year dataset (as well as 2015) of types of crime by area, available on Box. There is also data for 2016. To download the .xls files, click on the link under the “Download files from this publication” heading. Note that this dataset contains mostly smaller summary tables of counts of a particular crime by location and other factors.

If you use this dataset, we ask that you resist ranking cities/states or their law enforcement capabilities by their crime, as requested by the FBI. Showing trends and patterns should be your goal here.

Census Data

Census Data By County

Note: this is aggregated census data – which is much less interesting than the IPUMS “raw” (or sampled) data.

You can get census data in all kinds of forms. This page has 4 spreadsheets. Any one of them could tell an interesting story – but you probably want to put together multiple files. The complication is that it’s a long list of counties (you might just pick some, or try to give a sense of the range of what is going on, or identify unusual things, or …). The files are also in the Box.

The files are:

  • Population Estimates – has data 2010-2017 (per year) with inflows and outflows. There is a separate sheet in the excel file that explains the columns.
  • Education – has data from multiple years (1970, 1980, 1990, 2000, 2012-2016 5 year average) for different levels of educational attainment.
  • Unemployment – has data from 2007-2017
  • Poverty Estimates – mainly 2016 data, explanations for the columns in a separate sheet.

Detailed Census Data

You can get detailed census data (as in samples of specific people) from the IPUMS  website. This data gets very huge very fast (you can get millions of people) and requires aggregation and clever ways to handle it efficiently (Tableau does surprisingly well).

When you create a data set, you have to pick which census to sample (e.g., which years), and which variables you want. The tool will create huge CSV files (gigabytes). It also created documentation files.

In the box folder, I have a big data grab I got (past 15 years, many variables) – there’s the CSV file and the documentation file. There is also a “reduced file” that I created with a processing script – I decoded some of the columns, and selected a subset of the years. Even this small set is millions of people!

Other Datasets

Time Usage Survey

The American Time Usage Survey (ATUS) tracks how people spend their time. There are corresponding international versions. There are actually lots of different surveys with interesting data available from the IPUMS website.

Getting a data set requires picking from all the options. And you can probably pull together an interesting data set in many ways. I grabbed one from the site. I also checked that, despite the scary agreements I had to agree to, sharing it with a class is legal (see this), so I put a grab of how Americans time usage has changed over the years into DataSets Box folder.

You can find out what the “time use codes” mean on this page.

Interpretting the other codes requires some digging, unfortunately. Some are self-explanatory, but others… I tracked down the “FAMINCOME” columns: explanation here. The state codes are here.

Beijing Air Quality Data

2 Data Sets about Air Quality in Beijing, joined into a single cohesive table.

From the contributor:

The data comes from two sources:

  1. Air quality data: http://www.stateair.net/web/historical/1/1.html (need to download each .csv separately)
  2. Weather data: https://www.wunderground.com/history/airport/ZBAA (here’s a link to a .csv for 2011)

I first pulled the air quality data (where measurements are taken multiple times a day), and aggregated to be at the daily level. Then I merged the weather data to the air quality data. I have a GitHub repository with the data and R and Python code.

Note: the github repo not only has the documentation for the data, and the data conveniently processed into a CSV file, but it also has code for some basic visualizations. I can’t stop you from looking at the code. But, if you are not the author, you cannot turn in these visualizations.

UN Refugee Data

UN-Link: http://popstats.unhcr.org/en/asylum_seekers_monthly

The UN Refugee Agency (UNHCR) provides data about the number of displaced people over time. The link provided includes years 1999 to 2018 and allows the user to select years, months, country of asylum, and country of origin and export the selected data as .csv or .hxl. Note that * indicates “situations where the figures are being kept confidential to protect the anonymity of persons of concern”, and these figures are not included in totals.

Midlife in the United States Data

Large longitudinal “census like” data. note: this requires an account log in. idus.wisc.edu. Data is available from the Inter-university Consortium for Political and Social Research (ICPSR) here; the ICPSR contains other data which may be of interest. The MIDUS series includes data sets over the span of various years in the United States and some in Japan. Each set of data includes documentation and information about the variables. Data is available in various formats depending on the particular sample.

AirBnb Data

AirBNB properties and reviews. Note: this will require you to join across files as there are separate .csv files for reviews, listings, and other information. insideairbnb.com (click on “Get the Data” link in the upper left). Various locations are available, including cities across the United States (Boston, Austin, Los Angeles, …) and other countries (Madrid, Paris, Naples, …). Most recent data seems to be from August 2018, although some archived versions may be accessed.

IMDB Movie Data

Info on ~5000 movies with columns including director, genre, plot keywords, and more, available in .csv. This data may be noisy. https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset. This data is a little on the small side – but we’re still allowing it.

You can get information for many more movies from IMDB directly: https://datasets.imdbws.com/ this data set is much larger, but doesn’t have as many columns as the kaggle version. You may need to join different files to get enough data to do interesting things.

Yelp Data

Includes information about ~6 million business reviews, >1 million tips, >1 million reviewers, >150k businesses, and more, each in a separate file with the format of one JSON object per line. Several metropolitan areas are included. This dataset is larger and may require more interesting analysis to uncover numerical data worth making pictures of. https://www.yelp.com/dataset. Note that you will need to provide your information in order to download the dataset.

Lending Club Loan Data

This includes data for loans issued through selected time periods and loan applications which were denied. Columns include amount requested, amount funded, interest rate, purpose, and more. Files are available as .csv.

Data: https://www.lendingclub.com/info/download-data.action

Note a data dictionary is available near the bottom of the page, which includes information on each attribute.

World Health Organization Data

The World Health Organization (WHO) provides data related to global health. This spans various health indicators across multiple regions. The data may be filtered before being downloaded, or the complete dataset may be downloaded. Various formats are available, including .csv, .xls, and more.

Data: http://www.who.int/gho/en/

 

Approved Student Recommended Data Sets

The IMDB data from IMDB (added to the description for IMDB data above). https://datasets.imdbws.com/

India Socio Economic Data

Link here. The main csv file (elementary_2015_16.csv) contains ~600 rows and 800 columns, where each row denotes a district, and each column denotes an attribute of the district. Other csv files contain information about population, GDP, and housing.

College Scorecard Data

Link here. Each csv file denotes a year of scorecard data. Each row denotes an academic institution, with various columns, including metrics of standardized testing scores (SAT, ACT), admission rate, costs, student body metrics, etc. The documentation is quite thorough. May require some cleaning to deal with suppressed values; some values may not be reported or unavailable.


Old Datasets that you CANNOT USE

These data sets were suggested in old editions of the class (when we had undergrads as well). They are too simple/small to be interesting. But you can use them for practice.

Basketball Players

This dataset is relatively small. It was used in the past for Alper (who was the TA) to demonstrate how to use Tableau and Excel for doing class projects. It’s in the Box.

Metropolitan Area Population Change

Note: this data set is small / easy. If you pick this one, the expectations for what you will need to do with it are much higher. I really dislike the vis on the census bureau website, you should do better (from the visualization, you can link to the data table). But the data is too small, and I’m not sure how many rich stories are to be found in it.

Previous post:

Next post: