DC1 Data Sets

Page content

This page lists the data sets that you may use for Design Challenge 1 (DC1): One dataset / Four Stories.

You must pick one of these data sets for your assignment.

For Design Challenge 1, we are providing 4 different data sets. Variants of these data sets have been used successfully by students in the past. All of these data sets are sufficiently large and rich to contain many interesting stories. We may provide more choices, or better versions of the datasets listed. We will not remove things from this list (so if you pick something, it will continue to be OK).

Box Folder with All Data Sets

There are four choices (some with multiple variants):

  1. Census Data
  2. Airline On-Time Performance
  3. Time Usage (how people spend their time)
  4. IMDB movie data

Some of the data sets get quite big. However, Tableau is remarkably efficient at working with 100MB+ data sets - it uses fancy database technologies to perform aggregations quickly.

In each case, you may access the original data from its source. However, we are providing versions where we have done a little of the data wrangling work for you. Thank Young, the TA.

In your assignment, be sure to explain which data set you are using (including which variant). If you got your own version of the data, please say so.

Note: we are working on being able to provide these as part of our Tableau site so you can work with them using Tableau online without having to download/upload them yourself. Stay tuned for an announcement!

Census Data by County

This data set aggregates many different quantities of interest over the counties of the US. In past assignments, we’ve provided even more detailed data sets which required even more aggregation.

The USDA provides this data as 4 separate sheets (on This page). Any one of them could tell an interesting story but together, they provide a very rich and complex data set full of stories.

We have joined the 4 spreadsheets together (joining by the “FIPS Code” column) creating a single file. The four files are also combined (joined by the “FIPS Code” column) into one file, and put in Box. The rows for the states (not counties) are also removed. The data is downloaded and processed on September 13, 2020 from USDA/ERS.

  • CSV File on Box (~4MB, 3196 columns and 339 rows)

The variable descriptions can be found on here, and some of them are replicated in the following table:

Variable Descriptions
Column nameDescription
Births_2019Births in period 7/1/2018 to 6/30/2019
CENSUS_2010_POP4/1/2010 resident Census 2010 population
CI90LB017_201890% confidence interval lower bound of estimate of people age 0-17 in poverty 2018
CI90LB017P_201890% confidence interval lower bound of estimate of percent of people age 0-17 in poverty 2018
CI90LBINC_201890% confidence interval lower bound of estimate of median household income 2018
CI90UB017_201890% confidence interval upper bound of estimate of people age 0-17 in poverty 2018
CI90UB017P_201890% confidence interval upper bound of estimate of percent of people age 0-17 in poverty 2018
CI90UBINC_201890% confidence interval upper bound of estimate of median household income 2018
Civilian_labor_force_2018Civilian labor force annual average, 2018
Deaths_2019Deaths in period 7/1/2018 to 6/30/2019
DOMESTIC_MIG_2019Net domestic migration in period 7/1/2018 to 6/30/2019
Economic_typology_2015County economic types, 2015 edition
Employed_2019Number employed annual average, 2019
ESTIMATES_BASE_20104/1/2010 resident total population estimates base
FIPS_CodeState-County FIPS Code
GQ_ESTIMATES_20197/1/2019 Group Quarters total population estimate
GQ_ESTIMATES_BASE_20104/1/2010 Group Quarters total population estimates base
INTERNATIONAL_MIG_2019Net international migration in period 7/1/2018 to 6/30/2019
Med_HH_Income_Percent_of_State_Total_2019County Household Median Income as a percent of the State Total Median Household Income, 2019
MEDHHINC_2018Estimate of median household income 2018
Median_Household_Income_2019Estimate of Median household Income, 2019
Metro_2013Metro nonmetro dummy 0=Nonmetro 1=Metro (Based on 2013 OMB Metropolitan Area delineation)
N_POP_CHG_2019Numeric Change in resident total population 7/1/2018 to 7/1/2019
NATURAL_INC_2019Natural increase in period 7/1/2018 to 6/30/2019
NET_MIG_2019Net migration in period 7/1/2018 to 6/30/2019
PCTPOV017_2018Estimated percent of people age 0-17 in poverty 2018
POP_ESTIMATE_20197/1/2019 resident total population estimate
POV017_2018Estimate of people age 0-17 in poverty 2018
R_death_2019Death rate in period 7/1/2018 to 6/30/2019
R_DOMESTIC_MIG_2019Net domestic migration rate in period 7/1/2018 to 6/30/2019
R_INTERNATIONAL_MIG_2019Net international migration rate in period 7/1/2018 to 6/30/2019
R_NATURAL_INC_2019Natural increase rate in period 7/1/2018 to 6/30/2019
R_NET_MIG_2019Net migration rate in period 7/1/2018 to 6/30/2019
RESIDUAL_2019Residual for period 7/1/2018 to 6/30/2019
Rural-urban_Continuum_Code_2013Rural-urban Continuum Code, 2013
StateState Abbreviation
Unemployed_2019Number unemployed annual average, 2019
Unemployment_rate_2019Unemployment rate, 2019
Urban_Influence_Code_2013Urban Influence Code, 2013

Airline On-Time Performance

The Bureau of Transportation Statistics lets you download a lot of data, one month at a time from this page. You can download many different attributes of every flight.

We’ve chosen a set of attributes for each flight, which leads to huge files (250MB/month), and downloaded several months (one per season for 3 years).

We’ve downloaded a few months for you and assembled them into easier to use files. We’ve put multiple months together and down-sampled the data (using random sampling) to reduce the data set size. While Tableau seems very able to process even the largest of these files, they do get a little unweildy.

We chose 2019 (not 2020) since it has more “usual” patterns (although, comaparing pre- and post-COVID would be interesting).

The “small” datasets were downsampled to produce file sizes of about 5MB. The January file was downsampled to 2% (that is, each flight has a 2% chance of being included). The 4 month data set and 4 month*3 year data set was downsampled to have a similar total number of flights (e.g., the 4 month set has 2%/4).

The “medium” datasets were downsampled to produce file sizes of about 25MB each. The 4 month data set and 4 month*3 year data set was downsampled to have a similar total number of flights (e.g., the 4 month set has 10%/4).

The “large” data sets include 10% of all the months included. You can get a sense of how big they are from this picture (12 months, 40-60K flights a month)

airlines-count.png

For this data set, you may choose to use the months we downloaded, or download your own (please specify what data you use). You can choose to use just 1 month (version 1), you can pick multiple months to compare (version 2), or multiple years to compare if you want a real challenge (version 3). Note that you may filter by geography (select a particular state or all states), year, and monthly periods. Files are available as .csv.

You may want information on the airports (for example, to get location coordinates for each airport code). This data is available from Federal Aviation Administration. The Airline codes list is useful - especially for Madison where you see airlines like “F9” and “OO” (these are regional commuter companies that fly under another airline’s name).

You might want to refer to Bureau of Transportation Statistics for explanations of all the fields, and look up tables (files that say what the codes mean). Fields include date, carrier, origin/destination information, departure/arrival performance, gate/airport information, flight summaries, cause of delay, and some of them are replicated in the following table:

Variable Descriptions
Column nameDescription
ActualElapsedTimeElapsed Time of Flight, in Minutes
AirTimeFlight Time, in Minutes
ArrDel15Arrival Delay Indicator, 15 Minutes or More (1=Yes)
ArrDelayDifference in minutes between scheduled and actual arrival time. Early arrivals show negative numbers.
ArrDelayMinutesDifference in minutes between scheduled and actual arrival time. Early arrivals set to 0.
ArrivalDelayGroupsArrival Delay intervals, every (15-minutes from <-15 to >180)
ArrTimeActual Arrival Time (local time: hhmm)
ArrTimeBlkCRS Arrival Time Block, Hourly Intervals
CancellationCodeSpecifies The Reason For Cancellation
CancelledCancelled Flight Indicator (1=Yes)
CarrierDelayCarrier Delay, in Minutes
CRSArrTimeCRS Arrival Time (local time: hhmm)
CRSDepTimeCRS Departure Time (local time: hhmm)
CRSElapsedTimeCRS Elapsed Time of Flight, in Minutes
DayofMonthDay of Month
DayOfWeekDay of Week
DepartureDelayGroupsDeparture Delay intervals, every (15 minutes from <-15 to >180)
DepDel15Departure Delay Indicator, 15 Minutes or More (1=Yes)
DepDelayDifference in minutes between scheduled and actual departure time. Early departures show negative numbers.
DepDelayMinutesDifference in minutes between scheduled and actual departure time. Early departures set to 0.
DepTimeActual Departure Time (local time: hhmm)
DepTimeBlkCRS Departure Time Block, Hourly Intervals
DestDestination Airport
DestAirportIDDestination Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.
DestAirportSeqIDDestination Airport, Airport Sequence ID. An identification number assigned by US DOT to identify a unique airport at a given point of time. Airport attributes, such as airport name or coordinates, may change over time.
DestCityMarketIDDestination Airport, City Market ID. City Market ID is an identification number assigned by US DOT to identify a city market. Use this field to consolidate airports serving the same city market.
DestCityNameDestination Airport, City Name
DestStateDestination Airport, State Code
DestStateFipsDestination Airport, State Fips
DestStateNameDestination Airport, State Name
DestWacDestination Airport, World Area Code
DistanceDistance between airports (miles)
DistanceGroupDistance Intervals, every 250 Miles, for Flight Segment
DivActualElapsedTimeElapsed Time of Diverted Flight Reaching Scheduled Destination, in Minutes. The ActualElapsedTime column remains NULL for all diverted flights.
DivAirportLandingsNumber of Diverted Airport Landings
DivArrDelayDifference in minutes between scheduled and actual arrival time for a diverted flight reaching scheduled destination. The ArrDelay column remains NULL for all diverted flights.
DivDistanceDistance between scheduled destination and final diverted airport (miles). Value will be 0 for diverted flight reaching scheduled destination.
DivertedDiverted Flight Indicator (1=Yes)
DivReachedDestDiverted Flight Reaching Scheduled Destination Indicator (1=Yes)
DOT_ID_Reporting_AirlineAn identification number assigned by US DOT to identify a unique airline (carrier). A unique airline (carrier) is defined as one holding and reporting under the same DOT certificate regardless of its Code, Name, or holding company/corporation.
FirstDepTimeFirst Gate Departure Time at Origin Airport
Flight_Number_Reporting_AirlineFlight Number
FlightDateFlight Date (yyyymmdd)
FlightsNumber of Flights
IATA_CODE_Reporting_AirlineCode assigned by IATA and commonly used to identify a carrier. As the same code may have been assigned to different carriers over time, the code is not always unique. For analysis, use the Unique Carrier Code.
LateAircraftDelayLate Aircraft Delay, in Minutes
LongestAddGTimeLongest Time Away from Gate for Gate Return or Cancelled Flight
MonthMonth
NASDelayNational Air System Delay, in Minutes
OriginOrigin Airport
OriginAirportIDOrigin Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.
OriginAirportSeqIDOrigin Airport, Airport Sequence ID. An identification number assigned by US DOT to identify a unique airport at a given point of time. Airport attributes, such as airport name or coordinates, may change over time.
OriginCityMarketIDOrigin Airport, City Market ID. City Market ID is an identification number assigned by US DOT to identify a city market. Use this field to consolidate airports serving the same city market.
OriginCityNameOrigin Airport, City Name
OriginStateOrigin Airport, State Code
OriginStateFipsOrigin Airport, State Fips
OriginStateNameOrigin Airport, State Name
OriginWacOrigin Airport, World Area Code
QuarterQuarter (1-4)
Reporting_AirlineUnique Carrier Code. When the same code has been used by multiple carriers, a numeric suffix is used for earlier users, for example, PA, PA(1), PA(2). Use this field for analysis across a range of years.
SecurityDelaySecurity Delay, in Minutes
Tail_NumberTail Number
TaxiInTaxi In Time, in Minutes
TaxiOutTaxi Out Time, in Minutes
TotalAddGTimeTotal Ground Time Away from Gate for Gate Return or Cancelled Flight
WeatherDelayWeather Delay, in Minutes
WheelsOffWheels Off Time (local time: hhmm)
WheelsOnWheels On Time (local time: hhmm)
YearYear

Time Usage Survey

2003-2019 all samples, Respondents

The American Time Usage Survey (ATUS) tracks how people spend their time. There are corresponding international versions. There are actually lots of different surveys with interesting data available from the IPUMS website.

We have done a data pull for you. We are allowed to share a data pull (see this). We chose to collect data from all of the years (2003-2019), and selected a wide range of different attributes. The data was downloaded on September 13, 2020 from ATUS.

You may create your own data pull if you’d like to try this with different columns (but be sure to document it in your writeup). Getting a data set requires picking from all the options - there are so many options that picking a good set is pretty time consuming. It is actually an interesting exercise to see how they document their data - they are very careful in documenting everything.

You can find out what the “time use codes” mean on this page.

Interpreting the other codes requires some digging, unfortunately. Some are self-explanatory, but others… I tracked down the “FAMINCOME” columns: explanation here. The state codes are here. Here are some of them are replicated in the following table:

Variable Descriptions
Column nameDescription
RECTYPERecord type
1Household
2Person
3Activity
4Who
5Eldercare
REGIONRegion
1Northeast
2Midwest
3South
4West
STATEFIPFIPS State Code
01Alabama
02Alaska
04Arizona
05Arkansas
06California
08Colorado
09Connecticut
10Delaware
11District of Columbia
12Florida
13Georgia
15Hawaii
16Idaho
17Illinois
18Indiana
19Iowa
20Kansas
21Kentucky
22Louisiana
23Maine
24Maryland
25Massachusetts
26Michigan
27Minnesota
28Mississippi
29Missouri
30Montana
31Nebraska
32Nevada
33New Hampshire
34New Jersey
35New Mexico
36New York
37North Carolina
38North Dakota
39Ohio
40Oklahoma
41Oregon
42Pennsylvania
44Rhode Island
45South Carolina
46South Dakota
47Tennessee
48Texas
49Utah
50Vermont
51Virginia
53Washington
54West Virginia
55Wisconsin
56Wyoming
METROMetropolitan/central city status
01Metropolitan, central city
02Metropolitan, balance of MSA
03Metropolitan, not identified
04Nonmetropolitan
05Not identified
FAMINCOMEFamily income
001Less than $5,000
002$5,000 to $7,499
003$7,500 to $9,999
004$10,000 to $12,499
005$12,500 to $14,999
006$15,000 to $19,999
007$20,000 to $24,999
008$25,000 to $29,999
009$30,000 to $34,999
010$35,000 to $39,999
011$40,000 to $49,999
012$50,000 to $59,999
013$60,000 to $74,999
014$75,000 to $99,999
015$100,000 to $149,999
016$150,000 and over
996Refused
997Don’t know
998Blank
HH_SIZENumber of people in household
0011
0022
0033
0044
0055
0066
0077
0088
0099
01010
01111
01212
01313
01414
01515
01616
999NIU (Not in universe)
PERNUMPerson number (general)
011
022
033
044
055
066
077
088
099
1010
1111
1212
1313
1414
1515
1616
LINENOPerson line number
0011
0022
0033
0044
0055
0066
0077
0088
0099
01010
01111
01212
01313
01414
01515
01616
01717
01818
01919
999NIU (Not in universe)
SEXSex
01Male
02Female
99NIU (Not in universe)
RACERace
0100White only
0110Black only
0120American Indian, Alaskan Native
0130Asian or Pacific Islander
0131Asian only
0132Hawaiian Pacific Islander only
0200White-Black
0201White-American Indian
0202White-Asian
0203White-Hawaiian
0210Black-American Indian
0211Black-Asian
0212Black-Hawaiian
0220American Indian-Asian
0221American Indian-Hawaiian
0230Asian-Hawaiian
0300White-Black-American Indian
0301White-Black-Asian
0302White-Black-Hawaiian
0310White-American Indian-Asian
0311White-American Indian-Hawaiian
0320White-Asian-Hawaiian
0330Black-American Indian-Asian
0331Black-American Indian-Hawaiian
0340Black-Asian-Hawaiian
0350American Indian-Asian-Hawaiian
0398Other 3 race combinations
03992 or 3 races, unspecified
0400White-Black-American Indian-Asian
0401White-Black-American Indian-Hawaiian
0402White-Black-Asian-Hawaiian
0403Black-American Indian-Asian-Hawaiian
0404White-American Indian-Asian-Hawaiian
0500White-Black-American Indian-Asian-Hawaiian
05994 or 5 races, unspecified
9999NIU (Not in universe)
EDUCHighest level of school completed
-Less than HS diploma
010Less than 1st grade
0111st, 2nd, 3rd, or 4th grade
0125th or 6th grade
0137th or 8th grade
0149th grade
01510th grade
01611th grade
01712th grade - no diploma
-HS diploma, no college
020High school graduate - GED
021High school graduate - diploma
-Some college
030Some college but no degree
031Associate degree - occupational vocational
032Associate degree - academic program
-College degree +
040Bachelor’s degree (BA, AB, BS, etc.)
041Master’s degree (MA, MS, MEng, MEd, MSW, etc.)
042Professional school degree (MD, DDS, DVM, etc.)
043Doctoral degree (PhD, EdD, etc.)
999NIU (Not in universe)
EDUCYRSYears of education
100Grades 1-12
101Less than first grade
102First through fourth grade
105Fifth through sixth grade
107Seventh through eighth grade
109Ninth grade
110Tenth grade
111Eleventh grade
112Twelfth grade
200College
213College–one year
214College–two years
215College–three years
216College–four years
217Bachelor’s degree
300Advanced degree
316Master’s degree
317Master’s degree–one year program
318Master’s degree–two year program
319Master’s degree–three+ year program
320Professional degree
321Doctoral degree
999NIU (Not in universe)
EMPSTATLabor force status
01Employed - at work
02Employed - absent
03Unemployed - on layoff
04Unemployed - looking
05Not in labor force
99NIU (Not in universe)

IMDB Movie Data

This is data on approximately 5000 movies with columns including director, genre, plot keywords, and box office statistics. The data comes from Kaggle, and may be noisy.

This data set is small for the design challenge – but we’re still allowing it. The genre field contains texts separated by vertical bars, and we converted them into separate indicator columns and appended to the end of the table.

  • CSV file on Box (~1MB, 5044 columns and 54 rows)

The data was downloaded and processed on September 13, 2020 from Kaggle.

The list of genres are summarized in the following table:

Genre Counts
GenreCount
Action1153
Adventure923
Fantasy610
Sci-Fi616
Thriller1411
Documentary121
Romance1107
Animation242
Comedy1872
Family546
Musical132
Mystery500
Western97
Drama2594
History207
Sport182
Crime889
Horror565
War213
Biography293
Music214
Game-Show1
Reality-TV2
News3
Short5
Film-Noir6