You must pick one of these data sets for your assignment.
For Design Challenge 1, we are providing 4 different data sets. Variants of these data sets have been used successfully by students in the past. All of these data sets are sufficiently large and rich to contain many interesting stories. We may provide more choices, or better versions of the datasets listed. We will not remove things from this list (so if you pick something, it will continue to be OK).
There are four choices (some with multiple variants):
Census Data
Airline On-Time Performance
Time Usage (how people spend their time)
IMDB movie data
Some of the data sets get quite big. However, Tableau is remarkably efficient at working with 100MB+ data sets - it uses fancy database technologies to perform aggregations quickly.
In each case, you may access the original data from its source. However, we are providing versions where we have done a little of the data wrangling work for you. Thank Young, the TA.
In your assignment, be sure to explain which data set you are using (including which variant). If you got your own version of the data, please say so.
Note: we are working on being able to provide these as part of our Tableau site so you can work with them using Tableau online without having to download/upload them yourself. Stay tuned for an announcement!
Census Data by County
This data set aggregates many different quantities of interest over the counties of the US. In past assignments, we’ve provided even more detailed data sets which required even more aggregation.
The USDA provides this data as 4 separate sheets (on This page). Any one of them could tell an interesting story but together, they provide a very rich and complex data set full of stories.
We have joined the 4 spreadsheets together (joining by the “FIPS Code” column) creating a single file. The four files are also combined (joined by the “FIPS Code” column) into one file, and put in Box. The rows for the states (not counties) are also removed. The data is downloaded and processed on September 13, 2020 from USDA/ERS.
The variable descriptions can be found on here, and some of them are replicated in the following table:
Variable Descriptions
Column name
Description
Births_2019
Births in period 7/1/2018 to 6/30/2019
CENSUS_2010_POP
4/1/2010 resident Census 2010 population
CI90LB017_2018
90% confidence interval lower bound of estimate of people age 0-17 in poverty 2018
CI90LB017P_2018
90% confidence interval lower bound of estimate of percent of people age 0-17 in poverty 2018
CI90LBINC_2018
90% confidence interval lower bound of estimate of median household income 2018
CI90UB017_2018
90% confidence interval upper bound of estimate of people age 0-17 in poverty 2018
CI90UB017P_2018
90% confidence interval upper bound of estimate of percent of people age 0-17 in poverty 2018
CI90UBINC_2018
90% confidence interval upper bound of estimate of median household income 2018
Civilian_labor_force_2018
Civilian labor force annual average, 2018
Deaths_2019
Deaths in period 7/1/2018 to 6/30/2019
DOMESTIC_MIG_2019
Net domestic migration in period 7/1/2018 to 6/30/2019
Economic_typology_2015
County economic types, 2015 edition
Employed_2019
Number employed annual average, 2019
ESTIMATES_BASE_2010
4/1/2010 resident total population estimates base
FIPS_Code
State-County FIPS Code
GQ_ESTIMATES_2019
7/1/2019 Group Quarters total population estimate
GQ_ESTIMATES_BASE_2010
4/1/2010 Group Quarters total population estimates base
INTERNATIONAL_MIG_2019
Net international migration in period 7/1/2018 to 6/30/2019
Med_HH_Income_Percent_of_State_Total_2019
County Household Median Income as a percent of the State Total Median Household Income, 2019
MEDHHINC_2018
Estimate of median household income 2018
Median_Household_Income_2019
Estimate of Median household Income, 2019
Metro_2013
Metro nonmetro dummy 0=Nonmetro 1=Metro (Based on 2013 OMB Metropolitan Area delineation)
N_POP_CHG_2019
Numeric Change in resident total population 7/1/2018 to 7/1/2019
NATURAL_INC_2019
Natural increase in period 7/1/2018 to 6/30/2019
NET_MIG_2019
Net migration in period 7/1/2018 to 6/30/2019
PCTPOV017_2018
Estimated percent of people age 0-17 in poverty 2018
POP_ESTIMATE_2019
7/1/2019 resident total population estimate
POV017_2018
Estimate of people age 0-17 in poverty 2018
R_death_2019
Death rate in period 7/1/2018 to 6/30/2019
R_DOMESTIC_MIG_2019
Net domestic migration rate in period 7/1/2018 to 6/30/2019
R_INTERNATIONAL_MIG_2019
Net international migration rate in period 7/1/2018 to 6/30/2019
R_NATURAL_INC_2019
Natural increase rate in period 7/1/2018 to 6/30/2019
R_NET_MIG_2019
Net migration rate in period 7/1/2018 to 6/30/2019
RESIDUAL_2019
Residual for period 7/1/2018 to 6/30/2019
Rural-urban_Continuum_Code_2013
Rural-urban Continuum Code, 2013
State
State Abbreviation
Unemployed_2019
Number unemployed annual average, 2019
Unemployment_rate_2019
Unemployment rate, 2019
Urban_Influence_Code_2013
Urban Influence Code, 2013
Airline On-Time Performance
The Bureau of Transportation Statistics lets you download a lot of data, one month at a time from this page. You can download many different attributes of every flight.
We’ve chosen a set of attributes for each flight, which leads to huge files (250MB/month), and downloaded several months (one per season for 3 years).
We’ve downloaded a few months for you and assembled them into easier to use files. We’ve put multiple months together and down-sampled the data (using random sampling) to reduce the data set size. While Tableau seems very able to process even the largest of these files, they do get a little unweildy.
We chose 2019 (not 2020) since it has more “usual” patterns (although, comaparing pre- and post-COVID would be interesting).
The “small” datasets were downsampled to produce file sizes of about 5MB. The January file was downsampled to 2% (that is, each flight has a 2% chance of being included). The 4 month data set and 4 month*3 year data set was downsampled to have a similar total number of flights (e.g., the 4 month set has 2%/4).
The “medium” datasets were downsampled to produce file sizes of about 25MB each. The 4 month data set and 4 month*3 year data set was downsampled to have a similar total number of flights (e.g., the 4 month set has 10%/4).
The “large” data sets include 10% of all the months included. You can get a sense of how big they are from this picture (12 months, 40-60K flights a month)
For this data set, you may choose to use the months we downloaded, or download your own (please specify what data you use). You can choose to use just 1 month (version 1), you can pick multiple months to compare (version 2), or multiple years to compare if you want a real challenge (version 3). Note that you may filter by geography (select a particular state or all states), year, and monthly periods. Files are available as .csv.
You may want information on the airports (for example, to get location coordinates for each airport code). This data is available from Federal Aviation Administration. The Airline codes list is useful - especially for Madison where you see airlines like “F9” and “OO” (these are regional commuter companies that fly under another airline’s name).
You might want to refer to Bureau of Transportation Statistics for explanations of all the fields, and look up tables (files that say what the codes mean). Fields include date, carrier, origin/destination information, departure/arrival performance, gate/airport information, flight summaries, cause of delay, and some of them are replicated in the following table:
Variable Descriptions
Column name
Description
ActualElapsedTime
Elapsed Time of Flight, in Minutes
AirTime
Flight Time, in Minutes
ArrDel15
Arrival Delay Indicator, 15 Minutes or More (1=Yes)
ArrDelay
Difference in minutes between scheduled and actual arrival time. Early arrivals show negative numbers.
ArrDelayMinutes
Difference in minutes between scheduled and actual arrival time. Early arrivals set to 0.
ArrivalDelayGroups
Arrival Delay intervals, every (15-minutes from <-15 to >180)
ArrTime
Actual Arrival Time (local time: hhmm)
ArrTimeBlk
CRS Arrival Time Block, Hourly Intervals
CancellationCode
Specifies The Reason For Cancellation
Cancelled
Cancelled Flight Indicator (1=Yes)
CarrierDelay
Carrier Delay, in Minutes
CRSArrTime
CRS Arrival Time (local time: hhmm)
CRSDepTime
CRS Departure Time (local time: hhmm)
CRSElapsedTime
CRS Elapsed Time of Flight, in Minutes
DayofMonth
Day of Month
DayOfWeek
Day of Week
DepartureDelayGroups
Departure Delay intervals, every (15 minutes from <-15 to >180)
DepDel15
Departure Delay Indicator, 15 Minutes or More (1=Yes)
DepDelay
Difference in minutes between scheduled and actual departure time. Early departures show negative numbers.
DepDelayMinutes
Difference in minutes between scheduled and actual departure time. Early departures set to 0.
DepTime
Actual Departure Time (local time: hhmm)
DepTimeBlk
CRS Departure Time Block, Hourly Intervals
Dest
Destination Airport
DestAirportID
Destination Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.
DestAirportSeqID
Destination Airport, Airport Sequence ID. An identification number assigned by US DOT to identify a unique airport at a given point of time. Airport attributes, such as airport name or coordinates, may change over time.
DestCityMarketID
Destination Airport, City Market ID. City Market ID is an identification number assigned by US DOT to identify a city market. Use this field to consolidate airports serving the same city market.
DestCityName
Destination Airport, City Name
DestState
Destination Airport, State Code
DestStateFips
Destination Airport, State Fips
DestStateName
Destination Airport, State Name
DestWac
Destination Airport, World Area Code
Distance
Distance between airports (miles)
DistanceGroup
Distance Intervals, every 250 Miles, for Flight Segment
DivActualElapsedTime
Elapsed Time of Diverted Flight Reaching Scheduled Destination, in Minutes. The ActualElapsedTime column remains NULL for all diverted flights.
DivAirportLandings
Number of Diverted Airport Landings
DivArrDelay
Difference in minutes between scheduled and actual arrival time for a diverted flight reaching scheduled destination. The ArrDelay column remains NULL for all diverted flights.
DivDistance
Distance between scheduled destination and final diverted airport (miles). Value will be 0 for diverted flight reaching scheduled destination.
An identification number assigned by US DOT to identify a unique airline (carrier). A unique airline (carrier) is defined as one holding and reporting under the same DOT certificate regardless of its Code, Name, or holding company/corporation.
FirstDepTime
First Gate Departure Time at Origin Airport
Flight_Number_Reporting_Airline
Flight Number
FlightDate
Flight Date (yyyymmdd)
Flights
Number of Flights
IATA_CODE_Reporting_Airline
Code assigned by IATA and commonly used to identify a carrier. As the same code may have been assigned to different carriers over time, the code is not always unique. For analysis, use the Unique Carrier Code.
LateAircraftDelay
Late Aircraft Delay, in Minutes
LongestAddGTime
Longest Time Away from Gate for Gate Return or Cancelled Flight
Month
Month
NASDelay
National Air System Delay, in Minutes
Origin
Origin Airport
OriginAirportID
Origin Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.
OriginAirportSeqID
Origin Airport, Airport Sequence ID. An identification number assigned by US DOT to identify a unique airport at a given point of time. Airport attributes, such as airport name or coordinates, may change over time.
OriginCityMarketID
Origin Airport, City Market ID. City Market ID is an identification number assigned by US DOT to identify a city market. Use this field to consolidate airports serving the same city market.
OriginCityName
Origin Airport, City Name
OriginState
Origin Airport, State Code
OriginStateFips
Origin Airport, State Fips
OriginStateName
Origin Airport, State Name
OriginWac
Origin Airport, World Area Code
Quarter
Quarter (1-4)
Reporting_Airline
Unique Carrier Code. When the same code has been used by multiple carriers, a numeric suffix is used for earlier users, for example, PA, PA(1), PA(2). Use this field for analysis across a range of years.
SecurityDelay
Security Delay, in Minutes
Tail_Number
Tail Number
TaxiIn
Taxi In Time, in Minutes
TaxiOut
Taxi Out Time, in Minutes
TotalAddGTime
Total Ground Time Away from Gate for Gate Return or Cancelled Flight
WeatherDelay
Weather Delay, in Minutes
WheelsOff
Wheels Off Time (local time: hhmm)
WheelsOn
Wheels On Time (local time: hhmm)
Year
Year
Time Usage Survey
2003-2019
all samples, Respondents
The American Time Usage Survey (ATUS) tracks how people spend their time. There are corresponding international versions. There are actually lots of different surveys with interesting data available from the IPUMS website.
We have done a data pull for you. We are allowed to share a data pull (see this). We chose to collect data from all of the years (2003-2019), and selected a wide range of different attributes. The data was downloaded on September 13, 2020 from ATUS.
You may create your own data pull if you’d like to try this with different columns (but be sure to document it in your writeup). Getting a data set requires picking from all the options - there are so many options that picking a good set is pretty time consuming. It is actually an interesting exercise to see how they document their data - they are very careful in documenting everything.
You can find out what the “time use codes” mean on this page.
Interpreting the other codes requires some digging, unfortunately. Some are self-explanatory, but others… I tracked down the “FAMINCOME” columns: explanation here. The state codes are here. Here are some of them are replicated in the following table:
Variable Descriptions
Column name
Description
RECTYPE
Record type
1
Household
2
Person
3
Activity
4
Who
5
Eldercare
REGION
Region
1
Northeast
2
Midwest
3
South
4
West
STATEFIP
FIPS State Code
01
Alabama
02
Alaska
04
Arizona
05
Arkansas
06
California
08
Colorado
09
Connecticut
10
Delaware
11
District of Columbia
12
Florida
13
Georgia
15
Hawaii
16
Idaho
17
Illinois
18
Indiana
19
Iowa
20
Kansas
21
Kentucky
22
Louisiana
23
Maine
24
Maryland
25
Massachusetts
26
Michigan
27
Minnesota
28
Mississippi
29
Missouri
30
Montana
31
Nebraska
32
Nevada
33
New Hampshire
34
New Jersey
35
New Mexico
36
New York
37
North Carolina
38
North Dakota
39
Ohio
40
Oklahoma
41
Oregon
42
Pennsylvania
44
Rhode Island
45
South Carolina
46
South Dakota
47
Tennessee
48
Texas
49
Utah
50
Vermont
51
Virginia
53
Washington
54
West Virginia
55
Wisconsin
56
Wyoming
METRO
Metropolitan/central city status
01
Metropolitan, central city
02
Metropolitan, balance of MSA
03
Metropolitan, not identified
04
Nonmetropolitan
05
Not identified
FAMINCOME
Family income
001
Less than $5,000
002
$5,000 to $7,499
003
$7,500 to $9,999
004
$10,000 to $12,499
005
$12,500 to $14,999
006
$15,000 to $19,999
007
$20,000 to $24,999
008
$25,000 to $29,999
009
$30,000 to $34,999
010
$35,000 to $39,999
011
$40,000 to $49,999
012
$50,000 to $59,999
013
$60,000 to $74,999
014
$75,000 to $99,999
015
$100,000 to $149,999
016
$150,000 and over
996
Refused
997
Don’t know
998
Blank
HH_SIZE
Number of people in household
001
1
002
2
003
3
004
4
005
5
006
6
007
7
008
8
009
9
010
10
011
11
012
12
013
13
014
14
015
15
016
16
999
NIU (Not in universe)
PERNUM
Person number (general)
01
1
02
2
03
3
04
4
05
5
06
6
07
7
08
8
09
9
10
10
11
11
12
12
13
13
14
14
15
15
16
16
LINENO
Person line number
001
1
002
2
003
3
004
4
005
5
006
6
007
7
008
8
009
9
010
10
011
11
012
12
013
13
014
14
015
15
016
16
017
17
018
18
019
19
999
NIU (Not in universe)
SEX
Sex
01
Male
02
Female
99
NIU (Not in universe)
RACE
Race
0100
White only
0110
Black only
0120
American Indian, Alaskan Native
0130
Asian or Pacific Islander
0131
Asian only
0132
Hawaiian Pacific Islander only
0200
White-Black
0201
White-American Indian
0202
White-Asian
0203
White-Hawaiian
0210
Black-American Indian
0211
Black-Asian
0212
Black-Hawaiian
0220
American Indian-Asian
0221
American Indian-Hawaiian
0230
Asian-Hawaiian
0300
White-Black-American Indian
0301
White-Black-Asian
0302
White-Black-Hawaiian
0310
White-American Indian-Asian
0311
White-American Indian-Hawaiian
0320
White-Asian-Hawaiian
0330
Black-American Indian-Asian
0331
Black-American Indian-Hawaiian
0340
Black-Asian-Hawaiian
0350
American Indian-Asian-Hawaiian
0398
Other 3 race combinations
0399
2 or 3 races, unspecified
0400
White-Black-American Indian-Asian
0401
White-Black-American Indian-Hawaiian
0402
White-Black-Asian-Hawaiian
0403
Black-American Indian-Asian-Hawaiian
0404
White-American Indian-Asian-Hawaiian
0500
White-Black-American Indian-Asian-Hawaiian
0599
4 or 5 races, unspecified
9999
NIU (Not in universe)
EDUC
Highest level of school completed
-
Less than HS diploma
010
Less than 1st grade
011
1st, 2nd, 3rd, or 4th grade
012
5th or 6th grade
013
7th or 8th grade
014
9th grade
015
10th grade
016
11th grade
017
12th grade - no diploma
-
HS diploma, no college
020
High school graduate - GED
021
High school graduate - diploma
-
Some college
030
Some college but no degree
031
Associate degree - occupational vocational
032
Associate degree - academic program
-
College degree +
040
Bachelor’s degree (BA, AB, BS, etc.)
041
Master’s degree (MA, MS, MEng, MEd, MSW, etc.)
042
Professional school degree (MD, DDS, DVM, etc.)
043
Doctoral degree (PhD, EdD, etc.)
999
NIU (Not in universe)
EDUCYRS
Years of education
100
Grades 1-12
101
Less than first grade
102
First through fourth grade
105
Fifth through sixth grade
107
Seventh through eighth grade
109
Ninth grade
110
Tenth grade
111
Eleventh grade
112
Twelfth grade
200
College
213
College–one year
214
College–two years
215
College–three years
216
College–four years
217
Bachelor’s degree
300
Advanced degree
316
Master’s degree
317
Master’s degree–one year program
318
Master’s degree–two year program
319
Master’s degree–three+ year program
320
Professional degree
321
Doctoral degree
999
NIU (Not in universe)
EMPSTAT
Labor force status
01
Employed - at work
02
Employed - absent
03
Unemployed - on layoff
04
Unemployed - looking
05
Not in labor force
99
NIU (Not in universe)
IMDB Movie Data
This is data on approximately 5000 movies with columns including director, genre, plot keywords, and box office statistics. The data comes from Kaggle, and may be noisy.
This data set is small for the design challenge – but we’re still allowing it. The genre field contains texts separated by vertical bars, and we converted them into separate indicator columns and appended to the end of the table.