Design Challenge 1 Data Sets

This is a list of the data sets you may use for Design Challenge 1: One Dataset / Four Stories (Design Exercises 3 and 4).

You must choose one data set to create all of the visualizations in a phase. We strongly encourage you to use the same data set for phases 3 and 4.

For each data set, we have put the file in a Canvas file folder (the DC1 Data Sets folder). We will also create a starter workbook in Tableau Online, so you can copy the workbook (rather than uploading it yourself). See this Piazza Posting for instructions.

Note: for each data set, we’ve checked that it is in decent shape, and done some data cleaning. You may want to do more.

Also, for many of the sources, there is more data available. If you’d like to grab a different set from the same source, that is acceptable. You must choose a set that is at least comparable to what we provide. Please describe what you’ve done somewhere when you hand things in.

Also, note that this year’s list is different than prior years. (there are two that are the same)

1. Spotify

Note: this is a different (and bigger) Spotify dataset than we had in Phases 1 and 2.

The SpotifyFeatures.csv data set is a 32MB CSV file with 230K songs. For each song there are approximately a dozen features, as well as some identifying information.

The data set originally comes from Kaggle. There is some documentation for the features there, as well as in the documentation for the smaller dataset. Unfortunately, the Spotify Documentation provides little explanation of the features.

2. College Scorecard

This data comes from the U.S. Department of Education. This is a huge data set, that has a ton of information about all the degree granting institutions in the US. Some of the files have over 2600 features for each college. Most of these features are quite detailed, and relate to financial aspects.

If you want to get the data (be warned - there’s a 380M+ compressed zip file) and get more information you can download it from here. But, Abhay has processed the data to choose a reasonable subset. It should be more than enough to find interesting stories for this project.

Abhay’s files pick 37 features from the huge files, and put together multiple years. For each college, there is an entry for each year. You can get information about what the columns mean from the data dictionary.

There are two versions of the data: College_2012_2_2020.csv has years 2012-2020 in 2 year increments, while College_2000_5_2020.csv covers 2000-2020 in 5 year increments. You can choose either one.

One warning: this dataset has a bunch of “missing data” (empty entries that may show up as zeros).

3. Census Data by County

This data set aggregates many different quantities of interest over the counties of the US. In past assignments, we’ve provided even more detailed data sets which required even more aggregation.

The USDA provides this data as 4 separate sheets (on This page). Any one of them could tell an interesting story but together, they provide a very rich and complex data set full of stories.

We have (well, Young, the 765 TA in 2020 has) joined the 4 spreadsheets together (joining by the “FIPS Code” column) creating a single file. The four files are also combined (joined by the “FIPS Code” column) into one file, and put in Box. The rows for the states (not counties) are also removed. The data is downloaded and processed on September 13, 2020 from USDA/ERS.

The CSV File census_counties.csv is about 4MB and has 3196 rows and 339 columns)

The variable descriptions can be found on here, and some of them are replicated in the following table:

Variable Descriptions
Column nameDescription
Births_2019Births in period 7/1/2018 to 6/30/2019
CENSUS_2010_POP4/1/2010 resident Census 2010 population
CI90LB017_201890% confidence interval lower bound of estimate of people age 0-17 in poverty 2018
CI90LB017P_201890% confidence interval lower bound of estimate of percent of people age 0-17 in poverty 2018
CI90LBINC_201890% confidence interval lower bound of estimate of median household income 2018
CI90UB017_201890% confidence interval upper bound of estimate of people age 0-17 in poverty 2018
CI90UB017P_201890% confidence interval upper bound of estimate of percent of people age 0-17 in poverty 2018
CI90UBINC_201890% confidence interval upper bound of estimate of median household income 2018
Civilian_labor_force_2018Civilian labor force annual average, 2018
Deaths_2019Deaths in period 7/1/2018 to 6/30/2019
DOMESTIC_MIG_2019Net domestic migration in period 7/1/2018 to 6/30/2019
Economic_typology_2015County economic types, 2015 edition
Employed_2019Number employed annual average, 2019
ESTIMATES_BASE_20104/1/2010 resident total population estimates base
FIPS_CodeState-County FIPS Code
GQ_ESTIMATES_20197/1/2019 Group Quarters total population estimate
GQ_ESTIMATES_BASE_20104/1/2010 Group Quarters total population estimates base
INTERNATIONAL_MIG_2019Net international migration in period 7/1/2018 to 6/30/2019
Med_HH_Income_Percent_of_State_Total_2019County Household Median Income as a percent of the State Total Median Household Income, 2019
MEDHHINC_2018Estimate of median household income 2018
Median_Household_Income_2019Estimate of Median household Income, 2019
Metro_2013Metro nonmetro dummy 0=Nonmetro 1=Metro (Based on 2013 OMB Metropolitan Area delineation)
N_POP_CHG_2019Numeric Change in resident total population 7/1/2018 to 7/1/2019
NATURAL_INC_2019Natural increase in period 7/1/2018 to 6/30/2019
NET_MIG_2019Net migration in period 7/1/2018 to 6/30/2019
PCTPOV017_2018Estimated percent of people age 0-17 in poverty 2018
POP_ESTIMATE_20197/1/2019 resident total population estimate
POV017_2018Estimate of people age 0-17 in poverty 2018
R_death_2019Death rate in period 7/1/2018 to 6/30/2019
R_DOMESTIC_MIG_2019Net domestic migration rate in period 7/1/2018 to 6/30/2019
R_INTERNATIONAL_MIG_2019Net international migration rate in period 7/1/2018 to 6/30/2019
R_NATURAL_INC_2019Natural increase rate in period 7/1/2018 to 6/30/2019
R_NET_MIG_2019Net migration rate in period 7/1/2018 to 6/30/2019
RESIDUAL_2019Residual for period 7/1/2018 to 6/30/2019
Rural-urban_Continuum_Code_2013Rural-urban Continuum Code, 2013
StateState Abbreviation
Unemployed_2019Number unemployed annual average, 2019
Unemployment_rate_2019Unemployment rate, 2019
Urban_Influence_Code_2013Urban Influence Code, 2013

4. Time Usage Survey

The American Time Usage Survey (ATUS) tracks how people spend their time. There are corresponding international versions. There are actually lots of different surveys with interesting data available from the IPUMS website.

We have done a data pull for you. We are allowed to share a data pull ( see this). We chose to collect data from all of the years (2003-2019), and selected a wide range of different attributes. The data was downloaded on September 13, 2020 from ATUS.

Note: this is pre-pandemic data - we may try to create a new pull from the data set to get more updated information. I may make a newer data set, or a data set in a different form, available. Or, we may use that in a future assignment.

The CSV File atus_data.csv has 210587 rows and 34 columns.

You may create your own data pull if you’d like to try this with different columns (but be sure to document it in your writeup). Getting a data set requires picking from all the options - there are so many options that picking a good set is pretty time consuming. It is actually an interesting exercise to see how they document their data - they are very careful in documenting everything.

You can find out what the “time use codes” mean on this page.

Interpreting the other codes requires some digging, unfortunately. Some are self-explanatory, but others… I tracked down the “FAMINCOME” columns: explanation here. The state codes are here. Here are some of them are replicated in the following table:

Variable Descriptions
Column nameDescription
RECTYPERecord type
1Household
2Person
3Activity
4Who
5Eldercare
REGIONRegion
1Northeast
2Midwest
3South
4West
STATEFIPFIPS State Code
01Alabama
02Alaska
04Arizona
05Arkansas
06California
08Colorado
09Connecticut
10Delaware
11District of Columbia
12Florida
13Georgia
15Hawaii
16Idaho
17Illinois
18Indiana
19Iowa
20Kansas
21Kentucky
22Louisiana
23Maine
24Maryland
25Massachusetts
26Michigan
27Minnesota
28Mississippi
29Missouri
30Montana
31Nebraska
32Nevada
33New Hampshire
34New Jersey
35New Mexico
36New York
37North Carolina
38North Dakota
39Ohio
40Oklahoma
41Oregon
42Pennsylvania
44Rhode Island
45South Carolina
46South Dakota
47Tennessee
48Texas
49Utah
50Vermont
51Virginia
53Washington
54West Virginia
55Wisconsin
56Wyoming
METROMetropolitan/central city status
01Metropolitan, central city
02Metropolitan, balance of MSA
03Metropolitan, not identified
04Nonmetropolitan
05Not identified
FAMINCOMEFamily income
001Less than $5,000
002$5,000 to $7,499
003$7,500 to $9,999
004$10,000 to $12,499
005$12,500 to $14,999
006$15,000 to $19,999
007$20,000 to $24,999
008$25,000 to $29,999
009$30,000 to $34,999
010$35,000 to $39,999
011$40,000 to $49,999
012$50,000 to $59,999
013$60,000 to $74,999
014$75,000 to $99,999
015$100,000 to $149,999
016$150,000 and over
996Refused
997Don’t know
998Blank
HH_SIZENumber of people in household
0011
0022
0033
0044
0055
0066
0077
0088
0099
01010
01111
01212
01313
01414
01515
01616
999NIU (Not in universe)
PERNUMPerson number (general)
011
022
033
044
055
066
077
088
099
1010
1111
1212
1313
1414
1515
1616
LINENOPerson line number
0011
0022
0033
0044
0055
0066
0077
0088
0099
01010
01111
01212
01313
01414
01515
01616
01717
01818
01919
999NIU (Not in universe)
SEXSex
01Male
02Female
99NIU (Not in universe)
RACERace
0100White only
0110Black only
0120American Indian, Alaskan Native
0130Asian or Pacific Islander
0131Asian only
0132Hawaiian Pacific Islander only
0200White-Black
0201White-American Indian
0202White-Asian
0203White-Hawaiian
0210Black-American Indian
0211Black-Asian
0212Black-Hawaiian
0220American Indian-Asian
0221American Indian-Hawaiian
0230Asian-Hawaiian
0300White-Black-American Indian
0301White-Black-Asian
0302White-Black-Hawaiian
0310White-American Indian-Asian
0311White-American Indian-Hawaiian
0320White-Asian-Hawaiian
0330Black-American Indian-Asian
0331Black-American Indian-Hawaiian
0340Black-Asian-Hawaiian
0350American Indian-Asian-Hawaiian
0398Other 3 race combinations
03992 or 3 races, unspecified
0400White-Black-American Indian-Asian
0401White-Black-American Indian-Hawaiian
0402White-Black-Asian-Hawaiian
0403Black-American Indian-Asian-Hawaiian
0404White-American Indian-Asian-Hawaiian
0500White-Black-American Indian-Asian-Hawaiian
05994 or 5 races, unspecified
9999NIU (Not in universe)
EDUCHighest level of school completed
-Less than HS diploma
010Less than 1st grade
0111st, 2nd, 3rd, or 4th grade
0125th or 6th grade
0137th or 8th grade
0149th grade
01510th grade
01611th grade
01712th grade - no diploma
-HS diploma, no college
020High school graduate - GED
021High school graduate - diploma
-Some college
030Some college but no degree
031Associate degree - occupational vocational
032Associate degree - academic program
-College degree +
040Bachelor’s degree (BA, AB, BS, etc.)
041Master’s degree (MA, MS, MEng, MEd, MSW, etc.)
042Professional school degree (MD, DDS, DVM, etc.)
043Doctoral degree (PhD, EdD, etc.)
999NIU (Not in universe)
EDUCYRSYears of education
100Grades 1-12
101Less than first grade
102First through fourth grade
105Fifth through sixth grade
107Seventh through eighth grade
109Ninth grade
110Tenth grade
111Eleventh grade
112Twelfth grade
200College
213College–one year
214College–two years
215College–three years
216College–four years
217Bachelor’s degree
300Advanced degree
316Master’s degree
317Master’s degree–one year program
318Master’s degree–two year program
319Master’s degree–three+ year program
320Professional degree
321Doctoral degree
999NIU (Not in universe)
EMPSTATLabor force status
01Employed - at work
02Employed - absent
03Unemployed - on layoff
04Unemployed - looking
05Not in labor force
99NIU (Not in universe)