This is a list of the data sets you may use for Design Challenge 1: One Dataset / Four Stories (Design Exercises 3 and 4).
You must choose one data set to create all of the visualizations in a phase. We strongly encourage you to use the same data set for phases 3 and 4.
For each data set, we have put the file in a Canvas file folder (the
DC1 Data Sets folder). We will also create a starter workbook in Tableau Online, so you can copy the workbook (rather than uploading it yourself). See this
Piazza Posting for instructions.
Note: for each data set, we’ve checked that it is in decent shape, and done some data cleaning. You may want to do more.
Also, for many of the sources, there is more data available. If you’d like to grab a different set from the same source, that is acceptable. You must choose a set that is at least comparable to what we provide. Please describe what you’ve done somewhere when you hand things in.
Also, note that this year’s list is different than prior years. (there are two that are the same)
1. Spotify
Note: this is a different (and bigger) Spotify dataset than we had in Phases 1 and 2.
The SpotifyFeatures.csv
data set is a 32MB CSV file with 230K songs. For each song there are approximately a dozen features, as well as some identifying information.
The data set originally comes from
Kaggle. There is some documentation for the features there, as well as in the documentation for the
smaller dataset. Unfortunately, the
Spotify Documentation provides little explanation of the features.
2. College Scorecard
This data comes from the
U.S. Department of Education. This is a huge data set, that has a ton of information about all the degree granting institutions in the US. Some of the files have over 2600 features for each college. Most of these features are quite detailed, and relate to financial aspects.
If you want to get the data (be warned - there’s a 380M+ compressed zip file) and get more information you can download it from
here. But, Abhay has processed the data to choose a reasonable subset. It should be more than enough to find interesting stories for this project.
Abhay’s files pick 37 features from the huge files, and put together multiple years. For each college, there is an entry for each year. You can get information about what the columns mean from
the data dictionary.
There are two versions of the data: College_2012_2_2020.csv
has years 2012-2020 in 2 year increments, while College_2000_5_2020.csv
covers 2000-2020 in 5 year increments. You can choose either one.
One warning: this dataset has a bunch of “missing data” (empty entries that may show up as zeros).
3. Census Data by County
This data set aggregates many different quantities of interest over the counties of the US. In past assignments, we’ve provided even more detailed data sets which required even more aggregation.
The USDA provides this data as 4 separate sheets (on
This page). Any one of them could tell an interesting story but together, they provide a very rich and complex data set full of stories.
We have (well, Young, the 765 TA in 2020 has) joined the 4 spreadsheets together (joining by the “FIPS Code” column) creating a single file. The four files are also combined (joined by the “FIPS Code” column) into one file, and put in
Box. The rows for the states (not counties) are also removed. The data is downloaded and processed on September 13, 2020 from
USDA/ERS.
The CSV File census_counties.csv
is about 4MB and has 3196 rows and 339 columns)
The variable descriptions can be found on
here, and some of them are replicated in the following table:
Variable Descriptions
Column name | Description |
---|
Births_2019 | Births in period 7/1/2018 to 6/30/2019 |
CENSUS_2010_POP | 4/1/2010 resident Census 2010 population |
CI90LB017_2018 | 90% confidence interval lower bound of estimate of people age 0-17 in poverty 2018 |
CI90LB017P_2018 | 90% confidence interval lower bound of estimate of percent of people age 0-17 in poverty 2018 |
CI90LBINC_2018 | 90% confidence interval lower bound of estimate of median household income 2018 |
CI90UB017_2018 | 90% confidence interval upper bound of estimate of people age 0-17 in poverty 2018 |
CI90UB017P_2018 | 90% confidence interval upper bound of estimate of percent of people age 0-17 in poverty 2018 |
CI90UBINC_2018 | 90% confidence interval upper bound of estimate of median household income 2018 |
Civilian_labor_force_2018 | Civilian labor force annual average, 2018 |
Deaths_2019 | Deaths in period 7/1/2018 to 6/30/2019 |
DOMESTIC_MIG_2019 | Net domestic migration in period 7/1/2018 to 6/30/2019 |
Economic_typology_2015 | County economic types, 2015 edition |
Employed_2019 | Number employed annual average, 2019 |
ESTIMATES_BASE_2010 | 4/1/2010 resident total population estimates base |
FIPS_Code | State-County FIPS Code |
GQ_ESTIMATES_2019 | 7/1/2019 Group Quarters total population estimate |
GQ_ESTIMATES_BASE_2010 | 4/1/2010 Group Quarters total population estimates base |
INTERNATIONAL_MIG_2019 | Net international migration in period 7/1/2018 to 6/30/2019 |
Med_HH_Income_Percent_of_State_Total_2019 | County Household Median Income as a percent of the State Total Median Household Income, 2019 |
MEDHHINC_2018 | Estimate of median household income 2018 |
Median_Household_Income_2019 | Estimate of Median household Income, 2019 |
Metro_2013 | Metro nonmetro dummy 0=Nonmetro 1=Metro (Based on 2013 OMB Metropolitan Area delineation) |
N_POP_CHG_2019 | Numeric Change in resident total population 7/1/2018 to 7/1/2019 |
NATURAL_INC_2019 | Natural increase in period 7/1/2018 to 6/30/2019 |
NET_MIG_2019 | Net migration in period 7/1/2018 to 6/30/2019 |
PCTPOV017_2018 | Estimated percent of people age 0-17 in poverty 2018 |
POP_ESTIMATE_2019 | 7/1/2019 resident total population estimate |
POV017_2018 | Estimate of people age 0-17 in poverty 2018 |
R_death_2019 | Death rate in period 7/1/2018 to 6/30/2019 |
R_DOMESTIC_MIG_2019 | Net domestic migration rate in period 7/1/2018 to 6/30/2019 |
R_INTERNATIONAL_MIG_2019 | Net international migration rate in period 7/1/2018 to 6/30/2019 |
R_NATURAL_INC_2019 | Natural increase rate in period 7/1/2018 to 6/30/2019 |
R_NET_MIG_2019 | Net migration rate in period 7/1/2018 to 6/30/2019 |
RESIDUAL_2019 | Residual for period 7/1/2018 to 6/30/2019 |
Rural-urban_Continuum_Code_2013 | Rural-urban Continuum Code, 2013 |
State | State Abbreviation |
Unemployed_2019 | Number unemployed annual average, 2019 |
Unemployment_rate_2019 | Unemployment rate, 2019 |
Urban_Influence_Code_2013 | Urban Influence Code, 2013 |
4. Time Usage Survey
The American Time Usage Survey (ATUS) tracks how people spend their time. There are corresponding international versions. There are actually lots of different surveys with interesting data available from the
IPUMS website.
We have done a data pull for you. We are allowed to share a data pull (
see this). We chose to collect data from all of the years (2003-2019), and selected a wide range of different attributes. The data was downloaded on September 13, 2020 from
ATUS.
Note: this is pre-pandemic data - we may try to create a new pull from the data set to get more updated information. I may make a newer data set, or a data set in a different form, available. Or, we may use that in a future assignment.
The CSV File atus_data.csv
has 210587 rows and 34 columns.
You may create your own data pull if you’d like to try this with different columns (but be sure to document it in your writeup). Getting a data set requires picking from all the options - there are so many options that picking a good set is pretty time consuming. It is actually an interesting exercise to see how they document their data - they are very careful in documenting everything.
You can find out what the “time use codes” mean on
this page.
Interpreting the other codes requires some digging, unfortunately. Some are self-explanatory, but others… I tracked down the “FAMINCOME” columns: explanation
here. The state codes are
here. Here are some of them are replicated in the following table:
Variable Descriptions
Column name | Description |
---|
RECTYPE | Record type |
1 | Household |
2 | Person |
3 | Activity |
4 | Who |
5 | Eldercare |
REGION | Region |
1 | Northeast |
2 | Midwest |
3 | South |
4 | West |
STATEFIP | FIPS State Code |
01 | Alabama |
02 | Alaska |
04 | Arizona |
05 | Arkansas |
06 | California |
08 | Colorado |
09 | Connecticut |
10 | Delaware |
11 | District of Columbia |
12 | Florida |
13 | Georgia |
15 | Hawaii |
16 | Idaho |
17 | Illinois |
18 | Indiana |
19 | Iowa |
20 | Kansas |
21 | Kentucky |
22 | Louisiana |
23 | Maine |
24 | Maryland |
25 | Massachusetts |
26 | Michigan |
27 | Minnesota |
28 | Mississippi |
29 | Missouri |
30 | Montana |
31 | Nebraska |
32 | Nevada |
33 | New Hampshire |
34 | New Jersey |
35 | New Mexico |
36 | New York |
37 | North Carolina |
38 | North Dakota |
39 | Ohio |
40 | Oklahoma |
41 | Oregon |
42 | Pennsylvania |
44 | Rhode Island |
45 | South Carolina |
46 | South Dakota |
47 | Tennessee |
48 | Texas |
49 | Utah |
50 | Vermont |
51 | Virginia |
53 | Washington |
54 | West Virginia |
55 | Wisconsin |
56 | Wyoming |
METRO | Metropolitan/central city status |
01 | Metropolitan, central city |
02 | Metropolitan, balance of MSA |
03 | Metropolitan, not identified |
04 | Nonmetropolitan |
05 | Not identified |
FAMINCOME | Family income |
001 | Less than $5,000 |
002 | $5,000 to $7,499 |
003 | $7,500 to $9,999 |
004 | $10,000 to $12,499 |
005 | $12,500 to $14,999 |
006 | $15,000 to $19,999 |
007 | $20,000 to $24,999 |
008 | $25,000 to $29,999 |
009 | $30,000 to $34,999 |
010 | $35,000 to $39,999 |
011 | $40,000 to $49,999 |
012 | $50,000 to $59,999 |
013 | $60,000 to $74,999 |
014 | $75,000 to $99,999 |
015 | $100,000 to $149,999 |
016 | $150,000 and over |
996 | Refused |
997 | Don’t know |
998 | Blank |
HH_SIZE | Number of people in household |
001 | 1 |
002 | 2 |
003 | 3 |
004 | 4 |
005 | 5 |
006 | 6 |
007 | 7 |
008 | 8 |
009 | 9 |
010 | 10 |
011 | 11 |
012 | 12 |
013 | 13 |
014 | 14 |
015 | 15 |
016 | 16 |
999 | NIU (Not in universe) |
PERNUM | Person number (general) |
01 | 1 |
02 | 2 |
03 | 3 |
04 | 4 |
05 | 5 |
06 | 6 |
07 | 7 |
08 | 8 |
09 | 9 |
10 | 10 |
11 | 11 |
12 | 12 |
13 | 13 |
14 | 14 |
15 | 15 |
16 | 16 |
LINENO | Person line number |
001 | 1 |
002 | 2 |
003 | 3 |
004 | 4 |
005 | 5 |
006 | 6 |
007 | 7 |
008 | 8 |
009 | 9 |
010 | 10 |
011 | 11 |
012 | 12 |
013 | 13 |
014 | 14 |
015 | 15 |
016 | 16 |
017 | 17 |
018 | 18 |
019 | 19 |
999 | NIU (Not in universe) |
SEX | Sex |
01 | Male |
02 | Female |
99 | NIU (Not in universe) |
RACE | Race |
0100 | White only |
0110 | Black only |
0120 | American Indian, Alaskan Native |
0130 | Asian or Pacific Islander |
0131 | Asian only |
0132 | Hawaiian Pacific Islander only |
0200 | White-Black |
0201 | White-American Indian |
0202 | White-Asian |
0203 | White-Hawaiian |
0210 | Black-American Indian |
0211 | Black-Asian |
0212 | Black-Hawaiian |
0220 | American Indian-Asian |
0221 | American Indian-Hawaiian |
0230 | Asian-Hawaiian |
0300 | White-Black-American Indian |
0301 | White-Black-Asian |
0302 | White-Black-Hawaiian |
0310 | White-American Indian-Asian |
0311 | White-American Indian-Hawaiian |
0320 | White-Asian-Hawaiian |
0330 | Black-American Indian-Asian |
0331 | Black-American Indian-Hawaiian |
0340 | Black-Asian-Hawaiian |
0350 | American Indian-Asian-Hawaiian |
0398 | Other 3 race combinations |
0399 | 2 or 3 races, unspecified |
0400 | White-Black-American Indian-Asian |
0401 | White-Black-American Indian-Hawaiian |
0402 | White-Black-Asian-Hawaiian |
0403 | Black-American Indian-Asian-Hawaiian |
0404 | White-American Indian-Asian-Hawaiian |
0500 | White-Black-American Indian-Asian-Hawaiian |
0599 | 4 or 5 races, unspecified |
9999 | NIU (Not in universe) |
EDUC | Highest level of school completed |
- | Less than HS diploma |
010 | Less than 1st grade |
011 | 1st, 2nd, 3rd, or 4th grade |
012 | 5th or 6th grade |
013 | 7th or 8th grade |
014 | 9th grade |
015 | 10th grade |
016 | 11th grade |
017 | 12th grade - no diploma |
- | HS diploma, no college |
020 | High school graduate - GED |
021 | High school graduate - diploma |
- | Some college |
030 | Some college but no degree |
031 | Associate degree - occupational vocational |
032 | Associate degree - academic program |
- | College degree + |
040 | Bachelor’s degree (BA, AB, BS, etc.) |
041 | Master’s degree (MA, MS, MEng, MEd, MSW, etc.) |
042 | Professional school degree (MD, DDS, DVM, etc.) |
043 | Doctoral degree (PhD, EdD, etc.) |
999 | NIU (Not in universe) |
EDUCYRS | Years of education |
100 | Grades 1-12 |
101 | Less than first grade |
102 | First through fourth grade |
105 | Fifth through sixth grade |
107 | Seventh through eighth grade |
109 | Ninth grade |
110 | Tenth grade |
111 | Eleventh grade |
112 | Twelfth grade |
200 | College |
213 | College–one year |
214 | College–two years |
215 | College–three years |
216 | College–four years |
217 | Bachelor’s degree |
300 | Advanced degree |
316 | Master’s degree |
317 | Master’s degree–one year program |
318 | Master’s degree–two year program |
319 | Master’s degree–three+ year program |
320 | Professional degree |
321 | Doctoral degree |
999 | NIU (Not in universe) |
EMPSTAT | Labor force status |
01 | Employed - at work |
02 | Employed - absent |
03 | Unemployed - on layoff |
04 | Unemployed - looking |
05 | Not in labor force |
99 | NIU (Not in universe) |