We will use “the census data set” for many of the design exercises in class. I put it in scare quotes because there will be different versions. This data set was specifically chosen to be “just about right” for class - not too hard, and not too easy, to work with. But complex enough to be interesting. We will provide it in “cleaned” form.
The US Department of Agriculture (USDA) provides county level aggregation of population data aggregated at the county level. They gather education data, income data, poverty data, and population data. Later in the semester, we might gather more detailed data from other sources. (we will also provide the data at the state level)
The USDA provides this data as 4 separate sheets (on USDA Census Data). Any one of them could tell an interesting story but together, they provide a very rich and complex data set full of stories.
To help you get started faster (and focus on visualization, not data cleaning), the TA (Cat Nelson in 2024) has joined the data into one “convenient” large file. We may publish better versions of this data set as the semester goes on. (later in the semester, we might provide a more detailed data set from another source).
The CSV file is:
(county_census_2023-24_raw.csv 4.8mb). The state level data is:
(state_census_2023-24_raw.csv 0.1mb). The state level data is provided in the original files - we assume it was computed correctly.
Beware: the file has redundant columns, missing data, and other artifacts.
We may provide newer versions of the data that are easier to work with.
The data set was pulled from USDA Census Data on September 6, 2024. Cat joined the data based on FIPS code (see below) and removed aggregate regions (state and country level).
The variable descriptions can be found on here, and some of them are replicated in the following table:
Variable Descriptions
Column name | Description |
---|
Births_2019 | Births in period 7/1/2018 to 6/30/2019 |
CENSUS_2010_POP | 4/1/2010 resident Census 2010 population |
CI90LB017_2018 | 90% confidence interval lower bound of estimate of people age 0-17 in poverty 2018 |
CI90LB017P_2018 | 90% confidence interval lower bound of estimate of percent of people age 0-17 in poverty 2018 |
CI90LBINC_2018 | 90% confidence interval lower bound of estimate of median household income 2018 |
CI90UB017_2018 | 90% confidence interval upper bound of estimate of people age 0-17 in poverty 2018 |
CI90UB017P_2018 | 90% confidence interval upper bound of estimate of percent of people age 0-17 in poverty 2018 |
CI90UBINC_2018 | 90% confidence interval upper bound of estimate of median household income 2018 |
Civilian_labor_force_2018 | Civilian labor force annual average, 2018 |
Deaths_2019 | Deaths in period 7/1/2018 to 6/30/2019 |
DOMESTIC_MIG_2019 | Net domestic migration in period 7/1/2018 to 6/30/2019 |
Economic_typology_2015 | County economic types, 2015 edition |
Employed_2019 | Number employed annual average, 2019 |
ESTIMATES_BASE_2010 | 4/1/2010 resident total population estimates base |
FIPS_Code | State-County FIPS Code |
GQ_ESTIMATES_2019 | 7/1/2019 Group Quarters total population estimate |
GQ_ESTIMATES_BASE_2010 | 4/1/2010 Group Quarters total population estimates base |
INTERNATIONAL_MIG_2019 | Net international migration in period 7/1/2018 to 6/30/2019 |
Med_HH_Income_Percent_of_State_Total_2019 | County Household Median Income as a percent of the State Total Median Household Income, 2019 |
MEDHHINC_2018 | Estimate of median household income 2018 |
Median_Household_Income_2019 | Estimate of Median household Income, 2019 |
Metro_2013 | Metro nonmetro dummy 0=Nonmetro 1=Metro (Based on 2013 OMB Metropolitan Area delineation) |
N_POP_CHG_2019 | Numeric Change in resident total population 7/1/2018 to 7/1/2019 |
NATURAL_INC_2019 | Natural increase in period 7/1/2018 to 6/30/2019 |
NET_MIG_2019 | Net migration in period 7/1/2018 to 6/30/2019 |
PCTPOV017_2018 | Estimated percent of people age 0-17 in poverty 2018 |
POP_ESTIMATE_2019 | 7/1/2019 resident total population estimate |
POV017_2018 | Estimate of people age 0-17 in poverty 2018 |
R_death_2019 | Death rate in period 7/1/2018 to 6/30/2019 |
R_DOMESTIC_MIG_2019 | Net domestic migration rate in period 7/1/2018 to 6/30/2019 |
R_INTERNATIONAL_MIG_2019 | Net international migration rate in period 7/1/2018 to 6/30/2019 |
R_NATURAL_INC_2019 | Natural increase rate in period 7/1/2018 to 6/30/2019 |
R_NET_MIG_2019 | Net migration rate in period 7/1/2018 to 6/30/2019 |
RESIDUAL_2019 | Residual for period 7/1/2018 to 6/30/2019 |
Rural-urban_Continuum_Code_2013 | Rural-urban Continuum Code, 2013 |
State | State Abbreviation |
Unemployed_2019 | Number unemployed annual average, 2019 |
Unemployment_rate_2019 | Unemployment rate, 2019 |
Urban_Influence_Code_2013 | Urban Influence Code, 2013 |
A few things to note…
(especially for those of you new to the US)
Counties vary across the country. Some states have a few big counties, some states have lots of smaller counties.
Counties vary greatly in population and size.
Note: FIPS (“Federal Information Processing System”) code is a 5 digit string (the leading zeros are important!) that are the US Government’s ways to indicate counties. Each county has a unique code.
Some statistical things you might know better than me…
The differences in county sizes/populations make a big difference in the “noise” (random effects). Especially for uncommon events. If 1 person wins the lottery in a county with 100 people in it, that county will have a hugely high level of lottery winners one year, and a really huge change year to year (when it goes from very high to very low).