When it comes to buying a house, there is a lot more to consider than just the sticker price. Wages and salaries vary widely across the country, including within specific occupations, as does the share of household income local residents are accustomed to dedicating to housing costs each month. At one extreme, some places with low housing costs might appear to be very affordable, but incomes might also be much lower than elsewhere; at the other extreme, some places that appear to be extremely expensive when looking at prices, might be more manageable as a result of relatively high wages and salaries.
By combining Zillow data on home values, rents, and historical housing-cost burdens with data on incomes by occuption from the United States Census Bureau's American Community Survey, we can compare housing affordability for specific types of workers in different communities across the country. In this example, we estimate the share of a household's income that goes to a monthly mortgage payment on the median home across the country's metro areas for two types of workers:
This tutorial will guide you through the steps to load and wrangle Zillow and ACS data to compute the share of monthly income required to buy a typical home for firefighters and police officers across the country using the R programming language. To get started quickly, the code for this tutorial can be found at the following Github Repository. In addition, the R Markdown is provided to knit a version yourself.
Start by clearing your environment and setting your options, and calling the following libraries:
We call the libraries using a function, pkgTest, which checks whether a particular package is loaded in your environment, and if not, installs it.
## Preliminaries
rm(list = ls())
## This function will check if a package is installed, and if not, install it
pkgTest <- function(x) {
if (!require(x, character.only = TRUE)) {
install.packages(x, dep = TRUE)
if (!require(x, character.only = TRUE))
stop("Package not found")
}
}
## These lines load the required packages
packages <- c("plyr", "dplyr", "readr", "readxl", "reshape2", "stringi", "datasets",
"ggplot2", "choroplethr", "choroplethrMaps")
invisible(lapply(packages, pkgTest))
## These lines set several options
options(scipen = 999) # Do not print scientific notation
options(stringsAsFactors = FALSE) ## Do not load strings as factors
The first step is to load tables from the ACS Summary File FTP. There are several decisions to make:
Based on the above decisions, set your parameters appropriately.
## Set parameters to load ACS data
year <- 2014
summaryFileFolder <- "1_year_seq_by_state"
geoLevelFile <- "United States"
geoSummaryLevel <- 310
sequence <- 107
tableNumber <- "B24011"
Lines <- c(21, 22)
acsDrive <- file.path("http://www2.census.gov", "programs-surveys", "acs", "summary_file",
as.character(year), "data")
Next we load the ACS and geography tables.
## Load geographies
geos <- read_csv(file.path(acsDrive, summaryFileFolder, gsub(" ", "", geoLevelFile,
fixed = TRUE), paste0("g", as.character(year), 1, "us", ".csv")), col_names = FALSE)
## Load ACS tables
temp <- tempfile()
path <- file.path(acsDrive, summaryFileFolder, gsub(" ", "", geoLevelFile, fixed = TRUE),
paste0(as.character(year), 1, "us", stri_pad_left(sequence, 4, pad = "0"),
"000", ".zip"))
download.file(path, destfile = temp)
acs <- read_csv(unz(temp, paste0("e", as.character(year), "1", "us", stri_pad_left(sequence,
4, pad = "0"), "000", ".txt")), col_names = FALSE)
unlink(temp)
The following command, which appears several times when loading the data, removes spaces from the geoLevelFile name.
gsub(" ", "", geoLevelFile, fixed = TRUE)
Similarly, the function stri_pad_left takes the sequence number and converts it to a character string of a specified length, padding the string with leading characters (zeros in this case).
These two functions are particularly important when loading data for sub-state geographies, as described in the Appendix.
Then we load the associated variable names.
## Download variable and geography names
temp <- tempfile()
path <- file.path(acsDrive, paste0(as.character(year), "_1yr_Summary_FileTemplates",
".zip"))
download.file(path, temp)
## Load variable and geography names
acsNames <- read_excel(unzip(temp, file.path(paste0(as.character(year), "_1yr_Summary_FileTemplates"),
paste0("Seq", as.character(sequence), ".xls"))), sheet = "E")
geoNames <- read_excel(unzip(temp, paste0(as.character(year), "_", "SFGeoFileTemplate",
".xls")), sheet = 1)
## Close download link
unlink(temp)
Next we reformat the column names and apply them to the data tables.
## Reformat variable and geography names, then apply them to the data tables
acsNames <- data.frame(varName = colnames(acsNames), varDesc = t(acsNames[1,
]))
geoNames <- data.frame(geoName = colnames(geoNames), geoDesc = t(geoNames[1,
]))
colnames(acs) <- acsNames$varName
colnames(geos) <- geoNames$geoName
Because of the table structure in the ACS FTP, we loaded a lot of ACS data that we don’t actually need for this particular analysis, so let’s keep only the variables that we’re planning on using: The geographic identifier(s) and the data variables.
## Keep only variables of interest in ACS data
acsVars <- paste(tableNumber, as.character(stri_pad_left(Lines, 3, "0")), sep = "_")
acs <- acs[, c("STUSAB", "LOGRECNO", acsVars)]
acs$STUSAB <- toupper(acs$STUSAB)
Similarly, we loaded a lot of geographies that we do not plan on using for this analysis. Let’s get rid of that data. We should also extract the CBSA Code (an integer code associated with each metro area) from the GEOID field in the geography data. You can learn more about understanding the different geographic levels embedded in GEOIDs by reading this page and selecting “GEOID Structure for Geographic Areas”.
## Keep only geographies of interest
geos <- subset(geos, SUMLEVEL == geoSummaryLevel)
geos$CBSACode <- as.integer(substr(geos$GEOID, nchar(geos$GEOID) - 4, nchar(geos$GEOID)))
geos <- geos[, c("STUSAB", "LOGRECNO", "CBSACode", "NAME")]
Finally, let’s merge the ACS data and the geographic codes.
## Merge geos and acs data
incomes <- merge(geos, acs[, c("STUSAB", "LOGRECNO", acsVars)], by = "LOGRECNO")
Now we have a dataframe, called “incomes,” that contains the median annual earnings for firefighters and police officers for each metro area in 2014. We have successfully loaded and formatted incomes from the ACS and are ready to start loading Zillow data on home values and mortgage rates.
Zillow publishes monthly median home values for a wide range of geographic levels ranging from the country and U.S. states down to the neighborhood and ZIP code. These and other data are regularly updated on the Zillow Data page. The median home value (ZHVI) for each geographic level reflects the value of the typical home in each geography, regardless if the home is currently for-sale. It is based on Zillow ‘Zestimates’ of nearly 120 million homes nationwide (as of November 2015). You can read more about how Zillow calculates the median home value and Zestimates here, and more about Zillow’s coverage and accuracy here.
Since the types of homes that transact each month changes, median sale prices tend to be highly volatile due to compositional shifts: For instance, if more condos sell in a given month, and in the next month more mansions sell, then the median sale price will appear to have increased more dramatically over the month than the value of any single home. The ZHVI controls for these compositional changes since it reflects the value of the entire housing stock. However, some researchers prefer median sale or list prices, which are also available on the Zillow Data page.
First, we load Zillow data on median home values at the metro level. We then reshape the data to allow us to merge them with the ACS data on incomes. Since the ACS data correspond to average incomes in 2014, and the Zillow data are a monthly series, we keep only median home values for months in 2014 and take the average across the 12 months of the year.
## Load Zillow metro median home values
zillow <- read_csv("http://files.zillowstatic.com/research/public/Metro/Metro_Zhvi_AllHomes.csv",
col_names = TRUE)
zillow <- melt(zillow, id = c("RegionID", "RegionName", "SizeRank"))
zillow <- subset(zillow, substr(variable, 1, 4) == "2014")
zillow <- ddply(zillow, .(RegionID, RegionName, SizeRank), summarise, MedianHomeValue = mean(value))
We then load data on mortgage rates. Zillow publishes the average interest rate quoted for a standard 30-year fixed rate mortgage to a borrower with a good credit score (a FICO score of 720 or higher) on a loan with a loan-to-value ratio of 0.8 or less. This is the most common loan product Americans use to buy homes. The data are reported in 15-minute increments throughout the business day (6 am to 5 pm Pacific Time, excluding federal holidays). Interest rates are reported already multiplied by 100 – a mortgage rate of 4.2% is reported as 4.2 rather than 0.042 in the data – so we divide by 100 to facilitate computation.
## Load Zillow data on monthly mortgage rates
mortgageRate <- read_csv("http://files.zillowstatic.com/research/public/MortgageRateConventionalFixed.csv",
col_names = TRUE)
mortgageRate <- subset(mortgageRate, substr(Date, 1, 4) == "2014")
mortgageRate <- ddply(mortgageRate, .(substr(Date, 1, 4)), summarise, avgRate2014 = round(mean(MortgageRateConventionalFixed,
na.rm = TRUE)/100, 3))
The Zillow data include metros with slightly different names than the Census data. A county-level crosswalk between the two naming conventions is available here (metro areas are composed of groups of counties).
For this example, we are only interested in the metro-to-metro crosswalk, not the county-to-metro crosswalk, so we remove county identifiers and remove duplicate rows.
## Load Zillow Metro-CBSA crosswalk
metroXW <- read_csv("http://files.zillowstatic.com/research/public/CountyCrossWalk_Zillow.csv",
col_names = TRUE)
metroXW <- metroXW[, c("MetroRegionID_Zillow", "CBSACode")]
metroXW <- metroXW[!duplicated(metroXW), ]
Finally, we merge the crosswalk onto the Zillow data on median home values
## Merge Zillow data on median home values with crosswalk
zillow <- merge(zillow, metroXW, by.x = "RegionID", by.y = "MetroRegionID_Zillow")
Merging the Zillow and ACS data is now fairly straightforward:
## Merge Zillow and ACS data
zillow <- merge(zillow, incomes, by = "CBSACode")
We can then use a standard formula for computing a monthly loan payment based on the loan amount, the interest rate, and the loan amortization (how long before a borrower will pay off the loan). The formula we use is: \[M=P\frac{r(1 + r)^n}{(1 + r)^n - 1}\] where:
To make our code easier to read, we implement this using several parameters.
## Calculate monthly mortgage payment
nperiods <- 360 # mortgage loan term, in months
compoundingFactor <- (1 + (mortgageRate$avgRate2014/12))^nperiods
zillow$monthlyPymt <- (0.8 * zillow$MedianHomeValue) * ((mortgageRate$avgRate2014/12) *
compoundingFactor)/(compoundingFactor - 1)
The share of a household’s income that would go toward a monthly mortgage payment on the median home in a given metro area is then simply the ratio of the monthly mortgage payment and the monthly income (annual income divided by 12).
## Compute monthly payment relative to income
zillow$MortgagePymtToInc_Firefighters <- zillow$monthlyPymt/(zillow$B24011_021/12)
zillow$MortgagePymtToInc_Police <- zillow$monthlyPymt/(zillow$B24011_022/12)
Of course, in these calculations we made several implicit assumptions. For example, we assume that these households are buying a home with only one income. Many households have two incomes (or more in some rare instances). We also assume the household is buying the median home in the metro area. Some buyers may buy a more or less expensive home. There are many different flavors of mortgage affordability that one could compute, but the basic calculation outlined above provides a reasonable basis for comparison.
A quick examination of the results shows that Kokomo (IN), Lumberton (NC) and Manitowoc (WI) are the most affordable metro areas for firefighters. Under our assumptions, firefighters in these communities would need to spend less than 7% of their monthly income on his or her mortgage to buy the median home. By contrast, places like Ithaca (NY), Columbia (MO) and Beaver Dam (WI) – as well as larger places like San Jose (CA), San Francisco (CA) and Kahului (HI) – appear to be far less affordable for firefighters with mortgage-payment-to-income ratios in excess of 1, largely due to very low incomes. This means that even if a firefighter spent their entire income on their mortgage, they could not afford to buy the median home (again, assuming a single-income household).
## View top/bottom 10 metros for mortgage affordability for firefighters
head(zillow[order(zillow$MortgagePymtToInc_Firefighters), c("RegionName", "B24011_021",
"MedianHomeValue", "MortgagePymtToInc_Firefighters")], 10)
## RegionName B24011_021 MedianHomeValue
## 190 Kokomo, IN 61834 83325.00
## 224 Manitowoc, WI 70614 106516.67
## 286 Pottsville, PA 40571 67058.33
## 249 Nacogdoches, TX 85319 142308.33
## 264 Ogdensburg, NY 42364 72141.67
## 278 Peoria, IL 62250 112258.33
## 357 Texarkana, TX 51414 94591.67
## 283 Plattsburgh, NY 60170 116591.67
## 247 Muskogee, OK 33978 65966.67
## 112 Elmira, NY 43354 87858.33
## MortgagePymtToInc_Firefighters
## 190 0.06250931
## 224 0.06997186
## 286 0.07667146
## 249 0.07737155
## 264 0.07899251
## 278 0.08365192
## 357 0.08534305
## 283 0.08988437
## 247 0.09005823
## 112 0.09400490
head(zillow[order(-zillow$MortgagePymtToInc_Firefighters), c("RegionName", "B24011_021",
"MedianHomeValue", "MortgagePymtToInc_Firefighters")], 10)
## RegionName B24011_021 MedianHomeValue
## 85 Coos Bay, OR 0 152108.33
## 90 Cullman, AL 0 110908.33
## 141 Greeneville, TN 0 80691.67
## 240 Morehead City, NC 0 226100.00
## 329 Sebring, FL 0 79733.33
## 348 Statesboro, GA 0 135458.33
## 351 Sumter, SC 0 81750.00
## 358 The Villages, FL 0 227366.67
## 167 Ithaca, NY 2499 187358.33
## 78 Columbia, MO 2499 150008.33
## MortgagePymtToInc_Firefighters
## 85 Inf
## 90 Inf
## 141 Inf
## 240 Inf
## 329 Inf
## 348 Inf
## 351 Inf
## 358 Inf
## 167 3.477791
## 78 2.784491
For police, Battle Creek and Saginaw, Michigan, along with several communities in upstate New York, appear to be the most affordable communities. Under our assumptions, police officers in these communities would need to spend less than 6% of their monthly income on thier mortgage to buy the median home. At the other extreme, Oak Harbor (WA), Honolulu (HI), San Jose (CA) and Santa Fe (NM) appear to be the least affordable places for police officers.
## View top/bottom 10 metros for mortgage affordability for police
head(zillow[order(zillow$MortgagePymtToInc_Police), c("RegionName", "B24011_022",
"MedianHomeValue", "MortgagePymtToInc_Police")], 10)
## RegionName B24011_022 MedianHomeValue MortgagePymtToInc_Police
## 30 Battle Creek, MI 100576 87758.33 0.04047536
## 264 Ogdensburg, NY 70575 72141.67 0.04741677
## 266 Olean, NY 62715 71716.67 0.05304511
## 173 Jamestown, NY 64104 74200.00 0.05369273
## 306 Saginaw, MI 56484 67825.00 0.05570075
## 286 Pottsville, PA 51930 67058.33 0.05990059
## 86 Corning, NY 63750 85275.00 0.06204949
## 351 Sumter, SC 60831 81750.00 0.06233895
## 209 Lima, OH 63375 85833.33 0.06282532
## 246 Muskegon, MI 60585 84175.00 0.06444878
To map mortgage affordability for firefighters and police officers across the United States, we use the choroplethrMaps package, which makes it very easy to county-level (and other geographic level) data. Our affordability data in this example are at the metro (CBSA) level, but metros are composed of groups of counties, so we can use county-level maps to visualize the data.
As a first step, we merge our affordability data onto county identifiers.
## Load county identifiers
zillow_county <- read_csv("http://files.zillowstatic.com/research/public/CountyCrossWalk_Zillow.csv",
col_names = TRUE)
## Merge affordability data onto county identifiers
zillow_county <- merge(zillow_county, zillow[, c("CBSACode", "MortgagePymtToInc_Firefighters",
"MortgagePymtToInc_Police")], by = "CBSACode", all.x = TRUE)
ChoroplethrMaps requires county identifiers composed of a concatenated state FIPS code and county FIPS code. County FIPS codes are 3-digits so the stri_pad_left function ensures that leading zeros are inserted between the state FIPS code the county FIPS code when the county FIPS code is less than three digits.
## Create a combined state and county FIPS code
zillow_county$combinedFIPS <- as.integer(paste0(as.character(zillow_county$StateFIPS),
stri_pad_left(as.character(zillow_county$CountyFIPS), 3, "0")))
## Drop Alaska and Hawaii counties
zillow_county <- subset(zillow_county, !is.na(CBSACode) & !(StateName %in% c("Alaska",
"Hawaii")))
The data that feed choroplethrMaps maps must be formatted with 2 columns: a column for the geographic identifiers and a column with the data to map. We create separate data sets for each of the two occupation groups, drop counties missing affordability data, and rename the columns to meet choroplethrMaps requirements.
## Format the affordability data for mapping
zillow_county_firefighter <- zillow_county[!is.na(zillow_county$MortgagePymtToInc_Firefighters),
c("combinedFIPS", "MortgagePymtToInc_Firefighters")]
zillow_county_firefighter <- plyr::rename(zillow_county_firefighter, c(combinedFIPS = "region",
MortgagePymtToInc_Firefighters = "value"))
zillow_county_police <- zillow_county[!is.na(zillow_county$MortgagePymtToInc_Police),
c("combinedFIPS", "MortgagePymtToInc_Police")]
zillow_county_police <- plyr::rename(zillow_county_police, c(combinedFIPS = "region",
MortgagePymtToInc_Police = "value"))
As a last step before mapping the data, we create a vector with the state names of the contiguous lower 48 U.S. states as an input to the mapping functions.
## Create a vector with the names of the lower 48 states
data(county.regions)
counties <- county.regions$region
continental <- county.regions$region[!county.regions$state.name %in% c("alaska",
"hawaii")]
Finally, we are able to map the data.
Some (perhaps) surprising findings are visible in the results. For example, Chicago does very well when it comes to mortgage affordability for police officers, but not great when it comes to mortgage affordability for firefighters. Affordability for both occupational groups is equally bad in Bismark, North Dakota, as it is in coastal California.
There is obviously a lot more one could explore in these data. However, the code described above should allow you to do it yourself!
Loading data for geographies that are fully contained within state lines requires several modifications to the code example above. Since the ACS FTP file structure uses both full state names (e.g., California) and state abreviations (e.g., CA), we write a loop that loads each state file. Of course, if your analysis requires only geographies in a single state, you can bypass the loop.
Here we provide an example of loading city-level median incomes for firefighters and police officers.
The first step is to load U.S. state names and abreviations, which are conveniently available in the package, datasets. However, we manually bind a state name and abbreviation for the District of Columbia, which is not included in the package data.
## Load state names and abreviations
states <- rbind(data.frame(stateName = state.name,
stateAbrev = state.abb),
c('District of Columbia', 'DC'))
We then set our parameters as before. We set the geoLevelFile parameter to NULL as it will be replaced by state names. We also set the geoSummaryLevel to 160, which corresponds to State-Place. State-place includes cities, boroughs, towns and villages.
## Set parameters to load ACS data, at the city level
year <- 2014
summaryFileFolder <- '1_year_seq_by_state'
geoLevelFile <- NULL
geoSummaryLevel <- 160
sequence <- 107
tableNumber <- 'B24011'
Lines <- c(21, 22)
acsDrive <- file.path('http://www2.census.gov',
'programs-surveys',
'acs',
'summary_file',
as.character(year),
'data')
Next, we write a loop to load the ACS and geography tables for each state.
## Load geographies, for sub-state geographies
geos <- data.frame()
for (i in 1:nrow(states)){
geos_temp <- read_csv(file.path(acsDrive,
summaryFileFolder,
gsub(" ", "", states$stateName[i], fixed = TRUE),
paste0('g', as.character(year), 1,
tolower(states$stateAbrev[i]),
'.csv')),
col_names = FALSE)
geos <- rbind(geos, geos_temp)
rm(geos_temp)
}
rm(i)
## Load ACS tables, for sub-state geographies
acs <- data.frame()
for (i in 1:nrow(states)){
temp <- tempfile()
path <- file.path(acsDrive,
summaryFileFolder,
gsub(" ", "", states$stateName[i], fixed = TRUE),
paste0(as.character(year), 1,
tolower(states$stateAbrev[i]),
stri_pad_left(sequence, 4, pad = '0'),
'000', '.zip'))
download.file(path, destfile = temp)
acs_temp <- read_csv(unz(temp,
paste0('e', as.character(year), '1',
tolower(states$stateAbrev[i]),
stri_pad_left(sequence, 4, pad = '0'),
'000', '.txt')),
col_names = FALSE)
acs <- rbind(acs, acs_temp)
rm(acs_temp)
unlink(temp)
}
rm(i)
The rest of the code should run without any changes except for the following: When merging the geography names and the ACS data for sub-state geographies, merge the data frames on both the STUSAB (state abreviation) and LOGRECNO (logical record number) fields rather than just the LOGRECNO field.
## Merge geos and acs data, sub-state geographies
incomes <- merge(geos, acs[, c('STUSAB', 'LOGRECNO', acsVars)], by = c('STUSAB', 'LOGRECNO'))