TOUGH CROWD

A DEEP DIVE INTO BUSINESS DYNAMICS

by GUO LUI + DAVE GOODSMITH, DATASCIENCE, INC
edited by JEFF CHEN, COMMERCE DATA SERVICE
SEPTEMBER 2016
As part of the Commerce Data Usability Project, DataScience, Inc. in collaboration with the Commerce Data Service has created a tutorial that utilizes Census' Business Dynamics API to understand business survival rates in US Cities. If you have question, feel free to reach out to the Commerce Data Service at DataUsability@doc.gov.


INTRODUCTION:

Every year, thousands of entrepreneurs launch startups, aiming to make it big. This journey and the perils of failure have been interrogated from many angles, from making risky decisions to start the next iconic business to the demands of having your own startup. However, while the startup survival has been written about, how do these survival rates shake out when we look at empirical evidence? As it turns out, the U.S. Census Bureau collects data on business dynamics that can be used for survival analysis of firms and jobs.

In this tutorial, we build a series of functions in Python to better understand business survival across the United States. Kaplan-Meier Curves (KM Curves) are a product limit estimator that allows for calculation of survival of a defined cohort of businesses over time and are central to this tutorial. By comparing survival rates in various Metropolitan Statistical Areas (MSAs), we find regions that may fair far better in business survival than others.

GETTING STARTED:

In order to get started, we're going to first load in a series of Python packages that will allow us to build out a survival analysis:

  • Basics
    • io. Provides the Python interfaces to stream handling.
    • requests. Allows Python to send 'organic, grassfed' HTTP
    • zipfile. Provides tools to create, read, write, append, and list a ZIP file
  • Data
    • plotly. A data visualization library.
    • pandas. A library the allows for easy data manipulation and processing.

Loading up packages follows the usual routine. If you get an error for plotly.offline,make sure you pip install it."

In [1]:
import io, requests, zipfile
import pandas as pd
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
import plotly.graph_objs as go

init_notebook_mode()

Using these packages, we will build a series of convenient functions to:

  • Extract data from the Business Dynamics API
  • Process extracts into product limit estimates of survival
  • Visualize KM Curves
  • Calculate and compare five-year survival rates for MSAs with population > 250,000 people

EXTRACTING DATA VIA THE BUSINESS DYNAMICS API

First thing's first, we need a function to query the data we are interested in pulling from the API. We want this function to accept:

  • an arbitrary metropolitan statistical area code,
  • a dictionary of variables,
  • a dictionary of parameters to filter the data,

The result of the function should be beautifully formatted data.

Let's call the dictionary of variables we are interested as var_dict. Each key in var_dict is a variable name accepted by the API, and each value is a more user friendly name we give to the variables. We also want to implement a dictionary of parameters we wish to filter on, which can be passed on to parameter params in get method of package requests.

The get_data function is shown below. A deeper dive into the API documentation can be found here.

In [2]:
def get_data(msa_code, var_dict, para_dict):
    r = requests.get('http://api.census.gov/data/bds/firms?get='
                     + ','.join(var_dict.keys()) ## catenate keys using str.join() method.
                     + '&for=metropolitan+statistical+area:' + str(msa_code),
                    params=para_dict)
    
    ## when url returns empty content, return None
    if r.content is b'': 
        return None
    else:
        ## read in data
        data = pd.DataFrame(r.json())
        
        ## get columns names from first row and map them to the given names in var_dict
        columns = data.iloc[0,:-len(para_dict)].map(var_dict)
        
        ## data we are interested in 
        data = data.iloc[1:,:-len(para_dict)]
        
        ## rename dataframe columns
        data.columns = columns
        
        return data

Notice that the get_data function requires an msa_code, but how can we find that MSA code? Turns out the Census has a reference table (https://www2.census.gov/econ/susb/data/msa_codes_2012_to_2016.txt). Again, we use package request to get this data and read it as a pandas dataframe. Notice that we use package io in order to read it without downloading (e.g. into buffer).

After reading in the data, we only kept the metropolitan statistical area, and change the MSA name to the major city and state. Using this basic reference table, we can look up a metro area by name and obtain the MSA code.

In [3]:
## request data via https
r = requests.get('https://www2.census.gov/econ/susb/data/msa_codes_2007_to_2011.txt').content

## read and parse data
msa_code = pd.read_table(io.StringIO(r.decode('utf-8')), header=3, sep='   ') 

## rename columns
msa_code.columns = ['code','name'] 

## get rid of micropolitan statistical areas
msa_code = msa_code[msa_code['name'].str.contains('Metropolitan', case = False)]

## clean up names
msa_code['name'] = msa_code['name'].str.replace(' Metropolitan Statistical Area', '') 

## function to clean up MSA names, only keep the fist city and fist state
def name_clean(s):
    s_list = s.split(', ')
    cities = s_list[0]
    states = s_list[-1]
    return ' '.join([cities.split('-')[0], states.split('-')[0]])

## map the name_clean function to all MSA
msa_code['name'] = msa_code['name'].map(name_clean) 
/usr/local/lib/python3.5/site-packages/ipykernel/__main__.py:5: ParserWarning:

Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.

CALCULATING + VISUALIZING SURVIVAL RATES (5-year period)


Statistical Primer

Before proceeding, we need to establish a basic framework for analyzing survival. KM Curves are a common approach for analying 'time to event' data, whether it's an engineering problem with infrastructure failure or a medical problem with mortality. The application concept is essentially for business survival.

In the absense of censorship, or the condition in which the value or state of an observation is only partially known (e.g. loss to follow up, or attrition, etc), a survival function is defined as the following:

$S(t) = \frac{n -d}{n}$

where t is the "survival time" or time to failure or event, n is the number of observations in a study cohort, and d is the number of failure or death events. The survival rate over a study period is estimated using a product limit:

$\hat S(t) = \prod\limits_{t_i

where $\hat S(t)$ is the survival probability of individuals at the end of the study period t, $n_i$ is the number of individuals or observations at risk prior to $t_i$, and $d_i$ is the number of failure events up to point i. Essentially, for the calculation procedure to derive the KM curve is as follows:

  • Calculate the survival probability $S(t)$ for each equal time interval from $t_i$ to period
  • For each time period $t_i$, take the product of all survival probabilities from $t_0$ to $t_i$
  • Plot $\hat S(t)$ against time

Implementation

With this basic statistical base, we now can build out a basic analysis. From Section 1, we have the msa codes as well as a nifty get_data function to tap into the Census Bureau API. Now, we'll calculate and visualize the survival rates across US cities. To do so, we'll write a new function survival_rate that accepts the name of a major US city and a base year that is used to define a cohort. The result of all the calculations will be an interactive plot of the five-year survival rate. To provide more context about survival, survival will be calculated for firms (individual companies), establishments (locations at which business is conducted), and jobs.

Census Bureau API has limited number of calls for users without a key. Because we will make many calls to the API, you need to register a key here: http://api.census.gov/data/key_signup.html, and assign the value to variable api_key:

In [4]:
api_key = 'your-key-goes-here'

Function survival_rate for calculating and visualizing the survival rate is thus:

In [5]:
def survival_rate(name, start_year):
    
    #fage4 is the firm age in a given year. Letters a through f represent years 0 through 5
    fage4_values = ['a', 'b', 'c', 'd', 'e', 'f']
    
    #The data only covers up to 2013 (as of now), so we will limit the fage4_values to ones within 2013 minus start_year
    if 2013 - start_year < 6:
        fage4_values = fage4_values[0:(2013-start_year + 1)]

    #var_dict contains the variables and their labels
    var_dict = {'estabs': 'Establishments',
                'emp': 'Employment',
                'firms': 'Firms'}
    
    #set up empty array to accept results from get_data
    result = []
    
    #grab the msa code based on the 'name' provided in the input parameters
    code = msa_code[msa_code['name'].str.contains(name, case = False)]['code'].values[0]
    
    #Loop from start year to the 5th year (or the cap year if end years exceed 2013)
    for i in range(len(fage4_values)):
        para_dict = {'fage4': fage4_values[i], 'time': start_year+i }
        result.append(get_data(code, var_dict, para_dict))
        
    #The code was returning an error as not all variables were integer friendly (e.g. there was a phantom column of letters)
    #Added in a drop statement to keep only variables 0:4 
    df = pd.concat(result).ix[:, 0:3].astype(int)
    df.index = range(start_year,start_year+len(fage4_values))

    #Calculate point survival rate
    #Step 1: Create denominator dataframe
    #Shift rows up 1
    df2 = df.shift(1)
    
    #Replace blank row with 
    df2.iloc[[0]] = df.iloc[[0]]
    
    #Step 2: Divide numerator (df) by denominator (df2)
    df3 = df/df2

    #Step 3: Calculate cumulative product on instantaneous survival rate table
    df4 = df3.cumprod()*100

    
    ### start plotting using plotly
    data = []
    for label in df4.columns:
        trace = go.Scatter(
            x = df4.index,
            y = df4[label].values,
            name = label
        )
        data.append(trace)

    layout = dict(title = 'Business Survival Rates, '+ name +' Metropolitan Area, Year: '+str(start_year),
                  yaxis = dict(title = 'survival rate (%)'),
                  xaxis = dict(title = 'year', nticks = len(df4)),
                  )

    fig = dict(data=data, layout=layout)
    iplot(fig)

More on using plotly with python can be found here: https://plot.ly/python/

Using our new survival_rate function, we can effortlessly calculate and visualize the survival rate of any metropolitan statistical area in any year. For example, businesses that formed in 2006 in the Seattle metropolitan area experienced relatively gradual declining survival rates with a sharper decrease in 2008 to 2009, likely due to the financial crisis.

In [6]:
survival_rate('Seattle', 2006)

Or we can look at New York metropolitan area in 2009. Notice how employment diverges from firms and establishments, indicating that while some firms exited, the remaining firms grew and created more jobs.

Note that we have relaxed a fundamental assumption about KM Curves. In the strictest of implementations, KM Curves should not increase over time, but given that data was collected at the firm level, it is possible for employment to grow among surviving firms.

In [7]:
survival_rate('New York', 2009)

MAPPING SURVIVAL RATES

Business survival rates can go a long way for informing business siting decisions among other strategic considerations. But much of the value of survival rates is realized through comparisons between cities. In this next section, we extend the functions produced in Section 2 to incorporate geographic data.

Geographic coordinate data

To start, we can take advantage of rich geographic resources provided by the US Census Bureau. The geographic location (longitude/latitude) of metropolitan statistical areas can be found on https://www.census.gov/geo/maps-data/data/gazetteer2015.html, under \\"Core Based Statistical Areas\\". Using the URL for the zipfile (http://www2.census.gov/geo/docs/maps-data/data/gazetteer/2015_Gazetteer/2015_Gaz_cbsa_national.zip), we'll use requests to get the zipfile, use zipfile to unzip the file in memory, then use pandas to read 2015_Gaz_cbsa_national.txt.

In [8]:
## draw down zip file, unzip, read in txt with specified encoding
r = requests.get("http://www2.census.gov/geo/docs/maps-data/data/gazetteer/2015_Gazetteer/2015_Gaz_cbsa_national.zip")
z = zipfile.ZipFile(io.BytesIO(r.content))
geo = pd.read_table(z.open('2015_Gaz_cbsa_national.txt'), encoding = "ISO-8859-1")

Let's read in and clean the data:

In [9]:
## clean the columns names and change to lower case
geo.columns = [field.rstrip().lower() for field in geo.columns]

## get rid of micropolitan statistical areas and clean the names the same as msa_code 
geo = geo[geo['name'].str.contains('Metro', case = False)]
geo['name'] = geo['name'].str.replace(' Metro Area', '')
geo['name'] = geo['name'].map(name_clean)

Population estimates

The population data can be found at http://www.census.gov/popest/data/metro/totals/2015/files/CBSA-EST2015-alldata.csv. We'll read the .csv file directly using Pandas' read_csv, extract the CBSA and POPESTIMATE2015 fields, then relabel the two fields geoid and population, respectively. This will prepare the data in the right shape for merging with the geography file.

In [10]:
## http://www.census.gov/popest/data/metro/totals/2015/index.html -> "Metropolitan Statistical Area; and for Puerto Rico"

## read in data with specified encoding
pop = pd.read_csv("http://www.census.gov/popest/data/metro/totals/2015/files/CBSA-EST2015-alldata.csv", 
                  encoding = "ISO-8859-1")
pop = pop[pop['LSAD']=='Metropolitan Statistical Area']
pop = pop[['CBSA','POPESTIMATE2015']]
pop.columns = ['geoid', 'population']

Merging

We now want to join the geographic location, population data, and the msa_code data. The geoid variable is the primary key that will allow merging to connect the geographic and population information with MSA name.

In [11]:
## join to geographic location data
geo = geo.merge(pop, how='inner', left_on='geoid', right_on='geoid')

#Merge msa_code to geo
msa_code = msa_code.merge(geo[['name','intptlat','intptlong','population']],how='left', left_on='name', right_on='name')
msa_code = msa_code.dropna()

Calculating survival rates in bulk

Similar to the calculation of survival rate above, we can now iterate through all MSAs and calculate the survival rate of jobs and firms.

In [12]:
## specify starting year of analysis
start_year = 2008

## letters indicating firm age
fage4_values = ['a', 'b', 'c', 'd', 'e', 'f'] 

## deisred variables
var_dict = {'firms': 'firms',
            'emp': 'jobs'} 

## empty dataframe as placeholder
df = pd.DataFrame(columns = ['name', 'population', 'lat', 'lon', '5-year firm survival', 
                             '5-year job survival', 'number of jobs', 'number of firms'])

print('Fetching data for...')
## iterate through every MSA with a population bigger than 250,000
for idx, row in msa_code[msa_code['population']>=2.5e5].iterrows():
    
    ## code and name of current MSA
    code = row['code']
    print('    '+row['name'])
    
    ## place holder for results
    result = []
    
    ## iterate through age 0 - 5
    for i in range(6):
        para_dict = {'fage4': fage4_values[i], 'time': start_year + i, 'key': api_key}
        result.append(get_data(code, var_dict, para_dict))

    ## check for empty results
    if any([d is None for d in result]):
        continue
        
    #The code was returning an error as not all variables were integer friendly (e.g. there was a phantom column of letters)
    #Added in a drop statement to keep only variables 0:4 
    df0 = pd.concat(result).ix[:, 0:3].astype(int)
    df0.index = range(start_year,start_year+len(fage4_values))

    #Calculate point survival rate
    #Step 1: Create denominator dataframe
    #Shift rows up 1
    df1 = df0.shift(1)
    
    #Replace blank row with 
    df1.iloc[[0]] = df0.iloc[[0]]
    
    #Step 2: Divide numerator (df) by denominator (df2)
    df2 = df0/df1
    

    #Step 3: Calculate cumulative product on instantaneous survival rate table, keep only year 5
    df3 = df2.cumprod()*100
    
    ## copy the initial number of jobs and firms
    df.loc[code, ['number of jobs', 'number of firms']] = df0.iloc[[0]][['jobs', 'firms']].values

     ## copy the initial survival probs
    df.loc[code, ['5-year firm survival', '5-year job survival']] = df3.iloc[[5]][[ 'firms','jobs']].values

    ## copy the namem population and location of MSA
    df.loc[code, ['name', 'population', 'lat', 'lon']] = row[['name', 'population','intptlat','intptlong']].values 
    
Fetching data for...
    Akron OH
    Albany NY
    Albuquerque NM
    Allentown PA
    Amarillo TX
    Anchorage AK
    Ann Arbor MI
    Asheville NC
    Atlanta GA
    Atlantic City NJ
    Augusta GA
    Austin TX
    Bakersfield CA
    Baltimore MD
    Baton Rouge LA
    Beaumont TX
    Birmingham AL
    Boise City ID
    Boston MA
    Boulder CO
    Bremerton WA
    Bridgeport CT
    Brownsville TX
    Buffalo NY
    Canton OH
    Cape Coral FL
    Cedar Rapids IA
    Charleston SC
    Charlotte NC
    Chattanooga TN
    Chicago IL
    Cincinnati OH
    Clarksville TN
    Cleveland OH
    Colorado Springs CO
    Columbia SC
    Columbus GA
    Columbus OH
    Corpus Christi TX
    Dallas TX
    Davenport IA
    Dayton OH
    Deltona FL
    Denver CO
    Des Moines IA
    Detroit MI
    Duluth MN
    Durham NC
    El Paso TX
    Erie PA
    Eugene OR
    Evansville IN
    Fayetteville NC
    Fayetteville AR
    Flint MI
    Fort Collins CO
    Fort Smith AR
    Fort Wayne IN
    Fresno CA
    Gainesville FL
    Grand Rapids MI
    Greeley CO
    Green Bay WI
    Greensboro NC
    Greenville SC
    Gulfport MS
    Hagerstown MD
    Harrisburg PA
    Hartford CT
    Hickory NC
    Houston TX
    Huntington WV
    Huntsville AL
    Indianapolis IN
    Jackson MS
    Jacksonville FL
    Kalamazoo MI
    Kansas City MO
    Kennewick WA
    Killeen TX
    Kingsport TN
    Knoxville TN
    Lafayette LA
    Lakeland FL
    Lancaster PA
    Lansing MI
    Laredo TX
    Las Vegas NV
    Lexington KY
    Lincoln NE
    Little Rock AR
    Los Angeles CA
    Louisville/Jefferson County KY
    Lubbock TX
    Lynchburg VA
    Madison WI
    Manchester NH
    McAllen TX
    Memphis TN
    Merced CA
    Miami FL
    Milwaukee WI
    Minneapolis MN
    Mobile AL
    Modesto CA
    Montgomery AL
    Myrtle Beach SC
    Naples FL
    Nashville TN
    New Haven CT
    New Orleans LA
    New York NY
    Norwich CT
    Ocala FL
    Ogden UT
    Oklahoma City OK
    Olympia WA
    Omaha NE
    Orlando FL
    Oxnard CA
    Palm Bay FL
    Pensacola FL
    Peoria IL
    Philadelphia PA
    Phoenix AZ
    Pittsburgh PA
    Portland ME
    Portland OR
    Port St. Lucie FL
    Providence RI
    Provo UT
    Raleigh NC
    Reading PA
    Reno NV
    Richmond VA
    Riverside CA
    Roanoke VA
    Rochester NY
    Rockford IL
    Sacramento CA
    St. Louis MO
    Salem OR
    Salinas CA
    Salisbury MD
    Salt Lake City UT
    San Antonio TX
    San Diego CA
    San Francisco CA
    San Jose CA
    San Luis Obispo CA
    Santa Cruz CA
    Santa Rosa CA
    Savannah GA
    Scranton PA
    Seattle WA
    Shreveport LA
    Sioux Falls SD
    South Bend IN
    Spartanburg SC
    Spokane WA
    Springfield MA
    Springfield MO
    Stockton CA
    Syracuse NY
    Tallahassee FL
    Tampa FL
    Toledo OH
    Trenton NJ
    Tucson AZ
    Tulsa OK
    Utica NY
    Vallejo CA
    Virginia Beach VA
    Visalia CA
    Waco TX
    Washington DC
    Wichita KS
    Wilmington NC
    Winston NC
    Worcester MA
    York PA
    Youngstown OH

After we've captured all the MSA-level survival rates in df, we can now plot it on a map. But, as we know, maps that communicate clear insights are well-designed, considering interactive functionality, color schemes, labels, and scale.

We can design a color palette and associated intervals using fixed thresholds, but the downside of this method is that we have to manually set the values and we don't have a good handle on the sample size in each bucket.

Another way is using quantiles as threshold. To do so, we can specify cut offs at the following percentiles: 0.03, 0.1, 0.5, 0.9, 0.97. This is far more intuitive as we can more easily spot the MSAs that belong to the highest 3% or lowest 3%.

In [13]:
def map_plot(size_field, color_field):
    
    ## value to scale down bubble size
    scale = df[size_field].max()/4e3
    
    ## setting quantiles
    bins = [0, 0.03, 0.1, 0.5, 0.9, 0.97, 1]
    
    ## setting colors
    colors = ['#8c510a', '#d8b365', '#f6e8c3', '#c7eae5', '#5ab4ac', '#01665e']
    
    ## place holder for msa traces 
    msas = []
    
    ## text to show when mouse move over
    df['text'] = (df['name'] 
        + '<br>population: ' + (df['population']/1e6).map(lambda x: '%2.1f' % x) + ' million'
        + '<br>' + size_field + ': ' + df[size_field].map(lambda x: '{:,}'.format(x))
        + '<br>' + color_field + ': ' + df[color_field].map(lambda x: '%2.2f' % x) + '%')
    
    ## calculate the corresponding values for each quantile
    qs = df[color_field].quantile(bins).values
    
    ## iterate through each group
    for lower, upper, color in zip(qs[:-1], qs[1:], colors):
        
        ## handling lower bound
        if color==colors[0]: 
            df_sub = df[(df[color_field]<upper)]
            
            ## format the value for display
            name = '< {0:.0f}%'.format(upper)
            
        ## handling upper bound
        elif color==colors[-1]: 
            df_sub = df[(df[color_field]>lower)]
            name = '> {0:.0f}%'.format(lower)
        ## other groups    
        else: 
            df_sub = df[(df[color_field]>lower)&(df[color_field]<=upper)]
            name = '{0:.0f}% - {1:.0f}%'.format(lower,upper)
        
        ## put data into a dictionary in plotly format
        msa = dict(
            type = 'scattergeo',
            locationmode = 'USA-states',
            lon = df_sub['lon'],
            lat = df_sub['lat'],
            text = df_sub['text'],
            marker = dict(
                size = df_sub[size_field]/scale,
                color = color,
                line = dict(width=0.5, color='rgb(40,40,40)'),
                sizemode = 'area'
            ),
            name = name )
        
        ## append current data to placeholder
        msas.append(msa)
    
    ## setting figure title and layout
    layout = dict(
        title = size_field+' created in 2008, grouped by '+color_field,
        showlegend = True,
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showland = True,
            landcolor = 'white',
            subunitwidth=0.5,
            countrywidth=0.5,
            subunitcolor="gray",
            countrycolor="gray"
        ),
    )

    fig = dict( data=msas, layout=layout )
    iplot(fig)

We can now use this function to plot our data on a map. First lets take a look at firms. We can see some interesting patterns here. For example, South East MSAs, namely those in Georgia and Florida have a lower than average survival rate, while the MSAs between Great Lakes and Northeast Megalopolis have a high survival rate.

In [14]:
map_plot('number of firms', '5-year firm survival')

When we look at the jobs these firms created, we can recongize changes in the pattern. For example, Washington DC and San Jose stand out with high job survival rates in the region, while Cedar Rapids and Fresno experienced among the lowest employment survival rates.

In [15]:
map_plot('number of jobs', '5-year job survival')
In [15]: