Exploring 'The Counted' with Folium¶

This post will explore Folium - an excellent python package for visualizing geographic data. The data that I will use is from the The Counted, by the Guardian. Rather than describe this dataset myself, I'll just copy the description that the Guardian give on their 'About' page:

The Counted is a project by the Guardian – and you – working to count the number of people killed by police and other law enforcement agencies in the United States throughout 2015 and 2016, to monitor their demographics and to tell the stories of how they died. The database will combine Guardian reporting with verified crowdsourced information to build a more comprehensive record of such fatalities. The Counted is the most thorough public accounting for deadly use of force in the US, but it will operate as an imperfect work in progress – and will be updated by Guardian reporters and interactive journalists as frequently and as promptly as possible.

My general idea here is to not only explore some of the capabilities of Folium as a tool, but also to make this post the start of an exploration of this data to see if anything useful can be learned.

The format of the dataset is such that each row corresponds to one police killing. After downloading the data I did a little preprocessing to prepare the data to be pulled into a Pandas DataFrame (minor formatting changes). I also took the location of each death (usually given by a street address) and mapped this to a lat-long. In order to do this quickly from the command line I used the awesome package from Alex Reisner (http://www.rubygeocoder.com/) and the very helpful CLI from Ole Begemann (https://gist.github.com/ole/8583982). I also used this lat-long to determine the county in which the death occured.

To begin with, I'll set a few things up, load the dataset and take a quick look at the first few rows:

import pandas as pd
import numpy as np
import json
import folium

import warnings
warnings.filterwarnings('ignore')  #this is incluced to suppress warnings that might otherwise detract from the text

from IPython.display import display
folium.initialize_notebook()
%matplotlib inline

df = pd.read_csv('data_with_county.csv', sep = '\t', index_col = 0)
df.head()

Deaths by state¶

Clearly there is a lot of information here that we could start to dig into - location, gender, age, race/ethnicity, whether the person taht died was armed, and so on. We have to start somewhere so my first idea is to group the deaths per state - we can then take a look at both the absolute number of police killings per state, and also the number relative to the population size.

In order to get some state level information, such as the population, I took some information from the U.S. Census Bureau, Population Division, 2015 - it can be found here.

First we can use the pandas 'groupby' functionality to count the frequency of deaths per state. We can then join this to our supplementary information (in the file states.csv') and calculate the number of deaths for every 100K people in the state.

df_states = df.groupby(['state'], as_index=False).size()
df_states = df_states.to_frame()

df_states['state'] = df_states.index
df_states.columns = ['count','state']

states = pd.read_csv('states.csv', sep='\t')

states = pd.merge(states, df_states, left_on='code', right_on='state')
states.drop('state', axis=1, inplace=True)

states['population'] = states['population'].str.replace(',','')
states['population'] = states['population'].astype(float)

states['death_per_100K'] = ( states['count'] / (states['population'] * 1.0) ) * 100000
states.sort('death_per_100K', ascending = False, inplace = True)

states.head(10)

We sorted the above DataFrame by the number of deaths per 100K people in the state.

Now that we have these numbers - lets visualize this. Folium makes this incredibly simple and allows for us to quickly bind a Pandas DataFrames or Series with Geo/TopoJSON geometries. Here we can pass the state geographic information and our states level DataFrame. The result is a dynamic map with out information overlaid.

state_geo = r'data/us-states.json'

#Let Folium determine the scale
state_map = folium.Map(location=[48, -102], zoom_start=3)
state_map.geo_json(geo_path=state_geo, data=states,
             columns=['code', 'death_per_100K'],
             key_on="feature.id",
             fill_color='YlOrRd',
             fill_opacity='0.7', line_opacity='0.2',
             legend_name='Deaths per 100K')
state_map.create_map(path='us_states_deaths.html')
display(state_map)

Lets now do the same, but at the county level.¶

In order to get this information, I took more census data (this time at the county level). The data I used can be found here

The first thing I do here is to group by the county (and the state, as some county names appear in more than one state). We can then do some minor formatting changes and take a look at our data:

df_counties = df.groupby(['county', 'state'], as_index=False).size()
df_counties = df_counties.to_frame()

df_counties.reset_index(inplace = True)
df_counties.columns = ['county','state','death_count']
df_counties = df_counties.sort('death_count', ascending = False)
df_counties['county'] = df_counties['county'].str.replace(' County','')

print(df_counties.head(10))

             county state  death_count
266     Los Angeles    CA           44
281        Maricopa    AZ           20
393  San Bernardino    CA           18
178          Harris    TX           17
394       San Diego    CA           12
297      Miami-Dade    FL           12
224            Kern    CA           12
71            Clark    NV           11
97           Dallas    TX           11
437         Tarrant    TX           10

These numbers obviously don't even start to tell the entire story, of course. In this case though, lets take a look at what we can do with it.

Thankfully, most counties in the US don't appear in our data because they do not appear in the dataset of police killings. The first thing we might want to do is to add all of the missing counties, recording the number of fatalities as zero - this will allow for us to visualize them.

In the geographic data ('data/us-counties.json') each county has a unique ID - labelled here as a 'FIPS_Code'. To begin with I collect together all of these identifiers so we can later add in any that are missing.

# I start with a dictionary to collect together counties and their state

county_collection = dict()
for i, a in zip(df_counties.county, df_counties.state):
    if i in county_collection:
        county_collection[i].append(a)
    else:
        county_collection[i] = [a]

# then opening up some general county information, doing a little formatting and then 
county_codes = pd.read_csv('data/us_county_data.csv', sep = ',')
county_codes['Area_name'] = county_codes['Area_name'].str.replace(' County','')

# now we need to find the FIPS code for each of the counties in our existing dataset.
# we can do this by comparing what we have in our dictionary vs the full county information that we have in the county_codes DataFrame

df_counties['FIPS_Code'] = np.nan
FIPS_codes = set()

for index, row in county_codes.iterrows():
        if row.Area_name in county_collection:
            for m in range(len(county_collection[row.Area_name])):
                if county_collection[row.Area_name][m] == row.State:
                    df_counties.loc[(df_counties['county'] == row.Area_name) & (df_counties['state'] == row.State), 'FIPS_Code'] = str(row.FIPS_Code)
                    FIPS_codes.add(str(row.FIPS_Code))

# we can also take a look at the other information we have in our county_codes DataFrame:
county_codes.head()

We now need to add all of the counties that aren't in our main DataFrame that lists the fatalities. To be honest this should probably be re-written so that we only loop through the county_codes DataFrame once. The reason it is separated here is because initially I didn't match on states, only county names. This code is built on what existed before!

fips2 = set(df_counties.FIPS_Code)

for index, row in county_codes.iterrows():
        if str(row.FIPS_Code) in fips2:
            continue
        if row.Area_name == 'United States':
            continue
        df_temp = pd.DataFrame([[row.Area_name, row.State, 0,str(row.FIPS_Code)]])
        df_temp.columns = ['county', 'state', 'death_count', 'FIPS_Code']
        df_counties = df_counties.append(df_temp)
            
county_codes.FIPS_Code = county_codes.FIPS_Code.astype(str)
df_counties_full = pd.merge(df_counties, county_codes, left_on = 'FIPS_Code', right_on = str('FIPS_Code'))

We are now ready to visualize this data. We'll take a similar approach as before but with a county level granularity and looking at the absolute numbers, rather than scaled to population size. This can be something we look at in the next post when we explore the data in further detail.

county_geo = r'data/us-counties.json'

#Unemployment with custom defined scale
county_map = folium.Map(location=[40, -99], zoom_start=4)
county_map.geo_json(geo_path=county_geo, data=df_counties,
                   columns=['FIPS_Code', 'death_count'],
                   key_on='id',
                   threshold_scale=[0, 5, 7, 9, 11, 13],
                   fill_color='YlGnBu', line_opacity=0.3)
county_map.create_map(path='county_map.html')
display(county_map)

Summary¶

This is just a quick look at both the power of the Folium package and the data that the Guardian have been recording as part of 'The Counted' project. In further posts we'll try to build on this start and attempt to find interesting things about police killings in the US.

	uid	name	age	gender	raceethnicity	month	day	year	address	state	classification	lawenforcementagency	armed	lat	long	county
0	2	Matthew Ajibade	22	Male	Black	January	1	2015	1050 Carl Griffin Dr Savannah	GA	Death in custody	Chatham County Sheriff's Office	No	32.066691	-81.167881	Chatham County
1	4	Lewis Lembke	47	Male	White	January	2	2015	4505 SW Masters Loop Aloha	OR	Gunshot	Washington County Sheriff's Office	Firearm	45.486451	-122.891256	Washington County
2	7	Tim Elliott	53	Male	Asian/Pacific Islander	January	2	2015	600 E Island Lake Dr Shelton	WA	Gunshot	Mason County Sheriff's Office	Firearm	47.246534	-123.119497	Mason County
3	5	Michael Kocher Jr	19	Male	White	January	3	2015	2600 Kaumualii Hwy Kaumakani	HI	Struck by vehicle	Kauai Police Department	No	21.933291	-159.641888	Kauai County
4	6	John Quintero	23	Male	Hispanic/Latino	January	3	2015	500 North Oliver Ave Wichita	KS	Gunshot	Wichita Police Department	No	37.693797	-97.280530	Sedgwick County

	statename	population	code	count	death_per_100K
35	New Mexico	2085109	NM	29	1.390815
47	Alaska	738432	AK	8	1.083377
27	Oklahoma	3911338	OK	41	1.048235
48	DC	672228	DC	7	1.041313
50	Wyoming	586107	WY	6	1.023704
13	Arizona	6828065	AZ	53	0.776208
34	Nevada	2890845	NV	22	0.761023
37	West Virginia	1844128	WV	14	0.759166
24	Louisiana	4670724	LA	35	0.749348
21	Colorado	5456574	CO	40	0.733061

	FIPS_Code	State	Area_name	Civilian_labor_force_2011	Employed_2011	Unemployed_2011	Unemployment_rate_2011	Median_Household_Income_2011	Med_HH_Income_Percent_of_StateTotal_2011
0	0	US	United States	154505871	140674478	13831393	9	50502	100
1	1000	AL	Alabama	2190519	1993977	196542	9	41427	100
2	1001	AL	Autauga	25930	23854	2076	8	48863	117.9
3	1003	AL	Baldwin	85407	78491	6916	8.1	50144	121
4	1005	AL	Barbour	9761	8651	1110	11.4	30117	72.7

Bias and Variance

Thursday, April 14, 2016

Investigating 'The Counted' project with Folium

Exploring 'The Counted' with Folium¶

Deaths by state¶

Lets now do the same, but at the county level.¶

Summary¶

No comments:

Post a Comment