Thursday, April 14, 2016

Investigating 'The Counted' project with Folium

Investigating 'The Counted' and Folium

Exploring 'The Counted' with Folium

This post will explore Folium - an excellent python package for visualizing geographic data. The data that I will use is from the The Counted, by the Guardian. Rather than describe this dataset myself, I'll just copy the description that the Guardian give on their 'About' page:

The Counted is a project by the Guardian – and you – working to count the number of people killed by police and other law enforcement agencies in the United States throughout 2015 and 2016, to monitor their demographics and to tell the stories of how they died. The database will combine Guardian reporting with verified crowdsourced information to build a more comprehensive record of such fatalities. The Counted is the most thorough public accounting for deadly use of force in the US, but it will operate as an imperfect work in progress – and will be updated by Guardian reporters and interactive journalists as frequently and as promptly as possible.

My general idea here is to not only explore some of the capabilities of Folium as a tool, but also to make this post the start of an exploration of this data to see if anything useful can be learned.

The format of the dataset is such that each row corresponds to one police killing. After downloading the data I did a little preprocessing to prepare the data to be pulled into a Pandas DataFrame (minor formatting changes). I also took the location of each death (usually given by a street address) and mapped this to a lat-long. In order to do this quickly from the command line I used the awesome package from Alex Reisner (http://www.rubygeocoder.com/) and the very helpful CLI from Ole Begemann (https://gist.github.com/ole/8583982). I also used this lat-long to determine the county in which the death occured.

To begin with, I'll set a few things up, load the dataset and take a quick look at the first few rows:

In [1]:
import pandas as pd
import numpy as np
import json
import folium

import warnings
warnings.filterwarnings('ignore')  #this is incluced to suppress warnings that might otherwise detract from the text

from IPython.display import display
folium.initialize_notebook()
%matplotlib inline

df = pd.read_csv('data_with_county.csv', sep = '\t', index_col = 0)
df.head()
Out[1]:
uid name age gender raceethnicity month day year address state classification lawenforcementagency armed lat long county
0 2 Matthew Ajibade 22 Male Black January 1 2015 1050 Carl Griffin Dr Savannah GA Death in custody Chatham County Sheriff's Office No 32.066691 -81.167881 Chatham County
1 4 Lewis Lembke 47 Male White January 2 2015 4505 SW Masters Loop Aloha OR Gunshot Washington County Sheriff's Office Firearm 45.486451 -122.891256 Washington County
2 7 Tim Elliott 53 Male Asian/Pacific Islander January 2 2015 600 E Island Lake Dr Shelton WA Gunshot Mason County Sheriff's Office Firearm 47.246534 -123.119497 Mason County
3 5 Michael Kocher Jr 19 Male White January 3 2015 2600 Kaumualii Hwy Kaumakani HI Struck by vehicle Kauai Police Department No 21.933291 -159.641888 Kauai County
4 6 John Quintero 23 Male Hispanic/Latino January 3 2015 500 North Oliver Ave Wichita KS Gunshot Wichita Police Department No 37.693797 -97.280530 Sedgwick County

Deaths by state

Clearly there is a lot of information here that we could start to dig into - location, gender, age, race/ethnicity, whether the person taht died was armed, and so on. We have to start somewhere so my first idea is to group the deaths per state - we can then take a look at both the absolute number of police killings per state, and also the number relative to the population size.

In order to get some state level information, such as the population, I took some information from the U.S. Census Bureau, Population Division, 2015 - it can be found here.

First we can use the pandas 'groupby' functionality to count the frequency of deaths per state. We can then join this to our supplementary information (in the file states.csv') and calculate the number of deaths for every 100K people in the state.

In [2]:
df_states = df.groupby(['state'], as_index=False).size()
df_states = df_states.to_frame()

df_states['state'] = df_states.index
df_states.columns = ['count','state']

states = pd.read_csv('states.csv', sep='\t')

states = pd.merge(states, df_states, left_on='code', right_on='state')
states.drop('state', axis=1, inplace=True)

states['population'] = states['population'].str.replace(',','')
states['population'] = states['population'].astype(float)

states['death_per_100K'] = ( states['count'] / (states['population'] * 1.0) ) * 100000
states.sort('death_per_100K', ascending = False, inplace = True)

states.head(10)
Out[2]:
statename population code count death_per_100K
35 New Mexico 2085109 NM 29 1.390815
47 Alaska 738432 AK 8 1.083377
27 Oklahoma 3911338 OK 41 1.048235
48 DC 672228 DC 7 1.041313
50 Wyoming 586107 WY 6 1.023704
13 Arizona 6828065 AZ 53 0.776208
34 Nevada 2890845 NV 22 0.761023
37 West Virginia 1844128 WV 14 0.759166
24 Louisiana 4670724 LA 35 0.749348
21 Colorado 5456574 CO 40 0.733061

We sorted the above DataFrame by the number of deaths per 100K people in the state.

Now that we have these numbers - lets visualize this. Folium makes this incredibly simple and allows for us to quickly bind a Pandas DataFrames or Series with Geo/TopoJSON geometries. Here we can pass the state geographic information and our states level DataFrame. The result is a dynamic map with out information overlaid.

In [3]:
state_geo = r'data/us-states.json'

#Let Folium determine the scale
state_map = folium.Map(location=[48, -102], zoom_start=3)
state_map.geo_json(geo_path=state_geo, data=states,
             columns=['code', 'death_per_100K'],
             key_on="feature.id",
             fill_color='YlOrRd',
             fill_opacity='0.7', line_opacity='0.2',
             legend_name='Deaths per 100K')
state_map.create_map(path='us_states_deaths.html')
display(state_map)

Lets now do the same, but at the county level.

In order to get this information, I took more census data (this time at the county level). The data I used can be found here

The first thing I do here is to group by the county (and the state, as some county names appear in more than one state). We can then do some minor formatting changes and take a look at our data:

In [4]:
df_counties = df.groupby(['county', 'state'], as_index=False).size()
df_counties = df_counties.to_frame()

df_counties.reset_index(inplace = True)
df_counties.columns = ['county','state','death_count']
df_counties = df_counties.sort('death_count', ascending = False)
df_counties['county'] = df_counties['county'].str.replace(' County','')

print(df_counties.head(10))
             county state  death_count
266     Los Angeles    CA           44
281        Maricopa    AZ           20
393  San Bernardino    CA           18
178          Harris    TX           17
394       San Diego    CA           12
297      Miami-Dade    FL           12
224            Kern    CA           12
71            Clark    NV           11
97           Dallas    TX           11
437         Tarrant    TX           10

These numbers obviously don't even start to tell the entire story, of course. In this case though, lets take a look at what we can do with it.

Thankfully, most counties in the US don't appear in our data because they do not appear in the dataset of police killings. The first thing we might want to do is to add all of the missing counties, recording the number of fatalities as zero - this will allow for us to visualize them.

In the geographic data ('data/us-counties.json') each county has a unique ID - labelled here as a 'FIPS_Code'. To begin with I collect together all of these identifiers so we can later add in any that are missing.

In [5]:
# I start with a dictionary to collect together counties and their state

county_collection = dict()
for i, a in zip(df_counties.county, df_counties.state):
    if i in county_collection:
        county_collection[i].append(a)
    else:
        county_collection[i] = [a]

# then opening up some general county information, doing a little formatting and then 
county_codes = pd.read_csv('data/us_county_data.csv', sep = ',')
county_codes['Area_name'] = county_codes['Area_name'].str.replace(' County','')

# now we need to find the FIPS code for each of the counties in our existing dataset.
# we can do this by comparing what we have in our dictionary vs the full county information that we have in the county_codes DataFrame

df_counties['FIPS_Code'] = np.nan
FIPS_codes = set()

for index, row in county_codes.iterrows():
        if row.Area_name in county_collection:
            for m in range(len(county_collection[row.Area_name])):
                if county_collection[row.Area_name][m] == row.State:
                    df_counties.loc[(df_counties['county'] == row.Area_name) & (df_counties['state'] == row.State), 'FIPS_Code'] = str(row.FIPS_Code)
                    FIPS_codes.add(str(row.FIPS_Code))

# we can also take a look at the other information we have in our county_codes DataFrame:
county_codes.head()
Out[5]:
FIPS_Code State Area_name Civilian_labor_force_2011 Employed_2011 Unemployed_2011 Unemployment_rate_2011 Median_Household_Income_2011 Med_HH_Income_Percent_of_StateTotal_2011
0 0 US United States 154505871 140674478 13831393 9 50502 100
1 1000 AL Alabama 2190519 1993977 196542 9 41427 100
2 1001 AL Autauga 25930 23854 2076 8 48863 117.9
3 1003 AL Baldwin 85407 78491 6916 8.1 50144 121
4 1005 AL Barbour 9761 8651 1110 11.4 30117 72.7

We now need to add all of the counties that aren't in our main DataFrame that lists the fatalities. To be honest this should probably be re-written so that we only loop through the county_codes DataFrame once. The reason it is separated here is because initially I didn't match on states, only county names. This code is built on what existed before!

In [6]:
fips2 = set(df_counties.FIPS_Code)

for index, row in county_codes.iterrows():
        if str(row.FIPS_Code) in fips2:
            continue
        if row.Area_name == 'United States':
            continue
        df_temp = pd.DataFrame([[row.Area_name, row.State, 0,str(row.FIPS_Code)]])
        df_temp.columns = ['county', 'state', 'death_count', 'FIPS_Code']
        df_counties = df_counties.append(df_temp)
            
county_codes.FIPS_Code = county_codes.FIPS_Code.astype(str)
df_counties_full = pd.merge(df_counties, county_codes, left_on = 'FIPS_Code', right_on = str('FIPS_Code'))

We are now ready to visualize this data. We'll take a similar approach as before but with a county level granularity and looking at the absolute numbers, rather than scaled to population size. This can be something we look at in the next post when we explore the data in further detail.

In [7]:
county_geo = r'data/us-counties.json'

#Unemployment with custom defined scale
county_map = folium.Map(location=[40, -99], zoom_start=4)
county_map.geo_json(geo_path=county_geo, data=df_counties,
                   columns=['FIPS_Code', 'death_count'],
                   key_on='id',
                   threshold_scale=[0, 5, 7, 9, 11, 13],
                   fill_color='YlGnBu', line_opacity=0.3)
county_map.create_map(path='county_map.html')
display(county_map)
NOTE: County plot removed because Blogger wouldn't render it correctly.

Summary

This is just a quick look at both the power of the Folium package and the data that the Guardian have been recording as part of 'The Counted' project. In further posts we'll try to build on this start and attempt to find interesting things about police killings in the US.

No comments:

Post a Comment