Exploring 'The Counted' with Folium¶
This post will explore Folium - an excellent python package for visualizing geographic data. The data that I will use is from the The Counted, by the Guardian. Rather than describe this dataset myself, I'll just copy the description that the Guardian give on their 'About' page:
The Counted is a project by the Guardian – and you – working to count the number of people killed by police and other law enforcement agencies in the United States throughout 2015 and 2016, to monitor their demographics and to tell the stories of how they died. The database will combine Guardian reporting with verified crowdsourced information to build a more comprehensive record of such fatalities. The Counted is the most thorough public accounting for deadly use of force in the US, but it will operate as an imperfect work in progress – and will be updated by Guardian reporters and interactive journalists as frequently and as promptly as possible.
My general idea here is to not only explore some of the capabilities of Folium as a tool, but also to make this post the start of an exploration of this data to see if anything useful can be learned.
The format of the dataset is such that each row corresponds to one police killing. After downloading the data I did a little preprocessing to prepare the data to be pulled into a Pandas DataFrame (minor formatting changes). I also took the location of each death (usually given by a street address) and mapped this to a lat-long. In order to do this quickly from the command line I used the awesome package from Alex Reisner (http://www.rubygeocoder.com/) and the very helpful CLI from Ole Begemann (https://gist.github.com/ole/8583982). I also used this lat-long to determine the county in which the death occured.
To begin with, I'll set a few things up, load the dataset and take a quick look at the first few rows:
import pandas as pd
import numpy as np
import json
import folium
import warnings
warnings.filterwarnings('ignore') #this is incluced to suppress warnings that might otherwise detract from the text
from IPython.display import display
folium.initialize_notebook()
%matplotlib inline
df = pd.read_csv('data_with_county.csv', sep = '\t', index_col = 0)
df.head()
Deaths by state¶
Clearly there is a lot of information here that we could start to dig into - location, gender, age, race/ethnicity, whether the person taht died was armed, and so on. We have to start somewhere so my first idea is to group the deaths per state - we can then take a look at both the absolute number of police killings per state, and also the number relative to the population size.
In order to get some state level information, such as the population, I took some information from the U.S. Census Bureau, Population Division, 2015 - it can be found here.
First we can use the pandas 'groupby' functionality to count the frequency of deaths per state. We can then join this to our supplementary information (in the file states.csv') and calculate the number of deaths for every 100K people in the state.
df_states = df.groupby(['state'], as_index=False).size()
df_states = df_states.to_frame()
df_states['state'] = df_states.index
df_states.columns = ['count','state']
states = pd.read_csv('states.csv', sep='\t')
states = pd.merge(states, df_states, left_on='code', right_on='state')
states.drop('state', axis=1, inplace=True)
states['population'] = states['population'].str.replace(',','')
states['population'] = states['population'].astype(float)
states['death_per_100K'] = ( states['count'] / (states['population'] * 1.0) ) * 100000
states.sort('death_per_100K', ascending = False, inplace = True)
states.head(10)
We sorted the above DataFrame by the number of deaths per 100K people in the state.
Now that we have these numbers - lets visualize this. Folium makes this incredibly simple and allows for us to quickly bind a Pandas DataFrames or Series with Geo/TopoJSON geometries. Here we can pass the state geographic information and our states level DataFrame. The result is a dynamic map with out information overlaid.
state_geo = r'data/us-states.json'
#Let Folium determine the scale
state_map = folium.Map(location=[48, -102], zoom_start=3)
state_map.geo_json(geo_path=state_geo, data=states,
columns=['code', 'death_per_100K'],
key_on="feature.id",
fill_color='YlOrRd',
fill_opacity='0.7', line_opacity='0.2',
legend_name='Deaths per 100K')
state_map.create_map(path='us_states_deaths.html')
display(state_map)
Lets now do the same, but at the county level.¶
In order to get this information, I took more census data (this time at the county level). The data I used can be found here
The first thing I do here is to group by the county (and the state, as some county names appear in more than one state). We can then do some minor formatting changes and take a look at our data:
df_counties = df.groupby(['county', 'state'], as_index=False).size()
df_counties = df_counties.to_frame()
df_counties.reset_index(inplace = True)
df_counties.columns = ['county','state','death_count']
df_counties = df_counties.sort('death_count', ascending = False)
df_counties['county'] = df_counties['county'].str.replace(' County','')
print(df_counties.head(10))
These numbers obviously don't even start to tell the entire story, of course. In this case though, lets take a look at what we can do with it.
Thankfully, most counties in the US don't appear in our data because they do not appear in the dataset of police killings. The first thing we might want to do is to add all of the missing counties, recording the number of fatalities as zero - this will allow for us to visualize them.
In the geographic data ('data/us-counties.json') each county has a unique ID - labelled here as a 'FIPS_Code'. To begin with I collect together all of these identifiers so we can later add in any that are missing.
# I start with a dictionary to collect together counties and their state
county_collection = dict()
for i, a in zip(df_counties.county, df_counties.state):
if i in county_collection:
county_collection[i].append(a)
else:
county_collection[i] = [a]
# then opening up some general county information, doing a little formatting and then
county_codes = pd.read_csv('data/us_county_data.csv', sep = ',')
county_codes['Area_name'] = county_codes['Area_name'].str.replace(' County','')
# now we need to find the FIPS code for each of the counties in our existing dataset.
# we can do this by comparing what we have in our dictionary vs the full county information that we have in the county_codes DataFrame
df_counties['FIPS_Code'] = np.nan
FIPS_codes = set()
for index, row in county_codes.iterrows():
if row.Area_name in county_collection:
for m in range(len(county_collection[row.Area_name])):
if county_collection[row.Area_name][m] == row.State:
df_counties.loc[(df_counties['county'] == row.Area_name) & (df_counties['state'] == row.State), 'FIPS_Code'] = str(row.FIPS_Code)
FIPS_codes.add(str(row.FIPS_Code))
# we can also take a look at the other information we have in our county_codes DataFrame:
county_codes.head()
We now need to add all of the counties that aren't in our main DataFrame that lists the fatalities. To be honest this should probably be re-written so that we only loop through the county_codes DataFrame once. The reason it is separated here is because initially I didn't match on states, only county names. This code is built on what existed before!
fips2 = set(df_counties.FIPS_Code)
for index, row in county_codes.iterrows():
if str(row.FIPS_Code) in fips2:
continue
if row.Area_name == 'United States':
continue
df_temp = pd.DataFrame([[row.Area_name, row.State, 0,str(row.FIPS_Code)]])
df_temp.columns = ['county', 'state', 'death_count', 'FIPS_Code']
df_counties = df_counties.append(df_temp)
county_codes.FIPS_Code = county_codes.FIPS_Code.astype(str)
df_counties_full = pd.merge(df_counties, county_codes, left_on = 'FIPS_Code', right_on = str('FIPS_Code'))
We are now ready to visualize this data. We'll take a similar approach as before but with a county level granularity and looking at the absolute numbers, rather than scaled to population size. This can be something we look at in the next post when we explore the data in further detail.
county_geo = r'data/us-counties.json'
#Unemployment with custom defined scale
county_map = folium.Map(location=[40, -99], zoom_start=4)
county_map.geo_json(geo_path=county_geo, data=df_counties,
columns=['FIPS_Code', 'death_count'],
key_on='id',
threshold_scale=[0, 5, 7, 9, 11, 13],
fill_color='YlGnBu', line_opacity=0.3)
county_map.create_map(path='county_map.html')
display(county_map)
Summary¶
This is just a quick look at both the power of the Folium package and the data that the Guardian have been recording as part of 'The Counted' project. In further posts we'll try to build on this start and attempt to find interesting things about police killings in the US.
No comments:
Post a Comment