Microsoft PowerPoint - the NCRM EPrints Repository

Collaborative Online Social Media Observatory:
Crime Sensing Through Social Media
Matt Williams & Luke Sloan
Pete Burnap, Omer Rana, William Housley, Adam Edwards, Jeffrey Morgan, Vincent Knight, Rob
Procter, Alex Voss
Cardiff University, University of Warwick, University of St Andrews
tweet: @cosmos_project
COSMOS overview & update
Project Objectives
Key Literature
Research Questions
Data & Sampling
Sensing Crime & Disorder
Exploring Relationships
– Time
– Space
• Modelling Strategy
• Methodological Considerations
What is COSMOS?
Aim to establish a coordinated interdisciplinary UK response to “Big
Social Data”
Collaboration between the universities of Cardiff, Warwick and St.
Additional input from Edinburgh, UCL, Leeds, Manchester and
Brings together social, computer, political, health and mathematical
scientists to study the methodological, theoretical, and empirical
dimensions of Big Data in technical, social and policy contexts
Developing a research programme to address next-generation
research questions that focus upon the challenges posed by big social
data to government, digital economy and civil society
Development of new methodological tools and technical/data solutions
for UK academia
What is COSMOS?
• COSMOS has attracted 17 research grants
amounting to over £1.25M in funding from the
ESRC/EPSRC/AHRC/JISC and £500K from the
public and private sectors (DoH/FSA/HPC Wales).
• A significant proportion of these funds have been
awarded to collect and analyse social media data in
the contexts of tension, hate speech, crime, urban
safety, security and suicide
Selection of research projects
Digital Social Research Tools, Tension Indicators and Safer
Communities: A demonstration of COSMOS (ESRC DSR)
COSMOS: Supporting Empirical Social Scientific Research with a
Virtual Research Environment (JISC)
Small items of research equipment at Cardiff University (EPSRC)
Hate Speech and Social Media: Understanding Users, Networks and
Information Flows (ESRC Google)
Social Media and Prediction: Crime Sensing, Data Integration and
Statistical Modelling (ESRC NCRM)
Understanding the Role of Social Media in the Aftermath of Youth
Suicides (Department of Health)
Scaling the Computational Analysis of “Big Social Data” & Massive
Temporal Social Media Datasets (HPC Wales)
Digital Wildfire: (Mis)information flows, propagation and responsible
governance, (ESRC Global Uncertainties)
Public perceptions of the UK food system: public understanding and
engagement, and the impact of crises and scares (ESRC/FSA)
COSMOS Infrastructure
COSMOS Desktop
• Small local datasets
• Users’ API credentials
• Local analysis
• Sept ‘14 launch
• Scalable storage
• Massive datasets
• Scalable compute
• On-demand nodes
• Fast search & retrieve
• Fast analysis
• Workflow management
• Collaboration support
• 2015 launch
Project Objectives
• To evaluate the utility of crime and disorder related
tweets in predicting patterns of crime in six London
• To develop an automated machine classifier for
identifying tweets containing crime and disorder terms
• To develop statistical models that take into account
temporal and spatial variation
• To compare conventional predictive models of crime
with models containing social media derived data
Literature I
The interoperability afforded by COSMOS through spatial linkage enables
us to identify associations between online and offline phenomena
Social media is already being used as a preferential means of updating the
public about crime in the US and Europe (Johnson 2012, Crookes 2010,
Danef 2012, Philips 2011, Rawlinson 2012)
Allowing the reporting of emergencies on Twitter is being considered in the
A near ten-fold rise in crime related communication in 2012 (Warrell 2012)
Complaints originating from social media make up "at least half" of calls
passed on to front-line officers (College of Policing, 2014)
6,000 officers currently being trained to deal with social media evidence
Literature II
Social and computational researchers have already begun to ‘repurpose’
social media data in their ‘predictive’ efforts
Tumasjan et al. (2010) measured Twitter sentiment in relation to candidates
in the German general election concluding that this source of data was as
accurate at predicting voting patterns as traditional polls
Asur & Huberman (2010) correlated frequency and sentiment related to
movies on Twitter with their revenue, claiming that this method of prediction
was more accurate than the Hollywood Stock Market
Sakaki et al. (2010) found that the analysis of Twitter data produced
estimates of the centres of earthquakes more accurately than conventional
These studies illustrate how social media generates naturally occurring data
that can be used to complement and augment conventional curated and
administrative data
Literature III
• Another notable example is the association of
social media and crime, such as the riots during
August 2011 (Procter et al. 2013a)
• Malleson and Andresen (2014) use Twitter to
estimate changing populations densities as
alternative to Census for identification of violent
crime hotspots
• Gerber (2014) looks at the relationship in US
between reported crime and the prevalence of
multiple topics on Twitter
Research Questions
• Can crime and disorder related content on
Twitter enhance our understanding of and our
ability to predict crime patterns?
• If so, is Twitter content a better predictor of
certain major crime types then others?
• Can this form a data be used as a alternative
measure of feelings of insecurity in local
Data & Sampling
• Comparative case study of London and Cardiff
(this presentation focuses on London):
– Recorded crime (lat/long, HO crime category), split by
month Aug 2013 to Aug 2014
– Collecting 100% of geotagged UK Tweets (approx
500k per day)
– Census data including ethnic composition,
educational attainment, employment, income, health,
religiosity (ONS API)
NOTE: COSMOS archive contains all UK tweets since Sept 2011 (not all of
which are geotagged) but potential for identification of higher (mundane)
Sensing Crime & Disorder I
• We need to identify tweets in our sample that relate to
signatures of crime and disorder using key-word
detection of ordinary language
• 500K tweets a day means that it is unfeasible to do
this manually
• Develop machine classifier to identify tweets
referencing crime and disorder
• References to anxiety, environmental deterioration,
anti-social behaviour, night-time establishments etc.
• Use crowd-sourcing and human coders to develop a
lexicon and algorithm…
Sensing Crime & Disorder II
INPUT: all UK geocoded tweets
Reduce sample of UK Tweets to London & Cardiff
Take random subsample (every nth tweet) and send for crowd-sourced human coding
Human coders identify tweets that contain (and do not contain) crime/disorder terms
Use 50% of human-annotated dataset to train classifier through machine learning
Validate classifier using remaining 50% of dataset (test precision and recall)
Run classifier over whole London and Cardiff dataset
OUTPUT: All London and Cardiff tweets with crime/disorder flag
Exploring Relationships
• Simple correlation between tweets about
crime/disorder and occurrence of recorded
crime is too simplistic
• At what spatial and temporal level can social
media be used to inform operational decision
• At what spatial and temporal level do we try
to match tweets and crime?
• How to integrate existing curated data?
Exploring Relationships – Time I
• Certain variables are fixed (e.g. socio-economic
characteristics of areas)
• Crime and tweets are locomotive (by the second!)
• Investigate relationship between tweets and
crime/disorder at different levels of time:
Days of the week?
Time of day?
Seasonal variations in crime type?
Exploring Relationships – Time II
Simple frequency of
reported crime
commencement time
varies depending on
time of day (June
2013 data)…
Exploring Relationships – Time III
Type of crime, as proportion of all crime, differs by time of day…
Exploring Relationships – Time IV
variability across
time of day for
some crimes
more than
others (June
2013 data)…
Exploring Relationships – Time V
• Clearly time of day is important
• More tweets during daytime might mean that we
can more accurately predict daytime crime
• Likely that Twitter data is better for predicting
some crime types than others (explicit and hidden)
• How to account for ‘lag’ e.g. ‘the house down the
road was burgled last night’
Exploring Relationships – Space I
• Size of London results in huge internal variance in
crime type and rates
• Crime and tweets are point data that can be
located in any geography (from OA to LA)
• Investigate relationship between tweets and
crime/disorder at different levels of space:
– City wide
– Boroughs
– Wards
Exploring Relationships – Space II
Borough level
geography is
too high,
variance largely
due to
density (plus
commuter ands
Exploring Relationships – Space III
Soho & Covent
& Hyde Park
Exploring Relationships – Space III
Dark Green = <5%
‘never worked’ or
‘long term
Red = >14%
‘never worked’
or ‘long-term
Exploring Relationships – Space IV
• Commuter and tourism patterns matter,
although more people = more crimes = more
• Reduction in social media use for those living
in deprived areas? Less likely to tweet about
crime despite being more likely to know about
• Could go down to OA, but number of tweets
and reported crimes per case/unit is cut
Modelling Strategy I
• A ward-based example:
One month of data
Treated as cross-sectional
Crime and tweets aggregated over month
Single time point allows inc. of ward characteristics
One ward = single case
• Use existing known predictors of crime to specify model,
measure success
• Add tweet data to model and see if ‘prediction’ rate is
significantly higher
• i.e. does social media data [x] enable better explanation of
variance in crime rates [y]?
Modelling Strategy II
• Simple logit model ignores temporal order and spatial data
• Fixed effects model would account for changes over time (but
fixed factors such as ward demographics would be excluded)
• Random effects model would enable inclusion of non-time
variant predictors (but stringent assumptions)
• Spatial point data allows use to take into account spatial
correlation (kernal density estimation?)
• Multilevel model would account for both ward and borough
level variance
Modelling Strategy II
• Could control for time and space through dummy variables
• p-values and standard errors can be poorly estimated for dummy
variables in single level models (Snijders & Bosker 2012)
• Not feasible to have a dummy variable for every hour of the day
• Suggested way forward:
– Test for spatial variance (MLM)
• Ward and borough level
– Test for temporal variance (FE/RE)
• Time of day, day of week and month
• If amount of spatial and temporal variance is significant then it must
be accounted for in a multi-level longitudinal model (Yu et a. 2010)
Methodological Considerations
• Asynchronous relationship between
tweeting about crime/disorder and
experiencing/witnessing it?
• Commencement and finish time of a crime
are rarely the same (e.g. events)
• Difference between when something
happened and when it was reported
Collaborative Online Social Media Observatory:
Crime Sensing Through Social Media
Matt Williams & Luke Sloan
Pete Burnap, Omer Rana, William Housley, Adam Edwards, Jeffrey Morgan, Vincent Knight, Rob
Procter, Alex Voss
Cardiff University, University of Warwick, University of St Andrews
tweet: @cosmos_project

similar documents