Crime Data for Download

Available as downloads from this Web page are crime space and time series data for several crimes in Pittsburgh, Pennsylvania and Rochester, New York. Also available are corresponding contiguity matrices (for use in calculating spatial statistics) and GIS map layers. These are the data sets that were used in the research reported in Cohen, & Gorr (2005), Final Report: Development of Crime Forecasting and Mapping Systems for Use by Police, National Institute of Justice Grant 2001-IJ-CX-0018. By agreement with the Pittsburgh Bureau of Police, the Rochester Police Department, and the Carnegie Mellon University Internal Review Board, we are not allowed to release the individual report point data. All that can be made available are the aggregate crime space and time series data.

Click the following links to download documentation:

The data are unadjusted (for days per month) monthly time series for four geographies: 1990 census tracts, police car beats, an aggregation of police car beats called beats plus which are smaller than precincts, and police precincts (also an aggregation of car beats). We refer to an individual tract, car beat, etc. as a district. There are eight multivariate time series tables, stored as comma separated value (.csv) files, one for each geography and city. Click links below to download corresponding data sets:











where Pgh = Pittsburgh and Roch = Rochester.

The rows are monthly observations by district. The columns are crime counts by month and district. Variable names are in the first rows of tables and include district, year, month, and several crime types. Crimes are counts of police offense reports, except for C_Drugs and C_Shots in the Pittsburgh data sets which are counts of 911 calls for service with duplicate calls for the same incidents removed.

Included are zipped GIS folders for Pittsburgh

Primer on Police Data

  • Computer Aided Dispatch (CAD) or 911 Call-for-Service Data- This data is primarily from citizen complaints about crimes or disturbances; however, CAD calls are officer initiated; for example, when an officer sees a crime being committed while on patrol. Generally, all police events start with a CAD data record and ID number. A problem with this kind of data is that CAD calls are, for the most part, perceptions from untrained observers and sometimes citizens distort calls to get faster service(for instance, report a more severe type of incident than actually occurred). So, police often view individual CAD data points as being unreliable measures. Nevertheless, CAD data is more representative of the volume and extent of crimes that do not have victims, such as drug dealing, prostitution, and gambling than offense or arrest records. CAD data also provide some excellent leading indicator variables such as shots fired reports and various loitering and public disturbances.


  • Offense Report Data- When an officer believes that a crime has been committed, he/she should write an offense report giving the crime type (or all crime types if multiple offenses were committed), the address, date and time (or date and time interval), and many other variables. Offense data are the best indicator of crimes with victims such as homicide, robbery, aggravated assault, burglary, larceny, motor vehicle theft, etc. Generally statistics from offense reports are thought to under-represent the true levels of crime; for example, police might not report crimes with low solvability factors to keep case closure rates high. Also, victims sometimes do not report crimes such a rapes.


  • Arrest Report Data- When a suspect is arrested, a report is filed relating back to an offense report and giving data on the crimes committed, the arrested person, arrest location, etc.


  • Special Event Data- These are discrete events that are generally known ahead of time, such as sporting events, concerts, etc. that are associated with increased crime levels. While not collected systematically, they should be. Furthermore they should be incorporated in forecasts.


Crime Model Variables

  • Dependent Variables - are crime counts per unit time and observation unit; for example, burglaries per month in a particular car beat, census tratc, or grid cell. In areas with few crimes, such variables are often Poisson distributed. In high crime areas, these variables can be treated as continuous. The map below from Chapter 3 of CrimeMapTutorialis an example with 2,000 foot grid cells. (Our research has shown that this grid size is too small, and that 4,000 is about as small as grid cells can be for a place like Rochester, NY.)


  • Leading Indicator Variables - Our researchhas found some success in using CAD data and lesser crimes as leading indicators for serious crimes. These include CAD data such as on shots fired, public disturbances, and prostitution and lesser offenses such as simple assaults and trespassing. We have lagged grid counts of such indicators over time by one month and over space from contiguous grid cells. Follow-on research with these variables has used car beats and census tracts instead of grid cells.


  • Causal Variable Fixed Effects - These variables do not vary over time, but only across space. When building multivariate forecast models that include observations across space and over time, the results often exhibit severe spatial heteroscedasticity. In other words, the fitted model does not pass through the center of the crime count data cloud, but there are areas that are consistently fitted too low or too high. Including fixed effects on crime potential, with variables on socio-economic status and land uses, is a remedy. These include census variables for populations that have low human capital and family status, populations that are crime age, etc. Another source of data are electronic yellow pages that include commercial sites of various kinds; for example, bars, check cashing businesses, retail stores, restaurants, etc., and street addresses. See Cohen, J., C.K. Durso, and W.L. Gorr, "Estimation of Crime Seasonality: A Cross-Sectional Extension to Time Series Classical Decomposition," Heinz School Working paper 2003-18, August 2003 for a working paper that develops causal fixed effects variables for crime forecasting.


  • Dummy Variable Fixed Effects- It is possible to include a dummy variable for each grid cell, except one that is suppressed for estimation purposes. These variables do a good, albeit, non-insightful job of reducing spatial heterogeneity.


  • Seasonal Dummy Variables- Seasonal dummies, for example, for each month except one that is suppressed, are often important components of crime forecast models. These can have multiplicative or additive forms.


Spatial Data Processing

There are several steps to spatial and data aggregation processing of crime forecast model variables. See Chapter 2 of CrimeMapTutorial. For police data, these include:

  1. Preprocessing or cleaning address data- includes standardizing addresses to eliminate irrelevant text (like apartment number or notes like "rear of store"), replace place names such as "Carnegie Mellon University" with a street address like "5000 Forbes Ave", standardizing connectors for street intersections such as the "&" in "Craig St & Forbes Ave", and various other data cleaning steps. Many of these steps are included as parts of geographic information systems (GIS).


  2. Address matching incident location data- uses sophisticated matching algorithms in GIS packages with street centerline maps to transform street addresses into map coordinates.


  3. Spatial overlay of incident points - Once data are address matched and exist as mapped points, it is possible to use spatial processing to assign correct area identifiers like zip code, census tract, police car beat, or grid cell number. This step allows data aggregation into crime counts by geographic area and time interval.