Skip to main content

Advertisement

Log in

An impact assessment of machine learning risk forecasts on parole board decisions and recidivism

  • Published:
Journal of Experimental Criminology Aims and scope Submit manuscript

Abstract

Objectives:

The Pennsylvania Board of Probation and Parole has begun using machine learning forecasts to help inform parole release decisions. In this paper, we evaluate the impact of the forecasts on those decisions and subsequent recidivism.

Methods:

A close approximation to a natural, randomized experiment is used to evaluate the impact of the forecasts on parole release decisions. A generalized regression discontinuity design is used to evaluate the impact of the forecasts on recidivism.

Results:

The forecasts apparently had no effect on the overall parole release rate, but did appear to alter the mix of inmates released. Important distinctions were made between offenders forecasted to be re-arrested for nonviolent crime and offenders forecasted to be re-arrested for violent crime. The balance of evidence indicates that the forecasts led to reductions in re-arrests for both nonviolent and violent crimes.

Conclusions:

Risk assessments based on machine learning forecasts can improve parole release decisions, especially when distinctions are made between re-arrests for violent and nonviolent crime.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. Three different outcomes were forecasted at once: an arrest for a violent crime, an arrest for a crime that was not violent, and no arrest. The crimes that fell in either of the first two categories was determined by the Parole Board. Random forests was applied to training data and the predictive accuracy evaluated with out-of-bag test data. Cost ratios were determined by parole board members. All other tuning parameters we set to default values because changing them in sensible ways did not make a meaningful difference. The procedure randomForest in R was used. The procedure’s code was original written by Leo Breiman and Adele Cutler. It was ported to R and extended by Andy Liaw and Matthew Wiener. The random forest forecasts were much more accurate than the standard risk scales the board was using, such as the LSI-R, which very soon won over board members who were initially skeptical. But the key practical advantage was in comparisons to current rates of re-arrest. 27 % versus 58 % is in principle a substantial improvement.

  2. 2 This was no surprise. Parole board members were already knowledgeable, experienced professionals, who were genuinely committed to the job. A stereotype held by some that parole board members are political hacks did not correspond to the reality.

  3. The possible impact of the Justice Reinvestment Initiative was suggested by an unusually thorough reviewer.

  4. There are good statistical justifications for the how the reliability values were grouped, but a discussion would mean going into considerable detail about the random forests algorithm (Breiman 2001). For those already familiar with the algorithm, reliabilities can be computed from the “votes” over trees in the random forest. But matters were complicated by there being three prospective outcomes for each individual. It was possible for the winning vote to be a plurality, but not a majority. For example, a winning vote proportion of .4 could be coupled with vote proportions of .35 and .25. There is a clear favorite by all pairwise comparisons, but a substantial majority vote against the plurality winner.

  5. Many of these variable were required to properly organize the data, but were largely irrelevant to the data analysis itself.

  6. With a large sample, the null hypothesis of no difference between the treatment group and the comparison groups can be rejected even if the covariate imbalance makes no material difference in the estimated treatment effect.

  7. When an inmate’s conviction crime is violent, parole is granted 52 % of the time. When an inmate’s conviction crime is not violent, parole is granted 62 % of the time. The difference in proportions does not adjust for associations with nonviolent predictors, and is probably overstated.

  8. In this instance, it is not clear how one would proceed if adjustments were needed for a larger number of covariates. Even caliper matching would be a challenge because of likely nonlinear relationships and interaction effects. Propensity score matching would be problematic for the same reasons. For example, the design matrix would be very nearly singular even if a case could be made that all important relationships had been captured. And we will see shortly that even conditioning on the one potentially problematic covariate that is badly out of balance makes no material difference.

  9. The number of observations for Table 1 is smaller because of missing data patterns.

  10. A decision to parole an inmate is binary: release or not. There is no earlier, official classification of inmates by risk level even though the machine learning forecasting classified each inmate by one of three risk classes. The forecasted risk classes were just one of many inputs to the Board’s deliberations.

  11. The usual χ 2 test for a contingency table does not take order into account. When there is order in one or more of the variables (e.g., for the forecasts), the test is conservative because it has less power. If the null hypothesis is rejected nevertheless, it would also have been rejected were the ordering built into the test. There is a χ 2 test for ordered variables introduced originally by Cochran (1954) and by Armitage (1955) and available in the R library coin. We applied that test when we could not reject the null using the conventional (unordered) test. In no case, was the null hypothesis then rejected. Excellent references are books by Agresti (2002) and Hollander and Wolfe (1999). It is also possible to approach the same data with logistic regression, but ordered predictors are still a problem, and there are a multitude of possible tests depending on the contrasts specified. After applying logistic regression, it was clear that the tabular results were far more accessible, especially since many high order interaction effects were necessarily included in the logistic regression.

  12. As noted earlier, the forecasts were not available in time to inform the Board’s decision, but were provided for this analysis.

  13. For all inmates, forecasts and reliabilities were computed, although some were provided too late for the hearing. This means that the predictors included are the same for inmates reviewed with the forecasts and inmates reviewed without the forecasts.

  14. The observations for the burn-in period included a large number of inmates who were not released. Such inmates could not be used to study recidivism.

  15. The effects of history are always a concern with regression discontinuity designs, not just when the assignment covariate is time.

  16. The intervention is even more complicated if, as suggested earlier, an important feature of the intervention was educating Board members about proper risk assessments.

  17. There are a number of technical issues here that are beyond the scope of this paper.

  18. With categorical outcomes, conventional scatter plots that are otherwise useful in RDD analyses do not provide much visual insight. We do not use them in the analyses to follow.

  19. To be clear, date was included as a regressor. There was just no need to clutter the table with its regression coefficients.

  20. The estimates for which the standard errors are computed are for an approximation of the true regression discontinuity relationship; the approximation is the estimation target. We have asymptotically unbiased estimates of the approximation, not the truth, and the standard errors refer to the approximation as well (Buja et al. 2015).

  21. As mentioned above, some of the very earliest observations were dropped because they were very sparsely distributed.

  22. The dates are transformed back to conventional representations from the Julian dates used in the analysis.

  23. In the units of log-odds, the function of date is linear. In odds units, the relationship becomes nonlinear. However, over the empirical ranges of log-odds fitted by the multinomial logistic regression, the amount of nonlinearity introduced is very small and difficult to see because of the range of values that must be covered by the vertical axis. We could have plotted the results in units of probabilities on the vertical axis. But that could have been very misleading because the base rate for violent crimes is so low.

  24. We were unable to apply nonparametric smoothers as suggested by Gelman and Imbens (2014). We could find no statistical procedures for smoothers within a multinomial framework. Moreover, had we used smoothers, we risked introducing biases because of model selection when tuning parameters are determined from the data (Berk et al. 2010a).

References

  • Agresti, A. (2002). Categorical data analysis. New York: Wiley.

    Book  Google Scholar 

  • Armitage, P. (1955). Tests for linear trends in proportions and frequencies. Biometrtics, 11(3), 375–386.

    Article  Google Scholar 

  • Berk, R. A. (2010). Handbook of quantitative criminology In Piquero, A., & Weisburd, D. (Eds.), Recent perspectives on the regression discontinuity design. New York: Springer.

    Chapter  Google Scholar 

  • Berk, R. A. (2012). Criminal justice forecasts of risk: a machine learning approach. New York: Springer.

    Book  Google Scholar 

  • Berk, R. A., Barnes, G., Alhman, L., & Kurtz, E. (2010b). When second best is good enough: a comparison between a true experiment and a regression discontinuity quasi-experiment. J. Exp. Criminol., 6(2), 191–208.

  • Berk, R. A., & Bleich, J. (2013). Statistical procedures for forecasting criminal behavior: A comparative assessment. Journal of Criminology and Public Policy, 12 (3), 515–544.

    Google Scholar 

  • Berk, R. A., Brown, L., & Zhao, L. (2010a). Statistical inference after model selection. J. Quant. Criminol., 26(2), 217–236.

  • Berk, R. A., Brown, L., Buja, A., Zhang, K., & Zhao, L. (2014a). Valid post-selection inference. Annals of Statistics, 41(2), 802–837.

  • Berk, R. A., Pitkin, E., Brown, L., Buja, A., George, E., & Zhao, L. (2014b). Covariance adjustments for the analysis of randomized field experiments. Eval. Rev., 37, 170–196.

  • Berk, R. A., Brown, L., Buja, A., George, E., Pitkin, E., Zhang, K., & Zhao, L. (2014c). Misspecified mean function regression: making good use of regression models that are wrong. Sociological Methods and Research, 43, 422–451.

  • Berk, R. A., & de Leeuw, J. (1999). An evaluation of California’s inmate classification system using a generalized regression discontinuity design. J. Am. Stat. Assoc., 94(448), 1045–1052.

  • Berk, R. A., & Hyatt, J. (2015). Machine learning forecasts of risk to inform sentencing decisions. The Federal Sentencing Reporter, 27(4), 222–228.

    Article  Google Scholar 

  • Berk, R. A., & Rauma, D. (1983). Capitalizing on nonrandom assignment to treatments: a regression discontinuity evaluation of a crime control program. J. Am. Stat. Assoc., 78(381), 21–27.

    Article  Google Scholar 

  • Bertanha, M., & Imbens, G. W. (2014). External validity in fuzzy regression discontinuity designs. National Bureau of Economic Research, working paper 20773.

  • Breiman, L. (2001). Random forests. Mach. Learn., 45, 5–32.

    Article  Google Scholar 

  • Buja, A., Berk, R. A., Brown, L., George, E., Pitkin, E., Traskin, M., Zhao, L., & Zhang, K. (2015). Models as approximations — a conspiracy of random regressors and model violations against classical inference in regression. imsart-sts ver.2015/07/30: Buja_et_al_Conspiracy-v2.tex year: July 23, 2015.

  • Borden, H. G. (1928). Factors predicting parole success. Journal of the American Institute of Criminal Law and Criminology, 19, 328–336.

    Article  Google Scholar 

  • Burgess, E. M. (1928). The Working of the Indeterminate Sentence Law and the Parole System in Illinois In Bruce, A. A., Harno, A. J., Burgess, E., & Landesco, E. W. (Eds.), Factors determining success or failure on parole, (pp. 205–249). Springfield: State Board of Parole.

    Google Scholar 

  • Duwe, G. (2014). The development, validity, and reliability of the Minnesota screening tool assessing recidivism risk (mnSTARR). Criminal Justice Policy Review, 25(5), 579–613.

    Article  Google Scholar 

  • Campbell, D. T., & Stanley, J. (1963). Experimental and quasi-experimental designs for research (Independence. Kentucky: Cengage Learning).

    Google Scholar 

  • Chen, M., & Shapiro, J. (2007). Do harsher prison conditions reduce recidivism: a discontinuity-based approach. American Law and Economics Review, 9, 1–29.

    Article  Google Scholar 

  • Cochran, W. G. (1954). Some methods for strengthening the common χ 2 tests. Biometrics, 10(4), 417–451.

    Article  Google Scholar 

  • Conroy, M. A. (2006). Risk assessments in the Texas criminal justice system. Applied Psychology in Criminal Justice, 2(3), 147–176.

    Google Scholar 

  • Gelman, A., & Imbens, G. (2014). Why high-order polynomials should not be used in regression discontinuity designs. Cambridge: National Bureau of Economic Research. (No. w20405).

    Book  Google Scholar 

  • Gottfredson, S. D., & Moriarty, L. J. (2006). Statistical risk assessment: old problems and new applications. Crime Delinq., 52(1), 178–200.

    Article  Google Scholar 

  • Hamilton, Z., Kigerl, A., & Campagna, M. (2016). Designed to fit: the development and validation of the STRONG-R recidivism risk assessment. Criminal Justice and Behavior February, 43(2), 230–263.

    Article  Google Scholar 

  • Hahn, J., Todd, J. P., & Van der Klaauw, W. (2001). Identification and estimation of treatment effects with a regression discontinuity design. Econometrica, 69, 201–209.

  • Harcourt, B. (2008). Against prediction: profiling, policing and punishing in an actuarial age. Chicago: University of Chicago Press.

    Google Scholar 

  • Hollander, M., & Wolfe, D. A. (1999). Nonparametric statistical methods, 2nd edn. New York: Wiley.

    Google Scholar 

  • Holsinger, A. M. (2013). Implementation of actuarial risk/need assessment and its effect on community supervision revocations. Justice Research and Policy, 15(1), 95–122.

    Article  Google Scholar 

  • Imani, K., King, G., & Stuart, E. A. (2008). Misunderstandings between experimentalists and observationalists about causal Inference. J. R. Stat. Soc. Ser. A, 171, 481–502.

    Article  Google Scholar 

  • Imbens, G. W., & Lemieux, T. (2008). Regression discontinuity designs: A guide to practice. J. Econ., 142, 611–614.

    Article  Google Scholar 

  • Imbens, G. W., & Rubin, D. B. (2015). Causal inference for statistics, social and biomedical sciences: an introduction. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Jalbert, S. K., Rhodes, W., Flygare, C., & Kane, M. (2010). Testing probation outcomes in an evidence-based practice setting: reduced caseload size and intensive supervision effectiveness. Journal of Offender Rehabilitation, 49, 233–253.

    Article  Google Scholar 

  • LaVigne, N., Bieler, S., Cramer, L., Ho, H., Kotonias, C., Mayer, D., McClure, D., Pacifici, L., Parks, E., Peterson, B., & Samuels, J. (2014). Justice Reinvestment Initaitive State Assessment Report. Washington D.C.: Bureau of Justice Assistance, U.S. Department of Justice.

    Google Scholar 

  • Loeffler, C. E. (2015). Processed as an adult: a regression discontinuity estimate of crime effects of charging nontransfer juveniles as adults. Journal of research on crime and delinquency, 52(6), 890–922.

    Article  Google Scholar 

  • McCafferty, J. J. (2015). Professional discretion and the predictive validity of a juvenile risk assessment instrument: Exploring the overlooked principle of effective correctional classification. Youth Violent and Juvenile Justice, December 28.

  • Miller, J., & Malony, C. (2013). Practitioner compliance with risk/needs assessment tools: A theoretical and empirical assessment. Criminal Justice and Behavior, 40(7), 716–736.

    Article  Google Scholar 

  • Monahan, J., & Skeem, J. L. (2014). Risk redux: The resurgence of risk assessment in criminal sanctioning. Federal Sentencing Reporter, 26(3), 158–661.

    Article  Google Scholar 

  • Pew (2011). Risk/needs assessment 101: Science reveals new tools to manage offenders. PEW Center on the States, Public Safety Performance Project. www.pewcenteronthestates.org/publicsafety.

  • Rhodes, W., & Jalbert, S. K. (2013). Regression discontinuity design in criminal justice evaluation. Eval. Rev., 37(3-4), 239–273.

    Article  Google Scholar 

  • Starr, S. B. (2014). Evidence-based sentencing and the scientific rationalization of discrimination. Stanford Law Review, 66, 803–872.

    Google Scholar 

  • Thistlewaite, D. L., & Campbell, D. T. (1960). Regression-Discontinuity analysis: An alternative to the ex-post facto design. J. Educ. Psychol., 51, 309–317.

    Article  Google Scholar 

  • Tonrey, M. (2014). Legal and ethical issues in the prediction of recidivism. Federal Sentencing Reporter, 26(3), 167–176.

    Article  Google Scholar 

  • Trochim, W. M. K. (2001). Regression discontinuity design Smelser, N. J., & Bates, P. B. (Eds.), (Vol. 19.

  • Viglione, J., Rudes, D. S., & Taxman, F. S. (2014). Misalignment in supervision implementing risk/needs assessment instruments in probation. Criminal Justice and Behavior, 42(3), 263–285.

    Article  Google Scholar 

Download references

Acknowledgments

The entire project would have been impossible without the efforts of Jim Alibrio, Fred Klunk and their many colleagues working at the Pennsylvania Board of Probation and Parole and the Pennsylvania Department of Corrections. Thanks also go to the National Institute of Justice for financial support and to Patrick Clark who was the project monitor at NIJ. Finally, the paper benefitted from an unusually thorough and constructive set of reviews.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Richard Berk.

Appendix A: Information available to be board and used as predictors

Appendix A: Information available to be board and used as predictors

  1. 1.

    Violent Indicator – A binary classification coded “Yes” for a violent offense and “No” otherwise, based on an inmate’s offenses that led to his or her current prison sentence.

  2. 2.

    OVRT – A classification into one of four categories called Offender Violent Recidivism Typology that incorporates criminal history into expectations of future recidivism.

  3. 3.

    LSIR Score – A risk assessment score from a Level of Service Inventory-Revised interview that is part of the Pennsylvania Parole Guidelines

  4. 4.

    LSIR Level – Label for risk level given assessment by LSIR instrument

  5. 5.

    Sex Offender – A “Yes” or “No” indicator based upon the Pennsylvania Parole Guideline assessment derived from the Static-99 instrument.

  6. 6.

    Institutional Program Code – A numeric code for prison program participation recorded on the Parole Guideline instrument.

  7. 7.

    Institutional Behavior Code – A numeric code for the offender’s behavior and prison adjustment.

  8. 8.

    Guideline Score – A numeric score derived from summing assessment values in the Pennsylvania Parole Guidelines instrument. When the sum exceeds 7, there is a low likelihood of a recommendation for release.

  9. 9.

    Guideline Recommendation – A threshold value of the likelihood of granting parole, as a summation of the Parole Guideline assessment recommendation.

  10. 10.

    Degree of Reliability – One of three possible reliability ranges for the random forests forecasts: greater than 0.5 (a strong result), between 0.5 and 0.4 (a modest result) and less than 0.4 (a weak result).

  11. 11.

    Forecast – The outcome forecasted by random forests: V (violent crime), O (nonviolent crime) or N (no future arrest).

  12. 12.

    Prior Charges – The total count of arrests reported in the rap sheet from Pennsylvania State Police.

  13. 13.

    First Age – The offender’s age for the reported first arrest in the offender’s criminal history.

  14. 14.

    Arrests – The total number of unique arrest dates in an offender’s criminal history record.

  15. 15.

    Sex – A binary code for gender of the offender.

  16. 16.

    LSI-R Age – The chronological age of the offender at the time that the LSIR assessment interview conducted immediately before to the parole interview.

  17. 17.

    ISI-R Score – The total score from an interview conducted with an inmate prior to the parole hearing.

  18. 18.

    LSI-R 29 – The ”Yes” or ”No” for question 29 in the LSIR assessment pertaining to whether the offender lived in a high crime neighborhood.

  19. 19.

    Convictions – A numeric count for the number of convictions reported on rap sheets manually determined by parole officers.

  20. 20.

    Intelligence Rate – A Department of Corrections intelligence score after a year in prison based upon a group assessment technique.

  21. 21.

    Program Participation – A Department of Corrections rating of institutional program participation after a year in prison.

  22. 22.

    Participation Rating – A Department of Corrections rating of offender work participation after a year in prison.

  23. 23.

    Nominal Length – The computed length of time sentenced based upon the Department of Corrections commitment date and the offender’s sentence maximum date.

  24. 24.

    Serious Misconduct – A count to the number of prison misconduct reports with the most serious misconduct category indicated.

  25. 25.

    Misconduct Counts – A count of the total number of prison misconduct reports found in the offenders complete record.

  26. 26.

    Forecast Printed – whether the forecasts were available to the Board.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Berk, R. An impact assessment of machine learning risk forecasts on parole board decisions and recidivism. J Exp Criminol 13, 193–216 (2017). https://doi.org/10.1007/s11292-017-9286-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11292-017-9286-2

Keywords

Navigation