Comparing the forecasting performance of logistic regression and random forest models in criminal recidivism
Authors
Date
2016Access restrictions
This material has a restricted access due to copyright reasons. It can be read at the workstation at Jyväskylä University Library reserved for the use of archival materials: https://kirjasto.jyu.fi/en/workspaces/facilities.
Rikosseuraamusalalla on viime vuosina kehitetty uusintarikollisuutta ennustavia malleja (Tyni, 2015), jotka perustuvat tyypillisesti rekisteripohjaisiin mittareihin, jotka mittaavat mm. tuomitun sukupuolta, ikää, rikostaustaa ja vankikertaisuutta. Yleensä tällaisten mallien kehityksessä käytetään logistisen regressioanalyysin kaltaisia parametrisia malleja, joissa uusintarikollisuuden todennäköisyyttä mallinnetaan taustamuuttujien lineaarisena funktiona. Näiden mallien rinnalle on viime aikoina kehitetty koneoppimisalgoritmeihin perustuvia vaihtoehtoja, joiden on todettu suoriutuvan käytännön sovelluksissa uusintarikollisuuden ennustamisessa perinteisiä malleja paremmin (Berk & Bleich, 2014). Tällaisten mallien toimivuutta suhteessa perinteisiin malleihin ei ole kuitenkaan testattu suomalaisella datalla. Tutkielman tarkoituksena on tarkastella sitä, kuinka hyvin erilaiset ennustemallit onnistuvat tehtävässään. Tutkielman ensimmäisessä vaiheessa luodaan logistiseen regressioanalyysiin ja koneoppimisalgoritmiin (Random forest) perustuvat uusintarikollisuutta ennustavat mallit Kriminologian ja oikeuspolitiikan instituutin Rikosten ja seuraamusten tutkimusrekisteristä poimitulla aineistolla, joka sisältää referenssituomioita vuosilta 2005-2007. Tuomituille henkilöille on haettu tietoa myös referenssituomiota edeltävästä ja seuraavasta rikoskäyttäytymisestä. Ennustemalli luodaan vuosien 2005–2006 välillä tuomittujen aineistolla, ja ennustemallia testataan vuoden 2007 datalla. Näin simuloidaan tilannetta, jossa havaittuun aineistoon perustuvalla historiallisella toteumatiedolla ennustetaan uuden tuomittujen ryhmän vielä toteutumatonta uusintarikollisuutta. Tutkimuskysymyksenä kysytäänkin, kumpi malleista pystyy luomaan rikoshistoriatiedon perusteella paremman ennustusmallin. Molemmat mallit ennustavat uusinta-rikollisuutta tutkielman asetelmassa verrattain hyvin. Kumpikaan ennustemalli ei kuitenkaan ole toista parempi, sillä menetelmät tuottavat ennustustehokkuudeltaan varsin samantasoiset mallit. Tutkielman tuloksena todetaan, ettei Random forest –koneoppimismenetelmän ja logistisen regressiomallin ennustustehokkuuden välille saada merkittävää eroa tutkielman asetelmalla.
...
During the recent years, predictive models have been created to predict the future criminal behavior (recidivism) of past offenders (e.g. Tyni, 2015). Predictive models are often created by using register-based indicators, e.g. offender’s gender, age, criminal background, or prior imprisonments. Usually, these predictive models are created by using parametric models, where the likelihood of recidivating is modelled as a linear function of independent variables. Lately, machine learning algorithms have been introduced as alternatives to these more traditional models. In a recent American study, machine learning algorithms were stated to be more accurate predictors of recidivism than the more traditional logistic regression model (Berk & Bleich, 2014). However, these machine learning algorithms have not been tested for criminal recidivism prediction utilizing Finnish data. The aim of this thesis is to examine the comparative effectiveness of different risk prediction models in a Finnish setting. In this thesis, two predictive models for recidivism are created, one being a logistic regression model, and the other a machine learning algorithm-based model called Random forest. Research data was gathered from the RST (Rikosten ja seuraamusten tutkimusrekisteri, which translates to “the research register of crimes and sanctions”) database of Institute of Criminology and Legal Policy, and includes all offenders convicted to several common crime type offenses in Finland from 2005 to 2007. Data also includes information on past and future criminal behavior for those offenders. Predictive models are developed with data from the years 2005 and 2006. The model testing is done with the remaining 2007 data, in order to simulate a situation where predictive models are used to predict recidivism yet to be actualized. The research question asks which of these models perform better in forecasting the criminal recidivism of a previous offender. The results of this study show that both logistic regression and Random forest algorithm create decent predictive models, but neither model outperforms the other on chosen performance metrics. The outcome, and the answer to the research question is, that neither model is better than the other in predicting recidivism among convicted offenders in Finland.
...
Keywords
Metadata
Show full item recordCollections
- Pro gradu -tutkielmat [29556]
Related items
Showing items with similar title or keywords.
-
Predicting Children's Myopia Risk : A Monte Carlo Approach to Compare the Performance of Machine Learning Models
Artiemjew, Piotr; Cybulski, Radosław; Emamian, Mohammad; Grzybowski, Andrzej; Jankowski, Andrzej; Lanca, Carla; Mehravaran, Shiva; Młyński, Marcin; Morawski, Cezary; Nordhausen, Klaus; Pärssinen, Olavi; Ropiak, Krzysztof (SCITEPRESS Science and Technology Publications, 2024)This study presents the initial results of the Myopia Risk Calculator (MRC) Consortium, introducing an innovative approach to predict myopia risk by using trustworthy machine-learning models. The dataset included approximately ... -
Effect of variable selection strategy on the predictive models for adverse pregnancy outcomes of pre-eclampsia : A retrospective study
Zheng, Dongying; Hao, Xinyu; Khan, Muhanmmad; Kang, Fuli; Li, Fan; Hämäläinen, Timo; Wang, Lixia (Scholar Media Publishing Company, 2024)Objectives: The improvement of prediction for adverse pregnancy outcomes is quite essential to the women suffering from pre-eclampsia, while the collection of predictive indicators is the prerequisite. The traditional ... -
Comparison of machine learning and logistic regression as predictive models for adverse maternal and neonatal outcomes of preeclampsia : A retrospective study
Zheng, Dongying; Hao, Xinyu; Khan, Muhanmmad; Wang, Lixia; Li, Fan; Xiang, Ning; Kang, Fuli; Hamalainen, Timo; Cong, Fengyu; Song, Kedong; Qiao, Chong (Frontiers Media SA, 2022)Introduction: Preeclampsia, one of the leading causes of maternal and fetal morbidity and mortality, demands accurate predictive models for the lack of effective treatment. Predictive models based on machine learning ... -
Do Randomized Algorithms Improve the Efficiency of Minimal Learning Machine?
Linja, Joakim; Hämäläinen, Joonas; Nieminen, Paavo; Kärkkäinen, Tommi (MDPI AG, 2020)Minimal Learning Machine (MLM) is a recently popularized supervised learning method, which is composed of distance-regression and multilateration steps. The computational complexity of MLM is dominated by the solution of ... -
Intelligent solutions for real-life data-driven applications
Ivannikova, Elena (University of Jyväskylä, 2017)The subject of this thesis belongs to the topic of machine learning or, specifically, to the development of advanced methods for regression analysis, clustering, and anomaly detection. Industry is constantly seeking ...