Approaches and challenges of automatic vulnerability classification using natural language processing and machine learning techniques

Jormakka, Ossi

dc.contributor.advisor	Costin, Andrei
dc.contributor.author	Jormakka, Ossi
dc.date.accessioned	2019-11-06T06:51:08Z
dc.date.available	2019-11-06T06:51:08Z
dc.date.issued	2019
dc.identifier.uri	https://jyx.jyu.fi/handle/123456789/66196
dc.description.abstract	Automatisoitu haavoittuvuuksien etsiminen ja haavoittuvuuksien yksityiskohtien ennustaminen voi auttaa asiantuntijoita priorisoimaan ohjelmistovirheitä, joka voi johtaa nopeampaan virheenkorjaukseen. Tässä työssä käytettiin National Vulnerability Database -tietokantaa tutkittaessa kuinka haavoittuvuuskuvauksien perusteella voidaan havaita haavoittuvuuksia mistä tahansa tekstistä sekä ennustaa haavoittuvuuksien vakavuus ja haavoittuvuustyyppi. Common Vulnerability Scoring System -järjestelmä tarjoaa tavan mitata haavoittuvuuksien vakavuuksia. Common Weakness Enumeration -järjestelmä tarjoaa hierarkkisen luokittelun yleisiin haavoittuvuustyyppeihin. Olemassa olevat tutkimukset haavoittuvuuksien tekstiluokittelussa usein rajoittuvat kapeaan alueeseen, esimerkiksi vain johonkin Common Vulnerability Scoring System -järjestelmän versioon. Tämä työ antaa yleiskuvan virheraporttien luokittelusta sekä vakavuuden ja haavoittuvuustyypin ennustamisesta. Työssä pyrittiin käyttämään laajasti tunnettuja tekstin esikäsittelymenetelmiä sekä monia muita Scikit-learn -kirjaston tarjoamia luonnollisen tekstin käsittelyn vaihtoehtoja ja koneoppimismenetelmiä. Tulokset osoittavat 2-grammin avainsanapohjaisen menetelmän olevan yhtä tehokas kuin yhden luokan tukivektorikone kun esikäsittelynä käytetään Term Frequency – Inverse Document Frequency -painotusta ja sanojen taivutusmuotojen muuttamista perusmuotoon (lemmatizing). Haavoittuvuuksien vakavuuden ennustamisessa saadaan parempia tuloksia Common Vulnerability Scoring System -järjestelmän versiolle 2 kuin järjestelmän versiolle 3. Lineaarinen tukivekorikone saavutti korkeimman F1-tuloksen haavoittuvuuksien vakavuuden ja haavoittuvuustyypin luokittelussa. Lisäksi tässä työssä on yhteenveto uusimpaan National Vulnerability Database -tietokannan tietoon.	fi
dc.description.abstract	Automated vulnerability detection and prediction of vulnerability details may help security specialists to prioritize bug reports and getting earlier fixes to security related software defects. This thesis is about finding vulnerable-like descriptions from any text and classifying vulnerability severities and weakness types. Vulnerability severities are measured using Common Vulnerability Scoring System. Common Weakness Enumeration is a hierarchical list of weakness types that each vulnerability can be classified to. The scoring and weakness type information for known vulnerabilities are available on National Vulnerability Database. Many existing research about vulnerability text-only classification is limited to a narrow area, for example, specific version of Common Vulnerability Scoring System. This thesis gives an overview of classifying bug reports with severities and weakness types altogether. The Scikitlearn library’s interfaces were used extensively to implement text preprocessing, machine learning classification, and experiment validation. Experiments include stemming, lemmatization, and numerous text vectorization options and algorithms provided by the library. The results show that the keyword-based classifier using word 2-grams works as well as One-class Support Vector Machine with lemmatizing using the Term Frequency–Inverse Document Frequency preprocessing method in vulnerability detection. Vulnerability severities can be predicted better for Common Vulnerability Scoring System version 2 than its version 3. The Linear Support Vector Machine classifier got the highest F1-score in predicting both Common Vulnerability Scoring System and Common Weakness Enumeration. This thesis also presents a summary on the latest data available on the National Vulnerability Database data feeds.	en
dc.format.extent	62
dc.format.mimetype	application/pdf
dc.language.iso	en
dc.subject.other	common vulnerability scoring system
dc.subject.other	common weakness enumeration
dc.subject.other	Scikit-learn
dc.title	Approaches and challenges of automatic vulnerability classification using natural language processing and machine learning techniques
dc.identifier.urn	URN:NBN:fi:jyu-201911064740
dc.type.ontasot	Pro gradu -tutkielma	fi
dc.type.ontasot	Master’s thesis	en
dc.contributor.tiedekunta	Informaatioteknologian tiedekunta	fi
dc.contributor.tiedekunta	Faculty of Information Technology	en
dc.contributor.laitos	Informaatioteknologia	fi
dc.contributor.laitos	Information Technology	en
dc.contributor.yliopisto	Jyväskylän yliopisto	fi
dc.contributor.yliopisto	University of Jyväskylä	en
dc.contributor.oppiaine	Tietojenkäsittelytiede	fi
dc.contributor.oppiaine	Computer Science	en
dc.rights.copyright	Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.	fi
dc.rights.copyright	This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.	en
dc.type.publication	masterThesis
dc.contributor.oppiainekoodi	601
dc.subject.yso	koneoppiminen
dc.subject.yso	luokitus (toiminta)
dc.subject.yso	haavoittuvuus
dc.subject.yso	datatiede
dc.subject.yso	machine learning
dc.subject.yso	classification
dc.subject.yso	vulnerability
dc.subject.yso	data science
dc.format.content	fulltext
dc.type.okm	G2

Aineistoon kuuluvat tiedostot

Nimi:: URN:NBN:fi:jyu-201911064740.pdf
Koko:: 2.566Mb
Tiedostomuoto:: PDF

Katso/Avaa

Aineisto kuuluu seuraaviin kokoelmiin

Pro gradu -tutkielmat [29561]

Näytä suppeat kuvailutiedot

Näytetään aineistoja, joilla on samankaltainen nimeke tai asiasanat.

Updating strategies for distance based classification model with recursive least squares

Raita-Hakola, Anna-Maria; Pölönen, Ilkka (Copernicus Publications, 2022)

The idea is to create a self-learning Minimal Learning Machine (MLM) model that is computationally efficient, easy to implement and performs with high accuracy. The study has two hypotheses. Experiment A examines the ...
The Datafication of Hate : Expectations and Challenges in Automated Hate Speech Monitoring

Laaksonen, Salla-Maaria; Haapoja, Jesse; Kinnunen, Teemu; Nelimarkka, Matti; Pöyhtäri, Reeta (Frontiers Media, 2020)

Hate speech has been identified as a pressing problem in society and several automated approaches have been designed to detect and prevent it. This paper reports and reflects upon an action research setting consisting of ...
Automatic image‐based identification and biomass estimation of invertebrates

Ärje, Johanna; Melvad, Claus; Jeppesen, Mads Rosenhøj; Madsen, Sigurd Agerskov; Raitoharju, Jenni; Rasmussen, Maria Strandgård; Iosifidis, Alexandros; Tirronen, Ville; Gabbouj, Moncef; Meissner, Kristian; Høye, Toke Thomas (Wiley, 2020)

Understanding how biological communities respond to environmental changes is a key challenge in ecology and ecosystem management. The apparent decline of insect populations necessitates more biomonitoring but the time-consuming ...
The Truth is Out There : Focusing on Smaller to Guess Bigger in Image Classification

Terziyan, Vagan; Kaikova, Olena; Malyk, Diana; Branytskyi, Vladyslav (Elsevier, 2023)

In Artificial Intelligence (AI) in general and in Machine Learning (ML) in particular, which are important and integral components of modern Industry 4.0, we often deal with uncertainty, e.g., lack of complete information ...
Description of movement sensor dataset for dog behavior classification

Vehkaoja, Antti; Somppi, Sanni; Törnqvist, Heini; Valldeoriola Cardó, Anna; Kumpulainen, Pekka; Väätäjä, Heli; Majaranta, Päivi; Surakka, Veikko; Kujala, Miiamaaria V.; Vainio, Outi (Elsevier, 2022)

Movement sensor data from seven static and dynamic dog behaviors (sitting, standing, lying down, trotting, walking, playing, and (treat) searching i.e. sniffing) was collected from 45 middle to large sized dogs with six ...

Approaches and challenges of automatic vulnerability classification using natural language processing and machine learning techniques

Aineistoon kuuluvat tiedostot

Aineisto kuuluu seuraaviin kokoelmiin

Samankaltainen aineisto

Updating strategies for distance based classification model with recursive least squares ﻿

The Datafication of Hate : Expectations and Challenges in Automated Hate Speech Monitoring ﻿

Automatic image‐based identification and biomass estimation of invertebrates ﻿

The Truth is Out There : Focusing on Smaller to Guess Bigger in Image Classification ﻿

Description of movement sensor dataset for dog behavior classification ﻿

Updating strategies for distance based classification model with recursive least squares

The Datafication of Hate : Expectations and Challenges in Automated Hate Speech Monitoring

Automatic image‐based identification and biomass estimation of invertebrates

The Truth is Out There : Focusing on Smaller to Guess Bigger in Image Classification

Description of movement sensor dataset for dog behavior classification