University of Jyväskylä | JYX Digital Repository

  • English  | Give feedback |
    • suomi
    • English
 
  • Login
JavaScript is disabled for your browser. Some features of this site may not work without it.
View Item 
  • JYX
  • Opinnäytteet
  • Pro gradu -tutkielmat
  • View Item
JYX > Opinnäytteet > Pro gradu -tutkielmat > View Item

Architecture-independent matching of stripped binary code files using BERT and a Siamese neural network

Thumbnail
3.1 Mb

Authors
Lampinen, Kenneth
Date
2020
Discipline
TietojenkäsittelytiedeComputer Science
Access restrictions


The author has not given permission to make the work publicly available electronically. Therefore the material can be read only at the archival workstation at Jyväskylä University Library reserved for the use of archival materials.
You can request a copy of this thesis here
Copyright
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.

 
The proliferation of IoT devices brings many cyber security challenges. Identifying executable code with known vulnerabilities is one of them, this despite the fact that open source code is commonly used in IoT firmware. Factors that contribute to this challenge include the high usage of heterogeneous architectures, as well as non-standard toolsets and compilers when developing IoT firmware. To address this issue, this work examines the latest research in bi-nary code matching. It concludes that the research does not adequately address the current cyber security issues incurred by IoT devices and proposes a new method of binary code matching based on techniques and methods commonly seen in Natural Language Processing (NLP). An artefact using Google’s BERT and a custom bi-directional LSTM Siamese network is developed and tested to demonstrate the viability of this new method. The BERT model was pre-trained using the code sections of binary executables compiled for the ARM architecture. It achieved scores of 89.1% and 98.0% in the key metrics of masked_lm_accuracy and next_sentence_accuracy respectively. This pre-trained BERT model was used to extract embeddings from the binary files’ code sections in order to train and validate the Siamese network. The Siamese network achieved an average rate of approximately 80% on the task of match-ing the stripped code sections of binary files compiled by two separate open source projects. This compares favorably to the 0% accuracy achieved by the fuzzy hashing algorithms SSDEEP and SDHASH. ...
Keywords
binary file matching deep learning Natural Language Processing NLP BERT transformer LSTM Siamese network similarity detection SSDEEP SDHASH kyberturvallisuus koneoppiminen esineiden internet cyber security machine learning Internet of things
URI

http://urn.fi/URN:NBN:fi:jyu-202012287374

Metadata
Show full item record
Collections
  • Pro gradu -tutkielmat [24518]

Related items

Showing items with similar title or keywords.

  • ISAdetect : Usable Automated Detection of CPU Architecture and Endianness for Executable Binary Files and Object Code 

    Kairajärvi, Sami; Costin, Andrei; Hämäläinen, Timo (ACM, 2020)
    Static and dynamic binary analysis techniques are actively used to reverse engineer software's behavior and to detect its vulnerabilities, even when only the binary code is available for analysis. To avoid analysis errors ...
  • Adversarial Attack’s Impact on Machine Learning Model in Cyber-Physical Systems 

    Vähäkainu, Petri; Lehto, Martti; Kariluoto, Antti (Peregrine Technical Solutions, 2020)
    Deficiency of correctly implemented and robust defence leaves Internet of Things devices vulnerable to cyber threats, such as adversarial attacks. A perpetrator can utilize adversarial examples when attacking Machine ...
  • Node co-activations as a means of error detection : Towards fault-tolerant neural networks 

    Myllyaho, Lalli; Nurminen, Jukka K.; Mikkonen, Tommi (Elsevier, 2022)
    Context: Machine learning has proved an efficient tool, but the systems need tools to mitigate risks during runtime. One approach is fault tolerance: detecting and handling errors before they cause harm. Objective: This ...
  • Taxonomy of generative adversarial networks for digital immunity of Industry 4.0 systems 

    Terziyan, Vagan; Gryshko, Svitlana; Golovianko, Mariia (Elsevier, 2021)
  • On data mining applications in mobile networking and network security 

    Zolotukhin, Mikhail (University of Jyväskylä, 2014)
  • Browse materials
  • Browse materials
  • Articles
  • Conferences and seminars
  • Electronic books
  • Historical maps
  • Journals
  • Tunes and musical notes
  • Photographs
  • Presentations and posters
  • Publication series
  • Research reports
  • Research data
  • Study materials
  • Theses

Browse

All of JYXCollection listBy Issue DateAuthorsSubjectsPublished inDepartmentDiscipline

My Account

Login

Statistics

View Usage Statistics
  • How to publish in JYX?
  • Self-archiving
  • Publish Your Thesis Online
  • Publishing Your Dissertation
  • Publication services

Open Science at the JYU
 
Data Protection Description

Accessibility Statement

Unless otherwise specified, publicly available JYX metadata (excluding abstracts) may be freely reused under the CC0 waiver.
Open Science Centre