ISLAMIC UNIVERSITY OF TECHNOLOGY (IUT)

Malware Detection Using Machine Learning Classifiers

Authors:

Ahmed Camara:   170041070
Aanmar Abdou Salam:  170041075

Mefire Abdallah:   170041078

Supervisor:

Shohel Ahmed
Assistant professor

CSE Department, IUT

A thesis submitted in partial fulfillment of the requirements for the degree of B.Sc.
Engineering in Computer Science and Engineering

Academic Year 2020-21
Department of Computer Science and Engineering (CSE)

Islamic university of Technology (IUT)
A Subsidiary Organ of Organization of Islamic Cooperation (OIC)

Board Bazar, Gazipur-1704, Bangladesh.
May, 2022

1


Declaration of Authorship

This is to certify that the work done by the students listed below was done under
the supervision of Mr. Ahmed Shohel, Assistant Professor at the Islamic University
of Technology's Department of Computer Science and Engineering (IUT). This
article is the result of the student's thesis work for the Bachelor of Engineering in
Computer Science.

Author: Ahmed Camara
ID: 170041070
E-mail: camaraahmed@iut-dhaka.edu

..…………………..
Date and Signature

Author: Aanmar Abdou Salam
ID:170041075
E-mail: abdousalam@iut-dhaka.edu

…………………..
Date and Signature

Author: Mefire Abdallah
ID:170041078
E-mail: mefireabdallah@iut-dhaka.edu

…………………..
Date and Signature

2


Malware Detection Using Machine Learning Classifiers

Approved By:

Supervisor:

Shohel Ahmed

Assistant Professor

Department of Computer science and Engineering (CSE)

Islamic University of Technology (IUT), OIC

3


Acknowledgment

We would like to express our sincere gratitude to the Computer Science and
Engineering Faculty for allowing us to complete this thesis, as well as to our
supervisor, Mr. Ahmed Shohel, Assistant Professor. For this thesis, his
explanations and ideas were invaluable. Without his leadership, none of this would
have been possible. From the initial phases of the work and topic selection to
project implementation and finalization. His important opinions, times, and inputs
were offered throughout the thesis work, which aided us in completing our thesis
work properly. Mr. Ahmed Shohel's suggestions and comments on our work were
really appreciated.

4


Abstract

With the growth of technology, and the exponential amount of data that is being
generated, the main challenge is to figure out how to protect this data from
unauthorized access. Over the last couple of years, researchers have struggled to
come up with a best solution that would handle this problem. The signature-based
detection was the standard method used to detect malware. Regrettably, traditional
technologies are no longer capable of providing adequate protection. In this work,
we proposed a protection system where we trained different models in machine
learning to learn from malicious and benign files to allow future prediction. We
trained three classifiers in this work, Random Forest, Decision Tree, and KNearest
Neighbors on the data. Random Forest gives the best result with an FPR value of
0.0208 and an accuracy of 98%.

Keywords: Malware, Machine Learning, Random Forest, Decision Tree, K-NN,
Sampling

5


Contents

1. Introduction ............................................................................................................8

1.1 Overview .....................................................................................................8

1.2 Problem Statement ......................................................................................8

2. Literature Review ...................................................................................................9

3. Proposed Methodology .........................................................................................11

3.1 Data Collection and Description …..........................................................12

3.2 Data Analysis ...........................................................................................12

3.3 Feature Engineering and Selection …......................................................12

3.4 Models .....................................................................................................15

3.4.1 Random Forest and Decision Tree ....................................................15

A. Random Forest Model .......................................................................16

B. Decision Tree Model  ........................................................................17

3.4.2 K-Nearest Neighbors .......................................................................17

4 Results Analysis ...................................................................................................19

5 Conclusion and Future Work .............................................................................21

References                                                                                                                 22

6


List of Figures

3.1: Experiment methodology ....................................................................................11

3.3.1: Matrix correlation between features ...............................................................13

3.3.2: Distribution of data............................................................................................14

3.4.1.1: Construction of decision tree .........................................................................15

3.4.2.1: KNN .................................................................................................................18

7


1. Introduction

1.1 Overview

We live in a world where computer devices have become much more necessary for
our daily life. The computer is being used by everyone starting from normal users
to engineers, which leads to an exponential amount of data that is being generated.
Therefore, securing these data has been one of the most challenges for engineers
over the past decades. Many techniques were used to come up with the best
solution. Various solutions have emerged starting from traditional solutions to the
most advanced ones. According to the independent IT-Security institution [1], the
number of malware has exponentially increased over the last couple of years, going
from 182.90M in 2013 to 1347.63M in 2022. The term “Malware” [2] stands for
malicious software that is used by cybercriminals to access unauthorized data.
There are different types of malware such as Ransomware, Trojan, Spyware,
etc[3]. The standard method that was used was known as signature-based
detection[4] which is the most popular technique used to detect malware. But as
the number of malware keeps increasing exponentially, this method has become
useless, because signature-based detection can only detect already seen malware,
not the unseen ones. There are also different types of techniques to detect malware
such as static analysis [5] which consists of analyzing a malware source without
running it, and dynamic analysis [6] consists of running the malware in a virtual
environment to analyze the behavior. Many efforts have been made to implement
an advanced technique to detect malware.

1.2 Problem Statement

The signature-based detection is a technique where, when new malware is released
on the market, the anti-malware company has to analyze the newly released
malware, assign a unique string identifier to the malware, and update the
anti-malware database to allow future detection. Between this period of analysis
and implementation of the solution, thousands or millions of computers might get
infected. This makes clear how inaccurate signature-based detection has become.
We need a technique that can overcome this problem. This is where Machine

8


Learning classifiers come into play. Machine Learning allows us to train a model
that will be able to recognize malware and benign files and make a prediction
whether a newly installed file in a system is legitimate or not.

2. Literature Review

Zane Markel and Michael Bilzor [7] proposed a binary classification to detect
malware. In this paper, researchers extracted a set of features from the PE header
sections of legitimate, and non-legitimate files. In this experiment, they used
Logistic Regression, Decision Tree, and Naive Baye. The result showed that
Decision Tree gave the best result with an F1-Score of 0.97.

Junho Choi, Hayoung Kim, Chang Choi, and Pankoo Kim [8] proposed a method
by extracting features using N-gram analysis from a malicious code for classifying
executable as malware or benign. They used Support Vector Machines (SVM), and
K-Nearest Neighbors (KNN) in this experiment. The result showed that KNN
outperformed SVM with an FPR value of 0.009, and a result of 0.015 for SVM.

Zhongru Wang, Peixin Cong, and Weiqiang Yu [9] proposed three approaches
which are: The Extraction and validity verification methods of malicious code
metadata based on the big data analysis framework Spark, the extraction and
learning of metadata of PE- files are more efficient so that PE- Classifier can
process large-scale PE files distributed and then Propose and implement the
malicious code detection prototype PE-Classifier for quick and distributed
detection with great accuracy compared to the traditional AV detection where the
great result has been founded. the evaluation indicator for EP- classifier shows the
result of 0.96 for TPR.

Ivan Firdausi, Charles Lim, Alva Erwin, and Anto Satriyo Nugroho [10] in their
work, have extracted a set of features from the Window Portable Executable(PE)
files. using a total of 220 unique malware samples. The benign were taken from
system files located in the “System32” directory of a clean installation of Windows
XP Professional 32-bit with a total of 250 unique benign software samples. In this
work, they trained 5 classifiers KNN, Naive Bayes, SVM, and J48 Decision Tree.

9


The experiments were conducted based on four data sets. J48 Decison Tree
outperformed with an accuracy of 95.9%, and an FPR value of 0.024.

Athiq Reheman Mohammed (&), G. Sai Viswanath, K. Sai babu, and T. Anuradha
[11], have also proposed a technique where they extract a set of features from the
PE header file to construct a dataset with the respective values, and then train many
models to learn from this data and make predictions. To test the model, they
created 9 a static website where they uploaded a file, and at the back-end side, the
trained machine learning model classifies whether the uploaded file is malicious or
not. They got the best accuracy with Random Forest, at 99.4%. In a similar
approach, Nur Syuhada Selamat, Fakariah Hani Mohd Ali [12], in their work
performed analyzed malware using static analysis. From this, they extracted
features from the
samples using the PEView tool. The malware files were randomly taken from
various sources, and the legitimate files were taken from Windows and Programs
Files folder.

Yuval Elovic, Chanan Gleze, and Robert Moskovitch [13] have extracted two types
of static features, N-grams, and Win32 executables Portable Executable Header.
They have implemented a tool that extracted 5-grams from the binary
representation of a file. In the same manner, Mozammel Chowdhury, Azizur
Rahman, and Md Rafiqul Islam [14] proposed a work in which they introduce data
mining. In this work, they also extracted features by analyzing executable files
based on both static and dynamic analysis. They extracted two types of features
from the executable’s files, N-gram features, and Windows API calls on which
classifiers were trained.

10


3. Proposed Methodology

There are some critical steps that need to be taken into consideration when solving
a Machine Learning problem. Hence, in this thesis work, we have considered four
major steps to implement our framework.

Figure 3.1: Experiment methodology

11


3.1 Data Collection and Description

The dataset used in this work was taken from Kaggle [15]. It is part of a
competition hosted by Microsoft in 2018. In the dataset, each row corresponds to a
machine uniquely identified by a Machine Identifier. The dataset consists of 8M
records and 83 columns. To allow the data to be loaded into our computer’s
memory, we considered only a subset of the data. The final data to train the models
consisted of 567730 records.

3.2 Data Analysis

In Machine Learning, the first critical step before building the models is to
understand the data. We can do that by performing some statistical analysis, and
also by visualizing the distribution of the data for different attributes. We realized
that the dataset was highly biased. In the dataset, columns like ‘PuaMode’, and
‘Census_ProcessorClass’ have more than 90% missing values. We, therefore, set a
threshold value of 30% for evaluation. All the columns with more than 30%
missing values have to be deleted. These columns will not be useful to our
classifiers as most of the values are missing. We also have some columns with
different values in each row. Having these columns would only confuse our
models. Therefore, the best solution would be to ignore them from the dataset. In
the end, we will have columns with only a few missing values that will be handled
later.

3.3 Feature Engineering and Selection

After getting a basic understanding of the data, our next goal is to make feature
engineering, and by selecting relevant features to build the models. We have to
improve the quality of the data. The dataset consists of 56,7730 records with
482,571 malware and 85,159 legitimate files. We see that the data is highly
imbalanced. Training a classifier on this kind of data might lead to an unexpected
result. In Machine Learning, there are many ways to deal with this kind of
problem. In our experiment, we went on with the over-sampling method.
RandomOverSampler [16] is a module that is available in python that helps us to
deal with the imbalance problem. It is a technique that will randomly select data in

12


the minority class and duplicate them to balance the data. We ended up with
perfectly balanced data. The next step consisted of selected relevant features. By
using the seaborn library, we plotted the correlation between each feature using the
heatmap[17] available in the seaborn library to understand how much the features
were correlated to each other. The dataset consists of 83 columns, so plotting all
these columns at the same time will make visualization difficult. Thus for the first
time, we plotted only a subset of the column list from column 1 to the 15th column.

Figure 3.3.1: Matrix correlation between features

13


For example, the correlation between ‘IsSxsPassiveMode’ and ‘RtpStateBitfield’ is
high. We can delete one column and keep the other one. We deleted the column
RtpStateBitfield and kept the other one.

We have to be very careful when deciding what column to delete. For that, we have
to understand what each column contains. We can do that by visualizing each
column.

Figure 3.3.2: Distribution of data

The column RtpStateBitfield contains six different types of values. From the figure
above, It is clear that the values in this columns were not equally distributed. Then
we deleted it and kept the IsSxsPassiveMode column, which has two types of
value. For the next phase, we plotted from the 15th column to the 30th column. We
did the same process to select all the relevant features. In the end, we were left with
58 features out of 83 completely uncorrelated. We normalized the data using the
StandardScaler module. This allows the data to be on the same scale before
training.

14


3.4 Models

Once we have ameliorated the quality of the data, normalized the data, and selected
relevant features, we now have to build the models. In this work, we built only
three classifiers. During the training phase we split the data into training and
testing data. We took 80% of the data for the training phase, and 20% for the
testing phase. Finally, we trained our models and tested them on the test data.
Before building the models, let’s understand how each classifier works behind the
scenes to make predictions.

3.4.1 Random Forest and Decision Tree

Random Forest [18] and Decision Tree [19] are two types of supervised learning
approaches that can be used in both classification and regression problems.
Random Forest is a tree-based classifier that uses the technique of ensemble
learning, which consists of combining multiple classifiers into one to solve more
complex problems. During training, Random Forest builds multiple internal
decision trees that each will give an output. From these outputs, Random Forest
chooses the final decision based on the Majority voting for the given problem.

Figure 3.4.1.1: Construction of decision tree

15


When constructing individual trees, Random Forest will not consider all features.
Each tree will be different from one another. This leads to feature space reduction.

In the Decision Tree, each node denotes a test based on an attribute, each branch
represents the result of the test, and each leaf contains the class label.

Decision Tree makes classification by sorting the instances down the tree from the
root to some terminal node, which provides the classification of the instance. This
process keeps happening recursively for every subtree rooted at the new node.

The two algorithms have many parameters that can be tuned to improve the
accuracy of a classification problem.

A. Random Forest Model

As stated above, Random Forest has many parameters that can be used to train the
model to allow better accuracy. In this experiment, we tried a set of values for each
parameter and considered only the parameters that gave us the best results.

The training phase was done as follows:

We trained the model with the default values for each parameter, and we got an
accuracy of 98.4%, an FPR value of 0.0233, and an f1-score value of 0.98 for both
malware and legitimate files. Then, We tuned the parameters to see if we could get
better results. The parameters we tuned are n_estimators, max_features,
min_samples_split, min_samples_leaf, and bootstrap.

In the end, we were able to get a better result than the previous one. We got the
optimal result with the values shown below:

rf_model=

ek.RandomForestClassifier(n_estimators=400,max_features='auto',max_depth=6

0,min_samples_split=2,min_samples_leaf=1,bootstrap=True)

rf_model.fit(X_train,y_train)

print (f'Train Accuracy - : {rf_model.score(X_train,y_train):.3f}')

print (f'Test Accuracy - : {rf_model.score(X_test,y_test):.3f}')

16


The n_estimators parameter defines the number of trees we want to build before
taking the majority vote. The max_features defines the maximum number of
features Random Forest is allowed to try in the individual tree. When we want to
control what depth the trees should grow, then the max_depth parameter is tuned
for this purpose. The min_samples_split and the min_samples_leaf attributes
respectively define the minimum number of features required to split an internal
node, and the minimum number of samples to be at the leaf node.

With these values, we got an accuracy of 98.5%, an FPR value of 0.0208, and an
f1-score of 99% for both malware and legitimate files.

B. Decision Tree Model

We did the same thing during training for the Decision Tree as well. After training
the model with the default values, we used different values for each parameter to
increase the accuracy. After the experiment, we noticed that the results achieved
with the default values were the best ones.

dt_model = DecisionTreeClassifier()

dt_model.fit(X_train,y_train)

print (f'Train Accuracy - : {dt_model.score(X_train,y_train):.3f}')

print (f'Test Accuracy - : {dt_model.score(X_test,y_test):.3f}')

We got an accuracy of 0.906 with an FPR value of 0.177, and an F1-score of 91%
for both malware and legitimate files.

3.4.2 K-Nearest Neighbors

KNN [20], like Random Forest and Decision Tree, is a supervised learning
approach. KNN tries to predict the label of a data point by calculating the distance
between the data point and all the training data.

17


Many distance metrics can be used such as Euclidean Distance, and Manhattan
Distance.

Figure 3.4.2.1: KNN

The K-value gives us the number of nearest points to consider for the new data
point. After calculating the distance between the data point and all the training data
points. We classify the new data point as the category in which the data are close to
it. In the KNN part, we tuned only the k parameter. After testing different values of
K, we realized that the default K value gave us the best result.

knn_model = KNeighborsClassifier()

knn_model.fit(X_train,y_train)

print (f'Train Accuracy - : {knn_model.score(X_train,y_train):.3f}')

print (f'Test Accuracy - : {knn_model.score(X_test,y_test):.3f}')

18


4. Results Analysis

We ran a different experiment for each model with different parameter values. We
will only show a few parameters with their values along with their respective FPR
values in this section. We trained the models in many phases. For the sake of
simplicity, we chose only the three phases with the higher results.

n_estimators max_features max_depth min_samples_split min_samples_leaf bootstrap FPR

Phase1 100 auto None 2 1 True 0.0233

Phase2 400 auto 60 2 1 True 0.0208

Phase3 250 sqrt 20 5 2 False 0.277

Random forest

max_depth max_features criterion min_samples_split min_samples_leaf splitter FPR

Phase1 None None gini 2 1 best 0.177

Phase2 4 sqrt entropy 4 10 random 0.607

Phase3 8 log2 entropy 8 10 best 0.434

Decision Tree

K FPR

Phase1 5 0.395

Phase2 10 0.374

Phase3 20 0.409

KNN

At the end of the entire experiment, we only kept the parameters that gave us the
best result for each model. The table below shows the summary of the final result.

19


TPR FPR Accuracy AUC

Decision Tree 0.993 0.177 0.907 0.908

Random
Forest

0.992 0.0208 0.985 0.997

K-Nearest
Neighbors

0.916 0.395 0.759 0.846

In this work, our goal was to achieve an FPR value close to 0, and we can clearly
see that from the values above, Random Forest gives us the best result with an FPR
value of 0.0208 and an f1-score of 0.99 for both malware and legitimate files. The
precision value gives us the ratio of correctly classified as positive to the total
classified positive observations, the recall value gives us the ratio of correctly
classified to all observations in the actual class. Finally, the f1-score takes into
consideration both precision and recall. It gives us better intuition about how good
our classifiers are.

20


5. Conclusion and Future Work

The biggest challenge in the world of IT systems is security. Over the years, many
techniques have emerged from researchers. Machine Learning is used to overcome
the problem of traditional anti-virus systems. In this experiment, we used three
classifiers to detect malware, and we got the best results with Random Forest. In
the near future, we plan to introduce more complex classification algorithms.
Collecting the most recent samples of malware and legitimate files might help to
get better results, and doing intensive feature engineering to ameliorate the quality
of the data, and select the most important features will surely give better results.
Finally, making more hyperparameters tuning might lead to better results.

21


References

[1] AV-TEST Institute, https://www.av-test.org/en/statistics/malware/

[2] Simon Kramer, Julian C. Bradfield,“A general definition of malware”.

[3] https://www.malwarebytes.com/malware.

[4] James Scott, “Signature Based Malware Detection is Dead”,2017.

[5] Andreas Moser, Christopher Kruegel, and Engin Kirda, “Limits of status Analysis for
Malware Detection”.

[6] Amir Afianian, Salman Niksefat, and Babak Sadeghiyan, “Malware Dynamic
Analysis Evasion Techniques : a survey”, 2019.

[7] Zane Markel and Michael Bilzor, “Building a Machine Learning Classifier for
Malware Detection”, 2014.

[8] Junho Choi, Hayoung Kim, Chang Choi, and Pankoo Kim, “Efficient Malicious Code
Detection Using N-Gram Analysis and SVM”.

[9] Zhongru Wang, Peixin Cong, and Weiqiang Yu, “Malicious Code Detection and
Technology Based on Metadata Machine Learning”,2020.

[10] Ivan Firdausi, Charles Lim, Alva Erwin, and Anto Satriyo Nugroho, “Analysis of
Machine Learning used in behavior-based Malware Detection”.

[11] Athiq Reheman Mohammed (&), G. Sai Viswanath, K. Sai babu, and T. Anuradha,
“Malware Detection in Executable Files Using Machine Learning”.

[12] Nur Syuhada Selamat, Fakariah Hani Mohd Ali, “Comparison of malware detection
techniques using machine learning algorithm”.

[13] Yuval Elovic, Chanan Gleze and Robert Moskovitch, “Applying Machine Learning
Techniques for Detection of Malicious code in network traffic”.

[14] Mozammed Chowdhury, Azizur Rahman, Md Rafiqul Islam, “Malware Analysis and
Detection using Data Mining and Machine Learning Classification”.

[15] https://www.kaggle.com/c/microsoft-malware-prediction/overview

22

https://www.av-test.org/en/statistics/malware/
https://www.kaggle.com/c/microsoft-malware-prediction/overview


[16] Roweida Mohammed, Jumanah Rawashdeh and Malak Abdullah, “Machine
Learning with Oversampling and Undersampling Techniques: Overview study and
Experimental Results”,2020.

[17] https://seaborn.pydata.org/generated/seaborn.heatmap.html

[18] Yanli Liu, Yourong Wang, and Jian Zhang, “New Machine Learning Algorithm:
Random Forest”, 2012.

[19] Arundhati Navada, Aamir Nizam Ansari, Siddharth Patil, Balwant A. Sonkamble,
“Overview of use of Decision Tree algorithms in Machine Learning”, 2011.

[20] Lishan Wang, “Research and Implementation of Machine Learning Classifier based
on KNN”, 2019.

23

https://seaborn.pydata.org/generated/seaborn.heatmap.html