Malware Classification

The goal of this diploma thesis is the application of Artificial Intelligence and Machine Learning algorithms, which detect malware in the traffic of TLS protocol. The TLS protocol is the one used extensively for client and server communication over the Web. It provides packet encryption in order to protect messages, transactions and other activities. The novelty of this work is that the packages are not decrypted to check if they contain malware, but Machine Learning (ML) algorithms are trained, based on features, certificates and metadata we choose, to predict with great accuracy whether it is malware or not. As a result, we do not violate users’ personal data, such as passwords, since packets are not decrypted and we also minimize delays in the communication of client and server, which would not happen if we decrypted each packet separately.

Credits

Mergecap by Scott Renfro tool for merging PCAP files
Mercury by Cisco tool for extracting TLS metadata
Benign dataset by Stratosphere Lab
Malware dataset by Malware Traffic Analysis

Demo

Merge benign datasets:

geobour98@archlinux:~/thesis-malware-ml/benign_dataset$ mergecap -F pcap -w ben_output.pcap *.pcap

Merge malware datasets:

geobour98@archlinux:~/thesis-malware-ml/malware_dataset$ mergecap -F pcap -w mal_output.pcap *.pcap

Extract TLS metadata from benign datasets:

geobour98@archlinux:~/thesis-malware-ml/benign_dataset$ mercury -r ben_output.pcap --select=tls --metadata --certs-json > ben_output.json

Extract TLS metadata from malware datasets:

geobour98@archlinux:~/thesis-malware-ml/malware_dataset$ mercury -r mal_output.pcap --select=tls --metadata --certs-json > mal_output.json

Modify benign datasets to valid JSON:

geobour98@archlinux:~/thesis-malware-ml/benign_dataset$ python3 final_json.py

Modify malware datasets to valid JSON:

geobour98@archlinux:~/thesis-malware-ml/malware_dataset$ python3 final_json.py

Perform the classification:

geobour98@archlinux:~/thesis-malware-ml$ python3 classify.py

Number of Benign flows:  56691
Number of Malware flows:  30122

Random Forests: 

Precision: 1.000
Recall: 1.000
F1-score: 1.000
Accuracy: 1.000 

KNN: 

Precision: 0.990
Recall: 0.968
F1-score: 0.979
Accuracy: 0.983 

SVM: 

Precision: 0.835
Recall: 0.927
F1-score: 0.879
Accuracy: 0.895

Warning

The code needs several improvements mostly due to hardcoded values and repeated code. No plans on updating the code base.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benign_dataset		benign_dataset
malware_dataset		malware_dataset
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
classify.py		classify.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Malware Classification

Credits

Demo

About

Uh oh!

Languages

License

geobour98/malware-classification

Folders and files

Latest commit

History

Repository files navigation

Malware Classification

Credits

Demo

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages