EmailSpamDetection

Project Problem Statement

To construct an email spam detection model by using the Naive Bayes model and to find its efficiency at variable explicit factors.

Project Libraries

import the CountVectorizer tool from the scikit-learn library CountVectorizer gives a straightforward method to tokenize a gathering of documents and to construct a vocabulary with the known words it experiences.
import names corpora from the corpus data collection of nltk The names corpora are a deep collection of names which has been recorded in the corpus data collection.
import the WordNetLemmatizer tool from nltk module WordNetLemmatizer is an open-source Lexical Analysis instrument.
import glob and os for data loading and traversal The glob library is utilized to navigate the record arrangement of the client and to return documents which coordinated the characterized parameters set in the glob design. The os module empowers the client to run the program on some other framework or stage, it gives interoperability, which enables the program to keep running with parts of an alternate framework.
import numpy library to carry out array functionalities The numpy library enables the client to do data handling and has numerous advantages.

Project Dataset

The directory which is being utilized is the Enron Directory. The Enron Directory contains the Enron-Spam datasets. The spam filtration that is being done utilizing the Enron Directory pursues the Naive Bayes Algorithm for spam filtration. The subdirectories, enron-ham and enron-spam, contain messages which include client produced messages to their destination and spam messages, respectfully. The Enron Dataset has been cleaned with the help of Pandas.

Project UML Diagram

Project Output and Results

Data loading

Emails are successfully imported and stored in emails respectively. The order is recorded in labels.

Figure: 1.1.1 – emails and label data Tools letters_only and clean_text are created to clean the email data extracted from enron directory. WordNetLemmatizer tool and CountVectorizer are initialized.

Result:

The data is loaded into emails dataset and a unique id corresponding to class of email is assigned and stored in labels. The tools and methods have been defined to process the data. 2. Created tools required to compute the training and testing data

Created get_prior, get_likelihood and get_posterior methods. Created get_label_index, with labels as argument for saving labels.

Figure: 2.1.1 – get prior function

Figure: 2.1.2 – get likelihood function

Figure: 2.1.3 – get posterior function

Result:

Prior, posterior and likelihood methods are defined which will be used to generate a prediction report of the memory based model.

Created training models, calculated accuracy for MultiNomialNB and default-memory based models and generated a metric report for the Bayesian classifier The accuracy of the memory based model is calculated as following:

Figure: 3.1.1 – accuracy memory based The accuracy of MultiNomialNB is calculated as following:

Figure: 3.1.2 – accuracy percentage MNNB A report is generated for the metric scores of classifier, with the support column which shows the number of positive samples in the respective classes 0 and 1.

Figure: 3.1.3 – metric report

Result:

True positive and false positive counts are generated for the classifier by analysing the report by which true positive and false positive rates are computed respectively.

True positive rate and false positive rates are computed from the results gathered. The rates are used to plot a ROC curve to compute efficiency.

Figure: 3.1.4 – true and false positive rate

Figure: 3.1.5 – plot curve

The roc_auc_score is generated for classifier model = 0.9652.. or 96%, therefore the efficiency of model has been proved to be very high.

Developed a model to train and test data for multiple features, smoothing factor and fit prior respectively. Generated a report auc_record

The model developed for the process is shown below.

Figure: 4.1.1 – final model

Result:

The auc_record is the report dataset which has stored all the information that has been generated by the model for each factor, defined explicitly.

Figure: 4.1.2 – auc_record

Displayed the generated report in a tabulated form

Figure: 5.1.1 – final report of model

Result:

The efficiency of the model deployed at variable max features, smoothing parameters and fit prior have been computed, recorded and displayed successfully. The results obtained clearly outlines that the Naive Bayesian model is more efficient than the memory based model, and can store and detect words in the range of 2000 and above correctly, hence proving the high efficient nature of the Bayesian classifier as stated by Sahami in his research.

A sample test array which contains random emails is checked for spam using the Bayesian classifier

The sample test array is initialized, it is cleaned using the clean_text and letters_only methods and then is stored as a sparse matrix term_docs_test.

Figure: 6.1.1 – term docs test

The Bayesian classifier is loaded with the training data to set the prediction probability for the classifier. The sample test matrix, term_docs_test is set as parameter and a prediction list is generated to predict the class of email read by the classifier. Result

Figure: 6.1.2 – prediction array The prediction array returns 1 for spam and 0 for not spam. The following prediction array generated for term_docs_test in correspondence to the true nature of the test emails is correct to an extent. Hence, the Bayesian classifier is validated and verified for its authenticity and its prediction capacity on a random test data, loaded onto the classifier.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Email Spam Detection.ipynb		Email Spam Detection.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EmailSpamDetection

Project Problem Statement

Project Libraries

Project Dataset

Project UML Diagram

Project Output and Results

About

Uh oh!

Releases

Packages

Languages

chitranshuraj/EmailSpamDetection

Folders and files

Latest commit

History

Repository files navigation

EmailSpamDetection

Project Problem Statement

Project Libraries

Project Dataset

Project UML Diagram

Project Output and Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages