Skip to content

🪐 6- Social Buss: Academic project for Fake News detection using Machine Learning algorithms applied to Portuguese language data, focused on testing and comparing supervised models.

License

Notifications You must be signed in to change notification settings

Mindful-AI-Assistants/6-social-buzz-ai-fake-news-detection-ml-br

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

197 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

[🇧🇷 Português] [🇺🇸 English]





Sponsor Mindful AI Assistants




Fake.News.Detection.-.Machine.Learning.in.the.Fight.Against.Disinformation.mov.mov





Course: Humanistic AI & Data Science (4th Semester)
Institution: PUC-SP
Professor: Erick Bacconi





Tip

This repository 2-social-buzz-ai-GBoost-and-LowDefault-Modeling is part of the main project 1-social-buzz-ai-main. To explore all related materials, analyses, and notebooks, visit the main repository





  • Fake news are false information mainly spread on social networks, which can cause serious political, social, and public health impacts.
  • This study aims to apply Machine Learning (ML) algorithms to automatically detect fake news, offering a technological alternative to address this issue.




  • Test and compare different ML algorithms for detecting fake news.
  • Assess the performance of each model in accuracy, sensitivity, and specificity.
  • Propose an automated, replicable, and useful solution for society.




3.1. Dataset


Python + libraries: Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, NLTK



import pandas as pd
import numpy as np
import string
import nltk
from nltk.corpus import stopwords

# Load fake and true datasets
fake = pd.read_csv('Fake.csv')
true = pd.read_csv('True.csv')
fake['target'] = 1
true['target'] = 0

# Concatenate and shuffle records
data = pd.concat([fake, true], ignore_index=True)
data = data.sample(frac=1).reset_index(drop=True)

# Remove title and date columns
data.drop(['title', 'date'], axis=1, inplace=True)

# Clean text (lowercase, no punctuation, no stopwords)
nltk.download('stopwords')
stop_words = set(stopwords.words('portuguese'))

def clean_text(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = " ".join([word for word in text.split() if word not in stop_words])
    return text

data['text'] = data['text'].apply(clean_text)


from wordcloud import WordCloud
import matplotlib.pyplot as plt

wc = WordCloud(width=800, height=400, background_color='white').generate(' '.join(data['text']))
plt.figure(figsize=(10, 5))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.show()


from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

X = data['text']
y = data['target']

vectorizer = TfidfVectorizer(max_features=5000)
X_vect = vectorizer.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_vect, y, test_size=0.2, random_state=42)




Five supervised models were trained and evaluated:


Model Accuracy Remarks
Logistic Regression 98.92% High precision, confusion matrix shows low error rate.
Decision Tree 99.6% Best overall performance and lowest error.
Random Forest 98.74% Good performance, consistent confusion matrix.
Support Vector Machine 99.5% Excellent accuracy and precision, robust text model.
K-Nearest Neighbors (KNN) 60.84% Low performance, high number of false negatives.

Metrics computed via confusion matrix (including TP, TN, FP, FN) and specific precision and sensitivity values for each model.




Below are examples of models tested and evaluated.



from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


from sklearn.svm import SVC

svm = SVC()
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


import seaborn as sns

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()




  • Accuracy: Overall model success rate.
  • Precision: How well fake news are detected.
  • Sensitivity: Model's ability to correctly identify actual fake news.
  • Specificity: Model's ability to correctly identify actual real news.


Model Precision Sensitivity Specificity
Logistic Regression 98% 99% 98%
Decision Tree 98.5% 99% 99%
Random Forest 98.5% 99% 98%
SVM 99% 99% 99%
KNN 99% 57% 19%




  • Four out of five models achieved accuracy above 90%.
  • KNN performed poorly, mainly due to the high number of false negatives (57% sensitivity).
  • Decision Tree and SVM excelled as the most efficient.
  • Data processing and feature selection were key to the models' success.




  • Study limitations: Difficulty in finding standardized datasets (especially in Portuguese), few systems applied to the Brazilian context.
  • Future directions: Test new algorithms (Naive Bayes, Boosting, K-means, Gradient Descent), apply other validation techniques, expand Portuguese datasets, include cross-validation (K-fold, Leave-one-out), and develop web applications for public use.




  • Machine Learning has proven powerful for fake news detection and is crucial for protecting society from the impact of false information.
  • Ongoing research is especially necessary in the Brazilian context.







Image




Monteiro Bastos & Monteiro de Lima (2023). Fake News Detection Using Decision Tree, Support Vector Machine, and K-Nearest Neighbors Algorithms. Revista de Estudos Multidisciplinares XV Encontro Científico da UNDB.



  • For notebook files, detailed tutorials, or enhanced visualizations, please reach out.
  • Interested in Python notebooks simulating these dynamics or advanced Humanistic AI models? Just ask!

🛸๋ My Contacts Hub



────────────── 🔭⋆ ──────────────

➣➢➤ Back to Top


Copyright 2025 Mindful-AI-Assistants. Code released under the MIT license.

Sponsor this project