Breast cancer remains one of the most prevalent and deadly diseases affecting women globally. Despite medical advancements, early detection is crucial for improving survival rates. Traditional diagnostic methods like biopsies and fine-needle aspirations are effective but invasive, time-consuming, and reliant on human judgment. This project explores the use of machine learning to automate breast tumor classification using digitized cell nuclei features. By applying models such as Random Forest Classifiers, the system achieves high accuracy in distinguishing between benign and malignant tumors. The work demonstrates a complete ML pipeline—from data preprocessing to deployment—and highlights its potential to enhance clinical decision-making, reduce delays, and support scalable healthcare solutions. Ultimately, it bridges medicine and technology in the fight against breast cancer.
Breast cancer originates from faulty cells that grow uncontrollably, often within milk-producing lobules or ducts. Early detection is critical for improving patient survival. Traditional diagnostic methods rely on invasive procedures and expert judgment, which can be time-consuming and error-prone. This project focuses on building and refining a machine learning model to classify breast tumors as benign or malignant using cell nuclei features from digitized fine-needle aspirate images. Key objectives include dataset selection, data preprocessing, exploratory analysis, model development, testing, and fine-tuning.
To train our breast cancer classification model, we used a publicly available dataset from Kaggle, containing 569 instances derived from digitized fine-needle aspirate images of breast masses. Each sample includes 32 columns: an ID, a diagnosis label (benign or malignant), and 30 numerical features describing cell nuclei characteristics.
These features are grouped into three statistical forms:
Mean values
Standard errors
Worst (maximum) values
The core features include:
Radius
Texture
Perimeter
Area
Smoothness
Compactness
Concavity
Concave points
Symmetry
Fractal dimension
This structured dataset enables robust exploratory analysis and supports accurate machine learning classification.
Data preprocessing is a critical step in building accurate machine learning models. It involves cleaning the dataset, handling missing values, encoding categorical variables, and removing irrelevant or noisy features. This ensures the model learns from meaningful patterns and avoids issues like target leakage—where future information mistakenly influences predictions.
For this project:
The dataset was imported using Pandas and visualized with Matplotlib and Seaborn.
Initial inspection revealed 31 diagnostic features and one empty column (“Unnamed: 32”), which was dropped along with the non-informative “id” column.
A t-test was conducted using Scipy to identify features with low statistical significance (p > 0.01), which were removed to reduce noise.
Diagnosis labels were mapped: ‘M’ to 1 (malignant) and ‘B’ to 0 (benign).
The data was split into features (X) and labels (y), and visualized using multi-line plots and pair plots to explore distributions and correlations.
This process ensured a cleaner, more focused dataset for training robust and interpretable models.
Choosing the right machine learning model is essential for accurate predictions. In supervised learning, regression models (e.g., Linear Regression, Ridge Regression) are used for numerical outputs, while classification models (e.g., Logistic Regression, Decision Tree Classifier, Random Forest) are used for categorical outputs.
For this project, four models were trained and evaluated:
Logistic Regression: Chosen for its simplicity, interpretability, probabilistic output, and efficiency—ideal as a baseline model.
Random Forest: Offers high accuracy, handles complex feature interactions, resists overfitting, and provides feature importance insights.
Random Forest with Cross-Validation: Adds robustness and generalizability by averaging multiple trees and validating performance across folds.
XGBoost: Known for superior accuracy, speed, and flexibility in handling structured data and missing values.
Each model was selected to balance interpretability, performance, and scalability for breast cancer classification.
This project trained and evaluated four machine learning models to classify breast tumors using digitized cell nuclei features. Each model was implemented with careful preprocessing, tuning, and validation.
6.1 Logistic Regression Dataset loaded and cleaned using Pandas.
Diagnosis labels mapped: ‘M’ → 1, ‘B’ → 0.
Unwanted columns dropped; features (X) and target (y) defined.
StandardScaler applied to normalize feature ranges.
Data split into training and test sets (80/20) with random_state=42.
Logistic Regression model trained and evaluated using accuracy, confusion matrix, and classification report.
6.2 Random Forest Data split into training and validation sets.
Pipeline created with a SimpleImputer and RandomForestClassifier.
Model trained and predictions made.
Decision trees visualized to understand model structure.
6.3 Random Forest with Cross-Validation Cross-validation applied using fold values from 2 to 9.
Best performance achieved with 7 folds (MAE = 0.02981).
Function get_score(n_estimators) used to test different tree counts.
Optimal number of trees found to be 140 (balancing accuracy and efficiency).
Final model trained and evaluated using heatmaps and accuracy metrics.
6.4 Gradient Boosting (XGBoost) XGBClassifier implemented via scikit-learn API.
Pipeline created and model trained.
Achieved high accuracy (0.97368) and low MAE (0.02631).
Confusion matrix showed only 3 misclassifications out of 114 samples.
Precision, recall, and F1-scores were excellent for both classes.
Model tuned using n_estimators, learning_rate, and early_stopping_rounds.
Each model demonstrated strong performance, with Random Forest (cross-validated) and XGBoost showing the highest accuracy and lowest error rates. These results validate the effectiveness of ensemble methods in medical classification tasks.
Four classification models were trained and evaluated on the breast cancer diagnosis dataset. Each model was assessed using accuracy, mean absolute error (MAE), confusion matrix, and classification reports.
- Logistic Regression Accuracy: 97.36%
MAE: Not specified
Confusion Matrix: 3 misclassifications out of 114 samples
Conclusion: High precision and recall for both classes. No signs of overfitting (training accuracy: 98%, test accuracy: 97%).
- Random Forest (Default) Accuracy: 96.49%
MAE: 0.03508
Confusion Matrix: 4 errors out of 114 samples
Conclusion: Strong performance with good class balance. Slightly lower accuracy than other models.
- Random Forest with Cross-Validation Accuracy: 97.37%
MAE: 0.02628
Confusion Matrix: 15 errors out of 569 samples
Conclusion: Best overall performance. Cross-validation improved generalization and reduced overfitting. Selected as the final model.
- Gradient Boosting (XGBoost) Accuracy: 97.36%
MAE: 0.02631
Confusion Matrix: 3 errors out of 114 samples
Conclusion: Excellent performance, nearly matching the cross-validated Random Forest. Slightly higher MAE.
🏆 Final Model Selection
The Random Forest Classifier with Cross-Validation was chosen as the most suitable model due to:
Highest accuracy (97.37%)
Lowest MAE (0.02628)
Strong robustness against overfitting and data leakage
Reliable and interpretable predictions
Scalability for real-world medical applications
This model offers trustworthy diagnostic support and demonstrates how machine learning can contribute meaningfully to healthcare.
Deployment is the final step in making a machine learning model accessible to users and applications. It involves packaging, serving, and hosting the model to ensure reliability, scalability, and usability.
- Model Packaging The trained model is serialized using pickle or joblib and saved as a .pkl file.
A requirements.txt file is created to preserve the training environment and dependencies.
- API Development An API is built using Flask or FastAPI to handle user requests, process input data, and return predictions.
The model is loaded once during server startup for efficient performance.
- Web Application A simple, user-friendly interface is developed for lab assistants to input data and view predictions.
The UI allows easy interaction with the model’s output.
- Containerization Docker is used to package the application, dependencies, and configurations into a portable container.
A Dockerfile defines the build instructions for creating the Docker image.
- Hosting and Scaling The containerized application is hosted on AWS EC2 for public access.
AWS was chosen for its reliability, scalability, cost-effectiveness, and strong security features.
Breast cancer is a leading cause of mortality worldwide, and early detection is critical for improving survival rates. Traditional diagnostic methods, while effective, are invasive, costly, and reliant on human expertise. This project explores the use of machine learning to automate breast tumor classification, offering faster and scalable clinical decision support.
Using a publicly available Kaggle dataset of 569 samples derived from fine-needle aspirate images, the data was cleaned, encoded, and scaled. Statistical tests and visualizations helped identify the most relevant features for diagnosis.
Several models were trained:
Logistic Regression served as a baseline.
Random Forest emerged as the best-performing model, achieving 96.49% accuracy initially and 97.37% after cross-validation.
Gradient Boosting showed slightly higher accuracy (98.24%) but risked overfitting.
Deployment strategies included:
Serializing the trained model
Building APIs with Flask or FastAPI
Containerizing with Docker
Hosting on AWS EC2 for scalability
This end-to-end project—from data preprocessing to deployment—demonstrated how machine learning can complement traditional medical diagnostics. While no model replaces expert judgment, ML tools can enhance speed, accuracy, and consistency in healthcare when tested on larger, diverse datasets.