Contributors: Batuhan Avci, Kirill Zhukovsky
This project implements machine learning models to predict breast cancer diagnosis using the Breast Cancer Wisconsin (Diagnostic) Dataset. The dataset contains measurements from digitized images of fine needle aspirate (FNA) of breast mass, with features computed from the images describing characteristics of the cell nuclei.
- Source: Breast Cancer Wisconsin (Diagnostic) Dataset
- Size: 569 samples
- Classes: Binary (Malignant/Benign)
- Features: 30 real-valued features computed from cell nuclei images
- Feature Categories:
- Mean values
- Standard error values
- "Worst" values (mean of the three largest values)
Each feature is computed for each cell nucleus, including:
- Radius
- Texture
- Perimeter
- Area
- Smoothness
- Compactness
- Concavity
- Concave points
- Symmetry
- Fractal dimension
- Loading and Cleaning: Removing unnecessary columns and handling missing values.
- Mapping Diagnosis Labels: Converting labels (M=1, B=0) for binary classification.
- Feature Scaling: Standardizing data using
StandardScalerto normalize feature distributions. - Feature Selection: Removing highly correlated features to prevent redundancy.
- Splitting Data: Using a 70-30 train-test split, followed by 5-fold cross-validation on the training set.
- A widely used statistical model for binary classification.
- Uses logistic loss function for optimization.
- Provides interpretable results with feature importance.
- An ensemble learning method that constructs multiple decision trees.
- Uses Gini impurity as a criterion for split quality.
- More robust to outliers and non-linear relationships.
The project is implemented in Python using the following libraries:
pandas- Data manipulation and analysisscikit-learn- Machine learning algorithmsnumpy- Numerical computationsmatplotlib/seaborn- Data visualization
Both models performed well in classifying breast cancer samples.
| Model | Test Accuracy | Cross-Validation Accuracy | False Positives | False Negatives |
|---|---|---|---|---|
| Logistic Regression | 96.5% | 96.7% | 5 | 1 |
| Random Forest | 95.9% | 95.5% | 3 | 4 |
| Actual \ Predicted | Benign (0) | Malignant (1) |
|---|---|---|
| Benign (0) | 103 | 5 |
| Malignant (1) | 1 | 62 |
| Actual \ Predicted | Benign (0) | Malignant (1) |
|---|---|---|
| Benign (0) | 105 | 3 |
| Malignant (1) | 4 | 59 |
- Logistic Regression had fewer false negatives (1 vs. 4), making it more reliable for detecting malignant cases.
- Random Forest had fewer false positives (3 vs. 5), meaning it reduced unnecessary alarms for benign cases.
- The close match between cross-validation accuracy and test accuracy suggests good generalization without overfitting.
Ensure you have the following dependencies installed:
- Python 3.x
- pandas
- scikit-learn
- numpy
- matplotlib
- seaborn
Install all dependencies using:
pip install -r requirements.txt- Clone the repository:
git clone https://github.com/your-repo/breast-cancer-detection.git cd breast-cancer-detection - Install dependencies:
pip install -r requirements.txt
- Run the Jupyter Notebook:
jupyter notebook code/main.ipynb
- Implement additional machine learning algorithms (e.g., SVM, Neural Networks).
- Conduct hyperparameter tuning for improved accuracy.
- Apply cross-validation with multiple metrics to refine evaluation.
- Develop a web interface for real-time breast cancer diagnosis.
Contributions are welcome! If you wish to improve the project:
- Fork the repository.
- Create a feature branch.
- Submit a pull request.
This project is licensed under the MIT License - see the LICENSE file for details.
- UCI Machine Learning Repository for the Breast Cancer dataset.
- scikit-learn documentation and community.
- Open source contributors who maintain Python ML libraries.