Feature Engineering Handbook

📖 Introduction

Feature Engineering and Feature Selection are critical data preparation tasks in machine learning that significantly impact model performance.

Feature Engineering is the process of creating new features from existing data.

Feature Selection is the process of selecting a subset of features from a dataset.

These techniques help key aspects of your ML pipeline:

Improve Performance: Create features more relevant to the target variable.
Reduce Overfitting: Limit the number of redundant features.
Enhance Interpretability: Simplifies models making them easier to understand.

📑 Table of Contents

Roadmap
Feature Engineering Techniques
Feature Selection Methods
- Scikit-learn Classes & Functions
- Method Categories
Visualizations
Code Snippets & Examples
Model Selection & Evaluation
Contributing
License
Connect with Me

🛣️ Roadmap

🛠 Feature Engineering Techniques

Common strategies include:

Combination: Creating new features by combining existing ones (e.g., summing two columns).
Transformation: Mathematical transformations (e.g., square root, log, box-cox).
Discretization: Converting continuous variables into bins or buckets.
Polynomial Features: Powers of existing features (e.g., $age^2$).
Interaction Features: Products of features (e.g., $age \times gender$).
Time Series Features: Extracting time-based components (e.g., day of week, lag features).

🎯 Feature Selection Methods

Scikit-learn Classes & Functions for `feature_selection`

Class/Function	Description
`SelectKBest`	Selects the top K features based on a scoring function.
`chi2`	Chi-squared stats of non-negative features for classification tasks.
`SelectPercentile`	Selects the top percentile of features based on a scoring function.
`SelectFromModel`	Meta-transformer for selecting features based on importance weights.
`RFE`	Recursive Feature Elimination; removes weakest features iteratively.
`RFECV`	RFE with cross-validation to select the best number of features.
`SequentialFeatureSelector`	Transformer that performs sequential feature selection (forward or backward).
`mutual_info_regression`	Estimate mutual information for a continuous target variable.
`mutual_info_classification`	Estimate mutual information for a discrete target variable.
`f_regression`	Univariate linear regression tests returning F-value and p-value.

Most Commonly Used Feature Selection Methods

Method Type	Description	Scikit-learn Class Examples
Filter Methods	Select features based on statistical measures (e.g., correlation, chi-square). Fast and generic.	`SelectKBest`, `SelectPercentile`, `chi2`, `f_regression`, `mutual_info_classif`
Wrapper Methods	Evaluate subsets of features by training a model. Computationally expensive but accurate.	`RFECV`, `SequentialFeatureSelector`
Embedded Methods	Perform feature selection during model training (e.g., regularization).	`SelectFromModel`, `LassoCV`, `RandomForestClassifier`, `XGBRegressor`

📊 Visualizations for Analysis

Charts that assist in deciding which features to keep or engineer:

Chart	Purpose
Correlation Heatmap	Identify redundant features by visualizing correlations.
Box Plot	Detect outliers and understand feature distribution.
Scatter Plot Matrix	Visualize relationships between multiple features and the target.
Decision Tree Viz	See which features the model prioritizes for splitting.
PCA Plot	Understand high-dimensional data structure and clusters.
Feature Importance Plot	Bar chart showing the relative importance scores of features.

💻 Code Snippets & Examples

1. SelectKBest (Filter Method)

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, chi2

# Load iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Apply SelectKBest feature selection
selector = SelectKBest(chi2, k=2)
X_new = selector.fit_transform(X, y)

# Print selected feature indices
print(selector.get_support(indices=True))

2. Chi-Squared Test

from sklearn.feature_selection import chi2
from sklearn import datasets

X, y = datasets.load_wine(return_X_y=True)

selector = chi2(X, y)
# Note: For chi2 simply calculating, you don't strictly need a class context if just wanting scores, 
# but SelectKBest usually wraps it. Here is conceptual usage:
# selector = SelectKBest(chi2, k=5).fit(X, y)
# features = X.columns[selector.get_support()] 
# (Assuming X is a DataFrame for columns access)

3. Recursive Feature Elimination (RFE)

from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor()
# Select top 5 features
rfe = RFE(rf, n_features_to_select=5)
rfe.fit(X_train, y_train)

print("Selected Features:", rfe.support_)
print("Ranking:", rfe.ranking_)

4. RFECV (RFE with Cross-Validation)

from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression

selector = RFECV(estimator=LogisticRegression(), step=1, cv=5, scoring='accuracy')
selector.fit(X, y)

print("Optimal number of features: %d" % selector.n_features_)
print("Selected features indices:", selector.get_support(indices=True))

5. Sequential Feature Selector

from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=3)
sfs = SequentialFeatureSelector(knn, n_features_to_select=2)
X_new = sfs.fit_transform(X, y)

print(sfs.get_support(indices=True))

6. Feature Importance from Trees

from sklearn.ensemble import RandomForestClassifier
import pandas as pd

model = RandomForestClassifier()
model.fit(X, y)

# Combine feature names and their importance
feature_importances = pd.Series(model.feature_importances_, index=X.columns)
print(feature_importances.sort_values(ascending=False).head(5))

📈 Model Selection & Evaluation

The sklearn.model_selection module is essential for validating your feature engineering efforts.

Function	Description
`train_test_split`	Split arrays or matrices into random train and test subsets.
`cross_val_score`	Evaluate a score by cross-validation.
`cross_validate`	Evaluate multiple metric(s) by cross-validation.
`GridSearchCV`	Exhaustive search over specified parameter values for an estimator.
`RandomizedSearchCV`	Randomized search on hyper parameters.
`KFold` / `StratifiedKFold`	K-Folds cross-validator (Stratified preserves class percentage).

Example: Hyperparameter Tuning with `GridSearchCV`

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

# Generate dummy data
X, y = make_classification(n_samples=1000, n_features=20)

# Define model and parameter grid
rf = RandomForestClassifier()
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5]
}

# Setup GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, scoring='accuracy')

# Fit and print best results
grid_search.fit(X, y)
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Score: {grid_search.best_score_:.4f}")

📦 Other Powerful Libraries

Beyond Scikit-learn, several specialized libraries excel at specific feature engineering tasks.

1. Category Encoders

Advanced encoding methods for categorical variables, often superior to standard OneHot or Label encoding.

Encoder	Description
`BinaryEncoder`	Encodes categorical features into binary codes (fewer columns than OneHot).
`TargetEncoder`	Encodes categories based on the mean of the target variable (great for high cardinality).
`WOEEncoder`	Weight of Evidence encoding; measures the "strength" of a grouping for separating good and bad.
`HashingEncoder`	Hash hashing of categories; useful for high-cardinality and online learning.

import category_encoders as ce
import pandas as pd

X = pd.DataFrame({'city': ['New York', 'Paris', 'Tokyo', 'Paris', 'New York'], 'price': [200, 150, 300, 160, 210]})
y = [1, 0, 1, 0, 1]  # Target

# Target Encoding Example
encoder = ce.TargetEncoder(cols=['city'])
X_encoded = encoder.fit_transform(X, y)

print(X_encoded)

2. Feature-engine

A library specifically designed for feature engineering with a scikit-learn compatible API.

Module	Description
`imputation`	Methods like `MeanMedianImputer`, `EndTailImputer` for handling missing data.
`encoding`	`RareLabelEncoder` (groups rare categories), `MeanEncoder`, `CountFrequencyEncoder`.
`discretisation`	`EqualFrequencyDiscretiser`, `DecisionTreeDiscretiser`.
`outliers`	`Winsorizer` to cap outliers at given quantiles.

from feature_engine.encoding import RareLabelEncoder
import pandas as pd

# Data with rare category 'E'
X = pd.DataFrame({'category': ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D', 'E']})

# Group categories occurring in less than 20% of rows into 'Rare'
encoder = RareLabelEncoder(tol=0.2, n_categories=1, replace_with='Rare')
X_encoded = encoder.fit_transform(X)

print(X_encoded['category'].value_counts())

3. Imbalanced-learn

While primarily for sampling, it is crucial for feature prep in imbalanced datasets.

Method	Description
`SMOTE`	Synthetic Minority Over-sampling Technique; generates synthetic samples.
`NearMiss`	Under-sampling based on k-nearest neighbors.
`ADASYN`	Adaptive Synthetic sampling; focuses on "hard to learn" examples.

from imblearn.over_sampling import SMOTE
from collections import Counter

X, y = make_classification(n_classes=2, weights=[0.9, 0.1], n_samples=1000)
print(f"Original class distribution: {Counter(y)}")

# Resample using SMOTE
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)

print(f"Resampled class distribution: {Counter(y_res)}")

4. SHAP (SHapley Additive exPlanations)

Although an interpretability tool, SHAP values are powerful for feature selection by understanding true contribution.

import shap
import xgboost as xgb

# Train a model
model = xgb.XGBClassifier().fit(X, y)

# Explain predictions
explainer = shap.Explainer(model)
shap_values = explainer(X)

# Visualize feature importance (beeswarm plot)
shap.plots.beeswarm(shap_values)

🤝 Contributing

This repository is open source and contributions are welcome! If you have any ideas, hacks, tips, or find errors:

Fork the repository.
Create a branch.
Submit a Pull Request.
Or simply open an Issue.

⚖️ License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.vscode		.vscode
FeatSel_pdf		FeatSel_pdf
_dataset		_dataset
_img		_img
graphics		graphics
00_Beginner's_Guide_to_Feature_Selection_in_Python.ipynb		00_Beginner's_Guide_to_Feature_Selection_in_Python.ipynb
01_sklearn_feature_selection_func.ipynb		01_sklearn_feature_selection_func.ipynb
02_Feature_Transformation.ipynb		02_Feature_Transformation.ipynb
03_Dropping_Constant_Features.ipynb		03_Dropping_Constant_Features.ipynb
04_Correlation.ipynb		04_Correlation.ipynb
05_Information_Gain_Mutual_Information_Classification.ipynb		05_Information_Gain_Mutual_Information_Classification.ipynb
05_Information_Gain_Mutual_Information_Classification_Regression.ipynb		05_Information_Gain_Mutual_Information_Classification_Regression.ipynb
06_Chi_Square_test.ipynb		06_Chi_Square_test.ipynb
07_Feature_Selection_mobile_data.ipynb		07_Feature_Selection_mobile_data.ipynb
LICENSE		LICENSE
README.md		README.md
_dataset_genration.ipynb		_dataset_genration.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Feature Engineering Handbook

📖 Introduction

📑 Table of Contents

🛣️ Roadmap

🛠 Feature Engineering Techniques

🎯 Feature Selection Methods

Scikit-learn Classes & Functions for `feature_selection`

Most Commonly Used Feature Selection Methods

📊 Visualizations for Analysis

💻 Code Snippets & Examples

1. SelectKBest (Filter Method)

2. Chi-Squared Test

3. Recursive Feature Elimination (RFE)

4. RFECV (RFE with Cross-Validation)

5. Sequential Feature Selector

6. Feature Importance from Trees

📈 Model Selection & Evaluation

Example: Hyperparameter Tuning with `GridSearchCV`

📦 Other Powerful Libraries

1. Category Encoders

2. Feature-engine

3. Imbalanced-learn

4. SHAP (SHapley Additive exPlanations)

🤝 Contributing

⚖️ License

Connect with me

About

Uh oh!

Releases

Packages

Languages

License

mohd-faizy/feature-engineering-handbook

Folders and files

Latest commit

History

Repository files navigation

Feature Engineering Handbook

📖 Introduction

📑 Table of Contents

🛣️ Roadmap

🛠 Feature Engineering Techniques

🎯 Feature Selection Methods

Scikit-learn Classes & Functions for feature_selection

Most Commonly Used Feature Selection Methods

📊 Visualizations for Analysis

💻 Code Snippets & Examples

1. SelectKBest (Filter Method)

2. Chi-Squared Test

3. Recursive Feature Elimination (RFE)

4. RFECV (RFE with Cross-Validation)

5. Sequential Feature Selector

6. Feature Importance from Trees

📈 Model Selection & Evaluation

Example: Hyperparameter Tuning with GridSearchCV

📦 Other Powerful Libraries

1. Category Encoders

2. Feature-engine

3. Imbalanced-learn

4. SHAP (SHapley Additive exPlanations)

🤝 Contributing

⚖️ License

Connect with me

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Scikit-learn Classes & Functions for `feature_selection`

Example: Hyperparameter Tuning with `GridSearchCV`

Packages