Clustering-and-Segmentation-for-Banking

Project Title: Exploratory Analysis and Clustering Techniques for Customer Segmentation in Banking

PROJECT OVERVIEW

In this project scenario, I am envisioning the role of a data scientist employed by a bank, presented with comprehensive data pertaining to the bank's customers over the preceding six months.
This dataset encompasses information such as transaction frequency, amounts, tenure, among other relevant details.
The objective set forth by the bank's marketing team is to harness the power of AI/ML to initiate a targeted advertising campaign tailored specifically to distinct customer groups.
The success of this campaign hinges on effectively categorizing customers into a minimum of three distinct groups, a practice commonly referred to as marketing segmentation.
This segmentation process is pivotal for optimizing the conversion rates of marketing campaigns.

MODULES OF THE PROJECT

Exploratory Data Analysis
Data Visualization
Feature Engineering
Feature Selection (Lasso CV Feature Importances)
Clustering (Hierarchial Clustering)
Principal Component Analysis (PCA)

DATASET DESCRIPTION

The dataset, sourced from Kaggle here, provides insights into the usage behavior of approximately 9000 active credit card holders over the past six months. Organized at a customer level, the dataset encompasses 18 behavioral variables that capture diverse aspects of credit card utilization.

DATA DICTIONARY:

CUSTID: Identification of Credit Card holder
BALANCE: Balance amount left in customer's account to make purchases
BALANCE_FREQUENCY: How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
PURCHASES: Amount of purchases made from account
ONEOFFPURCHASES: Maximum purchase amount done in one-go
INSTALLMENTS_PURCHASES: Amount of purchase done in installment
CASH_ADVANCE: Cash in advance given by the user
PURCHASES_FREQUENCY: How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)
ONEOFF_PURCHASES_FREQUENCY: How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)
PURCHASES_INSTALLMENTS_FREQUENCY: How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)
CASH_ADVANCE_FREQUENCY: How frequently the cash in advance being paid
CASH_ADVANCE_TRX: Number of Transactions made with "Cash in Advance"
PURCHASES_TRX: Number of purchase transactions made
CREDIT_LIMIT: Limit of Credit Card for user
PAYMENTS: Amount of Payment done by user
MINIMUM_PAYMENTS: Minimum amount of payments made by user
PRC_FULL_PAYMENT: Percent of full payment paid by user
TENURE: Tenure of credit card service for user

1. DATA ANALYSIS

I'll commence by addressing the dataset's cleanliness. This involves identifying and managing null values, addressing outliers, and ensuring the consistency of the data.

A) Describing the Data

creditcard_df.describe().T

Insights

Mean balance is $1564
Balance frequency is frequently updated on average ~0.9
Purchases average is $1000
one off purchase average is ~$600
Average purchases frequency is around 0.5
Average ONEOFF_PURCHASES_FREQUENCY, PURCHASES_INSTALLMENTS_FREQUENCY, and CASH_ADVANCE_FREQUENCY are generally low
Average credit limit ~ 4500
Percent of full payment is 15%
Average tenure is 11 years

B) Checking for Missing Values

# Plotting missing values
plt.figure(figsize=(10, 5))
sns.barplot(x=creditcard_df.columns, y=creditcard_df.isnull().sum(), palette='Blues')
plt.xticks(rotation=45, ha='right')
plt.title('Missing Data Visualization')
plt.show()

So, we are having Missing values in Minimum Payment Attribute. Hence, I decide to impute with KNN Imputer values where each sample’s missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set.

C) Checking for Outliers

Using Inter-Quartile Range (IQR), following the below approach to find outliers:
Calculate the first and third quartile (Q1 and Q3).
Further, evaluate the interquartile range, IQR = Q3-Q1.
Estimate the lower bound, the lower bound = Q11.5
Estimate the upper bound, upper bound = Q31.5
The data points that lie outside of the lower and the upper bound are outliers.

def outlier_percent(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    minimum = Q1 - (1.5 * IQR)
    maximum = Q3 + (1.5 * IQR)
    num_outliers =  np.sum((data < minimum) |(data > maximum))
    num_total = data.count()
    return (num_outliers/num_total)*100

non_categorical_data = creditcard_df.drop(['CUST_ID'], axis=1)
for column in non_categorical_data.columns:
    data = non_categorical_data[column]
    percent = round(outlier_percent(data), 2)
    print(f'Outliers in "{column}": {percent}%')

D) Imputing the Missing Values and Outliers - KNN Imputer

First I set all outliers as NaN, so it will be taken care of in the next stage, where I impute the missing values.

# imputation
from sklearn.impute import KNNImputer
imputer = KNNImputer()
imp_data = pd.DataFrame(imputer.fit_transform(non_categorical_data), columns=non_categorical_data.columns)
imp_data.isna().sum()

2. DATA VISUALIZATION

A) Displot

plt.figure(figsize=(20,50))
for i in range(len(creditcard_df.columns)):
    plt.subplot(17, 1, i+1)
    displot = sns.distplot(creditcard_df[creditcard_df.columns[i]], kde_kws={"color": "b", "lw": 3, "label": "KDE"}, hist_kws={"color": "g"})
    plt.title(creditcard_df.columns[i])

displot.get_figure().savefig("Images/Distplot.png")
plt.tight_layout()

Insights

Mean of balance is 1500 dollors
'Balance_Frequency' for most customers is updated frequently ~1
For 'PURCHASES_FREQUENCY', there are two distinct group of customers
For 'ONEOFF_PURCHASES_FREQUENCY' and 'PURCHASES_INSTALLMENT_FREQUENCY' most users don't do one off puchases or installment purchases frequently
Very small number of customers pay their balance in full 'PRC_FULL_PAYMENT'~0
Credit limit average is around $4500
Most customers are ~11 years tenure

B) Heatmap (Correlation Analysis)

correlations = creditcard_df.corr()
f, ax = plt.subplots(figsize = (20, 8))
heatmap = sns.heatmap(correlations, annot = True)
plt.show()

Insights

'PURCHASES' have high correlation between one-off purchases, 'installment purchases, purchase transactions, credit limit and payments.
Strong Positive Correlation between 'PURCHASES_FREQUENCY' and 'PURCHASES_INSTALLMENT_FREQUENCY'

3. FEATURE ENGINEERING

creditcard_df["new_BALANCE_BALANCE_FREQUENCY"] = creditcard_df["BALANCE"] * creditcard_df["BALANCE_FREQUENCY"]
creditcard_df["new_ONEOFF_PURCHASES_PURCHASES"] = creditcard_df["ONEOFF_PURCHASES"] / creditcard_df["PURCHASES"]
creditcard_df["new_INSTALLMENTS_PURCHASES_PURCHASES"] = creditcard_df["INSTALLMENTS_PURCHASES"] / creditcard_df["PURCHASES"]
creditcard_df["new_CASH_ADVANCE_PURCHASES_PURCHASES"] = creditcard_df["CASH_ADVANCE"] * creditcard_df["CASH_ADVANCE_FREQUENCY"]
creditcard_df["new_PURCHASES_PURCHASES_FREQUENCY"] = creditcard_df["PURCHASES"] * creditcard_df["PURCHASES_FREQUENCY"]
creditcard_df["new_PURCHASES_ONEOFF_PURCHASES_FREQUENCY"] = creditcard_df["PURCHASES"] * creditcard_df["ONEOFF_PURCHASES_FREQUENCY"]
creditcard_df["new_PURCHASES_PURCHASES_TRX"] = creditcard_df["PURCHASES"] / creditcard_df["PURCHASES_TRX"]
creditcard_df["new_CASH_ADVANCE_CASH_ADVANCE_TRX"] = creditcard_df["CASH_ADVANCE"] / creditcard_df["CASH_ADVANCE_TRX"]
creditcard_df["new_BALANCE_CREDIT_LIMIT"] = creditcard_df["BALANCE"] / creditcard_df["CREDIT_LIMIT"]
creditcard_df["new_PAYMENTS_CREDIT_LIMIT"] = creditcard_df["PAYMENTS"] / creditcard_df["MINIMUM_PAYMENTS"]

def outlier_thresholds(dataframe, variable):
    quartile1 = dataframe[variable].quantile(0.01)
    quartile3 = dataframe[variable].quantile(0.99)
    interquantile_range = quartile3 - quartile1
    up_limit = quartile3 + 1.5 * interquantile_range
    low_limit = quartile1 - 1.5 * interquantile_range
    return low_limit, up_limit


def replace_with_thresholds(dataframe, variable):
    low_limit, up_limit = outlier_thresholds(dataframe, variable)
    dataframe.loc[(dataframe[variable] < low_limit), variable] = low_limit
    dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit
    
for col in creditcard_df.columns:
    replace_with_thresholds(creditcard_df, col)

plt.figure(figsize=(10,5))
sns.boxplot(data=creditcard_df)
plt.xticks(rotation=90)
plt.show()

4. FEATURE SELECTION

Lasso CV Feature Importances

X = data_scaled.drop(["BALANCE","new_BALANCE_BALANCE_FREQUENCY", "new_BALANCE_CREDIT_LIMIT", "BALANCE_FREQUENCY"],1)   #Feature Matrix
y = data_scaled["BALANCE"]          #Target Variable

reg = LassoCV()
reg.fit(X, y)
print("Best alpha using built-in LassoCV: %f" % reg.alpha_)
print("Best score using built-in LassoCV: %f" %reg.score(X,y))
coef = pd.Series(reg.coef_, index = X.columns)

print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " +  
      str(sum(coef == 0)) + " variables")

imp_coef = coef.sort_values()
lasso_FE = imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")
plt.show()

5. CLUSTERING TECHNIQUES

Hierarchial Clustering

Hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:

Agglomerative: This is a "bottom-up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
Divisive : This is a "top-down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram.

A) Dendogram

plt.figure(figsize =(6, 6)) 
plt.title('Visualising the data') 
Dendrogram = shc.dendrogram((shc.linkage(X_principal, method ='ward')))

B) Silhouette Score

silhouette_scores = [] 

for n_cluster in range(2, 8):
    silhouette_scores.append( 
        silhouette_score(X_principal, 
                         AgglomerativeClustering(n_clusters = n_cluster).fit_predict(X_principal))) 
    
# Plotting a bar graph to compare the results 
k = [2, 3, 4, 5, 6,7] 
plt.bar(k, silhouette_scores) 
plt.xlabel('Number of clusters', fontsize = 10) 
plt.ylabel('Silhouette Score', fontsize = 10) 
plt.show()

Insights Therefore, we see the optimal number of clusters for this particular dataset would be 3 or 4. Let us now build and visualize the clustering model for k =3.

C) Agglomerative Clustering

agg = AgglomerativeClustering(n_clusters = 3)
agg.fit(X_principal)
# Visualizing the clustering 
plt.scatter(X_principal['P1'], X_principal['P2'],  
           c = AgglomerativeClustering(n_clusters = 3).fit_predict(X_principal), cmap =plt.cm.winter) 
plt.show()

6. PRINCIPAL COMPONENT ANALYSIS (PCA)

PCA is an unsupervised ML Algorithm
It performs dimensionality reductions while attempting at keeping the original information unchanged.
It works by trying to find a new set of fearures called components.
Components are composites of the uncorrelated given input features.

A) Obtaining the Principal Components

pca = PCA(n_components = 2)
principal_comp = pca.fit_transform(data_scaled)
principal_comp

B) Create a dataframe with the two components and concatenate the clusters labels to the dataframe

# Create a dataframe with the two components
pca_df = pd.DataFrame(data = principal_comp, columns =['pca1','pca2'])
pca_df.head()
# Concatenate the clusters labels to the dataframe
pca_df = pd.concat([pca_df, pd.DataFrame({'cluster':labels})], axis = 1)
pca_df.head()

C) Visualizing the output

plt.figure(figsize=(20,8))
ax = sns.scatterplot(x="pca1", y="pca2", hue = "cluster", data = pca_df, palette =['red','green','blue','pink','yellow'])
plt.show()

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Clustering_and_Segmentation_for_Banking.ipynb		Clustering_and_Segmentation_for_Banking.ipynb
README.md		README.md
bank_data.csv		bank_data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clustering-and-Segmentation-for-Banking

Project Title: Exploratory Analysis and Clustering Techniques for Customer Segmentation in Banking

PROJECT OVERVIEW

MODULES OF THE PROJECT

DATASET DESCRIPTION

DATA DICTIONARY:

1. DATA ANALYSIS

A) Describing the Data

B) Checking for Missing Values

C) Checking for Outliers

D) Imputing the Missing Values and Outliers - KNN Imputer

2. DATA VISUALIZATION

A) Displot

B) Heatmap (Correlation Analysis)

3. FEATURE ENGINEERING

4. FEATURE SELECTION

Lasso CV Feature Importances

5. CLUSTERING TECHNIQUES

Hierarchial Clustering

A) Dendogram

B) Silhouette Score

C) Agglomerative Clustering

6. PRINCIPAL COMPONENT ANALYSIS (PCA)

A) Obtaining the Principal Components

B) Create a dataframe with the two components and concatenate the clusters labels to the dataframe

C) Visualizing the output

About

Uh oh!

Releases

Packages

Languages

bsdr18/Clustering-and-Segmentation-for-Banking

Folders and files

Latest commit

History

Repository files navigation

Clustering-and-Segmentation-for-Banking

Project Title: Exploratory Analysis and Clustering Techniques for Customer Segmentation in Banking

PROJECT OVERVIEW

MODULES OF THE PROJECT

DATASET DESCRIPTION

DATA DICTIONARY:

1. DATA ANALYSIS

A) Describing the Data

B) Checking for Missing Values

C) Checking for Outliers

D) Imputing the Missing Values and Outliers - KNN Imputer

2. DATA VISUALIZATION

A) Displot

B) Heatmap (Correlation Analysis)

3. FEATURE ENGINEERING

4. FEATURE SELECTION

Lasso CV Feature Importances

5. CLUSTERING TECHNIQUES

Hierarchial Clustering

A) Dendogram

B) Silhouette Score

C) Agglomerative Clustering

6. PRINCIPAL COMPONENT ANALYSIS (PCA)

A) Obtaining the Principal Components

B) Create a dataframe with the two components and concatenate the clusters labels to the dataframe

C) Visualizing the output

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages