- In this project scenario, I am envisioning the role of a data scientist employed by a bank, presented with comprehensive data pertaining to the bank's customers over the preceding six months.
- This dataset encompasses information such as transaction frequency, amounts, tenure, among other relevant details.
- The objective set forth by the bank's marketing team is to harness the power of AI/ML to initiate a
targeted advertising campaigntailored specifically to distinct customer groups. - The success of this campaign hinges on effectively categorizing customers into a
minimum of three distinct groups, a practice commonly referred to asmarketing segmentation. - This segmentation process is pivotal for
optimizing the conversion rates of marketing campaigns.
- Exploratory Data Analysis
- Data Visualization
- Feature Engineering
- Feature Selection (Lasso CV Feature Importances)
- Clustering (Hierarchial Clustering)
- Principal Component Analysis (PCA)
The dataset, sourced from Kaggle here, provides insights into the usage behavior of approximately 9000 active credit card holders over the past six months. Organized at a customer level, the dataset encompasses 18 behavioral variables that capture diverse aspects of credit card utilization.
- CUSTID: Identification of Credit Card holder
- BALANCE: Balance amount left in customer's account to make purchases
- BALANCE_FREQUENCY: How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
- PURCHASES: Amount of purchases made from account
- ONEOFFPURCHASES: Maximum purchase amount done in one-go
- INSTALLMENTS_PURCHASES: Amount of purchase done in installment
- CASH_ADVANCE: Cash in advance given by the user
- PURCHASES_FREQUENCY: How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)
- ONEOFF_PURCHASES_FREQUENCY: How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)
- PURCHASES_INSTALLMENTS_FREQUENCY: How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)
- CASH_ADVANCE_FREQUENCY: How frequently the cash in advance being paid
- CASH_ADVANCE_TRX: Number of Transactions made with "Cash in Advance"
- PURCHASES_TRX: Number of purchase transactions made
- CREDIT_LIMIT: Limit of Credit Card for user
- PAYMENTS: Amount of Payment done by user
- MINIMUM_PAYMENTS: Minimum amount of payments made by user
- PRC_FULL_PAYMENT: Percent of full payment paid by user
- TENURE: Tenure of credit card service for user
I'll commence by addressing the dataset's cleanliness. This involves identifying and managing null values, addressing outliers, and ensuring the consistency of the data.
Insights
- Mean balance is $1564
- Balance frequency is frequently updated on average ~0.9
- Purchases average is $1000
- one off purchase average is ~$600
- Average purchases frequency is around 0.5
- Average ONEOFF_PURCHASES_FREQUENCY, PURCHASES_INSTALLMENTS_FREQUENCY, and CASH_ADVANCE_FREQUENCY are generally low
- Average credit limit ~ 4500
- Percent of full payment is 15%
- Average tenure is 11 years
# Plotting missing values
plt.figure(figsize=(10, 5))
sns.barplot(x=creditcard_df.columns, y=creditcard_df.isnull().sum(), palette='Blues')
plt.xticks(rotation=45, ha='right')
plt.title('Missing Data Visualization')
plt.show()
So, we are having Missing values in Minimum Payment Attribute. Hence, I decide to impute with KNN Imputer values where each sample’s missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set.
Using Inter-Quartile Range (IQR), following the below approach to find outliers:
Calculate the first and third quartile (Q1 and Q3).
Further, evaluate the interquartile range, IQR = Q3-Q1.
Estimate the lower bound, the lower bound = Q11.5
Estimate the upper bound, upper bound = Q31.5
The data points that lie outside of the lower and the upper bound are outliers.
def outlier_percent(data):
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
minimum = Q1 - (1.5 * IQR)
maximum = Q3 + (1.5 * IQR)
num_outliers = np.sum((data < minimum) |(data > maximum))
num_total = data.count()
return (num_outliers/num_total)*100
non_categorical_data = creditcard_df.drop(['CUST_ID'], axis=1)
for column in non_categorical_data.columns:
data = non_categorical_data[column]
percent = round(outlier_percent(data), 2)
print(f'Outliers in "{column}": {percent}%')
First I set all outliers as NaN, so it will be taken care of in the next stage, where I impute the missing values.
# imputation
from sklearn.impute import KNNImputer
imputer = KNNImputer()
imp_data = pd.DataFrame(imputer.fit_transform(non_categorical_data), columns=non_categorical_data.columns)
imp_data.isna().sum()
plt.figure(figsize=(20,50))
for i in range(len(creditcard_df.columns)):
plt.subplot(17, 1, i+1)
displot = sns.distplot(creditcard_df[creditcard_df.columns[i]], kde_kws={"color": "b", "lw": 3, "label": "KDE"}, hist_kws={"color": "g"})
plt.title(creditcard_df.columns[i])
displot.get_figure().savefig("Images/Distplot.png")
plt.tight_layout()
Insights
- Mean of balance is 1500 dollors
- 'Balance_Frequency' for most customers is updated frequently ~1
- For 'PURCHASES_FREQUENCY', there are two distinct group of customers
- For 'ONEOFF_PURCHASES_FREQUENCY' and 'PURCHASES_INSTALLMENT_FREQUENCY' most users don't do one off puchases or installment purchases frequently
- Very small number of customers pay their balance in full 'PRC_FULL_PAYMENT'~0
- Credit limit average is around $4500
- Most customers are ~11 years tenure
correlations = creditcard_df.corr()
f, ax = plt.subplots(figsize = (20, 8))
heatmap = sns.heatmap(correlations, annot = True)
plt.show()
Insights
- 'PURCHASES' have high correlation between one-off purchases, 'installment purchases, purchase transactions, credit limit and payments.
- Strong Positive Correlation between 'PURCHASES_FREQUENCY' and 'PURCHASES_INSTALLMENT_FREQUENCY'
creditcard_df["new_BALANCE_BALANCE_FREQUENCY"] = creditcard_df["BALANCE"] * creditcard_df["BALANCE_FREQUENCY"]
creditcard_df["new_ONEOFF_PURCHASES_PURCHASES"] = creditcard_df["ONEOFF_PURCHASES"] / creditcard_df["PURCHASES"]
creditcard_df["new_INSTALLMENTS_PURCHASES_PURCHASES"] = creditcard_df["INSTALLMENTS_PURCHASES"] / creditcard_df["PURCHASES"]
creditcard_df["new_CASH_ADVANCE_PURCHASES_PURCHASES"] = creditcard_df["CASH_ADVANCE"] * creditcard_df["CASH_ADVANCE_FREQUENCY"]
creditcard_df["new_PURCHASES_PURCHASES_FREQUENCY"] = creditcard_df["PURCHASES"] * creditcard_df["PURCHASES_FREQUENCY"]
creditcard_df["new_PURCHASES_ONEOFF_PURCHASES_FREQUENCY"] = creditcard_df["PURCHASES"] * creditcard_df["ONEOFF_PURCHASES_FREQUENCY"]
creditcard_df["new_PURCHASES_PURCHASES_TRX"] = creditcard_df["PURCHASES"] / creditcard_df["PURCHASES_TRX"]
creditcard_df["new_CASH_ADVANCE_CASH_ADVANCE_TRX"] = creditcard_df["CASH_ADVANCE"] / creditcard_df["CASH_ADVANCE_TRX"]
creditcard_df["new_BALANCE_CREDIT_LIMIT"] = creditcard_df["BALANCE"] / creditcard_df["CREDIT_LIMIT"]
creditcard_df["new_PAYMENTS_CREDIT_LIMIT"] = creditcard_df["PAYMENTS"] / creditcard_df["MINIMUM_PAYMENTS"]
def outlier_thresholds(dataframe, variable):
quartile1 = dataframe[variable].quantile(0.01)
quartile3 = dataframe[variable].quantile(0.99)
interquantile_range = quartile3 - quartile1
up_limit = quartile3 + 1.5 * interquantile_range
low_limit = quartile1 - 1.5 * interquantile_range
return low_limit, up_limit
def replace_with_thresholds(dataframe, variable):
low_limit, up_limit = outlier_thresholds(dataframe, variable)
dataframe.loc[(dataframe[variable] < low_limit), variable] = low_limit
dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit
for col in creditcard_df.columns:
replace_with_thresholds(creditcard_df, col)
plt.figure(figsize=(10,5))
sns.boxplot(data=creditcard_df)
plt.xticks(rotation=90)
plt.show()
X = data_scaled.drop(["BALANCE","new_BALANCE_BALANCE_FREQUENCY", "new_BALANCE_CREDIT_LIMIT", "BALANCE_FREQUENCY"],1) #Feature Matrix
y = data_scaled["BALANCE"] #Target Variable
reg = LassoCV()
reg.fit(X, y)
print("Best alpha using built-in LassoCV: %f" % reg.alpha_)
print("Best score using built-in LassoCV: %f" %reg.score(X,y))
coef = pd.Series(reg.coef_, index = X.columns)
print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " +
str(sum(coef == 0)) + " variables")
imp_coef = coef.sort_values()
lasso_FE = imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")
plt.show()
Hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:
- Agglomerative: This is a "bottom-up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
- Divisive : This is a "top-down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram.
plt.figure(figsize =(6, 6))
plt.title('Visualising the data')
Dendrogram = shc.dendrogram((shc.linkage(X_principal, method ='ward')))
silhouette_scores = []
for n_cluster in range(2, 8):
silhouette_scores.append(
silhouette_score(X_principal,
AgglomerativeClustering(n_clusters = n_cluster).fit_predict(X_principal)))
# Plotting a bar graph to compare the results
k = [2, 3, 4, 5, 6,7]
plt.bar(k, silhouette_scores)
plt.xlabel('Number of clusters', fontsize = 10)
plt.ylabel('Silhouette Score', fontsize = 10)
plt.show()
Insights Therefore, we see the optimal number of clusters for this particular dataset would be 3 or 4. Let us now build and visualize the clustering model for k =3.
agg = AgglomerativeClustering(n_clusters = 3)
agg.fit(X_principal)
# Visualizing the clustering
plt.scatter(X_principal['P1'], X_principal['P2'],
c = AgglomerativeClustering(n_clusters = 3).fit_predict(X_principal), cmap =plt.cm.winter)
plt.show()
- PCA is an unsupervised ML Algorithm
- It performs dimensionality reductions while attempting at keeping the original information unchanged.
- It works by trying to find a new set of fearures called components.
- Components are composites of the uncorrelated given input features.
pca = PCA(n_components = 2)
principal_comp = pca.fit_transform(data_scaled)
principal_comp
# Create a dataframe with the two components
pca_df = pd.DataFrame(data = principal_comp, columns =['pca1','pca2'])
pca_df.head()
# Concatenate the clusters labels to the dataframe
pca_df = pd.concat([pca_df, pd.DataFrame({'cluster':labels})], axis = 1)
pca_df.head()
plt.figure(figsize=(20,8))
ax = sns.scatterplot(x="pca1", y="pca2", hue = "cluster", data = pca_df, palette =['red','green','blue','pink','yellow'])
plt.show()












