German Bank's Credit Customer Segmentation using Clustering Analysing in Python

Shripathy Suresh
Apr 4, 2024
8 min read

Updated: Dec 14, 2024

Introduction

Have you ever wondered what factors influence who gets approved for a loan? Delve into the fascinating world of creditworthiness with us! We will crack open the German Credit Data, a treasure trove of information about loan applicant in Germany. Through data analysis and visualization, we will uncover the stories hidden within this dataset, revealing the characteristics lenders might consider when making a loan decisions. So, buckle up and get ready to explore the fascinating world of credit analysis!

Executive Summary

This blog post examines the German Credit Data, a dataset containing information on loan applicants in Germany. We analyse key attributes like age, sex, job, housing situation, and savings accounts to understand applicant landscape. Through data visualisation with the bar charts, we reveal patterns in the data distribution. We delve deeper by exploring potential relationships between factors using correlations and statistical tests. Finally, we introduce the conapts of k-means clustering, a technique that can group applicants with similar characteristics, allowing for a more nuanced understanding of loan applications.

Demystifying the Data: A Look at the Applicant Landscape

The German Credit Data provides a snapshot of loan applicants in Germany. It encompasses a variety of attributes that paint a picture of each individual's financial situation. This part is Data Cleaning and Processing. Let's get into some key data points:

Age: This attribute reveals the age distribution of loan applicants. Understand the age range helps lenders assess risk tolerance and repayment potential. younger applicants might have less established careers and credit history, while older applicants might have higher financial commitments.
Sex: While not a direct indicator of creditworthiness, analysing the distribution by sex can uncover any potential biases in the loan approval process. Ideally, loan decisions should be based solely on financial standing, not gender.
Job: The type of job an applicant holds can influence their income stability and repayment capacity. Applicants with stable, well-paying jobs are generally considered lower risks.
Housing: Whether an applicant owns their home, rents, or lives rent-free can provide insights into their financial obligations and overall stability. Homeownership often indicates a level of financial commitment and potentially lower debt.
Savings Account: The balance in an applicant's savings account reflects their ability to handle unexpected expenses and potentially make down payments on loans. Higher savings balances suggest stronger financial cushion.
Checking Account: The status of an applicant's checking account (little, moderate, rich) can offer cues about their spending habits and potential for managing monthly loan repayments.
Credit Amount: This attribute reveals the amount of money the applicant is requestion as a loan. Higher loan amounts naturally translate to higher risk for the lender.
Duration: The loan duration, measured in months, indicates the timeframe over which the applicant needs to repay the loan. Longer durations spread out the repayment but can also accrue more interest.

CODE:

from sklearn.cluster import KMeans
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [43]:

gb = pd.read_csv(r"C:\Users\Shripathy Suresh\Desktop\SHRIPATHY\ASB\T3\DARP\german_credit_data.csv")
print(gb.head())

   Unnamed: 0  Age     Sex  Job Housing Saving accounts Checking account  \
0           0   67    male    2     own             NaN           little   
1           1   22  female    2     own          little         moderate   
2           2   49    male    1     own          little              NaN   
3           3   45    male    2    free          little           little   
4           4   53    male    2    free          little           little   

   Credit amount  Duration              Purpose  
0           1169         6             radio/TV  
1           5951        48             radio/TV  
2           2096        12            education  
3           7882        42  furniture/equipment  
4           4870        24                  car

import pandas as pd

# Load the dataset
gb = pd.read_csv(r"C:\Users\Shripathy Suresh\Desktop\SHRIPATHY\ASB\T3\DARP\german_credit_data.csv")

# Map male to 0 and female to 1 and housing
gb['Sex'] = gb['Sex'].map({'male': 0, 'female': 1})
gb['Housing'] = gb['Housing'].map({'own': 0, 'rent': 1, 'free':2})
gb['Saving accounts'] = gb['Saving accounts'].map({'little':1,'moderate':2,'rich':3,'quite rich':4})

# Print the modified dataset
print(gb)

     Unnamed: 0  Age  Sex  Job  Housing  Saving accounts Checking account  \
0             0   67    0    2        0              NaN           little   
1             1   22    1    2        0              1.0         moderate   
2             2   49    0    1        0              1.0              NaN   
3             3   45    0    2        2              1.0           little   
4             4   53    0    2        2              1.0           little   
..          ...  ...  ...  ...      ...              ...              ...   
995         995   31    1    1        0              1.0              NaN   
996         996   40    0    3        0              1.0           little   
997         997   38    0    2        0              1.0              NaN   
998         998   23    0    2        2              1.0           little   
999         999   27    0    2        0              2.0         moderate   

     Credit amount  Duration              Purpose  
0             1169         6             radio/TV  
1             5951        48             radio/TV  
2             2096        12            education  
3             7882        42  furniture/equipment  
4             4870        24                  car  
..             ...       ...                  ...  
995           1736        12  furniture/equipment  
996           3857        30                  car  
997            804        12             radio/TV  
998           1845        45             radio/TV  
999           4576        45                  car  

[1000 rows x 10 columns]

Unveiling Patterns: Visualizing the Applicant Pool

Data visualization plays a crucial role in transforming raw data into understandable insights. Here's how charts can help us grasp the German Credit Data:

Age Distribution: A bar chart can illustrate the number of applicants within each age group. This helps identify if there's a skew towards young or older applicants.
Job Distribution: A bar chart can showcase the prevalence of different job types among applicants. This can reveal if certain professions are more likely to seek loans.
Savings and Checking Account Distribution: Categorical bar charts can depict the distribution of applicants across different savings and checking account categories (little, moderate, rich and quite rich). This provides insights into the overall financial health of the applicant pool.

This part is about Exploratory Data Analysis (EDA).

These visualizations not only present the data in clear and concise way but also allows us to identify potential relationship between factors. For example, a high concentration of applicants in specific age groups with lower average savings balances might suggest a correlation between age and financial preparedness.

CODE:

sex_counts = gb['Sex'].value_counts()
job_counts = gb['Job'].value_counts()
housing_counts = gb['Housing'].value_counts()
saving_counts = gb['Saving accounts'].value_counts()
checking_counts = gb['Checking account'].value_counts()

In [46]:

value_counts_data  = [sex_counts,job_counts,housing_counts,saving_counts,checking_counts]
for count_data in value_counts_data:
    count_data.plot(kind='bar',cmap='plasma')
    plt.show()

df = gb[['Age','Job', 'Sex', 'Housing', 'Credit amount', 'Saving accounts', 'Duration']]
df = df.dropna()
df

Out[47]:

	Age	Job	Sex	Housing	Credit amount	Saving accounts	Duration
1	22	2	1	0	5951	1.0	48
2	49	1	0	0	2096	1.0	12
3	45	2	0	2	7882	1.0	42
4	53	2	0	2	4870	1.0	24
6	53	2	0	0	2835	4.0	24
...	...	...	...	...	...	...	...
995	31	1	1	0	1736	1.0	12
996	40	3	0	0	3857	1.0	30
997	38	2	0	0	804	1.0	12
998	23	2	0	2	1845	1.0	45
999	27	2	0	0	4576	2.0	45

817 rows × 7 columns

Beyond the Surface: Exploring Relationships within the Data

While the initial analysis provides a general understanding of the applicant pool, more advanced techniques can reveal deeper connections within the data. Here are some potential areas for further exploration:

Correlations: We can calculate correlation coefficients to assess the strength and direction of relationships between attributes. For instance, a positive correlation between age and savings account balance might suggest that older applicants tend to have more accumulated savings.
Statistical Tests: Statistical tests can help determine if observed relationships are statically significant or simply due to random chance. This strengthens the validity of any conclusions drawn from the data.

This part is about Exploratory Factor Analysis (EFA).

By employing these techniques, we can move beyond basic description of the data and uncover hidden patterns that might influence loan approval decisions.

CODE:

sns.histplot(data=df,x='Age',kde=True)
plt.title('Histogram of Age')
plt.show()

sns.histplot(data=df,x='Job',bins=3,kde=True)
plt.title('Histogram of Job')
plt.show()

sns.histplot(data=df,x='Saving accounts',bins=4,kde=True)
plt.title('Histogram of Savings Account')
plt.show()

sns.histplot(data=df,x='Credit amount',kde=True)
plt.title('Histogram of Credit Amount')
plt.show()

Grouping Applicants: Unveiling Similarities and Differences

The German Credit Data holds valuable information about individual applicants. However, grouping applicants with similar characteristics can yield even more insights. Here's where K-Means clustering comes into play:

K-Means Clustering: This techniques identifies groups (clusters) within the data where the members of each cluster share similar characteristics. In the context of the German Credit Data, K-Means clustering could group applicants with similar age ranges, job types and financial profiles.
Benefits of Clustering: By analysing these clusters, we can identify patterns in loan applications. For example, one cluster might comprise young with lower savings balances but stable jobs, while another might contain older applicants with substantial savings.

Understanding these groupings can help lenders tailor their approach to different applicant profiles and improve risk assessment. By knowing the characteristics of each cluster, lenders can create targeted loan option with interest rates and repayment terms suitable for the specific risk profile.

CODE:

x = df.iloc[:,[1,2]].values #CREATES A DATAFRAME WITH THE 2ND AND 3RD COLUMNS
print(x) #TO CHECK AND VERIFYTHE DATA IS PROPERLY LOADED

[[2 1]
 [1 0]
 [2 0]
 ...
 [2 0]
 [2 0]
 [2 0]]

In [53]:

#NOW WE ARE INVOKING SKLEARN WHICH HAS THE K-MEANS
from sklearn.cluster import KMeans
wcss = [] #CRATING A EMPTY LIST FOR WITHIN CLUSTER SUM OF SQUARES
for i in range(1,11): # ITERATE THE PROCESS FOR K-MEANS VALUE 1 TO 10
    kmeans = KMeans(n_clusters = i, init = 'k-means++',random_state = 42) #SETTING UP A K-MEANS OBJECT WITH THE NUMBER OF CLUSTERS i
    kmeans.fit(x)
    wcss.append(kmeans.inertia_)
plt.plot(range(1,11),wcss)
plt.title("Elbow-Method")
plt.xlabel("Duration")
plt.ylabel("Credit amount")
plt.show()

#WE ARE PREDICTING THE BEST K-MEANS FIT FOR THE DATA, THIS LETS US CLUSTER THE DATA THAT ARE SIMILAR, INITIALISING K-MEANS ++ FOR BETTER CONVERGENCE
k_cluster = 3
kmeans = KMeans(n_clusters = k_cluster, init = 'k-means++',random_state = 42)
y_kmeans = kmeans.fit_predict(x)

kmeans = KMeans(n_clusters=4)
kmeans.fit(df)

KMeans

KMeans(n_clusters=4)

centroids = kmeans.cluster_centers_
centroids

Out[37]:

array([[3.42911392e+01, 1.95358650e+00, 2.99578059e-01, 4.17721519e-01,
        3.52973418e+03, 1.50210970e+00, 2.35443038e+01],
       [3.52384106e+01, 1.76158940e+00, 3.48785872e-01, 3.11258278e-01,
        1.44883664e+03, 1.49448124e+00, 1.48609272e+01],
       [3.63750000e+01, 2.34375000e+00, 3.12500000e-01, 6.56250000e-01,
        1.25564687e+04, 1.43750000e+00, 4.10312500e+01],
       [3.62631579e+01, 2.23157895e+00, 2.31578947e-01, 6.10526316e-01,
        6.96534737e+03, 1.32631579e+00, 3.28526316e+01]])

In [57]:

import numpy as np
import math as math
import scipy.stats as stat
import statsmodels as sm
from sklearn.cluster import AgglomerativeClustering
groups = AgglomerativeClustering(n_clusters=4,metric='euclidean', linkage='single')
groups.fit_predict(df)
plt.scatter(df['Credit amount'], df['Duration'], c = groups.labels_, cmap='cool')

# Load and preprocess data (assuming 'df' is your DataFrame)
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[['Credit amount', 'Duration']])  # Scale only relevant features

# Perform KMeans clustering
kmeans = KMeans(n_clusters=4, random_state=42)  # Set random state for reproducibility
kmeans.fit(df_scaled)

# Get cluster labels and centroids
cluster_labels = kmeans.labels_
centroids = kmeans.cluster_centers_

# Create scatterplot with separate colors for clusters
plt.figure(figsize=(8, 6))  # Adjust figure size as needed

plt.scatter(df_scaled[:, 0], df_scaled[:, 1], c=cluster_labels, cmap='coolwarm', alpha=0.7, label='Data Points')

# Plot centroids with larger markers and a distinct color
plt.scatter(centroids[:, 0], centroids[:, 1], marker='o', s=150, c='black', linewidths=2, label='Centroids')

# Add labels, title, and legend
plt.xlabel('Duration')
plt.ylabel('Credit amount')
plt.title('Clusters of Credit Data with Centroids (KMeans)')
plt.legend()

# Show the plot
plt.grid(True)
plt.tight_layout()
plt.show()df

Interpretation

The above image is the result of using kmeans clustering for duration and credit amount from the dataset German Credit Data, for n_clusters = 4. The centroids are used, the big black dots, for showing where is the average for each cluster. This helps to decide what would be more feasible credit amount and duration for both lenders (financial institutions) and borrowers (applicants).

Conclusion

The German Credit Data offers a fascinating glimpse into the world of applicants. By analysing and visualizing this data, we gain valuable insights into the factors that might influence creditworthiness. From age distribution to savings account balances, we paint a picture of the applicant pool. Moving beyond basic observations, we explore correlations and statistical tests to uncover deeper connections within the data. Finally, K-Means clustering allows us to group applicants with similar characteristics, revealing patterns that might not be readily apparent. This deeper understanding can be instrumental for lenders in tailoring their approach to different applicant profiles and ultimately making informed loan decisions.

ABOUT:

I am Shripathy S, from Amrita School of Business, Coimbatore. This is a project which has been given for our batch to choose, know and use the dataset using K-Means Clustering Analysis in Python.

References:

Materials given by our DARP (Data Analysis using R and Python) Faculty, Dr. Prashoban: Basics of Python and R; Data reading, cleaning and processing; EDA and EFA; Clustering Analysis: K-Means Clustering

Source: https://www.kaggle.com/code/nitishviraktamath/german-bank-customer-segmentation/input

LinkedIn: http://www.linkedin.com/in/shripathy-s-6b945819a