Introduction
In the competitive landscape of banking, customers have more choices than ever before. From traditional banks to online banks, there is plenty to choose from and thus making customer retention onerous. Since, customers have a diversity of choices, they don’t mind switching to a different bank when they don’t get the kind of service and the experience they desire.
Therefore churn, or the pace at which clients switch banks, is a serious concern to a bank’s capacity to make money and expand.
To combat this problem, banks resort to churn analysis which involves analysing the behaviour of customers who have left a bank to identify patterns and reasons for their departure. The findings can then be applied to reduce attrition and improve customer retention in the future.
Churn analysis – Real-life Scenario
A large Indian bank was facing high turnover of their customers. To stem the outflow of customer, they wanted to classify their customers as opportunistic (likely to switch to rival bank) and loyal (likely to remain with the bank).
The first step to churn analysis is to ingest the data into a single landing zone.
A lot of times, data can comprise of empty values, NaN values and duplicate values which will lead erroneous results. Data cleansing eliminates or alters such values to get the right results in later stage.
Next important step in churn analysis is Exploratory Data Analysis (EDA), i.e., analysis of the data and identifying general patterns. Python libraries, like Matplotlib, NumPy, Pandas, and seaborn, are used for univariate, bivariate and multivariate analysis for a deeper understanding of the data.
After this step, we train and test the model to ensure we achieve the right results. In this case, scikit-learn was used for this purpose. The data was split into distinct training and test datasets to train our machine learning model.
This is a binary classification problem, with positive class (opportunitic customers) and negative class (loyal customers).
There are various model building techniques such as logistic regression, KNN, SVM, Decision Tree, Random Forest, XGBoost. Therefore, it is important to pick the right techniques and train the model accordingly so that the model can eventually be deployed.
To select the right model, we evaluate various metrics, like accuracy, precision, ROC-AUC. We selected Random Forest with data imbalance handling done using Stratified Cross Validation for best overall ROC score (0.839) and accuracy (0.805).
Once the model was trained on training dataset, it is cross validated using test/validation data to prevent overfitting. There are various types of cross validation like K-fold and stratified cross validation and we choose the model with the best ROC value.
Leave A Comment