Project repository:

Project website (this page):

Customer Presonality Analysis¶

Author: Haoyuan Huang 
Class: cmps 3160
Date: 10/24/2021

Final Data Science Tutorial - CMPS 3660-02 - Introduction to Data Science - Professor Mattei

Project Plan & Project goals:¶

This project is a comprehensive data analytical project on Customer Personality Analysis. The workflow of my project would be divided into following steps:

1. Introduction

2. Data: Extraction, Transform and Load

3. Exploratory Data Analysis

4. Model Construction and Evaluation

5. Conclusion & Suggestions

Since I am a graduate student I will be work on this project alone.

The goal of the project is to explore the personalities for different customers, understand customer behavior and find a pattern to better target different customers.

Project Datasets

Customer Personality Dataset

Customer Churn Dataset

Target Business Goal for this Project¶

A company has two methods to get more customers:

Attract more customers
Keep the existing customers.

In my perspective, these two things will help a firm to understand a customer's personality.

Project Motivation¶

Firstly, I would like to talk my motivation to complete this project using this dataset. I am a master student studying business analytics. I have rich experience in business world and I am passionated to understand how could a business firm or entity is able to maximize their profits and why are some business firms are making more money than others.

Secondly, as a business people, the most important thing is take care of our customers. A super successful business will not only take care about its short-term profit but they are able to keep a good relationship with its customers so they will become a long-term customers. I would love to analyze on customer's behavior and understand how could we target a customer depend on his or her personal needs.

For this project, I will be providing a data science analysis of customer personality and customer churn data. This project aims to use exploratory data analysis to find the historical patterns and evaluating customers in causes of different purchasing behaviors and help business to identify different customer groups so they will able to better target their customer and identify what are some factors will cause a higher churn rate.

Project Questions¶

Firstly, Since we have limited resources and money so How could we better target our customers?
How could we segement our customers into different groups so we are able to identify some important features on certain customers?
How could we predict behavior to retain customers and reduce their churn rate?

Data Usage¶

I have two datasets. The customer personality data will be able to help me to segment customers. For example, I plan to use customer average product spent, total number thier spent and the time a customer has been a active user as my features to build a k-mean clustering analysis. Since this is a unlabeled dataset so I will create a label based on its purchase frequency so I am planning to use features like age, income, total purchase number, purchase channel and more to build a decision tree model to determine a high-frequency customer vs a low-frequency customer.

I will use second dataset to predict customer churn rate. For example, the data contains variables like whether a customer is a senior citizan or age of a customer, tenure(how long a customer have stayed with their company) and more, so I will able to build a model to help me to evaluate the positive or negative impact this variable will bring to the customer churn rate.

I will use two datasets for this project. I am planning to not merge two datasets together, and I will use each dataset to solve different business problems. In order to attract more new customers, my business question is how could we better target our customer and I will use the customer personality data to answer this problem. In order to keep the existing customers, my business question is how could we reduce the churn rate of customers and I will use customer churn data to answer the business problem.

Data Source¶

After my consideration for potential datasets, I choosed and got the dataset from Kaggle. Kaggle is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

Customer Personality Dataset Information: The data offers the customer demographic information, products they bought in the past, and the place they bought their product, discount and promotion information on company's product.

Customer Personality Dataset Limitation: Limitions of the data includes data set doesn't have different companies and brand names. This could be another interesting and important business problem to analyze. For example, we can analyze the data to study the factors that contribute to brand retention or improve brand reputation. However, for this particular dataset I will focus on grouping customers and differeciate their behaviors. Another limitation is the dataset is not up to date.

The final dataset can be download here

Customer Churn Dataset Information: The data is a labeled data and we have the information on whether this customer is churned or not. The data contains the customer demographic information, their daily behaviors and their company information, their phone service and internet service and lot more.

Customer Personality Dataset Limitation: The data is not linked to the customer personality data and I think that's a limitation for my analysis so I can not merge two datasets together.

The final dataset can be download here

General Information:

More information about Kaggle

# Required packages 
# Load Numpy
import numpy as np
# Load MatPlotLib
import matplotlib
import matplotlib.pyplot as plt
# Load Seaborn 
import seaborn as sns 
# Load Plotly 
import plotly.express as px 
import plotly.io as pio
# Load Pandas
import pandas as pd
# Plot Missing Values 
import missingno as msno
# Stats 
from scipy.stats import skew
import statsmodels.api as sm
import pylab as py
from scipy.stats import norm
# RE
import re

# This lets us show plots inline and also save PDF plots if we want them
%matplotlib inline
from matplotlib.backends.backend_pdf import PdfPages
matplotlib.style.use('seaborn')

# These two things are for Pandas, it widens the notebook and lets us display data easily.
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

# Show a ludicrus number of rows and columns
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# Import machine learning packages
from scipy import stats
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.cluster import KMeans 
from sklearn.metrics import silhouette_score 

# Import warnings package to ignore warnings
import warnings 
warnings.filterwarnings('ignore')

This magic box was motivated by Prof Mattei and contributed to class documents and SkLearn documentation. Sklearn documentation

Data: Extraction, Transform, and Load¶

Customer Personality Analysis Data¶

Extraction¶

This dataset was extracted from Here. The site gives clearly stated data documentation which includes variable dictionary to explain each varaible. In addtional, the site allows provides the first look of the dataset so I have the options to see the dimension of the data. The data is in csv(comma separated file) so I can directly loaded into my notebook.

Load¶

I will use pandas to load the data and other necessary packages needed for my analysis. Load the data, I will use pandas read_csv() to read the data using the seperator value as a tab and display the data without any data cleaning.

df = pd.read_csv('marketing_campaign.csv', sep = '\t')
df.head()

Transformation¶

Look throught the dataset, there are couple things I would like to examine on the dataset:

Missing Values
Examine categorical variables
We can see the data is in long format and I will be combing some columns to calculate its total values.
Delete some unnecessary columns
Add date related column
Check datatypes

# Check missing value 
display(df.isna().sum())
display(df[df['Income'].isnull()])

ID                      0
Year_Birth              0
Education               0
Marital_Status          0
Income                 24
Kidhome                 0
Teenhome                0
Dt_Customer             0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Complain                0
Z_CostContact           0
Z_Revenue               0
Response                0
dtype: int64

There are 24 missing values in the income column, then I will try to fill the missing values. I will use describe() to see its statistics to determine which method I should use to fill up the missing values.

display(df['Income'].describe())

# Fill NAs with mean 
mean = df['Income'].mean()
for i in range(df.shape[0]):
    if pd.isnull(df.Income[i]) == True:
        df.Income[i] = mean
    else:
        pass 
display(df[df['Income'].isnull()])

count      2216.000000
mean      52247.251354
std       25173.076661
min        1730.000000
25%       35303.000000
50%       51381.500000
75%       68522.000000
max      666666.000000
Name: Income, dtype: float64

I'm curious to see these categorical columns and for sake the model building¶

I will try to combine features into a binary variable and create a new column. I will use value_counts() to see each varaibles and use replace function to make it a binary variable and assign it to a new variable.

display(df['Education'].unique())
display(df['Education'].value_counts())

array(['Graduation', 'PhD', 'Master', 'Basic', '2n Cycle'], dtype=object)

Graduation    1127
PhD            486
Master         370
2n Cycle       203
Basic           54
Name: Education, dtype: int64

df['Education_level'] = df['Education'].apply(lambda x: 'Undergraduate' if x == 'Basic' else 'Graduate')
df.head()

Similarly, the "Marital_Status" column hold the same categorical information so I will keep the column to be a binary variable. I will create a new column to repalce the variables to simplify.

display(df['Marital_Status'].unique())
display(df['Marital_Status'].value_counts())

array(['Single', 'Together', 'Married', 'Divorced', 'Widow', 'Alone',
       'Absurd', 'YOLO'], dtype=object)

Married     864
Together    580
Single      480
Divorced    232
Widow        77
Alone         3
YOLO          2
Absurd        2
Name: Marital_Status, dtype: int64

df['Marital_Status_all'] = df['Marital_Status'].apply(lambda x: 'Relationship' if (x == 'Married' or x == 'Together') else "Single")
df.head()

Next, I will try to combine some columns together.¶

For example, Kidhome is the number of children in customer's household and Teenhome is the number of teenagers in customer's household. The start of the "Mnt" is the amount spent on "Product" in last 2 years, so I will combine them in into total values. The start of the "NumXXPurchases" is the number of purchases made through "xx" place, so I will combine them in into total purchases.The start of the "Accepted" is if customer accepted the offer in the "xx" campaign and "Response" is 1 if customer accepted the offer in the last campaign.

for index, row in df.iterrows():
    df.loc[index,'Number_of_Kids'] = row['Kidhome'] + row['Teenhome']
    df.loc[index, 'Total_Product_Spent_per_month'] = row['MntWines']+ row['MntFruits']+ row['MntMeatProducts']+ row['MntFishProducts']+ row['MntSweetProducts']+  row['MntGoldProds']
    df.loc[index, 'Total_Purchase_per_month'] = row['NumDealsPurchases']+ row['NumWebPurchases']+ row['NumCatalogPurchases']+ row['NumStorePurchases']
    df.loc[index, 'Total_Accept_Offer'] = row['AcceptedCmp1']+ row['AcceptedCmp2']+row['AcceptedCmp3']+row['AcceptedCmp4']+row['AcceptedCmp5']+row['Response']
    
df.head()

It is also important to compute its average spend as well and the percentage on number of purchases made with a discount

df['Average_Spend_per_product'] = round((df['Total_Product_Spent_per_month'] / df['Total_Purchase_per_month']), 1)
df['Deals_Purchases'] = round((df['NumDealsPurchases'] / df['Total_Purchase_per_month']) * 100, 1)
df.head()

I will remian the orginal columns for now because I think it is important for me to keep them for data explorative analysis, but I will drop them later before the modeling.

Next, I am going to drop some columns, I will use nunique( ) to check its unique values for every columns.¶

Z_CostContact and Z_Revenue only contains one variable and it is not necessary for the project so I will drop them. I have already known customer's age so it's necessary for me to know keep its birth year. I have combined Kid and Teen so I will also drop them.

df.nunique()

ID                               2240
Year_Birth                         59
Education                           5
Marital_Status                      8
Income                           1975
Kidhome                             3
Teenhome                            3
Dt_Customer                       663
Recency                           100
MntWines                          776
MntFruits                         158
MntMeatProducts                   558
MntFishProducts                   182
MntSweetProducts                  177
MntGoldProds                      213
NumDealsPurchases                  15
NumWebPurchases                    15
NumCatalogPurchases                14
NumStorePurchases                  14
NumWebVisitsMonth                  16
AcceptedCmp3                        2
AcceptedCmp4                        2
AcceptedCmp5                        2
AcceptedCmp1                        2
AcceptedCmp2                        2
Complain                            2
Z_CostContact                       1
Z_Revenue                           1
Response                            2
Education_level                     2
Marital_Status_all                  2
Number_of_Kids                      4
Total_Product_Spent_per_month    1054
Total_Purchase_per_month           39
Total_Accept_Offer                  6
Average_Spend_per_product         723
Deals_Purchases                   129
dtype: int64

df.drop(columns = ['ID', 'Kidhome', 'Teenhome', 'Z_CostContact', 'Z_Revenue'], axis =1, inplace = True)

df.head()

Next, the column "Dt_Customer" is the date of customer's enrollment with the company¶

I will be able to know the duration of a customer have been staying with the company. The dataset contains customer's information from 2012 to 2015 so I will assume the lastest date is 01-01-2016.

# create a new columns named duration 
df['Dt_Customer'] = pd.to_datetime(df['Dt_Customer'])

# I assume today's date is 01-01-2016
df['today'] = '01-01-2016'
df['today'] = pd.to_datetime(df['today'])
for index, row in df.iterrows():
    df.loc[index, 'Days_since_client'] = row['today'] - row['Dt_Customer']
df.head()

df['Days_since_client'] = df['Days_since_client'].astype('str')

for index, row in df.iterrows():
    df.loc[index, 'Days_since_client'] = row['Days_since_client'].split(' ')[0]

# Remove str in the variable 
df['Days_since_client'] = df['Days_since_client'].astype('str')
for index, row in df.iterrows():
    df.loc[index, "Days_since_client"] = row["Days_since_client"].split(' ')[0]

df.head()

Now, we get the information on customer engagement and duration since they first registered their account with the company. However, I think it would be more useful if we know their years instead of days.

# Create a new column with the year information 
for index, row in df.iterrows():
    df.loc[index, "Days_since_client_year"] = int(row['Days_since_client']) / 365

df['Days_since_client_year'] = df['Days_since_client_year'].astype('int')
df.head()

# Drop unnecessary columns 
df.drop(['today'], axis =1, inplace = True)
df.head()

Also, it is a good idea to know customer's age as well, because Year_Birth contains their birth date.

df['Age'] = 2016 - df['Year_Birth']
df.drop(columns = 'Year_Birth', axis =1, inplace = True)

Finally, I will check its data types and drop any unecessary columns, check null values.¶

df.dropna(inplace = True)
df.isna().sum()

Education                        0
Marital_Status                   0
Income                           0
Dt_Customer                      0
Recency                          0
MntWines                         0
MntFruits                        0
MntMeatProducts                  0
MntFishProducts                  0
MntSweetProducts                 0
MntGoldProds                     0
NumDealsPurchases                0
NumWebPurchases                  0
NumCatalogPurchases              0
NumStorePurchases                0
NumWebVisitsMonth                0
AcceptedCmp3                     0
AcceptedCmp4                     0
AcceptedCmp5                     0
AcceptedCmp1                     0
AcceptedCmp2                     0
Complain                         0
Response                         0
Education_level                  0
Marital_Status_all               0
Number_of_Kids                   0
Total_Product_Spent_per_month    0
Total_Purchase_per_month         0
Total_Accept_Offer               0
Average_Spend_per_product        0
Deals_Purchases                  0
Days_since_client                0
Days_since_client_year           0
Age                              0
dtype: int64

display(df.head())
display(df.dtypes)

Education                                object
Marital_Status                           object
Income                                  float64
Dt_Customer                      datetime64[ns]
Recency                                   int64
MntWines                                  int64
MntFruits                                 int64
MntMeatProducts                           int64
MntFishProducts                           int64
MntSweetProducts                          int64
MntGoldProds                              int64
NumDealsPurchases                         int64
NumWebPurchases                           int64
NumCatalogPurchases                       int64
NumStorePurchases                         int64
NumWebVisitsMonth                         int64
AcceptedCmp3                              int64
AcceptedCmp4                              int64
AcceptedCmp5                              int64
AcceptedCmp1                              int64
AcceptedCmp2                              int64
Complain                                  int64
Response                                  int64
Education_level                          object
Marital_Status_all                       object
Number_of_Kids                          float64
Total_Product_Spent_per_month           float64
Total_Purchase_per_month                float64
Total_Accept_Offer                      float64
Average_Spend_per_product               float64
Deals_Purchases                         float64
Days_since_client                        object
Days_since_client_year                    int64
Age                                       int64
dtype: object

# Change column Types
change_cols = [col for col in df.columns if col in ['Income', 'Number_of_Kids','Total_Product_Spent_per_month','Total_Purchase_per_month','Total_Accept_Offer','Age','Average_Spend_per_product','Deals_Purchases']]

for col in change_cols:
    df[col] = df[col].astype('int')
    
display(df.dtypes)

df['Days_since_client'] = df['Days_since_client'].astype('int')

# Reference: https://stackoverflow.com/questions/56486165/changing-data-types-of-multiple-columns-at-once-in-python-pandas

Education                                object
Marital_Status                           object
Income                                    int64
Dt_Customer                      datetime64[ns]
Recency                                   int64
MntWines                                  int64
MntFruits                                 int64
MntMeatProducts                           int64
MntFishProducts                           int64
MntSweetProducts                          int64
MntGoldProds                              int64
NumDealsPurchases                         int64
NumWebPurchases                           int64
NumCatalogPurchases                       int64
NumStorePurchases                         int64
NumWebVisitsMonth                         int64
AcceptedCmp3                              int64
AcceptedCmp4                              int64
AcceptedCmp5                              int64
AcceptedCmp1                              int64
AcceptedCmp2                              int64
Complain                                  int64
Response                                  int64
Education_level                          object
Marital_Status_all                       object
Number_of_Kids                            int64
Total_Product_Spent_per_month             int64
Total_Purchase_per_month                  int64
Total_Accept_Offer                        int64
Average_Spend_per_product                 int64
Deals_Purchases                           int64
Days_since_client                        object
Days_since_client_year                    int64
Age                                       int64
dtype: object

Create a label for future modeling¶

I am also interested to create a column to list out whether this customer has high-frequency purchase behavior or high-frequency purchase behavior. I will use average Total_Purchase_per_month as my threshold to determine a customer.

# Calculate the average and use it as my threshold 
print("Average Total puchase per month is:", df['Total_Purchase_per_month'].mean())
# Create the new variable
df['High_freq'] = df['Total_Purchase_per_month'].apply(lambda x: 1 if x > 14.88 else 0)

Average Total puchase per month is: 14.888640429338103

print("Variable distribution is:", df['High_freq'].value_counts())

Variable distribution is: 1    1172
0    1064
Name: High_freq, dtype: int64

Final cleaned data are shown below:¶

df.head()

Extraction¶

This dataset was extracted from Here. The site gives clearly stated data documentation which includes variable dictionary to explain each varaible. In addtional, the site allows provides the first look of the dataset so I have the options to see the dimension of the data. The data is in csv(comma separated file) so I can directly loaded into my notebook.

Customer Churn Prediction Data - ETL¶

Load¶

I will use pandas to load the data and other necessary packages needed for my analysis. Load the data, I will use pandas read_csv() to read the data using the seperator value as a tab and display the data without any data cleaning.

df_churn = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
df_churn.head()

Transformation¶

Look throught the dataset, there are couple things I would like to examine on the dataset:

Examine categorical variables
Missing Values
Delete some unnecessary columns
Check datatypes

I will use dtypes to check variable data types and I saw Total Charges is now object so I will use pd.to_numeric() to convert its data type

display(df_churn.dtypes) 
display(df_churn.head())

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

# Converting Total Charges to a numerical data type.
df_churn['TotalCharges'] = pd.to_numeric(df_churn['TotalCharges'], errors='coerce')

Next,I will use isna() to check any missing values in the data and replace missing value with its average value.

df_churn.isna().any()

customerID          False
gender              False
SeniorCitizen       False
Partner             False
Dependents          False
tenure              False
PhoneService        False
MultipleLines       False
InternetService     False
OnlineSecurity      False
OnlineBackup        False
DeviceProtection    False
TechSupport         False
StreamingTV         False
StreamingMovies     False
Contract            False
PaperlessBilling    False
PaymentMethod       False
MonthlyCharges      False
TotalCharges         True
Churn               False
dtype: bool

mean =  df_churn['TotalCharges'].mean()
for i in range(df_churn.shape[0]):
    if pd.isnull(df_churn.TotalCharges[i]) == True:
        df_churn.TotalCharges[i] = mean 
    else:
        pass
    
df_churn.isna().any()

customerID          False
gender              False
SeniorCitizen       False
Partner             False
Dependents          False
tenure              False
PhoneService        False
MultipleLines       False
InternetService     False
OnlineSecurity      False
OnlineBackup        False
DeviceProtection    False
TechSupport         False
StreamingTV         False
StreamingMovies     False
Contract            False
PaperlessBilling    False
PaymentMethod       False
MonthlyCharges      False
TotalCharges        False
Churn               False
dtype: bool

Next, I will examine some categorical variables and use lambda function to replace Yes and No to 1 and 0, I will remian some columns because I want to use them to finish my explorative analysis.

# Convert all the Yes/No categorical variables into 1/0. 
df_churn['Partner'] = df_churn['Partner'].apply(lambda x: 1 if x == 'Yes' else 0)
df_churn['PhoneService'] = df_churn['PhoneService'].apply(lambda x: 1 if x == 'Yes' else 0)   
df_churn['OnlineSecurity'] = df_churn['OnlineSecurity'].apply(lambda x: 1 if x == 'Yes' else 0) 
df_churn['OnlineBackup'] = df_churn['OnlineBackup'].apply(lambda x: 1 if x == 'Yes' else 0) 
df_churn['DeviceProtection'] = df_churn['DeviceProtection'].apply(lambda x: 1 if x == 'Yes' else 0) 
df_churn['TechSupport'] = df_churn['TechSupport'].apply(lambda x: 1 if x == 'Yes' else 0) 
df_churn['StreamingTV'] = df_churn['StreamingTV'].apply(lambda x: 1 if x == 'Yes' else 0)
df_churn['StreamingMovies'] = df_churn['StreamingMovies'].apply(lambda x: 1 if x == 'Yes' else 0) 
df_churn['PaperlessBilling'] = df_churn['PaperlessBilling'].apply(lambda x: 1 if x == 'Yes' else 0) 
df_churn['Churn'] = df_churn['Churn'].apply(lambda x: 1 if x == 'Yes' else 0) 
df_churn['Dependents'] = df_churn['Dependents'].apply(lambda x: 1 if x == 'Yes' else 0) 
df_churn['gender'] = df_churn['gender'].apply(lambda x: 1 if x == 'Male' else 0) 

df_churn.head()

I will need to remove the CustomerID since it's not needed.

df_churn.drop(columns = 'customerID', axis =1, inplace = True)

There is duplicated value in the data. For example, Inside the MutipleLine there are three unique variables(No, No service and Yes), but both No and No service are indicating no mutiple lines. So I will change the duplicated value.

df_churn['MultipleLines'] = df_churn['MultipleLines'].str.split(expand = True)[0]
df_churn['MultipleLines'].value_counts()

No     4072
Yes    2971
Name: MultipleLines, dtype: int64

Next, I will check the data types again to make sure the data types are clear.

display(df_churn.dtypes) 
display(df_churn.head())

gender                int64
SeniorCitizen         int64
Partner               int64
Dependents            int64
tenure                int64
PhoneService          int64
MultipleLines        object
InternetService      object
OnlineSecurity        int64
OnlineBackup          int64
DeviceProtection      int64
TechSupport           int64
StreamingTV           int64
StreamingMovies       int64
Contract             object
PaperlessBilling      int64
PaymentMethod        object
MonthlyCharges      float64
TotalCharges        float64
Churn                 int64
dtype: object

Great! Everything is good and I am ready to go to EDA.

However, I will do more data tranformation in the later modeling section to tranform some varaibles.

Final cleaned data are shown below:¶

df_churn.head()

Exploratory Data Analysis¶

This goal of this section is help me to understand the data better and use statistical computing to find patters in the data. It's also a important step we need to take before we want to build our machine learning models in the later section. The structure of this section would be:

All graphs reference:

Plotly Document

Seaborn Document

Subplots Document

Color templates Document

Display number Document

Pie chart Document

Customer Personality Analysis Data - EDA¶

Hypothesis I : Customer who has higher education background will have higher income and Marital Status will have positive impact on the income¶

plt.figure(figsize=(10,7))
sns.set_theme(style="whitegrid") 
plt.xlim(1,200000) # Set x-axis scale 
ax = sns.boxplot(x=df["Income"],palette="Spectral")
plt.title('Customer Income Distribution', fontsize=15, weight = 'bold')# Add title
plt.show()

It's a good idea to see the income distribution at first. The boxplot shows the median income level is around 50k per year and there are couple outliers in the data.The 75th percentile is around 70k and 25th percentile is around 40k. There are some outliers in the data and I need to remove it for my models.

df['Education'] = df['Education'].replace("Graduation", 'Master')

# Create Education plot 
plt.figure(figsize=(10,7)) # Reset the plot showing size 
plt.subplot(2, 1, 1) # make two plots together 
sns.boxplot(x="Education", y="Income",data=df, dodge=True,palette="Spectral")
plt.ylim(1,125000) # Set y-axis scale 
plt.title('Education vs Income', fontsize=15, weight = 'bold')
plt.show()

# Create Marriage plot
plt.figure(figsize=(10,7))
plt.subplot(2, 1, 2)
sns.boxplot(x="Marital_Status", y="Income",data=df, dodge=True,palette="Spectral")
plt.ylim(1,125000)
plt.title('Marital Status vs Income', fontsize=15, weight = 'bold')
plt.show()

In the graph, My hypothesis on the customer who has higher education background will have higher income is true. PhD class has slightly higher median income than other education classes and couple super higher income in the data(couple outliers). Basic has the lowest median income level as I expected it. 2rd Cycle class means if a customer comes back to school after couple years of work and their income are close to master class. Income is an important factor to determine a customer's purchasing behavior so we will see it in the modeling section.

My inital hypothesis on the marital Status will have positive impact on the income is not really true. People who are get married then got divorced have slightly higher income compared with people who are together and married. People who are absured have the highest income! This is an interesting factor to know and there are no clear positive/negative correlation between the marital status and income.

Hypothesis II : Customer who has post-graduate education background and in a relationship will have higher Average Spend¶

df1 = df.groupby(['Education_level','Marital_Status_all'])['Average_Spend_per_product'].mean().reset_index().sort_values('Average_Spend_per_product', ascending = False)

plt.figure(figsize=(10,7))

ax = sns.barplot(x = "Education_level", y = "Average_Spend_per_product", hue = "Marital_Status_all", data = df1,palette='Spectral')
plt.xticks()
for container in ax.containers: #Add number on the top of the plot 
    ax.bar_label(container)
ax.set_title('Distribution of Income for Each Education Level and Marital Status',weight='bold', size = 15)
plt.show()

Average spend per month and Income are positively correlated. As I expected, the higher the education level (average more than 20+ dollars per month) will lead to a higher income level and higher average spend. However, the marital status is an interesting factor because people who are together with higher education are spending less. However, people with lower education are spending more. My educated guess is because people who are in their higher education are more likely in a long-term relationship like marriage so their average spend per month is lower. People who are in college are more likely in a short-term relationship or a date so the cost are higher.

Hypothesis III : Customers will have the tendency to spend their money in grocery (Meat, fish) than other categories(Gold).¶

df2 = df[['Education_level','Marital_Status_all','Income','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts','MntGoldProds']]
id_vars = ["Education_level",
           "Marital_Status_all",
           "Income"
           ]
f_df = pd.melt(df2,
               id_vars = id_vars,
               var_name="Category",
               value_name="freq")
f_df.head()

result = f_df.groupby('Category')['freq'].sum().reset_index(name = 'Total_amount_spent').sort_values(by = 'Total_amount_spent')

fig = px.pie(result, values = 'Total_amount_spent', names = 'Category', width = 600, height = 600, template='seaborn')

# Add texts into each section and update the line to sepereate them 
fig.update_traces(textposition = 'inside', 
                  textinfo = 'percent + label', 
                  hole = 0.8, 
                  marker = dict(line = dict(color = 'white', width = 4)),
                  hovertemplate = 'Clients: %{value}')

# Add plot title 
fig.update_layout(annotations = [dict(text = 'Percentage Spent <br>by Category', 
                                      x = 0.5, y = 0.5, font_size = 28, showarrow = False, 
                                      font_family = 'monospace',
                                      font_color = 'Black')],
                  showlegend = False)
                  
fig.show()

One of the most essential factors to understand customer personality is to see where they spent their money. The graph validated my hypothesis and people spent 50% of their money in wines and 28% in meat. Customer spent majority of their money in daily goods and groceries.

Hypothesis IV: Customer who has kids will have higher purchase frequency and Web is the most popular purchase place¶

df3 = df[['Education_level','Marital_Status_all','Income','Number_of_Kids','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases']]
id_vars = ["Education_level",
           "Marital_Status_all",
           "Income",
           "Number_of_Kids"
           ]
f_df = pd.melt(df3,
               id_vars = id_vars,
               var_name="Purchase_place",
               value_name="Number_of_purchases")
f_df.head()

df4 = f_df.groupby(['Purchase_place','Number_of_Kids'])['Number_of_purchases'].sum().reset_index()

plt.figure(figsize=(10,7))
fig = sns.barplot(x = "Purchase_place", y = "Number_of_purchases", hue = "Number_of_Kids", data = df4,palette='Spectral')
plt.xticks(rotation=30)# Rotate the x-axis 
fig.set_title('Distribution of Number of Purchases for Each Place Type and Number of Kids',weight='bold', size = 15)
plt.show()

From the graph, Customer who has kids will have higher purchase frequency is True, so the purchase frequency and whether a customer has kid are postively correlated to each other. However, the correlation don't adds up because increase the number of kids will not directly increase the number of the purchases made by a cutomer.

Number of purchases made using a catalogue is the only purchase place poeple who don't have kids are made more purchase than people who have kids. Number of purchases made directly in stores is the most popular purchase place and web purchase is the second place. The data was collected before the Covid so I assume this has been changed right now and I think this is a good topic for the future analysis to compare between pre-covid and post-covid circumstance. Number of purchases made with a discount are slightly lower than other purchase chanels for couple reasons. First, time of a discount is limited. Second, necessities are mandatory so customer will make a purchase regardless of a discount.

Hypothesis V: People with high-frequency purchase tendency will have higher income level and higher total spend¶

# Create the scatter plot 
ax = sns.scatterplot(
     data=df, x="Income", y="Total_Product_Spent_per_month", hue="High_freq", size="High_freq",
    sizes=(20, 200), legend="full", palette='Spectral'
)
ax.set_xlim(1,200000)

ax.set_title('Distribution of Number of Purchases for Each Place Type and Number of Kids',weight='bold', size = 15)
plt.show()

My hypothesis is people who are a high-frequency purchaser will have higher income level and higher total product spend per month. From the right graph, we can see that the green dots represent high purchase frequency customers and their distribution is slighty righter than orange dots(low purchase frequency customers), so their income level is higher as a group. On the other hand, majority of Green dots are in the upper section which means the higher total product spent per month. Of course, there are some outliers in the data, we have few people earn a little money but spent super high, and few people earn more money but spent super low.

Hypothesis VI: Multiple marketing campaigns will increase the chance customer accept a promotion offer.¶

# Melt data from long to wide
df4 = df[['Education_level','Marital_Status_all','Income','Age','Response','AcceptedCmp1','AcceptedCmp2','AcceptedCmp3','AcceptedCmp4','AcceptedCmp5']]
id_vars = ["Education_level",
           "Marital_Status_all",
           "Income",
           "Age"
           ]
f_df = pd.melt(df4,
               id_vars = id_vars,
               var_name="Campaign_num",
               value_name="Accepted_offer")

# Map to change variable names
f_df['Campaign_num'] = f_df['Campaign_num'].map({
    'Response':'AcceptedCmp6',
    'AcceptedCmp1': 'AcceptedCmp1',
    'AcceptedCmp2':'AcceptedCmp2',
    'AcceptedCmp3':'AcceptedCmp3',
    'AcceptedCmp4':'AcceptedCmp4',
    'AcceptedCmp5':'AcceptedCmp5'
})
f_df.head()

def age_buckets(x): 
    if x < 30:
        return '18-29' 
    elif x < 40: 
        return '30-39' 
    elif x < 50: 
        return '40-49' 
    elif x < 60: 
        return '50-59' 
    elif x < 70: 
        return '60-69' 
    elif x >=70: 
        return '70+' 
    else: 
        return 'other'
f_df['Age_range'] = f_df['Age'].apply(age_buckets)

df5 = f_df.groupby(['Campaign_num','Age_range'])['Accepted_offer'].sum().reset_index()
df5
plt.figure(figsize=(10,7))
fig = sns.barplot(x = "Campaign_num", y = "Accepted_offer", hue = "Age_range", data = df5,palette='Spectral')
fig.set_title('Distribution of Number of Accepted Offer for Each Campaign and Age Range',weight='bold', size = 15)
plt.show()

From the graph, there is a positive correlation between the number of campaigns and number of accepted offer from campaigns in high level. Thus, I will not reject my null hypothesis. AcceptedCmp6 means 1 if customer accepted the offer in the last campaign, and there are more customers who accepted the last campaign than first campaign. Of course, the second campaign has the lowest acceptance rate and different campigns have different performance, but we can see there is a increase trend in the general level. Another interesting factor is age of customer. People who aged between 40 to 49 have the highest rate to accept an offer from a campaign. Customers who are age more than 70 have the lowest rate. People who are old will have less interest to purchase a lot of products so they will have less chance to accept an offer from the marketing campaign. customers are between 30 to 50 will have higher demand and they will try to save their money, so their offer acceptance rate will be higher than others.

Hypothesis VI: Total Product Spent is Correlated to Days since a Customer and Income¶

# Create the lmplot
sns.lmplot(x='Total_Product_Spent_per_month',y='Days_since_client',fit_reg=True,scatter_kws={"color":"orange","alpha":0.5
                                                                               ,"s":30},data=df)
plt.title('The corelationship between Total Spend per Month and Days Since a Customer', fontsize=15,weight='bold')
plt.xlabel('The total product spent per month',size=15)
plt.ylabel('Days since a customer',size=15)
plt.show()

There are no obvious relationship between the total product spent per month and days since a customer, so I will reject my null hypothesis.

print('The correlation between the total product spend per month and Days since a customer is: '+ str(df['Days_since_client'].corr(df['Total_Product_Spent_per_month'])))

The correlation between the total product spend per month and Days since a customer is: 0.13740024937226805

It turns out the correlation is positive but relatively low.

df['Income'] = np.where(df['Income'] > 120000, 120000, df['Income']) # Remove outlier 

# Create the lmlplot 
sns.lmplot(x='Total_Product_Spent_per_month',y='Income',fit_reg=True,scatter_kws={"color":"orange","alpha":0.5
                                                                               ,"s":30},data=df)
plt.title('The corelationship between Total Spend per Month and Income Level', fontsize=15,weight='bold')
plt.xlabel('The total product spent per month',size=15)
plt.ylabel('Income',size=15)
plt.show()

The graph shows some extend of correlation so I want to do some more analysis to see its actual correlation in number

print('The correlation between the total product spend per month and Income is: ' + str(df['Income'].corr(df['Total_Product_Spent_per_month'])))

The correlation between the total product spend per month and Income is: 0.8101980102758373

It turns out that the loss incurred and the insurance premium is positive correlated

Customer Churn Analysis Data - EDA¶

Hypothesis I: Customer churn rate will be heavily impacted by monthly charges or total charges¶

customer_churned = df_churn.loc[df_churn['Churn'] == 1]
customer_not_churned = df_churn.loc[df_churn['Churn'] == 0]

# Create denisity plot 
plt.figure(figsize=(10,7))
ax = sns.kdeplot(customer_churned['MonthlyCharges'],
                shade = True,palette='Spectral')
ax = sns.kdeplot(customer_not_churned['MonthlyCharges'],
                ax =ax,palette='Spectral', shade= True)
ax.legend(["Not Churn","Churn"],loc='upper right') # Show legend 
ax.set_ylabel('Density')
ax.set_xlabel('Monthly Charges')
ax.set_title('Distribution of monthly charges by churn',weight='bold', size = 15)
plt.show()

The graph validated my hypothesis which is customer churn rate will be heavily impacted by monthly charges. The orange shade is right-skewed, so majority of churned customers have monthly charges (The amount charged to the customer monthly) around 20. The blue shade is left-skewed and majority unchurned customer have higher monthly chrages around 60 - 100 dollars.

In summary, the higher amount charged to the customer monthly, the lower chance a customer will churn, and vice-versa. It's a negative correlation.

# Create the kde plot for total charges 
customer_churned = df_churn.loc[df_churn['Churn'] == 1]
customer_not_churned = df_churn.loc[df_churn['Churn'] == 0]

# Create denisity plot 
plt.figure(figsize=(10,7))
ax = sns.kdeplot(customer_churned['TotalCharges'],
                shade = True,palette='Spectral')
ax = sns.kdeplot(customer_not_churned['TotalCharges'],
                ax =ax,palette='Spectral', shade= True)
ax.legend(["Not Churn","Churn"],loc='upper right') # Show legend 
ax.set_ylabel('Density')
ax.set_xlabel('Total Charges')
ax.set_title('Distribution of Total charges by churn',weight='bold', size = 15)
plt.show()

The graph validated my hypothesis which is customer churn rate will be impacted by total charges. Both orange and blue shapes are right-skewed and their patterns are similiar to each other.

In summary, It seems that there is higher churn when the total charges are lower.

Hypothesis II: Longer contract term of the customer will have positive impact on the churn rate¶

churn = df_churn.groupby(['Contract', 'Churn']).size().reset_index(name = 'Number of Customers')

# set plot style: grey grid in the background:
sns.set(style="darkgrid")

# set the figure size
plt.figure(figsize=(10, 7))



fig = sns.barplot(x = "Contract", y = "Number of Customers", hue = "Churn", data = churn,palette='Spectral')
fig.set_title('Distribution of Number of Accepted Offer for Each Campaign and Age Range',weight='bold', size = 15)
plt.xticks()

for container in fig.containers: #Add number on the top of the plot 
    fig.bar_label(container)
fig.set_title('Distribution of Churn by Contract Type',weight='bold', size = 15)
plt.show()

From the graph, the highest churn rate is when the contract term of the customer is month-to-month, the lowest churn rate is when the contract term is two year. There is a strong negative correlation between the contract term and customer churn rate because the longer the contract term, the lower the customer churn rate will be. It's a proof of my null hypothesis and I think it's matching my expectation. Customers will less likely to churn when they have a longer contract vs a shorter contract.

Hypothesis III: Customer churn rate will be affected by paymentMethod¶

df6 = df_churn.groupby(['PaymentMethod','Churn']).count()['gender'].reset_index(name = 'Number of Customers')
churned = df6.loc[df6.Churn == 1].sort_values(by = 'Number of Customers')
# Create the bar chart 
plt.figure(figsize=(10,7))
fig = sns.barplot(x = "Churn", y = "Number of Customers", hue = "PaymentMethod", data = churned,palette='Spectral')
fig.set_title('Customer Churn Rate by Different Payment Types',weight='bold', size = 15)
plt.xticks()
for container in fig.containers: #Add number on the top of the plot 
    fig.bar_label(container)
plt.show()

From the graph, Customer churn rate was heavily affected by different payment Methods. Among all of the churned customers, automatic payments will significantly decrease the churn rate compared to non-automatic payments.However, this is meeting my expectation. A customer set an automatic payment will more likely to pay for their subscription or other services, and they will not forget to pay for their bills as well. Customers who pays with electronic check have significantly higher churn rate. Thus, the payment methods is an important variable to affect the churn rate of a customer and it's also important for me to put into my model in the later section

Hypothesis IV: Customer churn rate will have a unbalanced distribution¶

churn = df_churn[df_churn['Churn'] == 1].count()[0]
Not_churn = df_churn[df_churn['Churn'] == 0].count()[0]

# Set size
plt.figure(figsize=(10,7))

# Create pie chart
labels = ['Churn','Not Churn']
colors = ['coral','khaki']
explode = [0.1,0]
plt.pie([churn,Not_churn],labels = labels,autopct = '%.2f %%',pctdistance = 0.8,explode = explode, colors = colors)
plt.title('Customer Churn Rate Distribution',weight='bold', size = 15)

#draw circle
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

plt.show()

The graph has validated my hypothesis. In the data, 74% of the customers do not churn. There are only 27% of the customers churnd,so the data is skewed which it's an inbalanced label. We can aniticipate that there are more customers will not churn, so it's important to deal with the skewness of the data before building the model because the model will have the tendency to predict on the majority class, and there will be more false negatives (Type II error) in our prediction.

Hypothesis V: Customer churn rate will be affected by senority level¶

import matplotlib.ticker as mtick # For specifying the axes tick format
churn = df_churn.groupby(['SeniorCitizen','Churn']).size().unstack()
# Create the stacked bar plot
colors = ['#EEDC82','#EE7942']
ax = (churn.T*100.0 / churn.T.sum()).T.plot(kind='bar',
                                            width = 0.2,
                                            stacked = True,
                                            rot = 0, 
                                            figsize = (10,7),
                                            color = colors)


ax.yaxis.set_major_formatter(mtick.PercentFormatter()) # Set the y-axis to be XX%
ax.legend(loc='center',prop={'size':14},title = 'Churn') # Set the legend in the middle
ax.set_ylabel('% Customers') 
ax.set_title('Churn by Seniority Level',size = 15, weight = 'bold')

# add the percentage labels on the stacked bar chart
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate('{:.0f}%'.format(height), (p.get_x()+.25*width, p.get_y()+.4*height),
                color = 'white',
               weight = 'bold',size =14)
    
    
# Reference to show the percentage on the bar: https://stackoverflow.com/questions/57453959/how-to-show-percentages-on-the-stacked-bar-chart-in-python-3-x

First, the definition on the Senior Citizens is Whether the customer is a senior citizen or not. The graph validated my hypothesis because the senior citizens have almost double the churn rate than younger population. I think it's a very interesting finding and this variable could be an important factor to influence the customer churn rate.

Machine Learning Models¶

Goal of the models¶

The goal of this section is use machine learning models to predict customer behaviors and help us to better target our customers in a more effecient and accurate way.I am planning to build three models:

K-mean Clustering from Customer personality data

Logistic regression and Random Forest from Customer churn data

Model Details¶

Why I choose the K-mean model?

I choose K-mean from Customer personality data because the dataset is unlabeled and I would like to cluster different customers together to see their common personalities. K-mean is less computation expensive compare to other clustering techniques, and requires less storage and faster process.

Why I choose the Logistic regression and Random Forest model?

I choose LR and RF from Customer churn data because the dataset is labeled and I would like to predict on the customer churn rate(churn rate is a binary variable 0 and 1). LR and RF are good for classification prediction to classify differnt categories and they are pretty straight foward and easy to implement as well.

Model Purpose¶

In order to better target a customer, there are two ways in general:

Acquire more customers
Keep more existing customers

K-mean clustering is able to segment customers into different clusters. The model will be able to identify unqiue characteristics for different groups of customer and I will able to answer my problem in the beginning which is how could a company differeiate customers and tailor different marketing strategy to sell their products. Thus, K-mean model will help me to understand how could I better target a customer.

LR and RF are able to predict customer churn rate so I will able to understand what are some important factors will influence a customer to leave a product and become a churned customer. Thus, these two models will help me to understand how could a company keep the retention rate high and reduce the churn rate.

Model Variable Selection¶

K-mean Clustering: I am planning to use a customer's Education, Marital_Status, Income, Number_of_Kids, Total_Product_Spent_per_month, Total_Accept_Offer, Total_Purchase_per_month, Age, Average_Spend_per_product, Deals_Purchases, Average_Spend_per_product as my variables(They all come from my EDA analysis) to create my clustering analysis. Clustering is unsupervised machine learning method so I do not need my y-variable.

Logistic Regression and Random Forest: Both models are supervised machine learning models. I will use churn rate as my independend variable(y-variable) and rest of the variables as my depended variables(I am planning to use Gridsearch from sklearn document to reduce data dimensions.

K-Mean Clustering¶

Variable selection & Scaling
Elbow Method and Silhouette Score to determine the optimal K
Model Construction
Result Analysis

Reference:

Variable transformation and Scaling¶

# Variable Selection
selected_cols = ['Education_level', 'Marital_Status_all','Number_of_Kids',
                 'Total_Product_Spent_per_month','Total_Purchase_per_month', 'Total_Accept_Offer','Deals_Purchases', 
                'Age','Income','Average_Spend_per_product','Complain', ]
personality_df = df[selected_cols]
personality_df.head()

# Correlation Matrix 
fig, ax = plt.subplots(1, 1, figsize=(10, 10))
corr = personality_df.corr(method='pearson')

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

colors = ["#45b5aa","white", "#ff6f61"]
cmap = matplotlib.colors.LinearSegmentedColormap.from_list("", colors)
# Creat the heatmap 
sns.heatmap(corr, 
            square=True, 
            mask=mask,
            linewidth=2.5, 
            vmax=0.4, vmin=0, 
            cmap=cmap, 
            cbar=False, 
            ax=ax,annot=True)

ax.set_yticklabels(ax.get_xticklabels(), fontfamily='serif', rotation = 0, fontsize=11)
ax.set_xticklabels(ax.get_xticklabels(), fontfamily='serif', rotation=90, fontsize=11)

ax.spines['top'].set_visible(True)

# Add Fig Text 
fig.text(0.97, 1.02, 'Correlation Analysis', fontweight='bold', fontfamily='serif', fontsize=18, ha='right')    
fig.text(0.97, 0.912, 'Dataset : Customer Personlaity', fontweight='light', fontfamily='serif', fontsize=14, ha='right')    

fig.text(0.97, 0.91, '''
Find out the corelation between variables 
and there are eight factors in total.

Before build the model, I want to 
find if there are some highly correlated variables.''', 
         fontweight='light', fontfamily='serif', fontsize=14, va='top', ha='right') 

plt.tight_layout()
plt.show()

From the graph, the results are meeting my expectation. Rule of thumb is we have correlation higher or equal to 0.75 (75%) is considered two variables are highly correlated to each other. As a result, we have two strong positively correlated varaibles which are total purchase per month and total product spend per month; Income and total product spend per month. In common sense, a customer who has higher income level will have higher purchasing power so it will potential lead to a higher product spend and the more products bought will lead to higher product spend per month as well.

#Eigen vector of a correlation matrix.
multicollinearity, V=np.linalg.eig(corr)
multicollinearity

array([3.95460154, 1.17096992, 0.1123479 , 0.16984155, 0.37089697,
       0.99240534, 0.85262353, 0.72848125, 0.647832  ])

From the Eigen Vector result, we can see there are no variable close to 0, so it means that there is no multicollinearity exists in the data.

# Transfrom the categorical vars (select non-numerical columns)
categorical_cols = []

for i in personality_df.columns:
    if personality_df[i].dtypes == 'object':
        categorical_cols.append(i)
print(categorical_cols)

['Education_level', 'Marital_Status_all']

label_encoder = LabelEncoder()
for i in categorical_cols:
    personality_df[i] = personality_df[[i]].apply(label_encoder.fit_transform)

personality_df.head()

There are three popular ways for scaling:

standardizing: subtract each variable by its mean, then divide by its standard deviation, (also called z-standardization)
normalizing: scale each variable to have length (or "norm") 1
min/max scaling: scale each variable so that all values are between 0 and 1

I will use standardizing so my result will have negative values, and I decided to use one of the package from Sk-Learn which its StandardScaler

scaled_features = StandardScaler().fit_transform(personality_df.values)
scaled_personality_df = pd.DataFrame(scaled_features, index=personality_df.index, columns=personality_df.columns)
scaled_personality_df[:5]

# # personality_df_scaled = (personality_df.
#                   sub(personality_df.mean(), axis=1).
#                   divide(personality_df.std(), axis=1))
# personality_df_scaled.head()

Elbow Method & Silhouette Score¶

# Apply Elbow method
wcss = []
for i in range(1, 11):
    km = KMeans(n_clusters = i, random_state = 10)
    km.fit(scaled_personality_df)
    wcss.append(km.inertia_)
    
# Set up the title    
plt.figure(figsize = (12, 8))
plt.title('The Elbow Method', size = 25, y = 1.03, fontname = 'monospace')
plt.grid(color = 'gray', linestyle = ':', axis = 'y', alpha = 0.8, zorder = 0,  dashes = (1,7))

# Create the plot
a = sns.lineplot(x = range(1, 11), y = wcss, color = '#336b87', linewidth = 3)
sns.scatterplot(x = range(1, 11), y = wcss, color = '#336b87', s = 60, edgecolor = 'black', zorder = 5)
plt.ylabel('WCSS', size = 14, fontname = 'monospace')
plt.xlabel('Number of clusters', size = 14, fontname = 'monospace')
plt.xticks(size = 12, fontname = 'monospace')
plt.yticks(size = 12, fontname = 'monospace')

# Create the arrow 
plt.annotate('''2 Could be an Optimal number
of clusters''', xy = (2.1, 18000), xytext = (2.5, 20000),
             arrowprops = dict(facecolor = 'steelblue', arrowstyle = "->",color = 'black'), 
             fontsize = 13, fontfamily = 'monospace', ha = 'center', color = 'black')

plt.annotate('''3 Could be an Optimal number
of clusters''', xy = (2.9, 15000), xytext = (1.8, 10000),
             arrowprops = dict(facecolor = 'steelblue', arrowstyle = "->",color = 'black'), 
             fontsize = 13, fontfamily = 'monospace', ha = 'center', color = 'black')

# Create the vline
plt.axvline(3, 1, 0, color = 'green')
plt.axvline(2, 1, 0)

plt.show()

From the elbow method, I can either choose 3 or 2 as my optimal clusters because I can see there are two "Elbow Point". In term of the situation, I will use second method which is Silhouette Score to help me to determine the optimal K.

# Apply silhouette_scores method
silhouette_scores = []
for i in range(2,10):
    SS=KMeans(n_clusters=i, random_state=0)
    c = SS.fit_predict(scaled_personality_df)
    silhouette_scores.append(silhouette_score(scaled_personality_df, SS.fit_predict(scaled_personality_df))) 
    

# Set up the title    
plt.figure(figsize = (12, 8))
plt.title('The Silhouette Score Method', size = 25, y = 1.03, fontname = 'monospace')
plt.grid(color = 'gray', linestyle = ':', axis = 'y', alpha = 0.8, zorder = 0,  dashes = (1,7))    
    
    
# Create the plot 
plt.bar(range(2,10), silhouette_scores,edgecolor = 'black', zorder = 5, color = '#336b87') 
plt.xlabel('Number of clusters', fontsize = 12,fontname = 'monospace') 
plt.ylabel('S(i)', fontsize = 12,fontname = 'monospace') 
plt.xticks(size = 12, fontname = 'monospace')
plt.yticks(size = 12, fontname = 'monospace')
plt.show()

Silhouette score for a set of sample data points is used to measure how dense and well-separated the clusters are.
The silhouette score falls within the range [-1, 1].
The silhouette score of 1 means that the clusters are very dense and nicely separated. The score of 0 means that clusters are overlapping. The score of less than 0 means that data belonging to clusters may be wrong/incorrect.
Details on Silhouette Score

silhouette_scores

[0.2944189749959098,
 0.3179662555573218,
 0.26418563131513045,
 0.2428762664594691,
 0.23916730126054353,
 0.23811849760726875,
 0.22550869550661626,
 0.23420488602281367]

According to both two methods, I decided to use 3 as my optimal number of clustering.

# K-Means 
kmeans = KMeans(n_clusters=3, random_state=0).fit(scaled_personality_df)
pred = kmeans.predict(scaled_personality_df)


# Appending cluster values into main df
personality_df['Cluster'] = pred + 1

personality_df.head()

Next, I want to assign differet names for different Clusters based on their purchase frequncy on products to determine whether a customer is very active or not.

display(personality_df[personality_df['Cluster'] == 1]['Total_Purchase_per_month'].mean()) 
display(personality_df[personality_df['Cluster'] == 2]['Total_Purchase_per_month'].mean()) 
display(personality_df[personality_df['Cluster'] == 3]['Total_Purchase_per_month'].mean())

10.307320997586483

21.405750798722046

7.018518518518518

Cluster 1 has the second largest purchase frequency, so I will assign them to normal clients
Cluster 2 has the highest purchase frequency, so they are active clients
Cluster 3 has the least purchase frequency, so they are prospect clients

# Use map to rename cluster names
rename_clust = {
    1: 'Normal Customer',
    2: 'Active Customer',
    3: 'Prospect Customer'
}
personality_df['Cluster'] = personality_df['Cluster'].map(rename_clust)
personality_df.head()

Active_client = personality_df[personality_df['Cluster'] == 'Active Customer'].count()[0]
Normal_client = personality_df[personality_df['Cluster'] == 'Normal Customer'].count()[0]
Prospect_client = personality_df[personality_df['Cluster'] == 'Prospect Customer'].count()[0]

# change the default font family
plt.rcParams.update({'font.family':'monospace'})

# Set size
plt.figure(figsize=(12,8))

# Create pie chart
labels = ['Active_Customer','Normal_Customer', 'Prospect_Customer']
colors = ['#FFD700','#336b87','#45b5aa']
explode = [0.02,0.02,0.02]
plt.pie([Active_client,Normal_client,Prospect_client],labels = labels,autopct = '%.2f %%',pctdistance = 0.8,
        explode = explode, colors = colors, textprops={'fontsize': 12, 'weight': 'bold'})
plt.title('Number of Cutomers by Cluster Types',weight='bold', size = 15,fontname = 'monospace')

#draw circle
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

plt.show()

After I ran the k-mean clustering on customer personality data, we have three different clusters to differeciate different types of customers. 42% of them are active customers based on their purchase frequncy. There are 56% of them are normal customers and 3% are prospect customer in the future.

# Create the bar chart
fig = plt.figure(figsize = (13, 6))
plt.title('Which clients take part in the promotions the most?', size = 25, x = 0.4, y = 1.1, fontname = 'monospace',fontweight='bold')
fig = sns.barplot(data = personality_df.groupby(['Cluster']).agg({'Total_Accept_Offer': 'sum'}).reset_index(), 
                x = 'Total_Accept_Offer', y = 'Cluster', color = 'gold')
plt.xticks([])
plt.yticks(fontname = 'monospace', size = 16, color = 'black')
plt.xlabel('Number of Customers',fontname = 'monospace',fontsize =14)
plt.ylabel('Clusters',fontname = 'monospace',fontsize =14)

# new helper method to auto-label bars
fig.bar_label(fig.containers[0],fontweight='bold', fontsize =15,fontname = 'monospace')

# Assign color to different bars   
for i in fig.patches:
    if i.get_width() == 239:
        i.set_color('#336b87')
    elif i.get_width() == 754:
        i.set_color('#FFD700')
    else:
        i.set_color('#45b5aa')

From the graph, we can see majority active clients are willing to take promotions, which is number of accepted offer from campaigns in high level. In high level, our data approves the importance of the marketing campaigns(Including mutiple campaigns as well). For a company, the marketing campaigns are able to make customers more active and potentialy make more purchases. For detailed marketing strategy, a company needs to based on the specific customer data, industry data and their specific product data to tailor their marketing campign strategy to do customization.

# Create the bar chart
fig = plt.figure(figsize = (13, 6))
plt.title('Which clients take part in the Complains the most?', size = 25, x = 0.4, y = 1.1, fontname = 'monospace',fontweight='bold')
fig = sns.barplot(data = personality_df.groupby('Cluster').count()['Complain'].reset_index(name ='count'), 
                x = 'count', y = 'Cluster', color = 'gold')
plt.xticks([])
plt.yticks(fontname = 'monospace', size = 16, color = 'black')
plt.xlabel('Number of Customers',fontname = 'monospace',fontsize =14)
plt.ylabel('Clusters',fontname = 'monospace',fontsize =14)

# new helper method to auto-label bars
fig.bar_label(fig.containers[0],fontweight='bold', fontsize =15,fontname = 'monospace')

# Assign color to different bars   
for i in fig.patches:
    if i.get_width() == 1243:
        i.set_color('#336b87')
    elif i.get_width() == 939:
        i.set_color('#FFD700')
    else:
        i.set_color('#45b5aa')

From the graph, we can see that normal clients are actually have more complains than active clients. The definition of complains in the data is customer have negative comments on the products or return a purchased product. In high level, our data approved the importance of the customer service. Majority of complains are actually coming from normal clients, so it's the key to address their questions and take in action from their complains will help a company to be very successful.

sns.set(style="ticks")

fig, axs = plt.subplots(1,3)

sns.catplot(x='Marital_Status_all', hue='Marital_Status_all', col='Cluster', 
                 data=personality_df, kind='count',palette='Paired',ax=axs[0])
sns.catplot(x='Number_of_Kids', hue='Number_of_Kids', col='Cluster', 
                 data=personality_df, kind='count',palette='Paired',ax=axs[1])
sns.catplot(x='Education_level', hue='Education_level', col='Cluster', 
                 data=personality_df, kind='count',palette='Paired',ax=axs[2])
plt.close(0)
plt.close(1)

#https://stackoverflow.com/questions/56937029/plot-two-catplots-in-same-figure-using-python

In above graphs, I am trying to understand the differences between three clusters based on three different characteristics(Education, Number of kids and Maritial Status). I will put more detailed information in the later summary section.

# Create the kde plot for total charges 
Normal_client = personality_df.loc[personality_df['Cluster'] == 'Normal Customer']
Active_client = personality_df.loc[personality_df['Cluster'] == 'Active Customer']
Prospect_client = personality_df.loc[personality_df['Cluster'] == 'Prospect Customer']

# Create denisity plot 
plt.figure(figsize=(12,8))
ax = sns.kdeplot(Normal_client['Total_Purchase_per_month'],
                shade = True,palette='Spectral')
ax = sns.kdeplot(Active_client['Total_Purchase_per_month'],
                ax =ax,palette='Spectral', shade= True)
ax = sns.kdeplot(Prospect_client['Total_Purchase_per_month'],
                ax =ax,palette='Spectral', shade= True)




ax.legend(["Normal Client","Active Client","Prospect Client"],loc='upper right') # Show legend 
ax.set_ylabel('Density',fontname = 'monospace')
ax.set_xlabel('Total Purchase per Month',fontname = 'monospace')
ax.set_title('Distribution of Number of Purchase per Month by Clusters',weight='bold', size = 15,fontname = 'monospace')
plt.show()

In the graph, we can see there are huge difference in number of purchase per month to differeciate three clusters. The distribution for normal and future prospect clients are very similar to each other. However, active clients have much higher purchase frequency. Again, as I mentioned before, our data shows there is huge difference between active and normal clients. A company needs to study their active customers to tailor their marketing strategy to better traget their active clients.

# Create Education plot 
plt.figure(figsize=(10,7)) # Reset the plot showing size 

ax = sns.boxplot(x="Cluster", y="Income",data=personality_df, dodge=True,palette="Spectral")
ax.set_ylabel('Income',fontname = 'monospace',weight = 'bold')
ax.set_xlabel('Clusters',fontname = 'monospace',weight = 'bold')
plt.title('Income by Clusters', fontsize=15, weight = 'bold',fontname = 'monospace')
plt.show()

From the graph, active clients have much higher income level comparing to other two clusters. Income level is one of the most important factors to determine customer's behaviors. Therefore, I think it's very important to have specifc marketing strategy to target rich people because their income level will have potential to make them become more active clients.

def age_buckets(x): 
    if x < 30:
        return '18-29' 
    elif x < 40: 
        return '30-39' 
    elif x < 50: 
        return '40-49' 
    elif x < 60: 
        return '50-59' 
    elif x < 70: 
        return '60-69' 
    elif x >=70: 
        return '70+' 
    else: 
        return 'other'
personality_df['Age_range'] = personality_df['Age'].apply(age_buckets)

df5 = personality_df.groupby(['Cluster','Age_range'])['Age'].count().reset_index()
df5
plt.figure(figsize=(10,7))
fig = sns.barplot(x = "Cluster", y = "Age", hue = "Age_range", data = df5,palette='Spectral')
fig.set_title('Distribution of Ages for Clusters', size = 15,weight = 'bold',fontname = 'monospace')
fig.set_ylabel('Age',fontname = 'monospace',weight = 'bold')
fig.set_xlabel('Clusters',fontname = 'monospace',weight = 'bold')
plt.show()

This graph is showing the distribution of ages for different clusters. Again, I will put the details in my summary section.

Summary of Clusters:¶

Characteristics of Active Customer:

In terms of Education: Highly Active Customer are from UG background.
In terms of Marital_status: Number of people in relationship are slightly less as compare to single people.
In terms of Income: Income of Highly active customer are much higher as compare to other customers.
In terms of Kids: Active customer have less number of childern as compare to mormal customerz (Max customers has no child).
In terms of Expenses: Expenses of Active customer are more as compare to Normal.
In terms of Age: The most popular customer age range age are between 35 to 60.

Characteristics of Normal Customer:

In terms of Education: Highly Active Customer are from UG background.
In terms of Marital_status: Number of people in relationship are nearly two times less than single people.
In terms of Income: Income of normal customer are less as compare to active customer.
In terms of Kids: Normal customer have more number of children as compare to other customers (Average of 1 child).
In terms of Expenses: Expenses of Active customer are less as compare to Normal.
In terms of Age: The most popular customer age range are between 40 to 50.

Characteristics of Prospect Customer:

In terms of Education: Prospect Customer are from higher education background (PHD or Master).
In terms of Marital_status: Number of people in relationship are equal to single people.
In terms of Income: Income of prospect customer are very less.
In terms of Kids: Only few of these customer have child.
In terms of Expenses: Expenses of Prospect customer are very less.
In terms of Age: The most popular customer age range are between 15 to 30.

Logistic Regression¶

Variable selection & Scaling
Model Construction
Result Analysis

Basic idea of logistic regression is estimate the prob of belonging to each class. In binary case we get an estimate of prob(y=1) or P(y=1), the prob of belonging to class 1 then set up a cut-off point to determine classes.

Reference:

df_churn_copy = df_churn.copy()
df_churn_copy[:5]

df_churn_copy['MultipleLines'].value_counts()

No     4072
Yes    2971
Name: MultipleLines, dtype: int64

df_churn_copy['MultipleLines'] =  df_churn_copy['MultipleLines'].apply(lambda x: 0 if x == 'No' else 1)
df_churn_copy['MultipleLines'].value_counts()

0    4072
1    2971
Name: MultipleLines, dtype: int64

#Make dummy variables
columns = ['InternetService','Contract','PaymentMethod']
df_churn_copy_trans = pd.get_dummies(df_churn_copy, columns = columns, drop_first=True)
df_churn_copy_trans.head()

# Using min-max method to scale data so the data result will be between 0 and 1 
def min_max_scaling(series):
    return (series - series.min()) / (series.max() - series.min())

df_churn_final = min_max_scaling(df_churn_copy_trans)
df_churn_final.head()

Balance Dataset

I would like to check my dependent variable beacuse I want to avoid the unbalanced target variable.

Unbalanced Classes: In a dataset with highly unbalanced classes, the classifier always "predicts" the most common class without performing any analysis of the features, and it will cause a super high accuracy score and inaccurate confusion matrix.

Details on Resampling

churn_count = df_churn_final['Churn'].value_counts()
print('Class 0:', churn_count[0])
print('Class 1:', churn_count[1])
print('Proportion:', round(churn_count[0] / churn_count[1], 2), ': 1')

churn_count.plot(kind='bar', title='Count (Churn Rate)');

Class 0: 5174
Class 1: 1869
Proportion: 2.77 : 1

Resampling

Over-Sampling: It consists of adding more examples from the minority class.Over-sampling is to duplicate random records from the minority class, which can cause overfitting.

Under-Sampling: It consists of removing more examples from the majority class.Under-sampling removing random records from the majority class, which can cause loss of information.

In this case, I decided to use over-sampling because I am trying to build a super model to have the highest accuracy score instead I would like keep as much information in the model.

# Separate classes 
count_class_0, count_class_1 = df_churn_final['Churn'].value_counts()
# Divide by class
df_target_0 = df_churn_final[df_churn_final['Churn'] == 0]
df_target_1 = df_churn_final[df_churn_final['Churn'] == 1]

# Random Over-sampling 
df_class_1_over = df_target_1.sample(count_class_0, replace=True)
df_churn_final_over = pd.concat([df_target_0, df_class_1_over], axis=0)

print('Random Over-sampling:')
print(df_churn_final_over['Churn'].value_counts())

df_churn_final_over['Churn'].value_counts().plot(kind='bar', title='Count (Churn Rate)');
plt.show()
# Random Under-sampling 
# df_class_0_under = df_target_0.sample(count_class_1)
# df_test_under = pd.concat([df_class_0_under, df_target_1], axis=0)

Random Over-sampling:
1.0    5174
0.0    5174
Name: Churn, dtype: int64

From the graph, the data is now balanced for our target variable after we performed the random over-sampling.

Logistic Regression

y = df_churn_final_over['Churn'].values
X = df_churn_final_over.drop(columns = ['Churn'])

# Create Train & Test Data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

# Running logistic regression model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
result = model.fit(X_train, y_train)

# Using model to predict 
from sklearn import metrics
prediction_test = model.predict(X_test)
# Print the prediction accuracy
print(metrics.accuracy_score(y_test, prediction_test))

0.750402576489533

# Add constant to see model stats 
from statsmodels.tools import add_constant as add_constant
df_churn_final_constant = add_constant(df_churn_final_over)
df_churn_final_constant.head()

# Change the order of columns 
col_order = ["const","gender", "SeniorCitizen","Partner","Dependents",'InternetService_Fiber optic','InternetService_No',
             'Contract_One year','Contract_Two year','PaymentMethod_Credit card (automatic)','PaymentMethod_Electronic check',
             'PaymentMethod_Mailed check','tenure','PhoneService','MultipleLines','OnlineSecurity','OnlineBackup','DeviceProtection',
             'TechSupport','StreamingTV','StreamingMovies','PaperlessBilling','MonthlyCharges','TotalCharges','Churn']
df_churn_final_constant = df_churn_final_constant.reindex(columns=col_order)
df_churn_final_constant.head()

# Plot the stats table 
import scipy.stats as st
import statsmodels.api as sm
st.chisqprob = lambda chisq, df: st.chi2.sf(chisq, df)
cols=df_churn_final_constant.columns[:-1]
model=sm.Logit(df_churn_final_constant.Churn,df_churn_final_constant[cols])
result=model.fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.484507
         Iterations 7

The table shows some of the variables with P value higher than the preferred alpha(5%) and thereby showing low statistically significant relationship with the probability of churn rate. I would like to remove those variables to help my explaination.

I will use the Backward Elemination approach to remove those attributes with highest P-value one at a time follwed by running the regression repeatedly until all variables have P Values less than 0.05.

def back_feature_elem (data_frame,dep_var,col_list):
    """ Takes in the dataframe, the y-variable and a list of column names(except y), runs the regression repeatedly eleminating feature with the highest
    P-value above 0.05 one at a time and returns the regression summary with all p-values below alpha"""

    while len(col_list)>0 :
        model=sm.Logit(dep_var,data_frame[col_list])
        result=model.fit(disp=0)
        largest_pvalue=round(result.pvalues,3).nlargest(1)
        if largest_pvalue[0]<(0.05):
            return result
            break
        else:
            col_list=col_list.drop(largest_pvalue.index)

result=back_feature_elem(df_churn_final_constant,df_churn_final_constant.Churn,cols)

result.summary()

# Showing more stats information 
# params = np.exp(result.params)
# conf = np.exp(result.conf_int())
# conf['OR'] = params
# pvalue=round(result.pvalues,3)
# conf['pvalue']=pvalue
# conf.columns = ['CI 95%(2.5%)', 'CI 95%(97.5%)', 'Odds Ratio','pvalue']
# print ((conf))

After the backward elimination, we can see we only have those varables are stats siginificant(P-value less than 5%).

We can see that some variables have a negative relation to our predicted variable (Churn), while some have positive relation. Negative relation means that likeliness of churn decreases with that variable and vice-versa for positive relation.

Varible interpretation Example(Contract term) for logistic regression:

This fitted model shows that, holding all other features constant, the odds of getting churned for having the contract term of one-month (Contract_one_month = 1)over that of Contract_One year and Contract_Two year (Contract_One year = 0 & Contract_Two year = 0) is much higher.

Takeways from model:

In summary, a customer has longer tenure(Number of months the customer has stayed with the company), has Online Security program and TechSupport service who has lower Monthly payment charges for loans and the longer contract term for its employement using automative payment methods like (Bank transfer (automatic), Credit card (automatic) ) have positive effect on churn which means those factors are able to decrease the prob of the churn rate.
In the opposite, a customer who has higher total payment charges for loans and, who is using paperless billing and has mutiple lines(a customer who has mutiple phone service lines), using traditional payment methods(Electronic check, Mailed check) for payment, and who is a senior citizen have negative effect on churn which means those factors will increase the prob of the churn rate.
One particular interesting factor is fibre optic internet services can lead to higher churn rate. This is interesting because although fibre optic services are faster, customers are likely to churn because of it.

Random Forest¶

Variable selection & Scaling
Model Construction
Result Analysis

The basic idea behind the Random forest is to use bootstrapping method to get a lot of training datasets and construct a signle decision tree from every single subsets and we let the them vote, then we take the majority votes from all decision trees to determine its classification.

Reference:

Random Forest

y = df_churn_final_over['Churn'].values
X = df_churn_final_over.drop(columns = ['Churn'])

from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)
model_rf = RandomForestClassifier(n_estimators=1000 , oob_score = True, n_jobs = -1,
                                  random_state =50, max_features = "auto",
                                  max_leaf_nodes = 30)
model_rf.fit(X_train, y_train)

# Make predictions
prediction_result= model_rf.predict(X_test)
print (metrics.accuracy_score(y_test, prediction_result))

0.77487922705314

importances = model_rf.feature_importances_
weights = pd.Series(importances,
                 index=X.columns.values)
weights.sort_values()[-10:].plot(kind = 'barh')

<AxesSubplot:>

After I ran the random forest, I can see that the random forest model give me higher prediction accuracy (78%) comparing to logistic regression (76%). One of the most important reasons why I choosed to use random forest because the model is able to tell us what are some most important features in our model. From the graph, those are all important features and will impact on whether a customer will churn or not. The top 3 features (variables) are tenure, contract term and total charges, which means a customer's contract term for its loan payment, the total charges for the goods or product, and Number of months the customer has stayed with the company are the most important factors to influnce its churn rate. In general, a company want to have a longer contract term for their employees, lower the total amount payment charges for their loans and extend the contract term to be longer will have positive impact on the customer churn rate.

Conclusion & Summary¶

Target Business Goal for this Project¶

A company has two methods to get more customers:

Attract more customers
Keep the existing customers.

I would like to keep my conculsion in high-level, not only for this particular datam but also my finding in the industry and general patterns.

Attract more customers:¶

In order to achieve the first business goals, a company has to better target their customers and understand the difference between different customers. Since we have limited resource and money, I built a k-mean clustering analysis to grouping customers based on their characteristics. In high level, we can divide customers into three groups which are active, normal and future prospect clients. I want to make my recommdation more general and not only specific for this dataset. In high level, there are four conclusions to attract more customers:

Customers are able to divide into three to four groups. Active, normal, future customers. A company needs to separate them and develop different marketing strategy to target them differently based on their personalities and historical purchase behaviors to find patterns.
Marketing campaigns are important. Of course, there are more analysis inside the marketing campaigns. A company needs to use their specific collected data from their product and customers to better target their customers and measure their marketing campaign performance.
Customer Service is important. From our data, we can see there are more complains coming from normal clients than active clients. My conclusion is if a company is able provide excellent customer service, they are more likely to satisfy their customers in high level and turn normal customers to active customers and attract more future customers as well.
In general, a active customer who is likely to have high income level, a good education background, and who is willing to take promotions from marketing campaigns, they are not likely to have a big family(usually one or no kid).

Keep the existing customers:¶

In order to keep the existing customers, a company needs to reduce its churn rate as much as possible. There are so many internal and external factors to cause a custer churn. My hypotheis is to built a logistic regression and a random forest machine learning model to predict behavior to retain customers and identify factors to potentially reduce their churn rate. Based on the model result, there are some conclusions below:

The top 3 features (variables) to cause customer churn are tenure, contract term and total charges, which means a customer's contract term for its loan payment, the total charges for the goods or product, and Number of months the customer has stayed with their company(employment length).
In general, a business firm want to have a longer contract term for their employees, encourage employees to have lower total amount payment charges for loans and longer the contract-term for their payment will have positive impact on the employee churn rate.
In summary, a customer who has longer tenure(Number of months the customer has stayed with the company), has Online Security program and Tech-Support service and who has lower Monthly payment charges for loans and the longer contract term for its employment, using auto-mative payment methods like (Bank transfer (automatic), Credit card (automatic) ) have positive effect on churn which means those factors are able to decrease the prob of the churn rate.
In the opposite, a customer who has higher total payment charges for loans and, who is using paperless billing and has multiple lines(a customer who has multiple phone service lines), using traditional payment methods(Electronic check, Mailed check) for payment, and who is a senior citizen have negative effect on churn which means those factors will increase the prob of the churn rate.

Future study and research:¶

Customer personaliy analysis is a very hard and board topic in nowadays. My project is mainly focus on providing high-level suggestions and findings because I firmly believe each company will have different customer segements and their market position in the industry, even differnt industries are so different! A good customer personality analysis for a spcific company needs to analysis on their customer groups, product performance and reviews, future customer acquisition, marketing performance analysis, etc. In summary, this is a very board and complicated topic, but I think this is the "Real Money" because we are able to combining modern technology and data together to drive our firms forward and better target customers.

I find some interesting resources in this field and feel free to check it out!

	Education_level	Marital_Status_all	Number_of_Kids	Total_Product_Spent_per_month	Total_Purchase_per_month	Total_Accept_Offer	Deals_Purchases	Age	Income	Average_Spend_per_product	Complain
0	-0.157315	1.348952	-1.266401	1.677625	1.320622	0.619936	-0.515740	0.985667	0.301075	0.690960	-0.097369
1	-0.157315	1.348952	1.394946	-0.963051	-1.160925	-0.502474	1.368272	1.236012	-0.265083	-0.641480	-0.097369
2	-0.157315	-0.741316	-1.266401	0.280890	0.798191	-0.502474	-1.233459	0.318081	0.947928	0.091362	-0.097369
3	-0.157315	-0.741316	0.064272	-0.919870	-0.899710	-0.502474	0.650553	-1.267436	-1.210664	-0.597066	-0.097369
4	-0.157315	-0.741316	0.064272	-0.307034	0.536975	-0.502474	0.740268	-1.017091	0.308516	-0.241748	-0.097369

Dep. Variable:	Churn	No. Observations:	10348
Model:	Logit	Df Residuals:	10324
Method:	MLE	Df Model:	23
Date:	Thu, 16 Dec 2021	Pseudo R-squ.:	0.3010
Time:	12:36:47	Log-Likelihood:	-5013.7
converged:	True	LL-Null:	-7172.7
Covariance Type:	nonrobust	LLR p-value:	0.000

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	1.4199	0.214	6.636	0.000	1.001	1.839
gender	-0.0283	0.049	-0.572	0.567	-0.125	0.069
SeniorCitizen	0.2795	0.067	4.183	0.000	0.149	0.410
Partner	0.0172	0.060	0.288	0.773	-0.100	0.134
Dependents	-0.1513	0.066	-2.281	0.023	-0.281	-0.021
InternetService_Fiber optic	2.4932	0.610	4.089	0.000	1.298	3.688
InternetService_No	-2.5746	0.617	-4.175	0.000	-3.783	-1.366
Contract_One year	-0.6638	0.076	-8.702	0.000	-0.813	-0.514
Contract_Two year	-1.4621	0.118	-12.432	0.000	-1.693	-1.232
PaymentMethod_Credit card (automatic)	-0.0163	0.083	-0.197	0.844	-0.178	0.146
PaymentMethod_Electronic check	0.3666	0.071	5.160	0.000	0.227	0.506
PaymentMethod_Mailed check	-0.0819	0.085	-0.964	0.335	-0.248	0.085
tenure	-3.6054	0.290	-12.453	0.000	-4.173	-3.038
PhoneService	0.9881	0.495	1.996	0.046	0.018	1.958
MultipleLines	0.4992	0.136	3.676	0.000	0.233	0.765
OnlineSecurity	-0.0728	0.136	-0.535	0.593	-0.340	0.194
OnlineBackup	0.2358	0.134	1.755	0.079	-0.028	0.499
DeviceProtection	0.2347	0.134	1.748	0.080	-0.028	0.498
TechSupport	0.0550	0.137	0.402	0.688	-0.213	0.323
StreamingTV	0.9275	0.250	3.716	0.000	0.438	1.417
StreamingMovies	0.9340	0.250	3.739	0.000	0.444	1.424
PaperlessBilling	0.2857	0.056	5.128	0.000	0.176	0.395
MonthlyCharges	-7.1518	2.440	-2.931	0.003	-11.934	-2.370
TotalCharges	2.0870	0.407	5.126	0.000	1.289	2.885

Dep. Variable:	Churn	No. Observations:	10348
Model:	Logit	Df Residuals:	10330
Method:	MLE	Df Model:	17
Date:	Thu, 16 Dec 2021	Pseudo R-squ.:	0.3008
Time:	12:36:47	Log-Likelihood:	-5015.3
converged:	True	LL-Null:	-7172.7
Covariance Type:	nonrobust	LLR p-value:	0.000

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	1.3588	0.137	9.887	0.000	1.089	1.628
SeniorCitizen	0.2782	0.066	4.201	0.000	0.148	0.408
Dependents	-0.1445	0.060	-2.426	0.015	-0.261	-0.028
InternetService_Fiber optic	2.5403	0.203	12.522	0.000	2.143	2.938
InternetService_No	-2.6198	0.252	-10.416	0.000	-3.113	-2.127
Contract_One year	-0.6607	0.076	-8.676	0.000	-0.810	-0.511
Contract_Two year	-1.4542	0.117	-12.409	0.000	-1.684	-1.225
PaymentMethod_Electronic check	0.3986	0.053	7.500	0.000	0.294	0.503
tenure	-3.5516	0.282	-12.573	0.000	-4.105	-2.998
PhoneService	1.0196	0.196	5.207	0.000	0.636	1.403
MultipleLines	0.5084	0.072	7.085	0.000	0.368	0.649
OnlineBackup	0.2442	0.073	3.365	0.001	0.102	0.386
DeviceProtection	0.2480	0.073	3.379	0.001	0.104	0.392
StreamingTV	0.9504	0.104	9.168	0.000	0.747	1.154
StreamingMovies	0.9557	0.104	9.224	0.000	0.753	1.159
PaperlessBilling	0.2904	0.056	5.224	0.000	0.181	0.399
MonthlyCharges	-7.2961	0.829	-8.806	0.000	-8.920	-5.672
TotalCharges	2.0344	0.404	5.040	0.000	1.243	2.826

	ID	Year_Birth	Education	Marital_Status	Income	Kidhome	Teenhome	Dt_Customer	Recency	MntWines	MntFruits	MntMeatProducts	MntFishProducts	MntSweetProducts	MntGoldProds	NumDealsPurchases	NumWebPurchases	NumCatalogPurchases	NumStorePurchases	NumWebVisitsMonth	Z_CostContact	Z_Revenue	Response
0	5524	1957	Graduation	Single	58138.0	0	0	04-09-2012	58	635	88	546	172	88	88	3	8	10	4	7	3	11	1
1	2174	1954	Graduation	Single	46344.0	1	1	08-03-2014	38	11	1	6	2	1	6	2	1	1	2	5	3	11	0
2	4141	1965	Graduation	Together	71613.0	0	0	21-08-2013	26	426	49	127	111	21	42	1	8	2	10	4	3	11	0
3	6182	1984	Graduation	Together	26646.0	1	0	10-02-2014	26	11	4	20	10	3	5	2	2	0	4	6	3	11	0
4	5324	1981	PhD	Married	58293.0	1	0	19-01-2014	94	173	43	118	46	27	15	5	5	3	6	5	3	11	0

	ID	Year_Birth	Education	Marital_Status	Income	Kidhome	Teenhome	Dt_Customer	Recency	MntWines	MntFruits	MntMeatProducts	MntFishProducts	MntSweetProducts	MntGoldProds	NumDealsPurchases	NumWebPurchases	NumCatalogPurchases	NumStorePurchases	NumWebVisitsMonth	AcceptedCmp4	AcceptedCmp5	AcceptedCmp1	Z_CostContact	Z_Revenue	Response
10	1994	1983	Graduation	Married	NaN	1	0	15-11-2013	11	5	5	6	0	2	1	1	1	0	2	7	0	0	0	3	11	0
27	5255	1986	Graduation	Single	NaN	1	0	20-02-2013	19	5	1	3	3	263	362	0	27	0	0	1	0	0	0	3	11	0
43	7281	1959	PhD	Single	NaN	0	0	05-11-2013	80	81	11	50	3	2	39	1	1	3	4	2	0	0	0	3	11	0
48	7244	1951	Graduation	Single	NaN	2	1	01-01-2014	96	48	5	48	6	10	7	3	2	1	4	6	0	0	0	3	11	0
58	8557	1982	Graduation	Single	NaN	1	0	17-06-2013	57	11	3	22	2	2	6	2	2	0	3	6	0	0	0	3	11	0
71	10629	1973	2n Cycle	Married	NaN	1	0	14-09-2012	25	25	3	43	17	4	17	3	3	0	3	8	0	0	0	3	11	0
90	8996	1957	PhD	Married	NaN	2	1	19-11-2012	4	230	42	192	49	37	53	12	7	2	8	9	0	0	0	3	11	0
91	9235	1957	Graduation	Single	NaN	1	1	27-05-2014	45	7	0	8	2	0	1	1	1	0	2	7	0	0	0	3	11	0
92	5798	1973	Master	Together	NaN	0	0	23-11-2013	87	445	37	359	98	28	18	1	2	4	8	1	0	0	0	3	11	0
128	8268	1961	PhD	Married	NaN	0	1	11-07-2013	23	352	0	27	10	0	15	3	6	1	7	6	0	0	0	3	11	0
133	1295	1963	Graduation	Married	NaN	0	1	11-08-2013	96	231	65	196	38	71	124	1	6	5	7	4	0	0	0	3	11	0
312	2437	1989	Graduation	Married	NaN	0	0	03-06-2013	69	861	138	461	60	30	61	1	6	5	12	3	1	0	1	3	11	0
319	2863	1970	Graduation	Single	NaN	1	2	23-08-2013	67	738	20	172	52	50	20	6	2	3	10	7	1	0	1	3	11	0
1379	10475	1970	Master	Together	NaN	0	1	01-04-2013	39	187	5	65	26	20	14	2	4	2	6	5	0	0	0	3	11	0
1382	2902	1958	Graduation	Together	NaN	1	1	03-09-2012	87	19	4	12	2	2	6	1	1	0	3	5	0	0	0	3	11	0
1383	4345	1964	2n Cycle	Single	NaN	1	1	12-01-2014	49	5	1	9	2	0	4	1	1	0	2	7	0	0	0	3	11	0
1386	3769	1972	PhD	Together	NaN	1	0	02-03-2014	17	25	1	13	0	0	3	1	1	0	3	7	0	0	0	3	11	0
2059	7187	1969	Master	Together	NaN	1	1	18-05-2013	52	375	42	48	94	66	96	7	4	10	4	3	0	0	0	3	11	0
2061	1612	1981	PhD	Single	NaN	1	0	31-05-2013	82	23	0	15	0	2	7	2	3	0	3	6	0	0	0	3	11	0
2078	5079	1971	Graduation	Married	NaN	1	1	03-03-2013	82	71	1	16	0	0	9	4	2	1	3	8	0	0	0	3	11	0
2079	10339	1954	Master	Together	NaN	0	1	23-06-2013	83	161	0	22	0	0	24	2	4	1	4	6	0	0	0	3	11	0
2081	3117	1955	Graduation	Single	NaN	0	1	18-10-2013	95	264	0	21	12	6	147	3	6	1	5	7	0	0	0	3	11	0
2084	5250	1943	Master	Widow	NaN	0	0	30-10-2013	75	532	126	490	164	126	126	1	5	5	11	1	0	1	0	3	11	1
2228	8720	1978	2n Cycle	Together	NaN	0	0	12-08-2012	53	32	2	1607	12	4	22	0	0	0	1	0	1	0	0	3	11	0

	Education	Marital_Status	Income	Dt_Customer	Recency	MntWines	MntFruits	MntMeatProducts	MntFishProducts	MntSweetProducts	MntGoldProds	NumDealsPurchases	NumWebPurchases	NumCatalogPurchases	NumStorePurchases	NumWebVisitsMonth	Response	Education_level	Marital_Status_all	Number_of_Kids	Total_Product_Spent_per_month	Total_Purchase_per_month	Total_Accept_Offer	Average_Spend_per_product	Deals_Purchases	Days_since_client	Days_since_client_year	Age	High_freq
0	Graduation	Single	58138	2012-04-09	58	635	88	546	172	88	88	3	8	10	4	7	1	Graduate	Single	0	1617	25	1	64	12	1362	3	59	1
1	Graduation	Single	46344	2014-08-03	38	11	1	6	2	1	6	2	1	1	2	5	0	Graduate	Single	2	27	6	0	4	33	516	1	62	0
2	Graduation	Together	71613	2013-08-21	26	426	49	127	111	21	42	1	8	2	10	4	0	Graduate	Relationship	0	776	21	0	37	4	863	2	51	1
3	Graduation	Together	26646	2014-10-02	26	11	4	20	10	3	5	2	2	0	4	6	0	Graduate	Relationship	1	53	8	0	6	25	456	1	32	0
4	PhD	Married	58293	2014-01-19	94	173	43	118	46	27	15	5	5	3	6	5	0	Graduate	Relationship	1	422	19	0	22	26	712	1	35	1

	customerID	gender	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	7590-VHVEG	Female	Yes	No	1	No	No phone service	DSL	No	Yes	No	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	No
1	5575-GNVDE	Male	No	No	34	Yes	No	DSL	Yes	No	Yes	No	No	No	One year	No	Mailed check	56.95	1889.5	No
2	3668-QPYBK	Male	No	No	2	Yes	No	DSL	Yes	Yes	No	No	No	No	Month-to-month	Yes	Mailed check	53.85	108.15	Yes
3	7795-CFOCW	Male	No	No	45	No	No phone service	DSL	Yes	No	Yes	Yes	No	No	One year	No	Bank transfer (automatic)	42.30	1840.75	No
4	9237-HQITU	Female	No	No	2	Yes	No	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Electronic check	70.70	151.65	Yes

	Marital_Status_all	Number_of_Kids	Total_Product_Spent_per_month	Total_Purchase_per_month	Total_Accept_Offer	Deals_Purchases	Age	Income	Average_Spend_per_product	Cluster
0	1	0	1617	25	1	12	59	58138	64	Active Customer
1	1	2	27	6	0	33	62	46344	4	Normal Customer
2	0	0	776	21	0	4	51	71613	37	Active Customer
3	0	1	53	8	0	25	32	26646	6	Normal Customer
4	0	1	422	19	0	26	35	58293	22	Normal Customer

	gender	Partner	tenure	PhoneService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	PaperlessBilling	MonthlyCharges	TotalCharges	Churn	InternetService_Fiber optic	Contract_One year	PaymentMethod_Electronic check	PaymentMethod_Mailed check
0	0.0	1.0	0.013889	0.0	0.0	1.0	0.0	0.0	1.0	0.115423	0.001275	0.0	0.0	0.0	1.0	0.0
1	1.0	0.0	0.472222	1.0	1.0	0.0	1.0	0.0	0.0	0.385075	0.215867	0.0	0.0	1.0	0.0	1.0
2	1.0	0.0	0.027778	1.0	1.0	1.0	0.0	0.0	1.0	0.354229	0.010310	1.0	0.0	0.0	0.0	1.0
3	1.0	0.0	0.625000	0.0	1.0	0.0	1.0	1.0	0.0	0.239303	0.210241	0.0	0.0	1.0	0.0	0.0
4	0.0	0.0	0.027778	1.0	0.0	0.0	0.0	0.0	1.0	0.521891	0.015330	1.0	1.0	0.0	1.0	0.0