• Predicting Customer Churn in Telecom

    Classification & Clustering in R

  • DATASET INFO

    The dataset contains a sample of 3333 observations, sourced from the portfolio of customers of a telecommunications company.


    The dependent variable of the model is the Churn outcome observed during the previous period. Churn is a binary (categorical) variable, with 2 possible outcomes; 0=the customer left the company / 1=the customer stayed with the company.


    The dataset also contains 20 independent variables, that provide details on the usage during the previous period (behavioral data) and some demographics.

    THEMES

    R, Machine Learning, Decision Tree, Random Forest, XGBoost, Adaboost, SVM, KNN and Naive-Bayes

    PROJECT DESCRIPTION

    In this study, I attempted to formulate a predictive model that identifies whether a customer is probable to switch telecommunications providers (Churn) or stay with the company. We started with a Logistic Regression classifier, and moved on to methods such as Decision Tree, Random Forest, XGBoost, Adaboost, SVM, KNN and Naive- Bayes. We concluded that the best predictive model we could find was XGBoost, which manages to identify correctly almost all the non-churners and the vast majority of the churners. Closely trailing was the Decision Tree model, which is more easily interpretable and applicable in real business problems.


    On the other hand, Cluster Analysis was a bit more challenging. The Hierarchical Clustering methods we used weren’t very effective. Using the Mahalanobis distance and the Gower distance, we managed to produce 2 clustering methods with Silhouette values equal to 0.2. Using the K-Means method, the results became a little bit better, especially using Principal Components and creating 4 clusters.

    CODE

    Code in R is available here

  • SHORT PRESENTATION