Churn Prediction is one of the classic problems encountered. Its a predictive analytic in which one tries to predict whether a person is going to churn or not based on the previous historical data. This problem has been approached in various ways and in various context. Some of the common contexts are employee churn prediction and subscriber churn prediction in the telecom. In the former case, on tries to predict whether a particular employee is gonna resign or not. In the later, one tries to know if a subscriber is gonna leave the network or not. In the telecom sector, it increasingly becomes of high significance due to the statistics revealing unpleasant numbers of people churning. Churn prediction in literature has been dominated by logistic regression and more recently SVMs(Support Vector Machines).
Here, its approached as a binary classification problem. Wouldn’t it be nice if one is able to tell after how many days one is gonna resign? But I am not sure of attempting this problem and if at all we have data of that quality to predict such questions.
Therefore, let us stick to the original binary problem of predicting churn or not?
Churn prediction is a huge class imbalance problem. In general, there are a few number of churners and a large number of non-churners. Therefore, to counter this an intelligent sampling like SMOTE, stratified down sampling etc. are required. Here, we do not discuss the sampling methodologies.
A typical flow for any classification is illustrated in the self-explanatory figure below.

Feature Selection:
Feature selection is one of the most important steps to determine the quality of learnt model. Hence, more time should be dedicated to feature selection than to tune parameters for getting better models. But, the fundamental question is how to do feature selection? Initially, one needs to start heuristically and use literature to good effect. Then, one needs to statistically determine any correlation between the feature. A high positive/negative correlation is undesirable and therefore, these features can be dropped. Secondly, one could attempt PCA or ICA to determine strong and weak features and accordingly eliminate weak features and add further features from the data. Another interesting way to do determine feature sufficiency is to follow following steps:
- divide the sampled data into train and test
- Include half of the wrong test data predictions into the training data
- train the model using this new training set with same set of parameters
- test the model on the rest of the test data to see for improvements
If the predictions are not very encouraging then this is the best that can be achieved with this feature set. And therefore, we would be requiring more discriminating features.
A combination of dimensionality reduction, correlations and above methodology could be used for an effective feature selection.
Sampled Data
The data has been randomly down sampled in a stratified manner. All the data points with label “1” called positive samples are considered in the sample. The negative samples i.e.; with label “0”; are down sampled to counter the huge imbalance. 3 datasets are created namely,
- D1: It contains an equal number of positive and negative samples
- D4: Negative samples are 4 times the positive samples
- D8: Negative samples are 8 times the positive samples
Experiments:
The entire experiment here is carried out using LibSVM due to its superiority and licensing. Experiments to illustrate superiority of SVM over logistic regression was tried out but not explained in this post. Logistic regression experiments had been tried out using mahout SGD but the results were inferior to SVM and hence, was not considered further. The experiments have been carried out on the sampled data for creating SVM models using LIBSVM.
Here, RBF kernel is used for no loss of generality. C > 0 is the penalty parameter of the error term. And Gamma>0 is a kernel parameter.
Ideally, an iterative grid search needs to be made to achieve best C and Gamma parameter values. The code for doing a grid search for best C and Gamma is given below:
clear; clc;
addpath('/home/rahulkm/libsvm-3.14/matlab');
%Prepare sparse representation of data
data = csvread('sampleD4.csv',',');
%labels = data(:,13);
labels = data(:,13);
features = data(:,2:12);
features_sparse = sparse(features);
% Dump into train 80-20
splitParam=round(0.8*size(labels,1));
libsvmwrite('D4.train',labels(1:splitParam), features_sparse(1:splitParam,:));
libsvmwrite('D4.test',labels(splitParam+1:end), features_sparse(splitParam+1:end,:));
%scale the training data and apply same scaling to test data
/home/rahulkm/libsvm-3.14/svm-scale –l 0 –s range1 D4.train > D4.tr.scale
/home/rahulkm/libsvm-3.14/svm-scale –r range1 D4.train > D4.tt.scale
%Read training and test data
[tr_lbl,tr_f] = libsvmread('D4.tr.scale');
[tt_lbl,tt_f] = libsvmread('D4.tt.scale');
%find optimal C and Gamma
% The below ideally needs to be done again and again to narrow out C and Gamma.
% Here its done only once for illustration.
log2c=-3:8;
log2g=-3:8;
bestModel =0;
accuracy=nan(numel(log2c), numel(log2g));
for i=1:numel(log2c)
c=2^log2c(i);
for j=1:numel(log2g)
g=2^log2g(j);
cmd = ['-v 10 -s 0 -t 2 –h 0 –w0 1 –w1 4 -g' num2str(g) '-c ', num2str(c)];
model = svmtrain(tr_lbl,tr_f,cmd);
if(model>bestModel)
bestModel = model; bestC = c; bestG=g;
end
accuracy(i,j) = model;
end
end
% Now build the model using the best C and Gamma
cmd = ['-s 0 -t 2 -g' num2str(bestG) '-c ', num2str(bestC)];
newmodel = svmtrain(tr_lbl,tr_f, cmd)
[p_label, acc, dec] = svmpredict(tt_lbl, tt_f, newmodel);
confusionmat(tt_lbl,p_label)
Here, we have experimented with a few values of C and Gamma and accordingly the best model is selected. Also, at places where there is no significant benefit of varying Gamma and C; grid is not explored to the fullest and an alternative ad-hoc mechanism is taken.
Also, in the above code one can see a basic flaw in the sense that accuracy is not a very good evaluating parameter for skewed classes. For example, we might have 95% of Class-0 and 5% of class-1 and if we have a simple model which predicts class-0 for everything then we might be able to get an accuracy of 95% on training but it simply isn’t a good model at all. Therefore the code needs to take into account measures like F1-score
Another important observation is the use of parameters “-w1 4″ and “-w0 1″ . It essentially means, for this D4 dataset, I would like to penalize misclassification of churners as non-churners 4 times than that of penalizing misclassification of non-churners as churners. This essentially somehow balances the problem. These parameters correspond to the following SVM problem formulation
Data and results aren’t provided due to the confidentiality of the client data that needs to be maintained. The sequence of data flow and overall model learning and prediction is as shown below:
Learn and save model:
Predict:
Additionally, java code for trying out churn prediction is provided at https://github.com/rahulkmishra/ToyLibsvm

