## An approach towards Churn Prediction

Churn Prediction is one of the classic problems encountered. Its a predictive analytic in which one tries to predict whether a person is going to churn or not based on the previous historical data. This problem has been approached in various ways and in various context. Some of the common contexts are employee churn prediction and subscriber churn prediction in the telecom. In the former case, on tries to predict whether a particular employee is gonna resign or not. In the later, one tries to know if a subscriber is gonna leave the network or not. In the telecom sector, it increasingly becomes of high significance due to the statistics revealing unpleasant numbers of people churning. Churn prediction in literature has been dominated by logistic regression and more recently SVMs(Support Vector Machines).
Here, its approached  as a binary classification problem. Wouldn’t it be nice if one is able to tell after how many days one is gonna resign? But I am not sure of attempting this problem and if at all we have data of that quality to predict such questions.
Therefore, let us stick to the original binary problem of predicting churn or not?

Churn prediction is a huge class imbalance problem. In general, there are a few number of churners and a large number of non-churners. Therefore, to counter this an intelligent sampling like SMOTE, stratified down sampling etc. are required. Here, we do not discuss the sampling methodologies.

A typical flow for any classification is illustrated in the self-explanatory figure below.

### Feature Selection:

Feature selection is one of the most important steps to determine the quality of learnt model. Hence, more time should be dedicated to feature selection than to tune parameters for getting better models. But, the fundamental question is how to do feature selection? Initially, one needs to start heuristically and use literature to good effect. Then, one needs to statistically determine any correlation between the feature. A high positive/negative correlation is undesirable and therefore, these features can be dropped. Secondly, one could attempt PCA or ICA to determine strong and weak features and accordingly eliminate weak features and add further features from the data. Another interesting way to do determine feature sufficiency is to follow following steps:

• divide the sampled data into train and test
• Include half of the wrong test data predictions into the training data
• train the model using this new training set with same set of parameters
• test the model on the rest of the test data to see for improvements

If the predictions are not very encouraging then this is the best that can be achieved with this feature set. And therefore, we would be requiring more discriminating features.

A combination of dimensionality reduction, correlations and above methodology could be used for an effective feature selection.

### Sampled Data

The data has been randomly down sampled in a stratified manner. All the data points with label “1” called positive samples are considered in the sample. The negative samples i.e.; with label “0”; are down sampled to counter the huge imbalance. 3 datasets are created namely,

• D1: It contains an equal number of positive and negative samples
• D4: Negative samples are 4 times the positive samples
• D8: Negative samples are 8 times the positive samples

### Experiments:

The entire experiment here is carried out using LibSVM due to its superiority and licensing. Experiments to illustrate superiority of SVM over logistic regression was tried out but not explained in this post. Logistic regression experiments had been tried out using mahout SGD but the results were inferior to SVM and hence, was not considered further.  The experiments have been carried out on the sampled data for creating SVM models using LIBSVM.
Here, RBF kernel is used for no loss of generality. C > 0 is the penalty parameter of the error term. And Gamma>0 is a kernel parameter.
Ideally, an iterative grid search needs to be made to achieve best C and Gamma parameter values. The code for doing a grid search for best C and Gamma is given below:

 clear; clc; addpath('/home/rahulkm/libsvm-3.14/matlab'); %Prepare sparse representation of data data = csvread('sampleD4.csv',','); %labels = data(:,13); labels = data(:,13); features = data(:,2:12); features_sparse = sparse(features); % Dump into train 80-20 splitParam=round(0.8*size(labels,1)); libsvmwrite('D4.train',labels(1:splitParam), features_sparse(1:splitParam,:)); libsvmwrite('D4.test',labels(splitParam+1:end), features_sparse(splitParam+1:end,:)); %scale the training data and apply same scaling to test data /home/rahulkm/libsvm-3.14/svm-scale –l 0 –s range1 D4.train > D4.tr.scale /home/rahulkm/libsvm-3.14/svm-scale –r range1 D4.train > D4.tt.scale %Read training and test data [tr_lbl,tr_f] = libsvmread('D4.tr.scale'); [tt_lbl,tt_f] = libsvmread('D4.tt.scale'); %find optimal C and Gamma % The below ideally needs to be done again and again to narrow out C and Gamma. % Here its done only once for illustration. log2c=-3:8; log2g=-3:8; bestModel =0; accuracy=nan(numel(log2c), numel(log2g)); for i=1:numel(log2c) c=2^log2c(i); for j=1:numel(log2g) g=2^log2g(j); cmd = ['-v 10 -s 0 -t 2 –h 0 –w0 1 –w1 4 -g' num2str(g) '-c ', num2str(c)]; model = svmtrain(tr_lbl,tr_f,cmd); if(model>bestModel) bestModel = model; bestC = c; bestG=g; end accuracy(i,j) = model; end end % Now build the model using the best C and Gamma cmd = ['-s 0 -t 2 -g' num2str(bestG) '-c ', num2str(bestC)]; newmodel = svmtrain(tr_lbl,tr_f, cmd) [p_label, acc, dec] = svmpredict(tt_lbl, tt_f, newmodel); confusionmat(tt_lbl,p_label) 
Here, we have experimented with a few values of C and Gamma and accordingly the best model is selected. Also, at places where there is no significant benefit of varying Gamma and C; grid is not explored to the fullest and an alternative ad-hoc mechanism is taken.

Also, in the above code one can see a basic flaw in the sense that accuracy is not a very good evaluating parameter for skewed classes. For example, we might have 95% of Class-0 and 5% of class-1 and if we have a simple model which predicts class-0 for everything then we might be able to get an accuracy of 95% on training but it simply isn’t a good model at all. Therefore the code needs to take into account measures like F1-score
Another important observation is the use of parameters “-w1 4″ and “-w0 1″ . It essentially means, for this D4 dataset, I would like to penalize misclassification of churners as non-churners 4 times than that of penalizing misclassification of non-churners as churners. This essentially somehow balances the problem. These parameters correspond to the following SVM problem formulation

$\min\limits_{w,b,\xi} \frac{1}{2}{\bf w^Tw} + C^+\sum\limits_{y_i=1} \xi_i + C^-\sum\limits_{y_i=-1} \xi_i$

Data and results aren’t provided due to the confidentiality of the client data that needs to be maintained. The sequence of data flow and overall model learning and prediction is as shown below:

Learn and save model:

Predict:

Additionally, java code for trying out churn prediction is provided at https://github.com/rahulkmishra/ToyLibsvm

## Machine Learning Guidance For Beginners

With a deluge of machine learning resources both online and offline, a newbie in this field would simply get awestruck and might get stranded due to indecisiveness. There are people who are good at spotting what to read/follow and what not . Particularly, this post is for ML enthusiasts who are not able to find a good way to understand and use ML but this is what they had always wanted to wet their hands into.

[Hilary mason's video on ML ] for Hackers gives a great introductory feel of the ML area in 30 minutes

People who think a rigorous background of stochastic, optimization and linear algebra is utmost necessary to start with might not always be correct. Most important thing is to get started and the other mathematical fundamentals can be learnt on the fly. But, yes some prior knowledge might be helpful. A person cannot learn swimming unless he/she dives into the water, no matter how much you have read about swimming. Same analogy can be used here. But, one should be cautious in their approach. I have seen many of them having run away from ML for reasons like its just statistics, too much maths, etc. Some even get to learn things but do not know where to use it. These factors would essentially kill their enthusiasm.

Therefore, a good balance between theory and practical is necessary. One should try to apply the various ML stuffs learnt and once people start applying ML there are non-ending “WOWs”.
So, where does one start ?

I would recommend people to go through an advanced track of Andrew Ng’s online ML course on coursera(Andrew Ng’s online ML course on coursera) to begin with. It is fairly broad and its thorough. This course has a good balance between learning and its application. This would not only strengthen the basics but will also try to make you program and apply them.
The stanford CS229 course(stanford CS229 course) by Andrew Ng offers more depth and is much better for understanding the internals of ML.
Along with Andrew Ng’s course one also needs to work a bit on algebra and probability to take a bigger leap.

Another great set of video lectures is by Prof. Yaser S. Abu-Mostafa, from caltech. The course is titled Learning from Data. I personally consider this course superior than Andrew Ng’s course due to the content as well as Prof.’s approach towards ML.

(Mathematicalmonk’s channel on youtube) is another comprehensive resource on Machine learning. Along with the probability primer lectures, this really becomes very helpful in covering a broad range of topics with good mathematical fundamentals.

Other than these video resources, there are quite a few good introductory books on ML:

1. One of my favorites is [PRML book by Bishop]
2. [Tom Mitchell's book ] is another widely accepted book.
3. More mathematical but a nice read is [Pattern Classification by Duda and Hart]

Now that one has gathered good fundamentals on ML and is aware of various terminologies and jargons, one could explore various areas based on their own interest.
But, at this point of time one needs to decide whether one wants to merely use existing ML algorithms/tools or do they want to code the algorithms themselves. None of the two is inferior to the other. But, people deciding to write the new/existing ML algorithms need to be aware of internals behind the curtain. This is where Andrew Ng’s course lacks immensely. Andrew’s course is more like tool gatherer’s approach and in many ways good for ML enthusiasts but not desirable for all.
ML tool gatherer category of people also need to evolve to large scale machine learning because of its relevancy in the current era. For this, Programming collective intelligence by Toby Segaran is a great resource . A good tool to start experimenting with large scale data is mahout(mahout.apache.org).
Machine Learning for Hackers is another great practical book.

I ♥ data ::

Machine learning is a kind of decision making and hence, more thorough knowledge on related fields like optimization and Game theory needs to be developed. A strong mathematical background on algebra and stochastic also needs to be acquired along with exploring statistical learning theory to its limits. A background on information theory is also helpful.

A list of literature surveys, reviews, and tutorials on Machine Learning and related topics like computational biology, NLP, etc. have been compiled along with link to papers @ www.mlsurveys.com . Deep Learning, SVM, SGD, Bayesian statistics, Recommender engines and Mapreduce for ML are some of the key hot topics in ML in the current scene.

## ML Dependencies and Interplay

Fig.1 : Machine learning prerequisites and Interplay