Business Analytics is a very fast growing and important field of study which is built upon the several disciplines such as Statistics, Machine Learning, Artificial Intelligence, Finance and Computer Science. Business Analytics tends to use the Data generated and Gathered by the business processes and provides the insights into the Business Processes and Strategy.
Business Analytics is becoming more and more indispensable tool for the Managers and senior level Management in order to drive the business and maximize the profitability and to ensure the long term survival of the organization. It also provides the companies with the strategies to meet the competition and outperform the rivals.
Now days with ever increasing size of the data and ever increasing method of the data collection, the amount of data is growing exponentially. With the downpour of such a huge amount of data from various sources, the use of automated systems and computer platforms is inevitable. The Humans are living in an information age, where the capabilities of the manual data manipulations have long been surpassed. Business Analytics aims to address such issues which are directly related to the interests of the Business.
There are different types of problems, and for which there exist different solutions to tackle the specific problems. Sometimes a problem can be addressed using more than one technique, in such cases we can use different methods to solve the problems and compare their output and choose the best options. In order to select the best solution the problem at hand must be carefully analyzed and checked. Its business application and contexts must be studied and the scope of the solution should be specified. There are other things also which determine the selection of a particular solution over the others, for example, the volume of the data, Velocity of the data, Software and Hardware constraints, Economic constraints and longevity of the solution provided. The solution must be scalable for the future needs and must be easy to improve upon.
The Objective of this project is to investigate the Business Analytics problems and to develop the solution using the industry standard tools and techniques. To understand the problems faced by almost every company. And provide them with a feasible solution. This project aims to understand and develop the solutions based on the Statistical Analysis and Machine Learning.
The Analytics problems are data driven. A lot of thought is to be given on the two ends- namely Business Problem and Solution. Generally Analytics projects are headed by the experienced professionals from both ends. Since a Manager or a Business expert lacks the knowledge of the Analytical and Statistical methods and their intricacies. Hence he alone cannot tackle the problem by himself. There are various ways to solve a same problem with their merits and demerits, so an expert advice is also required.
This Project aims to address the technical end of the Analytics i.e. the Analysis and problem solving part. This project uses the most advanced techniques of Supervised Learning to crack the real world business problems. In this project the most effective techniques are used and discussed, such as Logistic Regression, Linear Regression, Time Series Analysis and Decision Trees. All the methods are used and the results are checked for their effective use and validity.
Another dimension of the project is to use and gain experience in a rapidly developing Statistical and Data Mining tool R. R is an open source technology which is one of the most rapidly growing framework. Now R is widely used in the industry and accepted throughout the cross platform applications. The contributors of R from many universities continuously develop the R packages for a Spectrum of the Statistical and Data mining, and Analytics problems.
The purpose of the Project is to gain insight into the Data analysis and Data mining application in Business Analytics. In order to fulfill the purpose and get the desired results the most important techniques and frequently used problems in the Business Analytics are addressed and solved. Case studies are done on the real business problems from the real data.
In order to make a comprehensive detail of the problem solving three Case studies are presented in this report which directly address the real world problem of Customer Attrition, Upselling and Cross selling is undertaken for study and solving. These are the prime problems which are faced by the companies regardless of their size in almost all spheres of businesses. The approach is data driven and employs the Supervised learning solutions. The Regression is one of the most versatile statistical techniques. There are thousands of different types of regressions which exist today to address different types of problems in various fields. But essentially the core idea remains the same. In this project the two most important type of Regressions are used namely, Linear Regression and Logistic Regression. While linear regression is used in the predictive modeling with the continuous variables, the Logistic regression is used to deal with the Qualitative data.
Besides the Regression technique this project also considers yet another important Analytics tool the Decision Trees. Decision tree are one of the most successful data mining tool. It is also the most widely used after Regression Analysis.
Another very wide field of Analytics is Forecasting and Time Series modeling. Since sometimes the solution of a problem is needed for a particular interval of time. The weather forecasting is apt example or a company looking for the sales in next quarter of the year. Both of the mentioned examples are time bound. This is an extensive field of study in its own. And the treatment given here is just for the sake of the completeness of the understanding of the Predictive as well as Forecasting methods in the Business Analytics.
2.Analytics Project Life Cycle
Analytics Projects typically consist of these major stages from their conception to completion:
1. Understanding the Problem
2. Data Preparation
3. Model Building
4. Evaluation of the Model
1) Understanding the Problem: This is the first and very critical step as we approach towards the problem and try to understand it and figure out the type of data we will need and what kind of problem we are solving. And think about the solution which will be the best.
2) Data Preparation: All the supervised machine learning problems are based on the previous data. And the quality of the solution provided by any of the models is directly dependent on the quality of the data. Before we can start the model building we must go through the data Audit and cleaning. Data preparation is one of the most time taking phase of the Project Life cycle. Data Preparation typically involves the tasks like: Data Selection, Data Correction, Missing value treatment, Outlier Treatment etc.
3) Model Building: Model building is the stage where the actual Analysis starts, we make models and tests them. We Split the data in to two sets: Training Dataset and Test Dataset. The typical ratio for Training and testing can be 30/70 or 40/60 as it suits the purpose.
4) Evaluation of Model: After the model is built, The model is needed to be evaluated for it accuracy of the results. Whether it is performing good or not. If the model is not good or up to the mark we go back the previous step and try to improve the model or sometimes we might need to scrap the whole model and start all over with a different approach.
5) Deployment: When the model is built and tested, it is ready to be deployed which is the responsibility of the IT Engineers. The deployment could be on an online platform or on the Off line platforms.
All the Case Studies undertaken in the Project confirm to the above mentioned Development Lifecycle. In this report the Deployment Phase is not considered as it is not the concern of the Analyst or a Data Scientist to deploy the project.
3. Case Study 1: Attrition Rate
3.1 Statement of the Problem:
Business Objective: John is the Customer Services and Relations head for a Multi brand retail store. He analyzed a couple of reports and got worried about losing his customers overtime. He thought over the different customer segments presented in the report and concluded that not all customers were worth retention. He identified the loyal and profitable customer segment and planned to develop a churn model to gauge the propensity of attrition of this customer segment.
He had plans to revise promotions and schemes for these customers based on the significant factors contributing to their attrition. John will use Churn model primarily to identify the customers next in line to attrite. He will then plan the promotions and strategies to retain them.
John model preparation will start with identification of responders and non-responders for his model. He will pull past customer’s data from his CRM and set a timeline to classify customers in two groups.
Group1: Customers, associated with the company for more than 36 months but haven’t done any transaction in past 12 months
Group2: Customers, associated with the company for more than 36 months but have done transaction in past 12 months
The assumption is that the loyal customers who have not done transaction in past 12 months are a case of churn and those who are still our customers are case of the retention. Since we are trying to identify the propensity of the customers who might leave, we will tag them as our responders (Case 1) and the retained customers as our non responders (Case 0).
The historical observation is that we lose close to 40% of the customers every year and we will use this information to keep the proportion of the responders and non-responders in our data sample.
The model will identify the important aspects of churn, known as the “Drivers of the churn” and give the propensity of churn for the customers.
Find as much attributes in CRM data as you can, and make a dataset of those attributes. The data should capture demographic details, transaction details, customer satisfaction, customer experience and other information if available.
List of attributes considered for this case:
The attributes have been denoted by a Label for ease of programming and reference in future.
Data Audit: The data audit report is the initial report that we prepared to understand the data well. This report will consists of descriptive statistics for all the variables in the dataset and also will helps in identifying the missing value and levels of categorical predictive variables. This data Audit report serves as base for assessing the quality of the data we extracted and obtained from the client. Based on this report we can request for additional data which we seems to be important for our analysis and it will also help in dropping some insignificant variable. Refer the data audit report fields below for better understanding.
Data Profiling:-Bivariate profiling assist in finding the frequency of each categorical variable with respect to the response variable. This would facilitate in binding/ grouping the categories which have same response rate so that the effect of that particular category can be captured in the model.
We prepare a report with following fields to further work on it.
The first stage of any statistical modeling consists of data treatment activities. Approximately 80% of the entire modeling time is consumed by the data treatment techniques. Here we check the hygiene factor of our independent variables and try to make the data as exploitable as possible.
Before going to data treatment one has to find out the correlation between variables means finding the relationship of predictors with the response variable and also to find out the inter correlation among predictors. From this analysis we can exclude some of the predictors which are not important for the model building based on the significant correlation values. The first step of variable reduction happens in this stage and next is on basis of multicollinearity check. The variables selected from this step will undergo for further data treatment like missing value, extreme value treatment and multicollinearity check.
Missing value treatment: –
We should check that the independent variables have sufficient information to establish a significant relation with the dependent variables.
Some of the basic rules for Missing value treatment are as below:
1. If the independent variables have a large amount of missing value (More than 40%-50%), we drop that independent variable from our analysis, since no relation can be established between that independent variable and the dependent variable in question.
2. If the percentage of missing value lies between 10% -40%, we try to establish a separate relation between the dependent and independent variables to understand any hidden pattern.
3. If our predictors are categorical variable, then we can make that missing values as one category but we will miss the information since that category will not comes significant in the model so better treat the missing value with those category which has highest frequency among all the categories of a variable.
4. For quantitative independent variable, treat the missing values with central tendency like mean, median and mode value of that variable.
5. Various other methods like exploration, regression method etc. has also been used to treat the missing values.
Note: These missing values are represented by dots (‘.’) for numerical variable and blank for categorical variables in the SAS.
Extreme value treatment: – This step is done to understand the distribution of our independent variables. Presence of an abnormal value in one of the independent variables can affect the entire analysis (Leverage variable). The extreme values are treated by capping. Capping is required as sometimes, the variable may contain extreme values corresponding to some observations; whereas in reality, such values are unlikely to exist. Such values may be a result of wrong keying the data or may represent the default values. There may be cases of negative values which is logically incorrect for a given variable. In such cases, these values need to be capped because they may affect the mean of the variable drastically.
Some basic rules for capping are as below:
• Don’t cap the value unless it is unrealistic
• Cap it to the next highest/lowest (realistic) value.
• Cap at a value so that the continuity is maintained.
Multicollinearity – When two or more independent variables are related between them, we tell that they have multicollinearity among each other. In technical terms, we say the one variable can be explained as a linear combination of other variables. Multicollinearity among independent variables does not allow the independent variables to explain their impact on the dependent variable optimally due to high internal impact. Keeping collinear variables in the model makes it unstable. In such a scenario we drop one of the variables from the model. Multicollinearity among the variables indicates that these explanatory variables are carrying a common set of information in explaining the dependent variable.
Detection of multicollinearity:
Variance Inflation Factor: We generally test the presence of multicollinearity using Variance Inflation Factor (VIF). Variance Inflation factor (VIF) is obtained by regressing each independent variable, say X on the remaining independent variables (say Y and Z) and checking how much of it (of X) is explained by these variables. Hence VIF = 1 / (1 – R2 ).
If the VIF value be greater than 2 we drop the variable from the model.
Steps in multicollinearity check:
• Add all the independent variables in the model to explain the response
• Check for the variable which has highest VIF
• Keeping in that variable in mind, go to correlation table.
• Identify the variable (has highest VIF) has the highest correlation with some other variable.
• Drop the variable with higher VIF and repeat the procedure till you get the VIF < 2
Once the data treatment is over we go ahead with the model building;
Development (Training) and Validation (Testing) Sample:-Before building the model on the data, divide entire data in the ratio of 70:30 as development sample and validation sample. Development sample is used to develop the model whereas validation sample will be used to check validity of the model. Build the model on development sample data and use the estimates obtained from this model to score the validation sample. If the response captured from the validation sample is nearly equal to the response captured from the development sample then we can say that the model is robust in predicting the responses for future dataset.
Logistic Regression Model Building:-Logistic regression technique is used to assess the impact of independent variables and probability of event of interest. The approach is explained in the following steps
Create the dummies and slope dummies for the categorical independent data if desired. Because most of the statistical analysis tools like E-Guide, E-Miner will take string variables directly. So in such cases no need to create dummies and here one of the categories will be converted into base category for comparison.
Model Fit Criteria:
1. Use the Deviance or Hosmer&Lemeshow test statistics to check the validity of the model. Higher the “P” value better is the model. Proceed to next steps only if we have higher value of P.
2. Test the null hypothesis for the independent variables, i.e. all β = 0. P value should be significant (i.e. p < 0.05) to reject the null hypothesis and prove that β values are not equal to 0.
3. Check the concordance and Tie. The rule of thumb test is (Concordance+ ½ Tie) should be greater than 60%.
4. Check the significance of the estimates of each of the variable. If any of the estimates are not significant, variable with highest P value will be dropped and steps i to vi are repeated with the new set of variables. This process will continue until all the variables in the model have significant estimates.
5. Frame the equation with the significant variables. Odds ratio and probability value for each of the profile is estimated.
6. Specificity and Sensitivity of the model is assessed and ROC (Receiving Operating Characteristic) graph is plotted. Area under the ROC is an indication of how well the classification of good in to good and bad to bad is decided by the identified model.
7. Coefficient Stability: Coefficient stability is checked across development and validation sample. Once the model is performing satisfactorily on development sample, we use the same set of variables to model the validation sample. A robust model should perform equally well on validation sample too. Hence, the coefficients should be in a close range and should be of same sign.
8. Concordance: Consider a set of 100 individuals out of which 10 are the responders (denoted by 1) and 90 are non-responders (denoted by 0). Now we construct pairs for each responder with every non-responder. Hence, we get 900 such pairs (10*90 = 900). Using the model under development, we calculate the predicted response rate for each responder and non-responder in every pair. If responder’s predicted probability is greater than non-responder’s predicted probability, then the pair is concordant. If it is vice versa, then the pair is discordant and if both are equal, then the pair is tied. For a good model, the percent concordant pair lies above 65%.
9. Gini Coefficient: The Gini coefficient is one which is used to test the model accuracy. It is calculated by using following formula. For good model the Gini coefficient should be in the range of 40-60%.
Gini=2C-1 Where C= Area under the curve (ie Concordance+1/2 of Tie)
10. Scoring: Satisfaction of the model comes when the model is doing well in terms of rank ordering, coefficient stability, Goodness of fit, Concordance and capturing both on development and validation samples.
Now, take the coefficients of variables obtained from a model run on development sample and use it to predict response rate of validation sample. This method is known as scoring of the model. Scoring provides a good idea about how the model will perform when applied to another data set. Here, we are concerned about the capturing of the responders, say in first 40 % of the population.
The model is used to predict the response rate for a set of new data is taken from a different time frame to test the validity of the rules suggested by the model. The model will be applicable to the profiles similar to the once already present in the sample data used for model development. Model validation is performed by taking the optimum threshold level of probability.