What is the best model for your predictive analytics project?

By Akhona for OLRAC SPS
Johannesburg, 21 Feb 2014

Modern data mining workbenches offer the adventurous analyst an impressive array of models for tackling classification or numerical prediction.

The IBM SPSS Modeller product offers the following: two types of neural network models, time series capability, four different types of decision trees (C5.0, CHAID, C&RT, Quest), a decision list modelling node, two different linear regression nodes, discriminant analysis, logistic regression (binomial or multinomial) analysis, general linear models, generalised linear mixed models, Cox regression, support vector machines, self-learning response models, Bayes Nets and a K Nearest Neighbour node.

IBM SPSS Modeller even offers an 'omnibus' node, which compares the performance of all models for classification problems (Auto Classifier node) and for numerical estimation problems (Auto Numeric node).

The pragmatic choice of a preferred model should, however, be based on more than the statistical performance considerations summarised by these omnibus nodes. Organisations and companies are being asked to make substantial investments in the development of predictive analytical solutions, and confidence building is key to getting support for such investments. If decision-makers are unable to understand the models, or to ground truth their results to their own common sense and experience, such investment decisions are likely to be shelved.

Logistic regression has gained acceptance in the credit risk industry, and is the basis for statements like "the chance of having a heart attack is five times greater for smokers than for non-smokers", which often appear in the popular media. Decision trees are easy to understand and lend themselves to live development in a workshop setting, guided, for example, by input from experienced business users and decision-makers.

Linear regression is widely accepted and understood. Beyond this lies a slippery slope of growing complexity. Neural networks and support vector machines, for example, are effectively black boxes for all but the seasoned senior analyst with a rich theoretical, analytical background. One has to ask, is the sacrifice of interpretability associated with these kinds of models compensated for by the extent of improved accuracy and precision?

And what about model robustness? Robustness is revealed when a measure of model performance, such as the Gini coefficient, or the R-squared, is contrasted between a training dataset, and an unseen testing dataset. Large shifts indicate that the model does not generalise, and is thus not sufficiently robust for predictive purposes. Ultimately, robustness needs to be re-evaluated on merit for each new project and for all model types, assuming that the 'black boxiness' of 'new age' methods can be tolerated. Across a range of projects, this writer has found that models like logistic and linear regression often offer superior robustness over other 'new age' methods.

Exactly how the predictive analytics model is to be deployed will also impact on model selection. For example, if the model is to be taken offline into a document and deployed manually, as is sometimes done with credit risk models, then model simplicity and interpretability is a paramount consideration. Where models are deployed by a computer, perhaps in real-time, more complex models are feasible, bearing in mind the earlier caveats made here about the general importance of the 'understandability' of the models.

Data quality can also impact model choice. For example, IBM SPSS Modeller silently implements list-wise deletion when it encounters missing values, when classical regression techniques (logistic or linear) are employed. List-wise deletion involves discarding an entire record if one of the predictors or the target value is system missing. It would not be unusual for an incidence of 10% missing-ness to result, via list-wise deletion, in the loss of 90% of all records. Decision trees implement a workaround to prevent the loss of records via list-wise deletion, involving fractionation or the use of surrogates. The creation of 'missing-ness' flag indicator variables accompanying each original variable, and/or converting the system missing indicator to a value like 999, is one way to prevent list-wise deletion in the case of regression techniques.

A further consideration for model selection is the nominal nature of variables in the data set. If the data contain many nominal variables, and there are large numbers of unique values for each variable, then neural network models are likely to perform poorly. This situation is typical of medical insurance claim data. Logistic regression, GLM techniques (ie, the equivalent of linear regression but with nominal predictor variables) and decision trees should be considered in these circumstances, depending on whether classification or numerical predictions are being attempted. However, in assessing the merits of GLMs versus decision trees, bear in mind that GLMs deal with interactions between variables rather poorly, while decision trees are very good at picking up on relevant interactions. For example, when a GLM is instructed to estimate all possible two-way interactions, the number of parameters that have to be estimated can explode. Two nominal variables each with 40 values will have 40 x 39 / 2 = 780 unique interactions. Unless laboriously set up to avoid this, all 780 variables will be estimated, with negative impacts on run-time performance, interpretability, and robustness.

By comparison, decision trees do not cater for all possible interactions, and can hone in automatically on just the most important interactions. Some decision tree models can also simplify matters by automatically grouping nominal values that are similar in their predictive role. Thus, when there are high dimensional nominal variables in the set of predictors, and it makes sense to explore interactions, the preferable approach is a decision tree.

Lastly, there are some situations where the nature of the problem effectively dictates which model should be employed. The most obvious example is the use of a survival model to analyse censored data when there is a single critical event. An example of such an event is churn, where the time to churn, and the 'survival' of a customer up to the point that the data snapshot was taken is dealt with in a satisfactory manner by the Cox regression modelling node offered in IBM SPSS Modeller. Another example is time series modelling, where the standard approach is the suite of classical forecasting approaches based on exponential smoothing and ARIMA models, functionality made available by the Time Series node.

To summarise, in many situations it may well be that models which are simple and relatively easy to understand - desirable properties from a business perspective - also have good accuracy, precision and robustness. Although more complex and less transparent modelling approaches may at times offer gains in terms of accuracy and precision, the additional model complexity presents problems from a business and implementation perspective. Each situation should be judged on its merits, with these comments as a broad background and guide.

Editorial contacts