Engineering & Technology
May 30, 2024

Predictive Analytics in the world of big data with application for targeting decisions

Predictive Analytics (PA) models are an increasingly important method for predicting future events in big data applications based on past observations for which the response values are known. One of the most popular applications is targeting decisions, which is commonly used in many domains, such as marketing and banking. The leading prediction models still belong to the realm of regression. Professor Jacob Zahavi from the Coller School of Management at Tel Aviv University, Israel, investigated the processes involved in building and validating PA models for big data applications based on logistic regression, and developed criteria for assessing the prediction accuracy of these models. This analysis provides a roadmap for developing high-quality and accurate PA models used in finance, healthcare, and other service industries.

Companies rely on accurately forecasting customer behaviour to build effective and efficient business strategies. Analysts working in marketing, insurance, telecommunications, healthcare, banking, and other domains aim to discover patterns in data in order to predict future outcomes, although prediction accuracy varies between modelling approaches and data characteristics.

Explanatory vs prediction models

According to Professor Jacob Zahavi from the Coller School of Management, Tel Aviv University in Israel, traditional regression models have been used retrospectively, for explaining phenomena based on current and past data. Explanatory models are usually based on causal or statistical relationships between variables to find the model that best fits the data employed in model-building, for a given predefined goodness-of-fit criterion.

Figure 1. Explanatory versus prediction models
Table credit: Based on Shmueli, G, (2012) To Explain or to Predict. Statist Sci, 25(3), 289–310.

Conversely, predictive models are based on associative relations between measurable variables. Predictive models are proactive tools designed to predict unforeseen events. The aim is to minimise the total error of prediction. The differences between explanatory and predictive models are summarised in Figure 1.

Predicting the future

Predictive Analytics (PA) uses past observations for which response values are known to predict future events. The new PA approach is designed to accurately tackle big data (datasets too large and complex to handle using traditional data-processing software) problems in industry.

In PA models, the output (dependent) variable represents the predicted event (eg, whether a customer will respond to an offer to purchase a product or a service) while the input (explanatory) variables are variables from the dataset, or their transformations, which may explain the event. PA for big data is both complex and high-dimensional, which complicates model-building and can lead to model overfitting (where a model works well for the test dataset but badly on new and unseen data) and instability.

Figure 2. Gain Charts: The gains in the percentage of the responsive customers ‘captured’ by the model as a function of the percentage of the targeted customers (prospects). The top curve, in gray, represents the perfect model curve.

Binary classification is the most common type of prediction model and widely used in targeting decisions. These models divide observations into two predefined classes for addressing ‘Yes/No’ binary questions such as ‘Will the customer pay their debt in full?’, or ‘Will they click on an advertisement?’. The likelihood of a particular outcome is then represented using either probabilities or rankings.

Although approaches using machine learning (algorithms that learn from training data and generalise to unseen data) and data mining for prediction (forecasting future events by determining patterns within large data sets) have come to the forefront in recent years, the most common leading predictive models still employ regression, which determines the relationship between an output variable and a group of input variables. For two-way classification, the leading regression model is logistic regression, where the dependent variable is a binary 0/1 variable (0 – for NO, 1 – for YES).

Stages of predictive analytics

The PA process involves two main stages: building a model to fit to a dataset and using that model to predict responses for new observations and unseen data. The goal is to predict outputs even for events that lie beyond the range of the original inputs. Large-scale PA model developers aim to address model overfitting and instability, and strike a good balance between accuracy and prediction errors.

Figure 3. Actual and predicted expected profits. It is worth targeting only the top four deciles of the customers in the list. Targeting customers beyond the fourth decile results in negative profits.

PA models are used to forecast outputs as a function of the predictors that made it to the final model after going through the feature selection process (the process of selecting the most influential variables affecting response). In marketing, models are based on previous campaigns for the same or similar products. If new products are involved, for which no previous campaigns exist, the models are built based on a random sample of customers drawn from the database. Targeting decisions are made using the customers’ purchase probabilities predicted by the model.

Predictive Analytics (PA) uses past observations for which response values are known to predict future events.

Zahavi reviewed the way researchers build and validate large-scale predictive analytics processes in the context of big data. He determined key criteria that could be used to assess the ability of PA to accurately predict outcomes, and proposed measures to determine the model accuracy and prediction errors of binary logistic regression-based classification models.

Model accuracy and sampling variance

Zahavi presents several performance measures for assessing prediction quality in classification problems involving logistic regression-based PA models. A common strategy to assess prediction errors, which applies not only to logistic regression but to all other prediction models (such as neural networks) is to split the testing dataset into two mutually exclusive and exhaustive datasets – a training dataset used for model building and a separate validation dataset for validating the model results. Since responses are also known for the validation dataset, model developers can compare predicted and actual values to determine prediction process quality.

Bias-variance trade-off

There are two main sources of errors in PA models – the bias which is introduced by the differences between the ‘true’ (and unknown) model parameters and the model estimated parameters, and the sampling variance introduced by the fact that the model parameters are estimated based on a random data sample from the population. Both these factors affect prediction quality. A good model should minimise the total prediction error (sampling variance and bias), but also be robust enough to work well with unseen data and new observations.

The major problem afflicting prediction accuracy is overfitting – yielding good predictions on the dataset used to build the model but bad results when applying the model on new observations and new data. Zahavi found this problem to be particularly apparent in big data applications that employ a large number of input variables.

Figure 4. The bias and sampling variance trade -off. Photo credit: Geman, S, et al, (1992) Neural Networks and the Bias/Variance Dilemma. Neural Comput, 4(1), 1–58.

As bias and variance errors cannot be reduced simultaneously, model developers must find the right bias-variance trade-off as a function of model complexity (measured by the number of predictors in the final model) to ensure prediction quality. Zahavi found that it is essential to optimise model complexity to minimise total error while reducing bias and variance as much as possible (demonstrated graphically by Figure 4).

Evaluating predictive models

Zahavi developed a series of principles to aid in building good predictive models. He suggests that model building is essentially an iterative process, as model developers cannot know from the start exactly which input variables will result in a good and stable model that is sufficiently generalised to yield good predictions for new observations.

To assess model prediction accuracy, Zahavi and his team employed a range of metrics in a flexible approach, varying model criteria, parameters, and configurations to build a sufficiently accurate model that fits the data well, yet possesses good predictive capacity for unseen and new data.

The accuracy and prediction error performance measures developed by Zahavi provide a roadmap for PA model creators.

The researchers also developed performance measures to analyse model fit compared to prediction accuracy. As model accuracy is represented by variance of output predictions for the test dataset, one way to determine model quality is to compare the goodness-of-fit between the training and validation datasets. A model for which these are close enough could be considered as a candidate for a good model.

Targeting customers

An accurate, high-quality, and stable predictive model with no overfitting can be used for decision-making in targeting applications. Zahavi focused on the use of logistic regression-based binary (Yes/No) classification PA models for typical targeting applications in marketing. The aim was to only approach those customers most likely to respond to an offer to buy products or services, which would reduce marketing costs.

The major challenge facing the research team was to automate the process of building a large-scale regression model, regardless of test dataset size and number of input variables employed. When using binary logistic regression models to predict customer purchasing probabilities, the higher the probability, the higher the likelihood the customer will respond to an offer to purchase a product. Sorting customers according to these predicted probabilities in descending order places the highest responding customers at the top of the list. The decision problem then becomes one of how many customers down the list to include in a marketing campaign while still remaining profitable.

In essence, this is a question of dividing non-targets from targets. Zahavi found that a good way forward was to select customers for whom the expected profit from purchase was higher than or equal to the cost of approaching them. In this way, a PA model can be used to effectively determine who to contact, thus saving company resources. The modelling results are presented graphically in Figures 2 and 3.

Enhancing Business Efficiency

Zahavi’s extensive survey of PA models shows these are often complex and high dimensional models, which significantly complicates the model-building process. Building good quality, large-scale PA models could offset the issues of overfitting and instability.

The accuracy and prediction error performance measures developed by Zahavi provide a roadmap for PA model creators, enabling them to develop more precise models for big data applications. In a competitive climate where companies must carefully navigate future events, accurate predictive modelling could provide businesses with a crucial competitive edge.

Personal Response

Which emerging theories and technologies could benefit PA in the future?
Most of the machine-learning algorithms in business require the use of well-defined datasets, often
needing pre-processing by data and domain experts to prepare them for modelling. But modern prediction problems are too complex, consisting of several types of data – structured and unstructured, audio and video, semantic and alphabetic, etc. These types of problems could benefit from the emerging technology of deep learning that is capable of analysing even unstructured and messy datasets. A famous application is ChatGPT, which is basically a prediction problem of the next word to include in a paragraph or a sentence. ChatGPT was trained based on the 50,000 words of the English language using deep-learning technology.
This feature article was created with the approval of the research team featured. This is a collaborative production, supported by those featured to aid free of charge, global distribution.

Want to read more articles like this?

Sign up to our mailing list and read about the topics that matter to you the most.
Sign Up!

Leave a Reply

Your email address will not be published. Required fields are marked *