Key Driver (Relative Importance) Analysis
Key Driver Analysis or Relative Importance Analysis, is the generic name given to a number of regression/correlation-based techniques that are used to discover which of a set of independent variables cause the greatest fluctuations in the given dependent variable. i.e., which of them have the greatest impact in determining the value of the dependent. As an example, in market research surveys the dependent variable could be a measure of overall satisfaction, whilst the independent variables are measures of other aspects of satisfaction, e.g., efficiency, value for money, customer service etc. The independents in this example are then often called Drivers of Satisfaction. By applying a suitable Key Driver technique, then ordering these variables in terms of a measure of importance, a researcher can better understand where a company should focus its attention if it wants to see the greatest impact.
Key Driver Analysis should perhaps be renamed marginal resource allocation analysis in that reallocating all of one’s resources blindly based on these results often would lead to problems. Take for example, airline satisfaction. No-one would ever consider cutting safety standards to improve their food, although in Key Driver Analysis, the standard of food would almost invariably come out as more important. This is because with respect to air travel, safety is assumed and hence safety standards do not generally influence people’s choice. However, the standard of food on the airline does vary between airlines and thus has a much greater influence on people’s satisfaction – and thus their choice of airline. Hence Key Driver Analysis will demonstrate where one should allocate extra resources if one wishes to see the greatest impact on their satisfaction scores.
One of the main problems
One of the main problems with analysing satisfaction and other similar data, is that independent variables are highly correlated with one another. This is called multicollinearity, and can result in importance values that are derived from simple regression/correlation analysis being inaccurate and potentially highly misleading.
There are various methods to overcome this problem. Three of the most well-respected statistical techniques are
- Shapley Value Analysis (see for example, Ulrike Grömping, The American Statistician 2007);
- Kruskal’s Relative Importance Analysis (William Kruskal, American Statistician 1987);
- Ridge Regression (also known as Tikhonov regularization; see for example https://en.wikipedia.org/wiki/Tikhonov_regularization).
Both Shapley Value Analysis and Kruskal’s Relative Importance Analysis are fairly similar in concept: for each independent variable, we derive a measure of the strength of the correlation between itself and the dependent after we have “stripped” out its correlations between the other independent variables. The final measure of importance for each variable is the mean of these derived correlations taken over all possible regression models between the dependent and the different possible subsets of the independents.
The difference between the techniques
The difference between the techniques comes in the measure of correlation used in the procedure: semi-partial correlations are used in the case of Shapley whereas partial correlations are used in the case of Kruskal’s. This simple difference leads to a huge difference in computation speeds. Shapley’s can be fairly easily computed whereas Kruskal’s is more time consuming. However, since we believe that Kruskal’s is more theoretically sound, we have developed an original algorithm that allows both methods to be simultaneously constructed with no extra time-cost.
Ridge Regression, on the other hand, in effect, penalises the importance values in an attempt to neutralise the effect of multicollinearity. In this case, the importance values returned depend on the penalising factor. So, it is usual to compute a set of importance values, then choose which one is most appropriate. A common way to help in your choice is to plot the importance values of the independents against the penalty factor, then choose your penalty factor when the graph appears to “flatten” (similar to a scree plot in factor analysis). Note that the penalty value of 0 is equivalent to ordinary linear regression.
J umpData Key Driver Tool
JumpData have developed an easy to use, web-based tool to conduct Key Driver analysis using the three techniques above. It allows the user to import either a .csv or Excel file, and run both Ridge Regression and then Shapley’s and Kruskal’s analysis simultaneously. Since standardisation of the independent variables is recommended, our tool automatically standardises the data upon upload. For ease of comparison and interpretation, the importance scores/results from all three methods are reported as percentage values.
When conducting Key Driver analysis, we recommend Ridge Regression is run first, in order to provide an immediate set of results. An appropriate Penalty Factor is chosen by inspecting the flattening of the Scree Plot. Following the choice of penalty factor, the more computationally expensive Shapley and Kruskal’s analysis are performed, allowing the results of all three methods to be compared. The importance scores are ordinary linear regression are also outputted for completeness.
The results of the three Key Driver techniques are likely to be fairly similar, with the Kruskal’s output being favoured by JumpData as the most theoretically sound way of removing multicollinearity from the dataset. We would recommend using Ridge Regression only as a sense check of the other two techniques.
Having said that, Kruskal’s technique is the most computationally expensive of the three. Time taken increases by a factor of just over two for each extra independent variable. So, with 16 independent variables, it may take around 7 seconds to execute, increasing to 35 seconds for 18 variables, 160 seconds for 20 variables, and 800 seconds – close to 13 and a half minutes for 22 variables. (These times were recorded when the program ran locally and so times may be slower on other machines). However, if you are conducting analysis with such a large number of independents, you may wish to consider whether all of them are actually going to reported on. For, in reality you will have four or five important factors, a similar amount of slightly important and very few of dominant importance. We recommend, to keep the number of independents to be less than 20 for sake of ease of interpretation of the results (and indeed have the limited the tool to a maximum of 22 independent variables).
If you do have more than 20 independents, an alternative is pre-analyse the data using Ridge Regression, then look at the base regression important scores and pick out the (20 or less) most important from this. Then you can run both Shapley’s and Kruskal’s on this reduced set of “more important” independents.
For examples of how to use the tool / interpret the results, download this document.
Click here for more informationResponsible for data analysis, modelling and automation