Understanding the Limitations and Pitfalls of SHAP in Personal Auto Insurance
As a consulting actuary working with insurers on predictive modeling, I have noticed that while SHAP (SHapley Additive exPlanations) values can help explain complex models, they can also lead to some expensive mistakes if used carelessly.
In two previous blog posts about SHAP, I introduced SHAP’s basics, explained how it breaks down individual predictions with local explanations. Then, I explored global SHAP, which showed how important different features in a predictive model are across entire portfolios. I highlighted SHAP as a powerful approach that uses solid math to help us make sense of black-box models - like ones often used in auto insurance pricing.
But SHAP is not a panacea. In personal auto insurance, where mispriced policies can erode profitability, understanding SHAP’s pitfalls is critical. In this blog post, I simulate an auto insurance dataset and use it to highlight five key limitations of SHAP that can obscure model interpretations.
The Auto Insurance Dataset
To set the stage for this illustration, I have simulated a personal auto insurance dataset that looks a lot like what one might typically see in real-world auto insurance data. The dataset includes common rating factors such as driver age, years of driving experience, insurance score, annual mileage, vehicle age, prior accidents and traffic violations, vehicle value, and an urban indicator. I have included one moderate correlation and one variable interaction, as follows:
- Moderate correlation: Insurance score and prior claims history correlate at approximately 0.55
- Variable interaction: Young drivers (<25) with prior accidents are charged an additional $50 on their premium to reflect the multiplicative effect of these two variables
The table below summarizes the dataset’s variables and their actuarial relevance:
|
Feature |
Description |
Actuarial Relevance |
|
Driver Age |
Age of Driver (18-80) |
Signals maturity, risk appetite |
|
Driving Experience |
Years since licensing |
Correlates with age |
|
Insurance Score |
Numeric score (e.g., credit-based) |
Predictive of claim propensity |
|
Prior Claims |
Count of past claims |
Correlates with insurance score (approx. 0.55) |
|
Annual Mileage |
Miles driven per year |
Indicator of exposure |
|
Vehicle Age |
Age of vehicle (0-15 years) |
Impacts repair costs, risk profile |
|
Violations |
Count of traffic violations |
Signals risky driving behavior |
|
Vehicle Value |
Market value of vehicle |
Correlates with repair/replacement costs |
|
Urban Indicator |
Binary (1 = urban, 0 = non-urban) |
Captures environmental risk factors |
I trained a Random Forest model on this dataset to predict premiums and calculated the SHAP values to explain the predictions. Below, I outline five key limitations exposed by this setup and show where SHAP can mislead when it breaks down, and why it should never be used in isolation.
Limitation 1: The Correlation Conundrum
Moderately correlated features, such as insurance score and prior claims history, pose a challenge for SHAP. Unlike highly correlated pairs, for which dropping one variable is the obvious solution, moderate correlations are ambiguous - the two correlated variables capture overlapping, but also distinct, signals. Such multicollinearity can often result in unstable feature importance rankings. SHAP assumes feature independence, so it arbitrarily splits variable importance between correlated variables.
Using my simulated dataset, training a Random Forest model with one random seed, the model might attribute the highest importance to insurance score. If we run the model again with a different seed, prior claims history suddenly becomes dominant. Running it repeatedly, we might see a split of 40 units of importance to insurance score and 30 units to prior claims history in one run, while a following run might reverse this. Neither variable is redundant, but SHAP cannot reliably distinguish their contributions, and treats both variables as separate effects.
The fundamental issue is that SHAP calculations assume the model's variables are independent. But real insurance data is heavily correlated - insurance score correlates with prior claims, vehicle value correlates with repair costs, and urban drivers have different mileage patterns. SHAP's elegant mathematics breaks down when these independence assumptions are violated.
When correlations are moderate, actuaries cannot simply throw out a single variable. Cross-checking SHAP with univariate analyses and other tools, such as partial dependence plots or feature selection techniques, is important to confirm whether or not the variables represent overlapping signals.
Limitation 2: The Background Dataset Trap
SHAP values should be interpreted relative to a reference population, which means that the choice of background population dramatically affects how we explain the model's predictions.
In my simulated example, explaining the premium of a 40-year-old driver using young drivers (<30) as the reference population might yield a -$80 SHAP value for age (lower premium relative to young drivers). On the other hand, using older drivers (>60) as the reference population, the same driver’s age might contribute +$60 (higher premium relative to older drivers). Both explanations are “correct,” yet seemingly inconsistent.
For consistent interpretations of our SHAP values, it is helpful to standardize our analyses by consistently using the same training data for every SHAP calculation. Otherwise, different analyses using varied training subsets produce conflicting SHAP explanations for the same predictions.
Limitation 3: Missing the Interaction Forest for the Trees
As we know, insurance risks are rarely additive. A young driver and an accident history are not just risky on their own - a young driver with accidents is exponentially riskier. I built this interaction into my simulated dataset by including a $50 surcharge for the combination of young driver with accidents, but SHAP failed to reveal this clearly.
When you examine the results, you will see that SHAP attributes some importance to age and some to accidents separately, but obscures their combined effect and fails to show that these variables multiply rather than add. For example, SHAP might show that accidents add approximately $200 to premiums and young age adds approximately $100, but it may miss that young drivers with accidents actually need an additional $50 for the interaction.
The young driver with two accidents is not just twice as risky as a driver with one accident - they might be three or four times riskier. SHAP will not make this multiplicative effect obvious, potentially resulting in underpriced policies in high-risk segments. Actuaries should complement SHAP with interaction plots from GLM diagnostics or SHAP interaction values (though the latter increases computational cost) to ensure that pricing reflects true risk interactions.
Limitation 4: The Computational Reality Check
Another major roadblock is the computational power required for calculating SHAP across large portfolios. Calculating SHAP over 100 policies might take only five to 10 seconds, but for one million policies, it might take over 14 hours. These types of challenges often necessitate utilizing sampling strategies, although great care is needed to avoid producing inaccurate conclusions.
If computational power is an issue, instead of relying on exact methods, you can apply faster approximations, such as Kernel SHAP or FastSHAP, but actuaries should validate their accuracy to avoid compounding errors across large books of business. So, the question is: How to validate that the approximation method is accurate enough?
One possibility is to benchmark the approximation against exact SHAP values calculated on a subsample of more manageable size. Using a sample of 500 to 1000 randomly selected policies, calculate the SHAP values using the exact method and the proposed approximation. For each variable, calculate the average value and the maximum value of the absolute differences between the two methods. If the estimation error consistently exceeds 5-10% of the average premium, the approximation is likely too volatile to be used for pricing decisions.
Another thing one can do is test how the SHAP approximation varies with different sample sizes. Plotting the resulting SHAP value distributions across key customer segments can help determine if the SHAP values are stabilizing as the sample size grows. If values continue to vary dramatically even for large sample sizes, an even larger sample or a fundamentally different approximation method might be needed.
Yet another possible check is out-of-sample validation on a holdout dataset of policies that were not used in any prior SHAP calculations. Compute SHAP values for these policies using the original approximation, then predict the premiums again by summing the base value and the SHAP contributions. Compare these new premiums with what the model actually predicts, and if the average difference is more than 2-3%, the approximation might be introducing too much noise. From SHAP theory, we know that SHAP contributions should sum to the prediction difference, and so approximations that violate this rule by large margins are not trustworthy.
Finally, stability can be tested by running the approximation multiple times with different random seeds. For each run, this involves ranking the model's features by their absolute average SHAP value, and if the ranking bounces around - say, insurance score is #2 in one run but #7 in another - the method is probably too unreliable. Such unstable rankings suggest that the approximation is essentially guessing. What we would like to see is approximations where the top five features rank consistently across repeated runs.
Limitation 5: The Instability Problem
Even small changes in the inputs can cause significant SHAP value shifts, even for unrelated features. For example, adding 10 miles to a driver’s annual mileage (say, less than 0.1% change) might alter not just the SHAP value of the mileage variable but also that of the insurance score variable by $10-20. This occurs because when one feature is changed in the input (say, "age" from 30 to 31), SHAP recalculates contributions for all features, as it considers how features work together. SHAP considers all possible feature combinations when computing contributions of the variables, and approximation methods introduce variance between calculations.
Unfortunately, unlike the issues observed with correlated features, there is no straightforward solution. This is an inherent property of SHAP's combinatorial calculations. One option to avoid similar confusing results in practice when using SHAP for explanations is to run multiple times and report averaged, or median SHAP values rather than single-run outputs. Additionally, one can consider setting materiality thresholds based on the standard deviation of SHAP values across multiple runs - if a feature's SHAP value fluctuates within one standard deviation of its mean across runs, treat the variation as noise rather than signal.
After reviewing its limitations, one might wonder if SHAP is worth using. The answer is yes - but only as part of a broader interpretability strategy. SHAP is a powerful tool, but it is not a one-size-fits-all solution.
- All eggs should not be put in one SHAP basket for interpreting models. Instead, mixing SHAP with more traditional actuarial tools like univariate analyses, rate relativities tables, or partial dependence plots can back up findings. For instance, if SHAP indicates that the insurance score variable is very important, but a univariate analysis shows only slight differences in the rate relativities, that is a red flag. It may necessitate diving a bit deeper.
- Make sure to look at the reference populations, how you are handling correlations, check for interactions, and see if your results are consistent across different samples.
- Make sure to use alternative tools, such as permutation importance for correlated features, Local Interpretable Model-agnostic Explanations (LIME) for intuitive local explanations, or Accumulated Local Effects (ALE) plots for robust effect estimation.
In my next blog post, I will explore these and other alternative tools to show how they can address the weaknesses of SHAP and ensure that analytics drive the model’s value without the pitfalls.
Key Takeaways
- SHAP is valuable but limited by correlations, reference dependence, interaction masking, computational cost, and instability.
- Never rely on SHAP alone - robust models require a comprehensive interpretability framework, blending SHAP with traditional actuarial tools, other complementary tools, and business logic to ensure pricing accuracy and reliable insights.