The Ultimate Guide to Feature Selection in Machine Learning (With Real Data Examples)
A hands-on comparison of five machine learning feature selection methods (Tree Importance, SHAP, RFE, Boruta, Permutation) applied to a credit risk dataset.
If you’ve ever trained a machine learning model, you’ve probably wrestled with feature selection. Drop too much, and you lose signal. Keep everything, and you risk noise, instability, or a model you can’t explain. But here’s the twist: even the most popular feature selection techniques don’t agree with each other. So how do you know who to trust? I ran five common methods on the same dataset to find out - and the results were eye-opening.
The Setup
I worked with a Kaggle credit risk dataset. The target was binary: risky vs. not risky customers. Predictors included both behaviours (balances, overdue counts, payment patterns) and demographics.
This wasn’t a clean textbook dataset. Some features overlapped heavily (highest_balance and new_balance were ~75% correlated). Some were categorical with many possible values. Others were numeric but skewed. In other words: a realistic mix that makes feature selection both important and tricky.
To keep comparisons fair, I locked preprocessing and splits. Every method saw the same folds with the same encodings. Then I asked: how do five common feature selection methods interpret this dataset, and why do their answers differ?
If you want the purely statistical side of the story - Pearson’s correlation, Chi-Squared test, F-test, Information Value, Kolmogorov - Smirnov statistic, Lasso regularisation - I’ve covered that in a separate guide. This piece stands alone, but together they form a comprehensive toolkit for understanding how to decide which features matter, and why.
1. Tree-Based Feature Importance
How it works
Decision trees split the dataset into groups, choosing features that reduce impurity (mix of risky vs. safe) the most. Each time a feature is chosen for a split, it earns credit. Across the full tree, or an ensemble of trees, those credits add up to its importance score.
What it values (and biases)
Favours features that make clean, early splits.
Inflates features with many thresholds or categories.
When features are correlated, it usually crowns one favourite while the other looks unimportant.
What I found
Top features: fea_2, fea_4, fea_10, fea_8, fea_11, highest_balance. Together they explained about 73% of the importance. Most were categorical, which shows the tree found neat category splits.
Two quirks stood out:
highest_balance beat new_balance despite being highly correlated - classic “correlation roulette.”
What struck me was how much the “easy splitters” dominated. Impurity scores have this well-known bias: they inflate features with many possible cut points — usually continuous numerics or high-cardinality categoricals. With one-hot encodings, the bias just shifts into giving lots of small dummies more chances to be picked. In my run, several categoricals topped the list. That could mean they truly were useful, but it could also mean their importance was overstated by the way trees measure splits. That’s why I wanted to check them against other methods before trusting the story.
Takeaway
Tree-based importance highlights features that slice data cleanly, not necessarily those that consistently drive predictions.
2. SHAP Values
How it works
SHAP borrows from game theory’s Shapley values, which are a way to fairly divide credit in team games. For a given prediction:
It considers all possible orders in which features could be added to the model.
Each time a feature joins, it asks: how much did this feature change the prediction compared to before it was added?
It averages that contribution across all possible team orders.
The result is a fair, per-feature contribution for that specific prediction. Positive SHAP values push the prediction upward (toward “risky”), negative values push it downward (toward “safe”).
When you repeat this across thousands of predictions and take the average magnitude (mean |SHAP|), you get global importance — a ranking of which features consistently influence outcomes.
What it values (and biases)
Consistency: Features that steadily push predictions in the same direction across many customers get high SHAP scores.
Fairness in correlation: If two features overlap, SHAP splits credit more evenly instead of arbitrarily favoring one (like trees often do).
Sensitive to assumptions: If you don’t choose the background distribution carefully (conditional vs. independent assumptions), SHAP can over- or under-credit proxies and correlated features.
What I found
SHAP agreed with the trees on the backbone - fea_2
and fea_4
were still at the top. But it also pulled some quiet players into the spotlight. fea_1
and new_balance
rarely got picked for splits, yet SHAP showed they were steadily nudging predictions across many customers. On the flip side, variables like highest_balance
and prod_code
looked weaker here than the trees had suggested. That was my first hint that trees sometimes crown “split darlings” that don’t actually carry steady predictive weight.
Takeaway
SHAP surfaces “quiet influencers” that trees overlook and provides a fairer view when features overlap.
3. Recursive Feature Elimination (RFE)
How it works
RFE starts with all features, trains a model, and drops the least important one(s). It repeats the process until it reaches a target number. If nested cross-validation is used (as I did), it also checks performance at each stage.
What it values (and biases)
Favours subsets that maximise predictive performance for the specific model used.
Keeps features that add incremental lift in combination, even if weak alone.
Can be unstable if not cross-validated properly.
What I found
The final set included fea_3, fea_5, fea_7, OVD_t1, OVD_t2, OVD_sum, pay_normal, highest_balance, fea_4, fea_8, fea_10, fea_11.
Notably, RFE kept OVD_t1, OVD_t2, OVD_sum, fea_5 — features that SHAP had ranked low. Their value came from interactions with others, not solo strength.
Takeaway
RFE is pragmatic: if a feature helps the chosen model perform better, it stays — even if other methods dismiss it.
4. Boruta
How it works
Boruta builds random forests, then adds “shadow” features - shuffled copies of the real ones. If a real feature consistently outperforms its shadow across many iterations, it’s kept. Otherwise, it’s rejected.
What it values (and biases)
Identifies features that are robustly better than random noise.
Conservative - often rejects marginal but potentially useful interaction features.
Dependent on tree parameters and number of iterations.
What I found
Boruta confirmed a smaller set: fea_2, fea_10, fea_11, fea_4, fea_8, highest_balance.
Takeaway
Boruta is strict. It keeps the all-weather core but may drop subtle helpers.
5. Permutation Importance
How it works
After training, permutation importance shuffles one feature’s values and measures how much model performance drops. If accuracy falls sharply, the feature mattered. If accuracy barely changes - or improves - the feature was unhelpful or harmful.
What it values (and biases)
Captures true contribution to performance after training.
Penalises features that only add noise.
Struggles with correlated features (shuffling one can be masked by another).
What I found
fea_2 was the standout workhorse.
fea_1 added meaningful lift.
fea_4 was borderline.
fea_3 and fea_0 were harmful — the model did better without them.
Takeaway
Permutation importance is blunt but honest. It shows who truly carried weight in the final model.
Cross-Technique Synthesis: What the Contradictions Really Mean
When I stepped back from the five methods, two things became clear. First, there is a reliable backbone of features that every technique gravitates toward. Second, the disagreements are not noise - they’re signals in their own right, showing how each method “sees” the data through a different lens.
The Backbone (Convergence)
Across trees, SHAP, RFE, Boruta, and permutation, a cluster of features (fea_4, fea_8, fea_10, fea_11) kept reappearing. No matter whether the method cared about neat splits, marginal contributions, noise-robustness, or post-hoc performance, these features came through as consistently valuable.
This is what I would call the “all-weather drivers.”
They’re the safest bet when explaining the model to stakeholders, because they’re not an artifact of a single method.
The Contest for Balance Variables
The pair highest_balance and new_balance highlighted how methods diverge in handling correlation:
Trees loved highest_balance - because it gave sharp, early splits.
SHAP stated that
new_balance
was the one making steady, consistent contributions across many customers, whilehighest_balance
barely registered.Permutation treated them cautiously, since shuffling one could be masked by the other. Why? Because when you shuffle highest_balance, the model can lean on new_balance (correlated ~0.75) and vice versa. The redundancy masks their individual contributions, so their permutation importances look smaller than what tree impurity or SHAP suggested.,
highest_balance survived to the end; new_balance didn’t, showing that the model got more value from one than the other, not necessarily both.
So how do you decide? If I had to pick just one, I’d trust new_balance
. It’s the steadier predictor, more interpretable, and less likely to collapse outside the training folds. If I cared only about squeezing out in-sample performance in a tree model, I might let highest_balance
stay. And if model complexity wasn’t a concern, I’d keep both, but watch their stability carefully on out-of-time data.
In the end, new_balance
feels like the safer bet - the feature I’d want to explain to stakeholders and rely on in production.
The “Quiet Influencers” vs. “Split Darlings”
fea_1 and pay_normal were rarely chosen by trees, but SHAP and permutation revealed them as steady contributors.
Conversely, prod_code and highest_balance ranked high for trees but looked weaker under SHAP and permutation.
This contrast shows why combining methods is critical: some features matter because they slice cleanly (split darlings), others because they quietly improve predictions in case after case (quiet influencers). If you only trusted trees, you’d keep split darlings and miss the quiet but dependable drivers.
The Synergy Features
RFE retained features like OVD_t1, OVD_t2, OVD_sum, fea_5 — variables other methods mostly ignored. These features didn’t shine alone, but together they added complementary signals.
This is the essence of interaction effects: one overdue variable might not matter, but the trio tells a stronger story.
It also explains why regulators often like to see RFE-style analyses - it ensures the feature set isn’t just the “loudest” variables but also the subtle helpers.
The Negative Contributors
Permutation flagged fea_3 and fea_0 as actively harmful. The model did better without them.
This finding is a reminder: not all features are neutral if irrelevant - some can introduce noise, destabilize calibration, or encourage overfitting.
None of the other methods directly exposed this; they just ranked these features low. Only permutation made the cost explicit.
Putting It Together
So what do we learn from the disagreements?
Agreement means robustness. If multiple methods highlight the same feature, you can trust it across models, data shifts, and encodings.
Disagreement means insight. A feature praised by one method but dismissed by another is a clue about how it interacts with the model. Is it a split darling? A quiet influencer? A synergy partner? A false friend?
This is why I see feature selection not as a filter, but as a conversation. Each method is a voice, pointing to a different way of understanding the data. The art is not to pick a winner, but to integrate the voices into a coherent picture.
The Playbook
Here’s the process I’d use going forward:
Start wide: run trees for quick triage.
Explain: run SHAP to reveal consistent influencers and check direction.
Lock the core: use Boruta to confirm the strongest features.
Add synergy carefully: run RFE with nested CV to capture interaction effects.
Final check: run permutation importance to see who actually carries weight, both in-fold and out-of-time. Drop unstable or harmful features.
Always report stability and narrative, not just rankings.
Closing
Feature selection isn’t a box-ticking step. It’s the moment you learn how your model actually thinks.
Trees show you what slices data neatly.
SHAP shows you who quietly shifts predictions.
RFE shows you which combinations make the model stronger.
Boruta shows you who survives the test against pure noise.
Permutation shows you who really carries weight when the rubber meets the road.
When all five lenses point to the same features, you’ve found your backbone. When they diverge, I wouldn't see it as confusion - see it as X-ray vision. You’re learning not just which features matter, but why different algorithms lean on them differently.
That, to me, is the real payoff. Feature selection isn’t just about dropping columns - it’s about learning how your model actually thinks. By looking at the same data through five different lenses, I don’t just get a cleaner dataset. I get confidence. Confidence that the model stands on solid ground. Confidence that I know where it might wobble. And confidence that the story I tell about the data isn’t guesswork - it’s a synthesis of perspectives solid enough to earn trust from both me and my stakeholders.