What Makes a Feature Useful? A Hands-On Exploration of Selection Techniques (1a)

How Do Models Choose Their Features? A Deep Dive into Feature Selection (Part 1a)

Aug 01, 2025

This blog is inspired by the paper written by Ileberi Emannuel, Yanxia Sun and Zenghui Wang; that explored the development of ML-based classification models to model credit worthiness decisions.

For the past eight years, I’ve worked with traditional credit risk models, mostly focused on estimating the probability of default. But when it comes to machine learning, my experience is primarily theoretical — more reading papers than writing code. Still, I’ve always been drawn to how different algorithms select features. Do they gravitate toward the same variables? Do their internal logics pick up on patterns I’d miss? And most importantly - how do different techniques differ from each other?

This series is my personal lab: a place to learn, test, and reflect as I explore the mechanics behind how different models choose their variables. Over this 3-part series, I’ll be exploring:

Feature selection techniques (this post + next)
Different model types and how they behave
Model evaluation

The Dataset

The data I'm working with is the Customer Transaction and Demographic related data sourced from Kaggle. The dataset labels risky and non-risky customers for different banking products, and includes both numeric and categorical variables. These variables include payment variables ( no. of times overdue, no. of times normal payment, credit limit, current balance, etc) and customer specific demographic data as well.

Two Notes Before We Start

1. Not every algorithm requires you to select features. Some algorithms - especially tree-based models - can automatically handle irrelevant or redundant features. But in practice, feature selection can still offer benefits like faster training, improved interpretability, and in some cases, better performance.

2. I’ll be standardising the numerical variables before applying feature selection. While tree-based models don’t need this, many others - like logistic regression, Lasso, and distance-based algorithms - are sensitive to feature scale. Standardisation ensures fair comparisons and stable model behaviour across techniques.

Techniques I’m Exploring

I’ve chosen a mix of classic statistical techniques and modern ML-based methods for this series. Some of these are well-established in the banking industry, while others are newer or more commonly used in data science circles. Here’s the list:

Pearson’s Correlation
Chi-Squared Test
F-Test
Information Value (IV)
Kolmogorov-Smirnov (KS) Statistic
Lasso Regularisation
(Part 1b will cover Tree-Based Importance, SHAP, RFE, Boruta, Permutation Importance, Correlation Filtering, and Clustering)

1. Pearson's correlation: Pearson’s correlation is often one of the first filters applied, especially when dealing with numerical variables. A common threshold is correlation > 0.3, meaning any variable stronger than this is considered potentially useful. The formula for Pearson's correlation is:

I calculated the Pearson's correlation for all the numerical variables at our disposal with the target variable and noted that none of them crossed the threshold of 0.3 (see below).

Does this mean none of our variables are statistically related to the target? Not quite. The key limitation here is that Pearson’s correlation assumes both the feature and the target are continuous and normally distributed. But in our case, the target is binary (0/1 for non-risky vs. risky), which breaks that assumption. So, this isn’t the right tool for the job.

2. Chi Square: Unlike Pearson’s correlation, which is designed for continuous variables, the Chi-Square (χ²) test is better suited for evaluating relationships between categorical features and a categorical target. It checks whether there’s a statistically significant association by comparing observed frequencies with those expected under the assumption of independence. Here’s the formula:

A high χ² score (with low p-value) implies the feature and target are not independent, and the feature is useful for classification. The result of this test for our data is reproduced below:

As can be noted - based on this measure - fea_1, fea_3 and fea_6 are significant at 5% level of significance. i.e these features are informative. Specifically, the expected value of the variables (i.e the central tendency) is statistically different from the observed frequency.
This test does not comment on the relative distribution of the variables (target vs feature).

3. ANOVA F Test: The ANOVA F-test is used to assess the linear dependency between each numerical feature and the binary target (label ∈ {0, 1}). Think of the ANOVA F-test as the numeric-feature sibling of the Chi-Square test — both check if a feature is statistically associated with a binary target, just for different data types. The result of this test for our data is reproduced below:

fea_4 comes out as the most predictive feature by a wide margin, based on the ANOVA F-statistic and p-values. Several other features — including OVD_t2, pay_normal, fea_2, highest_balance, OVD_t1, OVD_sum, OVD_t3, and new_balance exhibit statistically significant predictive power.
I do note that some of these variables are expected to exhibit high collinearity (eg. Highest_balance and new_balance and the OVD_t* family). The tree-based models are robust to collinearity, but even so - removing multicollinearity does improve stability and interpretability in linear and regularised models. Even in tree - based models, addressing multicollinearity will simplify the model and speed up training, although possibly at the marginal cost of improved accuracy.
Note that both the Chi squared test and the ANOVA F test are based on if the frequency of categories/ mean of the variable change with the target variable.

4. Information Value: IV measures the predictive power of a feature in separating binary classes - typically “good” vs “bad” (e.g., non-risky vs risky customers). It quantifies how differently the distribution of a feature behaves for each class. The formula has been reproduced below:

Imagine you’re checking how useful 'credit card limit' is for predicting whether someone will default. If most defaults happen in low-limit bins and most non-defaults in high-limit bins, IV will be high - meaning this feature is a good separator. The IV results in our data are reproduced below:

As noted, fea_4 and fea_2 are consistently strong across both the metrics so far - F test and IV. This implies that this variable has different average values for risky and non-risky customers - and that difference is not by chance. It’s likely useful for helping the model classify correctly.
On the other hand, features like pay_normal and the OVD_t* group showed strong results in the ANOVA test but weak Information Value scores. This suggests that while they exhibit differences in average values across classes (i.e., they separate means), they don't effectively capture the underlying distributional differences needed to distinguish between risky and non-risky customers. In other words, the mean differences aren't strongly aligned with class separation.
highest_balance, new_balance are weak across all metrics - we can consider dropping one or both unless they’re needed for business context.

5. Kolmogorov- Smirnov statistic - The KS statistic measures the maximum distance between the cumulative distributions of a feature across two classes - typically “good” (non-risky) and “bad” (risky) outcomes. In essence, it tells us how well a feature or model distinguishes between the classes by identifying the point where the distributions diverge the most.

Since both KS and Information Value (IV) aim to assess class separation, it’s fair to ask: Do we really need both?

The answer is yes - because they focus on different aspects:

IV gives a holistic view of how well a feature separates classes across all bins
KS zooms in on the single most separated point between distributions

So you might encounter situations like:

High IV, low KS: the feature contributes steadily across bins, but has no sharp split
High KS, low IV: there’s one strong separation point, but the rest of the distribution overlaps

Together, IV and KS give a more rounded perspective on feature usefulness and can help catch what the other might miss.

The KS results for our data are shown below:

Luckily, fea_4 and fea_2 also show high KS scores, confirming they’re doing a great job at separating the two classes. They’ve consistently performed well across ANOVA, IV, and now KS — which means they’re not just different in average values, but also in how the entire distribution looks across classes. These would be solid picks for the model.
While fea_11, pay_normal, new_balance, fea_8, OVD_sum, and prod_code show some separation between the two classes based on their KS values, the distance between their distributions (CDFs) isn't very wide - meaning the overlap between risky and non-risky customers remains fairly high. This is observed in their IV values as well, which are below 0.3.

Note that even if these features aren’t strong individually, they can still be useful in combination with others. Certain patterns or interactions (like pay_normal * prod_code, or OVD_sum interacting with new_balance) might capture non-obvious relationships that a model - especially a nonlinear one - can learn from. So even if these features don’t shine solo, they might still matter - and techniques like Tree-Based Importance, SHAP, Boruta, and Permutation Importance will help us catch those hidden effects. Worth keeping an eye on.

6. Lasso Regularisation: Lasso (Least Absolute Shrinkage and Selection Operator) is a regularisation technique used in regression and classification models that adds a penalty on the absolute size of the model coefficients. This encourages the model to shrink less important feature coefficients all the way to zero, effectively performing feature selection. The formula has been reproduced below:

This seems like a complicated formula - but the important thing to note here is that the first term here (Loss_original) captures the predictive performance of the model. Whereas the second term (L1 penalty) penalises the coefficients. The regularization parameter λ (or equivalently C=1/λ) controls the trade-off between fitting the data and keeping the model sparse.

The Lasso results in our data are reproduced below:

There are three important themes that jump out here:

One particularly interesting outcome from the Lasso results: new_balance was dropped entirely, while highest_balance was retained. At first glance, this might seem surprising — especially since new_balance had slightly better IV and KS scores. But on closer inspection, this makes sense.
As mentioned earlier, new_balance and highest_balance are highly correlated — with a correlation of around 75.16%. In other words, they’re telling the model a very similar story. And Lasso, being a linear and sparsity-focused technique, doesn’t like redundancy. It evaluates each feature based on its individual linear contribution, and when two features overlap, it keeps the one with slightly stronger predictive power and quietly drops the other.
In this case, highest_balance came out ahead, supported by a stronger ANOVA F-statistic (25.49 vs 10.49 for new_balance) and a meaningful Lasso coefficient.
So why did new_balance score better on IV and KS, yet still get cut? That comes down to methodological differences. IV and KS are non-parametric — they evaluate class separation without assuming linearity. Lasso, on the other hand, is looking strictly for linear, additive relationships. That disconnect explains why Lasso downplayed new_balance, even though other metrics showed potential.
Still, this doesn’t mean we’re done with new_balance. It may perform well in non-linear models, so we’ll be watching for that in upcoming tests.
Meanwhile, fea_4 remains the standout. It performs consistently well across every evaluation method — high F-statistic, strong IV and KS scores, and a large non-zero Lasso coefficient. It’s a clear signal that fea_4 is a feature to keep.
fea_2, on the other hand, had a low Lasso coefficient despite consistently strong IV and KS values. This likely points to a non-linear or non-monotonic relationship with the target — something that IV and KS can capture, but Lasso can’t. Because Lasso is limited to linear effects, it may be underestimating fea_2’s true value.

Given its strong non-parametric performance, fea_2 deserves to stay in the mix — especially as we move toward models that can capture more complex patterns.

Coming Up in Part 1(b)...

We’ve now worked through six techniques — from basic correlations and statistical tests to the more selective and opinionated Lasso. Each method gave us a different lens into which features matter, and why.

But we’re not done yet.

In Part 1 (b), we’ll dive into the remaining seven feature selection techniques:

Tree-Based Feature Importance
SHAP
Recursive Feature Elimination (RFE)
Boruta Algorithm
Permutation Importance
Correlation-Based Filtering
Clustering for Redundancy Reduction

These techniques bring richer modeling assumptions, often capturing nonlinearities, interactions, and redundancies in ways the current methods can’t.

If you found this useful — stick around. The next post is where things start to get even more interesting.

Risk and Reason

Discussion about this post