Feature Selection 실전 : 정리모델 성능과 해석력을 동시에 잡는 방법

EastLight

We all try to make things work, no one sets out to fail. Let's give it a try first and decide afterward.

Today :
Yesterday :

차원축소, 비지도학습, 데이터전처리, 선형회귀, kmeans, regression, 회귀분석, unsupervised learning, linearRegression, 군집화, 머신러닝회귀, XGBoost, 데이터분석, Machine Learning, LightGBM, Feature Engineering, ensemble learning, 피처엔지니어링, 머신러닝, MachineLearning,

Programming

Feature Selection 실전 : 정리모델 성능과 해석력을 동시에 잡는 방법

Lucas.Kim 2026. 1. 1. 00:28

1. Feature Selection이란 무엇인가

Feature Selection(피처 선택) 이란
모델을 구성하는 수많은 피처 중에서 학습과 예측에 의미 있는 피처만 선별하는 과정을 말합니다.

Feature Selection이 중요한 이유는 다음과 같습니다.

불필요한 피처가 많을수록 모델 성능이 오히려 저하될 수 있습니다.
모델이 어떤 기준으로 예측했는지 설명 가능한 구조를 만들 수 있습니다.
피처 수가 많아질수록 오버피팅(overfitting) 발생 가능성이 커집니다.
학습 속도와 추론 속도를 모두 개선할 수 있습니다.

일반적으로 Feature Selection 시 다음 요소들을 함께 고려합니다.

피처 값의 분포 형태
결측치(null) 존재 여부
피처 간 높은 상관관계
타겟 값과의 독립성 여부
모델 학습 결과 기반 중요도(Feature Importance)

2. 모델 기반 Feature Selection 개요

모델 기반 Feature Selection은
이미 학습된 모델이 판단한 중요도를 활용하는 방식입니다.

대표적인 방법은 다음과 같습니다.

2-1. RFE / RFECV (Recursive Feature Elimination)

모델을 학습한 뒤 중요도가 낮은 피처를 하나씩 제거
제거 → 재학습 → 평가를 반복하여 최적의 피처 개수 탐색
정확하지만 연산량이 매우 크고 수행 시간이 길다는 단점이 있음
데이터 수와 피처 수가 적은 경우에만 효율적

3. RFE + RFECV 실습

아래 실습에서는 SVC(linear kernel) 를 기반 모델로 사용합니다.

import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV, RFE
from sklearn.datasets import make_classification

3-1. 데이터 생성

# 분류를 위한 Feature 25개, 샘플 1000개 생성
X, y = make_classification(
    n_samples=1000,
    n_features=25,
    n_informative=3,
    n_redundant=2,
    n_repeated=0,
    n_classes=8,
    n_clusters_per_class=1,
    random_state=0
)

실제로 의미 있는 피처는 3개뿐인 데이터입니다.
나머지는 노이즈 또는 중복 피처입니다.

3-2. RFECV 적용

svc = SVC(kernel='linear')

rfecv = RFECV(
    estimator=svc,
    step=1,
    cv=StratifiedKFold(2),
    scoring='accuracy',
    verbose=2
)

rfecv.fit(X, y)
print(f'Optimal number of features : {rfecv.n_features_}')

Optimal number of features : 3

→ 실제 informative feature 개수와 정확히 일치합니다.

3-3. CV Score 시각화

plt.figure()
plt.xlabel('Number of features selected')
plt.ylabel('Cross-validation accuracy')
plt.plot(
    range(1, len(rfecv.cv_results_['mean_test_score']) + 1),
    rfecv.cv_results_['mean_test_score']
)
plt.show()

피처 개수가 줄어들면서 정확도가 어떻게 변하는지 시각적으로 확인할 수 있습니다.
RFECV는 최적 피처 개수 자동 탐색에 매우 유용합니다.

4. Permutation Importance(순열 중요도)

4-1. 개념 설명

Permutation Importance 는
특정 피처의 값을 무작위로 섞었을 때 모델 성능이 얼마나 감소하는지를 기준으로 중요도를 측정합니다.

핵심 특징은 다음과 같습니다.

학습된 모델을 그대로 사용
테스트(또는 검증) 데이터에서 수행
피처 하나씩 값을 변조 → 성능 저하 측정
평균적으로 성능에 가장 큰 영향을 주는 피처를 중요 피처로 판단

Tree 기반 Feature Importance의 편향 문제를 보완할 수 있습니다.

5. Permutation Importance 실습 (회귀)

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score

5-1. 모델 학습

diabetes = load_diabetes()
X_train, X_val, y_train, y_val = train_test_split(
    diabetes.data, diabetes.target, random_state=0
)

model = Ridge(alpha=1e-2).fit(X_train, y_train)
y_pred = model.predict(X_val)
print('r2 score:', r2_score(y_val, y_pred))

r2 score: 0.3566675322939423

5-2. Permutation Importance 계산

from sklearn.inspection import permutation_importance

r = permutation_importance(
    model,
    X_val,
    y_val,
    n_repeats=30,
    random_state=0
)

중요 피처 출력:

for i in r.importances_mean.argsort()[::-1]:
    if r.importances_mean[i] - 2 * r.importances_std[i] > 0:
         print(
             diabetes.feature_names[i],
             np.round(r.importances_mean[i], 4),
             "+/-",
             np.round(r.importances_std[i], 5)
         )

s5   0.2042 +/- 0.04964
bmi  0.1758 +/- 0.0484
bp   0.0884 +/- 0.03284
sex  0.0559 +/- 0.02319

6. SelectFromModel

6-1. 개념 설명

SelectFromModel 은
모델이 학습한 Feature Importance를 기준으로
임계값(threshold) 이상인 피처만 자동 선택합니다.

평균, 중앙값, 또는 사용자가 지정한 기준값 활용
RFE보다 훨씬 빠름
실무에서 가장 많이 사용되는 방식 중 하나

6-2. Lasso 기반 Feature Selection

from sklearn.datasets import load_diabetes
from sklearn.linear_model import LassoCV
import numpy as np
import matplotlib.pyplot as plt

diabets = load_diabetes()
X, y = diabets.data, diabets.target

lasso = LassoCV().fit(X, y)
importance = np.abs(lasso.coef_)
feature_names = np.array(diabets.feature_names)

Feature Importance 시각화:

plt.bar(height=importance, x=feature_names)
plt.title('Feature importance via Lasso')
plt.show()

6-3. SelectFromModel 적용

from sklearn.feature_selection import SelectFromModel

threshold = np.sort(importance)[-3] + 0.01
print(f'threshold : {threshold}')

sfm = SelectFromModel(lasso, threshold=threshold).fit(X, y)
print(
    f'Feature Selected by SelectFromModel : {feature_names[sfm.get_support()]}'
)

Feature Selected by SelectFromModel : ['s1' 's5']

7. Feature Importance의 한계

Tree 기반 Feature Importance는 다음과 같은 한계를 가집니다.

결정값과 직접적 관련이 없어도 중요도가 높게 나올 수 있음
고유값(cardinality)이 큰 피처에 편향
학습 데이터 기준 → 테스트 데이터에서는 달라질 수 있음

따라서 절대적인 기준으로 사용해서는 안 되며,
Permutation Importance와 함께 사용하는 것이 안전합니다.

8. Permutation Importance vs RandomForest Importance

Titanic 데이터셋을 활용해 두 방법을 비교합니다.

from sklearn.datasets import fetch_openml
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

8-1. 데이터 준비 및 전처리

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

rng = np.random.RandomState(seed=42)
X['random_cat'] = rng.randint(3, size=X.shape[0])
X['random_num'] = rng.randn(X.shape[0])

categorical_columns = ['pclass', 'sex', 'embarked', 'random_cat']
numerical_columns = ['age', 'sibsp', 'parch', 'fare', 'random_num']
X = X[categorical_columns + numerical_columns]

8-2. 모델 학습

rf = Pipeline([
    ('preprocess', preprocessing),
    ('classifier', RandomForestClassifier(random_state=42))
])

rf.fit(X_train, y_train)

8-3. Permutation Importance 시각화

result = permutation_importance(
    rf,
    X_test,
    y_test,
    n_repeats=10,
    random_state=42,
    n_jobs=2
)

sorted_idx = result.importances_mean.argsort()

fig, ax = plt.subplots()
ax.boxplot(
    result.importances[sorted_idx].T,
    vert=False,
    labels=X_test.columns[sorted_idx]
)
ax.set_title("Permutation Importances (test set)")
plt.show()

→ 랜덤으로 추가한 피처가 Importance에서 낮게 평가됨을 확인할 수 있습니다.

9. 정리

Feature Selection은 성능 향상과 해석력을 동시에 개선합니다.
하나의 방법에 의존하지 말고 여러 기준을 함께 사용하는 것이 중요합니다.
실무에서는
Feature Importance + Permutation Importance + 도메인 지식 조합이 가장 안정적입니다.

저작자표시 비영리 변경금지 (새창열림)

'Programming' 카테고리의 다른 글

회귀(Regression)란 무엇인가 – 개념부터 머신러닝까지 (0)	2026.01.02
내용 요약 정리 (Classification · Tree · Ensemble · Feature Engineering) (0)	2026.01.02
Stacking Ensemble 실습 : Basic Stacking과 교차검증 기반 Stacking 이해하기 (0)	2025.12.31
Credit Card Fraud Detection 3편 : 이상치 제거(IQR)와 SMOTE 오버샘플링을 통한 성능 개선 (0)	2025.12.26
Credit Card Fraud Detection 2편 : 데이터 분포도 변환 후 모델 학습·예측·평가 (0)	2025.12.26

현재글Feature Selection 실전 : 정리모델 성능과 해석력을 동시에 잡는 방법

Feature Selection 실전 : 정리모델 성능과 해석력을 동시에 잡는 방법

1. Feature Selection이란 무엇인가

2. 모델 기반 Feature Selection 개요

2-1. RFE / RFECV (Recursive Feature Elimination)

3. RFE + RFECV 실습

4. Permutation Importance(순열 중요도)

4-1. 개념 설명

5. Permutation Importance 실습 (회귀)

6. SelectFromModel

7. Feature Importance의 한계

8. Permutation Importance vs RandomForest Importance

9. 정리

'Programming' 카테고리의 다른 글

'Programming'의 다른글

티스토리툴바

« 2026/04 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30

Feature Selection 실전 : 정리모델 성능과 해석력을 동시에 잡는 방법

1. Feature Selection이란 무엇인가

2. 모델 기반 Feature Selection 개요

2-1. RFE / RFECV (Recursive Feature Elimination)

3. RFE + RFECV 실습

4. Permutation Importance(순열 중요도)

4-1. 개념 설명

5. Permutation Importance 실습 (회귀)

6. SelectFromModel

7. Feature Importance의 한계

8. Permutation Importance vs RandomForest Importance

9. 정리

'Programming' 카테고리의 다른 글

'Programming'의 다른글

관련글

티스토리툴바