고객 이탈 예측 프로젝트: 실무에서 쓰이는 머신러닝 파이프라인 구축하기

들어가며

지난 편에서 타이타닉 데이터셋으로 EDA의 기초를 다졌다면, 이제는 한 단계 더 나아가 실무에서 가장 많이 활용되는 고객 이탈 예측(Churn Prediction) 프로젝트를 진행해보겠습니다. 통신사, 구독 서비스, 금융권 등 다양한 산업에서 고객 이탈은 매출과 직결되는 핵심 지표이며, 데이터 분석가라면 반드시 다뤄봐야 할 주제입니다.

이번 편에서는 Kaggle의 Telco Customer Churn 데이터셋을 활용해 전처리 → 피처 엔지니어링 → 모델 학습 → 평가까지 전체 머신러닝 파이프라인을 구축하는 방법을 실습합니다.

프로젝트 목표 설정

실무 프로젝트는 명확한 비즈니스 목표에서 시작합니다. 이번 프로젝트의 목표는 다음과 같습니다:

비즈니스 목표: 이탈 가능성이 높은 고객을 사전에 식별하여 retention 마케팅 비용 최적화
기술적 목표: 정밀도(Precision) 70% 이상, 재현율(Recall) 60% 이상 달성
산출물: 고객별 이탈 확률 스코어 + 주요 이탈 요인 분석 리포트

“모델 정확도보다 중요한 것은 비즈니스 임팩트입니다. 이탈 예측 모델의 경우, False Negative(이탈 고객을 놓침)를 줄이는 것이 핵심입니다.”

데이터 전처리 파이프라인

1. 기본 전처리

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

# 데이터 로드
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

# TotalCharges 컬럼의 공백 처리 (숫자형 변환 실패 케이스)
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'].fillna(df['TotalCharges'].median(), inplace=True)

# 타겟 변수 인코딩
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})

# customerID 제거 (식별자는 예측에 불필요)
df.drop('customerID', axis=1, inplace=True)

2. 범주형 변수 처리 전략

실무에서는 One-Hot Encoding과 Label Encoding을 혼용합니다:

변수 유형	처리 방법	사유
이진 변수 (gender, Partner)	Label Encoding	차원 증가 방지
다중 범주 (Contract, PaymentMethod)	One-Hot Encoding	순서 관계 없음
순서형 (tenure 구간화)	Ordinal Encoding	순서 의미 보존

# 이진 변수 처리
binary_cols = ['gender', 'Partner', 'Dependents', 'PhoneService', 'PaperlessBilling']
for col in binary_cols:
    df[col] = LabelEncoder().fit_transform(df[col])

# 다중 범주 One-Hot Encoding
df = pd.get_dummies(df, columns=['InternetService', 'Contract', 'PaymentMethod'], 
                    drop_first=True)  # 다중공선성 방지

피처 엔지니어링: 실무 인사이트 반영

도메인 지식 기반 파생 변수 생성

# 1. 월평균 요금 (가성비 지표)
df['AvgMonthlyCharge'] = df['TotalCharges'] / (df['tenure'] + 1)

# 2. 서비스 가입 개수 (충성도 지표)
service_cols = ['OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 
                'TechSupport', 'StreamingTV', 'StreamingMovies']
df['ServiceCount'] = df[service_cols].apply(lambda x: (x == 'Yes').sum(), axis=1)

# 3. 계약 기간 구간화 (초기/성장/성숙 고객)
df['TenureGroup'] = pd.cut(df['tenure'], bins=[0, 12, 36, 72], 
                            labels=['0-1년', '1-3년', '3년+'])

# 4. 고액 요금 플래그 (상위 25% 요금 고객)
df['HighSpender'] = (df['MonthlyCharges'] > df['MonthlyCharges'].quantile(0.75)).astype(int)

이러한 파생 변수는 단순 원본 데이터보다 비즈니스 의미를 명확히 반영하며, 모델 성능 향상에 기여합니다.

모델 학습 및 하이퍼파라미터 튜닝

불균형 데이터 처리

이탈 데이터는 일반적으로 이탈:유지 = 2:8 정도로 불균형합니다. 이를 해결하는 세 가지 방법:

from imblearn.over_sampling import SMOTE
from sklearn.utils.class_weight import compute_class_weight

# 방법 1: SMOTE (합성 샘플 생성)
smote = SMOTE(random_state=42)
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

# 방법 2: Class Weight 조정 (추천)
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
weight_dict = {0: class_weights[0], 1: class_weights[1]}

모델 비교 및 선택

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, roc_auc_score

models = {
    'RandomForest': RandomForestClassifier(class_weight='balanced', random_state=42),
    'GradientBoosting': GradientBoostingClassifier(random_state=42),
    'XGBoost': XGBClassifier(scale_pos_weight=2.5, random_state=42)
}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"\n{name}")
    print(classification_report(y_test, y_pred))
    print(f"ROC-AUC: {roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]):.3f}")

GridSearchCV로 최적 하이퍼파라미터 탐색

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

grid_search = GridSearchCV(RandomForestClassifier(class_weight='balanced', random_state=42),
                           param_grid, cv=5, scoring='roc_auc', n_jobs=-1)
grid_search.fit(X_train, y_train)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best ROC-AUC: {grid_search.best_score_:.3f}")

모델 해석 및 비즈니스 인사이트 도출

Feature Importance 분석

import matplotlib.pyplot as plt

best_model = grid_search.best_estimator_
feature_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': best_model.feature_importances_
}).sort_values('importance', ascending=False).head(10)

plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'], feature_importance['importance'])
plt.xlabel('Feature Importance')
plt.title('Top 10 이탈 예측 주요 요인')
plt.show()

실무 팁: Feature Importance 결과를 다음과 같이 비즈니스 액션으로 연결하세요:

Contract_Month-to-month가 1위 → 장기 계약 유도 프로모션 강화
tenure가 2위 → 신규 고객(0-6개월) 집중 관리
MonthlyCharges가 3위 → 고액 요금제 고객 대상 할인 혜택

실전 배포를 위한 파이프라인 저장

import joblib
from sklearn.pipeline import Pipeline

# 전처리 + 모델을 하나의 파이프라인으로
full_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', best_model)
])

full_pipeline.fit(X_train, y_train)
joblib.dump(full_pipeline, 'churn_model_pipeline.pkl')

# 추론 시
loaded_pipeline = joblib.load('churn_model_pipeline.pkl')
predictions = loaded_pipeline.predict_proba(new_data)[:, 1]  # 이탈 확률

마무리

이번 프로젝트에서는 EDA를 넘어 실무 머신러닝 파이프라인의 전체 흐름을 경험했습니다. 특히 중요한 포인트는:

비즈니스 목표와 평가 지표 연결: Precision/Recall 균형 맞추기
도메인 지식 기반 피처 엔지니어링: 월평균 요금, 서비스 가입 수 등
불균형 데이터 처리: Class Weight 조정이 실무에서 가장 효과적
모델 해석의 중요성: Feature Importance → 비즈니스 액션 도출

다음 편에서는 시계열 데이터로 매출 예측 분석을 진행하며, Prophet과 ARIMA 모델을 활용한 트렌드 분석 기법을 다룹니다. 이탈 예측 모델을 직접 구축해보고, GitHub에 코드를 올려 포트폴리오를 확장해보세요!

데이터분석가를 꿈꾸는 취준생을 위한 실제 업무와 유사한 kaggle데이터를 활용한 실무데이터분석 프로젝트 시리즈 (3/6편)

← 이전: EDA 마스터하기: 타이타닉 데이터로 배우는 탐색적 데이터 분석 실전 기법다음 편 준비 중…