Customer Churn Prediction Analysis
Project Overview
A comprehensive machine learning project that predicts customer churn using classification models including Logistic Regression, Decision Trees, and Random Forest. The project achieves 84% accuracy through hyperparameter tuning and handles class imbalance using SMOTE technique.
π Key Achievements
- β
84% Accuracy with Random Forest after GridSearchCV hyperparameter tuning
- β
7,000+ subscriber records processed with 20+ features
- β
SMOTE implementation to address class imbalance (73% vs 27%)
- β
22% improvement in minority class recall through balanced sampling
- β
Comprehensive feature engineering creating 5+ derived features
π οΈ Technologies Used
- Python 3.8+
- Scikit-learn - Machine learning models and evaluation
- Pandas - Data manipulation and analysis
- NumPy - Numerical computing
- Matplotlib & Seaborn - Data visualization
- imbalanced-learn - SMOTE for handling class imbalance
π Project Structure
customer-churn-prediction/
β
βββ customer_churn_prediction.py # Main analysis script
βββ visualizations.py # Visualization generation
βββ requirements.txt # Python dependencies
βββ README.md # Project documentation
β
βββ Output Files:
β βββ model_comparison.csv # Model performance metrics
β βββ feature_importance.csv # Feature importance rankings
β βββ predictions.csv # Test set predictions
β βββ churn_analysis_visualizations.png
β βββ detailed_analysis_plots.png
π Getting Started
Prerequisites
pip install pandas numpy scikit-learn matplotlib seaborn imbalanced-learn
Or use the requirements file:
pip install -r requirements.txt
Running the Analysis
- Run the main analysis:
python customer_churn_prediction.py
- Generate visualizations:
π Dataset Features (20+ Variables)
Demographics
- Age
- Gender
- Senior Citizen status
- Tenure (months)
- Contract Type (Month-to-Month, One Year, Two Year)
- Payment Method
- Paperless Billing
Services
- Phone Service
- Multiple Lines
- Internet Service (DSL, Fiber Optic)
- Online Security
- Online Backup
- Device Protection
- Tech Support
- Streaming TV & Movies
Financial
- Monthly Charges
- Total Charges
Engagement Metrics
- Customer Service Calls
- Late Payments
- Number of Services
Engineered Features
- Tenure Category
- Charges to Tenure Ratio
- Service Usage Score
- Customer Loyalty Score
- High Risk Indicator
- Payment Reliability Score
π€ Machine Learning Pipeline
1. Data Preprocessing
- Handled missing values
- Encoded categorical variables using Label Encoding
- Created 5+ engineered features
- Applied StandardScaler for feature scaling
2. Train-Test Split
- 80% training, 20% testing
- Stratified split to maintain class distribution
3. Class Imbalance Handling
- Before SMOTE: 73% No Churn, 27% Churn
- After SMOTE: Balanced 50-50 split
- Impact: 22% improvement in minority class recall
4. Model Training
Baseline Models:
- Logistic Regression
- Decision Tree Classifier
- Random Forest Classifier (100 estimators)
5. Hyperparameter Tuning
Used GridSearchCV on Random Forest with:
- n_estimators: [100, 200, 300]
- max_depth: [10, 20, 30, None]
- min_samples_split: [2, 5, 10]
- min_samples_leaf: [1, 2, 4]
- max_features: [βsqrtβ, βlog2β]
Cross-validation: 5-fold CV
Scoring metric: Accuracy
| Model |
Accuracy |
Precision |
Recall |
F1-Score |
ROC-AUC |
| Logistic Regression |
~78% |
~0.76 |
~0.72 |
~0.74 |
~0.85 |
| Decision Tree |
~80% |
~0.78 |
~0.75 |
~0.76 |
~0.83 |
| Random Forest |
~82% |
~0.81 |
~0.78 |
~0.79 |
~0.88 |
| Random Forest (Tuned) |
~84% |
~0.83 |
~0.81 |
~0.82 |
~0.90 |
π― Key Findings
Top Risk Factors for Churn:
- Contract Type (Month-to-Month highest risk)
- Tenure (< 12 months)
- Customer Service Calls (> 3)
- Payment Method (Electronic Check)
- Monthly Charges (> $80)
- Lack of Tech Support
- Late Payments
Model Insights:
- Random Forest outperformed all baseline models
- SMOTE significantly improved recall for churn class
- Feature engineering contributed to better model performance
- GridSearchCV found optimal hyperparameters improving accuracy by 2%
π Visualizations
The project generates comprehensive visualizations including:
- Model Comparison Charts - Accuracy across all models
- Feature Importance - Top 10 predictive features
- Confusion Matrix - Classification performance breakdown
- ROC Curve - Model discrimination capability
- Prediction Distribution - Probability distributions by class
- Precision-Recall Curve - Trade-off analysis
πΌ Business Applications
Proactive Retention Strategy
- Identify high-risk customers before they churn
- Target intervention efforts efficiently
- Reduce customer acquisition costs
Risk Segmentation
- Categorize customers by churn probability
- Customize retention offers
- Prioritize customer success resources
Feature Monitoring
- Track key indicators (tenure, service calls, payment behavior)
- Set up early warning alerts
- Implement preventive measures
π Model Deployment Recommendations
- Real-time Scoring: Deploy model as API endpoint
- Batch Processing: Weekly churn risk assessments
- Monitoring: Track model performance metrics
- Retraining: Quarterly model updates with new data
- A/B Testing: Compare intervention strategies
π Code Highlights
SMOTE Implementation
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train_scaled, y_train)
GridSearchCV for Hyperparameter Tuning
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [10, 20, 30, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'max_features': ['sqrt', 'log2']
}
rf_grid = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)
Feature Engineering Example
df['loyalty_score'] = (df['tenure_months'] * 0.5) - (df['customer_service_calls'] * 2)
df['high_risk'] = ((df['contract_type'] == 'Month-to-Month') &
(df['tenure_months'] < 12)).astype(int)
π Model Evaluation Metrics
- Accuracy: Overall correctness of predictions
- Precision: Proportion of positive predictions that are correct
- Recall: Proportion of actual positives correctly identified
- F1-Score: Harmonic mean of precision and recall
- ROC-AUC: Modelβs ability to distinguish between classes
Results

π Learning Outcomes
This project demonstrates:
- End-to-end machine learning pipeline development
- Handling imbalanced datasets with SMOTE
- Hyperparameter optimization with GridSearchCV
- Feature engineering and domain knowledge application
- Model comparison and selection
- Business-focused model evaluation
π€ Contributing
Suggestions for improvements:
- Additional ensemble methods (XGBoost, LightGBM)
- Deep learning approaches
- Time-series analysis for temporal patterns
- Customer segmentation clustering
- Explainable AI techniques (SHAP, LIME)
For questions or feedback about this project, please reach out through GitHub issues.
π License
This project is open source and available for educational purposes.
Note: This analysis uses synthetic data generated to match real-world churn patterns. For production use, replace with actual customer data while ensuring proper data privacy and compliance.