Airfare-price-prediction-system

Airfare Price Prediction System

Project Overview

An end-to-end machine learning pipeline that predicts flight costs using XGBoost, SVM, and ensemble techniques. The system achieves 88% prediction accuracy through comprehensive data cleaning, feature engineering, and hyperparameter optimization with GridSearchCV and k-fold cross-validation.

πŸ“Š Key Achievements

πŸ› οΈ Technologies Used

πŸ“ Project Structure

airfare-price-prediction/
β”‚
β”œβ”€β”€ README.md                          # Project documentation
β”œβ”€β”€ PROJECT_SUMMARY.md                 # Executive summary
β”œβ”€β”€ requirements.txt                   # Python dependencies
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ airfare_price_prediction.py   # Main ML pipeline
β”‚   └── visualizations_airfare.py     # Visualization suite
β”‚
β”œβ”€β”€ data/
β”‚   └── cleaned_flight_data.csv       # Processed dataset
β”‚
└── results/
    β”œβ”€β”€ model_comparison.csv          # Model performance metrics
    β”œβ”€β”€ feature_importance.csv        # Feature rankings
    β”œβ”€β”€ predictions.csv               # Test predictions
    β”œβ”€β”€ airfare_visualizations.png    # Main dashboard
    └── detailed_analysis.png         # Additional insights

πŸš€ Getting Started

Prerequisites

pip install pandas numpy scikit-learn xgboost matplotlib seaborn scipy

Or use the requirements file:

pip install -r requirements.txt

Running the Analysis

  1. Run the main analysis:
    python airfare_price_prediction.py
    
  2. Generate visualizations:
    python visualizations_airfare.py
    

πŸ“‹ Dataset Features

Original Features

Engineered Features

πŸ€– Machine Learning Pipeline

1. Data Cleaning

Missing Values Handling (12% of dataset):

Outlier Detection (IQR Method):

Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
Lower Bound = Q1 - 1.5 * IQR
Upper Bound = Q3 + 1.5 * IQR

2. Feature Engineering

Created 12 new features from base attributes:

3. Encoding & Normalization

Label Encoding:

StandardScaler Normalization:

X_scaled = (X - ΞΌ) / Οƒ

4. Model Training

Baseline Models:

  1. Linear Regression - Simple baseline
  2. Support Vector Regression (SVM) - RBF kernel
  3. Random Forest - 100 estimators
  4. XGBoost/Gradient Boosting - Advanced ensemble

Ensemble Technique:

Voting Regressor combining all base models

5. Hyperparameter Tuning (GridSearchCV)

XGBoost/Gradient Boosting Parameters:

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [5, 7, 9],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 1.0]
}

Configuration:

6. K-Fold Cross-Validation

5-Fold Strategy:

πŸ“ˆ Model Performance

Model RΒ² Score RMSE MAE Accuracy
Linear Regression ~0.75 ~β‚Ή1,200 ~β‚Ή950 ~75%
SVM ~0.82 ~β‚Ή1,000 ~β‚Ή780 ~82%
Random Forest ~0.85 ~β‚Ή900 ~β‚Ή700 ~85%
XGBoost/GB ~0.87 ~β‚Ή850 ~β‚Ή650 ~87%
XGBoost (Tuned) ~0.88 ~β‚Ή800 ~β‚Ή620 ~88%
Ensemble ~0.86 ~β‚Ή870 ~β‚Ή680 ~86%

Validation Metrics (Adapted for Regression)

🎯 Key Findings

Top 5 Price Drivers

  1. Class - Business class 2.5x premium
  2. Duration - Longer flights cost more
  3. Days_left - Last-minute bookings +50% premium
  4. Airline - Vistara/Jet Airways charge premium
  5. Season - Summer travel +30% over monsoon

Business Insights

πŸ“Š Visualizations

The project generates comprehensive visualizations including:

  1. Model Comparison Charts
    • Accuracy comparison across all models
    • RΒ² score rankings
    • RMSE and MAE benchmarks
  2. Prediction Analysis
    • Actual vs. Predicted scatter plots
    • Residual distributions
    • Error histograms
  3. Feature Importance
    • Top 10 most influential features
    • Cumulative importance analysis
  4. Performance Metrics
    • Heatmaps of model performance
    • Error distribution by price range
    • Box plots of prediction errors

πŸ’Ό Business Applications

Revenue Management

Customer Experience

Operational Insights

πŸ”„ Model Deployment Recommendations

  1. Real-time API: Flask/FastAPI endpoint for live predictions
  2. Batch Processing: Daily price updates for all routes
  3. Model Monitoring: Track prediction drift over time
  4. A/B Testing: Compare pricing strategies
  5. Regular Retraining: Monthly updates with new data

πŸ“ Code Highlights

Missing Value Imputation

# Median for numerical features
df['Duration_minutes'].fillna(df['Duration_minutes'].median(), inplace=True)

# Mode for categorical
df['Total_Stops'].fillna(df['Total_Stops'].mode()[0], inplace=True)

IQR Outlier Detection

Q1 = df['Price'].quantile(0.25)
Q3 = df['Price'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Cap outliers
df.loc[df['Price'] < lower_bound, 'Price'] = lower_bound
df.loc[df['Price'] > upper_bound, 'Price'] = upper_bound

StandardScaler Normalization

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

XGBoost with GridSearchCV

from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [5, 7, 9],
    'learning_rate': [0.01, 0.1, 0.2]
}

grid_search = GridSearchCV(
    XGBRegressor(random_state=42),
    param_grid,
    cv=3,
    scoring='r2'
)
grid_search.fit(X_train, y_train)

K-Fold Cross-Validation

from sklearn.model_selection import KFold, cross_val_score

kfold = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X, y, cv=kfold, scoring='r2')

Results

airfare_visualizations airfare_detailed_analysis

πŸŽ“ Learning Outcomes

This project demonstrates:

🀝 Contributing

Potential enhancements:

πŸ“ž Contact

For questions or feedback about this project, please reach out through GitHub issues.

πŸ“„ License

This project is open source and available for educational purposes.


Note: This analysis uses synthetic data generated to match real-world flight pricing patterns. For production deployment, integrate with actual airline pricing APIs and booking systems.

πŸ“š References