Airline Passenger Satisfaction

Introduction:

4 min readMar 4, 2021

Today we will be conducting customer analysis on a dataset from Kaggle.com, where airline customers were surveyed on number of on and off flight services and if they were satisfied with their flight or had a neutral/ dissatisfied experience on the flight. With this analysis, we are able to predict if future passengers are satisfied with the trip, using various machine learning algorithms.

Loading Data and Feature Engineering:

The data was already split up into two datasets, Train and Test, or a 80/20 split of the total data. Each set contained 24 features, identifying information about the passenger (age, class, type of travel, etc.) to ratings on flight services, for example, Inflight wifi service, Ease of Online Booking, Leg Room Service, Cleanliness, etc, rated on 1–5 scale.

With some data wrangling we:

assigned ‘Satisfaction’ a numeric value (0 = neutral or dissatisfied, 1 = satisfied)
Fill Arrival Delay in Minutes with mean
Assigned label groups to age to remove high cardinality and dropped ‘Age’

Data Information

Looking at our data we can see our data is slightly imbalanced (0 = Neutral or dissatisfied, 1 = satisfied)

We also see going though our data:

older age groups are more satisfied with the flights

longer flights received a more satisfied passenger.

But longer flights had higher class satisfaction with the assumption the service and comfort was better.

Business travel was more satisfying with also the assumption they travel in business class.

And then we looked at the overall ratings of the top services with majority being 4–5 range satisfaction.

We then looked at the correlation in the dataset between the features and ‘Satisfaction’

Because of very low correlation of Gate Location, Departure Delay, Arrival Delay, and Time Convenience, we drop this columns from our datasets.

Split Data and Establish Target

Because the data is already split, we X / Y train from the train dataset and X / Y test from the test set with a target of ‘Satisfaction’.

Modeling

after establishing a baseline, I performed modeling algorithms to improve our accuracy:

Logistic Regression
Random Forest Classifier
Gradient Boosting Classifier
XG Boosting Classifier

Baseline:

Using our training data, we establish a baseline of 56% accuracy on predicting customer satisfaction on a flight.

Logistic Regression:

Our Training Accuracy: % 83.16. At a 17% error margin, this is not good enough to predict customer satisfaction

Random Forest Classifier:

With Ordinal Encoder and Simple Imputer (strategy= mean), I hyperparameter-tuned a Random Forest Classifier to find the best results in a grid search of our data. **Note: a Decision Tree Classifier was used with a 99.5% accuracy but the data was lost**.

Our Training Accuracy was 97.5%, a 14% increase in our error margin.

Boosting:

The Gradient Boosting Classifier and XGBoosting Classifier models performed very similar and no improvement to the Random Forest Classifier with training accuracy of 94.8% and 96.2% respectfully.

Final Results

Appling our test data to the Random Forest Regression Model, our model estimated 96.2% that we correctly predict of a passenger was satisfied with the flight or not.