• Exploring US Flights data using Python

    Using Python, Pandas, Matplotlib and Numpy

  • PROJECT DESCRIPTION

    In this project, we investigate the Airline On-Time Performance Data to discover the airlines and airports that provide the worst/best experience to travellers.

    • Loading and cleaning the dataset
    • Taking care of outliers or influential points
    • Creating a "misery index" for airports, based on the number of flights that were delayed, as well as the median and average delay.
    • Calculating the probability that a flight will be delayed based on the airport of origin.
    • Creating a "misery index" for airlines, based on the number of flights that were delayed, as well as the median and average delay.
    • Calculating the probability that a flight will be delayed based on the airline.
    • Visualising the distribution of departures for airports.
    • Analysing the temporal distribution of delays, by creating a plot that shows the number of flights and the number of delayed flights per month of year.
    • Creating a table that shows, for each possible origin and destination, which airline has the best performance, in terms of mean departure delay. With this table at hand, we can determine the best airline for a particular pair of origin and destination airports.

    THEMES

    Python, Pandas, Numpy, Matplotlib

    DATASET INFO

    Our dataset comes from the US Bureau of Transport Statistics and includes the On-Time Performance for domestic US flights in 2017. The dataset includes data for the flight date and flight number, the airline, origin and destination airports, cities, departure and arrival times, as well as the reason for the delay or whether the flight was cancelled.

    The dataset has 5.674.621 flights and 20 variables.

     

    A subset of the dataset can be found here.

    CODE

    Code in Python (with Markdown) is available here