Flight Data

For upcoming assignments (Module 5 and 6) we will work with a data set about airline flight delays provided by the Bureau of Transportation Statistics. This data set is quite big (there are thousands of flights each day). It is interesting for class because it can be viewed simply as tabular data, or as set or graph data.

The Data

The Bureau of Transportation Statistics has a nice website where you can download the data a month at a time. You can choose many different attributes of each flight to download, each one is well documented.

I have downloaded some data (and simplified it to make it easier to work with - as described below). You are allowed (encouraged) to work with my data.

The data is a table where each row is a flight. There is all kinds of information about each flight. Which day it was on, what time it left (scheduled and actual), when it arrived (scheduled and actual), how long it was (in miles and time), how late it was (relative to the schedule), the causes of the delays, whether it was cancelled or diverted, etc.

In my experimenting, I learned that air travel delays aren’t (or, I should say, weren’t in the months I looked at) that bad. The median flight delay - in almost any way I sliced things - was zero or less (most flights arrive on time or early).

The data we provide

I pulled data for January, April, July and October (the first month of each quarter). I pulled data for 2024. For the class assignments, we recommend working with the 4 months of data combined (to cover the different seasons), but 1 month is “big enough”. Students are welcome to pull more data (e.g., to have all 12 months, or different years).

I selected the fields I thought were sufficient. For example, I didn’t get the detailed causes of different delays, or the different ways of measuring departure times. I got the simple version of the airline codes and airport codes. All of the fields are well documented in the web page.

Each month has between 547K and 634K flights - there are 2.37 million flights in the combined data set. The simplifications (explained below) reduce the data by around 60% (the combined set has about 920K flights).

Note: you don’t need to use my provided data files, however, you are expected to work with (at least) an equivalent amount of data.

Getting The Data

The data (original, simplified, and combined) as well as my processing script is available at: https://github.com/uwgraphics/FlightData25/

If you want the combined-reduced.csv file you can access it directly or go to the repository page and press the “Download button” (recommended for reliability). The CSV file is 93.9MB.

Adding Data

It is OK to add data - but you don’t need to. Try to focus on the data that you have. There is enough to have interesting things to find.

If you are using Tableau, it will add useful data for you. If you tell it that the airport codes (Origin/Dest) should be interpreted as geographic information (airport code is a choice), it will get the latitude/longitude for you (and place it on a map).

I had GitHub Copilot write Python code to get latitude/longitude for me to use in my experiments.

Simplifying the Data

To make the data easier to work with, I simplified the data as follows (you are allowed to do the same if you process the data yourself - or you don’t have to).

  1. I limited the data to the first 28 days of each month. This is important because…
    • all months have the same number of days
    • all months have 4 of each day of the week (no month has more or less weekends)
  2. I limited the data to the top 105 airports (both the origin and destination need to be in the top 105). I picked 105 because I wanted to make sure MSN always appears, and it tends to be around 100.

This reduces the data by about 60% - but it is still pretty big.

Ways to Think of the Data

The airline route system is graph (network) data - there are nodes (cities) and edges (routes), and the connections can be complex. There is also set information (e.g., the set of cities that can be reached from a given city). However, we will cover graphs and sets later in the class.

This distinction (table/set/graph) is important for class because each of the assignments asks about a different

GenAI Disclosure:

I used GitHub Co-Pilot (with either GPT-5mini or Claude 4.5 models) to write the scripts I used to process the data.