All vehicles 🚗 reported by Transportation Network Providers (sometimes called rideshare companies) to the City of Chicago as part of routine reporting required by ordinance. Inclusion of a vehicle in a monthly report indicates that the vehicle was eligible for trips in Chicago in that month for at least one day, regardless of whether it actually provided any rides. If a vehicle is eligible in multiple months, which is common, it will have records in each of these reporting months.

Source: Chicago Data Portal Transportation Network Providers - Vehicles


Goal of this Notebook

🚀 This notebook will make the dataset more usable by perfoming these tasks:

  • Remove rows with missing values in MAKE, MODEL, COLOR, YEAR columns
  • Only filter vehicles with 100 or more trips
  • Filter rows with valid State 2-letter codes
  • Extract Year/Month Reported
  • Remove make & model combinations that are extremely rare
In [1]:
import pandas as pd
import numpy as np

Load Data

The dataset can be downloaded at Chicago Data Portal Transportation Network Providers - Vehicles (~350 MB). The dataset contains information about registered ridesharing vehicles in Chicago as of October 30, 2020.

In [2]:
df = pd.read_csv('Transportation_Network_Providers_-_Vehicles.csv')
In [3]:
df.head(3)
Out[3]:
MONTH_REPORTED STATE MAKE MODEL COLOR YEAR LAST_INSPECTION_MONTH NUMBER_OF_TRIPS MULTIPLE_TNPS
0 2017-09 IL Nissan Pathfinder Red 2016.0 2017-06 NaN True
1 2016-10 IL Nissan Altima NaN 2016.0 NaN NaN True
2 2018-01 IL Nissan Altima Gray 2014.0 2017-10 NaN True

Data Overview

Number of rows and columns

In [4]:
print(f'There are {df.shape[0]} rows and {df.shape[1]} columns.')
There are 8466571 rows and 9 columns.

Data types

In [5]:
df.dtypes
Out[5]:
MONTH_REPORTED            object
STATE                     object
MAKE                      object
MODEL                     object
COLOR                     object
YEAR                     float64
LAST_INSPECTION_MONTH     object
NUMBER_OF_TRIPS          float64
MULTIPLE_TNPS               bool
dtype: object

YEAR and NUMBER_OF_TRIPS are float64s since those columns contain missing values.

Missing values

In [6]:
df.isna().sum()
Out[6]:
MONTH_REPORTED                 0
STATE                      14210
MAKE                        5035
MODEL                       5053
COLOR                     875308
YEAR                        5031
LAST_INSPECTION_MONTH    2957715
NUMBER_OF_TRIPS            14531
MULTIPLE_TNPS                  0
dtype: int64

Removing Unwanted Rows and Columns

Filtering rows with missing values in columns of interest

In [7]:
df.dropna(subset=['MAKE', 'MODEL', 'COLOR', 'YEAR'], inplace=True)

Only select vehicles that had 100 or more trips

The filtered dataset will be used to analyze popular cars used by Uber and Lyft drivers. 🚙 We only select rows with 100 or more trips on the corresponding given month.

In [8]:
df = df[df['NUMBER_OF_TRIPS'] >= 100].copy()

Only filter valid States

Count the number of vehicles by State

In [9]:
df['STATE'].value_counts()
Out[9]:
IL            1652182
IN              24070
FL              11257
TX               4574
CA               4226
MI               4053
WI               3990
OH               2348
GA               2294
MO               1647
IA               1611
NY               1607
PA               1581
NJ               1383
VA               1357
CO               1191
TN               1108
LA               1072
AZ               1009
KY                909
WA                831
MD                819
MN                768
ID                704
MA                683
OK                668
SC                627
NC                579
AL                462
MS                438
NV                374
CT                373
AR                292
KS                257
OR                235
NE                225
NM                171
UT                156
ND                154
DE                143
SD                129
WV                113
AK                 99
RI                 89
VT                 37
MT                 35
ME                 35
DC                 30
NH                 23
HI                 20
WY                 12
0                  12
CALIFORNIA          6
LEXUS               4
S                   3
VIRGINIA            2
AS                  1
Name: STATE, dtype: int64

While most values look valid, there are some rows where the 2-letter codes are invalid. Replace full state names (e.g., "CALIFORNIA", "VIRGINIA) with the 2 letter codes.

In [10]:
df.loc[df['STATE'] == 'CALIFORNIA', 'STATE'] = 'CA'
df.loc[df['STATE'] == 'VIRGINIA', 'STATE'] = 'VA'

Drop the other ones.

In [11]:
# Filter valid states only
states = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", 
          "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
          "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
          "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
          "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
df = df[df['STATE'].isin(states)].copy()

Drop LAST_INSPECTION_MONTH

In [12]:
# Drop LAST_INSPECTION_MONTH column
df.drop(columns=['LAST_INSPECTION_MONTH'], inplace=True)

Data Types

Convert YEAR and NUMBER_OF_TRIPS to integers

In [13]:
df['YEAR'] = df['YEAR'].astype(np.int64)
df['NUMBER_OF_TRIPS'] = df['NUMBER_OF_TRIPS'].astype(np.int64)
In [14]:
df.dtypes
Out[14]:
MONTH_REPORTED     object
STATE              object
MAKE               object
MODEL              object
COLOR              object
YEAR                int64
NUMBER_OF_TRIPS     int64
MULTIPLE_TNPS        bool
dtype: object

Make & Model

Rename YEAR to MODEL_YEAR

In [15]:
df.rename(columns={'YEAR': 'MODEL_YEAR'}, inplace=True)

Number of registrations by make & model

In [16]:
df_count_by_vehicle = df.groupby(['MAKE', 'MODEL'], as_index=False).agg({
    'MODEL_YEAR': 'count'
}).rename(columns={
    'MODEL_YEAR': 'COUNT'
}).sort_values('COUNT')

df_count_by_vehicle.head(5)
Out[16]:
MAKE MODEL COUNT
0 Acura 3.2 Tl 1
750 Lincoln Camry Hybrid 1
749 Lincoln Camry 1
746 Lincoln Altima 1
744 Lexus Ux Hybrid 1

Remove rows with make & model registrations < 20

In [17]:
df_count_by_vehicle[df_count_by_vehicle['COUNT'] >= 20].head(10)
Out[17]:
MAKE MODEL COUNT
1249 Toyota Rav 20
432 Honda Fit Ev 20
859 Mercedes Benz Metris 20
1082 Saab 42252 20
689 Lexus Es300h 20
842 Mercedes Benz A Class 20
1153 Suzuki Verona 20
1090 Saab 9 5 20
741 Lexus Rx350 20
1225 Toyota Land Cruiser 21

Find make & model combinations that were registered at least 15 times. The threshold number 15 is purely subjective.

In [18]:
df_threshold = df_count_by_vehicle[df_count_by_vehicle['COUNT'] >= 20]
df_threshold.shape
Out[18]:
(677, 3)

Use an inner merge to remove rows where the make & model combinations were too rare.

In [19]:
df = df.merge(df_threshold[['MAKE', 'MODEL']], how='inner', on=['MAKE', 'MODEL'])
df.head(5)
Out[19]:
MONTH_REPORTED STATE MAKE MODEL COLOR MODEL_YEAR NUMBER_OF_TRIPS MULTIPLE_TNPS
0 2018-01 IL Toyota Corolla Gold 2013 151 False
1 2017-07 IL Toyota Corolla Dark Red 2009 400 False
2 2017-08 IL Toyota Corolla Black 2016 362 True
3 2017-05 IL Toyota Corolla Silver 2016 159 True
4 2015-08 IL Toyota Corolla Black 2015 113 True

Extract Year/Month Reported

In [20]:
df['REPORTED_YEAR'] = df['MONTH_REPORTED'].str[:4].astype(np.int64)
df['REPORTED_MONTH'] = df['MONTH_REPORTED'].str[-2:].astype(np.int64)
df.drop(columns=['MONTH_REPORTED'], inplace=True)
In [21]:
df
Out[21]:
STATE MAKE MODEL COLOR MODEL_YEAR NUMBER_OF_TRIPS MULTIPLE_TNPS REPORTED_YEAR REPORTED_MONTH
0 IL Toyota Corolla Gold 2013 151 False 2018 1
1 IL Toyota Corolla Dark Red 2009 400 False 2017 7
2 IL Toyota Corolla Black 2016 362 True 2017 8
3 IL Toyota Corolla Silver 2016 159 True 2017 5
4 IL Toyota Corolla Black 2015 113 True 2015 8
... ... ... ... ... ... ... ... ... ...
1729905 NY Kia E-Niro Black 2022 338 False 2022 7
1729906 IL Kia E-Niro Black 2022 179 False 2022 7
1729907 IL Kia E-Niro Black 2022 179 False 2022 7
1729908 NY Kia E-Niro Black 2022 247 False 2022 7
1729909 NY Kia E-Niro Black 2022 281 False 2022 7

1729910 rows × 9 columns

Rearrange columns

In [22]:
df = df[['REPORTED_YEAR', 'REPORTED_MONTH', 'STATE', 'MAKE', 'MODEL', 'COLOR', 'MODEL_YEAR', 'NUMBER_OF_TRIPS', 'MULTIPLE_TNPS']]
df.head(2)
Out[22]:
REPORTED_YEAR REPORTED_MONTH STATE MAKE MODEL COLOR MODEL_YEAR NUMBER_OF_TRIPS MULTIPLE_TNPS
0 2018 1 IL Toyota Corolla Gold 2013 151 False
1 2017 7 IL Toyota Corolla Dark Red 2009 400 False

Shuffle Rows and Sort By Reported Month

As a final step, sort the dataset by reported year and month.

In [23]:
df = df.sample(frac=1).sort_values(['REPORTED_YEAR', 'REPORTED_MONTH']).reset_index(drop=True)
In [24]:
df
Out[24]:
REPORTED_YEAR REPORTED_MONTH STATE MAKE MODEL COLOR MODEL_YEAR NUMBER_OF_TRIPS MULTIPLE_TNPS
0 2015 3 IL Ford Explorer Black 2014 160 False
1 2015 3 IL Ford Escape Green 2005 255 False
2 2015 3 IL Toyota Highlander Hybrid White 2011 324 False
3 2015 3 IL Honda Pilot Black 2011 135 False
4 2015 3 IL Bmw 5-Series Black 2008 558 False
... ... ... ... ... ... ... ... ... ...
1729905 2022 7 IL Buick Encore White 2019 134 False
1729906 2022 7 IL Toyota Highlander White 2012 366 False
1729907 2022 7 IL Chevrolet Impala Black 2017 157 False
1729908 2022 7 IL Honda Accord Silver 2018 124 False
1729909 2022 7 IL Lexus Rx White 2014 105 False

1729910 rows × 9 columns

In [25]:
df.to_csv('chicago-ridesharing-vehicles.csv', index=None)