All vehicles 🚗 reported by Transportation Network Providers (sometimes called rideshare companies) to the City of Chicago as part of routine reporting required by ordinance. Inclusion of a vehicle in a monthly report indicates that the vehicle was eligible for trips in Chicago in that month for at least one day, regardless of whether it actually provided any rides. If a vehicle is eligible in multiple months, which is common, it will have records in each of these reporting months.

Source: Chicago Data Portal Transportation Network Providers - Vehicles


Goal of this Notebook

🚀 This notebook will make the dataset more usable by perfoming these tasks:

  • Remove rows with missing values in MAKE, MODEL, COLOR, YEAR columns
  • Only filter vehicles with 100 or more trips
  • Filter rows with valid State 2-letter codes
  • Extract Year/Month Reported
  • Remove make & model combinations that are extremely rare
In [1]:
import pandas as pd
import numpy as np

Load Data

The dataset can be downloaded at Chicago Data Portal Transportation Network Providers - Vehicles (~350 MB). The dataset contains information about registered ridesharing vehicles in Chicago as of October 30, 2020.

In [2]:
df = pd.read_csv('Transportation_Network_Providers_-_Vehicles.csv')
In [3]:
df.head(3)
Out[3]:
MONTH_REPORTED STATE MAKE MODEL COLOR YEAR LAST_INSPECTION_MONTH NUMBER_OF_TRIPS MULTIPLE_TNPS
0 2017-09 IL Nissan Pathfinder Red 2016.0 2017-06 NaN True
1 2016-10 IL Nissan Altima NaN 2016.0 NaN NaN True
2 2018-01 IL Nissan Altima Gray 2014.0 2017-10 NaN True

Data Overview

Number of rows and columns

In [4]:
print(f'There are {df.shape[0]} rows and {df.shape[1]} columns.')
There are 7156604 rows and 9 columns.

Data types

In [5]:
df.dtypes
Out[5]:
MONTH_REPORTED            object
STATE                     object
MAKE                      object
MODEL                     object
COLOR                     object
YEAR                     float64
LAST_INSPECTION_MONTH     object
NUMBER_OF_TRIPS          float64
MULTIPLE_TNPS               bool
dtype: object

YEAR and NUMBER_OF_TRIPS are float64s since those columns contain missing values.

Missing values

In [6]:
df.isna().sum()
Out[6]:
MONTH_REPORTED                 0
STATE                      14210
MAKE                        5035
MODEL                       5053
COLOR                     870992
YEAR                        5031
LAST_INSPECTION_MONTH    2939655
NUMBER_OF_TRIPS            12129
MULTIPLE_TNPS                  0
dtype: int64

Removing Unwanted Rows and Columns

Filtering rows with missing values in columns of interest

In [7]:
df.dropna(subset=['MAKE', 'MODEL', 'COLOR', 'YEAR'], inplace=True)

Only select vehicles that had 100 or more trips

The filtered dataset will be used to analyze popular cars used by Uber and Lyft drivers. 🚙 We only select rows with 100 or more trips on the corresponding given month.

In [8]:
df = df[df['NUMBER_OF_TRIPS'] >= 100].copy()

Only filter valid States

Count the number of vehicles by State

In [9]:
df['STATE'].value_counts()
Out[9]:
IL            1324779
IN              19637
FL               5542
WI               3081
TX               2794
CA               2616
MI               2057
GA               1660
OH               1632
MO               1251
IA               1131
PA               1082
NY                921
VA                764
CO                729
TN                724
AZ                683
LA                676
KY                636
NJ                604
ID                595
MN                500
MD                499
WA                485
MA                468
OK                462
SC                461
NC                425
MS                353
AL                339
NV                252
AR                247
KS                231
CT                192
OR                168
NE                163
NM                127
SD                114
UT                113
ND                 99
WV                 96
AK                 86
DE                 82
RI                 76
VT                 28
MT                 28
DC                 27
ME                 19
NH                 17
HI                 15
0                  12
WY                 10
CALIFORNIA          6
LEXUS               4
S                   3
VIRGINIA            2
AS                  1
Name: STATE, dtype: int64

While most values look valid, there are some rows where the 2-letter codes are invalid. Replace full state names (e.g., "CALIFORNIA", "VIRGINIA) with the 2 letter codes.

In [10]:
df.loc[df['STATE'] == 'CALIFORNIA', 'STATE'] = 'CA'
df.loc[df['STATE'] == 'VIRGINIA', 'STATE'] = 'VA'

Drop the other ones.

In [11]:
# Filter valid states only
states = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", 
          "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
          "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
          "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
          "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
df = df[df['STATE'].isin(states)].copy()

Drop LAST_INSPECTION_MONTH

In [12]:
# Drop LAST_INSPECTION_MONTH column
df.drop(columns=['LAST_INSPECTION_MONTH'], inplace=True)

Data Types

Convert YEAR and NUMBER_OF_TRIPS to integers

In [13]:
df['YEAR'] = df['YEAR'].astype(np.int64)
df['NUMBER_OF_TRIPS'] = df['NUMBER_OF_TRIPS'].astype(np.int64)
In [14]:
df.dtypes
Out[14]:
MONTH_REPORTED     object
STATE              object
MAKE               object
MODEL              object
COLOR              object
YEAR                int64
NUMBER_OF_TRIPS     int64
MULTIPLE_TNPS        bool
dtype: object

Make & Model

Rename YEAR to MODEL_YEAR

In [15]:
df.rename(columns={'YEAR': 'MODEL_YEAR'}, inplace=True)

Number of registrations by make & model

In [16]:
df_count_by_vehicle = df.groupby(['MAKE', 'MODEL'], as_index=False).agg({
    'MODEL_YEAR': 'count'
}).rename(columns={
    'MODEL_YEAR': 'COUNT'
}).sort_values('COUNT')

df_count_by_vehicle.head(5)
Out[16]:
MAKE MODEL COUNT
0 Acura 3.2 Tl 1
354 Gmc Mv1 1
935 Nissan Wingroad 1
339 Ford Transit Van 1
336 Ford Transit 1

Remove rows with make & model registrations <= 15

In [17]:
df_count_by_vehicle[df_count_by_vehicle['COUNT'] >= 15].head(10)
Out[17]:
MAKE MODEL COUNT
250 Dodge Ram 15
385 Honda Cr 15
123 Cadillac Xt4 15
1172 Volkswagen Jetta Hybrid 15
554 Jeep Wrangler Jk 15
462 Hyundai Xg300 15
845 Mercury Milan Hybrid 15
625 Lexus Es Series 15
661 Lexus Nx 200t 16
131 Chevrolet Astra 16

Find make & model combinations that were registered at least 15 times. The threshold number 15 is purely subjective.

In [18]:
df_threshold = df_count_by_vehicle[df_count_by_vehicle['COUNT'] >= 15]
df_threshold.shape
Out[18]:
(659, 3)

Use an inner merge to remove rows where the make & model combinations were too rare.

In [19]:
df = df.merge(df_threshold[['MAKE', 'MODEL']], how='inner', on=['MAKE', 'MODEL'])
df.head(5)
Out[19]:
MONTH_REPORTED STATE MAKE MODEL COLOR MODEL_YEAR NUMBER_OF_TRIPS MULTIPLE_TNPS
0 2018-01 IL Toyota Corolla Gold 2013 151 False
1 2017-07 IL Toyota Corolla Dark Red 2009 400 False
2 2017-08 IL Toyota Corolla Black 2016 362 True
3 2017-05 IL Toyota Corolla Silver 2016 159 True
4 2015-08 IL Toyota Corolla Black 2015 113 True

Extract Year/Month Reported

In [20]:
df['REPORTED_YEAR'] = df['MONTH_REPORTED'].str[:4].astype(np.int64)
df['REPORTED_MONTH'] = df['MONTH_REPORTED'].str[-2:].astype(np.int64)
df.drop(columns=['MONTH_REPORTED'], inplace=True)
In [21]:
df
Out[21]:
STATE MAKE MODEL COLOR MODEL_YEAR NUMBER_OF_TRIPS MULTIPLE_TNPS REPORTED_YEAR REPORTED_MONTH
0 IL Toyota Corolla Gold 2013 151 False 2018 1
1 IL Toyota Corolla Dark Red 2009 400 False 2017 7
2 IL Toyota Corolla Black 2016 362 True 2017 8
3 IL Toyota Corolla Silver 2016 159 True 2017 5
4 IL Toyota Corolla Black 2015 113 True 2015 8
... ... ... ... ... ... ... ... ... ...
1377578 IL Kia Rio 5 Door Black 2018 309 False 2020 8
1377579 IL Kia Rio 5 Door Silver 2020 197 False 2020 8
1377580 IL Kia Rio 5 Door Black 2020 180 False 2020 8
1377581 IL Kia Rio 5 Door Red 2012 194 False 2020 8
1377582 IL Kia Rio 5 Door Silver 2020 189 False 2020 9

1377583 rows × 9 columns

Rearrange columns

In [22]:
df = df[['REPORTED_YEAR', 'REPORTED_MONTH', 'STATE', 'MAKE', 'MODEL', 'COLOR', 'MODEL_YEAR', 'NUMBER_OF_TRIPS', 'MULTIPLE_TNPS']]
df.head(2)
Out[22]:
REPORTED_YEAR REPORTED_MONTH STATE MAKE MODEL COLOR MODEL_YEAR NUMBER_OF_TRIPS MULTIPLE_TNPS
0 2018 1 IL Toyota Corolla Gold 2013 151 False
1 2017 7 IL Toyota Corolla Dark Red 2009 400 False

Shuffle Rows and Sort By Reported Month

As a final step, sort the dataset by reported year and month.

In [23]:
df = df.sample(frac=1).sort_values(['REPORTED_YEAR', 'REPORTED_MONTH']).reset_index(drop=True)
In [24]:
df
Out[24]:
REPORTED_YEAR REPORTED_MONTH STATE MAKE MODEL COLOR MODEL_YEAR NUMBER_OF_TRIPS MULTIPLE_TNPS
0 2015 3 IL Toyota Highlander Black 2010 263 False
1 2015 3 IL Toyota Camry Black 2008 204 False
2 2015 3 IL Toyota Camry Black 2009 122 False
3 2015 3 IL Mazda Mazda3 Silver 2012 150 False
4 2015 3 IL Chrysler Pt Cruiser White/Pearl 2008 208 False
... ... ... ... ... ... ... ... ... ...
1377578 2020 9 IL Buick Regal Black 2016 182 False
1377579 2020 9 IL Dodge Journey Silver 2019 312 True
1377580 2020 9 IL Ford Ecosport Blue 2018 143 False
1377581 2020 9 IL Toyota Corolla Black 2017 112 True
1377582 2020 9 IL Toyota Rav4 Black 2018 192 False

1377583 rows × 9 columns

In [25]:
df.to_csv('chicago-ridesharing-vehicles.csv', index=None)