All vehicles 🚗 reported by Transportation Network Providers (sometimes called rideshare companies) to the City of Chicago as part of routine reporting required by ordinance. Inclusion of a vehicle in a monthly report indicates that the vehicle was eligible for trips in Chicago in that month for at least one day, regardless of whether it actually provided any rides. If a vehicle is eligible in multiple months, which is common, it will have records in each of these reporting months.

Source: Chicago Data Portal Transportation Network Providers - Vehicles


Goal of this Notebook

🚀 This notebook will make the dataset more usable by perfoming these tasks:

  • Remove rows with missing values in MAKE, MODEL, COLOR, YEAR columns
  • Only filter vehicles with 100 or more trips
  • Filter rows with valid State 2-letter codes
  • Extract Year/Month Reported
  • Remove make & model combinations that are extremely rare
In [1]:
import pandas as pd
import numpy as np

Load Data

The dataset can be downloaded at Chicago Data Portal Transportation Network Providers - Vehicles (~350 MB). The dataset contains information about registered ridesharing vehicles in Chicago as of October 30, 2020.

In [2]:
df = pd.read_csv('Transportation_Network_Providers_-_Vehicles.csv')
In [3]:
df.head(3)
Out[3]:
MONTH_REPORTEDSTATEMAKEMODELCOLORYEARLAST_INSPECTION_MONTHNUMBER_OF_TRIPSMULTIPLE_TNPS
02017-09ILNissanPathfinderRed2016.02017-06NaNTrue
12016-10ILNissanAltimaNaN2016.0NaNNaNTrue
22018-01ILNissanAltimaGray2014.02017-10NaNTrue

Data Overview

Number of rows and columns

In [4]:
print(f'There are {df.shape[0]} rows and {df.shape[1]} columns.')
There are 8466571 rows and 9 columns.

Data types

In [5]:
df.dtypes
Out[5]:
MONTH_REPORTED            object
STATE                     object
MAKE                      object
MODEL                     object
COLOR                     object
YEAR                     float64
LAST_INSPECTION_MONTH     object
NUMBER_OF_TRIPS          float64
MULTIPLE_TNPS               bool
dtype: object

YEAR and NUMBER_OF_TRIPS are float64s since those columns contain missing values.

Missing values

In [6]:
df.isna().sum()
Out[6]:
MONTH_REPORTED                 0
STATE                      14210
MAKE                        5035
MODEL                       5053
COLOR                     875308
YEAR                        5031
LAST_INSPECTION_MONTH    2957715
NUMBER_OF_TRIPS            14531
MULTIPLE_TNPS                  0
dtype: int64

Removing Unwanted Rows and Columns

Filtering rows with missing values in columns of interest

In [7]:
df.dropna(subset=['MAKE', 'MODEL', 'COLOR', 'YEAR'], inplace=True)

Only select vehicles that had 100 or more trips

The filtered dataset will be used to analyze popular cars used by Uber and Lyft drivers. 🚙 We only select rows with 100 or more trips on the corresponding given month.

In [8]:
df = df[df['NUMBER_OF_TRIPS'] >= 100].copy()

Only filter valid States

Count the number of vehicles by State

In [9]:
df['STATE'].value_counts()
Out[9]:
IL            1652182
IN              24070
FL              11257
TX               4574
CA               4226
MI               4053
WI               3990
OH               2348
GA               2294
MO               1647
IA               1611
NY               1607
PA               1581
NJ               1383
VA               1357
CO               1191
TN               1108
LA               1072
AZ               1009
KY                909
WA                831
MD                819
MN                768
ID                704
MA                683
OK                668
SC                627
NC                579
AL                462
MS                438
NV                374
CT                373
AR                292
KS                257
OR                235
NE                225
NM                171
UT                156
ND                154
DE                143
SD                129
WV                113
AK                 99
RI                 89
VT                 37
MT                 35
ME                 35
DC                 30
NH                 23
HI                 20
WY                 12
0                  12
CALIFORNIA          6
LEXUS               4
S                   3
VIRGINIA            2
AS                  1
Name: STATE, dtype: int64

While most values look valid, there are some rows where the 2-letter codes are invalid. Replace full state names (e.g., "CALIFORNIA", "VIRGINIA) with the 2 letter codes.

In [10]:
df.loc[df['STATE'] == 'CALIFORNIA', 'STATE'] = 'CA'
df.loc[df['STATE'] == 'VIRGINIA', 'STATE'] = 'VA'

Drop the other ones.

In [11]:
# Filter valid states only
states = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", 
          "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
          "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
          "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
          "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
df = df[df['STATE'].isin(states)].copy()

Drop LAST_INSPECTION_MONTH

In [12]:
# Drop LAST_INSPECTION_MONTH column
df.drop(columns=['LAST_INSPECTION_MONTH'], inplace=True)

Data Types

Convert YEAR and NUMBER_OF_TRIPS to integers

In [13]:
df['YEAR'] = df['YEAR'].astype(np.int64)
df['NUMBER_OF_TRIPS'] = df['NUMBER_OF_TRIPS'].astype(np.int64)
In [14]:
df.dtypes
Out[14]:
MONTH_REPORTED     object
STATE              object
MAKE               object
MODEL              object
COLOR              object
YEAR                int64
NUMBER_OF_TRIPS     int64
MULTIPLE_TNPS        bool
dtype: object

Make & Model

Rename YEAR to MODEL_YEAR

In [15]:
df.rename(columns={'YEAR': 'MODEL_YEAR'}, inplace=True)

Number of registrations by make & model

In [16]:
df_count_by_vehicle = df.groupby(['MAKE', 'MODEL'], as_index=False).agg({
    'MODEL_YEAR': 'count'
}).rename(columns={
    'MODEL_YEAR': 'COUNT'
}).sort_values('COUNT')

df_count_by_vehicle.head(5)
Out[16]:
MAKEMODELCOUNT
0Acura3.2 Tl1
750LincolnCamry Hybrid1
749LincolnCamry1
746LincolnAltima1
744LexusUx Hybrid1

Remove rows with make & model registrations < 20

In [17]:
df_count_by_vehicle[df_count_by_vehicle['COUNT'] >= 20].head(10)
Out[17]:
MAKEMODELCOUNT
1249ToyotaRav20
432HondaFit Ev20
859Mercedes BenzMetris20
1082Saab4225220
689LexusEs300h20
842Mercedes BenzA Class20
1153SuzukiVerona20
1090Saab9 520
741LexusRx35020
1225ToyotaLand Cruiser21

Find make & model combinations that were registered at least 15 times. The threshold number 15 is purely subjective.

In [18]:
df_threshold = df_count_by_vehicle[df_count_by_vehicle['COUNT'] >= 20]
df_threshold.shape
Out[18]:
(677, 3)

Use an inner merge to remove rows where the make & model combinations were too rare.

In [19]:
df = df.merge(df_threshold[['MAKE', 'MODEL']], how='inner', on=['MAKE', 'MODEL'])
df.head(5)
Out[19]:
MONTH_REPORTEDSTATEMAKEMODELCOLORMODEL_YEARNUMBER_OF_TRIPSMULTIPLE_TNPS
02018-01ILToyotaCorollaGold2013151False
12017-07ILToyotaCorollaDark Red2009400False
22017-08ILToyotaCorollaBlack2016362True
32017-05ILToyotaCorollaSilver2016159True
42015-08ILToyotaCorollaBlack2015113True

Extract Year/Month Reported

In [20]:
df['REPORTED_YEAR'] = df['MONTH_REPORTED'].str[:4].astype(np.int64)
df['REPORTED_MONTH'] = df['MONTH_REPORTED'].str[-2:].astype(np.int64)
df.drop(columns=['MONTH_REPORTED'], inplace=True)
In [21]:
df
Out[21]:
STATEMAKEMODELCOLORMODEL_YEARNUMBER_OF_TRIPSMULTIPLE_TNPSREPORTED_YEARREPORTED_MONTH
0ILToyotaCorollaGold2013151False20181
1ILToyotaCorollaDark Red2009400False20177
2ILToyotaCorollaBlack2016362True20178
3ILToyotaCorollaSilver2016159True20175
4ILToyotaCorollaBlack2015113True20158
..............................
1729905NYKiaE-NiroBlack2022338False20227
1729906ILKiaE-NiroBlack2022179False20227
1729907ILKiaE-NiroBlack2022179False20227
1729908NYKiaE-NiroBlack2022247False20227
1729909NYKiaE-NiroBlack2022281False20227

1729910 rows × 9 columns

Rearrange columns

In [22]:
df = df[['REPORTED_YEAR', 'REPORTED_MONTH', 'STATE', 'MAKE', 'MODEL', 'COLOR', 'MODEL_YEAR', 'NUMBER_OF_TRIPS', 'MULTIPLE_TNPS']]
df.head(2)
Out[22]:
REPORTED_YEARREPORTED_MONTHSTATEMAKEMODELCOLORMODEL_YEARNUMBER_OF_TRIPSMULTIPLE_TNPS
020181ILToyotaCorollaGold2013151False
120177ILToyotaCorollaDark Red2009400False

Shuffle Rows and Sort By Reported Month

As a final step, sort the dataset by reported year and month.

In [23]:
df = df.sample(frac=1).sort_values(['REPORTED_YEAR', 'REPORTED_MONTH']).reset_index(drop=True)
In [24]:
df
Out[24]:
REPORTED_YEARREPORTED_MONTHSTATEMAKEMODELCOLORMODEL_YEARNUMBER_OF_TRIPSMULTIPLE_TNPS
020153ILFordExplorerBlack2014160False
120153ILFordEscapeGreen2005255False
220153ILToyotaHighlander HybridWhite2011324False
320153ILHondaPilotBlack2011135False
420153ILBmw5-SeriesBlack2008558False
..............................
172990520227ILBuickEncoreWhite2019134False
172990620227ILToyotaHighlanderWhite2012366False
172990720227ILChevroletImpalaBlack2017157False
172990820227ILHondaAccordSilver2018124False
172990920227ILLexusRxWhite2014105False

1729910 rows × 9 columns

In [25]:
df.to_csv('chicago-ridesharing-vehicles.csv', index=None)