All vehicles 🚗 reported by Transportation Network Providers (sometimes called rideshare companies) to the City of Chicago as part of routine reporting required by ordinance. Inclusion of a vehicle in a monthly report indicates that the vehicle was eligible for trips in Chicago in that month for at least one day, regardless of whether it actually provided any rides. If a vehicle is eligible in multiple months, which is common, it will have records in each of these reporting months.
Source: Chicago Data Portal Transportation Network Providers - Vehicles
Goal of this Notebook¶
🚀 This notebook will make the dataset more usable by perfoming these tasks:
- Remove rows with missing values in
MAKE
,MODEL
,COLOR
,YEAR
columns - Only filter vehicles with 100 or more trips
- Filter rows with valid State 2-letter codes
- Extract Year/Month Reported
- Remove make & model combinations that are extremely rare
import pandas as pd
import numpy as np
Load Data¶
The dataset can be downloaded at Chicago Data Portal Transportation Network Providers - Vehicles (~350 MB). The dataset contains information about registered ridesharing vehicles in Chicago as of October 30, 2020.
df = pd.read_csv('Transportation_Network_Providers_-_Vehicles.csv')
df.head(3)
Data Overview¶
Number of rows and columns¶
print(f'There are {df.shape[0]} rows and {df.shape[1]} columns.')
Data types¶
df.dtypes
YEAR
and NUMBER_OF_TRIPS
are float64
s since those columns contain missing values.
Missing values¶
df.isna().sum()
Removing Unwanted Rows and Columns¶
Filtering rows with missing values in columns of interest¶
df.dropna(subset=['MAKE', 'MODEL', 'COLOR', 'YEAR'], inplace=True)
Only select vehicles that had 100 or more trips¶
The filtered dataset will be used to analyze popular cars used by Uber and Lyft drivers. 🚙 We only select rows with 100 or more trips on the corresponding given month.
df = df[df['NUMBER_OF_TRIPS'] >= 100].copy()
Only filter valid States¶
Count the number of vehicles by State
df['STATE'].value_counts()
While most values look valid, there are some rows where the 2-letter codes are invalid. Replace full state names (e.g., "CALIFORNIA", "VIRGINIA) with the 2 letter codes.
df.loc[df['STATE'] == 'CALIFORNIA', 'STATE'] = 'CA'
df.loc[df['STATE'] == 'VIRGINIA', 'STATE'] = 'VA'
Drop the other ones.
# Filter valid states only
states = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA",
"HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD",
"MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ",
"NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC",
"SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
df = df[df['STATE'].isin(states)].copy()
Drop LAST_INSPECTION_MONTH
¶
# Drop LAST_INSPECTION_MONTH column
df.drop(columns=['LAST_INSPECTION_MONTH'], inplace=True)
Data Types¶
Convert YEAR
and NUMBER_OF_TRIPS
to integers¶
df['YEAR'] = df['YEAR'].astype(np.int64)
df['NUMBER_OF_TRIPS'] = df['NUMBER_OF_TRIPS'].astype(np.int64)
df.dtypes
Make & Model¶
Rename YEAR
to MODEL_YEAR
¶
df.rename(columns={'YEAR': 'MODEL_YEAR'}, inplace=True)
Number of registrations by make & model¶
df_count_by_vehicle = df.groupby(['MAKE', 'MODEL'], as_index=False).agg({
'MODEL_YEAR': 'count'
}).rename(columns={
'MODEL_YEAR': 'COUNT'
}).sort_values('COUNT')
df_count_by_vehicle.head(5)
Remove rows with make & model registrations < 20¶
df_count_by_vehicle[df_count_by_vehicle['COUNT'] >= 20].head(10)
Find make & model combinations that were registered at least 15 times. The threshold number 15 is purely subjective.
df_threshold = df_count_by_vehicle[df_count_by_vehicle['COUNT'] >= 20]
df_threshold.shape
Use an inner merge to remove rows where the make & model combinations were too rare.
df = df.merge(df_threshold[['MAKE', 'MODEL']], how='inner', on=['MAKE', 'MODEL'])
df.head(5)
Extract Year/Month Reported¶
df['REPORTED_YEAR'] = df['MONTH_REPORTED'].str[:4].astype(np.int64)
df['REPORTED_MONTH'] = df['MONTH_REPORTED'].str[-2:].astype(np.int64)
df.drop(columns=['MONTH_REPORTED'], inplace=True)
df
Rearrange columns¶
df = df[['REPORTED_YEAR', 'REPORTED_MONTH', 'STATE', 'MAKE', 'MODEL', 'COLOR', 'MODEL_YEAR', 'NUMBER_OF_TRIPS', 'MULTIPLE_TNPS']]
df.head(2)
Shuffle Rows and Sort By Reported Month¶
As a final step, sort the dataset by reported year and month.
df = df.sample(frac=1).sort_values(['REPORTED_YEAR', 'REPORTED_MONTH']).reset_index(drop=True)
df
df.to_csv('chicago-ridesharing-vehicles.csv', index=None)