✨ Background¶

I first moved into an apartment at UIUC during the summer of 2007. The apartment was Gregory Place under a management company called JSM. Here is a photo of the apartment from 2007. The construction site shows the second building being built.

Gregory Place 2007

Back in 2007, JSM was regarded as one of the better management companies, along with Royse Brinkmeyer and Roland. Almost two decades later, JSM is still highly regraded. Prospective tenants lining up from the early morning in front of JSM's office to secure a lease was a sight to see in recent years.

JSM Morning Lineup 2023 by Jacob Slabosz Photo by Jacob Slabosz @ Daily illini

After talking to hundreds of students over the years about their landlords in Champaign-Urbana, I came to a conclusion that many of the landlords in Champaign-Urbana must be the worst. From the notorious attempts to hijack tenants' security deposits by The University Group to having to endure cold showers during the winter in Seven07, finding a good landlord or a headache-free apartment sounds impossible.

But this got me thinking - are the landlords in Champaign-Urbana really the worst compared to other similar college towns?

♟️ About the Dataset¶

In this post, I will compare how the property management companies in Champaign-Urbana area compare to those of two other college towns - Provo, UT (Brigham Young University) and State College, PA (Penn State University).

Data collection and preparation¶

142 property management companies across the three college towns have been hand-selected.
The companies' information on Google Maps were scraped using Outscraper (a cloud-based scraping service).
About 17,000 reviews for the property management companies were scraped using Outscraper.
The review texts were tokenized using spaCy. The tokenization and additional preprocessing steps are not included in this post.
The post uses sentiment analysis results of the reviews using DistilBERT (distilbert-base-uncased-finetuned-sst-2-english).

Import packages.

In [1]:

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from plotly.subplots import make_subplots

In [2]:

import plotly

print(f'pandas version: {pd.__version__}')
print(f'plotly version: {plotly.__version__}')

pandas version: 2.2.3
plotly version: 5.24.1

Set the nubmer of maximum number of displayed columns to 50.

In [3]:

pd.set_option('display.max_columns', 50)

Load and Preprocess Data¶

There are four tables.

df_b: List of property management companies or apartments
df_r: Google reviews of the property management companies or apartments
df_tokens: Tokenized result of reviews using spaCy
df_sentiments: Sentiment analysis result using DistilBERT

In [4]:

df_b = pd.read_csv('https://github.com/bdi475/datasets/raw/main/campustowns-leasing-company-reviews/results/businesses.csv')
df_r = pd.read_csv(
    'https://github.com/bdi475/datasets/raw/main/campustowns-leasing-company-reviews/results/reviews.csv',
    parse_dates = ['review_datetime_utc', 'owner_answer_timestamp_datetime_utc']
)
df_tokens = pd.read_csv('https://github.com/bdi475/datasets/raw/main/campustowns-leasing-company-reviews/results/tokens.csv.gz')
df_sentiments = pd.read_csv('https://github.com/bdi475/datasets/raw/main/campustowns-leasing-company-reviews/results/sentiments.csv')

Businesses dataset¶

Display the businesses (property management companies) table.

In [5]:

df_b.head()

Out[5]:

	campus	place_id	name	site	category	borough	street	city	postal_code	state	latitude	longitude	verified
0	Brigham Young University	ChIJqyXyo6GQTYcRXGfgeIpqc_I	Alpine Village	https://myalpinevillage.com/	Student housing center	Carterville	1378 Freedom Blvd 200 W	Provo	84604	Utah	40.252607	-111.661247	True
1	Brigham Young University	ChIJGbnTbamQTYcRtZfBrL52jhs	Palladium Apartments	https://www.palladiumprovo.com/	Housing complex	North Park	538 N Freedom Blvd Unit	Provo	84601	Utah	40.240917	-111.661714	True
2	Brigham Young University	ChIJaURIj6GQTYcRCY56AtxjTI4	Glenwood Apartments	https://glenwoodapt.com/	Student housing center	Carterville	1565 N University Ave	Provo	84604	Utah	40.254606	-111.659193	True
3	Brigham Young University	ChIJU6EhynuRTYcRv5tkFZYZ8k0	Campus Edge	https://campusprovo.com/campus-edge/	Student housing center	Carterville	1305 N Canyon Rd	Provo	84604	Utah	40.251654	-111.656401	True
4	Brigham Young University	ChIJaURIj6GQTYcRkndi2c5Rtdw	Cambridge Apartments	https://www.cambridgecourtapt.com/?switch_cls[...	Apartment complex	Carterville	1425 N University Ave	Provo	84604	Utah	40.253127	-111.659800	True

In [6]:

print(f"There are {df_b.shape[0]} rows and {df_b.shape[1]} columns in the businesses table.")

There are 142 rows and 13 columns in the businesses table.

Print out the DataFrame's information.

In [7]:

df_b.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142 entries, 0 to 141
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   campus       142 non-null    object 
 1   place_id     142 non-null    object 
 2   name         142 non-null    object 
 3   site         140 non-null    object 
 4   category     142 non-null    object 
 5   borough      71 non-null     object 
 6   street       142 non-null    object 
 7   city         142 non-null    object 
 8   postal_code  142 non-null    int64  
 9   state        142 non-null    object 
 10  latitude     142 non-null    float64
 11  longitude    142 non-null    float64
 12  verified     142 non-null    bool   
dtypes: bool(1), float64(2), int64(1), object(9)
memory usage: 13.6+ KB

Check for missing values.

In [8]:

df_b.isna().sum()

Out[8]:

campus          0
place_id        0
name            0
site            2
category        0
borough        71
street          0
city            0
postal_code     0
state           0
latitude        0
longitude       0
verified        0
dtype: int64

The "site" and "borough" columns have missing values.

Reviews dataset¶

Display the reviews table.

In [9]:

df_r.head()

Out[9]:

	place_id	review_id	author_id	author_title	review_text	review_rating	review_img_url	review_datetime_utc	owner_answer	owner_answer_timestamp_datetime_utc	review_likes
0	ChIJqyXyo6GQTYcRXGfgeIpqc_I	ChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRAB	100512067005706825621	Laura Brown	Service requests take a very long time to be r...	3	NaN	2023-10-24 12:46:43+00:00	Dear Laura, \n\n We hear you loud and clear, t...	2023-10-24 14:56:01+00:00	0
1	ChIJqyXyo6GQTYcRXGfgeIpqc_I	ChZDSUhNMG9nS0VJQ0FnSUNwOUstOVpBEAE	102936107836751613731	Pete Haraguchi	I was thinking about giving two stars but I le...	1	https://lh5.googleusercontent.com/p/AF1QipPeIy...	2023-07-29 20:30:26+00:00	Pete,\n\n Thank you for letting us know about ...	2023-07-29 22:46:48+00:00	12
2	ChIJqyXyo6GQTYcRXGfgeIpqc_I	ChZDSUhNMG9nS0VJQ0FnSURwbGZXTUJ3EAE	103835714740882872235	Jack Radford	DO NOT LIVE HERE. This was the WORST experienc...	1	NaN	2023-08-26 03:31:42+00:00	Jack, \n\n I’m so sorry — it sounds like we re...	2023-08-26 04:56:04+00:00	3
3	ChIJqyXyo6GQTYcRXGfgeIpqc_I	ChdDSUhNMG9nS0VJQ0FnSURwdzd1Q3B3RRAB	107352000345615085703	Daniel Merritt	If you have any other choice, take it over Alp...	1	NaN	2023-08-28 21:54:09+00:00	Daniel,\n\n Even though it was a negative expe...	2023-08-29 00:00:30+00:00	4
4	ChIJqyXyo6GQTYcRXGfgeIpqc_I	ChZDSUhNMG9nS0VJQ0FnSURKdnFxeWVnEAE	109184252255204802361	Carman Hansen	Honestly, I debated whether or not I should le...	1	NaN	2023-07-17 21:47:22+00:00	Dear Carman, \n\n Thank you for bringing this ...	2023-07-17 22:59:16+00:00	9

In [10]:

print(f"There are {df_r.shape[0]} rows and {df_r.shape[1]} columns in the reviews table.")

There are 16848 rows and 11 columns in the reviews table.

Check for missing values.

In [11]:

df_r.isna().sum()

Out[11]:

place_id                                   0
review_id                                  0
author_id                                  0
author_title                               0
review_text                             1969
review_rating                              0
review_img_url                         16478
review_datetime_utc                        0
owner_answer                            6951
owner_answer_timestamp_datetime_utc     6951
review_likes                               0
dtype: int64

Columns "review_text", "review_img_url", "owner_answer", and "owner_answer_timestamp_datetime_utc" have missing values.

Sentiments¶

Display the sentiments table.

In [12]:

df_sentiments.head()

Out[12]:

	review_id	sentiment	score
0	ChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRAB	POSITIVE	0.909106
1	ChZDSUhNMG9nS0VJQ0FnSUNwOUstOVpBEAE	NEGATIVE	0.999216
2	ChZDSUhNMG9nS0VJQ0FnSURwbGZXTUJ3EAE	NEGATIVE	0.998690
3	ChdDSUhNMG9nS0VJQ0FnSURwdzd1Q3B3RRAB	NEGATIVE	0.999131
4	ChZDSUhNMG9nS0VJQ0FnSURKdnFxeWVnEAE	NEGATIVE	0.999204

Check for missing values.

In [13]:

df_sentiments.isna().sum()

Out[13]:

review_id    0
sentiment    0
score        0
dtype: int64

There are no missing values in the sentiments table.

Merge sentiments into reviews¶

In [14]:

if 'sentiment' not in df_r.columns:
    df_r = pd.merge(
        left=df_r,
        right=df_sentiments,
        on='review_id',
        how='left'
    )

df_r.head()

Out[14]:

	place_id	review_id	author_id	author_title	review_text	review_rating	review_img_url	review_datetime_utc	owner_answer	owner_answer_timestamp_datetime_utc	review_likes	sentiment	score
0	ChIJqyXyo6GQTYcRXGfgeIpqc_I	ChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRAB	100512067005706825621	Laura Brown	Service requests take a very long time to be r...	3	NaN	2023-10-24 12:46:43+00:00	Dear Laura, \n\n We hear you loud and clear, t...	2023-10-24 14:56:01+00:00	0	POSITIVE	0.909106
1	ChIJqyXyo6GQTYcRXGfgeIpqc_I	ChZDSUhNMG9nS0VJQ0FnSUNwOUstOVpBEAE	102936107836751613731	Pete Haraguchi	I was thinking about giving two stars but I le...	1	https://lh5.googleusercontent.com/p/AF1QipPeIy...	2023-07-29 20:30:26+00:00	Pete,\n\n Thank you for letting us know about ...	2023-07-29 22:46:48+00:00	12	NEGATIVE	0.999216
2	ChIJqyXyo6GQTYcRXGfgeIpqc_I	ChZDSUhNMG9nS0VJQ0FnSURwbGZXTUJ3EAE	103835714740882872235	Jack Radford	DO NOT LIVE HERE. This was the WORST experienc...	1	NaN	2023-08-26 03:31:42+00:00	Jack, \n\n I’m so sorry — it sounds like we re...	2023-08-26 04:56:04+00:00	3	NEGATIVE	0.998690
3	ChIJqyXyo6GQTYcRXGfgeIpqc_I	ChdDSUhNMG9nS0VJQ0FnSURwdzd1Q3B3RRAB	107352000345615085703	Daniel Merritt	If you have any other choice, take it over Alp...	1	NaN	2023-08-28 21:54:09+00:00	Daniel,\n\n Even though it was a negative expe...	2023-08-29 00:00:30+00:00	4	NEGATIVE	0.999131
4	ChIJqyXyo6GQTYcRXGfgeIpqc_I	ChZDSUhNMG9nS0VJQ0FnSURKdnFxeWVnEAE	109184252255204802361	Carman Hansen	Honestly, I debated whether or not I should le...	1	NaN	2023-07-17 21:47:22+00:00	Dear Carman, \n\n Thank you for bringing this ...	2023-07-17 22:59:16+00:00	9	NEGATIVE	0.999204

Tokens¶

Display the list of tokens in its original form ("text"), lemmatized form ("lemma"), and their part of speech explanations ("explain").

In [15]:

df_tokens

Out[15]:

	review_id	text	lemma	explain
0	ChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRAB	Service	service	noun
1	ChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRAB	requests	request	noun
2	ChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRAB	long	long	adjective
3	ChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRAB	time	time	noun
4	ChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRAB	responded	respond	verb
...	...	...	...	...
533659	ChZDSUhNMG9nS0VJQ0FnSURBdHNiMWRnEAE	looking	look	verb
533660	ChZDSUhNMG9nS0VJQ0FnSURBdHNiMWRnEAE	apartments	apartment	noun
533661	ChZDSUhNMG9nS0VJQ0FnSURBdHNiMWRnEAE	answered	answer	verb
533662	ChZDSUhNMG9nS0VJQ0FnSURBdHNiMWRnEAE	questions	question	noun
533663	ChZDSUhNMG9nS0VJQ0FnSURBdHNiMWRnEAE	patient	patient	adjective

533664 rows × 4 columns

Merge reviews and businesses¶

In [16]:

df_m = pd.merge(
    left=df_r,
    right=df_b,
    on='place_id',
    how='inner'
)

df_m.head(2)

Out[16]:

	place_id	review_id	author_id	author_title	review_text	review_rating	review_img_url	review_datetime_utc	owner_answer	owner_answer_timestamp_datetime_utc	review_likes	sentiment	score	campus	name	site	category	borough	street	city	postal_code	state	latitude	longitude	verified
0	ChIJqyXyo6GQTYcRXGfgeIpqc_I	ChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRAB	100512067005706825621	Laura Brown	Service requests take a very long time to be r...	3	NaN	2023-10-24 12:46:43+00:00	Dear Laura, \n\n We hear you loud and clear, t...	2023-10-24 14:56:01+00:00	0	POSITIVE	0.909106	Brigham Young University	Alpine Village	https://myalpinevillage.com/	Student housing center	Carterville	1378 Freedom Blvd 200 W	Provo	84604	Utah	40.252607	-111.661247	True
1	ChIJqyXyo6GQTYcRXGfgeIpqc_I	ChZDSUhNMG9nS0VJQ0FnSUNwOUstOVpBEAE	102936107836751613731	Pete Haraguchi	I was thinking about giving two stars but I le...	1	https://lh5.googleusercontent.com/p/AF1QipPeIy...	2023-07-29 20:30:26+00:00	Pete,\n\n Thank you for letting us know about ...	2023-07-29 22:46:48+00:00	12	NEGATIVE	0.999216	Brigham Young University	Alpine Village	https://myalpinevillage.com/	Student housing center	Carterville	1378 Freedom Blvd 200 W	Provo	84604	Utah	40.252607	-111.661247	True

Metadata for visualizations¶

In [78]:

campus_acronyms_map = {
    'University of Illinois Urbana-Champaign': 'UIUC',
    'Brigham Young University': 'BYU',
    'Penn State University Park': 'PSU',
}

👉 Number of apartments listed by campus¶

In [17]:

df_b['campus'] \
    .value_counts() \
    .to_frame() \
    .reset_index()

Out[17]:

	campus	count
0	Brigham Young University	53
1	Penn State University Park	47
2	University of Illinois Urbana-Champaign	42

👉 Calculate the time owner took to respond to a review¶

In [18]:

td = df_m['owner_answer_timestamp_datetime_utc'] - df_m['review_datetime_utc']
response_time_in_days = td.apply(lambda x: x / pd.Timedelta(days=1))

# round the response time if a positive number
# if the resposne time is negative, there is no way of knowing
# when the author has 
df_m['response_time_in_days'] = np.where(
    response_time_in_days > 0,
    round(response_time_in_days, 0),
    response_time_in_days
)

df_m[['review_id', 'owner_answer_timestamp_datetime_utc', 'review_datetime_utc', 'response_time_in_days']].sample(2)

Out[18]:

	review_id	owner_answer_timestamp_datetime_utc	review_datetime_utc	response_time_in_days
12864	ChZDSUhNMG9nS0VJQ0FnSUNBNHItRU1REAE	2018-08-17 18:15:27+00:00	2018-08-17 15:38:43+00:00	0.0
5859	ChZDSUhNMG9nS0VJQ0FnSURjMXR1dUpREAE	NaT	2020-06-11 01:56:32+00:00	NaN

Add a categorical variable based on the 'response_time_in_days' variable.

In [19]:

df_m['response_time'] = np.select(
    [
        df_m['response_time_in_days'].between(-10000, 0, inclusive='left'),
        df_m['response_time_in_days'].between(0, 1, inclusive='left'),
        df_m['response_time_in_days'].between(1, 7, inclusive='left'),
        df_m['response_time_in_days'].between(7, 30, inclusive='left'),
        df_m['response_time_in_days'].between(31, 10000, inclusive='left'),
    ],
    [
        'Unknown (review updated)',
        'Within a day',
        'Within a week',
        'Within a month',
        'After a month'
    ],
    default='No response'
)

df_m.head(2)

Out[19]:

	place_id	review_id	author_id	author_title	review_text	review_rating	review_img_url	review_datetime_utc	owner_answer	owner_answer_timestamp_datetime_utc	review_likes	sentiment	score	campus	name	site	category	borough	street	city	postal_code	state	latitude	longitude	verified	response_time_in_days	response_time
0	ChIJqyXyo6GQTYcRXGfgeIpqc_I	ChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRAB	100512067005706825621	Laura Brown	Service requests take a very long time to be r...	3	NaN	2023-10-24 12:46:43+00:00	Dear Laura, \n\n We hear you loud and clear, t...	2023-10-24 14:56:01+00:00	0	POSITIVE	0.909106	Brigham Young University	Alpine Village	https://myalpinevillage.com/	Student housing center	Carterville	1378 Freedom Blvd 200 W	Provo	84604	Utah	40.252607	-111.661247	True	0.0	Within a day
1	ChIJqyXyo6GQTYcRXGfgeIpqc_I	ChZDSUhNMG9nS0VJQ0FnSUNwOUstOVpBEAE	102936107836751613731	Pete Haraguchi	I was thinking about giving two stars but I le...	1	https://lh5.googleusercontent.com/p/AF1QipPeIy...	2023-07-29 20:30:26+00:00	Pete,\n\n Thank you for letting us know about ...	2023-07-29 22:46:48+00:00	12	NEGATIVE	0.999216	Brigham Young University	Alpine Village	https://myalpinevillage.com/	Student housing center	Carterville	1378 Freedom Blvd 200 W	Provo	84604	Utah	40.252607	-111.661247	True	0.0	Within a day

👉 Owner's response time (pie chart)¶

In [20]:

df_response_time_counts = df_m['response_time'].value_counts().to_frame().reset_index()
df_response_time_counts

Out[20]:

	response_time	count
0	No response	6963
1	Within a week	4258
2	Within a day	2055
3	Within a month	1374
4	After a month	1302
5	Unknown (review updated)	898

In [21]:

fig = px.pie(
    df_response_time_counts,
    names='response_time',
    values='count',
    title='<b>Owner response times</b><br><span style="color: #aaa;">Owners do not respond 40% of the times</span>',
    height=500,
    template='simple_white',
    color='response_time',
    color_discrete_map={
        "Within a day": "#689F38",
        "No response": "#FDD835",
        "Within a week": "#9CCC65",
        "Within a month": "#EF9A9A",
        "After a month": "#EF5350",
        "Unknown (review updated)": "#ccc"
    },
    labels={
        'response_time': 'Response Time'
    },
)

fig.update_traces(
    textinfo='percent+label',
    textposition='outside',
    showlegend=False
)

fig.update_layout(
    font_family='Helvetica, Inter, Arial, sans-serif',
)

fig.show()

👉 Owner's response time (histogram)¶

In [22]:

fig = px.histogram(
    df_m[(df_m['response_time_in_days'] >= 0) & (df_m['response_time_in_days'] < 30)],
    x='response_time_in_days',
    template='simple_white',
    title='<b>Owner\'s response time in days (<30 days)</b><br><span style="color: #aaa">The majority of the owners who responds replies within a week</span>',
    height=500,
    labels={
        'response_time_in_days': 'Response time in days'
    },
    color_discrete_sequence=['black']
)

fig.update_layout(
    font_family='Helvetica, Inter, Arial, sans-serif',
    yaxis_title_text='Count'
)

fig.add_vrect(
    x0=-0.5,
    x1=7.5, 
    annotation_text="Within a week",
    annotation_position="top right",
    annotation=dict(
        font_color='#7CB342',
        font_size=11
    ),
    fillcolor="#4CAF50",
    opacity=0.07,
    line_width=0
)
fig.add_vline(x=7.5, line_width=1, line_dash="solid", line_color="#8bc34a")

fig.show()

In [23]:

df_response_sentiment_count =df_m \
    .groupby(['response_time', 'sentiment'], as_index=False) \
    .agg({'review_id': 'count'}) \
    .rename(columns={'review_id': 'count'})

df_response_sentiment_count['sentiment'] = df_response_sentiment_count['sentiment'].str.capitalize()
df_response_sentiment_count['percentage'] = df_response_sentiment_count['count'] / df_response_sentiment_count.groupby('response_time')['count'].transform('sum')




df_response_sentiment_count

Out[23]:

	response_time	sentiment	count	percentage
0	After a month	Negative	507	0.447090
1	After a month	Positive	627	0.552910
2	No response	Negative	2523	0.430326
3	No response	Positive	3340	0.569674
4	Unknown (review updated)	Negative	487	0.570926
5	Unknown (review updated)	Positive	366	0.429074
6	Within a day	Negative	574	0.312807
7	Within a day	Positive	1261	0.687193
8	Within a month	Negative	519	0.411252
9	Within a month	Positive	743	0.588748
10	Within a week	Negative	1275	0.324098
11	Within a week	Positive	2659	0.675902

👉 Owner's response time vs sentiment¶

In [24]:

fig = px.bar(
    df_response_sentiment_count,
    x='percentage',
    y='response_time',
    category_orders={
        'response_time': [
            'Within a day',
            'Within a week',
            'Within a month',
            'After a month',
            'No response',
            'Unknown (review updated)'
        ],
        'sentiment': ['Positive', 'Negative']
    },
    template='simple_white',
    color='sentiment',
    color_discrete_map={
        'Positive': '#8bc34a',
        'Negative': '#ef5350'
    },
    labels={'sentiment': 'Sentiment'},
    title='<b>Owner\'s response time</b><br><span style="color: #aaa;">Companies who receive negative reviews are correlated with slower response times</span>',
    text=df_response_sentiment_count.apply(lambda r: f"{'👍' if r['sentiment'] == 'Positive' else '👎'} {'{0:.1f}%'.format(r['percentage'] * 100)}", axis=1),
    height=500
)

fig.update_layout(
    xaxis_title_text='Percentage',
    yaxis_title_text='Owner\'s response time',
    font_family='Helvetica, Inter, Arial, sans-serif',
    xaxis_tickformat=',.0%',
)

fig.for_each_trace(lambda t: t.update(textfont_color='white'))

fig.show()

👉 Review rating breakdown by campus¶

In [25]:

df_ratings_breakdown = df_m.groupby(
        ['campus', 'review_rating'],
        as_index=False
    ).agg({
        'review_id': 'count'
    }).rename(columns={
        'review_id': 'num_reviews'
    })

df_ratings_breakdown['campus'] = df_ratings_breakdown['campus'].map({
    'Brigham Young University': 'BYU',
    'Penn State University Park': 'PSU',
    'University of Illinois Urbana-Champaign': 'UIUC'
})

df_ratings_breakdown['review_rating'] = df_ratings_breakdown['review_rating'].astype(str)

df_ratings_breakdown['percentage'] = df_ratings_breakdown['num_reviews'] / df_ratings_breakdown.groupby('campus')['num_reviews'].transform('sum')

df_ratings_breakdown

Out[25]:

	campus	review_rating	num_reviews	percentage
0	BYU	1	1834	0.323971
1	BYU	2	436	0.077018
2	BYU	3	412	0.072779
3	BYU	4	1020	0.180180
4	BYU	5	1959	0.346052
5	PSU	1	1361	0.263606
6	PSU	2	220	0.042611
7	PSU	3	236	0.045710
8	PSU	4	601	0.116405
9	PSU	5	2745	0.531668
10	UIUC	1	1522	0.252572
11	UIUC	2	207	0.034351
12	UIUC	3	176	0.029207
13	UIUC	4	539	0.089446
14	UIUC	5	3582	0.594424

In [26]:

fig = px.bar(
    df_ratings_breakdown,
    x='campus',
    y='percentage',
    color='review_rating',
    color_discrete_map={
        "1": "#EF5350",
        "2": "#EF9A9A",
        "3": "#FDD835",
        "4": "#9CCC65",
        "5": "#689F38"
    },
    labels={
        'review_rating': 'Review Rating',
        'campus': 'Campus',
        'percentage': 'Percentage'
    },
    title='<b>Review rating breakdown by campus</b><br><span style="color: #aaa">UIUC has the highest proportion of 5 star reviews</span>',
    text=df_ratings_breakdown.apply(lambda r: f"{'⭐' * int(r['review_rating'])} {'{0:.1f}%'.format(r['percentage'] * 100)}", axis=1),
    template='simple_white',
    height=650
)

fig.update_layout(
    yaxis_tickformat=',.0%',
    uniformtext_minsize=10,
    uniformtext_mode='hide',
    font_family='Helvetica, Inter, Arial, sans-serif',
)
fig.for_each_trace(lambda t: t.update(textfont_color='white'))

fig.show()

👉 Total number of reviews by campus¶

In [27]:

fig = px.bar(
    df_ratings_breakdown,
    x='num_reviews',
    y='campus',
    color='review_rating',
    color_discrete_map={
        "1": "#EF5350",
        "2": "#EF9A9A",
        "3": "#FDD835",
        "4": "#9CCC65",
        "5": "#689F38"
    },
    labels={
        'review_rating': 'Review Rating',
        'campus': 'Campus',
        'percentage': 'Percentage',
        'num_reviews': 'Number of reviews'
    },
    title='<b>Total number of reviews by campus</b><br><span style="color: #ccc;">UIUC has the highest number of total reviews</span>',
    template='simple_white',
    height=500
)

fig.update_layout(
    font_family='Helvetica, Inter, Arial, sans-serif',
)
fig.update_yaxes(categoryorder='total ascending')

fig.show()

👉 Review ratings over time¶

In [28]:

df_dt = df_m.copy()

df_dt['month'] = df_dt['review_datetime_utc'].dt.strftime('%Y-%m')
df_dt = df_dt[df_dt['review_datetime_utc'].dt.year >= 2015]

df_summary_by_year_month = df_dt.groupby(['month', 'campus'], as_index=False) \
    .agg({
        'review_id': 'count',
        'review_rating': 'mean',
    }) \
    .rename(columns={
        'review_id': 'num_reviews'
    })

df_summary_by_year_month

Out[28]:

	month	campus	num_reviews	review_rating
0	2015-01	Brigham Young University	27	3.222222
1	2015-01	Penn State University Park	6	2.833333
2	2015-01	University of Illinois Urbana-Champaign	7	3.285714
3	2015-02	Brigham Young University	21	3.428571
4	2015-02	Penn State University Park	8	2.250000
...	...	...	...	...
316	2023-10	Penn State University Park	117	3.487179
317	2023-10	University of Illinois Urbana-Champaign	100	3.070000
318	2023-11	Brigham Young University	53	3.735849
319	2023-11	Penn State University Park	75	3.986667
320	2023-11	University of Illinois Urbana-Champaign	96	3.916667

321 rows × 4 columns

In [29]:

df_dt = df_m.copy()

df_dt['year'] = df_dt['review_datetime_utc'].dt.year
df_dt = df_dt[df_dt['review_datetime_utc'].dt.year >= 2015]

df_summary_by_year = df_dt.groupby(['year', 'campus'], as_index=False) \
    .agg({
        'review_id': 'count',
        'review_rating': 'mean',
    }) \
    .rename(columns={
        'review_id': 'num_reviews'
    })

df_summary_by_year['campus'] = df_summary_by_year['campus'].map({
    'Brigham Young University': 'BYU',
    'Penn State University Park': 'PSU',
    'University of Illinois Urbana-Champaign': 'UIUC'
})

df_summary_by_year.head(3)

Out[29]:

	year	campus	num_reviews	review_rating
0	2015	BYU	343	3.352770
1	2015	PSU	133	2.729323
2	2015	UIUC	119	3.672269

In [30]:

fig = px.line(
    df_summary_by_year,
    x='year',
    y='review_rating',
    color='campus',
    template='simple_white',
    labels={
        'review_rating': 'Average Review Rating',
        'year': 'Year',
        'campus': 'Campus',
    },
    title='<b>Average review rating change over time by campus</b><br><span style="color: #aaa;">The sharp decline in review ratings of Provo, UT (BYU) is alarming</span>',
    color_discrete_map={
        'BYU': '#1FB3D1',
        'PSU': '#001E44',
        'UIUC': '#FF5F05'
    },
    height=500,
)

fig.update_layout(
    font_family='Helvetica, Inter, Arial, sans-serif',
    legend=dict(
        font=dict(
            size=11,
        ),
        yanchor="top",
        y=1,
        xanchor="left",
        x=0.02
    ),
    showlegend=False,
)

annotations = [
    {'bgcolor': '#1FB3D1'},
    {'bgcolor': '#001E44'},
    {'bgcolor': '#FF5F05'},
]

for i in range(df_summary_by_year['campus'].nunique()):
    annotation_year = df_summary_by_year['year'].max()
    campus = df_summary_by_year['campus'].unique()[i]
    y = df_summary_by_year.query(f"(year == {annotation_year}) & (campus == '{campus}')")['review_rating'].iloc[0]
    
    fig.add_annotation(
        x=annotation_year,
        y=y,
        text=campus,
        font=dict(
            color='white',
            size=11
        ),
        showarrow=True,
        arrowwidth=1.5,
        arrowhead=1,
        arrowsize=1,
        arrowcolor=annotations[i]['bgcolor'],
        bgcolor=annotations[i]['bgcolor'],
        xshift=5,
        ax=40,
        ay=0,
    )


fig.show()

👉 Positive/negative sentiments by campus¶

In [31]:

df_sentiment_by_campus = df_m.groupby(['campus', 'sentiment'], as_index=False) \
    .agg({'review_id': 'count'})

df_sentiment_by_campus

Out[31]:

	campus	sentiment	review_id
0	Brigham Young University	NEGATIVE	2437
1	Brigham Young University	POSITIVE	2648
2	Penn State University Park	NEGATIVE	1626
3	Penn State University Park	POSITIVE	2855
4	University of Illinois Urbana-Champaign	NEGATIVE	1822
5	University of Illinois Urbana-Champaign	POSITIVE	3493

In [32]:

fig = make_subplots(
    rows=1,
    cols=3,
    specs=[[{'type': 'domain'}, {'type': 'domain'}, {'type': 'domain'}]],
    subplot_titles=['BYU', 'PSU', 'UIUC']
)
pie_labels = ['Negative', 'Positive']
marker_colors = ['#ef5350', '#8bc34a']

def add_sentiment_trace(fig, col, campus):
    fig.add_trace(
        go.Pie(
            labels=pie_labels,
            values=df_m[df_m['campus'] == campus]['sentiment'].value_counts().sort_index(),
            text=pie_labels,
            marker_colors=marker_colors,
            textposition='inside',
            textfont=dict(
                size=10,
                color='white'
            ),
            insidetextorientation='horizontal',
            sort=False,
        ), 1, col
    )

add_sentiment_trace(fig, 1, 'Brigham Young University')
add_sentiment_trace(fig, 2, 'Penn State University Park')
add_sentiment_trace(fig, 3, 'University of Illinois Urbana-Champaign')

fig.update_layout(
    title=dict(
        text='<b>DistilBERT Sentiments by Campus</b><br><span style="color: #aaa;">Leasing companies around Brigham Young University (Provo, UT) have the worst sentiment</span>',
        x=0,
        y=0.9,
        xanchor='left',
        yanchor='top',
    ),
    font_family='Helvetica, Arial, Inter, sans-serif',
    showlegend=False,
    margin=dict(
        l=0,
        r=0,
        t=125,
        b=50,
    ),
    height=450
)

In [33]:

df_sentiment_by_campus

Out[33]:

	campus	sentiment	review_id
0	Brigham Young University	NEGATIVE	2437
1	Brigham Young University	POSITIVE	2648
2	Penn State University Park	NEGATIVE	1626
3	Penn State University Park	POSITIVE	2855
4	University of Illinois Urbana-Champaign	NEGATIVE	1822
5	University of Illinois Urbana-Champaign	POSITIVE	3493

👉 Review length distribution by rating¶

In [34]:

df_review_len = df_m.copy()
df_review_len['review_length'] = df_review_len['review_text'].str.len()
df_review_len['review_rating'] = df_review_len['review_rating'].astype(str)

# only filter review <2000 chars to remove outliers
df_review_len = df_review_len[df_review_len['review_length'] < 2000]

fig = px.box(
    df_review_len,
    x='review_length',
    y='review_rating',
    color='review_rating',
    color_discrete_map={
        "1": "#EF5350",
        "2": "#EF9A9A",
        "3": "#FDD835",
        "4": "#9CCC65",
        "5": "#689F38"
    },
    labels={
        'review_rating': 'Review Rating',
        'review_length': 'Review Length'
    },
    template='simple_white',
    title='<b>Review length (number of characters) distribution by rating</b><br><span style="color: #aaa">Unsatisfied tenants have more to say</span>',
    height=600
)

fig.update_yaxes(categoryorder='category ascending')
fig.update_layout(
    showlegend=False,
    font_family='Helvetica, Inter, Arial, sans-serif',
)

fig.show()

👉 Correlation between review length and number of likes¶

In [36]:

fig = px.scatter(
    df_review_len[df_review_len['review_likes'] > 0],
    x='review_length',
    y='review_likes',
    labels={
        'review_likes': 'Review Likes',
        'review_length': 'Review Length'
    },
    template='simple_white',
    title='<b>Review length vs number of likes</b><br>\
<span style="color: #aaa">Longer reviews receive more likes on average</span>',
    trendline="ols",
    trendline_color_override="#D32F2F",
    height=600
)

fig.update_layout(
    showlegend=False,
    font_family='Helvetica, Inter, Arial, sans-serif',
)

fig.update_traces(
    marker=dict(
        color='#CFD8DC',
        size=3,
    ),
    selector=dict(mode='markers')
)

# calculate y coordinate to add annotation
r = px.get_trendline_results(fig).iloc[0]['px_fit_results'].params
annotation_x = 1500
annotation_y = r[0] + r[1] * annotation_x

fig.add_annotation(
    x=annotation_x,
    y=annotation_y,
    text=f"A weak positive correlation (R²={round(px.get_trendline_results(fig).iloc[0]['px_fit_results'].rsquared, 2)})",
    font=dict(
        size=14
    ),
    showarrow=True,
    arrowhead=2,
    bgcolor='white',
)

fig.show()

✏️ Keyword Analysis¶

👉 Frequencies¶

Convert tokens to lowercase and find frequencies.

In [37]:

df_tokens_lower = df_tokens.copy()
df_tokens_lower['lemma'] = df_tokens_lower['lemma'].str.lower()
df_tokens_lower.rename(columns={'lemma': 'token'}, inplace=True)

df_tokens_lower.head(3)

Out[37]:

	review_id	text	token	explain
0	ChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRAB	Service	service	noun
1	ChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRAB	requests	request	noun
2	ChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRAB	long	long	adjective

In [136]:

df_token_counts = df_tokens_lower[['review_id', 'token']] \
    .drop_duplicates()['token'] \
    .value_counts() \
    .to_frame() \
    .reset_index()

df_token_counts

Out[136]:

	token	count
0	live	6326
1	apartment	6088
2	place	4784
3	great	4206
4	good	3410
...	...	...
15269	butcher	1
15270	sonic	1
15271	940	1
15272	relegate	1
15273	distinctly	1

15274 rows × 2 columns

In [39]:

interesting_keywords = [
    'deposit', 'parking', 'gym', 'internet', 
    'laundry', 'washer', 'leak', 'pool',
    'package', 'expensive', 'maintenance', 'construction',
    'noise', 'pet', 'bug', 'elevator',
]

df_interesting_keywords_count = df_token_counts[
    df_token_counts['token'].isin(interesting_keywords)
]

df_interesting_keywords_count

Out[39]:

	token	count
7	maintenance	3119
43	parking	1479
80	pool	980
94	deposit	857
126	gym	701
195	internet	471
216	laundry	423
243	expensive	368
296	leak	311
298	washer	309
344	package	269
347	elevator	266
418	noise	216
494	pet	175
544	construction	154
557	bug	148

In [40]:

fig = px.bar(
    df_interesting_keywords_count,
    x='count',
    y='token',
    labels={
        'token': 'Keyword',
        'count': 'Frequency',
    },
    title='<b>Review keyword frequencies</b><br><span style="color: #ccc;">Maintenance is the most frequently discussed keyword in reviews</span>',
    template='simple_white',
    height=700
)

fig.update_layout(
    font_family='Helvetica, Inter, Arial, sans-serif',
)
fig.update_traces(marker_color='black')
fig.update_yaxes(categoryorder='total ascending')

fig.show()

👉 Positive/negative associations with tokens¶

Convert tokens to lowercase and find frequencies.

In [41]:

df_interesting_tokens = df_tokens_lower[df_tokens_lower['token'].isin(interesting_keywords)]

df_interesting_token_sentiments = pd.merge(
    left=df_r[['place_id', 'review_id', 'sentiment']],
    right=df_interesting_tokens[['review_id', 'token']],
    how='inner',
    on='review_id'
)

df_interesting_token_sentiments

Out[41]:

	place_id	review_id	sentiment	token
0	ChIJqyXyo6GQTYcRXGfgeIpqc_I	ChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRAB	POSITIVE	pool
1	ChIJqyXyo6GQTYcRXGfgeIpqc_I	ChZDSUhNMG9nS0VJQ0FnSUNwOUstOVpBEAE	NEGATIVE	elevator
2	ChIJqyXyo6GQTYcRXGfgeIpqc_I	ChZDSUhNMG9nS0VJQ0FnSUNwOUstOVpBEAE	NEGATIVE	elevator
3	ChIJqyXyo6GQTYcRXGfgeIpqc_I	ChZDSUhNMG9nS0VJQ0FnSUNwOUstOVpBEAE	NEGATIVE	gym
4	ChIJqyXyo6GQTYcRXGfgeIpqc_I	ChZDSUhNMG9nS0VJQ0FnSUNwOUstOVpBEAE	NEGATIVE	parking
...	...	...	...	...
13535	ChIJl3GbQTjXDIgRUIdCvHgdOY4	ChZDSUhNMG9nS0VJQ0FnSUNneXJ6aEJ3EAE	NEGATIVE	maintenance
13536	ChIJl3GbQTjXDIgRUIdCvHgdOY4	ChZDSUhNMG9nS0VJQ0FnSURBcXJXekJ3EAE	POSITIVE	deposit
13537	ChIJl3GbQTjXDIgRUIdCvHgdOY4	ChZDSUhNMG9nS0VJQ0FnSUMwMklTakh3EAE	POSITIVE	maintenance
13538	ChIJl3GbQTjXDIgRUIdCvHgdOY4	ChZDSUhNMG9nS0VJQ0FnSURJeVplTGZ3EAE	POSITIVE	maintenance
13539	ChIJl3GbQTjXDIgRUIdCvHgdOY4	ChdDSUhNMG9nS0VJQ0FnSUNzdktudTV3RRAB	POSITIVE	leak

13540 rows × 4 columns

In [42]:

df_interesting_token_sentiment_counts = df_interesting_token_sentiments \
    .groupby(['token', 'sentiment'], as_index=False) \
    .agg({'review_id': 'count'}) \
    .rename(columns={'review_id': 'count'})

df_interesting_token_sentiment_counts['percentage'] = df_interesting_token_sentiment_counts['count'] / df_interesting_token_sentiment_counts.groupby('token')['count'].transform('sum')
df_interesting_token_sentiment_counts['sentiment'] = df_interesting_token_sentiment_counts['sentiment'].str.capitalize()
df_interesting_token_sentiment_counts.sort_values(
    ['sentiment', 'percentage'],
    ascending=[True, True],
    inplace=True
)

df_interesting_token_sentiment_counts.head(10)

Out[42]:

	token	sentiment	count	percentage
10	gym	Negative	231	0.304348
28	pool	Negative	469	0.421384
26	pet	Negative	101	0.431624
18	maintenance	Negative	1901	0.478600
30	washer	Negative	225	0.616438
14	laundry	Negative	323	0.625969
8	expensive	Negative	246	0.629156
22	package	Negative	261	0.659091
24	parking	Negative	1607	0.667914
20	noise	Negative	205	0.729537

In [44]:

fig = px.bar(
    df_interesting_token_sentiment_counts,
    x='percentage',
    y='token',
    template='simple_white',
    color='sentiment',
    color_discrete_map={
        'Positive': '#8bc34a',
        'Negative': '#ef5350'
    },
    labels={'token': 'Keyword', 'sentiment': 'Sentiment', 'percentage': 'Percentage'},
    title='<b>Keyword vs review sentiment associations</b><br><span style="color: #aaa;">Who doesn\'t hate leaks, security deposit scams, and broken elevators?</span>',
    text=df_interesting_token_sentiment_counts.apply(
        lambda r: f"{'👍' if r['sentiment'] == 'Positive' else '👎'} \
        {'{0:.1f}%'.format(r['percentage'] * 100)}", axis=1),
    height=800
)

fig.update_layout(
    font_family='Helvetica, Inter, Arial, sans-serif',
    xaxis_tickformat=',.0%',
    showlegend=False,
)

fig.for_each_trace(lambda t: t.update(textfont_color='white'))

fig.show()

Add management company names and campus information to df_interesting_token_sentiments.

In [45]:

df_tokens_places = pd.merge(
    left=df_interesting_token_sentiments.drop_duplicates()[['place_id', 'sentiment', 'token']],
    right=df_b[['place_id', 'name', 'campus']]
)

df_tokens_places

Out[45]:

	place_id	sentiment	token	name	campus
0	ChIJqyXyo6GQTYcRXGfgeIpqc_I	POSITIVE	pool	Alpine Village	Brigham Young University
1	ChIJqyXyo6GQTYcRXGfgeIpqc_I	NEGATIVE	elevator	Alpine Village	Brigham Young University
2	ChIJqyXyo6GQTYcRXGfgeIpqc_I	NEGATIVE	gym	Alpine Village	Brigham Young University
3	ChIJqyXyo6GQTYcRXGfgeIpqc_I	NEGATIVE	parking	Alpine Village	Brigham Young University
4	ChIJqyXyo6GQTYcRXGfgeIpqc_I	NEGATIVE	washer	Alpine Village	Brigham Young University
...	...	...	...	...	...
10241	ChIJl3GbQTjXDIgRUIdCvHgdOY4	NEGATIVE	maintenance	Community Property Management	University of Illinois Urbana-Champaign
10242	ChIJl3GbQTjXDIgRUIdCvHgdOY4	POSITIVE	deposit	Community Property Management	University of Illinois Urbana-Champaign
10243	ChIJl3GbQTjXDIgRUIdCvHgdOY4	POSITIVE	maintenance	Community Property Management	University of Illinois Urbana-Champaign
10244	ChIJl3GbQTjXDIgRUIdCvHgdOY4	POSITIVE	maintenance	Community Property Management	University of Illinois Urbana-Champaign
10245	ChIJl3GbQTjXDIgRUIdCvHgdOY4	POSITIVE	leak	Community Property Management	University of Illinois Urbana-Champaign

10246 rows × 5 columns

Get the number of reviews by each business.

In [46]:

df_place_review_counts = df_r[df_r['review_text'].notna()].groupby('place_id', as_index=False)['review_id'].count() \
    .rename(columns={'review_id': 'num_reviews'})

df_place_review_counts

Out[46]:

	place_id	num_reviews
0	ChIJ06hDonPXDIgRwvaC5IF3CaA	35
1	ChIJ08JvZlGXTYcRYhauVfCCuJk	78
2	ChIJ0yehl6XXDIgRtqPdtmcuKeU	17
3	ChIJ28QMEbyozokRvv0bOBv6TUc	161
4	ChIJ2ZFgr_GlzokRlEdDJIv3WFU	26
...	...	...
137	ChIJvzfjLA2XTYcR66z_4bzQOvY	48
138	ChIJw3E2MbKQTYcRswRvHD7LBaU	200
139	ChIJy6xaEQapzokRGunpHg5WWxk	200
140	ChIJywqAqaCozokRRoRR_ZhhcQ8	30
141	ChIJzZZ2J46ozokRTALobuh-_JA	34

142 rows × 2 columns

Filter places by interesting keywords based on frequency.

In [79]:

df_tokens_places_summary = df_tokens_places.groupby(['place_id', 'name', 'campus', 'token'], as_index=False) \
    .agg({'sentiment': 'size'}) \
    .rename(columns={'sentiment': 'unique_frequency'}) \
    .merge(
        right=df_place_review_counts,
        on='place_id'
    )

df_tokens_places_summary['campus'] = df_tokens_places_summary['campus'].map(campus_acronyms_map)
df_tokens_places_summary['percentage'] = df_tokens_places_summary['unique_frequency'] / df_tokens_places_summary['num_reviews']
df_tokens_places_summary = df_tokens_places_summary[df_tokens_places_summary['percentage'] >= 0.1]
df_tokens_places_summary.sort_values('percentage', ascending=False)

Out[79]:

	place_id	name	campus	token	unique_frequency	num_reviews	percentage
756	ChIJX_L4rUnXDIgR9TAeTC_hgXA	University Group	UIUC	maintenance	115	200	0.575000
1227	ChIJn4wFUb-ozokRBIjeH4SwQl0	Lions Gate Apartments	PSU	maintenance	12	21	0.571429
555	ChIJLS5X17OnzokRqWLyrt6auaM	Toftrees Apartments - Turtle Creek Apartments	PSU	maintenance	13	24	0.541667
391	ChIJFws_9NCnzokRWuhhIW9H3ro	The Bryn Apartments	PSU	maintenance	87	200	0.435000
785	ChIJZ1Wg5rDXDIgRwsRBDwZ_PUw	Parkside Apartments	UIUC	maintenance	19	45	0.422222
...	...	...	...	...	...	...	...
967	ChIJbeoNXnGaTYcRP-R0abW2glI	The Crestwood	BYU	pool	20	200	0.100000
1081	ChIJhVL1izfXDIgRk7TCCsdAoJg	Latitude Apartments	UIUC	pool	20	200	0.100000
1209	ChIJm0H0xraQTYcR3_G-7MOct8c	Campus Plaza Apartments	BYU	pool	5	50	0.100000
1241	ChIJoUKqgmHXDIgRrZiClSE_GkA	The Linc	UIUC	gym	20	200	0.100000
1269	ChIJpbSzGKyQTYcRh9jrNYBP-fg	FCS Rentals	BYU	maintenance	9	90	0.100000

294 rows × 7 columns

In [123]:

# a function to create a bar chart of businesses with the
# highest relative frequency of reviews with a specific keyword
def generate_token_percentages_bar_chart(
    df_token_places_summary, 
    keyword,
    chart_subtitle
):
    df_keyword_top5_places = df_tokens_places_summary[df_tokens_places_summary['token'] == keyword] \
        .sort_values('percentage', ascending=False) \
        .head(5)
        
    color_discrete_sequence = ['#cfd8dc'] * df_keyword_top5_places.shape[0]
    color_discrete_sequence[0] = '#d32f2f'
    
    fig = px.bar(
        df_keyword_top5_places,
        x='name',
        y='percentage',
        labels={
            'name': 'Business',
            'percentage': 'Percentage',
            'campus': 'Campus',
            'num_reviews': 'Total number of reviews',
        },
        title=f'<b>Percentage of text reviews with the keyword "{keyword}"</b><br><span style="color: #ccc;">{chart_subtitle}</span>',
        template='simple_white',
        color='name',
        color_discrete_sequence=color_discrete_sequence,
        hover_name='name',
        hover_data=['campus', 'num_reviews'],
        text='campus',
        height=550
    )
    
    fig.update_traces(textposition='inside')
    
    fig.update_layout(
        yaxis_tickformat=',.0%',
        font_family='Helvetica, Inter, Arial, sans-serif',
        showlegend=False,
    )
    
    return fig

In [109]:

fig = generate_token_percentages_bar_chart(
    df_tokens_places_summary,
    'deposit',
    'The Univesity Group is notorious for not returning security deposits'
)

fig.show()

In [110]:

fig = generate_token_percentages_bar_chart(
    df_tokens_places_summary,
    'elevator',
    '309 Green only has 2 elevators for a 24-story building'
)

fig.show()

In [134]:

fig = generate_token_percentages_bar_chart(
    df_tokens_places_summary,
    'laundry',
    'Provo seems to have more apartments that share the laundry facility'
)

fig.show()

In [135]:

fig = generate_token_percentages_bar_chart(
    df_tokens_places_summary,
    'parking',
    'Parking seems more limited in Provo'
)

fig.show()

In [104]:

fig = generate_token_percentages_bar_chart(
    df_tokens_places_summary,
    'expensive',
    'Apartments in UIUC surprisingly do not make the list here'
)

fig.show()

👉 Percentage of tokens by campus¶

Get the number of reviews with review text by each campus.

In [51]:

df_campus_review_counts = df_m[df_m['review_text'].notna()].groupby('campus', as_index=False)['review_id'].count() \
    .rename(columns={'review_id': 'num_reviews'})

df_campus_review_counts

Out[51]:

	campus	num_reviews
0	Brigham Young University	5085
1	Penn State University Park	4481
2	University of Illinois Urbana-Champaign	5315

In [52]:

df_tokens_campus_summary = df_tokens_places.groupby(['campus', 'token'], as_index=False) \
    .agg({'sentiment': 'size'}) \
    .rename(columns={'sentiment': 'unique_frequency'}) \
    .merge(
        right=df_campus_review_counts,
        on='campus'
    )

df_tokens_campus_summary['percentage'] = df_tokens_campus_summary['unique_frequency'] / df_tokens_campus_summary['num_reviews']
df_tokens_campus_summary['campus'] = df_tokens_campus_summary['campus'].map({
    'Brigham Young University': 'BYU',
    'Penn State University Park': 'PSU',
    'University of Illinois Urbana-Champaign': 'UIUC'
})
df_tokens_campus_summary.sort_values(['campus', 'token'], inplace=True)
df_tokens_campus_summary.head()

Out[52]:

	campus	token	unique_frequency	num_reviews	percentage
0	BYU	bug	59	5085	0.011603
1	BYU	construction	48	5085	0.009440
2	BYU	deposit	364	5085	0.071583
3	BYU	elevator	36	5085	0.007080
4	BYU	expensive	228	5085	0.044838

In [53]:

df_campus_positive_keywords = df_tokens_campus_summary[df_tokens_campus_summary['token'] \
    .isin(['gym', 'pool', 'pet'])] \
    .sort_values(['campus', 'token'])

df_campus_positive_keywords.head(3)

Out[53]:

	campus	token	unique_frequency	num_reviews	percentage
5	BYU	gym	191	5085	0.037561
13	BYU	pet	35	5085	0.006883
14	BYU	pool	526	5085	0.103441

In [54]:

df_campus_negative_keywords = df_tokens_campus_summary[df_tokens_campus_summary['token'] \
    .isin(['leak', 'deposit', 'elevator'])] \
    .sort_values(['campus', 'token'])

df_campus_negative_keywords.head(3)

Out[54]:

	campus	token	unique_frequency	num_reviews	percentage
2	BYU	deposit	364	5085	0.071583
3	BYU	elevator	36	5085	0.007080
8	BYU	leak	125	5085	0.024582

In [55]:

df_campus_misc_keywords = df_tokens_campus_summary[df_tokens_campus_summary['token'] \
    .isin(['parking', 'expensive', 'construction', 'package'])] \
    .sort_values(['campus', 'token'])

df_campus_misc_keywords.head(3)

Out[55]:

	campus	token	unique_frequency	num_reviews	percentage
1	BYU	construction	48	5085	0.009440
4	BYU	expensive	228	5085	0.044838
11	BYU	package	61	5085	0.011996

In [56]:

fig = px.bar(
    df_campus_positive_keywords,
    x='token',
    y='percentage',
    color='campus',
    color_discrete_map={
        'BYU': '#1FB3D1',
        'PSU': '#001E44',
        'UIUC': '#FF5F05'
    },
    title='<b>Percentage of reviews that includes specific positive keywords</b><br>\
<span style="color: #ccc;">Provo, UT\'s summer is hotter than State College, PA and Champaign, IL</span>',
    labels={
        'percentage': 'Percentage',
        'token': 'Keyword',
        'campus': 'Campus'
    },
    height=500,
    template='simple_white',
    barmode='group'
)

fig.update_layout(
    yaxis_tickformat=',.0%',
    font_family='Helvetica, Inter, Arial, sans-serif',
)

fig.show()

In [57]:

fig = px.bar(
    df_campus_negative_keywords,
    x='token',
    y='percentage',
    color='campus',
    color_discrete_map={
        'BYU': '#1FB3D1',
        'PSU': '#001E44',
        'UIUC': '#FF5F05'
    },
    title='<b>Percentage of reviews that includes specific negative keywords</b><br>\
<span style="color: #ccc;">BYU surprisingly has the highest proportion of reviews related to security deposits</span>',
    labels={
        'percentage': 'Percentage',
        'token': 'Keyword',
        'campus': 'Campus'
    },
    height=500,
    template='simple_white',
    barmode='group'
)

fig.update_layout(
    yaxis_tickformat=',.0%',
    font_family='Helvetica, Inter, Arial, sans-serif',
)

fig.show()

In [60]:

fig = px.bar(
    df_campus_misc_keywords,
    x='token',
    y='percentage',
    color='campus',
    color_discrete_map={
        'BYU': '#1FB3D1',
        'PSU': '#001E44',
        'UIUC': '#FF5F05'
    },
    title='<b>Percentage of reviews that includes specific miscellaneous keywords</b><br>\
<span style="color: #ccc;">Parking is important to tenants at Provo, UT</span>',
    labels={
        'percentage': 'Percentage',
        'token': 'Keyword',
        'campus': 'Campus'
    },
    height=500,
    template='simple_white',
    barmode='group'
)

fig.update_layout(
    yaxis_tickformat=',.0%',
    font_family='Helvetica, Inter, Arial, sans-serif',
)

fig.show()

⚖️ Closing Thoughts¶

What a surprise! UIUC has the highest proportion of 5-star reviews, and the best tenant sentiments out of the three college towns.
BYU has the worst tenant sentiments.
48% of reviews in BYU (Provo, UT) are negative. This percentage far exceeds those of UIUC and PSU, which ranges about 34~36%.
The University Group @ UIUC is leading in not properly returning security deposits.
"Gym", "Pool", and "Pet" are often associated with positive reviews.
Common keywords in negative reviews are "leak", "deposit", "elevator", "construction", "bug", "parking", and "noise".
"Pool" and "Parking" are more relevant to tenants at BYU.
The average review ratings in UIUC and PSU have been improving.
On the other hand, this is not a good look for management companies at BYU.
Landlords with higher average ratings respond faster on average, compared to landlords with lower average ratings.

Analyzing College Town Landlords at UIUC, BYU, and PSU

✨ Background¶

♟️ About the Dataset¶

Data collection and preparation¶

Load and Preprocess Data¶

Businesses dataset¶

Reviews dataset¶

Sentiments¶

Merge sentiments into reviews¶

Tokens¶

Merge reviews and businesses¶

Metadata for visualizations¶

👉 Number of apartments listed by campus¶

👉 Calculate the time owner took to respond to a review¶

👉 Owner's response time (pie chart)¶

👉 Owner's response time (histogram)¶

👉 Owner's response time vs sentiment¶

👉 Review rating breakdown by campus¶

👉 Total number of reviews by campus¶

👉 Review ratings over time¶

👉 Positive/negative sentiments by campus¶

👉 Review length distribution by rating¶

👉 Correlation between review length and number of likes¶

✏️ Keyword Analysis¶

👉 Frequencies¶

👉 Positive/negative associations with tokens¶

👉 Percentage of tokens by campus¶

⚖️ Closing Thoughts¶