✨ Background¶
I first moved into an apartment at UIUC during the summer of 2007. The apartment was Gregory Place under a management company called JSM. Here is a photo of the apartment from 2007. The construction site shows the second building being built.
Back in 2007, JSM was regarded as one of the better management companies, along with Royse Brinkmeyer and Roland. Almost two decades later, JSM is still highly regraded. Prospective tenants lining up from the early morning in front of JSM's office to secure a lease was a sight to see in recent years.
Photo by Jacob Slabosz @ Daily illini
After talking to hundreds of students over the years about their landlords in Champaign-Urbana, I came to a conclusion that many of the landlords in Champaign-Urbana must be the worst. From the notorious attempts to hijack tenants' security deposits by The University Group to having to endure cold showers during the winter in Seven07, finding a good landlord or a headache-free apartment sounds impossible.
But this got me thinking - are the landlords in Champaign-Urbana really the worst compared to other similar college towns?
♟️ About the Dataset¶
In this post, I will compare how the property management companies in Champaign-Urbana area compare to those of two other college towns - Provo, UT (Brigham Young University) and State College, PA (Penn State University).
Data collection and preparation¶
- 142 property management companies across the three college towns have been hand-selected.
- The companies' information on Google Maps were scraped using Outscraper (a cloud-based scraping service).
- About 17,000 reviews for the property management companies were scraped using Outscraper.
- The review texts were tokenized using spaCy. The tokenization and additional preprocessing steps are not included in this post.
- The post uses sentiment analysis results of the reviews using DistilBERT (distilbert-base-uncased-finetuned-sst-2-english).
Import packages.
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from plotly.subplots import make_subplots
import plotly
print(f'pandas version: {pd.__version__}')
print(f'plotly version: {plotly.__version__}')
Set the nubmer of maximum number of displayed columns to 50.
pd.set_option('display.max_columns', 50)
Load and Preprocess Data¶
There are four tables.
df_b
: List of property management companies or apartmentsdf_r
: Google reviews of the property management companies or apartmentsdf_tokens
: Tokenized result of reviews using spaCydf_sentiments
: Sentiment analysis result using DistilBERT
df_b = pd.read_csv('https://github.com/bdi475/datasets/raw/main/campustowns-leasing-company-reviews/results/businesses.csv')
df_r = pd.read_csv(
'https://github.com/bdi475/datasets/raw/main/campustowns-leasing-company-reviews/results/reviews.csv',
parse_dates = ['review_datetime_utc', 'owner_answer_timestamp_datetime_utc']
)
df_tokens = pd.read_csv('https://github.com/bdi475/datasets/raw/main/campustowns-leasing-company-reviews/results/tokens.csv.gz')
df_sentiments = pd.read_csv('https://github.com/bdi475/datasets/raw/main/campustowns-leasing-company-reviews/results/sentiments.csv')
Businesses dataset¶
Display the businesses (property management companies) table.
df_b.head()
print(f"There are {df_b.shape[0]} rows and {df_b.shape[1]} columns in the businesses table.")
Print out the DataFrame's information.
df_b.info()
Check for missing values.
df_b.isna().sum()
The "site" and "borough" columns have missing values.
Reviews dataset¶
Display the reviews table.
df_r.head()
print(f"There are {df_r.shape[0]} rows and {df_r.shape[1]} columns in the reviews table.")
Check for missing values.
df_r.isna().sum()
Columns "review_text", "review_img_url", "owner_answer", and "owner_answer_timestamp_datetime_utc" have missing values.
Sentiments¶
Display the sentiments table.
df_sentiments.head()
Check for missing values.
df_sentiments.isna().sum()
There are no missing values in the sentiments table.
Merge sentiments into reviews¶
if 'sentiment' not in df_r.columns:
df_r = pd.merge(
left=df_r,
right=df_sentiments,
on='review_id',
how='left'
)
df_r.head()
Tokens¶
Display the list of tokens in its original form ("text"
), lemmatized form ("lemma"
), and their part of speech explanations ("explain"
).
df_tokens
Merge reviews and businesses¶
df_m = pd.merge(
left=df_r,
right=df_b,
on='place_id',
how='inner'
)
df_m.head(2)
Metadata for visualizations¶
campus_acronyms_map = {
'University of Illinois Urbana-Champaign': 'UIUC',
'Brigham Young University': 'BYU',
'Penn State University Park': 'PSU',
}
👉 Number of apartments listed by campus¶
df_b['campus'] \
.value_counts() \
.to_frame() \
.reset_index()
👉 Calculate the time owner took to respond to a review¶
td = df_m['owner_answer_timestamp_datetime_utc'] - df_m['review_datetime_utc']
response_time_in_days = td.apply(lambda x: x / pd.Timedelta(days=1))
# round the response time if a positive number
# if the resposne time is negative, there is no way of knowing
# when the author has
df_m['response_time_in_days'] = np.where(
response_time_in_days > 0,
round(response_time_in_days, 0),
response_time_in_days
)
df_m[['review_id', 'owner_answer_timestamp_datetime_utc', 'review_datetime_utc', 'response_time_in_days']].sample(2)
Add a categorical variable based on the 'response_time_in_days'
variable.
df_m['response_time'] = np.select(
[
df_m['response_time_in_days'].between(-10000, 0, inclusive='left'),
df_m['response_time_in_days'].between(0, 1, inclusive='left'),
df_m['response_time_in_days'].between(1, 7, inclusive='left'),
df_m['response_time_in_days'].between(7, 30, inclusive='left'),
df_m['response_time_in_days'].between(31, 10000, inclusive='left'),
],
[
'Unknown (review updated)',
'Within a day',
'Within a week',
'Within a month',
'After a month'
],
default='No response'
)
df_m.head(2)
👉 Owner's response time (pie chart)¶
df_response_time_counts = df_m['response_time'].value_counts().to_frame().reset_index()
df_response_time_counts
fig = px.pie(
df_response_time_counts,
names='response_time',
values='count',
title='<b>Owner response times</b><br><span style="color: #aaa;">Owners do not respond 40% of the times</span>',
height=500,
template='simple_white',
color='response_time',
color_discrete_map={
"Within a day": "#689F38",
"No response": "#FDD835",
"Within a week": "#9CCC65",
"Within a month": "#EF9A9A",
"After a month": "#EF5350",
"Unknown (review updated)": "#ccc"
},
labels={
'response_time': 'Response Time'
},
)
fig.update_traces(
textinfo='percent+label',
textposition='outside',
showlegend=False
)
fig.update_layout(
font_family='Helvetica, Inter, Arial, sans-serif',
)
fig.show()
👉 Owner's response time (histogram)¶
fig = px.histogram(
df_m[(df_m['response_time_in_days'] >= 0) & (df_m['response_time_in_days'] < 30)],
x='response_time_in_days',
template='simple_white',
title='<b>Owner\'s response time in days (<30 days)</b><br><span style="color: #aaa">The majority of the owners who responds replies within a week</span>',
height=500,
labels={
'response_time_in_days': 'Response time in days'
},
color_discrete_sequence=['black']
)
fig.update_layout(
font_family='Helvetica, Inter, Arial, sans-serif',
yaxis_title_text='Count'
)
fig.add_vrect(
x0=-0.5,
x1=7.5,
annotation_text="Within a week",
annotation_position="top right",
annotation=dict(
font_color='#7CB342',
font_size=11
),
fillcolor="#4CAF50",
opacity=0.07,
line_width=0
)
fig.add_vline(x=7.5, line_width=1, line_dash="solid", line_color="#8bc34a")
fig.show()
df_response_sentiment_count =df_m \
.groupby(['response_time', 'sentiment'], as_index=False) \
.agg({'review_id': 'count'}) \
.rename(columns={'review_id': 'count'})
df_response_sentiment_count['sentiment'] = df_response_sentiment_count['sentiment'].str.capitalize()
df_response_sentiment_count['percentage'] = df_response_sentiment_count['count'] / df_response_sentiment_count.groupby('response_time')['count'].transform('sum')
df_response_sentiment_count
👉 Owner's response time vs sentiment¶
fig = px.bar(
df_response_sentiment_count,
x='percentage',
y='response_time',
category_orders={
'response_time': [
'Within a day',
'Within a week',
'Within a month',
'After a month',
'No response',
'Unknown (review updated)'
],
'sentiment': ['Positive', 'Negative']
},
template='simple_white',
color='sentiment',
color_discrete_map={
'Positive': '#8bc34a',
'Negative': '#ef5350'
},
labels={'sentiment': 'Sentiment'},
title='<b>Owner\'s response time</b><br><span style="color: #aaa;">Companies who receive negative reviews are correlated with slower response times</span>',
text=df_response_sentiment_count.apply(lambda r: f"{'👍' if r['sentiment'] == 'Positive' else '👎'} {'{0:.1f}%'.format(r['percentage'] * 100)}", axis=1),
height=500
)
fig.update_layout(
xaxis_title_text='Percentage',
yaxis_title_text='Owner\'s response time',
font_family='Helvetica, Inter, Arial, sans-serif',
xaxis_tickformat=',.0%',
)
fig.for_each_trace(lambda t: t.update(textfont_color='white'))
fig.show()
👉 Review rating breakdown by campus¶
df_ratings_breakdown = df_m.groupby(
['campus', 'review_rating'],
as_index=False
).agg({
'review_id': 'count'
}).rename(columns={
'review_id': 'num_reviews'
})
df_ratings_breakdown['campus'] = df_ratings_breakdown['campus'].map({
'Brigham Young University': 'BYU',
'Penn State University Park': 'PSU',
'University of Illinois Urbana-Champaign': 'UIUC'
})
df_ratings_breakdown['review_rating'] = df_ratings_breakdown['review_rating'].astype(str)
df_ratings_breakdown['percentage'] = df_ratings_breakdown['num_reviews'] / df_ratings_breakdown.groupby('campus')['num_reviews'].transform('sum')
df_ratings_breakdown
fig = px.bar(
df_ratings_breakdown,
x='campus',
y='percentage',
color='review_rating',
color_discrete_map={
"1": "#EF5350",
"2": "#EF9A9A",
"3": "#FDD835",
"4": "#9CCC65",
"5": "#689F38"
},
labels={
'review_rating': 'Review Rating',
'campus': 'Campus',
'percentage': 'Percentage'
},
title='<b>Review rating breakdown by campus</b><br><span style="color: #aaa">UIUC has the highest proportion of 5 star reviews</span>',
text=df_ratings_breakdown.apply(lambda r: f"{'⭐' * int(r['review_rating'])} {'{0:.1f}%'.format(r['percentage'] * 100)}", axis=1),
template='simple_white',
height=650
)
fig.update_layout(
yaxis_tickformat=',.0%',
uniformtext_minsize=10,
uniformtext_mode='hide',
font_family='Helvetica, Inter, Arial, sans-serif',
)
fig.for_each_trace(lambda t: t.update(textfont_color='white'))
fig.show()
👉 Total number of reviews by campus¶
fig = px.bar(
df_ratings_breakdown,
x='num_reviews',
y='campus',
color='review_rating',
color_discrete_map={
"1": "#EF5350",
"2": "#EF9A9A",
"3": "#FDD835",
"4": "#9CCC65",
"5": "#689F38"
},
labels={
'review_rating': 'Review Rating',
'campus': 'Campus',
'percentage': 'Percentage',
'num_reviews': 'Number of reviews'
},
title='<b>Total number of reviews by campus</b><br><span style="color: #ccc;">UIUC has the highest number of total reviews</span>',
template='simple_white',
height=500
)
fig.update_layout(
font_family='Helvetica, Inter, Arial, sans-serif',
)
fig.update_yaxes(categoryorder='total ascending')
fig.show()
👉 Review ratings over time¶
df_dt = df_m.copy()
df_dt['month'] = df_dt['review_datetime_utc'].dt.strftime('%Y-%m')
df_dt = df_dt[df_dt['review_datetime_utc'].dt.year >= 2015]
df_summary_by_year_month = df_dt.groupby(['month', 'campus'], as_index=False) \
.agg({
'review_id': 'count',
'review_rating': 'mean',
}) \
.rename(columns={
'review_id': 'num_reviews'
})
df_summary_by_year_month
df_dt = df_m.copy()
df_dt['year'] = df_dt['review_datetime_utc'].dt.year
df_dt = df_dt[df_dt['review_datetime_utc'].dt.year >= 2015]
df_summary_by_year = df_dt.groupby(['year', 'campus'], as_index=False) \
.agg({
'review_id': 'count',
'review_rating': 'mean',
}) \
.rename(columns={
'review_id': 'num_reviews'
})
df_summary_by_year['campus'] = df_summary_by_year['campus'].map({
'Brigham Young University': 'BYU',
'Penn State University Park': 'PSU',
'University of Illinois Urbana-Champaign': 'UIUC'
})
df_summary_by_year.head(3)
fig = px.line(
df_summary_by_year,
x='year',
y='review_rating',
color='campus',
template='simple_white',
labels={
'review_rating': 'Average Review Rating',
'year': 'Year',
'campus': 'Campus',
},
title='<b>Average review rating change over time by campus</b><br><span style="color: #aaa;">The sharp decline in review ratings of Provo, UT (BYU) is alarming</span>',
color_discrete_map={
'BYU': '#1FB3D1',
'PSU': '#001E44',
'UIUC': '#FF5F05'
},
height=500,
)
fig.update_layout(
font_family='Helvetica, Inter, Arial, sans-serif',
legend=dict(
font=dict(
size=11,
),
yanchor="top",
y=1,
xanchor="left",
x=0.02
),
showlegend=False,
)
annotations = [
{'bgcolor': '#1FB3D1'},
{'bgcolor': '#001E44'},
{'bgcolor': '#FF5F05'},
]
for i in range(df_summary_by_year['campus'].nunique()):
annotation_year = df_summary_by_year['year'].max()
campus = df_summary_by_year['campus'].unique()[i]
y = df_summary_by_year.query(f"(year == {annotation_year}) & (campus == '{campus}')")['review_rating'].iloc[0]
fig.add_annotation(
x=annotation_year,
y=y,
text=campus,
font=dict(
color='white',
size=11
),
showarrow=True,
arrowwidth=1.5,
arrowhead=1,
arrowsize=1,
arrowcolor=annotations[i]['bgcolor'],
bgcolor=annotations[i]['bgcolor'],
xshift=5,
ax=40,
ay=0,
)
fig.show()
👉 Positive/negative sentiments by campus¶
df_sentiment_by_campus = df_m.groupby(['campus', 'sentiment'], as_index=False) \
.agg({'review_id': 'count'})
df_sentiment_by_campus
fig = make_subplots(
rows=1,
cols=3,
specs=[[{'type': 'domain'}, {'type': 'domain'}, {'type': 'domain'}]],
subplot_titles=['BYU', 'PSU', 'UIUC']
)
pie_labels = ['Negative', 'Positive']
marker_colors = ['#ef5350', '#8bc34a']
def add_sentiment_trace(fig, col, campus):
fig.add_trace(
go.Pie(
labels=pie_labels,
values=df_m[df_m['campus'] == campus]['sentiment'].value_counts().sort_index(),
text=pie_labels,
marker_colors=marker_colors,
textposition='inside',
textfont=dict(
size=10,
color='white'
),
insidetextorientation='horizontal',
sort=False,
), 1, col
)
add_sentiment_trace(fig, 1, 'Brigham Young University')
add_sentiment_trace(fig, 2, 'Penn State University Park')
add_sentiment_trace(fig, 3, 'University of Illinois Urbana-Champaign')
fig.update_layout(
title=dict(
text='<b>DistilBERT Sentiments by Campus</b><br><span style="color: #aaa;">Leasing companies around Brigham Young University (Provo, UT) have the worst sentiment</span>',
x=0,
y=0.9,
xanchor='left',
yanchor='top',
),
font_family='Helvetica, Arial, Inter, sans-serif',
showlegend=False,
margin=dict(
l=0,
r=0,
t=125,
b=50,
),
height=450
)
df_sentiment_by_campus
👉 Review length distribution by rating¶
df_review_len = df_m.copy()
df_review_len['review_length'] = df_review_len['review_text'].str.len()
df_review_len['review_rating'] = df_review_len['review_rating'].astype(str)
# only filter review <2000 chars to remove outliers
df_review_len = df_review_len[df_review_len['review_length'] < 2000]
fig = px.box(
df_review_len,
x='review_length',
y='review_rating',
color='review_rating',
color_discrete_map={
"1": "#EF5350",
"2": "#EF9A9A",
"3": "#FDD835",
"4": "#9CCC65",
"5": "#689F38"
},
labels={
'review_rating': 'Review Rating',
'review_length': 'Review Length'
},
template='simple_white',
title='<b>Review length (number of characters) distribution by rating</b><br><span style="color: #aaa">Unsatisfied tenants have more to say</span>',
height=600
)
fig.update_yaxes(categoryorder='category ascending')
fig.update_layout(
showlegend=False,
font_family='Helvetica, Inter, Arial, sans-serif',
)
fig.show()
👉 Correlation between review length and number of likes¶
fig = px.scatter(
df_review_len[df_review_len['review_likes'] > 0],
x='review_length',
y='review_likes',
labels={
'review_likes': 'Review Likes',
'review_length': 'Review Length'
},
template='simple_white',
title='<b>Review length vs number of likes</b><br>\
<span style="color: #aaa">Longer reviews receive more likes on average</span>',
trendline="ols",
trendline_color_override="#D32F2F",
height=600
)
fig.update_layout(
showlegend=False,
font_family='Helvetica, Inter, Arial, sans-serif',
)
fig.update_traces(
marker=dict(
color='#CFD8DC',
size=3,
),
selector=dict(mode='markers')
)
# calculate y coordinate to add annotation
r = px.get_trendline_results(fig).iloc[0]['px_fit_results'].params
annotation_x = 1500
annotation_y = r[0] + r[1] * annotation_x
fig.add_annotation(
x=annotation_x,
y=annotation_y,
text=f"A weak positive correlation (R²={round(px.get_trendline_results(fig).iloc[0]['px_fit_results'].rsquared, 2)})",
font=dict(
size=14
),
showarrow=True,
arrowhead=2,
bgcolor='white',
)
fig.show()
✏️ Keyword Analysis¶
👉 Frequencies¶
Convert tokens to lowercase and find frequencies.
df_tokens_lower = df_tokens.copy()
df_tokens_lower['lemma'] = df_tokens_lower['lemma'].str.lower()
df_tokens_lower.rename(columns={'lemma': 'token'}, inplace=True)
df_tokens_lower.head(3)
df_token_counts = df_tokens_lower[['review_id', 'token']] \
.drop_duplicates()['token'] \
.value_counts() \
.to_frame() \
.reset_index()
df_token_counts
interesting_keywords = [
'deposit', 'parking', 'gym', 'internet',
'laundry', 'washer', 'leak', 'pool',
'package', 'expensive', 'maintenance', 'construction',
'noise', 'pet', 'bug', 'elevator',
]
df_interesting_keywords_count = df_token_counts[
df_token_counts['token'].isin(interesting_keywords)
]
df_interesting_keywords_count
fig = px.bar(
df_interesting_keywords_count,
x='count',
y='token',
labels={
'token': 'Keyword',
'count': 'Frequency',
},
title='<b>Review keyword frequencies</b><br><span style="color: #ccc;">Maintenance is the most frequently discussed keyword in reviews</span>',
template='simple_white',
height=700
)
fig.update_layout(
font_family='Helvetica, Inter, Arial, sans-serif',
)
fig.update_traces(marker_color='black')
fig.update_yaxes(categoryorder='total ascending')
fig.show()
👉 Positive/negative associations with tokens¶
Convert tokens to lowercase and find frequencies.
df_interesting_tokens = df_tokens_lower[df_tokens_lower['token'].isin(interesting_keywords)]
df_interesting_token_sentiments = pd.merge(
left=df_r[['place_id', 'review_id', 'sentiment']],
right=df_interesting_tokens[['review_id', 'token']],
how='inner',
on='review_id'
)
df_interesting_token_sentiments
df_interesting_token_sentiment_counts = df_interesting_token_sentiments \
.groupby(['token', 'sentiment'], as_index=False) \
.agg({'review_id': 'count'}) \
.rename(columns={'review_id': 'count'})
df_interesting_token_sentiment_counts['percentage'] = df_interesting_token_sentiment_counts['count'] / df_interesting_token_sentiment_counts.groupby('token')['count'].transform('sum')
df_interesting_token_sentiment_counts['sentiment'] = df_interesting_token_sentiment_counts['sentiment'].str.capitalize()
df_interesting_token_sentiment_counts.sort_values(
['sentiment', 'percentage'],
ascending=[True, True],
inplace=True
)
df_interesting_token_sentiment_counts.head(10)
fig = px.bar(
df_interesting_token_sentiment_counts,
x='percentage',
y='token',
template='simple_white',
color='sentiment',
color_discrete_map={
'Positive': '#8bc34a',
'Negative': '#ef5350'
},
labels={'token': 'Keyword', 'sentiment': 'Sentiment', 'percentage': 'Percentage'},
title='<b>Keyword vs review sentiment associations</b><br><span style="color: #aaa;">Who doesn\'t hate leaks, security deposit scams, and broken elevators?</span>',
text=df_interesting_token_sentiment_counts.apply(
lambda r: f"{'👍' if r['sentiment'] == 'Positive' else '👎'} \
{'{0:.1f}%'.format(r['percentage'] * 100)}", axis=1),
height=800
)
fig.update_layout(
font_family='Helvetica, Inter, Arial, sans-serif',
xaxis_tickformat=',.0%',
showlegend=False,
)
fig.for_each_trace(lambda t: t.update(textfont_color='white'))
fig.show()
Add management company names and campus information to df_interesting_token_sentiments
.
df_tokens_places = pd.merge(
left=df_interesting_token_sentiments.drop_duplicates()[['place_id', 'sentiment', 'token']],
right=df_b[['place_id', 'name', 'campus']]
)
df_tokens_places
Get the number of reviews by each business.
df_place_review_counts = df_r[df_r['review_text'].notna()].groupby('place_id', as_index=False)['review_id'].count() \
.rename(columns={'review_id': 'num_reviews'})
df_place_review_counts
Filter places by interesting keywords based on frequency.
df_tokens_places_summary = df_tokens_places.groupby(['place_id', 'name', 'campus', 'token'], as_index=False) \
.agg({'sentiment': 'size'}) \
.rename(columns={'sentiment': 'unique_frequency'}) \
.merge(
right=df_place_review_counts,
on='place_id'
)
df_tokens_places_summary['campus'] = df_tokens_places_summary['campus'].map(campus_acronyms_map)
df_tokens_places_summary['percentage'] = df_tokens_places_summary['unique_frequency'] / df_tokens_places_summary['num_reviews']
df_tokens_places_summary = df_tokens_places_summary[df_tokens_places_summary['percentage'] >= 0.1]
df_tokens_places_summary.sort_values('percentage', ascending=False)
# a function to create a bar chart of businesses with the
# highest relative frequency of reviews with a specific keyword
def generate_token_percentages_bar_chart(
df_token_places_summary,
keyword,
chart_subtitle
):
df_keyword_top5_places = df_tokens_places_summary[df_tokens_places_summary['token'] == keyword] \
.sort_values('percentage', ascending=False) \
.head(5)
color_discrete_sequence = ['#cfd8dc'] * df_keyword_top5_places.shape[0]
color_discrete_sequence[0] = '#d32f2f'
fig = px.bar(
df_keyword_top5_places,
x='name',
y='percentage',
labels={
'name': 'Business',
'percentage': 'Percentage',
'campus': 'Campus',
'num_reviews': 'Total number of reviews',
},
title=f'<b>Percentage of text reviews with the keyword "{keyword}"</b><br><span style="color: #ccc;">{chart_subtitle}</span>',
template='simple_white',
color='name',
color_discrete_sequence=color_discrete_sequence,
hover_name='name',
hover_data=['campus', 'num_reviews'],
text='campus',
height=550
)
fig.update_traces(textposition='inside')
fig.update_layout(
yaxis_tickformat=',.0%',
font_family='Helvetica, Inter, Arial, sans-serif',
showlegend=False,
)
return fig
fig = generate_token_percentages_bar_chart(
df_tokens_places_summary,
'deposit',
'The Univesity Group is notorious for not returning security deposits'
)
fig.show()
fig = generate_token_percentages_bar_chart(
df_tokens_places_summary,
'elevator',
'309 Green only has 2 elevators for a 24-story building'
)
fig.show()
fig = generate_token_percentages_bar_chart(
df_tokens_places_summary,
'laundry',
'Provo seems to have more apartments that share the laundry facility'
)
fig.show()
fig = generate_token_percentages_bar_chart(
df_tokens_places_summary,
'parking',
'Parking seems more limited in Provo'
)
fig.show()
fig = generate_token_percentages_bar_chart(
df_tokens_places_summary,
'expensive',
'Apartments in UIUC surprisingly do not make the list here'
)
fig.show()
👉 Percentage of tokens by campus¶
Get the number of reviews with review text by each campus.
df_campus_review_counts = df_m[df_m['review_text'].notna()].groupby('campus', as_index=False)['review_id'].count() \
.rename(columns={'review_id': 'num_reviews'})
df_campus_review_counts
df_tokens_campus_summary = df_tokens_places.groupby(['campus', 'token'], as_index=False) \
.agg({'sentiment': 'size'}) \
.rename(columns={'sentiment': 'unique_frequency'}) \
.merge(
right=df_campus_review_counts,
on='campus'
)
df_tokens_campus_summary['percentage'] = df_tokens_campus_summary['unique_frequency'] / df_tokens_campus_summary['num_reviews']
df_tokens_campus_summary['campus'] = df_tokens_campus_summary['campus'].map({
'Brigham Young University': 'BYU',
'Penn State University Park': 'PSU',
'University of Illinois Urbana-Champaign': 'UIUC'
})
df_tokens_campus_summary.sort_values(['campus', 'token'], inplace=True)
df_tokens_campus_summary.head()
df_campus_positive_keywords = df_tokens_campus_summary[df_tokens_campus_summary['token'] \
.isin(['gym', 'pool', 'pet'])] \
.sort_values(['campus', 'token'])
df_campus_positive_keywords.head(3)
df_campus_negative_keywords = df_tokens_campus_summary[df_tokens_campus_summary['token'] \
.isin(['leak', 'deposit', 'elevator'])] \
.sort_values(['campus', 'token'])
df_campus_negative_keywords.head(3)
df_campus_misc_keywords = df_tokens_campus_summary[df_tokens_campus_summary['token'] \
.isin(['parking', 'expensive', 'construction', 'package'])] \
.sort_values(['campus', 'token'])
df_campus_misc_keywords.head(3)
fig = px.bar(
df_campus_positive_keywords,
x='token',
y='percentage',
color='campus',
color_discrete_map={
'BYU': '#1FB3D1',
'PSU': '#001E44',
'UIUC': '#FF5F05'
},
title='<b>Percentage of reviews that includes specific positive keywords</b><br>\
<span style="color: #ccc;">Provo, UT\'s summer is hotter than State College, PA and Champaign, IL</span>',
labels={
'percentage': 'Percentage',
'token': 'Keyword',
'campus': 'Campus'
},
height=500,
template='simple_white',
barmode='group'
)
fig.update_layout(
yaxis_tickformat=',.0%',
font_family='Helvetica, Inter, Arial, sans-serif',
)
fig.show()
fig = px.bar(
df_campus_negative_keywords,
x='token',
y='percentage',
color='campus',
color_discrete_map={
'BYU': '#1FB3D1',
'PSU': '#001E44',
'UIUC': '#FF5F05'
},
title='<b>Percentage of reviews that includes specific negative keywords</b><br>\
<span style="color: #ccc;">BYU surprisingly has the highest proportion of reviews related to security deposits</span>',
labels={
'percentage': 'Percentage',
'token': 'Keyword',
'campus': 'Campus'
},
height=500,
template='simple_white',
barmode='group'
)
fig.update_layout(
yaxis_tickformat=',.0%',
font_family='Helvetica, Inter, Arial, sans-serif',
)
fig.show()
fig = px.bar(
df_campus_misc_keywords,
x='token',
y='percentage',
color='campus',
color_discrete_map={
'BYU': '#1FB3D1',
'PSU': '#001E44',
'UIUC': '#FF5F05'
},
title='<b>Percentage of reviews that includes specific miscellaneous keywords</b><br>\
<span style="color: #ccc;">Parking is important to tenants at Provo, UT</span>',
labels={
'percentage': 'Percentage',
'token': 'Keyword',
'campus': 'Campus'
},
height=500,
template='simple_white',
barmode='group'
)
fig.update_layout(
yaxis_tickformat=',.0%',
font_family='Helvetica, Inter, Arial, sans-serif',
)
fig.show()
⚖️ Closing Thoughts¶
- What a surprise! UIUC has the highest proportion of 5-star reviews, and the best tenant sentiments out of the three college towns.
- BYU has the worst tenant sentiments.
- 48% of reviews in BYU (Provo, UT) are negative. This percentage far exceeds those of UIUC and PSU, which ranges about 34~36%.
- The University Group @ UIUC is leading in not properly returning security deposits.
- "Gym", "Pool", and "Pet" are often associated with positive reviews.
- Common keywords in negative reviews are "leak", "deposit", "elevator", "construction", "bug", "parking", and "noise".
- "Pool" and "Parking" are more relevant to tenants at BYU.
- The average review ratings in UIUC and PSU have been improving.
- On the other hand, this is not a good look for management companies at BYU.
- Landlords with higher average ratings respond faster on average, compared to landlords with lower average ratings.