✨ Background

I first moved into an apartment at UIUC during the summer of 2007. The apartment was Gregory Place under a management company called JSM. Here is a photo of the apartment from 2007. The construction site shows the second building being built.

Gregory Place 2007

Back in 2007, JSM was regarded as one of the better management companies, along with Royse Brinkmeyer and Roland. Almost two decades later, JSM is still highly regraded. Prospective tenants lining up from the early morning in front of JSM's office to secure a lease was a sight to see in recent years.

JSM Morning Lineup 2023 by Jacob Slabosz Photo by Jacob Slabosz @ Daily illini

After talking to hundreds of students over the years about their landlords in Champaign-Urbana, I came to a conclusion that many of the landlords in Champaign-Urbana must be the worst. From the notorious attempts to hijack tenants' security deposits by The University Group to having to endure cold showers during the winter in Seven07, finding a good landlord or a headache-free apartment sounds impossible.

But this got me thinking - are the landlords in Champaign-Urbana really the worst compared to other similar college towns?

♟️ About the Dataset

In this post, I will compare how the property management companies in Champaign-Urbana area compare to those of two other college towns - Provo, UT (Brigham Young University) and State College, PA (Penn State University).

Data collection and preparation

  • 142 property management companies across the three college towns have been hand-selected.
  • The companies' information on Google Maps were scraped using Outscraper (a cloud-based scraping service).
  • About 17,000 reviews for the property management companies were scraped using Outscraper.
  • The review texts were tokenized using spaCy. The tokenization and additional preprocessing steps are not included in this post.
  • The post uses sentiment analysis results of the reviews using DistilBERT (distilbert-base-uncased-finetuned-sst-2-english).

Import packages.

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from plotly.subplots import make_subplots
In [2]:
import plotly

print(f'pandas version: {pd.__version__}')
print(f'plotly version: {plotly.__version__}')
pandas version: 2.2.3
plotly version: 5.24.1

Set the nubmer of maximum number of displayed columns to 50.

In [3]:
pd.set_option('display.max_columns', 50)

Load and Preprocess Data

There are four tables.

  • df_b: List of property management companies or apartments
  • df_r: Google reviews of the property management companies or apartments
  • df_tokens: Tokenized result of reviews using spaCy
  • df_sentiments: Sentiment analysis result using DistilBERT
In [4]:
df_b = pd.read_csv('https://github.com/bdi475/datasets/raw/main/campustowns-leasing-company-reviews/results/businesses.csv')
df_r = pd.read_csv(
    'https://github.com/bdi475/datasets/raw/main/campustowns-leasing-company-reviews/results/reviews.csv',
    parse_dates = ['review_datetime_utc', 'owner_answer_timestamp_datetime_utc']
)
df_tokens = pd.read_csv('https://github.com/bdi475/datasets/raw/main/campustowns-leasing-company-reviews/results/tokens.csv.gz')
df_sentiments = pd.read_csv('https://github.com/bdi475/datasets/raw/main/campustowns-leasing-company-reviews/results/sentiments.csv')

Businesses dataset

Display the businesses (property management companies) table.

In [5]:
df_b.head()
Out[5]:
campusplace_idnamesitecategoryboroughstreetcitypostal_codestatelatitudelongitudeverified
0Brigham Young UniversityChIJqyXyo6GQTYcRXGfgeIpqc_IAlpine Villagehttps://myalpinevillage.com/Student housing centerCarterville1378 Freedom Blvd 200 WProvo84604Utah40.252607-111.661247True
1Brigham Young UniversityChIJGbnTbamQTYcRtZfBrL52jhsPalladium Apartmentshttps://www.palladiumprovo.com/Housing complexNorth Park538 N Freedom Blvd UnitProvo84601Utah40.240917-111.661714True
2Brigham Young UniversityChIJaURIj6GQTYcRCY56AtxjTI4Glenwood Apartmentshttps://glenwoodapt.com/Student housing centerCarterville1565 N University AveProvo84604Utah40.254606-111.659193True
3Brigham Young UniversityChIJU6EhynuRTYcRv5tkFZYZ8k0Campus Edgehttps://campusprovo.com/campus-edge/Student housing centerCarterville1305 N Canyon RdProvo84604Utah40.251654-111.656401True
4Brigham Young UniversityChIJaURIj6GQTYcRkndi2c5RtdwCambridge Apartmentshttps://www.cambridgecourtapt.com/?switch_cls[...Apartment complexCarterville1425 N University AveProvo84604Utah40.253127-111.659800True
In [6]:
print(f"There are {df_b.shape[0]} rows and {df_b.shape[1]} columns in the businesses table.")
There are 142 rows and 13 columns in the businesses table.

Print out the DataFrame's information.

In [7]:
df_b.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142 entries, 0 to 141
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   campus       142 non-null    object 
 1   place_id     142 non-null    object 
 2   name         142 non-null    object 
 3   site         140 non-null    object 
 4   category     142 non-null    object 
 5   borough      71 non-null     object 
 6   street       142 non-null    object 
 7   city         142 non-null    object 
 8   postal_code  142 non-null    int64  
 9   state        142 non-null    object 
 10  latitude     142 non-null    float64
 11  longitude    142 non-null    float64
 12  verified     142 non-null    bool   
dtypes: bool(1), float64(2), int64(1), object(9)
memory usage: 13.6+ KB

Check for missing values.

In [8]:
df_b.isna().sum()
Out[8]:
campus          0
place_id        0
name            0
site            2
category        0
borough        71
street          0
city            0
postal_code     0
state           0
latitude        0
longitude       0
verified        0
dtype: int64

The "site" and "borough" columns have missing values.

Reviews dataset

Display the reviews table.

In [9]:
df_r.head()
Out[9]:
place_idreview_idauthor_idauthor_titlereview_textreview_ratingreview_img_urlreview_datetime_utcowner_answerowner_answer_timestamp_datetime_utcreview_likes
0ChIJqyXyo6GQTYcRXGfgeIpqc_IChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRAB100512067005706825621Laura BrownService requests take a very long time to be r...3NaN2023-10-24 12:46:43+00:00Dear Laura, \n\n We hear you loud and clear, t...2023-10-24 14:56:01+00:000
1ChIJqyXyo6GQTYcRXGfgeIpqc_IChZDSUhNMG9nS0VJQ0FnSUNwOUstOVpBEAE102936107836751613731Pete HaraguchiI was thinking about giving two stars but I le...1https://lh5.googleusercontent.com/p/AF1QipPeIy...2023-07-29 20:30:26+00:00Pete,\n\n Thank you for letting us know about ...2023-07-29 22:46:48+00:0012
2ChIJqyXyo6GQTYcRXGfgeIpqc_IChZDSUhNMG9nS0VJQ0FnSURwbGZXTUJ3EAE103835714740882872235Jack RadfordDO NOT LIVE HERE. This was the WORST experienc...1NaN2023-08-26 03:31:42+00:00Jack, \n\n I’m so sorry — it sounds like we re...2023-08-26 04:56:04+00:003
3ChIJqyXyo6GQTYcRXGfgeIpqc_IChdDSUhNMG9nS0VJQ0FnSURwdzd1Q3B3RRAB107352000345615085703Daniel MerrittIf you have any other choice, take it over Alp...1NaN2023-08-28 21:54:09+00:00Daniel,\n\n Even though it was a negative expe...2023-08-29 00:00:30+00:004
4ChIJqyXyo6GQTYcRXGfgeIpqc_IChZDSUhNMG9nS0VJQ0FnSURKdnFxeWVnEAE109184252255204802361Carman HansenHonestly, I debated whether or not I should le...1NaN2023-07-17 21:47:22+00:00Dear Carman, \n\n Thank you for bringing this ...2023-07-17 22:59:16+00:009
In [10]:
print(f"There are {df_r.shape[0]} rows and {df_r.shape[1]} columns in the reviews table.")
There are 16848 rows and 11 columns in the reviews table.

Check for missing values.

In [11]:
df_r.isna().sum()
Out[11]:
place_id                                   0
review_id                                  0
author_id                                  0
author_title                               0
review_text                             1969
review_rating                              0
review_img_url                         16478
review_datetime_utc                        0
owner_answer                            6951
owner_answer_timestamp_datetime_utc     6951
review_likes                               0
dtype: int64

Columns "review_text", "review_img_url", "owner_answer", and "owner_answer_timestamp_datetime_utc" have missing values.

Sentiments

Display the sentiments table.

In [12]:
df_sentiments.head()
Out[12]:
review_idsentimentscore
0ChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRABPOSITIVE0.909106
1ChZDSUhNMG9nS0VJQ0FnSUNwOUstOVpBEAENEGATIVE0.999216
2ChZDSUhNMG9nS0VJQ0FnSURwbGZXTUJ3EAENEGATIVE0.998690
3ChdDSUhNMG9nS0VJQ0FnSURwdzd1Q3B3RRABNEGATIVE0.999131
4ChZDSUhNMG9nS0VJQ0FnSURKdnFxeWVnEAENEGATIVE0.999204

Check for missing values.

In [13]:
df_sentiments.isna().sum()
Out[13]:
review_id    0
sentiment    0
score        0
dtype: int64

There are no missing values in the sentiments table.

Merge sentiments into reviews

In [14]:
if 'sentiment' not in df_r.columns:
    df_r = pd.merge(
        left=df_r,
        right=df_sentiments,
        on='review_id',
        how='left'
    )

df_r.head()
Out[14]:
place_idreview_idauthor_idauthor_titlereview_textreview_ratingreview_img_urlreview_datetime_utcowner_answerowner_answer_timestamp_datetime_utcreview_likessentimentscore
0ChIJqyXyo6GQTYcRXGfgeIpqc_IChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRAB100512067005706825621Laura BrownService requests take a very long time to be r...3NaN2023-10-24 12:46:43+00:00Dear Laura, \n\n We hear you loud and clear, t...2023-10-24 14:56:01+00:000POSITIVE0.909106
1ChIJqyXyo6GQTYcRXGfgeIpqc_IChZDSUhNMG9nS0VJQ0FnSUNwOUstOVpBEAE102936107836751613731Pete HaraguchiI was thinking about giving two stars but I le...1https://lh5.googleusercontent.com/p/AF1QipPeIy...2023-07-29 20:30:26+00:00Pete,\n\n Thank you for letting us know about ...2023-07-29 22:46:48+00:0012NEGATIVE0.999216
2ChIJqyXyo6GQTYcRXGfgeIpqc_IChZDSUhNMG9nS0VJQ0FnSURwbGZXTUJ3EAE103835714740882872235Jack RadfordDO NOT LIVE HERE. This was the WORST experienc...1NaN2023-08-26 03:31:42+00:00Jack, \n\n I’m so sorry — it sounds like we re...2023-08-26 04:56:04+00:003NEGATIVE0.998690
3ChIJqyXyo6GQTYcRXGfgeIpqc_IChdDSUhNMG9nS0VJQ0FnSURwdzd1Q3B3RRAB107352000345615085703Daniel MerrittIf you have any other choice, take it over Alp...1NaN2023-08-28 21:54:09+00:00Daniel,\n\n Even though it was a negative expe...2023-08-29 00:00:30+00:004NEGATIVE0.999131
4ChIJqyXyo6GQTYcRXGfgeIpqc_IChZDSUhNMG9nS0VJQ0FnSURKdnFxeWVnEAE109184252255204802361Carman HansenHonestly, I debated whether or not I should le...1NaN2023-07-17 21:47:22+00:00Dear Carman, \n\n Thank you for bringing this ...2023-07-17 22:59:16+00:009NEGATIVE0.999204

Tokens

Display the list of tokens in its original form ("text"), lemmatized form ("lemma"), and their part of speech explanations ("explain").

In [15]:
df_tokens
Out[15]:
review_idtextlemmaexplain
0ChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRABServiceservicenoun
1ChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRABrequestsrequestnoun
2ChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRABlonglongadjective
3ChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRABtimetimenoun
4ChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRABrespondedrespondverb
...............
533659ChZDSUhNMG9nS0VJQ0FnSURBdHNiMWRnEAElookinglookverb
533660ChZDSUhNMG9nS0VJQ0FnSURBdHNiMWRnEAEapartmentsapartmentnoun
533661ChZDSUhNMG9nS0VJQ0FnSURBdHNiMWRnEAEansweredanswerverb
533662ChZDSUhNMG9nS0VJQ0FnSURBdHNiMWRnEAEquestionsquestionnoun
533663ChZDSUhNMG9nS0VJQ0FnSURBdHNiMWRnEAEpatientpatientadjective

533664 rows × 4 columns

Merge reviews and businesses

In [16]:
df_m = pd.merge(
    left=df_r,
    right=df_b,
    on='place_id',
    how='inner'
)

df_m.head(2)
Out[16]:
place_idreview_idauthor_idauthor_titlereview_textreview_ratingreview_img_urlreview_datetime_utcowner_answerowner_answer_timestamp_datetime_utcreview_likessentimentscorecampusnamesitecategoryboroughstreetcitypostal_codestatelatitudelongitudeverified
0ChIJqyXyo6GQTYcRXGfgeIpqc_IChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRAB100512067005706825621Laura BrownService requests take a very long time to be r...3NaN2023-10-24 12:46:43+00:00Dear Laura, \n\n We hear you loud and clear, t...2023-10-24 14:56:01+00:000POSITIVE0.909106Brigham Young UniversityAlpine Villagehttps://myalpinevillage.com/Student housing centerCarterville1378 Freedom Blvd 200 WProvo84604Utah40.252607-111.661247True
1ChIJqyXyo6GQTYcRXGfgeIpqc_IChZDSUhNMG9nS0VJQ0FnSUNwOUstOVpBEAE102936107836751613731Pete HaraguchiI was thinking about giving two stars but I le...1https://lh5.googleusercontent.com/p/AF1QipPeIy...2023-07-29 20:30:26+00:00Pete,\n\n Thank you for letting us know about ...2023-07-29 22:46:48+00:0012NEGATIVE0.999216Brigham Young UniversityAlpine Villagehttps://myalpinevillage.com/Student housing centerCarterville1378 Freedom Blvd 200 WProvo84604Utah40.252607-111.661247True

Metadata for visualizations

In [78]:
campus_acronyms_map = {
    'University of Illinois Urbana-Champaign': 'UIUC',
    'Brigham Young University': 'BYU',
    'Penn State University Park': 'PSU',
}

👉 Number of apartments listed by campus

In [17]:
df_b['campus'] \
    .value_counts() \
    .to_frame() \
    .reset_index()
Out[17]:
campuscount
0Brigham Young University53
1Penn State University Park47
2University of Illinois Urbana-Champaign42

👉 Calculate the time owner took to respond to a review

In [18]:
td = df_m['owner_answer_timestamp_datetime_utc'] - df_m['review_datetime_utc']
response_time_in_days = td.apply(lambda x: x / pd.Timedelta(days=1))

# round the response time if a positive number
# if the resposne time is negative, there is no way of knowing
# when the author has 
df_m['response_time_in_days'] = np.where(
    response_time_in_days > 0,
    round(response_time_in_days, 0),
    response_time_in_days
)

df_m[['review_id', 'owner_answer_timestamp_datetime_utc', 'review_datetime_utc', 'response_time_in_days']].sample(2)
Out[18]:
review_idowner_answer_timestamp_datetime_utcreview_datetime_utcresponse_time_in_days
12864ChZDSUhNMG9nS0VJQ0FnSUNBNHItRU1REAE2018-08-17 18:15:27+00:002018-08-17 15:38:43+00:000.0
5859ChZDSUhNMG9nS0VJQ0FnSURjMXR1dUpREAENaT2020-06-11 01:56:32+00:00NaN

Add a categorical variable based on the 'response_time_in_days' variable.

In [19]:
df_m['response_time'] = np.select(
    [
        df_m['response_time_in_days'].between(-10000, 0, inclusive='left'),
        df_m['response_time_in_days'].between(0, 1, inclusive='left'),
        df_m['response_time_in_days'].between(1, 7, inclusive='left'),
        df_m['response_time_in_days'].between(7, 30, inclusive='left'),
        df_m['response_time_in_days'].between(31, 10000, inclusive='left'),
    ],
    [
        'Unknown (review updated)',
        'Within a day',
        'Within a week',
        'Within a month',
        'After a month'
    ],
    default='No response'
)

df_m.head(2)
Out[19]:
place_idreview_idauthor_idauthor_titlereview_textreview_ratingreview_img_urlreview_datetime_utcowner_answerowner_answer_timestamp_datetime_utcreview_likessentimentscorecampusnamesitecategoryboroughstreetcitypostal_codestatelatitudelongitudeverifiedresponse_time_in_daysresponse_time
0ChIJqyXyo6GQTYcRXGfgeIpqc_IChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRAB100512067005706825621Laura BrownService requests take a very long time to be r...3NaN2023-10-24 12:46:43+00:00Dear Laura, \n\n We hear you loud and clear, t...2023-10-24 14:56:01+00:000POSITIVE0.909106Brigham Young UniversityAlpine Villagehttps://myalpinevillage.com/Student housing centerCarterville1378 Freedom Blvd 200 WProvo84604Utah40.252607-111.661247True0.0Within a day
1ChIJqyXyo6GQTYcRXGfgeIpqc_IChZDSUhNMG9nS0VJQ0FnSUNwOUstOVpBEAE102936107836751613731Pete HaraguchiI was thinking about giving two stars but I le...1https://lh5.googleusercontent.com/p/AF1QipPeIy...2023-07-29 20:30:26+00:00Pete,\n\n Thank you for letting us know about ...2023-07-29 22:46:48+00:0012NEGATIVE0.999216Brigham Young UniversityAlpine Villagehttps://myalpinevillage.com/Student housing centerCarterville1378 Freedom Blvd 200 WProvo84604Utah40.252607-111.661247True0.0Within a day

👉 Owner's response time (pie chart)

In [20]:
df_response_time_counts = df_m['response_time'].value_counts().to_frame().reset_index()
df_response_time_counts
Out[20]:
response_timecount
0No response6963
1Within a week4258
2Within a day2055
3Within a month1374
4After a month1302
5Unknown (review updated)898
In [21]:
fig = px.pie(
    df_response_time_counts,
    names='response_time',
    values='count',
    title='<b>Owner response times</b><br><span style="color: #aaa;">Owners do not respond 40% of the times</span>',
    height=500,
    template='simple_white',
    color='response_time',
    color_discrete_map={
        "Within a day": "#689F38",
        "No response": "#FDD835",
        "Within a week": "#9CCC65",
        "Within a month": "#EF9A9A",
        "After a month": "#EF5350",
        "Unknown (review updated)": "#ccc"
    },
    labels={
        'response_time': 'Response Time'
    },
)

fig.update_traces(
    textinfo='percent+label',
    textposition='outside',
    showlegend=False
)

fig.update_layout(
    font_family='Helvetica, Inter, Arial, sans-serif',
)

fig.show()
No response41.3%Within a week25.3%Within a day12.2%Within a month8.15%After a month7.73%Unknown (review updated)5.33%
Owner response timesOwners do not respond 40% of the times

👉 Owner's response time (histogram)

In [22]:
fig = px.histogram(
    df_m[(df_m['response_time_in_days'] >= 0) & (df_m['response_time_in_days'] < 30)],
    x='response_time_in_days',
    template='simple_white',
    title='<b>Owner\'s response time in days (<30 days)</b><br><span style="color: #aaa">The majority of the owners who responds replies within a week</span>',
    height=500,
    labels={
        'response_time_in_days': 'Response time in days'
    },
    color_discrete_sequence=['black']
)

fig.update_layout(
    font_family='Helvetica, Inter, Arial, sans-serif',
    yaxis_title_text='Count'
)

fig.add_vrect(
    x0=-0.5,
    x1=7.5, 
    annotation_text="Within a week",
    annotation_position="top right",
    annotation=dict(
        font_color='#7CB342',
        font_size=11
    ),
    fillcolor="#4CAF50",
    opacity=0.07,
    line_width=0
)
fig.add_vline(x=7.5, line_width=1, line_dash="solid", line_color="#8bc34a")

fig.show()
05101520250500100015002000
Owner's response time in days (<30 days)The majority of the owners who responds replies within a weekResponse time in daysCountWithin a week
In [23]:
df_response_sentiment_count =df_m \
    .groupby(['response_time', 'sentiment'], as_index=False) \
    .agg({'review_id': 'count'}) \
    .rename(columns={'review_id': 'count'})

df_response_sentiment_count['sentiment'] = df_response_sentiment_count['sentiment'].str.capitalize()
df_response_sentiment_count['percentage'] = df_response_sentiment_count['count'] / df_response_sentiment_count.groupby('response_time')['count'].transform('sum')




df_response_sentiment_count
Out[23]:
response_timesentimentcountpercentage
0After a monthNegative5070.447090
1After a monthPositive6270.552910
2No responseNegative25230.430326
3No responsePositive33400.569674
4Unknown (review updated)Negative4870.570926
5Unknown (review updated)Positive3660.429074
6Within a dayNegative5740.312807
7Within a dayPositive12610.687193
8Within a monthNegative5190.411252
9Within a monthPositive7430.588748
10Within a weekNegative12750.324098
11Within a weekPositive26590.675902

👉 Owner's response time vs sentiment

In [24]:
fig = px.bar(
    df_response_sentiment_count,
    x='percentage',
    y='response_time',
    category_orders={
        'response_time': [
            'Within a day',
            'Within a week',
            'Within a month',
            'After a month',
            'No response',
            'Unknown (review updated)'
        ],
        'sentiment': ['Positive', 'Negative']
    },
    template='simple_white',
    color='sentiment',
    color_discrete_map={
        'Positive': '#8bc34a',
        'Negative': '#ef5350'
    },
    labels={'sentiment': 'Sentiment'},
    title='<b>Owner\'s response time</b><br><span style="color: #aaa;">Companies who receive negative reviews are correlated with slower response times</span>',
    text=df_response_sentiment_count.apply(lambda r: f"{'👍' if r['sentiment'] == 'Positive' else '👎'} {'{0:.1f}%'.format(r['percentage'] * 100)}", axis=1),
    height=500
)

fig.update_layout(
    xaxis_title_text='Percentage',
    yaxis_title_text='Owner\'s response time',
    font_family='Helvetica, Inter, Arial, sans-serif',
    xaxis_tickformat=',.0%',
)

fig.for_each_trace(lambda t: t.update(textfont_color='white'))

fig.show()
👍 55.3%👍 57.0%👍 42.9%👍 68.7%👍 58.9%👍 67.6%👎 44.7%👎 43.0%👎 57.1%👎 31.3%👎 41.1%👎 32.4%0%20%40%60%80%100%Unknown (review updated)No responseAfter a monthWithin a monthWithin a weekWithin a day
SentimentPositiveNegativeOwner's response timeCompanies who receive negative reviews are correlated with slower response timesPercentageOwner's response time

👉 Review rating breakdown by campus

In [25]:
df_ratings_breakdown = df_m.groupby(
        ['campus', 'review_rating'],
        as_index=False
    ).agg({
        'review_id': 'count'
    }).rename(columns={
        'review_id': 'num_reviews'
    })

df_ratings_breakdown['campus'] = df_ratings_breakdown['campus'].map({
    'Brigham Young University': 'BYU',
    'Penn State University Park': 'PSU',
    'University of Illinois Urbana-Champaign': 'UIUC'
})

df_ratings_breakdown['review_rating'] = df_ratings_breakdown['review_rating'].astype(str)

df_ratings_breakdown['percentage'] = df_ratings_breakdown['num_reviews'] / df_ratings_breakdown.groupby('campus')['num_reviews'].transform('sum')

df_ratings_breakdown
Out[25]:
campusreview_ratingnum_reviewspercentage
0BYU118340.323971
1BYU24360.077018
2BYU34120.072779
3BYU410200.180180
4BYU519590.346052
5PSU113610.263606
6PSU22200.042611
7PSU32360.045710
8PSU46010.116405
9PSU527450.531668
10UIUC115220.252572
11UIUC22070.034351
12UIUC31760.029207
13UIUC45390.089446
14UIUC535820.594424
In [26]:
fig = px.bar(
    df_ratings_breakdown,
    x='campus',
    y='percentage',
    color='review_rating',
    color_discrete_map={
        "1": "#EF5350",
        "2": "#EF9A9A",
        "3": "#FDD835",
        "4": "#9CCC65",
        "5": "#689F38"
    },
    labels={
        'review_rating': 'Review Rating',
        'campus': 'Campus',
        'percentage': 'Percentage'
    },
    title='<b>Review rating breakdown by campus</b><br><span style="color: #aaa">UIUC has the highest proportion of 5 star reviews</span>',
    text=df_ratings_breakdown.apply(lambda r: f"{'⭐' * int(r['review_rating'])} {'{0:.1f}%'.format(r['percentage'] * 100)}", axis=1),
    template='simple_white',
    height=650
)

fig.update_layout(
    yaxis_tickformat=',.0%',
    uniformtext_minsize=10,
    uniformtext_mode='hide',
    font_family='Helvetica, Inter, Arial, sans-serif',
)
fig.for_each_trace(lambda t: t.update(textfont_color='white'))

fig.show()
⭐ 32.4%⭐ 26.4%⭐ 25.3%⭐⭐ 7.7%⭐⭐ 4.3%⭐⭐⭐ 7.3%⭐⭐⭐ 4.6%⭐⭐⭐⭐ 18.0%⭐⭐⭐⭐ 11.6%⭐⭐⭐⭐ 8.9%⭐⭐⭐⭐⭐ 34.6%⭐⭐⭐⭐⭐ 53.2%⭐⭐⭐⭐⭐ 59.4%BYUPSUUIUC0%20%40%60%80%100%
Review Rating12345Review rating breakdown by campusUIUC has the highest proportion of 5 star reviewsCampusPercentage

👉 Total number of reviews by campus

In [27]:
fig = px.bar(
    df_ratings_breakdown,
    x='num_reviews',
    y='campus',
    color='review_rating',
    color_discrete_map={
        "1": "#EF5350",
        "2": "#EF9A9A",
        "3": "#FDD835",
        "4": "#9CCC65",
        "5": "#689F38"
    },
    labels={
        'review_rating': 'Review Rating',
        'campus': 'Campus',
        'percentage': 'Percentage',
        'num_reviews': 'Number of reviews'
    },
    title='<b>Total number of reviews by campus</b><br><span style="color: #ccc;">UIUC has the highest number of total reviews</span>',
    template='simple_white',
    height=500
)

fig.update_layout(
    font_family='Helvetica, Inter, Arial, sans-serif',
)
fig.update_yaxes(categoryorder='total ascending')

fig.show()
0100020003000400050006000PSUBYUUIUC
Review Rating12345Total number of reviews by campusUIUC has the highest number of total reviewsNumber of reviewsCampus

👉 Review ratings over time

In [28]:
df_dt = df_m.copy()

df_dt['month'] = df_dt['review_datetime_utc'].dt.strftime('%Y-%m')
df_dt = df_dt[df_dt['review_datetime_utc'].dt.year >= 2015]

df_summary_by_year_month = df_dt.groupby(['month', 'campus'], as_index=False) \
    .agg({
        'review_id': 'count',
        'review_rating': 'mean',
    }) \
    .rename(columns={
        'review_id': 'num_reviews'
    })

df_summary_by_year_month
Out[28]:
monthcampusnum_reviewsreview_rating
02015-01Brigham Young University273.222222
12015-01Penn State University Park62.833333
22015-01University of Illinois Urbana-Champaign73.285714
32015-02Brigham Young University213.428571
42015-02Penn State University Park82.250000
...............
3162023-10Penn State University Park1173.487179
3172023-10University of Illinois Urbana-Champaign1003.070000
3182023-11Brigham Young University533.735849
3192023-11Penn State University Park753.986667
3202023-11University of Illinois Urbana-Champaign963.916667

321 rows × 4 columns

In [29]:
df_dt = df_m.copy()

df_dt['year'] = df_dt['review_datetime_utc'].dt.year
df_dt = df_dt[df_dt['review_datetime_utc'].dt.year >= 2015]

df_summary_by_year = df_dt.groupby(['year', 'campus'], as_index=False) \
    .agg({
        'review_id': 'count',
        'review_rating': 'mean',
    }) \
    .rename(columns={
        'review_id': 'num_reviews'
    })

df_summary_by_year['campus'] = df_summary_by_year['campus'].map({
    'Brigham Young University': 'BYU',
    'Penn State University Park': 'PSU',
    'University of Illinois Urbana-Champaign': 'UIUC'
})

df_summary_by_year.head(3)
Out[29]:
yearcampusnum_reviewsreview_rating
02015BYU3433.352770
12015PSU1332.729323
22015UIUC1193.672269
In [30]:
fig = px.line(
    df_summary_by_year,
    x='year',
    y='review_rating',
    color='campus',
    template='simple_white',
    labels={
        'review_rating': 'Average Review Rating',
        'year': 'Year',
        'campus': 'Campus',
    },
    title='<b>Average review rating change over time by campus</b><br><span style="color: #aaa;">The sharp decline in review ratings of Provo, UT (BYU) is alarming</span>',
    color_discrete_map={
        'BYU': '#1FB3D1',
        'PSU': '#001E44',
        'UIUC': '#FF5F05'
    },
    height=500,
)

fig.update_layout(
    font_family='Helvetica, Inter, Arial, sans-serif',
    legend=dict(
        font=dict(
            size=11,
        ),
        yanchor="top",
        y=1,
        xanchor="left",
        x=0.02
    ),
    showlegend=False,
)

annotations = [
    {'bgcolor': '#1FB3D1'},
    {'bgcolor': '#001E44'},
    {'bgcolor': '#FF5F05'},
]

for i in range(df_summary_by_year['campus'].nunique()):
    annotation_year = df_summary_by_year['year'].max()
    campus = df_summary_by_year['campus'].unique()[i]
    y = df_summary_by_year.query(f"(year == {annotation_year}) & (campus == '{campus}')")['review_rating'].iloc[0]
    
    fig.add_annotation(
        x=annotation_year,
        y=y,
        text=campus,
        font=dict(
            color='white',
            size=11
        ),
        showarrow=True,
        arrowwidth=1.5,
        arrowhead=1,
        arrowsize=1,
        arrowcolor=annotations[i]['bgcolor'],
        bgcolor=annotations[i]['bgcolor'],
        xshift=5,
        ax=40,
        ay=0,
    )


fig.show()
201620182020202220242.833.23.43.63.84
Average review rating change over time by campusThe sharp decline in review ratings of Provo, UT (BYU) is alarmingYearAverage Review RatingBYUPSUUIUC

👉 Positive/negative sentiments by campus

In [31]:
df_sentiment_by_campus = df_m.groupby(['campus', 'sentiment'], as_index=False) \
    .agg({'review_id': 'count'})

df_sentiment_by_campus
Out[31]:
campussentimentreview_id
0Brigham Young UniversityNEGATIVE2437
1Brigham Young UniversityPOSITIVE2648
2Penn State University ParkNEGATIVE1626
3Penn State University ParkPOSITIVE2855
4University of Illinois Urbana-ChampaignNEGATIVE1822
5University of Illinois Urbana-ChampaignPOSITIVE3493
In [32]:
fig = make_subplots(
    rows=1,
    cols=3,
    specs=[[{'type': 'domain'}, {'type': 'domain'}, {'type': 'domain'}]],
    subplot_titles=['BYU', 'PSU', 'UIUC']
)
pie_labels = ['Negative', 'Positive']
marker_colors = ['#ef5350', '#8bc34a']

def add_sentiment_trace(fig, col, campus):
    fig.add_trace(
        go.Pie(
            labels=pie_labels,
            values=df_m[df_m['campus'] == campus]['sentiment'].value_counts().sort_index(),
            text=pie_labels,
            marker_colors=marker_colors,
            textposition='inside',
            textfont=dict(
                size=10,
                color='white'
            ),
            insidetextorientation='horizontal',
            sort=False,
        ), 1, col
    )

add_sentiment_trace(fig, 1, 'Brigham Young University')
add_sentiment_trace(fig, 2, 'Penn State University Park')
add_sentiment_trace(fig, 3, 'University of Illinois Urbana-Champaign')

fig.update_layout(
    title=dict(
        text='<b>DistilBERT Sentiments by Campus</b><br><span style="color: #aaa;">Leasing companies around Brigham Young University (Provo, UT) have the worst sentiment</span>',
        x=0,
        y=0.9,
        xanchor='left',
        yanchor='top',
    ),
    font_family='Helvetica, Arial, Inter, sans-serif',
    showlegend=False,
    margin=dict(
        l=0,
        r=0,
        t=125,
        b=50,
    ),
    height=450
)
Negative47.9%Positive52.1%Negative36.3%Positive63.7%Negative34.3%Positive65.7%
DistilBERT Sentiments by CampusLeasing companies around Brigham Young University (Provo, UT) have the worst sentimentBYUPSUUIUC
In [33]:
df_sentiment_by_campus
Out[33]:
campussentimentreview_id
0Brigham Young UniversityNEGATIVE2437
1Brigham Young UniversityPOSITIVE2648
2Penn State University ParkNEGATIVE1626
3Penn State University ParkPOSITIVE2855
4University of Illinois Urbana-ChampaignNEGATIVE1822
5University of Illinois Urbana-ChampaignPOSITIVE3493

👉 Review length distribution by rating

In [34]:
df_review_len = df_m.copy()
df_review_len['review_length'] = df_review_len['review_text'].str.len()
df_review_len['review_rating'] = df_review_len['review_rating'].astype(str)

# only filter review <2000 chars to remove outliers
df_review_len = df_review_len[df_review_len['review_length'] < 2000]

fig = px.box(
    df_review_len,
    x='review_length',
    y='review_rating',
    color='review_rating',
    color_discrete_map={
        "1": "#EF5350",
        "2": "#EF9A9A",
        "3": "#FDD835",
        "4": "#9CCC65",
        "5": "#689F38"
    },
    labels={
        'review_rating': 'Review Rating',
        'review_length': 'Review Length'
    },
    template='simple_white',
    title='<b>Review length (number of characters) distribution by rating</b><br><span style="color: #aaa">Unsatisfied tenants have more to say</span>',
    height=600
)

fig.update_yaxes(categoryorder='category ascending')
fig.update_layout(
    showlegend=False,
    font_family='Helvetica, Inter, Arial, sans-serif',
)

fig.show()
050010001500200012345
Review length (number of characters) distribution by ratingUnsatisfied tenants have more to sayReview LengthReview Rating

👉 Correlation between review length and number of likes

In [36]:
fig = px.scatter(
    df_review_len[df_review_len['review_likes'] > 0],
    x='review_length',
    y='review_likes',
    labels={
        'review_likes': 'Review Likes',
        'review_length': 'Review Length'
    },
    template='simple_white',
    title='<b>Review length vs number of likes</b><br>\
<span style="color: #aaa">Longer reviews receive more likes on average</span>',
    trendline="ols",
    trendline_color_override="#D32F2F",
    height=600
)

fig.update_layout(
    showlegend=False,
    font_family='Helvetica, Inter, Arial, sans-serif',
)

fig.update_traces(
    marker=dict(
        color='#CFD8DC',
        size=3,
    ),
    selector=dict(mode='markers')
)

# calculate y coordinate to add annotation
r = px.get_trendline_results(fig).iloc[0]['px_fit_results'].params
annotation_x = 1500
annotation_y = r[0] + r[1] * annotation_x

fig.add_annotation(
    x=annotation_x,
    y=annotation_y,
    text=f"A weak positive correlation (R²={round(px.get_trendline_results(fig).iloc[0]['px_fit_results'].rsquared, 2)})",
    font=dict(
        size=14
    ),
    showarrow=True,
    arrowhead=2,
    bgcolor='white',
)

fig.show()
0500100015002000010203040
Review length vs number of likesLonger reviews receive more likes on averageReview LengthReview LikesA weak positive correlation (R²=0.09)

✏️ Keyword Analysis

👉 Frequencies

Convert tokens to lowercase and find frequencies.

In [37]:
df_tokens_lower = df_tokens.copy()
df_tokens_lower['lemma'] = df_tokens_lower['lemma'].str.lower()
df_tokens_lower.rename(columns={'lemma': 'token'}, inplace=True)

df_tokens_lower.head(3)
Out[37]:
review_idtexttokenexplain
0ChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRABServiceservicenoun
1ChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRABrequestsrequestnoun
2ChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRABlonglongadjective
In [136]:
df_token_counts = df_tokens_lower[['review_id', 'token']] \
    .drop_duplicates()['token'] \
    .value_counts() \
    .to_frame() \
    .reset_index()

df_token_counts
Out[136]:
tokencount
0live6326
1apartment6088
2place4784
3great4206
4good3410
.........
15269butcher1
15270sonic1
152719401
15272relegate1
15273distinctly1

15274 rows × 2 columns

In [39]:
interesting_keywords = [
    'deposit', 'parking', 'gym', 'internet', 
    'laundry', 'washer', 'leak', 'pool',
    'package', 'expensive', 'maintenance', 'construction',
    'noise', 'pet', 'bug', 'elevator',
]

df_interesting_keywords_count = df_token_counts[
    df_token_counts['token'].isin(interesting_keywords)
]

df_interesting_keywords_count
Out[39]:
tokencount
7maintenance3119
43parking1479
80pool980
94deposit857
126gym701
195internet471
216laundry423
243expensive368
296leak311
298washer309
344package269
347elevator266
418noise216
494pet175
544construction154
557bug148
In [40]:
fig = px.bar(
    df_interesting_keywords_count,
    x='count',
    y='token',
    labels={
        'token': 'Keyword',
        'count': 'Frequency',
    },
    title='<b>Review keyword frequencies</b><br><span style="color: #ccc;">Maintenance is the most frequently discussed keyword in reviews</span>',
    template='simple_white',
    height=700
)

fig.update_layout(
    font_family='Helvetica, Inter, Arial, sans-serif',
)
fig.update_traces(marker_color='black')
fig.update_yaxes(categoryorder='total ascending')

fig.show()
050010001500200025003000bugconstructionpetnoiseelevatorpackagewasherleakexpensivelaundryinternetgymdepositpoolparkingmaintenance
Review keyword frequenciesMaintenance is the most frequently discussed keyword in reviewsFrequencyKeyword

👉 Positive/negative associations with tokens

Convert tokens to lowercase and find frequencies.

In [41]:
df_interesting_tokens = df_tokens_lower[df_tokens_lower['token'].isin(interesting_keywords)]

df_interesting_token_sentiments = pd.merge(
    left=df_r[['place_id', 'review_id', 'sentiment']],
    right=df_interesting_tokens[['review_id', 'token']],
    how='inner',
    on='review_id'
)

df_interesting_token_sentiments
Out[41]:
place_idreview_idsentimenttoken
0ChIJqyXyo6GQTYcRXGfgeIpqc_IChdDSUhNMG9nS0VJQ0FnSUQ1OXRHZnNnRRABPOSITIVEpool
1ChIJqyXyo6GQTYcRXGfgeIpqc_IChZDSUhNMG9nS0VJQ0FnSUNwOUstOVpBEAENEGATIVEelevator
2ChIJqyXyo6GQTYcRXGfgeIpqc_IChZDSUhNMG9nS0VJQ0FnSUNwOUstOVpBEAENEGATIVEelevator
3ChIJqyXyo6GQTYcRXGfgeIpqc_IChZDSUhNMG9nS0VJQ0FnSUNwOUstOVpBEAENEGATIVEgym
4ChIJqyXyo6GQTYcRXGfgeIpqc_IChZDSUhNMG9nS0VJQ0FnSUNwOUstOVpBEAENEGATIVEparking
...............
13535ChIJl3GbQTjXDIgRUIdCvHgdOY4ChZDSUhNMG9nS0VJQ0FnSUNneXJ6aEJ3EAENEGATIVEmaintenance
13536ChIJl3GbQTjXDIgRUIdCvHgdOY4ChZDSUhNMG9nS0VJQ0FnSURBcXJXekJ3EAEPOSITIVEdeposit
13537ChIJl3GbQTjXDIgRUIdCvHgdOY4ChZDSUhNMG9nS0VJQ0FnSUMwMklTakh3EAEPOSITIVEmaintenance
13538ChIJl3GbQTjXDIgRUIdCvHgdOY4ChZDSUhNMG9nS0VJQ0FnSURJeVplTGZ3EAEPOSITIVEmaintenance
13539ChIJl3GbQTjXDIgRUIdCvHgdOY4ChdDSUhNMG9nS0VJQ0FnSUNzdktudTV3RRABPOSITIVEleak

13540 rows × 4 columns

In [42]:
df_interesting_token_sentiment_counts = df_interesting_token_sentiments \
    .groupby(['token', 'sentiment'], as_index=False) \
    .agg({'review_id': 'count'}) \
    .rename(columns={'review_id': 'count'})

df_interesting_token_sentiment_counts['percentage'] = df_interesting_token_sentiment_counts['count'] / df_interesting_token_sentiment_counts.groupby('token')['count'].transform('sum')
df_interesting_token_sentiment_counts['sentiment'] = df_interesting_token_sentiment_counts['sentiment'].str.capitalize()
df_interesting_token_sentiment_counts.sort_values(
    ['sentiment', 'percentage'],
    ascending=[True, True],
    inplace=True
)

df_interesting_token_sentiment_counts.head(10)
Out[42]:
tokensentimentcountpercentage
10gymNegative2310.304348
28poolNegative4690.421384
26petNegative1010.431624
18maintenanceNegative19010.478600
30washerNegative2250.616438
14laundryNegative3230.625969
8expensiveNegative2460.629156
22packageNegative2610.659091
24parkingNegative16070.667914
20noiseNegative2050.729537
In [44]:
fig = px.bar(
    df_interesting_token_sentiment_counts,
    x='percentage',
    y='token',
    template='simple_white',
    color='sentiment',
    color_discrete_map={
        'Positive': '#8bc34a',
        'Negative': '#ef5350'
    },
    labels={'token': 'Keyword', 'sentiment': 'Sentiment', 'percentage': 'Percentage'},
    title='<b>Keyword vs review sentiment associations</b><br><span style="color: #aaa;">Who doesn\'t hate leaks, security deposit scams, and broken elevators?</span>',
    text=df_interesting_token_sentiment_counts.apply(
        lambda r: f"{'👍' if r['sentiment'] == 'Positive' else '👎'} \
        {'{0:.1f}%'.format(r['percentage'] * 100)}", axis=1),
    height=800
)

fig.update_layout(
    font_family='Helvetica, Inter, Arial, sans-serif',
    xaxis_tickformat=',.0%',
    showlegend=False,
)

fig.for_each_trace(lambda t: t.update(textfont_color='white'))

fig.show()
👎 30.4%👎 42.1%👎 43.2%👎 47.9%👎 61.6%👎 62.6%👎 62.9%👎 65.9%👎 66.8%👎 73.0%👎 78.8%👎 85.4%👎 85.9%👎 86.1%👎 89.0%👎 90.3%👍 9.7%👍 11.0%👍 13.9%👍 14.1%👍 14.6%👍 21.2%👍 27.0%👍 33.2%👍 34.1%👍 37.1%👍 37.4%👍 38.4%👍 52.1%👍 56.8%👍 57.9%👍 69.6%0%20%40%60%80%100%gympoolpetmaintenancewasherlaundryexpensivepackageparkingnoiseinternetbugconstructionelevatordepositleak
Keyword vs review sentiment associationsWho doesn't hate leaks, security deposit scams, and broken elevators?PercentageKeyword

Add management company names and campus information to df_interesting_token_sentiments.

In [45]:
df_tokens_places = pd.merge(
    left=df_interesting_token_sentiments.drop_duplicates()[['place_id', 'sentiment', 'token']],
    right=df_b[['place_id', 'name', 'campus']]
)

df_tokens_places
Out[45]:
place_idsentimenttokennamecampus
0ChIJqyXyo6GQTYcRXGfgeIpqc_IPOSITIVEpoolAlpine VillageBrigham Young University
1ChIJqyXyo6GQTYcRXGfgeIpqc_INEGATIVEelevatorAlpine VillageBrigham Young University
2ChIJqyXyo6GQTYcRXGfgeIpqc_INEGATIVEgymAlpine VillageBrigham Young University
3ChIJqyXyo6GQTYcRXGfgeIpqc_INEGATIVEparkingAlpine VillageBrigham Young University
4ChIJqyXyo6GQTYcRXGfgeIpqc_INEGATIVEwasherAlpine VillageBrigham Young University
..................
10241ChIJl3GbQTjXDIgRUIdCvHgdOY4NEGATIVEmaintenanceCommunity Property ManagementUniversity of Illinois Urbana-Champaign
10242ChIJl3GbQTjXDIgRUIdCvHgdOY4POSITIVEdepositCommunity Property ManagementUniversity of Illinois Urbana-Champaign
10243ChIJl3GbQTjXDIgRUIdCvHgdOY4POSITIVEmaintenanceCommunity Property ManagementUniversity of Illinois Urbana-Champaign
10244ChIJl3GbQTjXDIgRUIdCvHgdOY4POSITIVEmaintenanceCommunity Property ManagementUniversity of Illinois Urbana-Champaign
10245ChIJl3GbQTjXDIgRUIdCvHgdOY4POSITIVEleakCommunity Property ManagementUniversity of Illinois Urbana-Champaign

10246 rows × 5 columns

Get the number of reviews by each business.

In [46]:
df_place_review_counts = df_r[df_r['review_text'].notna()].groupby('place_id', as_index=False)['review_id'].count() \
    .rename(columns={'review_id': 'num_reviews'})

df_place_review_counts
Out[46]:
place_idnum_reviews
0ChIJ06hDonPXDIgRwvaC5IF3CaA35
1ChIJ08JvZlGXTYcRYhauVfCCuJk78
2ChIJ0yehl6XXDIgRtqPdtmcuKeU17
3ChIJ28QMEbyozokRvv0bOBv6TUc161
4ChIJ2ZFgr_GlzokRlEdDJIv3WFU26
.........
137ChIJvzfjLA2XTYcR66z_4bzQOvY48
138ChIJw3E2MbKQTYcRswRvHD7LBaU200
139ChIJy6xaEQapzokRGunpHg5WWxk200
140ChIJywqAqaCozokRRoRR_ZhhcQ830
141ChIJzZZ2J46ozokRTALobuh-_JA34

142 rows × 2 columns

Filter places by interesting keywords based on frequency.

In [79]:
df_tokens_places_summary = df_tokens_places.groupby(['place_id', 'name', 'campus', 'token'], as_index=False) \
    .agg({'sentiment': 'size'}) \
    .rename(columns={'sentiment': 'unique_frequency'}) \
    .merge(
        right=df_place_review_counts,
        on='place_id'
    )

df_tokens_places_summary['campus'] = df_tokens_places_summary['campus'].map(campus_acronyms_map)
df_tokens_places_summary['percentage'] = df_tokens_places_summary['unique_frequency'] / df_tokens_places_summary['num_reviews']
df_tokens_places_summary = df_tokens_places_summary[df_tokens_places_summary['percentage'] >= 0.1]
df_tokens_places_summary.sort_values('percentage', ascending=False)
Out[79]:
place_idnamecampustokenunique_frequencynum_reviewspercentage
756ChIJX_L4rUnXDIgR9TAeTC_hgXAUniversity GroupUIUCmaintenance1152000.575000
1227ChIJn4wFUb-ozokRBIjeH4SwQl0Lions Gate ApartmentsPSUmaintenance12210.571429
555ChIJLS5X17OnzokRqWLyrt6auaMToftrees Apartments - Turtle Creek ApartmentsPSUmaintenance13240.541667
391ChIJFws_9NCnzokRWuhhIW9H3roThe Bryn ApartmentsPSUmaintenance872000.435000
785ChIJZ1Wg5rDXDIgRwsRBDwZ_PUwParkside ApartmentsUIUCmaintenance19450.422222
........................
967ChIJbeoNXnGaTYcRP-R0abW2glIThe CrestwoodBYUpool202000.100000
1081ChIJhVL1izfXDIgRk7TCCsdAoJgLatitude ApartmentsUIUCpool202000.100000
1209ChIJm0H0xraQTYcR3_G-7MOct8cCampus Plaza ApartmentsBYUpool5500.100000
1241ChIJoUKqgmHXDIgRrZiClSE_GkAThe LincUIUCgym202000.100000
1269ChIJpbSzGKyQTYcRh9jrNYBP-fgFCS RentalsBYUmaintenance9900.100000

294 rows × 7 columns

In [123]:
# a function to create a bar chart of businesses with the
# highest relative frequency of reviews with a specific keyword
def generate_token_percentages_bar_chart(
    df_token_places_summary, 
    keyword,
    chart_subtitle
):
    df_keyword_top5_places = df_tokens_places_summary[df_tokens_places_summary['token'] == keyword] \
        .sort_values('percentage', ascending=False) \
        .head(5)
        
    color_discrete_sequence = ['#cfd8dc'] * df_keyword_top5_places.shape[0]
    color_discrete_sequence[0] = '#d32f2f'
    
    fig = px.bar(
        df_keyword_top5_places,
        x='name',
        y='percentage',
        labels={
            'name': 'Business',
            'percentage': 'Percentage',
            'campus': 'Campus',
            'num_reviews': 'Total number of reviews',
        },
        title=f'<b>Percentage of text reviews with the keyword "{keyword}"</b><br><span style="color: #ccc;">{chart_subtitle}</span>',
        template='simple_white',
        color='name',
        color_discrete_sequence=color_discrete_sequence,
        hover_name='name',
        hover_data=['campus', 'num_reviews'],
        text='campus',
        height=550
    )
    
    fig.update_traces(textposition='inside')
    
    fig.update_layout(
        yaxis_tickformat=',.0%',
        font_family='Helvetica, Inter, Arial, sans-serif',
        showlegend=False,
    )
    
    return fig
In [109]:
fig = generate_token_percentages_bar_chart(
    df_tokens_places_summary,
    'deposit',
    'The Univesity Group is notorious for not returning security deposits'
)

fig.show()
UIUCPSUPSUBYUPSUUniversity GroupWestside Village ApartmentsUniversity Terrace ApartmentsCarriage Cove ApartmentsThe Apartment Store0%5%10%15%20%25%30%35%40%
Percentage of text reviews with the keyword "deposit"The Univesity Group is notorious for not returning security depositsBusinessPercentage
In [110]:
fig = generate_token_percentages_bar_chart(
    df_tokens_places_summary,
    'elevator',
    '309 Green only has 2 elevators for a 24-story building'
)

fig.show()
UIUCPSUPSUPSUUIUC309 GreenThe Standard at State CollegeThe Metropolitan State CollegeParkway Plaza Student ApartmentsThe Tower at Third0%5%10%15%
Percentage of text reviews with the keyword "elevator"309 Green only has 2 elevators for a 24-story buildingBusinessPercentage
In [134]:
fig = generate_token_percentages_bar_chart(
    df_tokens_places_summary,
    'laundry',
    'Provo seems to have more apartments that share the laundry facility'
)

fig.show()
BYUBYUBYUBYUBYUMonaco Court ApartmentsUnion Square ApartmentsLiberty SquareBrittany ApartmentsThe Branbury0%5%10%15%20%
Percentage of text reviews with the keyword "laundry"Provo seems to have more apartments that share the laundry facilityBusinessPercentage
In [135]:
fig = generate_token_percentages_bar_chart(
    df_tokens_places_summary,
    'parking',
    'Parking seems more limited in Provo'
)

fig.show()
BYUBYUBYUBYUBYUBrittany ApartmentsUnion Square ApartmentsLiberty on FreedomThe BranburyCollegePlace Woodland0%5%10%15%20%25%30%35%40%
Percentage of text reviews with the keyword "parking"Parking seems more limited in ProvoBusinessPercentage
In [104]:
fig = generate_token_percentages_bar_chart(
    df_tokens_places_summary,
    'expensive',
    'Apartments in UIUC surprisingly do not make the list here'
)

fig.show()
PSUBYUBYUBYUWhite Course ApartmentsLiberty on EighthCenter Pointe ApartmentsMoon Apartments0%2%4%6%8%10%12%14%
Percentage of text reviews with the keyword "expensive"Apartments in UIUC surprisingly do not make the list hereBusinessPercentage

👉 Percentage of tokens by campus

Get the number of reviews with review text by each campus.

In [51]:
df_campus_review_counts = df_m[df_m['review_text'].notna()].groupby('campus', as_index=False)['review_id'].count() \
    .rename(columns={'review_id': 'num_reviews'})

df_campus_review_counts
Out[51]:
campusnum_reviews
0Brigham Young University5085
1Penn State University Park4481
2University of Illinois Urbana-Champaign5315
In [52]:
df_tokens_campus_summary = df_tokens_places.groupby(['campus', 'token'], as_index=False) \
    .agg({'sentiment': 'size'}) \
    .rename(columns={'sentiment': 'unique_frequency'}) \
    .merge(
        right=df_campus_review_counts,
        on='campus'
    )

df_tokens_campus_summary['percentage'] = df_tokens_campus_summary['unique_frequency'] / df_tokens_campus_summary['num_reviews']
df_tokens_campus_summary['campus'] = df_tokens_campus_summary['campus'].map({
    'Brigham Young University': 'BYU',
    'Penn State University Park': 'PSU',
    'University of Illinois Urbana-Champaign': 'UIUC'
})
df_tokens_campus_summary.sort_values(['campus', 'token'], inplace=True)
df_tokens_campus_summary.head()
Out[52]:
campustokenunique_frequencynum_reviewspercentage
0BYUbug5950850.011603
1BYUconstruction4850850.009440
2BYUdeposit36450850.071583
3BYUelevator3650850.007080
4BYUexpensive22850850.044838
In [53]:
df_campus_positive_keywords = df_tokens_campus_summary[df_tokens_campus_summary['token'] \
    .isin(['gym', 'pool', 'pet'])] \
    .sort_values(['campus', 'token'])

df_campus_positive_keywords.head(3)
Out[53]:
campustokenunique_frequencynum_reviewspercentage
5BYUgym19150850.037561
13BYUpet3550850.006883
14BYUpool52650850.103441
In [54]:
df_campus_negative_keywords = df_tokens_campus_summary[df_tokens_campus_summary['token'] \
    .isin(['leak', 'deposit', 'elevator'])] \
    .sort_values(['campus', 'token'])

df_campus_negative_keywords.head(3)
Out[54]:
campustokenunique_frequencynum_reviewspercentage
2BYUdeposit36450850.071583
3BYUelevator3650850.007080
8BYUleak12550850.024582
In [55]:
df_campus_misc_keywords = df_tokens_campus_summary[df_tokens_campus_summary['token'] \
    .isin(['parking', 'expensive', 'construction', 'package'])] \
    .sort_values(['campus', 'token'])

df_campus_misc_keywords.head(3)
Out[55]:
campustokenunique_frequencynum_reviewspercentage
1BYUconstruction4850850.009440
4BYUexpensive22850850.044838
11BYUpackage6150850.011996
In [56]:
fig = px.bar(
    df_campus_positive_keywords,
    x='token',
    y='percentage',
    color='campus',
    color_discrete_map={
        'BYU': '#1FB3D1',
        'PSU': '#001E44',
        'UIUC': '#FF5F05'
    },
    title='<b>Percentage of reviews that includes specific positive keywords</b><br>\
<span style="color: #ccc;">Provo, UT\'s summer is hotter than State College, PA and Champaign, IL</span>',
    labels={
        'percentage': 'Percentage',
        'token': 'Keyword',
        'campus': 'Campus'
    },
    height=500,
    template='simple_white',
    barmode='group'
)

fig.update_layout(
    yaxis_tickformat=',.0%',
    font_family='Helvetica, Inter, Arial, sans-serif',
)

fig.show()
gympetpool0%2%4%6%8%10%
CampusBYUPSUUIUCPercentage of reviews that includes specific positive keywordsProvo, UT's summer is hotter than State College, PA and Champaign, ILKeywordPercentage
In [57]:
fig = px.bar(
    df_campus_negative_keywords,
    x='token',
    y='percentage',
    color='campus',
    color_discrete_map={
        'BYU': '#1FB3D1',
        'PSU': '#001E44',
        'UIUC': '#FF5F05'
    },
    title='<b>Percentage of reviews that includes specific negative keywords</b><br>\
<span style="color: #ccc;">BYU surprisingly has the highest proportion of reviews related to security deposits</span>',
    labels={
        'percentage': 'Percentage',
        'token': 'Keyword',
        'campus': 'Campus'
    },
    height=500,
    template='simple_white',
    barmode='group'
)

fig.update_layout(
    yaxis_tickformat=',.0%',
    font_family='Helvetica, Inter, Arial, sans-serif',
)

fig.show()
depositelevatorleak0%1%2%3%4%5%6%7%
CampusBYUPSUUIUCPercentage of reviews that includes specific negative keywordsBYU surprisingly has the highest proportion of reviews related to security depositsKeywordPercentage
In [60]:
fig = px.bar(
    df_campus_misc_keywords,
    x='token',
    y='percentage',
    color='campus',
    color_discrete_map={
        'BYU': '#1FB3D1',
        'PSU': '#001E44',
        'UIUC': '#FF5F05'
    },
    title='<b>Percentage of reviews that includes specific miscellaneous keywords</b><br>\
<span style="color: #ccc;">Parking is important to tenants at Provo, UT</span>',
    labels={
        'percentage': 'Percentage',
        'token': 'Keyword',
        'campus': 'Campus'
    },
    height=500,
    template='simple_white',
    barmode='group'
)

fig.update_layout(
    yaxis_tickformat=',.0%',
    font_family='Helvetica, Inter, Arial, sans-serif',
)

fig.show()
constructionexpensivepackageparking0%5%10%15%20%
CampusBYUPSUUIUCPercentage of reviews that includes specific miscellaneous keywordsParking is important to tenants at Provo, UTKeywordPercentage

⚖️ Closing Thoughts

  • What a surprise! UIUC has the highest proportion of 5-star reviews, and the best tenant sentiments out of the three college towns.
  • BYU has the worst tenant sentiments.
  • 48% of reviews in BYU (Provo, UT) are negative. This percentage far exceeds those of UIUC and PSU, which ranges about 34~36%.
  • The University Group @ UIUC is leading in not properly returning security deposits.
  • "Gym", "Pool", and "Pet" are often associated with positive reviews.
  • Common keywords in negative reviews are "leak", "deposit", "elevator", "construction", "bug", "parking", and "noise".
  • "Pool" and "Parking" are more relevant to tenants at BYU.
  • The average review ratings in UIUC and PSU have been improving.
  • On the other hand, this is not a good look for management companies at BYU.
  • Landlords with higher average ratings respond faster on average, compared to landlords with lower average ratings.