Natural language processing (NLP) aims to give computers the ability to understand, process, and even generate human language. This notebook introduces the common preprocessing steps and demonstrates how to use a widely used transformer model (distilbert-base-uncased-finetuned-sst-2-english) to perfrom a sentiment analysis. 😀😦🙁

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
In [2]:
pd.set_option('display.max_columns', 100)

🗃️ Load data

This exercise uses a small dataset that contains reviews of two apartments at Indiana University Bloomington.

  1. State On Campus Bloomington
  2. The Standard at Bloomington
In [3]:
df_b = pd.read_csv(
    'https://github.com/bdi475/datasets/raw/main/campustowns-leasing-company-reviews/businesses-sample.csv'
)
df_r = pd.read_csv(
    'https://github.com/bdi475/datasets/raw/main/campustowns-leasing-company-reviews/reviews-sample.csv',
    parse_dates = ['review_datetime_utc', 'owner_answer_timestamp_datetime_utc']
)
In [4]:
# businesses
display(df_b.head(2))
campusplace_idnamesitecategoryboroughstreetcitypostal_codestatelatitudelongitudeverified
0Indiana University BloomingtonChIJY1yB5NJmbIgRZn7E2oF5gVQState On Campus Bloomingtonhttps://stateoncampus.com/bloomington/?utm_sou...Apartment complexNaN2036 N Walnut StBloomington47404Indiana39.184846-86.532875True
1Indiana University BloomingtonChIJPb8SbdpnbIgR82bkOLSKtZMThe Standard at Bloomingtonhttps://www.thestandardbloomington.com/?utm_so...Student housing centerNaN250 E 14th StBloomington47408Indiana39.175974-86.531609True
In [5]:
# reviews table
display(df_r.head(2))
place_idreview_idauthor_idauthor_titlereview_textreview_ratingreview_img_urlreview_datetime_utcowner_answerowner_answer_timestamp_datetime_utcreview_likes
0ChIJY1yB5NJmbIgRZn7E2oF5gVQChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAE109839330111495228413Aziz BohraThis place has my heart! Spacious rooms, quali...5https://lh5.googleusercontent.com/p/AF1QipM0Jm...2023-04-01 03:19:15+00:00Thanks for your feedback. We are grateful tha...2022-11-01 18:54:45+00:000
1ChIJY1yB5NJmbIgRZn7E2oF5gVQChZDSUhNMG9nS0VJQ0FnSUNaOC15NkhBEAE102607480175477014087Nessa BacherI’ll start with the positives of living at Sta...4NaN2023-09-25 00:17:21+00:00We are so pleased to hear that you enjoy livin...2023-09-18 14:20:59+00:000

Print the info() of the two DataFrames.

In [6]:
df_b.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   campus       2 non-null      object 
 1   place_id     2 non-null      object 
 2   name         2 non-null      object 
 3   site         2 non-null      object 
 4   category     2 non-null      object 
 5   borough      0 non-null      float64
 6   street       2 non-null      object 
 7   city         2 non-null      object 
 8   postal_code  2 non-null      int64  
 9   state        2 non-null      object 
 10  latitude     2 non-null      float64
 11  longitude    2 non-null      float64
 12  verified     2 non-null      bool   
dtypes: bool(1), float64(3), int64(1), object(8)
memory usage: 322.0+ bytes
In [7]:
df_r.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 312 entries, 0 to 311
Data columns (total 11 columns):
 #   Column                               Non-Null Count  Dtype              
---  ------                               --------------  -----              
 0   place_id                             312 non-null    object             
 1   review_id                            312 non-null    object             
 2   author_id                            312 non-null    object             
 3   author_title                         312 non-null    object             
 4   review_text                          236 non-null    object             
 5   review_rating                        312 non-null    int64              
 6   review_img_url                       6 non-null      object             
 7   review_datetime_utc                  312 non-null    datetime64[ns, UTC]
 8   owner_answer                         245 non-null    object             
 9   owner_answer_timestamp_datetime_utc  245 non-null    datetime64[ns, UTC]
 10  review_likes                         312 non-null    int64              
dtypes: datetime64[ns, UTC](2), int64(2), object(7)
memory usage: 26.9+ KB

The dataset has 312 reviews.

🪓 Preprocess review text using spaCy

spaCy is a powerful, open-source library for advanced Natural Language Processing (NLP) in Python and Cython. Designed specifically for production use, spaCy helps developers build applications that process and understand large volumes of text data efficiently.

spaCy is particularly useful for:

  • Information extraction
  • Natural language understanding systems
  • Text pre-processing for deep learning
  • Building chatbots and language-based applications
  • Analyzing large volumes of unstructured text data
In [8]:
import spacy

Trained pipelines

Trained pipelines are models that enable spaCy to predict linguistic attributes in context

  • Part-of-speech tags
  • Syntactic dependencies
  • Named entities

'en_core_web_sm' is a English pipeline optimized for CPU.

Components:

  • tok2vec
  • taggerparser
  • senter
  • ner
  • attribute_ruler
  • lemmatizer
In [9]:
nlp = spacy.load('en_core_web_sm')

spacy.load() returns a Language object containing all components and data needed to process text. Calling the returned object on a string of text will return a processed Doc.

Source: https://spacy.io/usage/spacy-101

In [10]:
text = 'I love this apartment'
doc = nlp(text)

for token in doc:
    print('------------------')
    print(f'text: {token.text}')
    print(f'lemma: {token.lemma_}')
    print(f'pos: {token.pos_}') # pos_ stands for part-of-speech
    print(f'explain: {spacy.explain(token.pos_)}')
    print(f'is_stop: {token.is_stop}')
------------------
text: I
lemma: I
pos: PRON
explain: pronoun
is_stop: True
------------------
text: love
lemma: love
pos: VERB
explain: verb
is_stop: False
------------------
text: this
lemma: this
pos: DET
explain: determiner
is_stop: True
------------------
text: apartment
lemma: apartment
pos: NOUN
explain: noun
is_stop: False

Visualize the dependency parse using displacy.render().

In [11]:
from spacy import displacy

displacy.render(doc, style="dep", jupyter=True)
IPRONloveVERBthisDETapartmentNOUNnsubjdetdobj

Tokenization and lemmatization

Tokenization takes a piece of text and breaks it down into meaningful units called "tokens." These tokens can be individual words, punctuation marks, numbers, or even phrases depending on the task and chosen method.

Lemmatization goes a step further, focusing on the "base form" or "dictionary form" of words. It groups together different grammatical variations of the same word (like "playing," "plays," "played") and reduces them to their core meaning ("play"). This helps capture the true meaning of the text regardless of how they are used.

In [12]:
cols = ["text", "lemma", "pos", "explain", "is_stop"]
rows = []

for t in doc:
    row = [t.text, t.lemma_, t.pos_, spacy.explain(t.pos_), t.is_stop]
    rows.append(row)

df_tokens = pd.DataFrame(rows, columns=cols)
df_tokens
Out[12]:
textlemmaposexplainis_stop
0IIPRONpronounTrue
1loveloveVERBverbFalse
2thisthisDETdeterminerTrue
3apartmentapartmentNOUNnounFalse
In [13]:
cols = ["review_id", "text", "lemma", "pos", "explain", "is_stop"]
rows = []

for index, row in df_r[df_r['review_text'].notna()].iterrows():
    doc = nlp(row['review_text'])
    for t in doc:
        new_row = [row['review_id'], t.text, t.lemma_, t.pos_, spacy.explain(t.pos_), t.is_stop]
        rows.append(new_row)

df_tokens = pd.DataFrame(rows, columns=cols)
df_tokens
Out[13]:
review_idtextlemmaposexplainis_stop
0ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAEThisthisDETdeterminerTrue
1ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAEplaceplaceNOUNnounFalse
2ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAEhashaveVERBverbTrue
3ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAEmymyPRONpronounTrue
4ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAEheartheartNOUNnounFalse
.....................
18255ChZDSUhNMG9nS0VJQ0FnSUNSdUpEV2FREAE!!PUNCTpunctuationFalse
18256ChZDSUhNMG9nS0VJQ0FnSUNSeUp1dUVBEAEHoorayHoorayPROPNproper nounFalse
18257ChZDSUhNMG9nS0VJQ0FnSUNSeUp1dUVBEAE!!PUNCTpunctuationFalse
18258ChZDSUhNMG9nS0VJQ0FnSUR4aDZQckl3EAEwowiewowieVERBverbFalse
18259ChZDSUhNMG9nS0VJQ0FnSUR4aDZQckl3EAEzowazowaPROPNproper nounFalse

18260 rows × 6 columns

Remove stop words

Stop words, as you might guess from the name, are a set of commonly used words in a language that are often filtered out before processing text in Natural Language Processing (NLP) tasks. These words, like "the," "a," "is," "and," "on," etc., are considered to carry little independent meaning and contribute minimally to the overall understanding of the text.

We remove the stop words here for two reasons:

  1. Reduce noise: By removing commonly used words, we focus on the content-rich keywords that convey the core meaning of the text.
  2. Improve efficiency: Removing stop words reduces the overall size of the data, making NLP tasks faster and less computationally expensive.
In [14]:
# only filter non stop words
df_tokens_filtered = df_tokens[~df_tokens['is_stop']]

# remove words shorter than 4 characters long
df_tokens_filtered = df_tokens_filtered[df_tokens_filtered['lemma'].str.len() >= 4]

df_tokens_filtered
Out[14]:
review_idtextlemmaposexplainis_stop
1ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAEplaceplaceNOUNnounFalse
4ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAEheartheartNOUNnounFalse
6ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAESpaciousspaciousADJadjectiveFalse
7ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAEroomsroomNOUNnounFalse
9ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAEqualityqualityNOUNnounFalse
.....................
18252ChZDSUhNMG9nS0VJQ0FnSUNSdUpEV2FREAEstaffstaffNOUNnounFalse
18254ChZDSUhNMG9nS0VJQ0FnSUNSdUpEV2FREAEawesomeawesomeADJadjectiveFalse
18256ChZDSUhNMG9nS0VJQ0FnSUNSeUp1dUVBEAEHoorayHoorayPROPNproper nounFalse
18258ChZDSUhNMG9nS0VJQ0FnSUR4aDZQckl3EAEwowiewowieVERBverbFalse
18259ChZDSUhNMG9nS0VJQ0FnSUR4aDZQckl3EAEzowazowaPROPNproper nounFalse

5948 rows × 6 columns

Display value counts.

In [15]:
df_tokens_filtered['lemma'].value_counts()
Out[15]:
lemma
apartment     125
live          119
place          98
staff          73
helpful        62
             ... 
respectful      1
exit            1
hook            1
climbing        1
rock            1
Name: count, Length: 1470, dtype: int64

🧪 Sentiment analysis using DistilBERT

DistilBERT is a lightweight, efficient version of the BERT (Bidirectional Encoder Representations from Transformers) language model, designed for faster training and inference while maintaining competitive performance in natural language processing (NLP) tasks. Developed by HuggingFace, DistilBERT is a distilled version of BERT that retains 97% of its language understanding capabilities while being 40% smaller and 60% faster.

The transformer architecture is like BERT's brain.

From Hugging Face's Documentation:

Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch. These models support common tasks in different modalities, such as:

📝 Natural Language Processing: text classification, named entity recognition, question answering, language modeling, summarization, translation, multiple choice, and text generation.

What is BERT?

Here's a simple explanation of BERT and the transformer architecture for a five-year-old (with the help of perplexity.ai):

Imagine you're playing with a super-smart toy robot that can read and understand stories. This robot is called BERT. When BERT reads a story, it doesn't just look at one word at a time like some other robots. Instead, it looks at all the words in a sentence together, kind of like how you look at a whole picture instead of just one tiny part.

  1. Think of transformers as a team of helper robots working together to understand language.
  2. These helper robots have special "attention" powers. When they read a sentence, they can focus on different words at the same time, just like how you can look at different toys in your room all at once.
  3. The helpers talk to each other and share what they've learned about each word. This helps them understand the whole sentence better, like how you understand a story better when you and your friends talk about it together.
  4. These helper robots can learn from lots and lots of stories, so they become very good at understanding language, just like how you get better at reading the more books you read.
  5. After they've learned from many stories, they can help with all sorts of language tasks, like answering questions or figuring out if someone is happy or sad in a story.

So, BERT is like a super-smart reading buddy that uses these helper robots (transformers) to understand language in a way that's similar to how you understand stories – by looking at everything together and sharing information.

Transfomers

Transformers, provided Hugging Face, provides APIs to quickly download and use thousands of pretrained models to perform tasks on text, images, and audio.

Install the transformers package.

In [16]:
! pip install transformers
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: transformers in /home/jupyter-subwaymatch/.local/lib/python3.10/site-packages (4.48.0)
Requirement already satisfied: filelock in /home/jupyter-subwaymatch/.local/lib/python3.10/site-packages (from transformers) (3.16.1)
Requirement already satisfied: huggingface-hub<1.0,>=0.24.0 in /home/jupyter-subwaymatch/.local/lib/python3.10/site-packages (from transformers) (0.27.1)
Requirement already satisfied: numpy>=1.17 in /home/jupyter-subwaymatch/.local/lib/python3.10/site-packages (from transformers) (2.1.2)
Requirement already satisfied: packaging>=20.0 in /opt/tljh/user/lib/python3.10/site-packages (from transformers) (24.1)
Requirement already satisfied: pyyaml>=5.1 in /opt/tljh/user/lib/python3.10/site-packages (from transformers) (6.0.2)
Requirement already satisfied: regex!=2019.12.17 in /home/jupyter-subwaymatch/.local/lib/python3.10/site-packages (from transformers) (2024.11.6)
Requirement already satisfied: requests in /opt/tljh/user/lib/python3.10/site-packages (from transformers) (2.32.3)
Requirement already satisfied: tokenizers<0.22,>=0.21 in /home/jupyter-subwaymatch/.local/lib/python3.10/site-packages (from transformers) (0.21.0)
Requirement already satisfied: safetensors>=0.4.1 in /home/jupyter-subwaymatch/.local/lib/python3.10/site-packages (from transformers) (0.5.2)
Requirement already satisfied: tqdm>=4.27 in /opt/tljh/user/lib/python3.10/site-packages (from transformers) (4.65.0)
Requirement already satisfied: fsspec>=2023.5.0 in /home/jupyter-subwaymatch/.local/lib/python3.10/site-packages (from huggingface-hub<1.0,>=0.24.0->transformers) (2024.12.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/tljh/user/lib/python3.10/site-packages (from huggingface-hub<1.0,>=0.24.0->transformers) (4.12.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/tljh/user/lib/python3.10/site-packages (from requests->transformers) (3.1.0)
Requirement already satisfied: idna<4,>=2.5 in /opt/tljh/user/lib/python3.10/site-packages (from requests->transformers) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/tljh/user/lib/python3.10/site-packages (from requests->transformers) (1.26.15)
Requirement already satisfied: certifi>=2017.4.17 in /opt/tljh/user/lib/python3.10/site-packages (from requests->transformers) (2022.12.7)

Pipelines

Pipelines are objects that abstract complex code from the Hugging Face library into simple APIs for inference tasks.

The "sentiment-analysis" pipeline uses the default model for sentiment analysis (distilbert/distilbert-base-uncased-finetuned-sst-2-english).

In [17]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu

Run the sentiment classifier

The distilbert-base-uncased-finetuned-sst-2-english model classifies an input text to 'POSITIVE' or 'NEGATIVE' labels, along with confidence scores.

In [18]:
classifier("We are very happy to show you the 🤗 Transformers library.")
Out[18]:
[{'label': 'POSITIVE', 'score': 0.9997795224189758}]

👆 For the 'POSITIVE' label: The score of 0.9997795224189758 indicates a very high confidence (about 99.98%) that the input text expresses a positive sentiment.

In [19]:
classifier("These thieves tried to steal my security deposit.")
Out[19]:
[{'label': 'NEGATIVE', 'score': 0.996752142906189}]

👆 For the 'NEGATIVE' label: The score of 0.996752142906189 indicates a very high confidence (about 99.68%) that the input text expresses a negative sentiment.

You can supply multiple inputs to the pipeline as a list.

In [20]:
my_inputs = [
    "You're the best!",
    "You're the worst!"
]

classifier(my_inputs)
Out[20]:
[{'label': 'POSITIVE', 'score': 0.9998639822006226},
 {'label': 'NEGATIVE', 'score': 0.9997650980949402}]

Sample 30 rows

Although the distilbert-base-uncased-finetuned-sst-2-english model is pre-trained and distilled (40% smaller than the original BERT model), it will still be slow to be used for the entire dataset.

For this demo, only sample 30 rows where

  1. review_text is not missing, and
  2. review_rating is less than or equal to 4 out of 5 stars.
In [21]:
df_sample = df_r[df_r['review_text'].notna() & 
    (df_r['review_rating'] <= 4)].sample(30)

df_sample[['review_rating', 'review_text']]
Out[21]:
review_ratingreview_text
601Office staff extremely rude. Have laundry mach...
51NEVER LIVE HERE!! They don't charge a security...
681I lived here for a year and it was FAR too lon...
1931Avoid living here at all costs. The other revi...
491Very poor management. They seem nice and trea...
651They have maintenance come without any warning...
1941This is one of the most pricey places to live ...
2121Bribed students to write a review to be entere...
883Great place to live! But when it gets snowy it...
2072** The reason there are so many positive revie...
621Edited to respond to comments from management:...
413Honestly 2.5 stars is more appropriate but I'm...
761DO NOT LIVE HERE. Spaces are smaller than what...
531DO NOT LIVE HERE!!!!\nManagement is horrible a...
871This complex is a joke. They don’t answer your...
471Extremely upset after finding out that “the co...
541Management here is not that great. I understan...
351My daughter rented at this horrible complex. P...
311Staff is deplorable. Maintenance let my dog ou...
904After a minor dispute on move out charges the ...
931The pipe burst this apartment upstairs office ...
971DON'T LIVE HERE! They faked on the advertiseme...
261This place is the absolute worse. They get mon...
2031Visited the Standard to take a tour a few days...
2211I am extremely disappointed in the Standard. H...
631This is the most horrible apartment complex I ...
1004Clean space with cool atmosphere. Place can be...
1301Over priced, people around you are loud and r...
171I stayed in this terrible apartment a few year...
1961There is a big lack of communication around ma...

Run the classifier

truncation=True enables truncation of input sequences that exceed the maximum length accepted by the model. This prevents errors that would occur if the input text is too long for the model to process.

max_length=512 sets the maximum number of tokens that each input sequence can have after tokenization. If an input sequence is longer than this, it will be truncated to fit within this limit. The value 512 is commonly used, as it's the maximum sequence length for many BERT-based models.

padding=True enables padding for input sequences that are shorter than the maximum length15. This ensures that all input sequences in a batch have the same length, which is necessary for efficient processing by the model. Shorter sequences are padded with a special padding token to reach the specified maximum length.

In [22]:
classified_result = classifier(
    df_sample['review_text'].tolist(),
    truncation=True,
    max_length=512,
    padding=True,
)

classified_result
Out[22]:
[{'label': 'NEGATIVE', 'score': 0.999226450920105},
 {'label': 'NEGATIVE', 'score': 0.9996181726455688},
 {'label': 'NEGATIVE', 'score': 0.9997121691703796},
 {'label': 'NEGATIVE', 'score': 0.9996076226234436},
 {'label': 'NEGATIVE', 'score': 0.9990652203559875},
 {'label': 'NEGATIVE', 'score': 0.9995266199111938},
 {'label': 'NEGATIVE', 'score': 0.9992621541023254},
 {'label': 'NEGATIVE', 'score': 0.9993144273757935},
 {'label': 'POSITIVE', 'score': 0.9229395985603333},
 {'label': 'NEGATIVE', 'score': 0.9996793270111084},
 {'label': 'NEGATIVE', 'score': 0.9989628791809082},
 {'label': 'NEGATIVE', 'score': 0.9937936067581177},
 {'label': 'NEGATIVE', 'score': 0.998755693435669},
 {'label': 'NEGATIVE', 'score': 0.9998036026954651},
 {'label': 'NEGATIVE', 'score': 0.9996813535690308},
 {'label': 'NEGATIVE', 'score': 0.9992544054985046},
 {'label': 'NEGATIVE', 'score': 0.9995424747467041},
 {'label': 'NEGATIVE', 'score': 0.9993053674697876},
 {'label': 'NEGATIVE', 'score': 0.9993048906326294},
 {'label': 'POSITIVE', 'score': 0.9937560558319092},
 {'label': 'NEGATIVE', 'score': 0.999752938747406},
 {'label': 'NEGATIVE', 'score': 0.9948915243148804},
 {'label': 'NEGATIVE', 'score': 0.9997307658195496},
 {'label': 'NEGATIVE', 'score': 0.9974057078361511},
 {'label': 'NEGATIVE', 'score': 0.9996538162231445},
 {'label': 'NEGATIVE', 'score': 0.9995594620704651},
 {'label': 'POSITIVE', 'score': 0.8697509765625},
 {'label': 'NEGATIVE', 'score': 0.9998201727867126},
 {'label': 'NEGATIVE', 'score': 0.9997301697731018},
 {'label': 'NEGATIVE', 'score': 0.9996767044067383}]

Check the result.

In [23]:
df_sample['sentiment'] = list(map(lambda f: f['label'], classified_result))
df_sample['score'] = list(map(lambda f: f['score'], classified_result))

df_sample[['review_text', 'sentiment', 'score']]
Out[23]:
review_textsentimentscore
60Office staff extremely rude. Have laundry mach...NEGATIVE0.999226
5NEVER LIVE HERE!! They don't charge a security...NEGATIVE0.999618
68I lived here for a year and it was FAR too lon...NEGATIVE0.999712
193Avoid living here at all costs. The other revi...NEGATIVE0.999608
49Very poor management. They seem nice and trea...NEGATIVE0.999065
65They have maintenance come without any warning...NEGATIVE0.999527
194This is one of the most pricey places to live ...NEGATIVE0.999262
212Bribed students to write a review to be entere...NEGATIVE0.999314
88Great place to live! But when it gets snowy it...POSITIVE0.922940
207** The reason there are so many positive revie...NEGATIVE0.999679
62Edited to respond to comments from management:...NEGATIVE0.998963
41Honestly 2.5 stars is more appropriate but I'm...NEGATIVE0.993794
76DO NOT LIVE HERE. Spaces are smaller than what...NEGATIVE0.998756
53DO NOT LIVE HERE!!!!\nManagement is horrible a...NEGATIVE0.999804
87This complex is a joke. They don’t answer your...NEGATIVE0.999681
47Extremely upset after finding out that “the co...NEGATIVE0.999254
54Management here is not that great. I understan...NEGATIVE0.999542
35My daughter rented at this horrible complex. P...NEGATIVE0.999305
31Staff is deplorable. Maintenance let my dog ou...NEGATIVE0.999305
90After a minor dispute on move out charges the ...POSITIVE0.993756
93The pipe burst this apartment upstairs office ...NEGATIVE0.999753
97DON'T LIVE HERE! They faked on the advertiseme...NEGATIVE0.994892
26This place is the absolute worse. They get mon...NEGATIVE0.999731
203Visited the Standard to take a tour a few days...NEGATIVE0.997406
221I am extremely disappointed in the Standard. H...NEGATIVE0.999654
63This is the most horrible apartment complex I ...NEGATIVE0.999559
100Clean space with cool atmosphere. Place can be...POSITIVE0.869751
130Over priced, people around you are loud and r...NEGATIVE0.999820
17I stayed in this terrible apartment a few year...NEGATIVE0.999730
196There is a big lack of communication around ma...NEGATIVE0.999677

Alternatively, use a for loop to display progress

Passing a long list of text can be time-consuming. If you prefer tracking progress while the pipeline is running, use a for loop to run the classifier on each row. Print progress in each iteration.

In [24]:
# create new columns to store classified result
df_sample['sentiment'] = np.nan
df_sample['score'] = np.nan

# set the sentiment column to a string dtype
df_sample['sentiment'] = df_sample['sentiment'].astype(str)
In [25]:
num_rows = df_sample.shape[0]

for i in range(num_rows):
  # store review text to a variable
  review_text = df_sample['review_text'].iloc[i]

  if pd.notna(review_text):
    result = classifier(
        review_text,
        truncation=True,
        padding=True,
        max_length=512
    )
    
    df_sample.iloc[i, df_sample.columns.get_loc('sentiment')] = result[0]['label']
    df_sample.iloc[i, df_sample.columns.get_loc('score')] = result[0]['score']

  # display progress
  progress_percentage = round((i + 1) / num_rows * 100, 2)
  print(f'{i + 1}/{num_rows} ({progress_percentage}%)', end=' ')

  if (i + 1) % 10 == 0:
    print('')

print('====================')
print('Complete')
1/30 (3.33%) 2/30 (6.67%) 3/30 (10.0%) 4/30 (13.33%) 5/30 (16.67%) 6/30 (20.0%) 7/30 (23.33%) 8/30 (26.67%) 9/30 (30.0%) 10/30 (33.33%) 
11/30 (36.67%) 12/30 (40.0%) 13/30 (43.33%) 14/30 (46.67%) 15/30 (50.0%) 16/30 (53.33%) 17/30 (56.67%) 18/30 (60.0%) 19/30 (63.33%) 20/30 (66.67%) 
21/30 (70.0%) 22/30 (73.33%) 23/30 (76.67%) 24/30 (80.0%) 25/30 (83.33%) 26/30 (86.67%) 27/30 (90.0%) 28/30 (93.33%) 29/30 (96.67%) 30/30 (100.0%) 
====================
Complete

✨ Conclusion

Rule-based models like VADER struggle with nuanced language comprehension and context-dependent sentiments. BERT-based sentiment analysis models generally outperform rule-based models like VADER in terms of accuracy and nuanced understanding of context. BERT's bidirectional training allows it to grasp context from both directions, making it more effective in understanding complex sentiments.

ModernBERT was recently introduced, which is a slot-in replacement for any BERT-like models. ModernBERT is an improvment over its younger siblings across both speed and accuracy. Hugging Face expects to see ModernBERT become the new standard for applications where encoder-only models are now deployed.