Natural language processing (NLP) aims to give computers the ability to understand, process, and even generate human language. This notebook introduces the common preprocessing steps and demonstrates how to use a widely used transformer model (distilbert-base-uncased-finetuned-sst-2-english) to perfrom a sentiment analysis. 😀😦🙁

In [1]:

import pandas as pd
import numpy as np
import plotly.express as px

In [2]:

pd.set_option('display.max_columns', 100)

🗃️ Load data¶

This exercise uses a small dataset that contains reviews of two apartments at Indiana University Bloomington.

In [3]:

df_b = pd.read_csv(
    'https://github.com/bdi475/datasets/raw/main/campustowns-leasing-company-reviews/businesses-sample.csv'
)
df_r = pd.read_csv(
    'https://github.com/bdi475/datasets/raw/main/campustowns-leasing-company-reviews/reviews-sample.csv',
    parse_dates = ['review_datetime_utc', 'owner_answer_timestamp_datetime_utc']
)

In [4]:

# businesses
display(df_b.head(2))

	campus	place_id	name	site	category	borough	street	city	postal_code	state	latitude	longitude	verified
0	Indiana University Bloomington	ChIJY1yB5NJmbIgRZn7E2oF5gVQ	State On Campus Bloomington	https://stateoncampus.com/bloomington/?utm_sou...	Apartment complex	NaN	2036 N Walnut St	Bloomington	47404	Indiana	39.184846	-86.532875	True
1	Indiana University Bloomington	ChIJPb8SbdpnbIgR82bkOLSKtZM	The Standard at Bloomington	https://www.thestandardbloomington.com/?utm_so...	Student housing center	NaN	250 E 14th St	Bloomington	47408	Indiana	39.175974	-86.531609	True

In [5]:

# reviews table
display(df_r.head(2))

	place_id	review_id	author_id	author_title	review_text	review_rating	review_img_url	review_datetime_utc	owner_answer	owner_answer_timestamp_datetime_utc	review_likes
0	ChIJY1yB5NJmbIgRZn7E2oF5gVQ	ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAE	109839330111495228413	Aziz Bohra	This place has my heart! Spacious rooms, quali...	5	https://lh5.googleusercontent.com/p/AF1QipM0Jm...	2023-04-01 03:19:15+00:00	Thanks for your feedback. We are grateful tha...	2022-11-01 18:54:45+00:00	0
1	ChIJY1yB5NJmbIgRZn7E2oF5gVQ	ChZDSUhNMG9nS0VJQ0FnSUNaOC15NkhBEAE	102607480175477014087	Nessa Bacher	I’ll start with the positives of living at Sta...	4	NaN	2023-09-25 00:17:21+00:00	We are so pleased to hear that you enjoy livin...	2023-09-18 14:20:59+00:00	0

Print the info() of the two DataFrames.

In [6]:

df_b.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   campus       2 non-null      object 
 1   place_id     2 non-null      object 
 2   name         2 non-null      object 
 3   site         2 non-null      object 
 4   category     2 non-null      object 
 5   borough      0 non-null      float64
 6   street       2 non-null      object 
 7   city         2 non-null      object 
 8   postal_code  2 non-null      int64  
 9   state        2 non-null      object 
 10  latitude     2 non-null      float64
 11  longitude    2 non-null      float64
 12  verified     2 non-null      bool   
dtypes: bool(1), float64(3), int64(1), object(8)
memory usage: 322.0+ bytes

In [7]:

df_r.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 312 entries, 0 to 311
Data columns (total 11 columns):
 #   Column                               Non-Null Count  Dtype              
---  ------                               --------------  -----              
 0   place_id                             312 non-null    object             
 1   review_id                            312 non-null    object             
 2   author_id                            312 non-null    object             
 3   author_title                         312 non-null    object             
 4   review_text                          236 non-null    object             
 5   review_rating                        312 non-null    int64              
 6   review_img_url                       6 non-null      object             
 7   review_datetime_utc                  312 non-null    datetime64[ns, UTC]
 8   owner_answer                         245 non-null    object             
 9   owner_answer_timestamp_datetime_utc  245 non-null    datetime64[ns, UTC]
 10  review_likes                         312 non-null    int64              
dtypes: datetime64[ns, UTC](2), int64(2), object(7)
memory usage: 26.9+ KB

The dataset has 312 reviews.

🪓 Preprocess review text using spaCy¶

spaCy is a powerful, open-source library for advanced Natural Language Processing (NLP) in Python and Cython. Designed specifically for production use, spaCy helps developers build applications that process and understand large volumes of text data efficiently.

spaCy is particularly useful for:

Information extraction
Natural language understanding systems
Text pre-processing for deep learning
Building chatbots and language-based applications
Analyzing large volumes of unstructured text data

In [8]:

import spacy

Trained pipelines¶

Trained pipelines are models that enable spaCy to predict linguistic attributes in context

Part-of-speech tags
Syntactic dependencies
Named entities

'en_core_web_sm' is a English pipeline optimized for CPU.

Components:

tok2vec
taggerparser
senter
ner
attribute_ruler
lemmatizer

In [9]:

nlp = spacy.load('en_core_web_sm')

spacy.load() returns a Language object containing all components and data needed to process text. Calling the returned object on a string of text will return a processed Doc.

Source: https://spacy.io/usage/spacy-101

In [10]:

text = 'I love this apartment'
doc = nlp(text)

for token in doc:
    print('------------------')
    print(f'text: {token.text}')
    print(f'lemma: {token.lemma_}')
    print(f'pos: {token.pos_}') # pos_ stands for part-of-speech
    print(f'explain: {spacy.explain(token.pos_)}')
    print(f'is_stop: {token.is_stop}')

------------------
text: I
lemma: I
pos: PRON
explain: pronoun
is_stop: True
------------------
text: love
lemma: love
pos: VERB
explain: verb
is_stop: False
------------------
text: this
lemma: this
pos: DET
explain: determiner
is_stop: True
------------------
text: apartment
lemma: apartment
pos: NOUN
explain: noun
is_stop: False

Visualize the dependency parse using displacy.render().

In [11]:

from spacy import displacy

displacy.render(doc, style="dep", jupyter=True)

Tokenization and lemmatization¶

Tokenization takes a piece of text and breaks it down into meaningful units called "tokens." These tokens can be individual words, punctuation marks, numbers, or even phrases depending on the task and chosen method.

Lemmatization goes a step further, focusing on the "base form" or "dictionary form" of words. It groups together different grammatical variations of the same word (like "playing," "plays," "played") and reduces them to their core meaning ("play"). This helps capture the true meaning of the text regardless of how they are used.

In [12]:

cols = ["text", "lemma", "pos", "explain", "is_stop"]
rows = []

for t in doc:
    row = [t.text, t.lemma_, t.pos_, spacy.explain(t.pos_), t.is_stop]
    rows.append(row)

df_tokens = pd.DataFrame(rows, columns=cols)
df_tokens

Out[12]:

	text	lemma	pos	explain	is_stop
0	I	I	PRON	pronoun	True
1	love	love	VERB	verb	False
2	this	this	DET	determiner	True
3	apartment	apartment	NOUN	noun	False

In [13]:

cols = ["review_id", "text", "lemma", "pos", "explain", "is_stop"]
rows = []

for index, row in df_r[df_r['review_text'].notna()].iterrows():
    doc = nlp(row['review_text'])
    for t in doc:
        new_row = [row['review_id'], t.text, t.lemma_, t.pos_, spacy.explain(t.pos_), t.is_stop]
        rows.append(new_row)

df_tokens = pd.DataFrame(rows, columns=cols)
df_tokens

Out[13]:

	review_id	text	lemma	pos	explain	is_stop
0	ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAE	This	this	DET	determiner	True
1	ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAE	place	place	NOUN	noun	False
2	ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAE	has	have	VERB	verb	True
3	ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAE	my	my	PRON	pronoun	True
4	ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAE	heart	heart	NOUN	noun	False
...	...	...	...	...	...	...
18255	ChZDSUhNMG9nS0VJQ0FnSUNSdUpEV2FREAE	!	!	PUNCT	punctuation	False
18256	ChZDSUhNMG9nS0VJQ0FnSUNSeUp1dUVBEAE	Hooray	Hooray	PROPN	proper noun	False
18257	ChZDSUhNMG9nS0VJQ0FnSUNSeUp1dUVBEAE	!	!	PUNCT	punctuation	False
18258	ChZDSUhNMG9nS0VJQ0FnSUR4aDZQckl3EAE	wowie	wowie	VERB	verb	False
18259	ChZDSUhNMG9nS0VJQ0FnSUR4aDZQckl3EAE	zowa	zowa	PROPN	proper noun	False

18260 rows × 6 columns

Remove stop words¶

Stop words, as you might guess from the name, are a set of commonly used words in a language that are often filtered out before processing text in Natural Language Processing (NLP) tasks. These words, like "the," "a," "is," "and," "on," etc., are considered to carry little independent meaning and contribute minimally to the overall understanding of the text.

We remove the stop words here for two reasons:

Reduce noise: By removing commonly used words, we focus on the content-rich keywords that convey the core meaning of the text.
Improve efficiency: Removing stop words reduces the overall size of the data, making NLP tasks faster and less computationally expensive.

In [14]:

# only filter non stop words
df_tokens_filtered = df_tokens[~df_tokens['is_stop']]

# remove words shorter than 4 characters long
df_tokens_filtered = df_tokens_filtered[df_tokens_filtered['lemma'].str.len() >= 4]

df_tokens_filtered

Out[14]:

	review_id	text	lemma	pos	explain	is_stop
1	ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAE	place	place	NOUN	noun	False
4	ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAE	heart	heart	NOUN	noun	False
6	ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAE	Spacious	spacious	ADJ	adjective	False
7	ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAE	rooms	room	NOUN	noun	False
9	ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAE	quality	quality	NOUN	noun	False
...	...	...	...	...	...	...
18252	ChZDSUhNMG9nS0VJQ0FnSUNSdUpEV2FREAE	staff	staff	NOUN	noun	False
18254	ChZDSUhNMG9nS0VJQ0FnSUNSdUpEV2FREAE	awesome	awesome	ADJ	adjective	False
18256	ChZDSUhNMG9nS0VJQ0FnSUNSeUp1dUVBEAE	Hooray	Hooray	PROPN	proper noun	False
18258	ChZDSUhNMG9nS0VJQ0FnSUR4aDZQckl3EAE	wowie	wowie	VERB	verb	False
18259	ChZDSUhNMG9nS0VJQ0FnSUR4aDZQckl3EAE	zowa	zowa	PROPN	proper noun	False

5948 rows × 6 columns

Display value counts.

In [15]:

df_tokens_filtered['lemma'].value_counts()

Out[15]:

lemma
apartment     125
live          119
place          98
staff          73
helpful        62
             ... 
respectful      1
exit            1
hook            1
climbing        1
rock            1
Name: count, Length: 1470, dtype: int64

🧪 Sentiment analysis using DistilBERT¶

DistilBERT is a lightweight, efficient version of the BERT (Bidirectional Encoder Representations from Transformers) language model, designed for faster training and inference while maintaining competitive performance in natural language processing (NLP) tasks. Developed by HuggingFace, DistilBERT is a distilled version of BERT that retains 97% of its language understanding capabilities while being 40% smaller and 60% faster.

The transformer architecture is like BERT's brain.

From Hugging Face's Documentation:

Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch. These models support common tasks in different modalities, such as:

📝 Natural Language Processing: text classification, named entity recognition, question answering, language modeling, summarization, translation, multiple choice, and text generation.

What is BERT?¶

Here's a simple explanation of BERT and the transformer architecture for a five-year-old (with the help of perplexity.ai):

Imagine you're playing with a super-smart toy robot that can read and understand stories. This robot is called BERT. When BERT reads a story, it doesn't just look at one word at a time like some other robots. Instead, it looks at all the words in a sentence together, kind of like how you look at a whole picture instead of just one tiny part.

Think of transformers as a team of helper robots working together to understand language.
These helper robots have special "attention" powers. When they read a sentence, they can focus on different words at the same time, just like how you can look at different toys in your room all at once.
The helpers talk to each other and share what they've learned about each word. This helps them understand the whole sentence better, like how you understand a story better when you and your friends talk about it together.
These helper robots can learn from lots and lots of stories, so they become very good at understanding language, just like how you get better at reading the more books you read.
After they've learned from many stories, they can help with all sorts of language tasks, like answering questions or figuring out if someone is happy or sad in a story.

So, BERT is like a super-smart reading buddy that uses these helper robots (transformers) to understand language in a way that's similar to how you understand stories – by looking at everything together and sharing information.

Transfomers¶

Transformers, provided Hugging Face, provides APIs to quickly download and use thousands of pretrained models to perform tasks on text, images, and audio.

Install the transformers package.

In [16]:

! pip install transformers

Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: transformers in /home/jupyter-subwaymatch/.local/lib/python3.10/site-packages (4.48.0)
Requirement already satisfied: filelock in /home/jupyter-subwaymatch/.local/lib/python3.10/site-packages (from transformers) (3.16.1)
Requirement already satisfied: huggingface-hub<1.0,>=0.24.0 in /home/jupyter-subwaymatch/.local/lib/python3.10/site-packages (from transformers) (0.27.1)
Requirement already satisfied: numpy>=1.17 in /home/jupyter-subwaymatch/.local/lib/python3.10/site-packages (from transformers) (2.1.2)
Requirement already satisfied: packaging>=20.0 in /opt/tljh/user/lib/python3.10/site-packages (from transformers) (24.1)
Requirement already satisfied: pyyaml>=5.1 in /opt/tljh/user/lib/python3.10/site-packages (from transformers) (6.0.2)
Requirement already satisfied: regex!=2019.12.17 in /home/jupyter-subwaymatch/.local/lib/python3.10/site-packages (from transformers) (2024.11.6)
Requirement already satisfied: requests in /opt/tljh/user/lib/python3.10/site-packages (from transformers) (2.32.3)
Requirement already satisfied: tokenizers<0.22,>=0.21 in /home/jupyter-subwaymatch/.local/lib/python3.10/site-packages (from transformers) (0.21.0)
Requirement already satisfied: safetensors>=0.4.1 in /home/jupyter-subwaymatch/.local/lib/python3.10/site-packages (from transformers) (0.5.2)
Requirement already satisfied: tqdm>=4.27 in /opt/tljh/user/lib/python3.10/site-packages (from transformers) (4.65.0)
Requirement already satisfied: fsspec>=2023.5.0 in /home/jupyter-subwaymatch/.local/lib/python3.10/site-packages (from huggingface-hub<1.0,>=0.24.0->transformers) (2024.12.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/tljh/user/lib/python3.10/site-packages (from huggingface-hub<1.0,>=0.24.0->transformers) (4.12.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/tljh/user/lib/python3.10/site-packages (from requests->transformers) (3.1.0)
Requirement already satisfied: idna<4,>=2.5 in /opt/tljh/user/lib/python3.10/site-packages (from requests->transformers) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/tljh/user/lib/python3.10/site-packages (from requests->transformers) (1.26.15)
Requirement already satisfied: certifi>=2017.4.17 in /opt/tljh/user/lib/python3.10/site-packages (from requests->transformers) (2022.12.7)

Pipelines¶

Pipelines are objects that abstract complex code from the Hugging Face library into simple APIs for inference tasks.

The "sentiment-analysis" pipeline uses the default model for sentiment analysis (distilbert/distilbert-base-uncased-finetuned-sst-2-english).

In [17]:

from transformers import pipeline

classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu

Run the sentiment classifier¶

The distilbert-base-uncased-finetuned-sst-2-english model classifies an input text to 'POSITIVE' or 'NEGATIVE' labels, along with confidence scores.

In [18]:

classifier("We are very happy to show you the 🤗 Transformers library.")

Out[18]:

[{'label': 'POSITIVE', 'score': 0.9997795224189758}]

👆 For the 'POSITIVE' label: The score of 0.9997795224189758 indicates a very high confidence (about 99.98%) that the input text expresses a positive sentiment.

In [19]:

classifier("These thieves tried to steal my security deposit.")

Out[19]:

[{'label': 'NEGATIVE', 'score': 0.996752142906189}]

👆 For the 'NEGATIVE' label: The score of 0.996752142906189 indicates a very high confidence (about 99.68%) that the input text expresses a negative sentiment.

You can supply multiple inputs to the pipeline as a list.

In [20]:

my_inputs = [
    "You're the best!",
    "You're the worst!"
]

classifier(my_inputs)

Out[20]:

[{'label': 'POSITIVE', 'score': 0.9998639822006226},
 {'label': 'NEGATIVE', 'score': 0.9997650980949402}]

Sample 30 rows¶

Although the distilbert-base-uncased-finetuned-sst-2-english model is pre-trained and distilled (40% smaller than the original BERT model), it will still be slow to be used for the entire dataset.

For this demo, only sample 30 rows where

review_text is not missing, and
review_rating is less than or equal to 4 out of 5 stars.

In [21]:

df_sample = df_r[df_r['review_text'].notna() & 
    (df_r['review_rating'] <= 4)].sample(30)

df_sample[['review_rating', 'review_text']]

Out[21]:

	review_rating	review_text
60	1	Office staff extremely rude. Have laundry mach...
5	1	NEVER LIVE HERE!! They don't charge a security...
68	1	I lived here for a year and it was FAR too lon...
193	1	Avoid living here at all costs. The other revi...
49	1	Very poor management. They seem nice and trea...
65	1	They have maintenance come without any warning...
194	1	This is one of the most pricey places to live ...
212	1	Bribed students to write a review to be entere...
88	3	Great place to live! But when it gets snowy it...
207	2	** The reason there are so many positive revie...
62	1	Edited to respond to comments from management:...
41	3	Honestly 2.5 stars is more appropriate but I'm...
76	1	DO NOT LIVE HERE. Spaces are smaller than what...
53	1	DO NOT LIVE HERE!!!!\nManagement is horrible a...
87	1	This complex is a joke. They don’t answer your...
47	1	Extremely upset after finding out that “the co...
54	1	Management here is not that great. I understan...
35	1	My daughter rented at this horrible complex. P...
31	1	Staff is deplorable. Maintenance let my dog ou...
90	4	After a minor dispute on move out charges the ...
93	1	The pipe burst this apartment upstairs office ...
97	1	DON'T LIVE HERE! They faked on the advertiseme...
26	1	This place is the absolute worse. They get mon...
203	1	Visited the Standard to take a tour a few days...
221	1	I am extremely disappointed in the Standard. H...
63	1	This is the most horrible apartment complex I ...
100	4	Clean space with cool atmosphere. Place can be...
130	1	Over priced, people around you are loud and r...
17	1	I stayed in this terrible apartment a few year...
196	1	There is a big lack of communication around ma...

Run the classifier¶

truncation=True enables truncation of input sequences that exceed the maximum length accepted by the model. This prevents errors that would occur if the input text is too long for the model to process.

max_length=512 sets the maximum number of tokens that each input sequence can have after tokenization. If an input sequence is longer than this, it will be truncated to fit within this limit. The value 512 is commonly used, as it's the maximum sequence length for many BERT-based models.

padding=True enables padding for input sequences that are shorter than the maximum length15. This ensures that all input sequences in a batch have the same length, which is necessary for efficient processing by the model. Shorter sequences are padded with a special padding token to reach the specified maximum length.

In [22]:

classified_result = classifier(
    df_sample['review_text'].tolist(),
    truncation=True,
    max_length=512,
    padding=True,
)

classified_result

Out[22]:

[{'label': 'NEGATIVE', 'score': 0.999226450920105},
 {'label': 'NEGATIVE', 'score': 0.9996181726455688},
 {'label': 'NEGATIVE', 'score': 0.9997121691703796},
 {'label': 'NEGATIVE', 'score': 0.9996076226234436},
 {'label': 'NEGATIVE', 'score': 0.9990652203559875},
 {'label': 'NEGATIVE', 'score': 0.9995266199111938},
 {'label': 'NEGATIVE', 'score': 0.9992621541023254},
 {'label': 'NEGATIVE', 'score': 0.9993144273757935},
 {'label': 'POSITIVE', 'score': 0.9229395985603333},
 {'label': 'NEGATIVE', 'score': 0.9996793270111084},
 {'label': 'NEGATIVE', 'score': 0.9989628791809082},
 {'label': 'NEGATIVE', 'score': 0.9937936067581177},
 {'label': 'NEGATIVE', 'score': 0.998755693435669},
 {'label': 'NEGATIVE', 'score': 0.9998036026954651},
 {'label': 'NEGATIVE', 'score': 0.9996813535690308},
 {'label': 'NEGATIVE', 'score': 0.9992544054985046},
 {'label': 'NEGATIVE', 'score': 0.9995424747467041},
 {'label': 'NEGATIVE', 'score': 0.9993053674697876},
 {'label': 'NEGATIVE', 'score': 0.9993048906326294},
 {'label': 'POSITIVE', 'score': 0.9937560558319092},
 {'label': 'NEGATIVE', 'score': 0.999752938747406},
 {'label': 'NEGATIVE', 'score': 0.9948915243148804},
 {'label': 'NEGATIVE', 'score': 0.9997307658195496},
 {'label': 'NEGATIVE', 'score': 0.9974057078361511},
 {'label': 'NEGATIVE', 'score': 0.9996538162231445},
 {'label': 'NEGATIVE', 'score': 0.9995594620704651},
 {'label': 'POSITIVE', 'score': 0.8697509765625},
 {'label': 'NEGATIVE', 'score': 0.9998201727867126},
 {'label': 'NEGATIVE', 'score': 0.9997301697731018},
 {'label': 'NEGATIVE', 'score': 0.9996767044067383}]

Check the result.

In [23]:

df_sample['sentiment'] = list(map(lambda f: f['label'], classified_result))
df_sample['score'] = list(map(lambda f: f['score'], classified_result))

df_sample[['review_text', 'sentiment', 'score']]

Out[23]:

	review_text	sentiment	score
60	Office staff extremely rude. Have laundry mach...	NEGATIVE	0.999226
5	NEVER LIVE HERE!! They don't charge a security...	NEGATIVE	0.999618
68	I lived here for a year and it was FAR too lon...	NEGATIVE	0.999712
193	Avoid living here at all costs. The other revi...	NEGATIVE	0.999608
49	Very poor management. They seem nice and trea...	NEGATIVE	0.999065
65	They have maintenance come without any warning...	NEGATIVE	0.999527
194	This is one of the most pricey places to live ...	NEGATIVE	0.999262
212	Bribed students to write a review to be entere...	NEGATIVE	0.999314
88	Great place to live! But when it gets snowy it...	POSITIVE	0.922940
207	** The reason there are so many positive revie...	NEGATIVE	0.999679
62	Edited to respond to comments from management:...	NEGATIVE	0.998963
41	Honestly 2.5 stars is more appropriate but I'm...	NEGATIVE	0.993794
76	DO NOT LIVE HERE. Spaces are smaller than what...	NEGATIVE	0.998756
53	DO NOT LIVE HERE!!!!\nManagement is horrible a...	NEGATIVE	0.999804
87	This complex is a joke. They don’t answer your...	NEGATIVE	0.999681
47	Extremely upset after finding out that “the co...	NEGATIVE	0.999254
54	Management here is not that great. I understan...	NEGATIVE	0.999542
35	My daughter rented at this horrible complex. P...	NEGATIVE	0.999305
31	Staff is deplorable. Maintenance let my dog ou...	NEGATIVE	0.999305
90	After a minor dispute on move out charges the ...	POSITIVE	0.993756
93	The pipe burst this apartment upstairs office ...	NEGATIVE	0.999753
97	DON'T LIVE HERE! They faked on the advertiseme...	NEGATIVE	0.994892
26	This place is the absolute worse. They get mon...	NEGATIVE	0.999731
203	Visited the Standard to take a tour a few days...	NEGATIVE	0.997406
221	I am extremely disappointed in the Standard. H...	NEGATIVE	0.999654
63	This is the most horrible apartment complex I ...	NEGATIVE	0.999559
100	Clean space with cool atmosphere. Place can be...	POSITIVE	0.869751
130	Over priced, people around you are loud and r...	NEGATIVE	0.999820
17	I stayed in this terrible apartment a few year...	NEGATIVE	0.999730
196	There is a big lack of communication around ma...	NEGATIVE	0.999677

Alternatively, use a for loop to display progress¶

Passing a long list of text can be time-consuming. If you prefer tracking progress while the pipeline is running, use a for loop to run the classifier on each row. Print progress in each iteration.

In [24]:

# create new columns to store classified result
df_sample['sentiment'] = np.nan
df_sample['score'] = np.nan

# set the sentiment column to a string dtype
df_sample['sentiment'] = df_sample['sentiment'].astype(str)

In [25]:

num_rows = df_sample.shape[0]

for i in range(num_rows):
  # store review text to a variable
  review_text = df_sample['review_text'].iloc[i]

  if pd.notna(review_text):
    result = classifier(
        review_text,
        truncation=True,
        padding=True,
        max_length=512
    )
    
    df_sample.iloc[i, df_sample.columns.get_loc('sentiment')] = result[0]['label']
    df_sample.iloc[i, df_sample.columns.get_loc('score')] = result[0]['score']

  # display progress
  progress_percentage = round((i + 1) / num_rows * 100, 2)
  print(f'{i + 1}/{num_rows} ({progress_percentage}%)', end=' ')

  if (i + 1) % 10 == 0:
    print('')

print('====================')
print('Complete')

1/30 (3.33%) 2/30 (6.67%) 3/30 (10.0%) 4/30 (13.33%) 5/30 (16.67%) 6/30 (20.0%) 7/30 (23.33%) 8/30 (26.67%) 9/30 (30.0%) 10/30 (33.33%) 
11/30 (36.67%) 12/30 (40.0%) 13/30 (43.33%) 14/30 (46.67%) 15/30 (50.0%) 16/30 (53.33%) 17/30 (56.67%) 18/30 (60.0%) 19/30 (63.33%) 20/30 (66.67%) 
21/30 (70.0%) 22/30 (73.33%) 23/30 (76.67%) 24/30 (80.0%) 25/30 (83.33%) 26/30 (86.67%) 27/30 (90.0%) 28/30 (93.33%) 29/30 (96.67%) 30/30 (100.0%) 
====================
Complete

✨ Conclusion¶

Rule-based models like VADER struggle with nuanced language comprehension and context-dependent sentiments. BERT-based sentiment analysis models generally outperform rule-based models like VADER in terms of accuracy and nuanced understanding of context. BERT's bidirectional training allows it to grasp context from both directions, making it more effective in understanding complex sentiments.

ModernBERT was recently introduced, which is a slot-in replacement for any BERT-like models. ModernBERT is an improvment over its younger siblings across both speed and accuracy. Hugging Face expects to see ModernBERT become the new standard for applications where encoder-only models are now deployed.

Sentiment Analysis using spaCy and DistilBERT