Natural language processing (NLP) aims to give computers the ability to
understand, process, and even generate human language. This notebook
introduces the common preprocessing steps and demonstrates how to use a
widely used transformer model
(distilbert-base-uncased-finetuned-sst-2-english
) to
perfrom a sentiment analysis. 😀😦🙁
import pandas as pd
import numpy as np
import plotly.express as px
pd.set_option('display.max_columns', 100)
🗃️ Load data¶
This exercise uses a small dataset that contains reviews of two apartments at Indiana University Bloomington.
df_b = pd.read_csv(
'https://github.com/bdi475/datasets/raw/main/campustowns-leasing-company-reviews/businesses-sample.csv'
)
df_r = pd.read_csv(
'https://github.com/bdi475/datasets/raw/main/campustowns-leasing-company-reviews/reviews-sample.csv',
parse_dates = ['review_datetime_utc', 'owner_answer_timestamp_datetime_utc']
)
# businesses
display(df_b.head(2))
# reviews table
display(df_r.head(2))
Print the info()
of the two DataFrames.
df_b.info()
df_r.info()
The dataset has 312 reviews.
🪓 Preprocess review text using spaCy¶
spaCy is a powerful, open-source library for advanced Natural Language Processing (NLP) in Python and Cython. Designed specifically for production use, spaCy helps developers build applications that process and understand large volumes of text data efficiently.
spaCy is particularly useful for:
- Information extraction
- Natural language understanding systems
- Text pre-processing for deep learning
- Building chatbots and language-based applications
- Analyzing large volumes of unstructured text data
import spacy
Trained pipelines¶
Trained pipelines are models that enable spaCy to predict linguistic attributes in context
- Part-of-speech tags
- Syntactic dependencies
- Named entities
'en_core_web_sm'
is a English pipeline optimized for CPU.
Components:
- tok2vec
- taggerparser
- senter
- ner
- attribute_ruler
- lemmatizer
nlp = spacy.load('en_core_web_sm')
spacy.load()
returns a Language
object
containing all components and data needed to process text. Calling the
returned object on a string of text will return a processed
Doc
.
Source: https://spacy.io/usage/spacy-101
text = 'I love this apartment'
doc = nlp(text)
for token in doc:
print('------------------')
print(f'text: {token.text}')
print(f'lemma: {token.lemma_}')
print(f'pos: {token.pos_}') # pos_ stands for part-of-speech
print(f'explain: {spacy.explain(token.pos_)}')
print(f'is_stop: {token.is_stop}')
Visualize the dependency parse using displacy.render()
.
from spacy import displacy
displacy.render(doc, style="dep", jupyter=True)
Tokenization and lemmatization¶
Tokenization takes a piece of text and breaks it down into meaningful units called "tokens." These tokens can be individual words, punctuation marks, numbers, or even phrases depending on the task and chosen method.
Lemmatization goes a step further, focusing on the "base form" or "dictionary form" of words. It groups together different grammatical variations of the same word (like "playing," "plays," "played") and reduces them to their core meaning ("play"). This helps capture the true meaning of the text regardless of how they are used.
cols = ["text", "lemma", "pos", "explain", "is_stop"]
rows = []
for t in doc:
row = [t.text, t.lemma_, t.pos_, spacy.explain(t.pos_), t.is_stop]
rows.append(row)
df_tokens = pd.DataFrame(rows, columns=cols)
df_tokens
cols = ["review_id", "text", "lemma", "pos", "explain", "is_stop"]
rows = []
for index, row in df_r[df_r['review_text'].notna()].iterrows():
doc = nlp(row['review_text'])
for t in doc:
new_row = [row['review_id'], t.text, t.lemma_, t.pos_, spacy.explain(t.pos_), t.is_stop]
rows.append(new_row)
df_tokens = pd.DataFrame(rows, columns=cols)
df_tokens
Remove stop words¶
Stop words, as you might guess from the name, are a set of commonly used words in a language that are often filtered out before processing text in Natural Language Processing (NLP) tasks. These words, like "the," "a," "is," "and," "on," etc., are considered to carry little independent meaning and contribute minimally to the overall understanding of the text.
We remove the stop words here for two reasons:
- Reduce noise: By removing commonly used words, we focus on the content-rich keywords that convey the core meaning of the text.
- Improve efficiency: Removing stop words reduces the overall size of the data, making NLP tasks faster and less computationally expensive.
# only filter non stop words
df_tokens_filtered = df_tokens[~df_tokens['is_stop']]
# remove words shorter than 4 characters long
df_tokens_filtered = df_tokens_filtered[df_tokens_filtered['lemma'].str.len() >= 4]
df_tokens_filtered
Display value counts.
df_tokens_filtered['lemma'].value_counts()
🧪 Sentiment analysis using DistilBERT¶
DistilBERT is a lightweight, efficient version of the BERT (Bidirectional Encoder Representations from Transformers) language model, designed for faster training and inference while maintaining competitive performance in natural language processing (NLP) tasks. Developed by HuggingFace, DistilBERT is a distilled version of BERT that retains 97% of its language understanding capabilities while being 40% smaller and 60% faster.
The transformer architecture is like BERT's brain.
From Hugging Face's Documentation:
Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch. These models support common tasks in different modalities, such as:
📝 Natural Language Processing: text classification, named entity recognition, question answering, language modeling, summarization, translation, multiple choice, and text generation.
What is BERT?¶
Here's a simple explanation of BERT and the transformer architecture for a five-year-old (with the help of perplexity.ai):
Imagine you're playing with a super-smart toy robot that can read and understand stories. This robot is called BERT. When BERT reads a story, it doesn't just look at one word at a time like some other robots. Instead, it looks at all the words in a sentence together, kind of like how you look at a whole picture instead of just one tiny part.
- Think of transformers as a team of helper robots working together to understand language.
- These helper robots have special "attention" powers. When they read a sentence, they can focus on different words at the same time, just like how you can look at different toys in your room all at once.
- The helpers talk to each other and share what they've learned about each word. This helps them understand the whole sentence better, like how you understand a story better when you and your friends talk about it together.
- These helper robots can learn from lots and lots of stories, so they become very good at understanding language, just like how you get better at reading the more books you read.
- After they've learned from many stories, they can help with all sorts of language tasks, like answering questions or figuring out if someone is happy or sad in a story.
So, BERT is like a super-smart reading buddy that uses these helper robots (transformers) to understand language in a way that's similar to how you understand stories – by looking at everything together and sharing information.
Transfomers¶
Transformers, provided Hugging Face, provides APIs to quickly download and use thousands of pretrained models to perform tasks on text, images, and audio.
Install the transformers
package.
! pip install transformers
Pipelines¶
Pipelines are objects that abstract complex code from the Hugging Face library into simple APIs for inference tasks.
The "sentiment-analysis"
pipeline uses the default model
for sentiment analysis
(distilbert/distilbert-base-uncased-finetuned-sst-2-english
).
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
Run the sentiment classifier¶
The distilbert-base-uncased-finetuned-sst-2-english
model
classifies an input text to 'POSITIVE' or 'NEGATIVE' labels, along with
confidence scores.
classifier("We are very happy to show you the 🤗 Transformers library.")
👆 For the 'POSITIVE' label: The score of 0.9997795224189758 indicates a very high confidence (about 99.98%) that the input text expresses a positive sentiment.
classifier("These thieves tried to steal my security deposit.")
👆 For the 'NEGATIVE' label: The score of 0.996752142906189 indicates a very high confidence (about 99.68%) that the input text expresses a negative sentiment.
You can supply multiple inputs to the pipeline as a list.
my_inputs = [
"You're the best!",
"You're the worst!"
]
classifier(my_inputs)
Sample 30 rows¶
Although the
distilbert-base-uncased-finetuned-sst-2-english
model is
pre-trained and distilled (40% smaller than the original BERT model), it
will still be slow to be used for the entire dataset.
For this demo, only sample 30 rows where
- review_text is not missing, and
- review_rating is less than or equal to 4 out of 5 stars.
df_sample = df_r[df_r['review_text'].notna() &
(df_r['review_rating'] <= 4)].sample(30)
df_sample[['review_rating', 'review_text']]
Run the classifier¶
truncation=True
enables truncation of
input sequences that exceed the maximum length accepted by the model.
This prevents errors that would occur if the input text is too long for
the model to process.
max_length=512
sets the maximum number of
tokens that each input sequence can have after tokenization. If an input
sequence is longer than this, it will be truncated to fit within this
limit. The value 512 is commonly used, as it's the maximum sequence
length for many BERT-based models.
padding=True
enables padding for input
sequences that are shorter than the maximum length15. This ensures that
all input sequences in a batch have the same length, which is necessary
for efficient processing by the model. Shorter sequences are padded with
a special padding token to reach the specified maximum length.
classified_result = classifier(
df_sample['review_text'].tolist(),
truncation=True,
max_length=512,
padding=True,
)
classified_result
Check the result.
df_sample['sentiment'] = list(map(lambda f: f['label'], classified_result))
df_sample['score'] = list(map(lambda f: f['score'], classified_result))
df_sample[['review_text', 'sentiment', 'score']]
Alternatively, use a for loop to display progress¶
Passing a long list of text can be time-consuming. If you prefer tracking progress while the pipeline is running, use a for loop to run the classifier on each row. Print progress in each iteration.
# create new columns to store classified result
df_sample['sentiment'] = np.nan
df_sample['score'] = np.nan
# set the sentiment column to a string dtype
df_sample['sentiment'] = df_sample['sentiment'].astype(str)
num_rows = df_sample.shape[0]
for i in range(num_rows):
# store review text to a variable
review_text = df_sample['review_text'].iloc[i]
if pd.notna(review_text):
result = classifier(
review_text,
truncation=True,
padding=True,
max_length=512
)
df_sample.iloc[i, df_sample.columns.get_loc('sentiment')] = result[0]['label']
df_sample.iloc[i, df_sample.columns.get_loc('score')] = result[0]['score']
# display progress
progress_percentage = round((i + 1) / num_rows * 100, 2)
print(f'{i + 1}/{num_rows} ({progress_percentage}%)', end=' ')
if (i + 1) % 10 == 0:
print('')
print('====================')
print('Complete')
✨ Conclusion¶
Rule-based models like VADER struggle with nuanced language comprehension and context-dependent sentiments. BERT-based sentiment analysis models generally outperform rule-based models like VADER in terms of accuracy and nuanced understanding of context. BERT's bidirectional training allows it to grasp context from both directions, making it more effective in understanding complex sentiments.
ModernBERT was recently introduced, which is a slot-in replacement for any BERT-like models. ModernBERT is an improvment over its younger siblings across both speed and accuracy. Hugging Face expects to see ModernBERT become the new standard for applications where encoder-only models are now deployed.