Beyond Words: Exploring Sentiments in Cancer Narratives with ML💬🎗️ | by Saraswataroy | Jun, 2024

Photo by JOHN TOWNER on Unsplash

In today’s digital age, where discussions and narratives surrounding cancer are abundant across various platforms, understanding the sentiments expressed in these conversations is crucial. Sentiment analysis, powered by advanced techniques in data analysis and machine learning, offers valuable insights into the emotions, concerns, and experiences shared by individuals affected by cancer.

In this blog, we embark on a journey to explore the emotional landscape of cancer conversations using data-driven approaches. By leveraging sentiment analysis techniques, we aim to decode the sentiments expressed in cancer-related discussions across social media, support forums, and other channels. Our focus is not only on understanding the emotional undertones but also on how these insights can contribute to better support systems, patient care, and awareness campaigns.

Join us as we delve into the heartfelt stories, concerns, hopes, and experiences shared by individuals impacted by cancer, and discover how data analysis can offer empathy-driven insights into their journey.

In our exploration, we’re grateful to utilize a dataset curated by the dedicated efforts of researchers Irin Hoque Orchi, Nafisa Tabassum, Jaeemul Hossain, Sabrina Tajrin, Iftekhar Alam. This dataset comprises of 10,392 social media posts shared by cancer patients and their caregivers across platforms such as Reddit, Daily Strength, and the Health Board. These posts encapsulate genuine experiences, emotions, and conversations surrounding cancer journeys.

The dataset encompasses discussions related to five types of cancer: brain, colon, liver, leukemia, and lung cancer, offering a diverse insight into various aspects of the disease. From personal stories to queries, support-seeking posts to moments of triumph, these entries provide a holistic view of the cancer experience.

Each post has been meticulously tagged with sentiment scores ranging from -2 to 1, where -2 represents negative emotions or grief, 1 signifies positive or happy emotions, and neutral posts received a score of 0. This nuanced scoring system allows us to understand the emotional spectrum expressed within these conversations.

Link to data:

Firstly, we’ll check the dataset for any missing values. Subsequently, we’ll perform basic pandas operations to remove rows containing missing values if they represent only a negligible portion of the total dataset.

Next, we’ll check if the available data is imbalanced.

We observe that the data is not homogeneous; it predominantly consists of negative and neutral sentiment posts. Therefore, to ensure representative sampling, we need to stratify our test data according to the distribution in the training set.

Now, let’s proceed to preprocess the data to prepare it for feeding into the model.

Preprocessing text data involves several steps to clean and prepare the text for feeding into a machine learning model. Here’s a general outline of preprocessing steps:

  1. Lowercasing: Convert all text to lowercase to ensure consistency (optional depending on the case sensitivity of the model).
  2. Tokenization: Split the text into individual words or tokens.
  3. Removing Punctuation: Remove any punctuation marks as they usually don’t add significant meaning for most NLP tasks.
  4. Removing Stopwords: Remove common words (stopwords) like “and”, “the”, “is”, etc., which don’t contribute much to the meaning of the text.
  5. Stemming or Lemmatization: Reduce words to their base form. Stemming cuts off prefixes or suffixes, while lemmatization reduces words to their dictionary form.
  6. Handling Numbers: Decide whether to keep numbers as they are, replace them with a placeholder, or remove them.
  7. Handling Special Characters: Handle special characters or symbols appropriately based on the task.
  8. Vectorization: Convert text into numerical representations like TF-IDF vectors or word embeddings.

Lets proceed with the code:

#Instantly make your loops show a smart progress meter - just wrap any iterable with tqdm(iterable), and you’re
from tqdm import tqdm

# Helps in expanding these contractions to their original forms --> can't - cannot
import contractions

# Beautiful Soup is a Python library used for web scraping tasks. It provides tools to extract data from HTML and XML files
from bs4 import BeautifulSoup

# Lemmatize using WordNet's built-in morphy function. Returns the input word unchanged if it cannot be found in WordNet
from nltk.stem import WordNetLemmatizer

# Regular expressions
import re

# Natural Language Toolkit
import nltk

wordnet = WordNetLemmatizer()'stopwords')
stopwords = nltk.corpus.stopwords.words('english')

prepocessed_posts = []

for sentence in tqdm(df1['posts'].values):
# Removing contractions from the data
sentencee = contractions.fix(sentence)
# Checks for links starting with https/ and removes it
sentence = re.sub(r"https\S+", "", sentence)
# Checks for digits and removes it
sentence = re.sub("\S*\d\S*","", sentence).strip()
# Only keeps alpahbetical words
sentence = re.sub('[^a-zA-Z\s]+', " ", sentence)
tokens = sentence.split()
# Checks for stop words and normalises
tokens = [wordnet.lemmatize(word) for word in tokens if word.lower() not in stopwords]
cleaned_sentence = ' '.join(tokens).lower()

The above code can be used to normalize any text data to be made fit for Natural Language Processing.

Now that we have the normalized text data, we can proceed to convert it into numerical representations that machine learning algorithms can understand and process. For that we have two basic approaches.

Bag of Words (BOW)

It is a common technique used in Natural Language Processing (NLP) to convert text data into numerical representations. It involves the following steps:

  1. Tokenization: The text is split into individual words or tokens.
  2. Vocabulary Creation: A vocabulary (a set of unique words) is created from the entire corpus of text.
  3. Vectorization: Each document is represented as a vector of word frequencies. The length of the vector is equal to the size of the vocabulary, and each element in the vector corresponds to the frequency of a specific word in the document.


Consider the following three documents:

  1. “The cat sat on the mat.”
  2. “The dog sat on the log.”
  3. “The cat and the dog played together.”

The BoW representation involves:

  1. Tokenization:
  • Document 1: [“the”, “cat”, “sat”, “on”, “the”, “mat”]
  • Document 2: [“the”, “dog”, “sat”, “on”, “the”, “log”]
  • Document 3: [“the”, “cat”, “and”, “the”, “dog”, “played”, “together”]
  1. Vocabulary Creation:
  • [“the”, “cat”, “sat”, “on”, “mat”, “dog”, “log”, “and”, “played”, “together”]
  1. Vectorization:
  • Document 1: [2, 1, 1, 1, 1, 0, 0, 0, 0, 0]
  • Document 2: [2, 0, 1, 1, 0, 1, 1, 0, 0, 0]
  • Document 3: [2, 1, 0, 0, 0, 1, 0, 1, 1, 1]

Here, each vector represents the frequency of words in the corresponding document.

TF-IDF (Term Frequency-Inverse Document Frequency):

While the Bag of Words (BoW) model represents text data by counting word frequencies, but it treats all words equally, leading to dominance by common words and ignoring term importance across documents. TF-IDF (Term Frequency-Inverse Document Frequency) addresses these limitations by weighting terms based on their frequency in individual documents and their rarity across the entire corpus, down weighting common words and highlighting unique, informative terms. This results in more meaningful and discriminative features, enhancing the performance of machine learning models in tasks like text classification and information retrieval.

Key Components of TF-IDF:

  1. Term Frequency (TF): Measures how frequently a term appears in a document.
  1. Inverse Document Frequency (IDF): Measures how rare a term is across all documents in the corpus.
  1. TF-IDF Calculation: Combines TF and IDF to give a weighted score for each term in each document.

We’ll use TF-IDF (Term Frequency-Inverse Document Frequency) for its superior performance.

from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer

# Create the TfidfVectorizer instance
tf_idf_vect = TfidfVectorizer()
# Fit the documents
# Transform the documents to correspoding vectors
final_counts_tfidf = tf_idf_vect.transform(prepocessed_posts).toarray()


Here 35001 unique tokens were discovered and the value in each cell represents the TF-IDF score of the corresponding term in the document.

Phew! Alright, folks! Data preprocessing? Done! Now, let’s roll up our sleeves and dive into the exciting part — building models and testing our results!

We have a plethora of algorithms at our disposal, from powerful tree-based models like Random Forests and Gradient Boosting, XG boosting to advanced large language models like LLAMA and GPT.
For our analysis, let’s harness the power of XGBoost, a robust and efficient gradient boosting algorithm that excels in both speed and performance. XGBoost is renowned for its ability to handle large datasets and complex patterns, making it an ideal choice for our task.

from xgboost import XGBClassifier
from sklearn.metrics import classification_report, accuracy_score

xgboost = XGBClassifier(), y_train)

y_pred_train = xgboost.predict(x_train)
y_pred_test = xgboost.predict(x_test)

print("Training Accuracy: ", accuracy_score(y_train, y_pred_train))
print("Test Accuracy: ", accuracy_score(y_test, y_pred_test))

Training Accuracy: 0.9789461020211742

Test Accuracy: 0.721019721019721

The model with the current modifications gives us a promising accuracy of 72% on the training set, which is quite good given the data at hand. To further enhance this accuracy, we can explore advanced tokenization techniques to better capture the structure of the text or perform grid search/ random search to find the optimal hyperparameters for your model. Moreover, experimenting with various training algorithms will help us identify the optimal approach that works best for the dataset.

In this blog, we’ve journeyed through the intricate process of training a model and preprocessing text data to detect cancer-related sentiments using XGBoost. Starting with a rich dataset of posts/narratives from cancer patients and caregivers, we meticulously cleaned and normalized the text, transforming it into a format suitable for machine learning.

We leveraged the powerful TF-IDF technique to convert text data into numerical representations, enabling our model to understand and analyze the sentiments expressed in the posts. By implementing XGBoost, a robust and efficient gradient boosting algorithm, we aimed to achieve high accuracy in sentiment detection.

Despite achieving a commendable 72% accuracy on our training set, we recognize that further enhancements are possible. 🔧 We can explore additional modifications such as advanced feature engineering, data augmentation, and hyperparameter tuning to improve the model’s performance. Moreover, experimenting with various preprocessing techniques and training algorithms will help us refine our approach and achieve even better results.

Ultimately, this project underscores the importance of thorough data preprocessing and the strategic use of machine learning algorithms in tackling complex NLP tasks. With continuous iteration and optimization, we can develop highly accurate models that provide valuable insights into the sentiments of cancer patients and caregivers, contributing to better support and understanding in the healthcare community. 💡✨

Here comes the end of this blog. I hope you enjoyed this article! and found it informative and engaging. You can follow me Saraswata Roy for more such articles.

Project Link —

Source link

Related Articles

Back to top button