Disastrous Tweets Classification using BERT

6 min readAug 30, 2021

Overview

In this Article I am going to classify whether the tweet is disastrous or not using state of the art model. I will be using BERT to classify the tweets with the help of keras API using KTrain. If You would have used tensorflow previously then you probably would have been knowing that keras is a wrapper of tensorflow same way KTrain is used as a wrapper of keras and tensorflow. KTrain consists of various preprocessing module inside it to make NLP task Easier. If you are more curious about KTrain then don’t forget to visit this link. The dataset I have taken consists of various tweets comments and have been classified into two labels where label 1 depict the tweets is disastrous and label 0 indicates that tweet is not disastrous.

Step 1 : Importing necessary Libraries

import pandas as pd
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.layers import
                            Embedding,Dense,SpatialDropout1D,Dropout
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
import seaborn as sns
import matplotlib.pyplot as plt

Step 2:- Data Understanding

In this step we will get clear picture about different aspects of dataset. Like null values ,number of class(is it binary class or multiclass),shape of dataset etc.

Printing all the columns present in dataframe

data.columns [out]>>Index(['id', 'keyword', 'location', 'text', 'target'],  dtype='object')

shape of dataset i.e. :- How many rows and columns present in dataset

data.shape 
[out]>> (7613, 5)  #dataset has 7613 rows and 5 columns

Getting all the information related to dataset like how many null values are present. what is the dtype of each column etc.

data.info [out]> <class 'pandas.core.frame.DataFrame'> RangeIndex: 7613 entries, 0 to 7612 Data columns (total 5 columns):  #   Column    Non-Null Count  Dtype  ---  ------    --------------  -----   
0   id        7613 non-null   int64   
1   keyword   7552 non-null   object  
2   location  5080 non-null   object  
3   text      7613 non-null   object 
4   target    7613 non-null   int64  dtypes: int64(2), object(3) memory usage: 297.5+ KB

Statistical Description of numerical and categorical data. By using data.describe() we will get statistical description of dataset like mean ,meadian , mode , percentile etc.

data.describe() #description of numerical data

cat_data=(data.dtypes[data.dtypes=='object']).describe()
cat_data #description of categorical features

counting number of disastrous(1) and non-disastrous(0) tweets.

daat['target'].value_counts()[out]>>    0    4342 
           1    3271
           Name: target, dtype: int64

step 3:- Exploratory data analysis

Plotting the distribution of Disastrous and Non-Disastrous tweets.

plt.figure(figsize=(8,6))
sns.set_style(style='darkgrid')
sns.countplot(data['target'])
plt.title('Disastrous and Non-Disastrous Tweets')
plt.show()

let’s see the percentage contribution of Disastrous and Non-Disastrous tweets using pie chart.

plt.figure(figsize=(6,8))
sns.set_style("darkgrid")data['target'].value_counts().plot.pie(autopct='%0.2f%%')
plt.title("Percentage Contribution")
plt.xlabel("percent contribution")
plt.ylabel("target")
plt.show()

let’s see number of character distribution in tweets before doing this we need to preprocess the tweets. We will remove unwanted texts i.e.:- url, special symbols like @,!,# etc. we will also calculate the word count, char count, average word length etc. for this i have imported the preprocessing file file here.

import preprocess_kgptalkie as akhil
df=akhil.get_basic_features(data)
df.head()

let’s do some more plotting to understand the nature of text in data frame

sns.kdeplot(df['char_counts'],shade=True,color='green')
plt.show()

let’s se how the length of tweet vary based on their nature. i.e.:- counting the tweets character and comparing them with disastrous and non-disastrous tweet and will compare in which condition the length of the tweet is greater.

plt.figure(figsize=(6,8))
sns.kdeplot(df[df['target']==1]
            ['char_counts'],color='red',shade=True)
sns.kdeplot(df[df['target']==0]
            ['char_counts'],color='green',shade=True)
plt.show()

Disastrous tweets has more length as compared to non disastrous tweets. So if tweet is lengthy then it might be possible the those tweets will be related to complain or dis-satisfaction.

#Distribution of stop words on both the classes
sns.boxplot(data['target'],y=data['stopwords_counts'])
plt.show()


#let's see how random the hastag has been used in both the situaton
sns.violinplot(x=data['target'],y=data['hastag_counts'])
plt.show()

If we see above figure we will observe there is slightly more hashtag is used in case of Disastrous tweets.

Let’s see what are the most commonly and least commonly used word in dataset. Based on these commonly used word we can get idea about overall nature of tweets.

freq_occuring=akhil.get_word_freqs(data,'text')
top_20=freq_occuring[:20]
sns.barplot(top_20.index,top_20.values)
plt.xticks(rotation=70)
plt.show()

#least 20 occurring words
least_20=freq_occuring[:20]
sns.barplot(least_20.index,least_20.values)
plt.xticks(rotation=70)
plt.show()

Step 4:- Data Cleaning

As we know tweets contains hashtags, special symbols(@,#,$,%,^,&,*!),url, numbers etc. we need to clean all those unwanted text. Hence in this section we will clan the text so that it can be in human readable format without any special character.

Importing necessary text cleaning libraries

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps=PorterStemmer()
nltk.download('stopwords')
stopwords=set(stopwords.words('english'))

Let’s clean the data…

def cleaner(text):
    cleaned=text.replace("//"," ").replace("."," ")
    cleaned=re.sub(r'[^a-zA-Z]'," ",cleaned)
    cleaned=cleaned.strip() #removing whitespace
    cleaned=cleaned.lower() #converting into lower case words
    cleaned=re.sub(r'\w+\d+'," ",cleaned)#remove alphanumeric words
    cleaned=ps.stem(cleaned) #stemming 
    cleaned=[word for word in cleaned.split if len(word)>2]
    cleaned=" ".join(cleaned)
    return cleaned
data['text']=data['text'].apply(lambda text:cleaner(text))
#let's check for some text
data['text'][0:10]

Now we will do most sought out visualization i.e.:- word cloud visualization for class 0 and class 1.

from wordcloud import WordCloud,STOPWORDS
dataset=akhil.get_word_freqs(data[data['target']==1],'text')
print(dataset.index)
dataset=" ".join(dataset.index)
word_cloud=WordCloud(max_font_size=60,background_color='white').generate(dataset)
plt.imshow(word_cloud)
plt.axis('off')
plt.show()

from wordcloud import WordCloud,STOPWORDS
dataset=akhil.get_word_freqs(data[data['target']==0],'text')
print(dataset.index)
dataset=" ".join(dataset.index)
word_cloud=WordCloud(max_font_size=60,background_color='white').generate(dataset)
plt.imshow(word_cloud)
plt.axis('off')
plt.show()

Step 5:- Fine tuning BERT model

import ktrain
from ktrain import text
(x_train, y_train),
   (x_test,y_test),preprocess=text.texts_from_df(data,  
                                text_column='text'
                                ,label_columns='target',maxlen=50,
                                 preprocess_mode='bert')

let’s understand above code in bits and pieces…

text_from_df ktrain will do preprocessing of data from dataframe and will return five variables out of it these variables are (x_train,y_train)(x_test,y_test) and preprocess .The arguments inside text_from_df are;

data the dataset that has been taken for operation. text_column text column present in dataframe/dataset.label_columns traget/output column present in dataset. maxlen maximum length of word that can be present inside a sentence in case of BERT we can taken maximum length of 512 if we will take sentence length beyond 512 it will give error. preprocess_mode this says how the preprocessing has been done, in my case i have preprocessed the textual data using BERT.

model=text.text_classifier('bert',train_data=(x_train,y_train) 
                                             ,preproc=preprocess)

Now we will use get_learner this will wrap the model and data which is then further being used for final prediction of result.

learner=ktrain.get_learner(model,train_data=
         (x_train,y_train),val_data=(x_test,y_test),batch_size=64)

we will give some learning rate and epochs to the model to predict the best possible results.

learner.for_onecycle(lr=1e-5,epochs=4)

I am getting accuracy somewhere around 58 percent. I might need to focus more on preprocessing and cleaning aspect to improve the model accuracy. let’s predict result on trending tweets.

prediction=ktrain.get_predictor(learner.model,preprocess)
data=["US did this! Loudly crying faceLoudly crying face"]
prediction.predict(data)[out]>> 'not-Disastrous'

Conclusion:-

This is a basic introductory approach about how to solve NLP model Please hang tight will be back with more blogs with BERT fine tuning using Tensor flow and Pytorch. Please put your valuable opinion in comment box.

Disastrous Tweets Classification using BERT

Written by akhil anand