Disastrous Tweets Classification using BERT
Overview
In this Article I am going to classify whether the tweet is disastrous or not using state of the art model. I will be using BERT to classify the tweets with the help of keras API using KTrain. If You would have used tensorflow previously then you probably would have been knowing that keras is a wrapper of tensorflow same way KTrain is used as a wrapper of keras and tensorflow. KTrain consists of various preprocessing module inside it to make NLP task Easier. If you are more curious about KTrain then don’t forget to visit this link. The dataset I have taken consists of various tweets comments and have been classified into two labels where label 1 depict the tweets is disastrous and label 0 indicates that tweet is not disastrous.
Step 1 : Importing necessary Libraries
import pandas as pd
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.layers import
Embedding,Dense,SpatialDropout1D,Dropout
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
import seaborn as sns
import matplotlib.pyplot as plt
Step 2:- Data Understanding
In this step we will get clear picture about different aspects of dataset. Like null values ,number of class(is it binary class or multiclass),shape of dataset etc.
Printing all the columns present in dataframe
data.columns [out]>>Index(['id', 'keyword', 'location', 'text', 'target'], dtype='object')
shape of dataset i.e. :- How many rows and columns present in dataset
data.shape
[out]>> (7613, 5) #dataset has 7613 rows and 5 columns
Getting all the information related to dataset like how many null values are present. what is the dtype of each column etc.
data.info [out]> <class 'pandas.core.frame.DataFrame'> RangeIndex: 7613 entries, 0 to 7612 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- -----
0 id 7613 non-null int64
1 keyword 7552 non-null object
2 location 5080 non-null object
3 text 7613 non-null object
4 target 7613 non-null int64 dtypes: int64(2), object(3) memory usage: 297.5+ KB
Statistical Description of numerical and categorical data. By using data.describe()
we will get statistical description of dataset like mean ,meadian , mode , percentile etc.
data.describe() #description of numerical data
cat_data=(data.dtypes[data.dtypes=='object']).describe()
cat_data #description of categorical features
counting number of disastrous(1) and non-disastrous(0) tweets.
daat['target'].value_counts()[out]>> 0 4342
1 3271
Name: target, dtype: int64
step 3:- Exploratory data analysis
Plotting the distribution of Disastrous and Non-Disastrous tweets.
plt.figure(figsize=(8,6))
sns.set_style(style='darkgrid')
sns.countplot(data['target'])
plt.title('Disastrous and Non-Disastrous Tweets')
plt.show()
let’s see the percentage contribution of Disastrous and Non-Disastrous tweets using pie chart.
plt.figure(figsize=(6,8))
sns.set_style("darkgrid")data['target'].value_counts().plot.pie(autopct='%0.2f%%')
plt.title("Percentage Contribution")
plt.xlabel("percent contribution")
plt.ylabel("target")
plt.show()
let’s see number of character distribution in tweets before doing this we need to preprocess the tweets. We will remove unwanted texts i.e.:- url, special symbols like @,!,# etc. we will also calculate the word count, char count, average word length etc. for this i have imported the preprocessing file file here.
import preprocess_kgptalkie as akhil
df=akhil.get_basic_features(data)
df.head()
let’s do some more plotting to understand the nature of text in data frame
sns.kdeplot(df['char_counts'],shade=True,color='green')
plt.show()
let’s se how the length of tweet vary based on their nature. i.e.:- counting the tweets character and comparing them with disastrous and non-disastrous tweet and will compare in which condition the length of the tweet is greater.
plt.figure(figsize=(6,8))
sns.kdeplot(df[df['target']==1]
['char_counts'],color='red',shade=True)
sns.kdeplot(df[df['target']==0]
['char_counts'],color='green',shade=True)
plt.show()
Disastrous tweets has more length as compared to non disastrous tweets. So if tweet is lengthy then it might be possible the those tweets will be related to complain or dis-satisfaction.
#Distribution of stop words on both the classes
sns.boxplot(data['target'],y=data['stopwords_counts'])
plt.show()
#let's see how random the hastag has been used in both the situaton
sns.violinplot(x=data['target'],y=data['hastag_counts'])
plt.show()
If we see above figure we will observe there is slightly more hashtag is used in case of Disastrous tweets.
Let’s see what are the most commonly and least commonly used word in dataset. Based on these commonly used word we can get idea about overall nature of tweets.
freq_occuring=akhil.get_word_freqs(data,'text')
top_20=freq_occuring[:20]
sns.barplot(top_20.index,top_20.values)
plt.xticks(rotation=70)
plt.show()
#least 20 occurring words
least_20=freq_occuring[:20]
sns.barplot(least_20.index,least_20.values)
plt.xticks(rotation=70)
plt.show()
Step 4:- Data Cleaning
As we know tweets contains hashtags, special symbols(@,#,$,%,^,&,*!),url, numbers etc. we need to clean all those unwanted text. Hence in this section we will clan the text so that it can be in human readable format without any special character.
Importing necessary text cleaning libraries
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps=PorterStemmer()
nltk.download('stopwords')
stopwords=set(stopwords.words('english'))
Let’s clean the data…
def cleaner(text):
cleaned=text.replace("//"," ").replace("."," ")
cleaned=re.sub(r'[^a-zA-Z]'," ",cleaned)
cleaned=cleaned.strip() #removing whitespace
cleaned=cleaned.lower() #converting into lower case words
cleaned=re.sub(r'\w+\d+'," ",cleaned)#remove alphanumeric words
cleaned=ps.stem(cleaned) #stemming
cleaned=[word for word in cleaned.split if len(word)>2]
cleaned=" ".join(cleaned)
return cleaned
data['text']=data['text'].apply(lambda text:cleaner(text))
#let's check for some text
data['text'][0:10]
Now we will do most sought out visualization i.e.:- word cloud visualization for class 0 and class 1.
from wordcloud import WordCloud,STOPWORDS
dataset=akhil.get_word_freqs(data[data['target']==1],'text')
print(dataset.index)
dataset=" ".join(dataset.index)
word_cloud=WordCloud(max_font_size=60,background_color='white').generate(dataset)
plt.imshow(word_cloud)
plt.axis('off')
plt.show()
from wordcloud import WordCloud,STOPWORDS
dataset=akhil.get_word_freqs(data[data['target']==0],'text')
print(dataset.index)
dataset=" ".join(dataset.index)
word_cloud=WordCloud(max_font_size=60,background_color='white').generate(dataset)
plt.imshow(word_cloud)
plt.axis('off')
plt.show()
Step 5:- Fine tuning BERT model
import ktrain
from ktrain import text
(x_train, y_train),
(x_test,y_test),preprocess=text.texts_from_df(data,
text_column='text'
,label_columns='target',maxlen=50,
preprocess_mode='bert')
let’s understand above code in bits and pieces…
text_from_df
ktrain will do preprocessing of data from dataframe and will return five variables out of it these variables are (x_train,y_train)(x_test,y_test)
and preprocess
.The arguments inside text_from_df
are;
data
the dataset that has been taken for operation. text_column
text column present in dataframe/dataset.label_columns
traget/output column present in dataset. maxlen
maximum length of word that can be present inside a sentence in case of BERT we can taken maximum length of 512
if we will take sentence length beyond 512
it will give error. preprocess_mode
this says how the preprocessing has been done, in my case i have preprocessed the textual data using BERT.
model=text.text_classifier('bert',train_data=(x_train,y_train)
,preproc=preprocess)
Now we will use get_learner
this will wrap the model and data which is then further being used for final prediction of result.
learner=ktrain.get_learner(model,train_data=
(x_train,y_train),val_data=(x_test,y_test),batch_size=64)
we will give some learning rate and epochs to the model to predict the best possible results.
learner.for_onecycle(lr=1e-5,epochs=4)
I am getting accuracy somewhere around 58 percent. I might need to focus more on preprocessing and cleaning aspect to improve the model accuracy. let’s predict result on trending tweets.
prediction=ktrain.get_predictor(learner.model,preprocess)
data=["US did this! Loudly crying faceLoudly crying face"]
prediction.predict(data)[out]>> 'not-Disastrous'
Conclusion:-
This is a basic introductory approach about how to solve NLP model Please hang tight will be back with more blogs with BERT fine tuning using Tensor flow and Pytorch. Please put your valuable opinion in comment box.