Classifying Twitter Disaster Response Messages Trough NLP

Isak Kabir
Analytics Vidhya
Published in
4 min readApr 26, 2020

--

Woman sitting on brown concrete ground beside brown wicker basket [1]

We are facing a global health crisis due to COVID-19 — one that is killing people, spreading human suffering, and upending people’s lives. But this is much more than a health crisis. It is a human, economic and social crisis [2].

During a pandemic or a natural disaster, it is crucial to respond quickly to people’s needs, which are expressed in messages sent across various channels. Machine learning algorithms using NLP could help to categorise messages so that they can be sent to appropriate disaster relief agencies that takes care of medical aid, water, shelter, food, logistics etc. This article uses a data set containing real Twitter messages that were sent during disaster events provided by Figure Eight to build a model for an API that classifies disaster messages.

The code was divided in three sections:

  • ETL Pipeline: Pipeline to read in data , cleans and stores it in a SQL database.The script merges the messages and categories datasets, splits the categories column into separate, converts values to binary, and drops duplicates.
  • Machine Learning Pipeline: Transforms the data using Natural language processing, train a machine learning model using GridSearchCV, RandomForest to classify the message behind the tweet among 36 categories.
  • Web development: Flask app and the user interface used to predict results and display them.

ETL Pipeline

Loading and merging data

# import libraries
import pandas as pd
from sqlalchemy import create_engine
# load messages dataset
messages = pd.read_csv('messages.csv')
# load categories dataset
categories = pd.read_csv('categories.csv')
# merge datasets
df = messages.merge(categories, on='id')
# Split categories into separate category columns.
# create a dataframe of the 36 individual category columns
categories_split = df ['categories'].str.split (pat = ';', expand = True)
categories_split.head()
Merge of datasets before split categories into separate columns

Data cleaning

# rename the columns of 'categories'
# select the first row of the categories dataframe
row = categories_split.iloc [0]
# use this row to extract a list of new column names for categories.
category_colnames = row.apply (lambda x: x.rstrip ('- 0 1'))
categories_split.columns = category_colnames
# convert category values to 0 and 1
for column in categories_split:
# set each value to be the last character of the string
categories_split[column] = categories_split[column].str [-1]
# convert column from string to numeric
categories_split[column] = pd.to_numeric(categories_split[column], errors = 'coerce')
# drop the original categories column from `df`
df.drop (['categories'], axis = 1, inplace = True)
# concatenate the original dataframe with the new `categories` dataframe
df = pd.concat([df,categories_split], axis = 1, sort = False)
# drop duplicates
df.drop_duplicates(inplace=True)

Save dataframe in SQL database

engine = create_engine(‘sqlite:///InsertDatabaseName.db’) df.to_sql(‘InsertTableName’, engine, index=False)

Machine Learning Pipeline

Import libraries and load data from database.

# import libraries
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
import pandas as pd
from sqlalchemy import create_engine
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn import multioutput
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support
import re
from sklearn.metrics import fbeta_score, make_scorer
from sklearn.model_selection import GridSearchCV
import pickle

Tokenization function to process text data

def tokenize(text):
#normalize text
text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())

# stopword list
stop_words = stopwords.words("english")

#tokenize
words = word_tokenize(text)

#stemming
stemmed = [PorterStemmer().stem(w) for w in words]

#lemmatizing
words_lemmed = [WordNetLemmatizer().lemmatize(w) for w in stemmed if w not in stop_words]

return words_lemmed

Build a machine learning pipeline

This ML pipeline should take message as input and output classification results on 36 categories.

pipeline = Pipeline([
('vect', CountVectorizer(tokenizer=tokenize)),
('tfidf', TfidfTransformer()),
('clf', multioutput.MultiOutputClassifier (RandomForestClassifier()))
])

Train pipeline

Split data into train and test sets and train pipeline

X_train, X_test, y_train, y_test = train_test_split(X, y, X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 22)
# train classifier
new_pipeline.fit(X_train, y_train)

Test your model

Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn’s classification_report on each.

y_pred = new_pipeline.predict(X_test)
# Get results and add them to a dataframe.
def get_results(y_test, y_pred):
results = pd.DataFrame(columns=['Category', 'f_score', 'precision', 'recall'])
num = 0
for cat in y_test.columns:
precision, recall, f_score, support = precision_recall_fscore_support(y_test[cat], y_pred[:,num], average='weighted')
results.set_value(num+1, 'Category', cat)
results.set_value(num+1, 'f_score', f_score)
results.set_value(num+1, 'precision', precision)
results.set_value(num+1, 'recall', recall)
num += 1
print('Aggregated f_score:', results['f_score'].mean())
print('Aggregated precision:', results['precision'].mean())
print('Aggregated recall:', results['recall'].mean())
return results
#---------------------------------------------------------
y_pred_tuned_ada = new_pipeline.predict(X_test)
results_tuned = get_results(y_test, y_pred_tuned_ada)
results_tuned

Aggregated f_score: 0.940272096531
Aggregated precision: 0.94022097111
Aggregated recall: 0.947979181501

Code can be found in this repository

References:

  1. https://unsplash.com/photos/iZ2v4FwtMLc
  2. https://www.un.org/development/desa/dspd/2020/04/social-impact-of-covid-19/

--

--