Classifying Twitter Disaster Response Messages Trough NLP
We are facing a global health crisis due to COVID-19 — one that is killing people, spreading human suffering, and upending people’s lives. But this is much more than a health crisis. It is a human, economic and social crisis [2].
During a pandemic or a natural disaster, it is crucial to respond quickly to people’s needs, which are expressed in messages sent across various channels. Machine learning algorithms using NLP could help to categorise messages so that they can be sent to appropriate disaster relief agencies that takes care of medical aid, water, shelter, food, logistics etc. This article uses a data set containing real Twitter messages that were sent during disaster events provided by Figure Eight to build a model for an API that classifies disaster messages.
The code was divided in three sections:
- ETL Pipeline: Pipeline to read in data , cleans and stores it in a SQL database.The script merges the messages and categories datasets, splits the categories column into separate, converts values to binary, and drops duplicates.
- Machine Learning Pipeline: Transforms the data using Natural language processing, train a machine learning model using GridSearchCV, RandomForest to classify the message behind the tweet among 36 categories.
- Web development: Flask app and the user interface used to predict results and display them.
ETL Pipeline
Loading and merging data
# import libraries
import pandas as pd
from sqlalchemy import create_engine# load messages dataset
messages = pd.read_csv('messages.csv')
# load categories dataset
categories = pd.read_csv('categories.csv')
# merge datasets
df = messages.merge(categories, on='id')# Split categories into separate category columns.
# create a dataframe of the 36 individual category columns
categories_split = df ['categories'].str.split (pat = ';', expand = True)
categories_split.head()
Data cleaning
# rename the columns of 'categories'
# select the first row of the categories dataframe
row = categories_split.iloc [0]
# use this row to extract a list of new column names for categories.
category_colnames = row.apply (lambda x: x.rstrip ('- 0 1'))
categories_split.columns = category_colnames# convert category values to 0 and 1
for column in categories_split:
# set each value to be the last character of the string
categories_split[column] = categories_split[column].str [-1]
# convert column from string to numeric
categories_split[column] = pd.to_numeric(categories_split[column], errors = 'coerce')# drop the original categories column from `df`
df.drop (['categories'], axis = 1, inplace = True)# concatenate the original dataframe with the new `categories` dataframe
df = pd.concat([df,categories_split], axis = 1, sort = False)# drop duplicates
df.drop_duplicates(inplace=True)
Save dataframe in SQL database
engine = create_engine(‘sqlite:///InsertDatabaseName.db’) df.to_sql(‘InsertTableName’, engine, index=False)
Machine Learning Pipeline
Import libraries and load data from database.
# import libraries
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
import pandas as pd
from sqlalchemy import create_engine
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn import multioutput
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support
import re
from sklearn.metrics import fbeta_score, make_scorer
from sklearn.model_selection import GridSearchCV
import pickle
Tokenization function to process text data
def tokenize(text):
#normalize text
text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
# stopword list
stop_words = stopwords.words("english")
#tokenize
words = word_tokenize(text)
#stemming
stemmed = [PorterStemmer().stem(w) for w in words]
#lemmatizing
words_lemmed = [WordNetLemmatizer().lemmatize(w) for w in stemmed if w not in stop_words]
return words_lemmed
Build a machine learning pipeline
This ML pipeline should take message as input and output classification results on 36 categories.
pipeline = Pipeline([
('vect', CountVectorizer(tokenizer=tokenize)),
('tfidf', TfidfTransformer()),
('clf', multioutput.MultiOutputClassifier (RandomForestClassifier()))
])
Train pipeline
Split data into train and test sets and train pipeline
X_train, X_test, y_train, y_test = train_test_split(X, y, X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 22)
# train classifier
new_pipeline.fit(X_train, y_train)
Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn’s classification_report
on each.
y_pred = new_pipeline.predict(X_test)
# Get results and add them to a dataframe.
def get_results(y_test, y_pred):
results = pd.DataFrame(columns=['Category', 'f_score', 'precision', 'recall'])
num = 0
for cat in y_test.columns:
precision, recall, f_score, support = precision_recall_fscore_support(y_test[cat], y_pred[:,num], average='weighted')
results.set_value(num+1, 'Category', cat)
results.set_value(num+1, 'f_score', f_score)
results.set_value(num+1, 'precision', precision)
results.set_value(num+1, 'recall', recall)
num += 1
print('Aggregated f_score:', results['f_score'].mean())
print('Aggregated precision:', results['precision'].mean())
print('Aggregated recall:', results['recall'].mean())
return results
#---------------------------------------------------------y_pred_tuned_ada = new_pipeline.predict(X_test)
results_tuned = get_results(y_test, y_pred_tuned_ada)
results_tuned
Aggregated f_score: 0.940272096531
Aggregated precision: 0.94022097111
Aggregated recall: 0.947979181501
Code can be found in this repository
References: