Data Preprocessing Techniques¶
(1) Data Cleaning: The following function will help to clean the text data. As function parameter we can pass the entire dataframe where you have a column name 'text'. After cleaning you can simple return the dataframe for further process. To import the library simpley run the 'pip install neattext' to install the required packages.
import neattext.functions as nfx
def get_clean_text(df):
df['text'] = df['text'].apply(nfx.remove_userhandles)
df['text'] = df['text'].apply(nfx.remove_punctuations)
df['text'] = df['text'].apply(nfx.remove_emojis)
df['text'] = df['text'].apply(nfx.remove_hashtags)
df['text'] = df['text'].apply(nfx.remove_html_tags)
df['text'] = df['text'].apply(nfx.remove_stopwords)
df['text'] = df['text'].apply(nfx.remove_urls)
df['text'] = df['text'].apply(nfx.remove_phone_numbers)
return df
(2) Tokenization: Before feeding text data into model we need to tokenize the text. We can install the transformer packages where AutoTokenizer class needed to be load. We have defiene the path of folder where 'special_tokens_map.json', 'tokenizer_config.json', 'tokenizer.json', and 'vocab.txt' files are existed. As parameter following function only takes single text/sentence. You can call this function that times as many as sentences you have. And it returns dictionay data.
# Tokenize all of the sentences and map the tokens to thier word IDs.
from transformers import AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("path of tokenizer", do_lower_case=True)
def tokenize(comment):
input_ids = []
attention_masks = []
token_type_ids = []
encoded_dict = tokenizer.encode_plus(
str(comment), # Sentence to encode.
add_special_tokens = True, # Add '[CLS]' and '[SEP]'
max_length = 512, # Pad & truncate all sentences.
pad_to_max_length = True,
return_attention_mask = True, # Construct attn. masks.
return_tensors = 'pt', # Return pytorch tensors.
truncation = True
)
# Add the encoded sentence to the list.
input_ids.append(encoded_dict['input_ids'])
# And its attention mask (simply differentiates padding from non-padding).
attention_masks.append(encoded_dict['attention_mask'])
token_type_ids.append(encoded_dict['token_type_ids'])
# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
token_type_ids = torch.cat(token_type_ids, dim=0)
return {"input_ids": input_ids, "mask": attention_masks, "token_type_ids": token_type_ids}
Model Initializing¶
Following class is defined as text classification model. Here, the class object is initialized with three values, number of classes we are predicting, drop out value, and the pretrained model name, we can replace 'model_ckpt' value with any transformer model in hugging face. In the forward we are taking the pool layer and make connection with drop out layer aftwards and then final layer is output layer which will classify the model output, meaning will result the percentage of each class.
# from utils.model import TextClassification, TextClassificationBertLSTM
from transformers import AutoModel
import torch.nn as nn
class TextClassification(nn.Module):
def __init__ (self, n_classes, dropout, model_ckpt):
super(TextClassification, self).__init__()
self.bert = AutoModel.from_pretrained(model_ckpt)
self.dropout = nn.Dropout(dropout)
self.out = nn.Linear(768, n_classes)
def forward(self, ids, mask, token_type_ids):
pooledOut = self.bert(ids, attention_mask = mask, token_type_ids = token_type_ids)
dropOut = self.dropout(pooledOut[1])
output = self.out(dropOut)
return output
model = TextClassification(n_classes=5, dropout=0.9, model_ckpt="bert-base-uncased")
Following function is used to predict the all target classes. As parameters, it takes 'input ids', 'attention mask', 'token type ids' as well as device (cup or gpu). When we have tensor data in order to convert to numpy first we have to place all data at cpu and if you need, can convert to list further and then return the result.
def Classification(ids, mask, type_ids, device):
ids = ids.to(device, dtype=torch.long)
mask = mask.to(device, dtype=torch.long)
token_type_ids = type_ids.to(device, dtype=torch.long)
# targets = targets.to(device, dtype=torch.float)
outputs = model(ids=ids, mask=mask, token_type_ids=token_type_ids)
# print("Output: ", outputs)
result = torch.sigmoid(outputs)
result = result.cpu().data.numpy()
scores = result.tolist()
return scores