Transfer learning with transformers for minority language NER

Jun 20, 2022

1. Overview: The problem and solution

For many e-commerce companies (e.g., Amazon), address related-information can be very informative and can be harnessed to build more accurate geocode. This can result faster and more efficient shipping systems. Here I work with the real world data provided by Shopee, the leading online shopping platform in southeast Asia. They are interested in POI (point of interest) and street name for each customer but address-related information they receive are usually unstructured, in free-text format. Here is an example (Note that we are working with Indonesian here)

To solve this problem, I fine-tuned Indonesian BERT on huggingface transformer to perform Name-Entity-Recognition (i.e., token classification). Here I summarized the major steps I took and each step is elaborated in following sections.
- Used IOBES annotation scheme to label each word.
- Tokenized text inputs and aligned labels using a pre-trained BERT for Indonesian.
- Added a token classification head and fine-tune both body and head.
- Make predictions on unlabeled unstructured address; and reconstruct words from tokens.

Given raw, unstructured address: jalan tipar cakung no 26 depan rusun albo garasi dumtruk
- Tokens: using pre-trained tokenizer, the raw address was separated into sub-word tokens. [CLS] and [SEP] are also automatically added to the start and end of the sequence.
- Tokens_ID: INTs that map each token to the vocabulary of the pre-trained model
- Words IDs: specify which word the token belongs to. For example, both the token tip and the token ##ar have the ID being 1, suggesting that these two tokens are in fact from the same word, and the word is the second in the sequence.
- Labels: specify the name entity of each token. For example, the token jalan has the label B-STR, suggesting that it is the beginning of a street entity; the token ##o has the label E-POI, suggesting that it is the end of a POI entity.
- Labels_id: category coding for all labels.
The model was training with mini-batch gradient descent. Each mini-batch was prefetched and padded to the longest sequence of the batch using data collator.
The inputs of the model are Tokens_ID and Labels_id, along with the attention mask for each sequence.

The model was trained ADAM with additional learning rate decay.
With more training epoch, the training loss keeps dropping but the validation loss ended up getting larger, indicating a trend of overfitting. Thus, the model was restored to the weights trained after the third epoch.

The final model was evaluated using the validation set and the f1 score was computed for each tag category (POI and STR) using seqeval. The results show that the model was able to predict token classification with really good performance.

It is shown below that before fine-tuning the model, the model is making random predictions for the labels of the tokens, with pretty high loss. However, after tuning, the model could accurately predict each token’s label. Note that loss was computed based on each token’s logit. Thus although the predicted label is correct, loss may still not be zero.