For many e-commerce companies (e.g., Amazon), address related-information can be very informative and can be harnessed to build more accurate geocode. This can result faster and more efficient shipping systems. Here I work with the real world data provided by Shopee, the leading online shopping platform in southeast Asia. They are interested in POI (point of interest) and street name for each customer but address-related information they receive are usually unstructured, in free-text format. Here is an example (Note that we are working with Indonesian here)
The Solution
To solve this problem, I fine-tuned Indonesian BERT on huggingface transformer to perform Name-Entity-Recognition (i.e., token classification). Here I summarized the major steps I took and each step is elaborated in following sections.
Added a token classification head and fine-tune both body and head.
Make predictions on unlabeled unstructured address; and reconstruct words from tokens.
Data Preprocessing and Model Inputs
Given raw, unstructured address: jalan tipar cakung no 26 depan rusun albo garasi dumtruk
Tokens: using pre-trained tokenizer, the raw address was separated into sub-word tokens. [CLS] and [SEP] are also automatically added to the start and end of the sequence.
Tokens_ID: INTs that map each token to the vocabulary of the pre-trained model
Words IDs: specify which word the token belongs to. For example, both the token tip and the token ##ar have the ID being 1, suggesting that these two tokens are in fact from the same word, and the word is the second in the sequence.
Labels: specify the name entity of each token. For example, the token jalan has the label B-STR, suggesting that it is the beginning of a street entity; the token ##o has the label E-POI, suggesting that it is the end of a POI entity.
Labels_id: category coding for all labels.
The model was training with mini-batch gradient descent. Each mini-batch was prefetched and padded to the longest sequence of the batch using data collator.
The inputs of the model are Tokens_ID and Labels_id, along with the attention mask for each sequence.
Model Training and Evaluating
The model was trained ADAM with additional learning rate decay.
With more training epoch, the training loss keeps dropping but the validation loss ended up getting larger, indicating a trend of overfitting. Thus, the model was restored to the weights trained after the third epoch.
The final model was evaluated using the validation set and the f1 score was computed for each tag category (POI and STR) using seqeval. The results show that the model was able to predict token classification with really good performance.
Model For Prediction
It is shown below that before fine-tuning the model, the model is making random predictions for the labels of the tokens, with pretty high loss. However, after tuning, the model could accurately predict each token’s label. Note that loss was computed based on each token’s logit. Thus although the predicted label is correct, loss may still not be zero.