Pre-trained transformer-based language models for Sundanese

Published in Journal of Big Data, 2022

Recommended citation: Wongso, W., Lucky, H. & Suhartono, D. "Pre-trained transformer-based language models for Sundanese." J Big Data 9, 39 (2022). https://doi.org/10.1186/s40537-022-00590-7 https://link.springer.com/article/10.1186/s40537-022-00590-7

Abstract

The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use.

BibTeX Citation

@article{wongso2022pre,
  author = {Wongso, Wilson and Lucky, Henry and Suhartono, Derwin},
  date = {2022/04/13},
  date-added = {2023-05-13 11:29:57 +0700},
  date-modified = {2023-05-13 11:29:57 +0700},
  doi = {10.1186/s40537-022-00590-7},
  id = {Wongso2022},
  isbn = {2196-1115},
  journal = {Journal of Big Data},
  number = {1},
  pages = {39},
  title = {Pre-trained transformer-based language models for Sundanese},
  url = {https://doi.org/10.1186/s40537-022-00590-7},
  volume = {9},
  year = {2022},
  bdsk-url-1 = {https://doi.org/10.1186/s40537-022-00590-7}
}