Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Pages

Hello! My name is Wilson Wongso, a Machine Learning Engineer!

Posts

Predicting Phonemes with BERT

25 minute read

Published:

Our team at Bookbot is currently developing a grapheme-to-phoneme Python package for Bahasa Indonesia. The package is highly inspired by its English counterpart, g2p. A lot of our design and methods are borrowed from that library, most notably the steps to predict phonemes. The English g2p used the following algorithm (c.f. g2p’s README):

My HuggingFace JAX Community Week Experience

13 minute read

Published:

On June 23, the HuggingFace team announced that they are planning to host a community week together with the people from the Google Cloud team. The main gist of this event was getting everyone to learn and use HuggingFace’s newly integrated JAX framework. But aside from just learning from tutorials, we were equipped with blazing fast TPUs thanks to the amazing Google Cloud team 🤯.

Pneumonia Chest X-Ray Classification

7 minute read

Published:

The dataset used for this task if from a Kaggle dataset by Paul Mooney. It consists of two kinds of chest x-rays, those infected by pneumonia, and the other being normal. Our main goal is to distinguish which chest corresponds to pneumonia-infected ones and which aren’t. Note that the dataset is highly imbalanced, like many medical image dataset are.

Text Generation using minGPT and fast.ai

13 minute read

Published:

Andrej Karpathy, Tesla’s AI Director released minGPT, a mini version to OpenAI’s GPT. Normally a GPT would have billions of parameters and would take hours to train. Karpathy’s approach is to provide a smaller version of GPT, hence the name minGPT.

MNIST Classification with Quantum Neural Network

19 minute read

Published:

Tensorflow is one of the most used deep learning frameworks today, bundled with many features for end-to-end deep learning processes. Recently, they have just announced a new library on top of Tensorflow, called Tensorflow Quantum. Tensorflow Quantum integrates with Cirq, which provides quantum computing algorithms, and the two works well to do tasks involving Quantum Machine Learning.

MNIST Classification with Hybrid Quantum-Classical Neural Network

14 minute read

Published:

Qiskit is IBM’s open-source framework to do quantum processes which provides users access to both simulators and real Quantum Computers. Today, the Quantum Computer available is still in the Noisy Intermediate-Scale Quantum (NISQ) era and is very much sensitive to any forms of interference. Unlike real Quantum Computers, simulators provided by Qiskit aren’t noisy and is great for prototyping.

Handwritten Javanese Script Classification

6 minute read

Published:

Aksara Jawa, or the Javanese Script is the core of writing the Javanese language and has influenced various other regional languages such as Sundanese, Madurese, etc. The script is now rarely used on a daily basis, but is sometimes taught in local schools in certain provinces of Indonesia.

Doubly Linked List in C

6 minute read

Published:

After learning how to implement Singly Linked List, we’re going to implement Doubly Linked List, which is similar to Singly Linked List, but with the addition of a prev pointer which points to the node before it.

Discrete and Continuous Optimization Algorithms

11 minute read

Published:

Optimization is a key process in machine learning, from which we can approach inference and learning. It allows us to decouple the mathematical specification of what we want to compute from the algorithms for how to compute it.

Singly Linked List in C

7 minute read

Published:

According to Wikipedia, a linked list is a linear collection of data elements, whose order is not given by their physical placement in memory. Instead, each element points to the next. It is a data structure consisting of a collection of nodes which together represent a sequence.

Automatic Differentiation

7 minute read

Published:

Automatic Differentiation (AD) is a vital process in Deep Learning. Many of deep learning’s techniques like backpropagation relies heavily on AD. There are multiple ways to implement AD, one of which is utilizing Dual Numbers.

portfolio

publications

Causal and Masked Language Modeling of Javanese Language using Transformer-based Architectures

Published in 2021 International Conference on Advanced Computer Science and Information Systems (ICACSIS), 2021

Most natural language understanding breakthroughs occur in popularly spoken languages, while low-resource languages are rarely examined. We pre-trained as well as compared different Transformer-based architectures on the Javanese language. They were trained on causal and masked language modeling tasks, with Javanese Wikipedia documents as corpus, and could then be fine-tuned to downstream natural language understanding tasks. To speed up pre-training, we transferred English word-embeddings, utilized gradual unfreezing of layers, and applied discriminative fine-tuning. We further fine-tuned our models to classify binary movie reviews and find that they were on par with multilingual/cross-lingual Transformers. We release our pre-trained models for others to use, in hopes of encouraging other researchers to work on low-resource languages like Javanese.

Recommended citation: W. Wongso, D. S. Setiawan and D. Suhartono, "Causal and Masked Language Modeling of Javanese Language using Transformer-based Architectures," 2021 International Conference on Advanced Computer Science and Information Systems (ICACSIS), Depok, Indonesia, 2021, pp. 1-7, doi: 10.1109/ICACSIS53237.2021.9631331.
Download Paper

Pre-trained transformer-based language models for Sundanese

Published in Journal of Big Data, 2022

The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use.

Recommended citation: Wongso, W., Lucky, H. & Suhartono, D. "Pre-trained transformer-based language models for Sundanese." J Big Data 9, 39 (2022). https://doi.org/10.1186/s40537-022-00590-7
Download Paper

Many-to-Many Multilingual Translation Model for Languages of Indonesia

Published in IEEE Access, 2023

Indonesia is home to over 700 languages and most people speak their respective regional languages aside from the lingua franca. In this paper, we focus on the task of multilingual machine translation for 45 regional Indonesian languages and introduced Indo-T5 which leveraged the mT5 sequence-to-sequence language model as a baseline. Performances of bilingual and multilingual fine-tuning methods were also compared, in which we found that our models have outperformed current state-of-the-art translation models. We also investigate the use of religious texts from the Bible as an intermediate mid-resource translation domain for low-resource translation domain specialization. Our findings suggest that this two-step fine-tuning approach is highly effective in improving the quality of translations for low-resource text domains. Our results show an increase in SacreBLEU scores when evaluated on the low-resource NusaX dataset. We release our translation models for other researchers to leverage.

Recommended citation: Wongso, W., Joyoadikusumo, A., Buana, B. S., & Suhartono, D. (2023). Many-to-Many Multilingual Translation Model for Languages of Indonesia. IEEE Access.
Download Paper

NusaBERT Teaching IndoBERT to be Multilingual and Multicultural

Published in arXiv, 2024

Indonesia’s linguistic landscape is remarkably diverse, encompassing over 700 languages and dialects, making it one of the world’s most linguistically rich nations. This diversity, coupled with the widespread practice of code-switching and the presence of low-resource regional languages, presents unique challenges for modern pre-trained language models. In response to these challenges, we developed NusaBERT, building upon IndoBERT by incorporating vocabulary expansion and leveraging a diverse multilingual corpus that includes regional languages and dialects. Through rigorous evaluation across a range of benchmarks, NusaBERT demonstrates state-of-the-art performance in tasks involving multiple languages of Indonesia, paving the way for future natural language understanding research for under-represented languages.

Recommended citation: Wongso, W., Setiawan, D. S., Limcorn, S., & Joyoadikusumo, A. (2024). NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural. arXiv [Cs.CL]. Retrieved from https://arxiv.org/abs/2403.01817
Download Paper

IdSarcasm: Benchmarking and Evaluating Language Models for Indonesian Sarcasm Detection

Published in IEEE Access, 2024

Sarcasm detection in the Indonesian language poses a unique set of challenges due to the linguistic nuances and cultural specificities of the Indonesian social media landscape. Understanding the dynamics of sarcasm in this context requires a deep dive into language patterns and the socio-cultural background that shapes the use of sarcasm as a form of criticism and expression. In this study, we developed the first publicly available Indonesian sarcasm detection benchmark datasets from social media texts. We extensively investigated the results of classical machine learning algorithms, pre-trained language models, and recent large language models (LLMs). Our findings show that fine-tuning pre-trained language models is still superior to other techniques, achieving F1 scores of 62.74% and 76.92% on the Reddit and Twitter subsets respectively. Further, we show that recent LLMs fail to perform zero-shot classification for sarcasm detection and that tackling data imbalance requires a more sophisticated data augmentation approach than our basic methods.

Recommended citation: Suhartono, D., Wongso, W., & Handoyo, A. T. (2024). IdSarcasm: Benchmarking and Evaluating Language Models for Indonesian Sarcasm Detection. IEEE Access.
Download Paper

talks

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.