Sitemap

A list of all the posts and pages found on the site. For you robots out there, there is an XML version available for digesting as well.

Page Not Found

Hello! My name is Wilson Wongso, a PhD candidate in Computer Science!

Jupyter notebook markdown generator

Posts

Predicting Phonemes with BERT

25 minute read

Published: April 23, 2022

Our team at Bookbot is currently developing a grapheme-to-phoneme Python package for Bahasa Indonesia. The package is highly inspired by its English counterpart, g2p. A lot of our design and methods are borrowed from that library, most notably the steps to predict phonemes. The English g2p used the following algorithm (c.f. g2p’s README):

My HuggingFace JAX Community Week Experience

13 minute read

Published: July 30, 2021

On June 23, the HuggingFace team announced that they are planning to host a community week together with the people from the Google Cloud team. The main gist of this event was getting everyone to learn and use HuggingFace’s newly integrated JAX framework. But aside from just learning from tutorials, we were equipped with blazing fast TPUs thanks to the amazing Google Cloud team 🤯.

Pneumonia Chest X-Ray Classification

7 minute read

Published: August 31, 2020

The dataset used for this task if from a Kaggle dataset by Paul Mooney. It consists of two kinds of chest x-rays, those infected by pneumonia, and the other being normal. Our main goal is to distinguish which chest corresponds to pneumonia-infected ones and which aren’t. Note that the dataset is highly imbalanced, like many medical image dataset are.

Text Generation using minGPT and fast.ai

13 minute read

Published: August 24, 2020

Andrej Karpathy, Tesla’s AI Director released minGPT, a mini version to OpenAI’s GPT. Normally a GPT would have billions of parameters and would take hours to train. Karpathy’s approach is to provide a smaller version of GPT, hence the name minGPT.

MNIST Classification with Quantum Neural Network

19 minute read

Published: July 14, 2020

Tensorflow is one of the most used deep learning frameworks today, bundled with many features for end-to-end deep learning processes. Recently, they have just announced a new library on top of Tensorflow, called Tensorflow Quantum. Tensorflow Quantum integrates with Cirq, which provides quantum computing algorithms, and the two works well to do tasks involving Quantum Machine Learning.

MNIST Classification with Hybrid Quantum-Classical Neural Network

14 minute read

Published: July 13, 2020

Qiskit is IBM’s open-source framework to do quantum processes which provides users access to both simulators and real Quantum Computers. Today, the Quantum Computer available is still in the Noisy Intermediate-Scale Quantum (NISQ) era and is very much sensitive to any forms of interference. Unlike real Quantum Computers, simulators provided by Qiskit aren’t noisy and is great for prototyping.

Color Restoration with Generative Adversarial Network

11 minute read

Published: July 10, 2020

Fast.ai has a two-part Deep Learning Course, the first being Practical Deep Learning for Coders, and the second being Deep Learning from the Foundations, both having different approaches and intended for different audiences. In the 7th lecture of Part 1, Jeremy Howard taught a lot about modern architectures such as Residual Network (ResNet) , U-Net, and Generative Adversarial Network (GAN).

Handwritten Javanese Script Classification

6 minute read

Published: July 06, 2020

Aksara Jawa, or the Javanese Script is the core of writing the Javanese language and has influenced various other regional languages such as Sundanese, Madurese, etc. The script is now rarely used on a daily basis, but is sometimes taught in local schools in certain provinces of Indonesia.

Hash Tables, Collisions, and Separate Chaining

8 minute read

Published: April 11, 2020

According to Wikipedia, a hash table or sometimes called hash map is a a data structure that implements an associative array abstract data type, a structure that can map keys to values.

Doubly Linked List in C

6 minute read

Published: March 05, 2020

After learning how to implement Singly Linked List, we’re going to implement Doubly Linked List, which is similar to Singly Linked List, but with the addition of a prev pointer which points to the node before it.

Discrete and Continuous Optimization Algorithms

11 minute read

Published: March 02, 2020

Optimization is a key process in machine learning, from which we can approach inference and learning. It allows us to decouple the mathematical specification of what we want to compute from the algorithms for how to compute it.

Singly Linked List in C

7 minute read

Published: February 28, 2020

According to Wikipedia, a linked list is a linear collection of data elements, whose order is not given by their physical placement in memory. Instead, each element points to the next. It is a data structure consisting of a collection of nodes which together represent a sequence.

Automatic Differentiation

7 minute read

Published: February 27, 2020

Automatic Differentiation (AD) is a vital process in Deep Learning. Many of deep learning’s techniques like backpropagation relies heavily on AD. There are multiple ways to implement AD, one of which is utilizing Dual Numbers.

portfolio

Portfolio item number 1

Short description of portfolio item number 1

Portfolio item number 2

Short description of portfolio item number 2

publications

Causal and Masked Language Modeling of Javanese Language using Transformer-based Architectures

Published in 2021 International Conference on Advanced Computer Science and Information Systems (ICACSIS), 2021

Most natural language understanding breakthroughs occur in popularly spoken languages, while low-resource languages are rarely examined. We pre-trained as well as compared different Transformer-based architectures on the Javanese language. They were trained on causal and masked language modeling tasks, with Javanese Wikipedia documents as corpus, and could then be fine-tuned to downstream natural language understanding tasks. To speed up pre-training, we transferred English word-embeddings, utilized gradual unfreezing of layers, and applied discriminative fine-tuning. We further fine-tuned our models to classify binary movie reviews and find that they were on par with multilingual/cross-lingual Transformers. We release our pre-trained models for others to use, in hopes of encouraging other researchers to work on low-resource languages like Javanese.

Recommended citation: W. Wongso, D. S. Setiawan and D. Suhartono, "Causal and Masked Language Modeling of Javanese Language using Transformer-based Architectures," 2021 International Conference on Advanced Computer Science and Information Systems (ICACSIS), Depok, Indonesia, 2021, pp. 1-7, doi: 10.1109/ICACSIS53237.2021.9631331.
Download Paper

Pre-trained transformer-based language models for Sundanese

Published in Journal of Big Data, 2022

The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use.

Recommended citation: Wongso, W., Lucky, H. & Suhartono, D. "Pre-trained transformer-based language models for Sundanese." J Big Data 9, 39 (2022). https://doi.org/10.1186/s40537-022-00590-7
Download Paper

Many-to-Many Multilingual Translation Model for Languages of Indonesia

Published in IEEE Access, 2023

Indonesia is home to over 700 languages and most people speak their respective regional languages aside from the lingua franca. In this paper, we focus on the task of multilingual machine translation for 45 regional Indonesian languages and introduced Indo-T5 which leveraged the mT5 sequence-to-sequence language model as a baseline. Performances of bilingual and multilingual fine-tuning methods were also compared, in which we found that our models have outperformed current state-of-the-art translation models. We also investigate the use of religious texts from the Bible as an intermediate mid-resource translation domain for low-resource translation domain specialization. Our findings suggest that this two-step fine-tuning approach is highly effective in improving the quality of translations for low-resource text domains. Our results show an increase in SacreBLEU scores when evaluated on the low-resource NusaX dataset. We release our translation models for other researchers to leverage.

Recommended citation: Wongso, W., Joyoadikusumo, A., Buana, B. S., & Suhartono, D. (2023). Many-to-Many Multilingual Translation Model for Languages of Indonesia. IEEE Access.
Download Paper

IdSarcasm: Benchmarking and Evaluating Language Models for Indonesian Sarcasm Detection

Published in IEEE Access, 2024

Sarcasm detection in the Indonesian language poses a unique set of challenges due to the linguistic nuances and cultural specificities of the Indonesian social media landscape. Understanding the dynamics of sarcasm in this context requires a deep dive into language patterns and the socio-cultural background that shapes the use of sarcasm as a form of criticism and expression. In this study, we developed the first publicly available Indonesian sarcasm detection benchmark datasets from social media texts. We extensively investigated the results of classical machine learning algorithms, pre-trained language models, and recent large language models (LLMs). Our findings show that fine-tuning pre-trained language models is still superior to other techniques, achieving F1 scores of 62.74% and 76.92% on the Reddit and Twitter subsets respectively. Further, we show that recent LLMs fail to perform zero-shot classification for sarcasm detection and that tackling data imbalance requires a more sophisticated data augmentation approach than our basic methods.

Recommended citation: Suhartono, D., Wongso, W., & Handoyo, A. T. (2024). IdSarcasm: Benchmarking and Evaluating Language Models for Indonesian Sarcasm Detection. IEEE Access.
Download Paper

NusaBERT Teaching IndoBERT to be Multilingual and Multicultural

Published in Proceedings of the Second Workshop in South East Asian Language Processing, 2025

Indonesia’s linguistic landscape is remarkably diverse, encompassing over 700 languages and dialects, making it one of the world’s most linguistically rich nations. This diversity, coupled with the widespread practice of code-mixing and the presence of low-resource regional languages, presents unique challenges for modern pre-trained language models. In response to these challenges, we developed NusaBERT, building upon IndoBERT by incorporating vocabulary expansion and leveraging a diverse multilingual corpus that includes regional languages. Through rigorous evaluation across a range of benchmarks, NusaBERT demonstrates state-of-the-art performance in tasks involving multiple languages of Indonesia, paving the way for future natural language understanding research for under-represented languages.

Recommended citation: Wilson Wongso, David Samuel Setiawan, Steven Limcorn, and Ananto Joyoadikusumo. 2025. NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural. In Proceedings of the Second Workshop in South East Asian Language Processing, pages 10–26, Online. Association for Computational Linguistics.
Download Paper

Massive-STEPS: Massive Semantic Trajectories for Understanding POI Check-ins–Dataset and Benchmarks

Published in arXiv, 2025

Understanding human mobility through Point-of-Interest (POI) recommendation is increasingly important for applications such as urban planning, personalized services, and generative agent simulation. However, progress in this field is hindered by two key challenges: the over-reliance on older datasets from 2012-2013 and the lack of reproducible, city-level check-in datasets that reflect diverse global regions. To address these gaps, we present Massive-STEPS (Massive Semantic Trajectories for Understanding POI Check-ins), a large-scale, publicly available benchmark dataset built upon the Semantic Trails dataset and enriched with semantic POI metadata. Massive-STEPS spans 12 geographically and culturally diverse cities and features more recent (2017-2018) and longer-duration (24 months) check-in data than prior datasets. We benchmarked a wide range of POI recommendation models on Massive-STEPS using both supervised and zero-shot approaches, and evaluated their performance across multiple urban contexts. By releasing Massive-STEPS, we aim to facilitate reproducible and equitable research in human mobility and POI recommendation. The dataset and benchmarking code are available at: https://github.com/cruiseresearchgroup/Massive-STEPS

Recommended citation: Wongso, W., Xue, H., & Salim, F. D. (2025). Massive-STEPS: Massive Semantic Trajectories for Understanding POI Check-ins--Dataset and Benchmarks. arXiv preprint arXiv:2505.11239.
Download Paper

Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG

Published in 2025 KDD Cup Workshop for Multimodal Retrieval Augmented Generation, 2025

This paper presents the technical solution developed by team CRUISE for the KDD Cup 2025 Meta Comprehensive RAG Benchmark for Multi-modal, Multi-turn (CRAG-MM) challenge. The challenge aims to address a critical limitation of modern Vision Language Models (VLMs): their propensity to hallucinate, especially when faced with egocentric imagery, long-tail entities, and complex, multi-hop questions. This issue is particularly problematic in real-world applications where users pose fact-seeking queries that demand high factual accuracy across diverse modalities. To tackle this, we propose a robust, multi-stage framework that prioritizes factual accuracy and truthfulness over completeness. Our solution integrates a lightweight query router for efficiency, a query-aware retrieval and summarization pipeline, a dual-pathways generation and a post-hoc verification. This conservative strategy is designed to minimize hallucinations, which incur a severe penalty in the competition’s scoring metric. Our approach achieved 3rd place in Task 1, demonstrating the effectiveness of prioritizing answer reliability in complex multi-modal RAG systems. Our implementation is available at https://github.com/Breezelled/KDD-Cup-2025-Meta-CRAG-MM.

Recommended citation: Chen, B., Wongso, W., Hu, X., Tan, Y., & Salim, F. (2025). Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG. In Proceedings of 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’25).
Download Paper

Lazarus NLP at SemEval-2025 Task 11: Fine-Tuning Large Language Models for Multi-Label Emotion Classification via Sentence-Label Pairing

Published in SemEval 2025 at ACL 2025, 2025

Multi-label emotion classification in lowresource languages remains challenging due to limited annotated data and model adaptability. To address this, we fine-tune large language models (LLMs) using a sentence-label pairing approach, optimizing efficiency while improving classification performance. Evaluating on Sundanese, Indonesian, and Javanese, our method outperforms conventional classifierbased fine-tuning and achieves strong zero-shot cross-lingual transfer. Notably, our approach ranks first in the Sundanese subset of SemEval2025 Task 11 Track A. Our findings demonstrate the effectiveness of LLM fine-tuning for low-resource emotion classification, underscoring the importance of tailoring adaptation strategies to specific language families in multilingual contexts. Our source code is available at: https://github.com/LazarusNLP/SemEval2025-Emotion-Analysis.

Recommended citation: Wongso, W., Setiawan, D., Joyoadikusumo, A., & Limcorn, S. (2025, July). Lazarus NLP at SemEval-2025 Task 11: Fine-Tuning Large Language Models for Multi-Label Emotion Classification via Sentence-Label Pairing. In Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025) (pp. 763-772).
Download Paper

GenUP: Generative User Profilers as In-Context Learners for Next POI Recommender Systems

Published in SIGSPATIAL 2025, 2025

Traditional Point-of-Interest (POI) recommendation systems often lack transparency, interpretability, and scrutability due to their reliance on dense vector-based user embeddings. Furthermore, the cold-start problem–where systems have insufficient data for new users–limits their ability to generate accurate recommendations. Existing methods often address this by leveraging similar trajectories from other users, but this approach can be computationally expensive and increases the context length for large language model-based methods, making them difficult to scale. To address these limitations, we propose a method that generates natural language (NL) user profiles from large-scale, location-based social network check-ins, utilizing robust personality assessments and behavioral theories. These NL profiles capture user preferences, routines, and behaviors, improving POI prediction accuracy while offering enhanced transparency. By incorporating NL profiles as system prompts to large language models, our approach reduces reliance on extensive historical data, while remaining flexible, easily updated, and computationally efficient. Results demonstrate that our approach consistently outperforms baseline methods, offering a more interpretable and resource-efficient solution for POI recommendation systems.

Recommended citation: Wongso, W., Xue, H., & Salim, F. D. (2024). GenUP: Generative User Profilers as In-Context Learners for Next POI Recommender Systems. In The 33rd ACM International Conference on Advances in Geographic Information Systems (SIGSPATIAL ’25).
Download Paper

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.

Wilson Wongso

Sitemap

Pages

Posts

portfolio

publications

talks

teaching