Sitemap
A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.
Pages
Study Notes
Study Notes
Works in the Wild
Works in the Wild
Posts
Predicting Phonemes with BERT
Published:
Our team at Bookbot is currently developing a grapheme-to-phoneme Python package for Bahasa Indonesia. The package is highly inspired by its English counterpart, g2p. A lot of our design and methods are borrowed from that library, most notably the steps to predict phonemes. The English g2p used the following algorithm (c.f. g2p’s README):
My HuggingFace JAX Community Week Experience
Published:
On June 23, the HuggingFace team announced that they are planning to host a community week together with the people from the Google Cloud team. The main gist of this event was getting everyone to learn and use HuggingFace’s newly integrated JAX framework. But aside from just learning from tutorials, we were equipped with blazing fast TPUs thanks to the amazing Google Cloud team 🤯.
Pneumonia Chest X-Ray Classification
Published:
The dataset used for this task if from a Kaggle dataset by Paul Mooney. It consists of two kinds of chest x-rays, those infected by pneumonia, and the other being normal. Our main goal is to distinguish which chest corresponds to pneumonia-infected ones and which aren’t. Note that the dataset is highly imbalanced, like many medical image dataset are.
Text Generation using minGPT and fast.ai
Published:
Andrej Karpathy, Tesla’s AI Director released minGPT, a mini version to OpenAI’s GPT. Normally a GPT would have billions of parameters and would take hours to train. Karpathy’s approach is to provide a smaller version of GPT, hence the name minGPT.
MNIST Classification with Quantum Neural Network
Published:
Tensorflow is one of the most used deep learning frameworks today, bundled with many features for end-to-end deep learning processes. Recently, they have just announced a new library on top of Tensorflow, called Tensorflow Quantum. Tensorflow Quantum integrates with Cirq, which provides quantum computing algorithms, and the two works well to do tasks involving Quantum Machine Learning.
MNIST Classification with Hybrid Quantum-Classical Neural Network
Published:
Qiskit is IBM’s open-source framework to do quantum processes which provides users access to both simulators and real Quantum Computers. Today, the Quantum Computer available is still in the Noisy Intermediate-Scale Quantum (NISQ) era and is very much sensitive to any forms of interference. Unlike real Quantum Computers, simulators provided by Qiskit aren’t noisy and is great for prototyping.
Color Restoration with Generative Adversarial Network
Published:
Fast.ai has a two-part Deep Learning Course, the first being Practical Deep Learning for Coders, and the second being Deep Learning from the Foundations, both having different approaches and intended for different audiences. In the 7th lecture of Part 1, Jeremy Howard taught a lot about modern architectures such as Residual Network (ResNet) , U-Net, and Generative Adversarial Network (GAN).
Handwritten Javanese Script Classification
Published:
Aksara Jawa, or the Javanese Script is the core of writing the Javanese language and has influenced various other regional languages such as Sundanese, Madurese, etc. The script is now rarely used on a daily basis, but is sometimes taught in local schools in certain provinces of Indonesia.
Hash Tables, Collisions, and Separate Chaining
Published:
According to Wikipedia, a hash table or sometimes called hash map is a a data structure that implements an associative array abstract data type, a structure that can map keys to values.
Doubly Linked List in C
Published:
After learning how to implement Singly Linked List, we’re going to implement Doubly Linked List, which is similar to Singly Linked List, but with the addition of a prev
pointer
which points to the node before it.
Discrete and Continuous Optimization Algorithms
Published:
Optimization is a key process in machine learning, from which we can approach inference and learning. It allows us to decouple the mathematical specification of what we want to compute from the algorithms for how to compute it.
Singly Linked List in C
Published:
According to Wikipedia, a linked list is a linear collection of data elements, whose order is not given by their physical placement in memory. Instead, each element points to the next. It is a data structure consisting of a collection of nodes which together represent a sequence.
Automatic Differentiation
Published:
Automatic Differentiation (AD) is a vital process in Deep Learning. Many of deep learning’s techniques like backpropagation relies heavily on AD. There are multiple ways to implement AD, one of which is utilizing Dual Numbers.
portfolio
Portfolio item number 1
Short description of portfolio item number 1
Portfolio item number 2
Short description of portfolio item number 2
publications
Causal and Masked Language Modeling of Javanese Language using Transformer-based Architectures
Published in 2021 International Conference on Advanced Computer Science and Information Systems (ICACSIS), 2021
Most natural language understanding breakthroughs occur in popularly spoken languages, while low-resource languages are rarely examined. We pre-trained as well as compared different Transformer-based architectures on the Javanese language. They were trained on causal and masked language modeling tasks, with Javanese Wikipedia documents as corpus, and could then be fine-tuned to downstream natural language understanding tasks. To speed up pre-training, we transferred English word-embeddings, utilized gradual unfreezing of layers, and applied discriminative fine-tuning. We further fine-tuned our models to classify binary movie reviews and find that they were on par with multilingual/cross-lingual Transformers. We release our pre-trained models for others to use, in hopes of encouraging other researchers to work on low-resource languages like Javanese.
Recommended citation: W. Wongso, D. S. Setiawan and D. Suhartono, "Causal and Masked Language Modeling of Javanese Language using Transformer-based Architectures," 2021 International Conference on Advanced Computer Science and Information Systems (ICACSIS), Depok, Indonesia, 2021, pp. 1-7, doi: 10.1109/ICACSIS53237.2021.9631331.
Download Paper
Pre-trained transformer-based language models for Sundanese
Published in Journal of Big Data, 2022
The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use.
Recommended citation: Wongso, W., Lucky, H. & Suhartono, D. "Pre-trained transformer-based language models for Sundanese." J Big Data 9, 39 (2022). https://doi.org/10.1186/s40537-022-00590-7
Download Paper
Many-to-Many Multilingual Translation Model for Languages of Indonesia
Published in IEEE Access, 2023
Indonesia is home to over 700 languages and most people speak their respective regional languages aside from the lingua franca. In this paper, we focus on the task of multilingual machine translation for 45 regional Indonesian languages and introduced Indo-T5 which leveraged the mT5 sequence-to-sequence language model as a baseline. Performances of bilingual and multilingual fine-tuning methods were also compared, in which we found that our models have outperformed current state-of-the-art translation models. We also investigate the use of religious texts from the Bible as an intermediate mid-resource translation domain for low-resource translation domain specialization. Our findings suggest that this two-step fine-tuning approach is highly effective in improving the quality of translations for low-resource text domains. Our results show an increase in SacreBLEU scores when evaluated on the low-resource NusaX dataset. We release our translation models for other researchers to leverage.
Recommended citation: Wongso, W., Joyoadikusumo, A., Buana, B. S., & Suhartono, D. (2023). Many-to-Many Multilingual Translation Model for Languages of Indonesia. IEEE Access.
Download Paper
NusaBERT Teaching IndoBERT to be Multilingual and Multicultural
Published in arXiv, 2024
Indonesia’s linguistic landscape is remarkably diverse, encompassing over 700 languages and dialects, making it one of the world’s most linguistically rich nations. This diversity, coupled with the widespread practice of code-switching and the presence of low-resource regional languages, presents unique challenges for modern pre-trained language models. In response to these challenges, we developed NusaBERT, building upon IndoBERT by incorporating vocabulary expansion and leveraging a diverse multilingual corpus that includes regional languages and dialects. Through rigorous evaluation across a range of benchmarks, NusaBERT demonstrates state-of-the-art performance in tasks involving multiple languages of Indonesia, paving the way for future natural language understanding research for under-represented languages.
Recommended citation: Wongso, W., Setiawan, D. S., Limcorn, S., & Joyoadikusumo, A. (2024). NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural. arXiv [Cs.CL]. Retrieved from https://arxiv.org/abs/2403.01817
Download Paper
IdSarcasm: Benchmarking and Evaluating Language Models for Indonesian Sarcasm Detection
Published in IEEE Access, 2024
Sarcasm detection in the Indonesian language poses a unique set of challenges due to the linguistic nuances and cultural specificities of the Indonesian social media landscape. Understanding the dynamics of sarcasm in this context requires a deep dive into language patterns and the socio-cultural background that shapes the use of sarcasm as a form of criticism and expression. In this study, we developed the first publicly available Indonesian sarcasm detection benchmark datasets from social media texts. We extensively investigated the results of classical machine learning algorithms, pre-trained language models, and recent large language models (LLMs). Our findings show that fine-tuning pre-trained language models is still superior to other techniques, achieving F1 scores of 62.74% and 76.92% on the Reddit and Twitter subsets respectively. Further, we show that recent LLMs fail to perform zero-shot classification for sarcasm detection and that tackling data imbalance requires a more sophisticated data augmentation approach than our basic methods.
Recommended citation: Suhartono, D., Wongso, W., & Handoyo, A. T. (2024). IdSarcasm: Benchmarking and Evaluating Language Models for Indonesian Sarcasm Detection. IEEE Access.
Download Paper
GenUP: Generative User Profilers as In-Context Learners for Next POI Recommender Systems
Published in arXiv, 2024
Traditional POI recommendation systems often lack transparency, interpretability, and scrutability due to their reliance on dense vector-based user embeddings. Furthermore, the cold-start problem – where systems have insufficient data for new users – limits their ability to generate accurate recommendations. Existing methods often address this by leveraging similar trajectories from other users, but this approach can be computationally expensive and increases the context length for LLM-based methods, making them difficult to scale. To address these limitations, we propose a method that generates natural language (NL) user profiles from large-scale, location-based social network (LBSN) check-ins, utilizing robust personality assessments and behavioral theories. These NL profiles capture user preferences, routines, and behaviors, improving POI prediction accuracy while offering enhanced transparency. By incorporating NL profiles as system prompts to LLMs, our approach reduces reliance on extensive historical data, while remaining flexible, easily updated, and computationally efficient. Our method is not only competitive with other LLM-based and complex agentic frameworks but is also more scalable for real-world scenarios and on-device POI recommendations. Results demonstrate that our approach consistently outperforms baseline methods, offering a more interpretable and resource-efficient solution for POI recommendation systems.
Recommended citation: Wongso, W., Xue, H., & Salim, F. D. (2024). GenUP: Generative User Profilers as In-Context Learners for Next POI Recommender Systems. arXiv preprint arXiv:2410.20643.
Download Paper
talks
Talk 1 on Relevant Topic in Your Field
Published:
This is a description of your talk, which is a markdown files that can be all markdown-ified like any other post. Yay markdown!
Conference Proceeding talk 3 on Relevant Topic in Your Field
Published:
This is a description of your conference proceedings talk, note the different field in type. You can put anything in this field.
teaching
Teaching experience 1
Undergraduate course, University 1, Department, 2014
This is a description of a teaching experience. You can use markdown like any other post.
Teaching experience 2
Workshop, University 1, Department, 2015
This is a description of a teaching experience. You can use markdown like any other post.