No Cover Image

Conference Paper/Proceeding/Abstract 112 views 28 downloads

Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus

Deshan Sumanathilaka, Nicholas Micallef Orcid Logo, Ruvan Weerasinghe

2024 4th International Conference on Advanced Research in Computing (ICARC)

Swansea University Authors: Deshan Sumanathilaka, Nicholas Micallef Orcid Logo

  • Swa-Bhasha D updated.pdf

    PDF | Accepted Manuscript

    Author accepted manuscript document released under the terms of a Creative Commons CC-BY licence using the Swansea University Research Publications Policy (rights retention).

    Download (404.47KB)

DOI (Published version): 10.1109/icarc61713.2024.10499771

Abstract

In the context of a changing society and rapid technological advancements, the prevalence of social media platforms and instant messaging services has significantly strengthened the usage of native languages. In Sri Lanka, Sinhala and Romanized Sinhala have emerged as popular typing languages, owing...

Full description

Published in: 2024 4th International Conference on Advanced Research in Computing (ICARC)
ISBN: 979-8-3503-8487-1 979-8-3503-8486-4
Published: Belihuloya, Sri Lanka IEEE 2024
URI: https://cronfa.swan.ac.uk/Record/cronfa65621
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract: In the context of a changing society and rapid technological advancements, the prevalence of social media platforms and instant messaging services has significantly strengthened the usage of native languages. In Sri Lanka, Sinhala and Romanized Sinhala have emerged as popular typing languages, owing to the widespread use of informal shorthand-based typing and internet acronyms for quicker communication. However, due to the limited availability of resources, linguistic support for these languages is limited, making them low-resource languages. To address this resource deficit, this study proposes the development of a rule-based transliteration tool that can annotate Sinhala words into Romanized Sinhala, accommodating the diverse ad hoc typing patterns used by the community. The research approach involved a comprehensive survey employing a stratified sampling method, considering variables such as age, gender, and language proficiency. 215 participants were presented with an online survey comprising 12 Sinhala sentences to capture various transliteration patterns related to Sinhala characters which are necessary for the annotation process. Analysis of the survey responses led to the formulation of 92 general rules and 26 special rules, encapsulating ad-hoc Romanized Sinhala typing patterns. Using these rules, Sinhala dictionaries were annotated, building a large corpus of data which consists of Sinhala and its Romanized Sinhala patterns. The annotated dataset was validated using a back transliteration tool, achieving an 84%-word accuracy rate. This innovative transliteration annotator can be used to mitigate the resource constraints associated with Sinhala to Romanized Sinhala transliteration. GitHub link:https://github.com/Sumanathilaka/Swa-Bhasha-Sinhala-Singlish-Dataset
Keywords: Annotation, Dataset Creation, Romanized Sinhala, Transliteration, survey
College: Faculty of Science and Engineering