No Cover Image

Conference Paper/Proceeding/Abstract 312 views 84 downloads

Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus

Deshan Sumanathilaka Orcid Logo, Nicholas Micallef Orcid Logo, Ruvan Weerasinghe

2024 4th International Conference on Advanced Research in Computing (ICARC), Volume: 2024, Pages: 189 - 194

Swansea University Authors: Deshan Sumanathilaka Orcid Logo, Nicholas Micallef Orcid Logo

  • Swa-Bhasha D updated.pdf

    PDF | Accepted Manuscript

    Author accepted manuscript document released under the terms of a Creative Commons CC-BY licence using the Swansea University Research Publications Policy (rights retention).

    Download (404.47KB)

DOI (Published version): 10.1109/icarc61713.2024.10499771

Abstract

In the context of a changing society and rapid technological advancements, the prevalence of social media platforms and instant messaging services has significantly strengthened the usage of native languages. In Sri Lanka, Sinhala and Romanized Sinhala have emerged as popular typing languages, owing...

Full description

Published in: 2024 4th International Conference on Advanced Research in Computing (ICARC)
ISBN: 979-8-3503-8487-1 979-8-3503-8486-4
Published: Belihuloya, Sri Lanka IEEE 2024
URI: https://cronfa.swan.ac.uk/Record/cronfa65621
Tags: Add Tag
No Tags, Be the first to tag this record!
first_indexed 2024-02-09T09:37:09Z
last_indexed 2024-02-09T09:37:09Z
id cronfa65621
recordtype SURis
fullrecord <?xml version="1.0" encoding="utf-8"?><rfc1807 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><bib-version>v2</bib-version><id>65621</id><entry>2024-02-09</entry><title>Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus</title><swanseaauthors><author><sid>2fe44f0c1e7d845dc21bb6b00d5b2085</sid><ORCID>0009-0005-8933-6559</ORCID><firstname>Deshan</firstname><surname>Sumanathilaka</surname><name>Deshan Sumanathilaka</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>1cc4c84582d665b7ee08fb16f5454671</sid><ORCID>0000-0002-2683-8042</ORCID><firstname>Nicholas</firstname><surname>Micallef</surname><name>Nicholas Micallef</name><active>true</active><ethesisStudent>false</ethesisStudent></author></swanseaauthors><date>2024-02-09</date><deptcode>MACS</deptcode><abstract>In the context of a changing society and rapid technological advancements, the prevalence of social media platforms and instant messaging services has significantly strengthened the usage of native languages. In Sri Lanka, Sinhala and Romanized Sinhala have emerged as popular typing languages, owing to the widespread use of informal shorthand-based typing and internet acronyms for quicker communication. However, due to the limited availability of resources, linguistic support for these languages is limited, making them low-resource languages. To address this resource deficit, this study proposes the development of a rule-based transliteration tool that can annotate Sinhala words into Romanized Sinhala, accommodating the diverse ad hoc typing patterns used by the community. The research approach involved a comprehensive survey employing a stratified sampling method, considering variables such as age, gender, and language proficiency. 215 participants were presented with an online survey comprising 12 Sinhala sentences to capture various transliteration patterns related to Sinhala characters which are necessary for the annotation process. Analysis of the survey responses led to the formulation of 92 general rules and 26 special rules, encapsulating ad-hoc Romanized Sinhala typing patterns. Using these rules, Sinhala dictionaries were annotated, building a large corpus of data which consists of Sinhala and its Romanized Sinhala patterns. The annotated dataset was validated using a back transliteration tool, achieving an 84%-word accuracy rate. This innovative transliteration annotator can be used to mitigate the resource constraints associated with Sinhala to Romanized Sinhala transliteration. GitHub link:https://github.com/Sumanathilaka/Swa-Bhasha-Sinhala-Singlish-Dataset</abstract><type>Conference Paper/Proceeding/Abstract</type><journal>2024 4th International Conference on Advanced Research in Computing (ICARC)</journal><volume>2024</volume><journalNumber/><paginationStart>189</paginationStart><paginationEnd>194</paginationEnd><publisher>IEEE</publisher><placeOfPublication>Belihuloya, Sri Lanka</placeOfPublication><isbnPrint>979-8-3503-8487-1</isbnPrint><isbnElectronic>979-8-3503-8486-4</isbnElectronic><issnPrint/><issnElectronic/><keywords>Surveys, dictionaries, social networking (online), annotations, buildings, instant messaging, linguistics, annotation, dataset creation, Romanized Sinhala, transliteration, survey</keywords><publishedDay>22</publishedDay><publishedMonth>4</publishedMonth><publishedYear>2024</publishedYear><publishedDate>2024-04-22</publishedDate><doi>10.1109/icarc61713.2024.10499771</doi><url/><notes/><college>COLLEGE NANME</college><department>Mathematics and Computer Science School</department><CollegeCode>COLLEGE CODE</CollegeCode><DepartmentCode>MACS</DepartmentCode><institution>Swansea University</institution><apcterm>Not Required</apcterm><funders/><projectreference/><lastEdited>2024-10-15T10:48:52.9120920</lastEdited><Created>2024-02-09T09:24:31.0913577</Created><path><level id="1">Faculty of Science and Engineering</level><level id="2">School of Mathematics and Computer Science - Computer Science</level></path><authors><author><firstname>Deshan</firstname><surname>Sumanathilaka</surname><orcid>0009-0005-8933-6559</orcid><order>1</order></author><author><firstname>Nicholas</firstname><surname>Micallef</surname><orcid>0000-0002-2683-8042</orcid><order>2</order></author><author><firstname>Ruvan</firstname><surname>Weerasinghe</surname><order>3</order></author></authors><documents><document><filename>65621__29532__55fe07d44ea649178a984747f4c382c0.pdf</filename><originalFilename>Swa-Bhasha D updated.pdf</originalFilename><uploaded>2024-02-09T09:33:28.5988311</uploaded><type>Output</type><contentLength>414177</contentLength><contentType>application/pdf</contentType><version>Accepted Manuscript</version><cronfaStatus>true</cronfaStatus><documentNotes>Author accepted manuscript document released under the terms of a Creative Commons CC-BY licence using the Swansea University Research Publications Policy (rights retention).</documentNotes><copyrightCorrect>true</copyrightCorrect><language>eng</language><licence>https://creativecommons.org/licenses/by/4.0/deed.en</licence></document></documents><OutputDurs/></rfc1807>
spelling v2 65621 2024-02-09 Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus 2fe44f0c1e7d845dc21bb6b00d5b2085 0009-0005-8933-6559 Deshan Sumanathilaka Deshan Sumanathilaka true false 1cc4c84582d665b7ee08fb16f5454671 0000-0002-2683-8042 Nicholas Micallef Nicholas Micallef true false 2024-02-09 MACS In the context of a changing society and rapid technological advancements, the prevalence of social media platforms and instant messaging services has significantly strengthened the usage of native languages. In Sri Lanka, Sinhala and Romanized Sinhala have emerged as popular typing languages, owing to the widespread use of informal shorthand-based typing and internet acronyms for quicker communication. However, due to the limited availability of resources, linguistic support for these languages is limited, making them low-resource languages. To address this resource deficit, this study proposes the development of a rule-based transliteration tool that can annotate Sinhala words into Romanized Sinhala, accommodating the diverse ad hoc typing patterns used by the community. The research approach involved a comprehensive survey employing a stratified sampling method, considering variables such as age, gender, and language proficiency. 215 participants were presented with an online survey comprising 12 Sinhala sentences to capture various transliteration patterns related to Sinhala characters which are necessary for the annotation process. Analysis of the survey responses led to the formulation of 92 general rules and 26 special rules, encapsulating ad-hoc Romanized Sinhala typing patterns. Using these rules, Sinhala dictionaries were annotated, building a large corpus of data which consists of Sinhala and its Romanized Sinhala patterns. The annotated dataset was validated using a back transliteration tool, achieving an 84%-word accuracy rate. This innovative transliteration annotator can be used to mitigate the resource constraints associated with Sinhala to Romanized Sinhala transliteration. GitHub link:https://github.com/Sumanathilaka/Swa-Bhasha-Sinhala-Singlish-Dataset Conference Paper/Proceeding/Abstract 2024 4th International Conference on Advanced Research in Computing (ICARC) 2024 189 194 IEEE Belihuloya, Sri Lanka 979-8-3503-8487-1 979-8-3503-8486-4 Surveys, dictionaries, social networking (online), annotations, buildings, instant messaging, linguistics, annotation, dataset creation, Romanized Sinhala, transliteration, survey 22 4 2024 2024-04-22 10.1109/icarc61713.2024.10499771 COLLEGE NANME Mathematics and Computer Science School COLLEGE CODE MACS Swansea University Not Required 2024-10-15T10:48:52.9120920 2024-02-09T09:24:31.0913577 Faculty of Science and Engineering School of Mathematics and Computer Science - Computer Science Deshan Sumanathilaka 0009-0005-8933-6559 1 Nicholas Micallef 0000-0002-2683-8042 2 Ruvan Weerasinghe 3 65621__29532__55fe07d44ea649178a984747f4c382c0.pdf Swa-Bhasha D updated.pdf 2024-02-09T09:33:28.5988311 Output 414177 application/pdf Accepted Manuscript true Author accepted manuscript document released under the terms of a Creative Commons CC-BY licence using the Swansea University Research Publications Policy (rights retention). true eng https://creativecommons.org/licenses/by/4.0/deed.en
title Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus
spellingShingle Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus
Deshan Sumanathilaka
Nicholas Micallef
title_short Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus
title_full Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus
title_fullStr Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus
title_full_unstemmed Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus
title_sort Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus
author_id_str_mv 2fe44f0c1e7d845dc21bb6b00d5b2085
1cc4c84582d665b7ee08fb16f5454671
author_id_fullname_str_mv 2fe44f0c1e7d845dc21bb6b00d5b2085_***_Deshan Sumanathilaka
1cc4c84582d665b7ee08fb16f5454671_***_Nicholas Micallef
author Deshan Sumanathilaka
Nicholas Micallef
author2 Deshan Sumanathilaka
Nicholas Micallef
Ruvan Weerasinghe
format Conference Paper/Proceeding/Abstract
container_title 2024 4th International Conference on Advanced Research in Computing (ICARC)
container_volume 2024
container_start_page 189
publishDate 2024
institution Swansea University
isbn 979-8-3503-8487-1
979-8-3503-8486-4
doi_str_mv 10.1109/icarc61713.2024.10499771
publisher IEEE
college_str Faculty of Science and Engineering
hierarchytype
hierarchy_top_id facultyofscienceandengineering
hierarchy_top_title Faculty of Science and Engineering
hierarchy_parent_id facultyofscienceandengineering
hierarchy_parent_title Faculty of Science and Engineering
department_str School of Mathematics and Computer Science - Computer Science{{{_:::_}}}Faculty of Science and Engineering{{{_:::_}}}School of Mathematics and Computer Science - Computer Science
document_store_str 1
active_str 0
description In the context of a changing society and rapid technological advancements, the prevalence of social media platforms and instant messaging services has significantly strengthened the usage of native languages. In Sri Lanka, Sinhala and Romanized Sinhala have emerged as popular typing languages, owing to the widespread use of informal shorthand-based typing and internet acronyms for quicker communication. However, due to the limited availability of resources, linguistic support for these languages is limited, making them low-resource languages. To address this resource deficit, this study proposes the development of a rule-based transliteration tool that can annotate Sinhala words into Romanized Sinhala, accommodating the diverse ad hoc typing patterns used by the community. The research approach involved a comprehensive survey employing a stratified sampling method, considering variables such as age, gender, and language proficiency. 215 participants were presented with an online survey comprising 12 Sinhala sentences to capture various transliteration patterns related to Sinhala characters which are necessary for the annotation process. Analysis of the survey responses led to the formulation of 92 general rules and 26 special rules, encapsulating ad-hoc Romanized Sinhala typing patterns. Using these rules, Sinhala dictionaries were annotated, building a large corpus of data which consists of Sinhala and its Romanized Sinhala patterns. The annotated dataset was validated using a back transliteration tool, achieving an 84%-word accuracy rate. This innovative transliteration annotator can be used to mitigate the resource constraints associated with Sinhala to Romanized Sinhala transliteration. GitHub link:https://github.com/Sumanathilaka/Swa-Bhasha-Sinhala-Singlish-Dataset
published_date 2024-04-22T10:48:51Z
_version_ 1812972942578941952
score 11.035634