No Cover Image

Conference Paper/Proceeding/Abstract 112 views 28 downloads

Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus

Deshan Sumanathilaka, Nicholas Micallef Orcid Logo, Ruvan Weerasinghe

2024 4th International Conference on Advanced Research in Computing (ICARC)

Swansea University Authors: Deshan Sumanathilaka, Nicholas Micallef Orcid Logo

  • Swa-Bhasha D updated.pdf

    PDF | Accepted Manuscript

    Author accepted manuscript document released under the terms of a Creative Commons CC-BY licence using the Swansea University Research Publications Policy (rights retention).

    Download (404.47KB)

DOI (Published version): 10.1109/icarc61713.2024.10499771

Abstract

In the context of a changing society and rapid technological advancements, the prevalence of social media platforms and instant messaging services has significantly strengthened the usage of native languages. In Sri Lanka, Sinhala and Romanized Sinhala have emerged as popular typing languages, owing...

Full description

Published in: 2024 4th International Conference on Advanced Research in Computing (ICARC)
ISBN: 979-8-3503-8487-1 979-8-3503-8486-4
Published: Belihuloya, Sri Lanka IEEE 2024
URI: https://cronfa.swan.ac.uk/Record/cronfa65621
Tags: Add Tag
No Tags, Be the first to tag this record!
first_indexed 2024-02-09T09:37:09Z
last_indexed 2024-02-09T09:37:09Z
id cronfa65621
recordtype SURis
fullrecord <?xml version="1.0" encoding="utf-8"?><rfc1807 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><bib-version>v2</bib-version><id>65621</id><entry>2024-02-09</entry><title>Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus</title><swanseaauthors><author><sid>2fe44f0c1e7d845dc21bb6b00d5b2085</sid><firstname>Deshan</firstname><surname>Sumanathilaka</surname><name>Deshan Sumanathilaka</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>1cc4c84582d665b7ee08fb16f5454671</sid><ORCID>0000-0002-2683-8042</ORCID><firstname>Nicholas</firstname><surname>Micallef</surname><name>Nicholas Micallef</name><active>true</active><ethesisStudent>false</ethesisStudent></author></swanseaauthors><date>2024-02-09</date><deptcode>SCS</deptcode><abstract>In the context of a changing society and rapid technological advancements, the prevalence of social media platforms and instant messaging services has significantly strengthened the usage of native languages. In Sri Lanka, Sinhala and Romanized Sinhala have emerged as popular typing languages, owing to the widespread use of informal shorthand-based typing and internet acronyms for quicker communication. However, due to the limited availability of resources, linguistic support for these languages is limited, making them low-resource languages. To address this resource deficit, this study proposes the development of a rule-based transliteration tool that can annotate Sinhala words into Romanized Sinhala, accommodating the diverse ad hoc typing patterns used by the community. The research approach involved a comprehensive survey employing a stratified sampling method, considering variables such as age, gender, and language proficiency. 215 participants were presented with an online survey comprising 12 Sinhala sentences to capture various transliteration patterns related to Sinhala characters which are necessary for the annotation process. Analysis of the survey responses led to the formulation of 92 general rules and 26 special rules, encapsulating ad-hoc Romanized Sinhala typing patterns. Using these rules, Sinhala dictionaries were annotated, building a large corpus of data which consists of Sinhala and its Romanized Sinhala patterns. The annotated dataset was validated using a back transliteration tool, achieving an 84%-word accuracy rate. This innovative transliteration annotator can be used to mitigate the resource constraints associated with Sinhala to Romanized Sinhala transliteration. GitHub link:https://github.com/Sumanathilaka/Swa-Bhasha-Sinhala-Singlish-Dataset</abstract><type>Conference Paper/Proceeding/Abstract</type><journal>2024 4th International Conference on Advanced Research in Computing (ICARC)</journal><volume>0</volume><journalNumber/><paginationStart/><paginationEnd/><publisher>IEEE</publisher><placeOfPublication>Belihuloya, Sri Lanka</placeOfPublication><isbnPrint>979-8-3503-8487-1</isbnPrint><isbnElectronic>979-8-3503-8486-4</isbnElectronic><issnPrint/><issnElectronic/><keywords>Annotation, Dataset Creation, Romanized Sinhala, Transliteration, survey</keywords><publishedDay>22</publishedDay><publishedMonth>4</publishedMonth><publishedYear>2024</publishedYear><publishedDate>2024-04-22</publishedDate><doi>10.1109/icarc61713.2024.10499771</doi><url/><notes/><college>COLLEGE NANME</college><department>Computer Science</department><CollegeCode>COLLEGE CODE</CollegeCode><DepartmentCode>SCS</DepartmentCode><institution>Swansea University</institution><apcterm>Not Required</apcterm><funders/><projectreference/><lastEdited>2024-04-30T14:03:57.2275648</lastEdited><Created>2024-02-09T09:24:31.0913577</Created><path><level id="1">Faculty of Science and Engineering</level><level id="2">School of Mathematics and Computer Science - Computer Science</level></path><authors><author><firstname>Deshan</firstname><surname>Sumanathilaka</surname><order>1</order></author><author><firstname>Nicholas</firstname><surname>Micallef</surname><orcid>0000-0002-2683-8042</orcid><order>2</order></author><author><firstname>Ruvan</firstname><surname>Weerasinghe</surname><order>3</order></author></authors><documents><document><filename>65621__29532__55fe07d44ea649178a984747f4c382c0.pdf</filename><originalFilename>Swa-Bhasha D updated.pdf</originalFilename><uploaded>2024-02-09T09:33:28.5988311</uploaded><type>Output</type><contentLength>414177</contentLength><contentType>application/pdf</contentType><version>Accepted Manuscript</version><cronfaStatus>true</cronfaStatus><documentNotes>Author accepted manuscript document released under the terms of a Creative Commons CC-BY licence using the Swansea University Research Publications Policy (rights retention).</documentNotes><copyrightCorrect>true</copyrightCorrect><language>eng</language><licence>https://creativecommons.org/licenses/by/4.0/deed.en</licence></document></documents><OutputDurs/></rfc1807>
spelling v2 65621 2024-02-09 Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus 2fe44f0c1e7d845dc21bb6b00d5b2085 Deshan Sumanathilaka Deshan Sumanathilaka true false 1cc4c84582d665b7ee08fb16f5454671 0000-0002-2683-8042 Nicholas Micallef Nicholas Micallef true false 2024-02-09 SCS In the context of a changing society and rapid technological advancements, the prevalence of social media platforms and instant messaging services has significantly strengthened the usage of native languages. In Sri Lanka, Sinhala and Romanized Sinhala have emerged as popular typing languages, owing to the widespread use of informal shorthand-based typing and internet acronyms for quicker communication. However, due to the limited availability of resources, linguistic support for these languages is limited, making them low-resource languages. To address this resource deficit, this study proposes the development of a rule-based transliteration tool that can annotate Sinhala words into Romanized Sinhala, accommodating the diverse ad hoc typing patterns used by the community. The research approach involved a comprehensive survey employing a stratified sampling method, considering variables such as age, gender, and language proficiency. 215 participants were presented with an online survey comprising 12 Sinhala sentences to capture various transliteration patterns related to Sinhala characters which are necessary for the annotation process. Analysis of the survey responses led to the formulation of 92 general rules and 26 special rules, encapsulating ad-hoc Romanized Sinhala typing patterns. Using these rules, Sinhala dictionaries were annotated, building a large corpus of data which consists of Sinhala and its Romanized Sinhala patterns. The annotated dataset was validated using a back transliteration tool, achieving an 84%-word accuracy rate. This innovative transliteration annotator can be used to mitigate the resource constraints associated with Sinhala to Romanized Sinhala transliteration. GitHub link:https://github.com/Sumanathilaka/Swa-Bhasha-Sinhala-Singlish-Dataset Conference Paper/Proceeding/Abstract 2024 4th International Conference on Advanced Research in Computing (ICARC) 0 IEEE Belihuloya, Sri Lanka 979-8-3503-8487-1 979-8-3503-8486-4 Annotation, Dataset Creation, Romanized Sinhala, Transliteration, survey 22 4 2024 2024-04-22 10.1109/icarc61713.2024.10499771 COLLEGE NANME Computer Science COLLEGE CODE SCS Swansea University Not Required 2024-04-30T14:03:57.2275648 2024-02-09T09:24:31.0913577 Faculty of Science and Engineering School of Mathematics and Computer Science - Computer Science Deshan Sumanathilaka 1 Nicholas Micallef 0000-0002-2683-8042 2 Ruvan Weerasinghe 3 65621__29532__55fe07d44ea649178a984747f4c382c0.pdf Swa-Bhasha D updated.pdf 2024-02-09T09:33:28.5988311 Output 414177 application/pdf Accepted Manuscript true Author accepted manuscript document released under the terms of a Creative Commons CC-BY licence using the Swansea University Research Publications Policy (rights retention). true eng https://creativecommons.org/licenses/by/4.0/deed.en
title Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus
spellingShingle Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus
Deshan Sumanathilaka
Nicholas Micallef
title_short Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus
title_full Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus
title_fullStr Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus
title_full_unstemmed Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus
title_sort Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus
author_id_str_mv 2fe44f0c1e7d845dc21bb6b00d5b2085
1cc4c84582d665b7ee08fb16f5454671
author_id_fullname_str_mv 2fe44f0c1e7d845dc21bb6b00d5b2085_***_Deshan Sumanathilaka
1cc4c84582d665b7ee08fb16f5454671_***_Nicholas Micallef
author Deshan Sumanathilaka
Nicholas Micallef
author2 Deshan Sumanathilaka
Nicholas Micallef
Ruvan Weerasinghe
format Conference Paper/Proceeding/Abstract
container_title 2024 4th International Conference on Advanced Research in Computing (ICARC)
container_volume 0
publishDate 2024
institution Swansea University
isbn 979-8-3503-8487-1
979-8-3503-8486-4
doi_str_mv 10.1109/icarc61713.2024.10499771
publisher IEEE
college_str Faculty of Science and Engineering
hierarchytype
hierarchy_top_id facultyofscienceandengineering
hierarchy_top_title Faculty of Science and Engineering
hierarchy_parent_id facultyofscienceandengineering
hierarchy_parent_title Faculty of Science and Engineering
department_str School of Mathematics and Computer Science - Computer Science{{{_:::_}}}Faculty of Science and Engineering{{{_:::_}}}School of Mathematics and Computer Science - Computer Science
document_store_str 1
active_str 0
description In the context of a changing society and rapid technological advancements, the prevalence of social media platforms and instant messaging services has significantly strengthened the usage of native languages. In Sri Lanka, Sinhala and Romanized Sinhala have emerged as popular typing languages, owing to the widespread use of informal shorthand-based typing and internet acronyms for quicker communication. However, due to the limited availability of resources, linguistic support for these languages is limited, making them low-resource languages. To address this resource deficit, this study proposes the development of a rule-based transliteration tool that can annotate Sinhala words into Romanized Sinhala, accommodating the diverse ad hoc typing patterns used by the community. The research approach involved a comprehensive survey employing a stratified sampling method, considering variables such as age, gender, and language proficiency. 215 participants were presented with an online survey comprising 12 Sinhala sentences to capture various transliteration patterns related to Sinhala characters which are necessary for the annotation process. Analysis of the survey responses led to the formulation of 92 general rules and 26 special rules, encapsulating ad-hoc Romanized Sinhala typing patterns. Using these rules, Sinhala dictionaries were annotated, building a large corpus of data which consists of Sinhala and its Romanized Sinhala patterns. The annotated dataset was validated using a back transliteration tool, achieving an 84%-word accuracy rate. This innovative transliteration annotator can be used to mitigate the resource constraints associated with Sinhala to Romanized Sinhala transliteration. GitHub link:https://github.com/Sumanathilaka/Swa-Bhasha-Sinhala-Singlish-Dataset
published_date 2024-04-22T14:03:56Z
_version_ 1797764925281533952
score 11.016235