Conference Paper/Proceeding/Abstract 312 views 84 downloads
Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus
2024 4th International Conference on Advanced Research in Computing (ICARC), Volume: 2024, Pages: 189 - 194
Swansea University Authors: Deshan Sumanathilaka , Nicholas Micallef
-
PDF | Accepted Manuscript
Author accepted manuscript document released under the terms of a Creative Commons CC-BY licence using the Swansea University Research Publications Policy (rights retention).
Download (404.47KB)
DOI (Published version): 10.1109/icarc61713.2024.10499771
Abstract
In the context of a changing society and rapid technological advancements, the prevalence of social media platforms and instant messaging services has significantly strengthened the usage of native languages. In Sri Lanka, Sinhala and Romanized Sinhala have emerged as popular typing languages, owing...
Published in: | 2024 4th International Conference on Advanced Research in Computing (ICARC) |
---|---|
ISBN: | 979-8-3503-8487-1 979-8-3503-8486-4 |
Published: |
Belihuloya, Sri Lanka
IEEE
2024
|
URI: | https://cronfa.swan.ac.uk/Record/cronfa65621 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
first_indexed |
2024-02-09T09:37:09Z |
---|---|
last_indexed |
2024-02-09T09:37:09Z |
id |
cronfa65621 |
recordtype |
SURis |
fullrecord |
<?xml version="1.0" encoding="utf-8"?><rfc1807 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><bib-version>v2</bib-version><id>65621</id><entry>2024-02-09</entry><title>Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus</title><swanseaauthors><author><sid>2fe44f0c1e7d845dc21bb6b00d5b2085</sid><ORCID>0009-0005-8933-6559</ORCID><firstname>Deshan</firstname><surname>Sumanathilaka</surname><name>Deshan Sumanathilaka</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>1cc4c84582d665b7ee08fb16f5454671</sid><ORCID>0000-0002-2683-8042</ORCID><firstname>Nicholas</firstname><surname>Micallef</surname><name>Nicholas Micallef</name><active>true</active><ethesisStudent>false</ethesisStudent></author></swanseaauthors><date>2024-02-09</date><deptcode>MACS</deptcode><abstract>In the context of a changing society and rapid technological advancements, the prevalence of social media platforms and instant messaging services has significantly strengthened the usage of native languages. In Sri Lanka, Sinhala and Romanized Sinhala have emerged as popular typing languages, owing to the widespread use of informal shorthand-based typing and internet acronyms for quicker communication. However, due to the limited availability of resources, linguistic support for these languages is limited, making them low-resource languages. To address this resource deficit, this study proposes the development of a rule-based transliteration tool that can annotate Sinhala words into Romanized Sinhala, accommodating the diverse ad hoc typing patterns used by the community. The research approach involved a comprehensive survey employing a stratified sampling method, considering variables such as age, gender, and language proficiency. 215 participants were presented with an online survey comprising 12 Sinhala sentences to capture various transliteration patterns related to Sinhala characters which are necessary for the annotation process. Analysis of the survey responses led to the formulation of 92 general rules and 26 special rules, encapsulating ad-hoc Romanized Sinhala typing patterns. Using these rules, Sinhala dictionaries were annotated, building a large corpus of data which consists of Sinhala and its Romanized Sinhala patterns. The annotated dataset was validated using a back transliteration tool, achieving an 84%-word accuracy rate. This innovative transliteration annotator can be used to mitigate the resource constraints associated with Sinhala to Romanized Sinhala transliteration. GitHub link:https://github.com/Sumanathilaka/Swa-Bhasha-Sinhala-Singlish-Dataset</abstract><type>Conference Paper/Proceeding/Abstract</type><journal>2024 4th International Conference on Advanced Research in Computing (ICARC)</journal><volume>2024</volume><journalNumber/><paginationStart>189</paginationStart><paginationEnd>194</paginationEnd><publisher>IEEE</publisher><placeOfPublication>Belihuloya, Sri Lanka</placeOfPublication><isbnPrint>979-8-3503-8487-1</isbnPrint><isbnElectronic>979-8-3503-8486-4</isbnElectronic><issnPrint/><issnElectronic/><keywords>Surveys, dictionaries, social networking (online), annotations, buildings, instant messaging, linguistics, annotation, dataset creation, Romanized Sinhala, transliteration, survey</keywords><publishedDay>22</publishedDay><publishedMonth>4</publishedMonth><publishedYear>2024</publishedYear><publishedDate>2024-04-22</publishedDate><doi>10.1109/icarc61713.2024.10499771</doi><url/><notes/><college>COLLEGE NANME</college><department>Mathematics and Computer Science School</department><CollegeCode>COLLEGE CODE</CollegeCode><DepartmentCode>MACS</DepartmentCode><institution>Swansea University</institution><apcterm>Not Required</apcterm><funders/><projectreference/><lastEdited>2024-10-15T10:48:52.9120920</lastEdited><Created>2024-02-09T09:24:31.0913577</Created><path><level id="1">Faculty of Science and Engineering</level><level id="2">School of Mathematics and Computer Science - Computer Science</level></path><authors><author><firstname>Deshan</firstname><surname>Sumanathilaka</surname><orcid>0009-0005-8933-6559</orcid><order>1</order></author><author><firstname>Nicholas</firstname><surname>Micallef</surname><orcid>0000-0002-2683-8042</orcid><order>2</order></author><author><firstname>Ruvan</firstname><surname>Weerasinghe</surname><order>3</order></author></authors><documents><document><filename>65621__29532__55fe07d44ea649178a984747f4c382c0.pdf</filename><originalFilename>Swa-Bhasha D updated.pdf</originalFilename><uploaded>2024-02-09T09:33:28.5988311</uploaded><type>Output</type><contentLength>414177</contentLength><contentType>application/pdf</contentType><version>Accepted Manuscript</version><cronfaStatus>true</cronfaStatus><documentNotes>Author accepted manuscript document released under the terms of a Creative Commons CC-BY licence using the Swansea University Research Publications Policy (rights retention).</documentNotes><copyrightCorrect>true</copyrightCorrect><language>eng</language><licence>https://creativecommons.org/licenses/by/4.0/deed.en</licence></document></documents><OutputDurs/></rfc1807> |
spelling |
v2 65621 2024-02-09 Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus 2fe44f0c1e7d845dc21bb6b00d5b2085 0009-0005-8933-6559 Deshan Sumanathilaka Deshan Sumanathilaka true false 1cc4c84582d665b7ee08fb16f5454671 0000-0002-2683-8042 Nicholas Micallef Nicholas Micallef true false 2024-02-09 MACS In the context of a changing society and rapid technological advancements, the prevalence of social media platforms and instant messaging services has significantly strengthened the usage of native languages. In Sri Lanka, Sinhala and Romanized Sinhala have emerged as popular typing languages, owing to the widespread use of informal shorthand-based typing and internet acronyms for quicker communication. However, due to the limited availability of resources, linguistic support for these languages is limited, making them low-resource languages. To address this resource deficit, this study proposes the development of a rule-based transliteration tool that can annotate Sinhala words into Romanized Sinhala, accommodating the diverse ad hoc typing patterns used by the community. The research approach involved a comprehensive survey employing a stratified sampling method, considering variables such as age, gender, and language proficiency. 215 participants were presented with an online survey comprising 12 Sinhala sentences to capture various transliteration patterns related to Sinhala characters which are necessary for the annotation process. Analysis of the survey responses led to the formulation of 92 general rules and 26 special rules, encapsulating ad-hoc Romanized Sinhala typing patterns. Using these rules, Sinhala dictionaries were annotated, building a large corpus of data which consists of Sinhala and its Romanized Sinhala patterns. The annotated dataset was validated using a back transliteration tool, achieving an 84%-word accuracy rate. This innovative transliteration annotator can be used to mitigate the resource constraints associated with Sinhala to Romanized Sinhala transliteration. GitHub link:https://github.com/Sumanathilaka/Swa-Bhasha-Sinhala-Singlish-Dataset Conference Paper/Proceeding/Abstract 2024 4th International Conference on Advanced Research in Computing (ICARC) 2024 189 194 IEEE Belihuloya, Sri Lanka 979-8-3503-8487-1 979-8-3503-8486-4 Surveys, dictionaries, social networking (online), annotations, buildings, instant messaging, linguistics, annotation, dataset creation, Romanized Sinhala, transliteration, survey 22 4 2024 2024-04-22 10.1109/icarc61713.2024.10499771 COLLEGE NANME Mathematics and Computer Science School COLLEGE CODE MACS Swansea University Not Required 2024-10-15T10:48:52.9120920 2024-02-09T09:24:31.0913577 Faculty of Science and Engineering School of Mathematics and Computer Science - Computer Science Deshan Sumanathilaka 0009-0005-8933-6559 1 Nicholas Micallef 0000-0002-2683-8042 2 Ruvan Weerasinghe 3 65621__29532__55fe07d44ea649178a984747f4c382c0.pdf Swa-Bhasha D updated.pdf 2024-02-09T09:33:28.5988311 Output 414177 application/pdf Accepted Manuscript true Author accepted manuscript document released under the terms of a Creative Commons CC-BY licence using the Swansea University Research Publications Policy (rights retention). true eng https://creativecommons.org/licenses/by/4.0/deed.en |
title |
Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus |
spellingShingle |
Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus Deshan Sumanathilaka Nicholas Micallef |
title_short |
Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus |
title_full |
Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus |
title_fullStr |
Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus |
title_full_unstemmed |
Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus |
title_sort |
Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus |
author_id_str_mv |
2fe44f0c1e7d845dc21bb6b00d5b2085 1cc4c84582d665b7ee08fb16f5454671 |
author_id_fullname_str_mv |
2fe44f0c1e7d845dc21bb6b00d5b2085_***_Deshan Sumanathilaka 1cc4c84582d665b7ee08fb16f5454671_***_Nicholas Micallef |
author |
Deshan Sumanathilaka Nicholas Micallef |
author2 |
Deshan Sumanathilaka Nicholas Micallef Ruvan Weerasinghe |
format |
Conference Paper/Proceeding/Abstract |
container_title |
2024 4th International Conference on Advanced Research in Computing (ICARC) |
container_volume |
2024 |
container_start_page |
189 |
publishDate |
2024 |
institution |
Swansea University |
isbn |
979-8-3503-8487-1 979-8-3503-8486-4 |
doi_str_mv |
10.1109/icarc61713.2024.10499771 |
publisher |
IEEE |
college_str |
Faculty of Science and Engineering |
hierarchytype |
|
hierarchy_top_id |
facultyofscienceandengineering |
hierarchy_top_title |
Faculty of Science and Engineering |
hierarchy_parent_id |
facultyofscienceandengineering |
hierarchy_parent_title |
Faculty of Science and Engineering |
department_str |
School of Mathematics and Computer Science - Computer Science{{{_:::_}}}Faculty of Science and Engineering{{{_:::_}}}School of Mathematics and Computer Science - Computer Science |
document_store_str |
1 |
active_str |
0 |
description |
In the context of a changing society and rapid technological advancements, the prevalence of social media platforms and instant messaging services has significantly strengthened the usage of native languages. In Sri Lanka, Sinhala and Romanized Sinhala have emerged as popular typing languages, owing to the widespread use of informal shorthand-based typing and internet acronyms for quicker communication. However, due to the limited availability of resources, linguistic support for these languages is limited, making them low-resource languages. To address this resource deficit, this study proposes the development of a rule-based transliteration tool that can annotate Sinhala words into Romanized Sinhala, accommodating the diverse ad hoc typing patterns used by the community. The research approach involved a comprehensive survey employing a stratified sampling method, considering variables such as age, gender, and language proficiency. 215 participants were presented with an online survey comprising 12 Sinhala sentences to capture various transliteration patterns related to Sinhala characters which are necessary for the annotation process. Analysis of the survey responses led to the formulation of 92 general rules and 26 special rules, encapsulating ad-hoc Romanized Sinhala typing patterns. Using these rules, Sinhala dictionaries were annotated, building a large corpus of data which consists of Sinhala and its Romanized Sinhala patterns. The annotated dataset was validated using a back transliteration tool, achieving an 84%-word accuracy rate. This innovative transliteration annotator can be used to mitigate the resource constraints associated with Sinhala to Romanized Sinhala transliteration. GitHub link:https://github.com/Sumanathilaka/Swa-Bhasha-Sinhala-Singlish-Dataset |
published_date |
2024-04-22T10:48:51Z |
_version_ |
1812972942578941952 |
score |
11.035634 |