Ang PaggamJt ngTrigram Ranking Bilang Panukat sa Pagkakahalintulad at Pagkakapangkat ng mga Wika.
In: Malay, Jg. 26 (2014-04-01), Heft 2, S. 53-68
academicJournal
Zugriff:
A trigram is a 3-letter sequence of a word. As an example, the lists of trigrams that can be generated from the word "tatlo" are the following: tat, atl, and tlo. Presented in this research is trigram ranking, a metric for language similarity. It involves [1] collecting huge amounts of texts as training data, [2] generating trigram profiles from the training data, [3] and computing for language similarity using trigrams. Also presented is the use of k-means clustering to group languages based on their trigram ranking. In this study, the Internet was mined for texts using automatic means: [1] an XML to text converter was used to gather English and Filipino Wikipedia articles; [2] a webcrawler was used to collect online news articles; [3] a twitter API was used to collect tweets; and [4] a hot was used to collect chat logs from Ragnarok, an online game. Documents from a parallel corpus and documents from an online corpus were also collected. The following languages were used as test bed: Bikol, Cebuano, Hiligaynon, Iloko, Pampanga, Pangasinan, Tagalog, and War ay. Based on the results, language pairs with trigram rankings close to each other come from the same subfamily of languages: [1] Bikol, Cebuano, Hiligaynon, Tagalog, and Waray come from one subgroup; [2] Iloko and Pangasinan come from one subgroup; and [3] Pampanga comes from another subgroup. Trigram ranking can be used to measure which Philippine languages are closely-related. [ABSTRACT FROM AUTHOR]
Titel: |
Ang PaggamJt ngTrigram Ranking Bilang Panukat sa Pagkakahalintulad at Pagkakapangkat ng mga Wika.
|
---|---|
Autor/in / Beteiligte Person: | Oco, Nathaniel ; Sison-Buban, Raquel ; Syliongka, Leif Romeritch ; Roxas, Rachel Edita ; llao, Joel |
Zeitschrift: | Malay, Jg. 26 (2014-04-01), Heft 2, S. 53-68 |
Veröffentlichung: | 2014 |
Medientyp: | academicJournal |
ISSN: | 0115-6195 (print) |
Schlagwort: |
|
Sonstiges: |
|