Abstract:
In this paper, we propose a method for identifying
font scripts of Myanmar Language. Because of the
unavailability of nationwide standardized encoding
scheme in Myanmar font scripts, knowledge written
in Myanmar language are scattered across internet
pages. Font scripts Identifier are essential to merge
those scattered knowledge into one for NLP
application such as text categorization, information
retrieval and text summarization. Our proposed
method use N-gram based text categorization. A
piece of text for 11 font scripts is taken for training.
TF-IDF (Term Frequency-Inverse Document
Frequency) weights of character N-grams for each
font script are computed and stored as a profile for
that particular font script. When a new text document
is given to testify, TF-IDF weight is computed for
that font script and cosine similarity is measured
between the test and trained profiles. The highest
similarity scored of the font script is taken as a
result. 100% accuracy is obtained for testing of
11different font scripts by applying TF-IDF
approach. Therefore, this method works well for
Myanmar font script identification.