Sunday, November 11, 2007

Updated News about Myanmar Lexicon and Corpus (II)


I have read yr new post. It is a good job. I encourage you.
I have been working on that one too for more than 3 yrs.
So right now, we have 8 lakhs words in inflected forms.
Our words in Wininnwa ascii. U know that it can be easily transformed to Unicode as and when required.

Our words are generated from the available internet sources and pdf. We can say it is general word lists across different domains (medicine,computer science ...) . Our assumption is that myanmar words are made up of 1 or more syllables. We have just built up syllable collocations (in computer terms , it is called N-grams). စား (1gram)၊ စားပါ (2grams)၊ စားပါသည္ (3grams). So in this way, if you know that domains, you can generate words easily.

As I have said that words are generated, we have to put efforts on finding valid words. But that task doesn't need high education for some domains. That manual checking also can be reduced if you use machine learning techniques.

We also have 2.1 millions Myanmar sentences from newspaper ,novels etc. The data are not released yet to the public until now.

What I have mentioned above is to let you know that you can generate Myanmar words using that techniques according to your domains. Hopefully, you will see on paper soon.

I got so much knowledge from you too. Thanking you.

With best wishes,


နိုင္ငံရပ္ျခားမွ ျမန္မာပညာရွင္ တစ္ေယာက္ရဲ့ အျပုသေဘာေဆာင္တဲ့ စာေလးပါဗ်ာ။ ဖတ္လိုက္ရတာလည္း ေက်နပ္ဖို့ေကာင္းသလို စိတ္ခ်မ္းသာတယ္၊ အားတက္ရတယ္။ ဒီလို ယဉ္ယဉ္ေက်းေက်း၊ အခ်က္အလက္ တိတိက်က် ေျပာေတာ့ က်ြန္ေတာှလည္း က်ြန္ေတာှ ထင္တာေတြ မဟုတ္ဘဲ လုပ္ေနသူေတြ ရွိေသးပါလားလို့ ေပ်ာှေပ်ာှရြွင္ရြွင္ သိရသလို ေျပာျပသူ အေပါှမွာလည္း ေက်းဇူးတင္မိတယ္။ အခုလို ေက်းဇူးတင္ေတာ့ ခင္ဗ်ားတို့လည္း ေကာင္းေကာင္းမြန္မြန္ သိရတယ္။ ေက်းဇူးတင္ပါတယ္ XXX ခင္ဗ်ာ။ :)

No comments: