What about Traditional Chinese segmentation?

Sep 7, 2012 at 3:48 AM

I'd like to know how to do Traditional Chinese segmentation using this awesome tool. Do I need to modified dictionary or training model to achieve this?

Kindly give me some help.

Oct 14, 2012 at 3:05 PM

Yes. The code is language independent, however, the lexical dictionary and CRF model is not. If you want to apply it on Traditional Chinese segmentation, you may have three different solutions as follows:

1. prepare lexical dictionary and CRF model training corpus on traditional chinese.

2. add post-process on existed code logic to convert simple chinese to traditional chinese.

3. If current lexical dictionary and model are suitable for your task, I'd like to share it to you, and you can convert it in traditional chinese. 

Apr 16, 2013 at 6:01 AM
Sorry for reply this being late. I lost my mailbox password for a long time that I can't receive notification anymore.
I'd like to take your lexical dictionary and model and try to convert to TC for some testing.
I also want to ask some stupid questions.
  1. lexical dictionary format ?
  2. How to train a CRF model? By using what tool?
please kindly point me the direction.
By the way, how can I contact you directly id possible?
Apr 16, 2013 at 9:54 AM
For 1, the lexical dictionary format is very simple. It's a raw text format and each line has a term. You can open it with any text editor and modify it easily.
For 2, please see CRFSharp project on codeplex. The wordseg's model is based on CRFSharp. You can find how to train a CRF model on that homepage.

For any issue, please free feel to contact with me.
Apr 23, 2013 at 3:07 AM
Okay, thanks for you reply.
I wondered if I need to applied Traditional Chinese, what lexical files should I change?
I've tried wordseg demo and see those txt files (seems to be lexical files?)

If I edit these files and change those content into TC, what's next?
Re-train model? Using CRFSharp?

Please kindly point me.
Many thanks.
Apr 26, 2013 at 6:07 AM
To apply Traditional Chinese into wordseg, there are two solutions as follows:
  1. Convert lexical dictionary and CRF model's training corpus into Traditional Chinese, and re-train the model
  2. No need to update wordseg. Please using wordseg API in your application, after you get results from wordseg, you can convert the result to Traditional Chinese.
Sep 18, 2013 at 3:51 AM
Sorry for some reason I lost this project for months.
Recently I pickup this topic again.
Could you please provide me original lexical dictionary to me and I'd like to convert this to TC.
I'm very very new to this field, so please point me a direction that how to "train" a CRF's model once I have lexical dictionary. i know that I have to edit some "template" that provide CRF tool to train the model.
Many thanks for your kindly support and guide.
Sep 18, 2013 at 2:35 PM
  1. The demo package contains raw lexical dictionary (ChineseDictionary.txt), you can download it from [DOWNLOADS] section.
  2. To train CRF model, please visit CRFSharp homepage.