Project Description
wordseg project is a word segment module implemented by C#. It is used to segment text into tokens and label token's attribute according its context and semantic by front-maximum matching and CRF algorithms (http://crfsharp.codeplex.com/). The wordseg is flexible and customizable. For different languages and tasks, you can use corresponding lexical dictionary and CRF model files to process.

The following five Chinese sentences are examples:
张晓晨和付仲恺一起坐在家里的沙发上看江苏的非诚勿扰。
今天在公司遇到一位实习生,他从百度过来。
luckycat84和monkeyfu一起打三国杀。
年龄越大,心里越沉静,希望自己能够多做一些真善美的事情,而不是追逐世俗的利益。
百度公司的名字源于“众里寻他千百度”这诗句。

The segment result as follows:
张晓晨[PER] 和 付仲恺[PER] 一起 坐 在 家 里 的 沙发[PDT] 上 看 江苏[LOC] 的 非 诚 勿扰 。
今天 在 公司 遇到 一位 实习生[JOB] , 他 从 百度 过来 。
luckycat 84 和 monkeyfu 一起 打 三国杀 。
年龄 越 大 , 心里 越 沉静 , 希望 自己 能够 多 做 一些 真善美 的 事情 , 而 不是 追逐 世俗 的 利益 。
百度公司[ORG] 的 名字 源于 “ 众 里 寻 他 千百度 ” 这 诗句 。

In above, each token is split by TAB character and its attribute shows in "[ ]" following the token.  In this example, the attribute string defined as follows:
PER: person name
LOC: location name
PDT: product name
ORG: organization name
JOB: job title.

Since wordseg segments text by its context semantic, with different context, the segment result may also be different. For example, in 5th example sentence, The first 百度 is the brand name of baidu company, so it is labeled as ORG with 公司. In contrast, since the last 百度 is a part of a poem, so it is labeled as common word.

In addition, the attributes are defined in lexical dictionary and CRF model. You can define any attributes by building custom lexical ditionary or CRF model. 

If you want to use wordseg into your program, you can refer wordseg.dll into project and add following code-snippte into your code.

In global initialize:
//Initialize word segment instance
WordSeg.WordSeg wordseg = new WordSeg.WordSeg();
//Load lexical dictionary with raw text format
wordseg.LoadLexicalDict(args[0], true);
//Load CRF model with default feature generator
wordseg.LoadModelFile(args[1], null);

For each thread initialize:
//Create tokens and set the max word segment legnth is 1024
WordSeg.Tokens tokens = wordseg.CreateTokens(1024);

Segment given text in each thread:
//Segment text with both lexical dictionary and CRF model
wordseg.Segment(strText, tokens, true);
//Parse each segmented token
StringBuilder sb = new StringBuilder();
for (int i = 0; i < tokens.tokenList.Count; i++)
{
    sb.Append(tokens.tokenList[i].strTerm);
    int len = tokens.tokenList[i].len;
    int offset = tokens.tokenList[i].offset;
    List<string> attributeList = tokens.tokenList[i].strTagList
}

Last edited Mar 14, 2013 at 4:52 AM by monkeyfu, version 9