pyLucene的自定义Analyzer — 中文分词
Saturday, May 3rd, 2008昨天决定使用PyLucene做indexer, 主要考虑效率的问题,还有和其他robot配合的问题, 所以放弃了用Zend Framework中的Lucene做indexer. 大概在网上search了一下. 没有发现具体关于PyLucene的的中文Analyzer的一些资料, 不过还好在PyLucene的Readme.txt中, 对如何做customer analyzer进行描述. 其实很简单 ,
只要在你自定义的Anaylzer中实现一个tokenStream的方法, 而tokenStream的方法返回的你定义的_tokenStream的object. 然后需要在_tokenStream实现next()方法, next()方法返回的是Token.这样, 你的Analyzer就可以被indexer使用, 并通过next()方法,得到不同Token().
今天抽空把逆向最大匹配的中文分词算法做到一个自定Analyzer里面试一下, 字典先用sogou的.
原文如下:
Technically, the PyLucene programmer is not providing an ‘extension’
but a Python implementation of a set of methods encapsulated by a
Python class whose instances are wrapped by the Java proxies provided
by PyLucene.
For example, the code below, extracted from a PyLucene unit test,
defines a custom analyzer using a custom token stream that returns the
tokens ‘1′, ‘2′, ‘3′, ‘4′, ‘5′ for any document it is called on.
All that is needed in order to provide a custom analyzer in Python is
defining a class that implements a method called ‘tokenStream’. The
presence of the ‘tokenStream’ method is detected by the corresponding
SWIG type handler and the python instance passed in is wrapped by a new
Java PythonAnalyzer instance that extends Lucene’s abstract Analyzer
class.
In other words, SWIG in reverse.
class _analyzer(object):
def tokenStream(self, fieldName, reader):
class _tokenStream(object):
def __init__(self):
self.tokens = ['1', '2', '3', '4', '5']
self.increments = [1, 2, 1, 0, 1]
self.i = 0
def next(self):
if self.i == len(self.tokens):
return None
t = Token(self.tokens[self.i], self.i, self.i)
t.setPositionIncrement(self.increments[self.i])
self.i += 1
return t
return _tokenStream()
analyzer = _analyzer()
store = RAMDirectory()
writer = IndexWriter(store, analyzer, True)
d = Document()
d.add(Field.Text(”field”, “bogus”))
writer.addDocument(d)
writer.optimize()
writer.close()
