Archive for the ‘Lucene技术’ Category

pyLucene的自定义Analyzer — 中文分词

Saturday, May 3rd, 2008

昨天决定使用PyLucene做indexer, 主要考虑效率的问题,还有和其他robot配合的问题, 所以放弃了用Zend Framework中的Lucene做indexer. 大概在网上search了一下. 没有发现具体关于PyLucene的的中文Analyzer的一些资料, 不过还好在PyLucene的Readme.txt中, 对如何做customer analyzer进行描述. 其实很简单 ,

只要在你自定义的Anaylzer中实现一个tokenStream的方法, 而tokenStream的方法返回的你定义的_tokenStream的object. 然后需要在_tokenStream实现next()方法, next()方法返回的是Token.这样, 你的Analyzer就可以被indexer使用, 并通过next()方法,得到不同Token().

今天抽空把逆向最大匹配的中文分词算法做到一个自定Analyzer里面试一下, 字典先用sogou的.

原文如下:

Technically, the PyLucene programmer is not providing an ‘extension’
but a Python implementation of a set of methods encapsulated by a
Python class whose instances are wrapped by the Java proxies provided
by PyLucene.

For example, the code below, extracted from a PyLucene unit test,
defines a custom analyzer using a custom token stream that returns the
tokens ‘1′, ‘2′, ‘3′, ‘4′, ‘5′ for any document it is called on.

All that is needed in order to provide a custom analyzer in Python is
defining a class that implements a method called ‘tokenStream’. The
presence of the ‘tokenStream’ method is detected by the corresponding
SWIG type handler and the python instance passed in is wrapped by a new
Java PythonAnalyzer instance that extends Lucene’s abstract Analyzer
class.

In other words, SWIG in reverse.


class _analyzer(object):
def tokenStream(self, fieldName, reader):
class _tokenStream(object):
def __init__(self):
self.tokens = ['1', '2', '3', '4', '5']
self.increments = [1, 2, 1, 0, 1]
self.i = 0
def next(self):
if self.i == len(self.tokens):
return None
t = Token(self.tokens[self.i], self.i, self.i)
t.setPositionIncrement(self.increments[self.i])
self.i += 1
return t
return _tokenStream()

analyzer = _analyzer()

store = RAMDirectory()
writer = IndexWriter(store, analyzer, True)

d = Document()
d.add(Field.Text(”field”, “bogus”))
writer.addDocument(d)
writer.optimize()
writer.close()

推荐一篇关于安装pylucene的文章, 不错~!

Wednesday, April 30th, 2008

之所以推荐是因为这篇写得很详细, 重要的是按照文章说所的, 我很快就让pylucene的sample中的searchfiles.py和indexfiles.py 工作了.:)

btw: 我使用的是PyLucene-2.1.0-2-gcj346-py25-win32

差点忘了文章的地址: 安装pylucene