Posts Tagged ‘python’

[python] 共享一下关于python资料及source code

Monday, May 26th, 2008

书籍资料以及书籍中的code

1. ebook-chm-python-oreilly-programming-python-2nd-edition

2. python-programming-an-introduction-to-computer-science

3. ebook-thinking-in-python-html-code

4. 2000-python-programming-on-win32-code-oreilly practical-python-source-code

5. python-cookbookoreilly2005code

6. python_programming_on_win32_sourcecode

7. ebook-program-core-python-programming

8. oreillypythoninanutshell2003-chm

9. oreilly-programming-python-2nd-edition-with-source-code

[python] 一个简单的spider

Thursday, May 22nd, 2008

今天刚装一个插件 Google Syntax Highlighter for WordPress,用来高亮显示各种code的.于是用python写了一个非常简单的spider.试验一下.:)

# #############################
# Spider
#        get the page content of the URI
#

class Spider:
    def __init__(self):
        """
        """
        self._count = 0
        self._data = 0
        self._cost = 0
        self._agents = []
        self._agents.append('Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14')
        self._agents.append('Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)')
        self._agents.append('Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13')
        self._agents.append('Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)')
        self._agents.append('Mozilla/5.0 (compatible; YodaoBot/1.0; http://www.yodao.com/help/webmaster/spider/; )')
        self._agents.append('Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)')
        self._failed = 0
        self._name = "G1029"
    def get(self, u):
        """
        """
        s = ""
        c = time.time()
        try:
            random.shuffle(self._agents)
            request = urllib2.Request(url=u)
            request.add_header('User-Agent', self._agents[0])
            f = urllib2.urlopen(request)
            s = f.read()
            f.close()
        except:
            pass
            self._failed += 1
            print "ERROR: "+u
        self._count += 1
        self._cost += (time.time() - c)
        self._data += len(s)
        return s
    def set(self, item, value):
        """
        """
        if item == 'name' :
            self._name = value
        else :
            return None
        return 1
    def count(self):
        """
        """
        return self._count
    def failed(self):
        """
        """
        return self._failed
    def cost(self):
        """
        """
        return self._cost
    def data(self):
        """
        """
        return self._data
    def dump(self):
        """
        """
        print 'Spider Name: %s Count: %d Failed: %d Cost: %d Data: %d' % (self._name, self._count, self._failed, self._cost, self._data)
        return



似乎不错哦, 呵呵.先用这个吧

[python] Simple Watch Dog

Thursday, May 15th, 2008


为了让spider的进程可以做到简单的保护,做了一个简单的watch dog. 原理很简单, 就是给dog喂食, 如果dog发现没有及时喂食, 就kill要守护的process,然后restart这个process.

主要class 有两个:

  • watchdog 负责守护进程
  • feeder 负责喂食

用法也很简单

  • 启动watchdog
  • 在你想要守护的进程中加入feeder,

几点说明:

  • 由于是时间仓促, 其中,有些地方做的还比较简单, 没有异常处理, 请见谅 :)
  • 我做的feeder是为了在我的python spider使用, 其他语言自己port一下吧

具体代码下载( watchdog)

import os
import sys
import time
import copy
import socket
import threading

class Watch_Dog (object):
    def __init__(self):
        self._exit = 0
        self._sock = None
        self._processes = {}
        self._processes_lock = threading.Lock()
        self._feed_interval = 120
        self._check_interval = 30
        self._addr = None
        return
    def open(self, addr):
        # record address
        self._addr = addr
        # create sock and bind to addr
        try:
            self._sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
            self._sock.bind(self._addr)
            self._sock.setblocking(0)
        except:
            pass
            print 'creat or bind socket failed'
            return 0
        # create worker thread
        h = threading.Thread(target=self._worker, args=())
        h.start()
        return 1
    def close(self):
        # close thread
        self._exit = 1
        #close socket
        try:
            self._sock.close()
        except:
            pass
        return
    def process(self):
        last_check_time = time.time()
        while not self._exit:
            if time.time() - last_check_time > self._check_interval :
                last_check_time = time.time()
                self._check()
            time.sleep(1)
        return
    def _worker(self):
        while not self._exit:
            try:
                recved, addr = self._sock.recvfrom(1024)
                if len(recved) > 0:
                    ss = recved.split('\n')
                    if len(ss) == 3 and ss[0] == 'register' :
                        self._register(ss[1], ss[2])
                    elif len(ss) == 2 and ss[0] == 'feed' :
                        self._feed(ss[1])
            except:
                pass
            time.sleep(0.1)
        return
    def _register(self, id, path):
        print 'regiester', id, path
        key = hash(id)
        self._processes_lock.acquire()
        if not self._processes.has_key(key) :
            self._processes[key] = [id, path, time.time()]
        self._processes_lock.release()
        return
    def _feed(self, id):
        print 'feed', id
        key = hash(id)
        self._processes_lock.acquire()
        if self._processes.has_key(key) :
            self._processes[key][2] = time.time()
        self._processes_lock.release()
        return
    def _check(self):
        print time.time(), 'check process'
        # make copy
        self._processes_lock.acquire()
        processes = copy.copy(self._processes)
        self._processes_lock.release()
        # check each robot
        for k in processes.keys():
            id = processes[k][0]
            pt = processes[k][1]
            tm = processes[k][2]
            if time.time() - tm > self._feed_interval :
                self._remove(id)
                pid = int(id)
                h = threading.Thread(target=self._restart, args=(pid, pt))
                h.start()
        return
    def _remove(self, id):
        key = hash(id)
        self._processes_lock.acquire()
        del self._processes[key]
        self._processes_lock.release()
        return
    def _restart(self, pid, path):
        print 'thread >> start kill',pid,'restart',path
        if os.name == 'nt' :
            os.popen('taskkill /F /PID '+str(pid))
            os.startfile(path)
        elif os.name == 'posix' :
            os.popen('kill -9 '+str(pid))
            os.spawnvp(os.P_NOWAIT, 'python', ['python', path])
        else :
            print 'not suppot this os now'
        print 'thread >> finish'
        return

class Feeder (object) :
    def __init__(self):
        self._sock = None
        self._pid = None
        self._exit = 0
        self._feed_interval = 10
        self._addr = None
        self._path = None
        return
    def open(self, addr, path):
        self._addr = addr
        self._path = path
        self._sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
        self._pid = os.getpid()
        self._register()
        h = threading.Thread(target=self._worker, args=())
        h.start()
        return
    def close(self):
        self._exit = 1
        self._sock.close()
        return
    def _worker(self):
        last_feed_time = time.time()
        while not self._exit :
            if time.time() - last_feed_time > self._feed_interval :
                last_feed_time = time.time()
                self._feed()
            time.sleep(1)
        return
    def _register(self):
        self._sock.sendto('register\n'+str(self._pid)+'\n'+self._path, self._addr)
        return
    def _feed(self):
        self._sock.sendto('feed\n'+str(self._pid), self._addr)
        return

if __name__ == '__main__':

    print 'Current OS :', os.name

    addr = ('localhost', 15028)

    wd = Watch_Dog()

    if not wd.open(addr) : exit(-1)

    wd.process()

    wd.close()

pyLucene的自定义Analyzer — 中文分词

Saturday, May 3rd, 2008

昨天决定使用PyLucene做indexer, 主要考虑效率的问题,还有和其他robot配合的问题, 所以放弃了用Zend Framework中的Lucene做indexer. 大概在网上search了一下. 没有发现具体关于PyLucene的的中文Analyzer的一些资料, 不过还好在PyLucene的Readme.txt中, 对如何做customer analyzer进行描述. 其实很简单 ,

只要在你自定义的Anaylzer中实现一个tokenStream的方法, 而tokenStream的方法返回的你定义的_tokenStream的object. 然后需要在_tokenStream实现next()方法, next()方法返回的是Token.这样, 你的Analyzer就可以被indexer使用, 并通过next()方法,得到不同Token().

今天抽空把逆向最大匹配的中文分词算法做到一个自定Analyzer里面试一下, 字典先用sogou的.

原文如下:

Technically, the PyLucene programmer is not providing an ‘extension’
but a Python implementation of a set of methods encapsulated by a
Python class whose instances are wrapped by the Java proxies provided
by PyLucene.

For example, the code below, extracted from a PyLucene unit test,
defines a custom analyzer using a custom token stream that returns the
tokens ‘1′, ‘2′, ‘3′, ‘4′, ‘5′ for any document it is called on.

All that is needed in order to provide a custom analyzer in Python is
defining a class that implements a method called ‘tokenStream’. The
presence of the ‘tokenStream’ method is detected by the corresponding
SWIG type handler and the python instance passed in is wrapped by a new
Java PythonAnalyzer instance that extends Lucene’s abstract Analyzer
class.

In other words, SWIG in reverse.


class _analyzer(object):
def tokenStream(self, fieldName, reader):
class _tokenStream(object):
def __init__(self):
self.tokens = ['1', '2', '3', '4', '5']
self.increments = [1, 2, 1, 0, 1]
self.i = 0
def next(self):
if self.i == len(self.tokens):
return None
t = Token(self.tokens[self.i], self.i, self.i)
t.setPositionIncrement(self.increments[self.i])
self.i += 1
return t
return _tokenStream()

analyzer = _analyzer()

store = RAMDirectory()
writer = IndexWriter(store, analyzer, True)

d = Document()
d.add(Field.Text(”field”, “bogus”))
writer.addDocument(d)
writer.optimize()
writer.close()

Python的八荣八耻

Wednesday, April 30th, 2008
以动手实践为荣 , 以只看不练为耻;
以打印日志为荣 , 以单步跟踪为耻;
以空格缩进为荣 , 以制表缩进为耻;
以单元测试为荣 , 以人工测试为耻;

以模块复用为荣 , 以复制粘贴为耻;
以多态应用为荣 , 以分支判断为耻;
以Pythonic为荣 , 以冗余拖沓为耻;
以总结分享为荣 , 以跪求其解为耻;

                  -- 来自华蟒

从以前写p2p的程序到最近写spider和indexer都是使用python. 

在网络IO的瓶颈下, 使用python还是不错的选者, 从华蟒看到这个八荣八耻,

觉得说的真不错, :), 引以为铭.