最近在做一些信息抽取方面的研究，但是苦于没有中文语料。于是想到了百度百科，从百度百科的词条里的文字，生成一些语料。需要的是非结构化的数据，所以爬取百度百科的描述性的语句，而不是已经结构化的那些标签。

整体思路：

从链接库之中把页面下载下来，并从中爬取信息和链接
爬取的链接存储起来，数据也存储起来

页面加载

这一部分使用urllib和bs4来完成页面的加载和解析。这中间需要一些判断，把一些页面给过滤掉，像什么“百科达人”什么的，完全不需要。

这次任务针对的是百科里面的人物词条，恰好获得了一个人名词典，主要是历史上的名人。为了简化操作，直接判断词条的标题是不是存在于这个词典里，如果存在就爬取。
downloader 是下载页面

def downloader(url: str, header: dict):
    try:
        # 使用requests也是一样的
        # response = requests.get(url, headers=header)
        # response.encoding = "utf-8"
        request = urllib.request.Request(url, headers=header)
        response = urllib.request.urlopen(request)
        return response
    except (HTTPError, URLError) as e:
        return None

get_new_url 是获取新的链接,过滤一些不符合要求的链接

def get_new_url(page) -> dict:
    links = page.find_all('a', target='_blank', href=re.compile("/item/*"))
    new_url = {}
    words = ['达人', '秒懂', '本人', '义项', '义词', '百科', '\n', ' ']
    for link in links:
        name = link.get_text()
        url = link["href"]
        flag = True
        for word in words:
            if word in name:
                flag = False
        if name not in new_url and name != '' and flag and is_people_name(name):
            new_url[name] = 'https://baike.baidu.com' + url
    return new_url

get_new_content 是获取内容

def get_new_content(page) -> dict:
    new_content = {}
    try:
        # 如果是爬取正文，可以使用 {"class": re.compile("para(-title)*"),"label-module": re.compile("para(-title)*")}
        # 爬取简介内容
        title = page.find("h1").get_text()
        subtitle = page.find("h2").get_text()
        new_content['title'] = title
        new_content['complete_title'] = title + subtitle
        para = page.find("div", {"class": "lemma-summary", "label-module": "lemmaSummary"})
        introduction = para.get_text()
        introduction = re.sub('\\[.*?]|\xa0|\n', '', introduction)
        new_content['introduction'] = introduction

        items = page.find("div", {"class": "basic-info cmn-clearfix"})
        names = items.findAll("dt", {"class": "basicInfo-item name"})
        values = items.findAll("dd", {"class": "basicInfo-item value"})
        for name, value in zip(names, values):
            name = name.get_text()
            name = re.sub('\\[.*?]|\xa0|\n', '', name)
            value = value.get_text()
            value = re.sub('\\[.*?]|\xa0|\n', '', value)
            new_content[name] = value
        return new_content

    except AttributeError as e:
        return None

数据库交互

第二个部分就是和数据库的操作，3个部分，写入链接，写入数据，获取链接。
使用mongodb来存储数据，因为数据正好是字典的结构。

1
2
3

def writer(content: dict):
    if content is not None:
        collection.insert_one(content)

因为最初设想的是用pg存储数据，后来换成了mongodb，但是存储链接还是用的pg。存储链接的时候，加一个是第几次的变量。这样完成一轮操作之后，获取上一轮新增加的链接。
为了避免链接的重复，把链接作为主键。当然表需要先建好。
插入没有特别优化，如果用executemany、execute_batch，copyfrom这些会更快点。

def manager(urls: dict, batch: int) -> None:
    conn = psycopg2.connect(database=database, host=host,
                            port=port, user=user, password=password)
    cur = conn.cursor()
    for key in urls.keys():
        cur.execute("INSERT INTO people(url, title, batch) VALUES(%s,%s,%s) ON conflict(url) DO NOTHING", (urls[key], key, batch))
    conn.commit()
    cur.close()
    conn.close()

获取链接就比较简单了，一个简单的查询就行了。

def get_url(batch: int) -> tuple:
    conn = psycopg2.connect(database=database, host=host,
                            port=port, user=user, password=password)
    cur = conn.cursor()
    cur.execute("SELECT url FROM people WHERE batch=%s", str(batch))
    url_new = cur.fetchall()
    cur.close()
    conn.close()

    return url_new

没有想到了是百度百科里一个词条的url可能有好几个，最后还是存在大量重复的内容。只能写程序再过滤一下了。

自然语言处理的技巧

百度百科词条爬虫

整体思路：

页面加载

数据库交互