Python自然語言處理實踐: 在NLTK中使用斯坦福中文分詞器 | 我愛自然語言處理

看見就非常 2015-04-24

展開全文

斯坦福大學自然語言處理組是世界知名的NLP研究小組，他們提供了一系列開源的Java文本分析工具，包括分詞器(Word Segmenter)，詞性標注工具（Part-Of-Speech Tagger），命名實體識別工具（Named Entity Recognizer），句法分析器（Parser）等，可喜的事，他們還為這些工具訓練了相應的中文模型，支持中文文本處理。在使用NLTK的過程中，發(fā)現(xiàn)當前版本的NLTK已經(jīng)提供了相應的斯坦福文本處理工具接口，包括詞性標注，命名實體識別和句法分析器的接口，不過可惜的是，沒有提供分詞器的接口。在google無果和閱讀了相應的代碼后，我決定照貓畫虎為NLTK寫一個斯坦福中文分詞器接口，這樣可以方便的在Python中調(diào)用斯坦福文本處理工具。

首先需要做一些準備工作，第一步當然是安裝NLTK，這個可以參考我們在gensim的相關(guān)文章中的介紹《如何計算兩個文檔的相似度》，不過這里建議check github上最新的NLTK源代碼并用“python setup.py install”的方式安裝這個版本：https://github.com/nltk/nltk。這個版本新增了對于斯坦福句法分析器的接口，一些老的版本并沒有，這個之后我們也許還會用來介紹。而我們也是在這個版本中添加的斯坦福分詞器接口，其他版本也許會存在一些小問題。其次是安裝Java運行環(huán)境，以Ubuntu 12.04為例，安裝Java運行環(huán)境僅需要兩步：

sudo apt-get install default-jre
sudo apt-get install default-jdk

最后，當然是最重要的，你需要下載斯坦福分詞器的相應文件，包括源代碼，模型文件，詞典文件等。注意斯坦福分詞器并不僅僅支持中文分詞，還支持阿拉伯語的分詞，需要下載的zip打包文件是這個: Download Stanford Word Segmenter version 2014-08-27，下載后解壓。

準備工作就緒后，我們首先考慮的是在nltk源代碼里的什么地方來添加這個接口文件。在nltk源代碼包下，斯坦福詞性標注器和命名實體識別工具的接口文件是這個：nltk/tag/stanford.py ，而句法分析器的接口文件是這個：nltk/parse/stanford.py , 雖然在nltk/tokenize/目錄下有一個stanford.py文件，但是僅僅提供了一個針對英文的tokenizer工具PTBTokenizer的接口，沒有針對斯坦福分詞器的接口，于是我決定在nltk/tokenize下添加一個stanford_segmenter.py文件，作為nltk斯坦福中文分詞器的接口文件。NLTK中的這些接口利用了Linux 下的管道（PIPE）機制和subprocess模塊，這里直接貼源代碼了，感興趣的同學可以自行閱讀:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Natural Language Toolkit: Interface to the Stanford Chinese Segmenter
#
# Copyright (C) 2001-2014 NLTK Project
# Author: 52nlp <52nlpcn@gmail.com>
#
# URL: <http:///>
# For license information, see LICENSE.TXT

from __future__ import unicode_literals, print_function

import tempfile
import os
import json
from subprocess import PIPE

from nltk import compat
from nltk.internals import find_jar, config_java, java, _java_options

from nltk.tokenize.api import TokenizerI

class StanfordSegmenter(TokenizerI):
r"""
Interface to the Stanford Segmenter

>>> from nltk.tokenize.stanford_segmenter import StanfordSegmenter
>>> segmenter = StanfordSegmenter(path_to_jar="stanford-segmenter-3.4.1.jar", path_to_sihan_corpora_dict="./data", path_to_model="./data/pku.gz", path_to_dict="./data/dict-chris6.ser.gz")
>>> sentence = u"這是斯坦福中文分詞器測試"
>>> segmenter.segment(sentence)
>>> u'\u8fd9 \u662f \u65af\u5766\u798f \u4e2d\u6587 \u5206\u8bcd\u5668 \u6d4b\u8bd5\n'
>>> segmenter.segment_file("test.simp.utf8")
>>> u'\u9762\u5bf9 \u65b0 \u4e16\u7eaa \uff0c \u4e16\u754c \u5404\u56fd ...
"""

_JAR = 'stanford-segmenter.jar'

def __init__(self, path_to_jar=None,
path_to_sihan_corpora_dict=None,
path_to_model=None, path_to_dict=None,
encoding='UTF-8', options=None,
verbose=False, java_options='-mx2g'):
self._stanford_jar = find_jar(
self._JAR, path_to_jar,
env_vars=('STANFORD_SEGMENTER',),
searchpath=(),
verbose=verbose
)
self._sihan_corpora_dict = path_to_sihan_corpora_dict
self._model = path_to_model
self._dict = path_to_dict

self._encoding = encoding
self.java_options = java_options
options = {} if options is None else options
self._options_cmd = ','.join('{0}={1}'.format(key, json.dumps(val)) for key, val in options.items())

def segment_file(self, input_file_path):
"""
"""
cmd = [
'edu.stanford.nlp.ie.crf.CRFClassifier',
'-sighanCorporaDict', self._sihan_corpora_dict,
'-textFile', input_file_path,
'-sighanPostProcessing', 'true',
'-keepAllWhitespaces', 'false',
'-loadClassifier', self._model,
'-serDictionary', self._dict
]

stdout = self._execute(cmd)

return stdout

def segment(self, tokens):
return self.segment_sents([tokens])

def segment_sents(self, sentences):
"""
"""
encoding = self._encoding
# Create a temporary input file
_input_fh, self._input_file_path = tempfile.mkstemp(text=True)

# Write the actural sentences to the temporary input file
_input_fh = os.fdopen(_input_fh, 'wb')
_input = '\n'.join((' '.join(x) for x in sentences))
if isinstance(_input, compat.text_type) and encoding:
_input = _input.encode(encoding)
_input_fh.write(_input)
_input_fh.close()

cmd = [
'edu.stanford.nlp.ie.crf.CRFClassifier',
'-sighanCorporaDict', self._sihan_corpora_dict,
'-textFile', self._input_file_path,
'-sighanPostProcessing', 'true',
'-keepAllWhitespaces', 'false',
'-loadClassifier', self._model,
'-serDictionary', self._dict
]

stdout = self._execute(cmd)

# Delete the temporary file
os.unlink(self._input_file_path)

return stdout

def _execute(self, cmd, verbose=False):
encoding = self._encoding
cmd.extend(['-inputEncoding', encoding])
_options_cmd = self._options_cmd
if _options_cmd:
cmd.extend(['-options', self._options_cmd])

default_options = ' '.join(_java_options)

# Configure java.
config_java(options=self.java_options, verbose=verbose)

stdout, _stderr = java(cmd,classpath=self._stanford_jar, stdout=PIPE, stderr=PIPE)
stdout = stdout.decode(encoding)

# Return java configurations to their default values.
config_java(options=default_options, verbose=False)

return stdout

def setup_module(module):
from nose import SkipTest

try:
StanfordSegmenter()
except LookupError:
raise SkipTest('doctests from nltk.tokenize.stanford_segmenter are skipped because the stanford segmenter jar doesn\'t exist')

我在github上fork了一個最新的NLTK版本，然后在這個版本中添加了stanford_segmenter.py，感興趣的同學可以自行下載這個代碼，放到nltk/tokenize/目錄下，然后重新安裝NLTK：sudo python setpy.py install. 或者直接clone我們的這個nltk版本，安裝后就可以使用斯坦福中文分詞器了。

現(xiàn)在就可以在Python NLTK中調(diào)用這個斯坦福中文分詞接口了。為了方便起見，建議首先進入到解壓后的斯坦福分詞工具目錄下：cd stanford-segmenter-2014-08-27，然后在這個目錄下啟用ipython，當然默認python解釋器也可：

# 初始化斯坦福中文分詞器
In [1]: from nltk.tokenize.stanford_segmenter import StanfordSegmenter

# 注意分詞模型，詞典等資源在data目錄下，使用的是相對路徑，
# 在stanford-segmenter-2014-08-27的目錄下執(zhí)行有效
# 注意，斯坦福中文分詞器提供了兩個中文分詞模型：
# ctb.gz是基于賓州中文樹庫訓練的模型
# pku.gz是基于北大在2005backoof上提供的人名日報語料庫
# 這里選用了pku.gz，方便最后的測試
In [2]: segmenter = StanfordSegmenter(path_to_jar=”stanford-segmenter-3.4.1.jar”, path_to_sihan_corpora_dict=”./data”, path_to_model=”./data/pku.gz”, path_to_dict=”./data/dict-chris6.ser.gz”)

# 測試一個中文句子，注意u
In [3]: sentence = u”這是斯坦福中文分詞器測試”

# 調(diào)用segment方法來切分中文句子，這里隱藏了一個問題，我們最后來說明
In [4]: segmenter.segment(sentence)
Out[4]: u’\u8fd9 \u662f \u65af\u5766\u798f \u4e2d\u6587 \u5206\u8bcd\u5668 \u6d4b\u8bd5\n’

# 由于分詞后顯示的是中文編碼，我們把這個結(jié)果輸出到文件中
# 不知道有沒有同學有在python解釋器總顯示中文的方法
In [5]: outfile = open(‘outfile’, ‘w’)

In [6]: result = segmenter.segment(sentence)

# 注意寫入到文件的時候要encode 為 UTF-8編碼
In [7]: outfile.write(result.encode(‘UTF-8′))

In [8]: outfile.close()

打開這個outfile文件：

這是斯坦福中文分詞器測試

這里同時提供了一個segment_file的調(diào)用方法，方便直接對文件進行切分，讓我們來測試《中文分詞入門之資源》中介紹的backoff2005的測試集pku_test.utf8，來看看斯坦福分詞器的效果：

In [9]: result = segmenter.segment_file(‘pku_test.utf8′)

In [10]: outfile = open(‘pku_outfile’, ‘w’)

In [11]: outfile.write(result.encode(‘UTF-8′))

In [12]: outfile.close()

打開結(jié)果文件pku_outfile：

共同創(chuàng)造美好的新世紀 ——二○○一年新年賀詞
（二○○○年十二月三十一日）（附圖片 1 張）
女士們，先生們，同志們，朋友們：
2001年新年鐘聲即將敲響。人類社會前進的航船就要駛?cè)?21 世紀的新航程。中國人民進入了向現(xiàn)代化建設(shè) 第三步戰(zhàn)略目標邁進的新征程。
在這個激動人心的時刻，我很高興通過中國國際廣播電臺、中央人民廣播電臺和中央電視臺，向全國各族人民，向香港特別行政區(qū) 同胞、澳門特別行政區(qū) 同胞和臺灣同胞、海外僑胞，向世界各國的朋友們，致以新世紀第一個新年的祝賀！
….

我們用backoff2005的測試腳本來測試一下斯坦福中文分詞器在這份測試語料上的效果：

./icwb2-data/scripts/score ./icwb2-data/gold/pku_training_words.utf8 ./icwb2-data/gold/pku_test_gold.utf8 pku_outfile > stanford_pku_test.score

結(jié)果如下：
=== SUMMARY:
=== TOTAL INSERTIONS: 1479
=== TOTAL DELETIONS: 1974
=== TOTAL SUBSTITUTIONS: 3638
=== TOTAL NCHANGE: 7091
=== TOTAL TRUE WORD COUNT: 104372
=== TOTAL TEST WORD COUNT: 103877
=== TOTAL TRUE WORDS RECALL: 0.946
=== TOTAL TEST WORDS PRECISION: 0.951
=== F MEASURE: 0.948
=== OOV Rate: 0.058
=== OOV Recall Rate: 0.769
=== IV Recall Rate: 0.957
### pku_outfile 1479 1974 3638 7091 104372 103877 0.946 0.951 0.948 0.058 0.769 0.957

準確率是95.1%，召回率是94.6%, F值是94.8%, 相當不錯。感興趣的同學可以測試一下其他測試集，或者用賓州中文樹庫的模型來測試一下結(jié)果。

最后我們再說明一下這個接口存在的問題，因為使用了Linux PIPE模式來調(diào)用斯坦福中文分詞器，相當于在Python中執(zhí)行相應的Java命令，導致每次在執(zhí)行分詞時會加載一遍分詞所需的模型和詞典，這個對文件操作時（segment_file)沒有多大的問題，但是在對句子執(zhí)行分詞（segment)的時候會存在很大的問題，每次都加載數(shù)據(jù)，在實際產(chǎn)品中基本是不可用的。雖然發(fā)現(xiàn)斯坦福分詞器提供了一個 –readStdin 的讀入標準輸入的參數(shù)，也嘗試通過python subprocess中先load 文件，再用的communicate方法來讀入標準輸入，但是仍然沒有解決問題，發(fā)現(xiàn)還是一次執(zhí)行，一次結(jié)束。這個問題困擾了我很久，google了很多資料也沒有解決問題，歡迎懂行的同學告知或者來解決這個問題，再此先謝過了。That’s all!

注：原創(chuàng)文章，轉(zhuǎn)載請注明出處“我愛自然語言處理”：www.

本文鏈接地址：http://www./python自然語言處理實踐-在nltk中使用斯坦福中文分詞器

相關(guān)文章: