PostgreSQL的中文全文檢索

bubbi7 2017-02-14

展開全文

http://my.oschina .NET/Kenyon/blog/82305

http://www./_linux_/postgresql-bamboo-lucene-part2.html

上一篇介紹了postgresql全文檢索的環(huán)境和一些示例，http://my.oschina .Net/Kenyon/blog/80904，都是基于其自帶的模式，目前版本默認(rèn)并不支持中文的全文檢索，但是我們的實(shí)際使用過程中肯定會(huì)有用到中文的檢索，好在有強(qiáng)大的社區(qū)支持，結(jié)合第三方工具可以簡(jiǎn)單實(shí)現(xiàn)PG的中文全文檢索。

PG的中文全文檢索步驟也主要分三步走：
1.將中文分詞
2.轉(zhuǎn)換分詞,去掉無意義分詞
3.按一定順序排序，建索引加快查詢

一、使用到的測(cè)試環(huán)境與工具
VMWARE 6.0
PostgreSQL 9.1.2
CRF++-0.57 下載地址：http://crfpp./svn/trunk/doc/index.html
nlpbamboo-1.1.2 下載地址：http://code.google.com/p/nlpbamboo/downloads/list
index.tar.bz2 下載地址：http://code.google.com/p/nlpbamboo/downloads/list

二、部署過程(root用戶)
1.先安裝CRF

cd CRF++-0.57
./configure
make
make install

2.安裝nlpbamboo

cd nlpbamboo
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=release
make all
make install

3.下載分詞數(shù)據(jù)庫(kù)文件
下載index.tar.bz2,解壓到/opt/bamboo/index

4.查看
安裝完了后，到默認(rèn)的安裝路徑下查看軟件的安裝情況,主要的默認(rèn)路徑
/usr/lib
/usr/include/
/opt/bamboo/

[postgres@localhost ~]$ cd /usr/local/lib
[postgres@localhost lib]$ ll
total 788
-rw-r--r--. 1 root root 516882 Sep  3 19:57 libcrfpp.a
-rwxr-xr-x. 1 root root    952 Sep  3 19:57 
lrwxrwxrwx. 1 root root     17 Sep  3 19:57  -> .0.0.0
lrwxrwxrwx. 1 root root     17 Sep  3 19:57 .0 -> .0.0.0
-rwxr-xr-x. 1 root root 280760 Sep  3 19:57 .0.0.0

[postgres@localhost lib]$ cd /usr/lib
[postgres@localhost lib]$ ll lib*
-rw-r--r--. 1 root root 1027044 Sep  3 20:02 libbamboo.a
lrwxrwxrwx. 1 root root      14 Sep  3 20:03  -> .2
-rwxr-xr-x. 1 root root  250140 Sep  3 20:02 .2
lrwxrwxrwx. 1 root root      25 Sep  3 23:56 libcrfpp.a -> /usr/local/lib/libcrfpp.a
lrwxrwxrwx. 1 root root      26 Sep  3 23:56  -> /usr/local/lib/
lrwxrwxrwx. 1 root root      28 Sep  3 23:56 .0 -> /usr/local/lib/.0

[postgres@localhost bamboo]$ cd /opt/bamboo/
[postgres@localhost bamboo]$ ll
total 17412
drwxr-xr-x. 2 postgres postgres     4096 Sep  3 20:03 bin
drwxr-xr-x. 2 postgres postgres     4096 Aug 15 01:52 etc
drwxr-xr-x. 4 postgres postgres     4096 Aug 15 01:52 exts
drwxr-sr-x. 2 postgres postgres     4096 Apr  1  2009 index
-rw-r--r--. 1 postgres postgres 17804377 Sep  3 23:52 index.tar.bz2
drwxr-xr-x. 2 postgres postgres     4096 Sep  3 20:03 processor
drwxr-xr-x. 2 postgres postgres     4096 Aug 15 01:52 template

5.編輯中文檢索干擾詞匯
編輯該詞匯是為了減少一些無意義的詞匯被檢索出來，比如'a',‘的','得'等

[postgres@localhost tsearch_data]$touch /usr/share/postgresql/8.4/tsearch_data/chinese_utf8.stop

[postgres@localhost tsearch_data]$ pwd
/home/postgres/share/tsearch_data
[postgres@localhost tsearch_data]$ more chinese_utf8.stop 
的
我
我們

6.編譯

cd /opt/bamboo/exts/postgres/pg_tokenize
make
make install
cd /opt/bamboo/exts/postgres/chinese_parser
make
make install

7.導(dǎo)入分詞函數(shù)和分詞模塊

[postgres@localhost ~]$ psql
postgres=# \i /home/postgres/share/contrib/pg_tokenize.sql
SET
CREATE FUNCTION
postgres=#  \i /home/postgres/share/contrib/chinese_parser.sql
SET
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE TEXT SEARCH PARSER
CREATE TEXT SEARCH CONFIGURATION
CREATE TEXT SEARCH DICTIONARY
ALTER TEXT SEARCH CONFIGURATION

8.安裝完圖形化展示

9.不同的DB安裝中文分詞
假如在一臺(tái)機(jī)子上同時(shí)有多個(gè)DB，則只需要把分詞函數(shù)和分詞模版在新庫(kù)里導(dǎo)入一下即可。

三、應(yīng)用
1.使用tokens測(cè)試中文分詞效果

[postgres@localhost ~]$ psql -p 5432
psql (9.1.2)
Type "help" for help.

postgres=# select tokenize('中文詞分浙江人民海量的人');
            tokenize            
---------------------------------
中文詞 分 浙江 人民 海量 的 人
(1 row)

postgres=# SELECT to_tsvector('chinesecfg', '我愛北京天安門');
            to_tsvector           
-----------------------------------
'北京':3 '天安門':4 '我':1 '愛':2
(1 row)

postgres=# select tokenize('南京市長(zhǎng)江大橋');
     tokenize     
-------------------
南京市 長(zhǎng)江 大橋
(1 row)

postgres=# select tokenize('南京市長(zhǎng)');
  tokenize 
------------
南京 市長(zhǎng)
(1 row)

有一個(gè)比較好的分詞效果，最明顯的是南京市長(zhǎng)江大橋，并沒有被分成南京,市長(zhǎng),江大橋之類的。

2.使用一個(gè)普通測(cè)試表，新建一個(gè)tsvector列用來存放分詞數(shù)據(jù)

ALTER TABLE t_store_adv add column index_col_ts tsvector;
UPDATE t_store_adv SET index_col_ts =
to_tsvector('chinesecfg', coalesce(adv_title,'') || ' ' || coalesce(adv_content,''));

3.建立索引

CREATE INDEX t_store_adv_idx ON t_store_adv USING gin(index_col_ts);

4.查詢

[postgres@localhost ~]$ psql  -p 5432
psql (9.1.2)
Type "help" for help.

postgres=# select count(1) from t_store_adv;
 count 
-------
  38803
(1 row)

postgres=# SELECT count(1) FROM t_store_adv WHERE index_col_ts @@ to_tsquery('南京');
 count 
-------
    16
(1 row)

postgres=# explain SELECT count(1) FROM t_store_adv WHERE index_col_ts @@ to_tsquery('南京');
                                      QUERY PLAN                                      
--------------------------------------------------------------------------------------
 Aggregate  (cost=108.61..108.62 rows=1 width=0)
   ->  Bitmap Heap Scan on t_store_adv  (cost=12.21..108.55 rows=27 width=0)
         Recheck Cond: (index_col_ts @@ to_tsquery('南京'::text))
         ->  Bitmap Index Scan on t_store_adv_idx  (cost=0.00..12.21 rows=27 width=0)
               Index Cond: (index_col_ts @@ to_tsquery('南京'::text))
(5 rows)

--普通的文本檢索
postgres=# select count(1) from t_store_adv where (adv_content like '%南京%' or adv_title like '%南京%');
 count 
-------
    17
(1 row)

postgres=# explain select count(1) from t_store_adv where (adv_content like '%南京%' or adv_title like '%南京%');
                                             QUERY PLAN                                             
----------------------------------------------------------------------------------------------------
 Aggregate  (cost=1348.05..1348.06 rows=1 width=0)
   ->  Seq Scan on t_store_adv  (cost=0.00..1348.05 rows=1 width=0)
         Filter: (((adv_content)::text ~~ '%南京%'::text) OR ((adv_title)::text ~~ '%南京%'::text))
(3 rows)

本次測(cè)試的數(shù)據(jù)量不是很大，但從執(zhí)行計(jì)劃上可見一斑，所消耗的資源是要少很多的，當(dāng)然存儲(chǔ)會(huì)消耗多一點(diǎn)，數(shù)據(jù)量大的情況下，索引檢索的效率也能看出有很大的提升，具體可參考一個(gè)例子：http://www.oschina.net/question/96003_19020

四、總結(jié)：
示例中略去了使用觸發(fā)器來更新tsvector列。使用中文全文檢索可以有效提升中文檢索速度，只是目前還不是內(nèi)置的，需要借助第三方工具手工安裝一下，選擇的分詞方案也比較多，可以擇優(yōu)選擇。

五、參考：
http://www.cnblogs.com/shuaixf/archive/2011/09/10/2173260.html
http://www./_linux_/postgresql-bamboo-lucene-part2.html

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購(gòu)買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自： bubbi7 > 《軟件》

舉報(bào)/認(rèn)領(lǐng)