[原創(chuàng)]全文搜索引擎Lucene學(xué)習(xí)筆記(頁 1) - 『編程設(shè)計』 - 青韶論壇湘...

chanvy 2008-12-12

展開全文

[原創(chuàng)]全文搜索引擎Lucene學(xué)習(xí)筆記

Source: http://www./bbs/viewthread.php?tid=810119

在apache下載了一個最新的Lucene2.0.0的版本，開始了Lucene的學(xué)習(xí)歷程：

首先搭建好運行環(huán)境，JDK、TOMCAT及下載的Lucene（在Lucene說明書上說要下載ant和JavaCC，ant用于構(gòu)建Lucene，但是下載的Lucene包是已經(jīng)構(gòu)建好了的，而JavaCC是可選的）

然后，測試Lucene提供的Demo：
按說明書上的，將Lucene包:lucene-core-2.0.0.jar和lucene-demo-2.0.0.jar加入classpath。為了方便起見，我用Eclipse代替了上面的工作。新建工程，將這兩個包導(dǎo)入。直接運行demo中的IndexHTML。個人理解這個是用于建立HTML文件索引的，而包目錄下還有一個IndexFiles，估計是用于建立普通文件索引的。
Usage: IndexHTML [-create] [-index <index>] <root_directory>
-index <index>是目錄索引存放文件夾，root_directory是欲建立索引的文件目錄。
這里我直接將Lucene的API html文件做為root_directory，再在創(chuàng)建一個index目錄，用于存放索引。
運行IndexHTML，成功的話可以看到index目錄下面將生成三個文件：
segments
deletable
_?.cfs

建立好索引文件后就可以應(yīng)用查詢了~~
直接用Lucene自帶的JSP應(yīng)用的話，將Luceneweb.war放入tomcat\webapps目錄下，重啟tomcat后，設(shè)置configuration.jsp里的indexLocation參數(shù)為指定的上面的index目錄。

這里L(fēng)ucene里自帶的那個JSP應(yīng)用有錯誤，估計是apache更新過Lucene后忘記同時更新下Demo了。在results.jsp里有一行Query query = QueryParser.parse(...)這一行運行時會出錯，parse方法已經(jīng)過時。改正的辦法是建立一個QueryParser實例，再調(diào)用其parse方法：
QueryParser qp = new QueryParser("contents", analyzer);
query = qp.parse(queryString);

然后就可以在瀏覽器下運行此web應(yīng)用了~~

也可以用應(yīng)用程序的方式來檢驗：
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.*;

public class Search {
public static void main(String[] args) throws Exception {
String indexPath = args[0], queryString = args[1]; //指向索引目錄的搜索器
Searcher searcher = new IndexSearcher(indexPath); //查詢解析器：使用和索引同樣的語言分析器
QueryParser qp = new QueryParser("contents",new SimpleAnalyzer());
Query query = qp.parse(queryString); //搜索結(jié)果使用Hits存儲
Hits hits = searcher.search(query); //通過hits可以訪問到相應(yīng)字段的數(shù)據(jù)和查詢的匹配度
System.out.println(hits.length());
for (int i=0; i<hits.length(); i++) {
System.out.println(hits.doc(i).get("path") + "; Score: " + hits.score(i));
};
}
}

對搜索引擎、Lucene及Lucene API的一些理解：
搜索是在已經(jīng)建立好的索引的基礎(chǔ)上進行的。由于數(shù)據(jù)庫索引不適合全文索引（花費巨大且效果差），因此產(chǎn)生了Lucene等全文搜索引擎。如果有時候要對數(shù)據(jù)庫比如果存放于數(shù)據(jù)庫里的貼子內(nèi)容進行全文檢索的話，可以先由數(shù)據(jù)庫建立索引文件，再由搜索引擎在此索引文件的基礎(chǔ)上建立自己的索引，再進行檢索。
Lucene：
要利用Lucene作為搜索引擎，先要建立索引文件。具體細(xì)節(jié)有待查看IndexHTML的實現(xiàn)方法。在索引文件的基礎(chǔ)上進行檢索的話，先建立一個指向索引目錄的搜索器Searcher，然后建立查詢解析器QueryParse，在參數(shù)里設(shè)定查詢范圍和分析器Analyzer。利用QueryParse的parse方法得到一個Query實例。將此實例做為參數(shù)傳入Searcher的search()方法，該方法將返回一個結(jié)果集Hits。之后的操作就是遍歷此結(jié)果集了
要注意的是Hits對象有一個score()方法，該方法返回的是該條結(jié)果符合檢索條件的權(quán)重?？梢詫Y(jié)果集的權(quán)重進行排序以得到最好的結(jié)果。

大餅先生 2006-9-8 14:50

在Lucene里應(yīng)用中文檢索

說起來很簡單，Lucene2.0.0直接就提供了中文檢索的功能
引入Lucene的擴展包analyzersn，里面有個ChineseAnalyzer和CJKAnalyzer是直接中文切詞的~
在建立索引的時候，用IndexWriter writer = new IndexWriter(INDEX_DIR, new ChineseAnalyzer(), true)就可以建立基于中文檢索的Lucene索引
而欲檢索則中文索引，只需將查詢解析器QueryParse的構(gòu)造方法的analyzer參數(shù)設(shè)為ChineseAnalyzer，同時檢索條件相應(yīng)的轉(zhuǎn)化成為“GBK”就行了~

大餅先生 2006-9-9 12:42

在HTML頁面里處理中文，返回的編碼是ISO8859-1格式的！
所以查詢時要轉(zhuǎn)化編碼：
queryString = new String(request.getParameter("query").getBytes("iso8859_1"),"GB2312");
同時將查詢解析器QueryParse的構(gòu)造方法的analyzer參數(shù)設(shè)為ChineseAnalyzer

大餅先生 2006-9-10 15:51

設(shè)置關(guān)鍵字高亮的方法

Lucene里包含一個highlight包，用于高亮關(guān)鍵字等功能，具體用法：
Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter("<font color=red><B>","</B></font>"), new QueryScorer(query));
................

String text = doc.get("contents");
TokenStream tokenStream = analyzer.tokenStream(queryCondition, new StringReader(text));
String result = highlighter.getBestFragments(tokenStream, text, 3, "...."); //設(shè)置最符合查詢結(jié)果的片段

輸出result就可以得到：
最符合查詢結(jié)果的一個結(jié)果中的三個片段，中間用“....”分隔，并且輸入的查詢條件queryCondition在片段中將會被設(shè)置成高亮！

大餅先生 2007-3-21 14:28

一個簡單應(yīng)用，在jdk1.5,Lucene2.0版本下通過，正常運行。
一共3個文件
Constants.java用于存放常量
LuceneIndex.java用于建立索引
LuceneSearch.java用于搜索

package testlucene;

public class Constants {
//要索引的文件的存放路徑
public final static String INDEX_FILE_PATH = "c:\\test";

//索引的存放位置
public final static String INDEX_STORE_PATH = "c:\\index";
}

package testlucene;
import java.io.*;
import java.util.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;

public class LuceneIndex {
private IndexWriter writer = null;

public LuceneIndex(){
try {
writer = new IndexWriter(Constants.INDEX_STORE_PATH,
new StandardAnalyzer(),true);
}catch(Exception e){
e.printStackTrace();
}
}

private Document getDocument(File f) throws Exception{
Document doc = new Document();
FileInputStream is = new FileInputStream(f);
Reader reader = new BufferedReader(new InputStreamReader(is));
doc.add(new Field("contents",reader));
doc.add(new Field("path",f.getAbsolutePath(),Field.Store.YES,Field.Index.TOKENIZED));
return doc;
}

public void writeToIndex() throws Exception{
File folder = new File(Constants.INDEX_FILE_PATH);
if(folder.isDirectory()){
String[] files = folder.list();
for(int i=0; i<files.length; i++){
File file = new File(folder,files[i]);
Document doc = getDocument(file);
System.out.println("正在建立索引： " + file + " ");
writer.addDocument(doc);
}
}
}

public void close()throws Exception{
writer.close();
}

public static void main(String[] args)throws Exception{
LuceneIndex indexer = new LuceneIndex();
Date start = new Date();
indexer.writeToIndex();
Date end = new Date();
System.out.println("建立索引用時 " + (end.getTime() - start.getTime()) + "毫秒");
indexer.close();
}
}

package testlucene;
import java.util.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.index.*;
import org.apache.lucene.document.*;
import org.apache.lucene.queryParser.*;
import org.apache.lucene.search.*;

public class LuceneSearch {
private IndexSearcher searcher = null;
private Query query = null;

public LuceneSearch(){
try{
searcher = new IndexSearcher(IndexReader.open(Constants.INDEX_STORE_PATH));
}catch(Exception e){
e.printStackTrace();
}
}

public final Hits Search(String keyword){
System.out.println("正在檢索關(guān)鍵字 " + keyword);
try{
query = new QueryParser("contents", new StandardAnalyzer()).parse(keyword);
Date start = new Date();
Hits hits = searcher.search(query);
Date end = new Date();
System.out.println("檢索完成，用時" + (end.getTime() - start.getTime()) + "毫秒");
return hits;
}catch(Exception e){
e.printStackTrace();
return null;
}
}

public void printResult(Hits h){
if(h.length() == 0){
System.out.println("對不起，沒有找到您要的結(jié)果。");
}
else{
for(int i = 0; i<h.length(); i++){
try{
Document doc = h.doc(i);
System.out.print("這是第" + i + "個檢索到的結(jié)果，文件名為：");
System.out.println(doc.get("path"));
}catch(Exception e ){
e.printStackTrace();
}
}
}
System.out.println("---------------------------");
}

public static void main(String[] args) throws Exception{
LuceneSearch test = new LuceneSearch();
Hits h = null;
h = test.Search("測試");
test.printResult(h);

h = test.Search("搜索");
test.printResult(h);

h = test.Search("引擎");
test.printResult(h);
}
}