有關(guān)Lucene的問題(4):影響Lucene對文檔打分的四種方式

CevenCheng 2012-05-13

展開全文

在索引階段設(shè)置Document Boost和Field Boost，存儲(chǔ)在(.nrm)文件中。

如果希望某些文檔和某些域比其他的域更重要，如果此文檔和此域包含所要查詢的詞則應(yīng)該得分較高，則可以在索引階段設(shè)定文檔的boost和域的boost值。

這些值是在索引階段就寫入索引文件的，存儲(chǔ)在標(biāo)準(zhǔn)化因子(.nrm)文件中，一旦設(shè)定，除非刪除此文檔，否則無法改變。

如果不進(jìn)行設(shè)定，則Document Boost和Field Boost默認(rèn)為1。

Document Boost及FieldBoost的設(shè)定方式如下：

Document doc = new Document();

Field f = new Field("contents", "hello world", Field.Store.NO, Field.Index.ANALYZED);

f.setBoost(100);

doc.add(f);

doc.setBoost(100);

兩者是如何影響Lucene的文檔打分的呢？

讓我們首先來看一下Lucene的文檔打分的公式：

score(q,d) = coord(q,d) · queryNorm(q) · ∑( tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) )

t in q

Document Boost和Field Boost影響的是norm(t, d)，其公式如下：

norm(t,d) = doc.getBoost() · lengthNorm(field) · ∏f.getBoost()

field f in d named as t

它包括三個(gè)參數(shù)：

Document boost：此值越大，說明此文檔越重要。
Field boost：此域越大，說明此域越重要。
lengthNorm(field) = (1.0 / Math.sqrt(numTerms))：一個(gè)域中包含的Term總數(shù)越多，也即文檔越長，此值越小，文檔越短，此值越大。

其中第三個(gè)參數(shù)可以在自己的Similarity中影響打分，下面會(huì)論述。

當(dāng)然，也可以在添加Field的時(shí)候，設(shè)置Field.Index.ANALYZED_NO_NORMS或Field.Index.NOT_ANALYZED_NO_NORMS，完全不用norm，來節(jié)約空間。

根據(jù)Lucene的注釋，No norms means that index-time field and document boosting and field length normalization are disabled. The benefit is less memory usage as norms take up one byte of RAM per indexed field for every document in the index, during searching. Note that once you index a given field with norms enabled, disabling norms will have no effect. 沒有norms意味著索引階段禁用了文檔boost和域的boost及長度標(biāo)準(zhǔn)化。好處在于節(jié)省內(nèi)存，不用在搜索階段為索引中的每篇文檔的每個(gè)域都占用一個(gè)字節(jié)來保存norms信息了。但是對norms信息的禁用是必須全部域都禁用的，一旦有一個(gè)域不禁用，則其他禁用的域也會(huì)存放默認(rèn)的norms值。因?yàn)闉榱思涌靚orms的搜索速度，Lucene是根據(jù)文檔號(hào)乘以每篇文檔的norms信息所占用的大小來計(jì)算偏移量的，中間少一篇文檔，偏移量將無法計(jì)算。也即norms信息要么都保存，要么都不保存。

下面幾個(gè)試驗(yàn)可以驗(yàn)證norms信息的作用：

試驗(yàn)一：Document Boost的作用

public void testNormsDocBoost() throws Exception {
File indexDir = new File("testNormsDocBoost");
IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
writer.setUseCompoundFile(false);
Document doc1 = new Document();
Field f1 = new Field("contents", "common hello hello", Field.Store.NO, Field.Index.ANALYZED);
doc1.add(f1);
doc1.setBoost(100);
writer.addDocument(doc1);
Document doc2 = new Document();
Field f2 = new Field("contents", "common common hello", Field.Store.NO, Field.Index.ANALYZED_NO_NORMS);
doc2.add(f2);
writer.addDocument(doc2);
Document doc3 = new Document();
Field f3 = new Field("contents", "common common common", Field.Store.NO, Field.Index.ANALYZED_NO_NORMS);
doc3.add(f3);
writer.addDocument(doc3);
writer.close();

IndexReader reader = IndexReader.open(FSDirectory.open(indexDir));
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs docs = searcher.search(new TermQuery(new Term("contents", "common")), 10);
for (ScoreDoc doc : docs.scoreDocs) {
System.out.println("docid : " + doc.doc + " score : " + doc.score);
}
}

如果第一篇文檔的域f1也為Field.Index.ANALYZED_NO_NORMS的時(shí)候，搜索排名如下：

docid : 2 score : 1.2337708
docid : 1 score : 1.0073696
docid : 0 score : 0.71231794

如果第一篇文檔的域f1設(shè)為Field.Index.ANALYZED，則搜索排名如下：

docid : 0 score : 39.889805
docid : 2 score : 0.6168854
docid : 1 score : 0.5036848

試驗(yàn)二：Field Boost的作用

如果我們覺得title要比contents要重要，可以做一下設(shè)定。

public void testNormsFieldBoost() throws Exception {
File indexDir = new File("testNormsFieldBoost");
IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
writer.setUseCompoundFile(false);
Document doc1 = new Document();
Field f1 = new Field("title", "common hello hello", Field.Store.NO, Field.Index.ANALYZED);
f1.setBoost(100);
doc1.add(f1);
writer.addDocument(doc1);
Document doc2 = new Document();
Field f2 = new Field("contents", "common common hello", Field.Store.NO, Field.Index.ANALYZED_NO_NORMS);
doc2.add(f2);
writer.addDocument(doc2);
writer.close();

IndexReader reader = IndexReader.open(FSDirectory.open(indexDir));
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "contents", new StandardAnalyzer(Version.LUCENE_CURRENT));
Query query = parser.parse("title:common contents:common");
TopDocs docs = searcher.search(query, 10);
for (ScoreDoc doc : docs.scoreDocs) {
System.out.println("docid : " + doc.doc + " score : " + doc.score);
}
}

如果第一篇文檔的域f1也為Field.Index.ANALYZED_NO_NORMS的時(shí)候，搜索排名如下：

docid : 1 score : 0.49999997
docid : 0 score : 0.35355338

如果第一篇文檔的域f1設(shè)為Field.Index.ANALYZED，則搜索排名如下：

docid : 0 score : 19.79899
docid : 1 score : 0.49999997

試驗(yàn)三：norms中文檔長度對打分的影響

public void testNormsLength() throws Exception {
File indexDir = new File("testNormsLength");
IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
writer.setUseCompoundFile(false);
Document doc1 = new Document();
Field f1 = new Field("contents", "common hello hello", Field.Store.NO, Field.Index.ANALYZED_NO_NORMS);
doc1.add(f1);
writer.addDocument(doc1);
Document doc2 = new Document();
Field f2 = new Field("contents", "common common hello hello hello hello", Field.Store.NO, Field.Index.ANALYZED_NO_NORMS);
doc2.add(f2);
writer.addDocument(doc2);
writer.close();

IndexReader reader = IndexReader.open(FSDirectory.open(indexDir));
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "contents", new StandardAnalyzer(Version.LUCENE_CURRENT));
Query query = parser.parse("title:common contents:common");
TopDocs docs = searcher.search(query, 10);
for (ScoreDoc doc : docs.scoreDocs) {
System.out.println("docid : " + doc.doc + " score : " + doc.score);
}
}

當(dāng)norms被禁用的時(shí)候，包含兩個(gè)common的第二篇文檔打分較高：

docid : 1 score : 0.13928263
docid : 0 score : 0.09848769

當(dāng)norms起作用的時(shí)候，雖然包含兩個(gè)common的第二篇文檔，由于長度較長，因而打分較低：

docid : 0 score : 0.09848769
docid : 1 score : 0.052230984

試驗(yàn)四：norms信息要么都保存，要么都不保存的特性

public void testOmitNorms() throws Exception {
File indexDir = new File("testOmitNorms");
IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
writer.setUseCompoundFile(false);
Document doc1 = new Document();
Field f1 = new Field("title", "common hello hello", Field.Store.NO, Field.Index.ANALYZED);
doc1.add(f1);
writer.addDocument(doc1);
for (int i = 0; i < 10000; i++) {
 Document doc2 = new Document();
 Field f2 = new Field("contents", "common common hello hello hello hello", Field.Store.NO, Field.Index.ANALYZED_NO_NORMS);
 doc2.add(f2);
 writer.addDocument(doc2);
}
writer.close();
}

當(dāng)我們添加10001篇文檔，所有的文檔都設(shè)為Field.Index.ANALYZED_NO_NORMS的時(shí)候，我們看索引文件，發(fā)現(xiàn).nrm文件只有1K，也即其中除了保持一定的格式信息，并無其他數(shù)據(jù)。

當(dāng)我們把第一篇文檔設(shè)為Field.Index.ANALYZED，而其他10000篇文檔都設(shè)為Field.Index.ANALYZED_NO_NORMS的時(shí)候，發(fā)現(xiàn).nrm文件又10K，也即所有的文檔都存儲(chǔ)了norms信息，而非只有第一篇文檔。

在搜索語句中，設(shè)置Query Boost.

在搜索中，我們可以指定，某些詞對我們來說更重要，我們可以設(shè)置這個(gè)詞的boost：

common^4 hello

使得包含common的文檔比包含hello的文檔獲得更高的分?jǐn)?shù)。

由于在Lucene中，一個(gè)Term定義為Field:Term，則也可以影響不同域的打分：

title:common^4 content:common

使得title中包含common的文檔比content中包含common的文檔獲得更高的分?jǐn)?shù)。

實(shí)例：

public void testQueryBoost() throws Exception {
File indexDir = new File("TestQueryBoost");
IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
Document doc1 = new Document();
Field f1 = new Field("contents", "common1 hello hello", Field.Store.NO, Field.Index.ANALYZED);
doc1.add(f1);
writer.addDocument(doc1);
Document doc2 = new Document();
Field f2 = new Field("contents", "common2 common2 hello", Field.Store.NO, Field.Index.ANALYZED);
doc2.add(f2);
writer.addDocument(doc2);
writer.close();

IndexReader reader = IndexReader.open(FSDirectory.open(indexDir));
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "contents", new StandardAnalyzer(Version.LUCENE_CURRENT));
Query query = parser.parse("common1 common2");
TopDocs docs = searcher.search(query, 10);
for (ScoreDoc doc : docs.scoreDocs) {
System.out.println("docid : " + doc.doc + " score : " + doc.score);
}
}

根據(jù)tf/idf，包含兩個(gè)common2的第二篇文檔打分較高：

docid : 1 score : 0.24999999
docid : 0 score : 0.17677669

如果我們輸入的查詢語句為："common1^100 common2"，則第一篇文檔打分較高：

docid : 0 score : 0.2499875
docid : 1 score : 0.0035353568

那Query Boost是如何影響文檔打分的呢？

根據(jù)Lucene的打分計(jì)算公式：

score(q,d) = coord(q,d) · queryNorm(q) · ∑( tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) )

t in q

注：在queryNorm的部分，也有q.getBoost()的部分，但是對query向量的歸一化(見向量空間模型與Lucene的打分機(jī)制[http://forfuture1978./blog/588721])。

繼承并實(shí)現(xiàn)自己的Similarity

Similariy是計(jì)算Lucene打分的最主要的類，實(shí)現(xiàn)其中的很多借口可以干預(yù)打分的過程。

(1) float computeNorm(String field, FieldInvertState state)

(2) float lengthNorm(String fieldName, int numTokens)

(3) float queryNorm(float sumOfSquaredWeights)

(4) float tf(float freq)

(5) float idf(int docFreq, int numDocs)

(6) float coord(int overlap, int maxOverlap)

(7) float scorePayload(int docId, String fieldName, int start, int end, byte [] payload, int offset, int length)

它們分別影響Lucene打分計(jì)算的如下部分：

score(q,d) = (6)coord(q,d) · (3)queryNorm(q) · ∑( (4)tf(t in d) · (5)idf(t)² · t.getBoost() · (1)norm(t,d) )

t in q

norm(t,d) = doc.getBoost() · (2)lengthNorm(field) · ∏f.getBoost()

field f in d named as t

下面逐個(gè)進(jìn)行解釋：

(1) float computeNorm(String field, FieldInvertState state)

影響標(biāo)準(zhǔn)化因子的計(jì)算，如上述，他主要包含了三部分：文檔boost，域boost，以及文檔長度歸一化。此函數(shù)一般按照上面norm(t, d)的公式進(jìn)行計(jì)算。

(2) float lengthNorm(String fieldName, int numTokens)

主要計(jì)算文檔長度的歸一化，默認(rèn)是1.0 / Math.sqrt(numTerms)。

因?yàn)樵谒饕?，不同的文檔長度不一樣，很顯然，對于任意一個(gè)term，在長的文檔中的tf要大的多，因而分?jǐn)?shù)也越高，這樣對小的文檔不公平，舉一個(gè)極端的例子，在一篇1000萬個(gè)詞的鴻篇巨著中，"lucene"這個(gè)詞出現(xiàn)了11次，而在一篇12個(gè)詞的短小文檔中，"lucene"這個(gè)詞出現(xiàn)了10次，如果不考慮長度在內(nèi)，當(dāng)然鴻篇巨著應(yīng)該分?jǐn)?shù)更高，然而顯然這篇小文檔才是真正關(guān)注"lucene"的。

因而在此處是要除以文檔的長度，從而減少因文檔長度帶來的打分不公。

然而現(xiàn)在這個(gè)公式是偏向于首先返回短小的文檔的，這樣在實(shí)際應(yīng)用中使得搜索結(jié)果也很難看。

于是在實(shí)踐中，要根據(jù)項(xiàng)目的需要，根據(jù)搜索的領(lǐng)域，改寫lengthNorm的計(jì)算公式。比如我想做一個(gè)經(jīng)濟(jì)學(xué)論文的搜索系統(tǒng)，經(jīng)過一定時(shí)間的調(diào)研，發(fā)現(xiàn)大多數(shù)的經(jīng)濟(jì)學(xué)論文的長度在8000到10000詞，因而lengthNorm的公式應(yīng)該是一個(gè)倒拋物線型的，8000到10000詞的論文分?jǐn)?shù)最高，更短或更長的分?jǐn)?shù)都應(yīng)該偏低，方能夠返回給用戶最好的數(shù)據(jù)。

(3) float queryNorm(float sumOfSquaredWeights)

這是按照向量空間模型，對query向量的歸一化。此值并不影響排序，而僅僅使得不同的query之間的分?jǐn)?shù)可以比較。

(4) float tf(float freq)

freq是指在一篇文檔中包含的某個(gè)詞的數(shù)目。tf是根據(jù)此數(shù)目給出的分?jǐn)?shù)，默認(rèn)為Math.sqrt(freq)。也即此項(xiàng)并不是隨著包含的數(shù)目的增多而線性增加的。

(5) float idf(int docFreq, int numDocs)

idf是根據(jù)包含某個(gè)詞的文檔數(shù)以及總文檔數(shù)計(jì)算出的分?jǐn)?shù)，默認(rèn)為(Math.log(numDocs/(double)(docFreq+1)) + 1.0)。

由于此項(xiàng)計(jì)算涉及到總文檔數(shù)和包含此詞的文檔數(shù)，因而需要全局的文檔數(shù)信息，這給跨索引搜索造成麻煩。

從下面的例子我們可以看出，用MultiSearcher來一起搜索兩個(gè)索引和分別用IndexSearcher來搜索兩個(gè)索引所得出的分?jǐn)?shù)是有很大差異的。

究其原因是MultiSearcher的docFreq(Term term)函數(shù)計(jì)算了包含兩個(gè)索引中包含此詞的總文檔數(shù)，而IndexSearcher僅僅計(jì)算了每個(gè)索引中包含此詞的文檔數(shù)。當(dāng)兩個(gè)索引包含的文檔總數(shù)是有很大不同的時(shí)候，分?jǐn)?shù)是無法比較的。

public void testMultiIndex() throws Exception{
MultiIndexSimilarity sim = new MultiIndexSimilarity();
File indexDir01 = new File("TestMultiIndex/TestMultiIndex01");
File indexDir02 = new File("TestMultiIndex/TestMultiIndex02");
IndexReader reader01 = IndexReader.open(FSDirectory.open(indexDir01));
IndexReader reader02 = IndexReader.open(FSDirectory.open(indexDir02));
IndexSearcher searcher01 = new IndexSearcher(reader01);
searcher01.setSimilarity(sim);
IndexSearcher searcher02 = new IndexSearcher(reader02);
searcher02.setSimilarity(sim);
MultiSearcher multiseacher = new MultiSearcher(searcher01, searcher02);
multiseacher.setSimilarity(sim);
QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "contents", new StandardAnalyzer(Version.LUCENE_CURRENT));
Query query = parser.parse("common");
TopDocs docs = searcher01.search(query, 10);
System.out.println("----------------------------------------------");
for (ScoreDoc doc : docs.scoreDocs) {
    System.out.println("docid : " + doc.doc + " score : " + doc.score);
}
System.out.println("----------------------------------------------");
docs = searcher02.search(query, 10);
for (ScoreDoc doc : docs.scoreDocs) {
    System.out.println("docid : " + doc.doc + " score : " + doc.score);
}
System.out.println("----------------------------------------------");
docs = multiseacher.search(query, 20);
for (ScoreDoc doc : docs.scoreDocs) {
    System.out.println("docid : " + doc.doc + " score : " + doc.score);
}

結(jié)果為：

-------------------------------
docid : 0 score : 0.49317428
docid : 1 score : 0.49317428
docid : 2 score : 0.49317428
docid : 3 score : 0.49317428
docid : 4 score : 0.49317428
docid : 5 score : 0.49317428
docid : 6 score : 0.49317428
docid : 7 score : 0.49317428
-------------------------------
docid : 0 score : 0.45709616
docid : 1 score : 0.45709616
docid : 2 score : 0.45709616
docid : 3 score : 0.45709616
docid : 4 score : 0.45709616
-------------------------------
docid : 0 score : 0.5175894
docid : 1 score : 0.5175894
docid : 2 score : 0.5175894
docid : 3 score : 0.5175894
docid : 4 score : 0.5175894
docid : 5 score : 0.5175894
docid : 6 score : 0.5175894
docid : 7 score : 0.5175894
docid : 8 score : 0.5175894
docid : 9 score : 0.5175894
docid : 10 score : 0.5175894
docid : 11 score : 0.5175894
docid : 12 score : 0.5175894

如果幾個(gè)索引都是在一臺(tái)機(jī)器上，則用MultiSearcher或者M(jìn)ultiReader就解決問題了，然而有時(shí)候索引是分布在多臺(tái)機(jī)器上的，雖然Lucene也提供了RMI，或用NFS保存索引的方法，然而效率和并行性一直是一個(gè)問題。

一個(gè)可以嘗試的辦法是在Similarity中，idf返回1，然后多個(gè)機(jī)器上的索引并行搜索，在匯總結(jié)果的機(jī)器上，再融入idf的計(jì)算。

如下面的例子可以看出，當(dāng)idf返回1的時(shí)候，打分可以比較了：

class MultiIndexSimilarity extends Similarity {

@Override
public float idf(int docFreq, int numDocs) {
return 1.0f;
}

-----------------------------
docid : 0 score : 0.559017
docid : 1 score : 0.559017
docid : 2 score : 0.559017
docid : 3 score : 0.559017
docid : 4 score : 0.559017
docid : 5 score : 0.559017
docid : 6 score : 0.559017
docid : 7 score : 0.559017
-----------------------------
docid : 0 score : 0.559017
docid : 1 score : 0.559017
docid : 2 score : 0.559017
docid : 3 score : 0.559017
docid : 4 score : 0.559017
-----------------------------
docid : 0 score : 0.559017
docid : 1 score : 0.559017
docid : 2 score : 0.559017
docid : 3 score : 0.559017
docid : 4 score : 0.559017
docid : 5 score : 0.559017
docid : 6 score : 0.559017
docid : 7 score : 0.559017
docid : 8 score : 0.559017
docid : 9 score : 0.559017
docid : 10 score : 0.559017
docid : 11 score : 0.559017
docid : 12 score : 0.559017

(6) float coord(int overlap, int maxOverlap)

一次搜索可能包含多個(gè)搜索詞，而一篇文檔中也可能包含多個(gè)搜索詞，此項(xiàng)表示，當(dāng)一篇文檔中包含的搜索詞越多，則此文檔則打分越高。

public void TestCoord() throws Exception {
MySimilarity sim = new MySimilarity();
File indexDir = new File("TestCoord");
IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
Document doc1 = new Document();
Field f1 = new Field("contents", "common hello world", Field.Store.NO, Field.Index.ANALYZED);
doc1.add(f1);
writer.addDocument(doc1);
Document doc2 = new Document();
Field f2 = new Field("contents", "common common common", Field.Store.NO, Field.Index.ANALYZED);
doc2.add(f2);
writer.addDocument(doc2);
for(int i = 0; i < 10; i++){
 Document doc3 = new Document();
 Field f3 = new Field("contents", "world", Field.Store.NO, Field.Index.ANALYZED);
 doc3.add(f3);
 writer.addDocument(doc3);
}
writer.close();

IndexReader reader = IndexReader.open(FSDirectory.open(indexDir));
IndexSearcher searcher = new IndexSearcher(reader);
searcher.setSimilarity(sim);
QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "contents", new StandardAnalyzer(Version.LUCENE_CURRENT));
Query query = parser.parse("common world");
TopDocs docs = searcher.search(query, 2);
for (ScoreDoc doc : docs.scoreDocs) {
System.out.println("docid : " + doc.doc + " score : " + doc.score);
}
}

class MySimilarity extends Similarity {

@Override
public float coord(int overlap, int maxOverlap) {
return 1;
}

}

如上面的實(shí)例，當(dāng)coord返回1，不起作用的時(shí)候，文檔一雖然包含了兩個(gè)搜索詞common和world，但由于world的所在的文檔數(shù)太多，而文檔二包含common的次數(shù)比較多，因而文檔二分?jǐn)?shù)較高：

docid : 1 score : 1.9059997
docid : 0 score : 1.2936771

而當(dāng)coord起作用的時(shí)候，文檔一由于包含了兩個(gè)搜索詞而分?jǐn)?shù)較高：

class MySimilarity extends Similarity {

@Override
public float coord(int overlap, int maxOverlap) {
return overlap / (float)maxOverlap;
}

}

docid : 0 score : 1.2936771
docid : 1 score : 0.95299983

(7) float scorePayload(int docId, String fieldName, int start, int end, byte [] payload, int offset, int length)

由于Lucene引入了payload，因而可以存儲(chǔ)一些自己的信息，用戶可以根據(jù)自己存儲(chǔ)的信息，來影響Lucene的打分。

payload的定義

我們知道，索引是以倒排表形式存儲(chǔ)的，對于每一個(gè)詞，都保存了包含這個(gè)詞的一個(gè)鏈表，當(dāng)然為了加快查詢速度，此鏈表多用跳躍表進(jìn)行存儲(chǔ)。

Payload信息就是存儲(chǔ)在倒排表中的，同文檔號(hào)一起存放，多用于存儲(chǔ)與每篇文檔相關(guān)的一些信息。當(dāng)然這部分信息也可以存儲(chǔ)域里(stored Field)，兩者從功能上基本是一樣的，然而當(dāng)要存儲(chǔ)的信息很多的時(shí)候，存放在倒排表里，利用跳躍表，有利于大大提高搜索速度。

Payload的存儲(chǔ)方式如下圖：

由payload的定義，我們可以看出，payload可以存儲(chǔ)一些不但與文檔相關(guān)，而且與查詢詞也相關(guān)的信息。比如某篇文檔的某個(gè)詞有特殊性，則可以在這個(gè)詞的這個(gè)文檔的position信息后存儲(chǔ)payload信息，使得當(dāng)搜索這個(gè)詞的時(shí)候，這篇文檔獲得較高的分?jǐn)?shù)。

要利用payload來影響查詢需要做到以下幾點(diǎn)，下面舉例用標(biāo)記的詞在payload中存儲(chǔ)1，否則存儲(chǔ)0：

首先要實(shí)現(xiàn)自己的Analyzer從而在Token中放入payload信息：

class BoldAnalyzer extends Analyzer {

@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new WhitespaceTokenizer(reader);
    result = new BoldFilter(result);
    return result;
}

}

class BoldFilter extends TokenFilter {
public static int IS_NOT_BOLD = 0;
public static int IS_BOLD = 1;

private TermAttribute termAtt;
private PayloadAttribute payloadAtt;

protected BoldFilter(TokenStream input) {
    super(input);
    termAtt = addAttribute(TermAttribute.class);
    payloadAtt = addAttribute(PayloadAttribute.class);
}

@Override
public boolean incrementToken() throws IOException {
if (input.incrementToken()) {

final char[] buffer = termAtt.termBuffer();
final int length = termAtt.termLength();

String tokenstring = new String(buffer, 0, length);
 if (tokenstring.startsWith("") && tokenstring.endsWith("")) {
 tokenstring = tokenstring.replace("", "");
 tokenstring = tokenstring.replace("", "");
 termAtt.setTermBuffer(tokenstring);
 payloadAtt.setPayload(new Payload(int2bytes(IS_BOLD)));
 } else {
 payloadAtt.setPayload(new Payload(int2bytes(IS_NOT_BOLD)));
 }
 return true;
 } else
 return false;
}

public static int bytes2int(byte[] b) {
 int mask = 0xff;
 int temp = 0;
 int res = 0;
 for (int i = 0; i < 4; i++) {
 res <<= 8;
 temp = b[i] & mask;
 res |= temp;
 }
 return res;
}

public static byte[] int2bytes(int num) {
 byte[] b = new byte[4];
 for (int i = 0; i < 4; i++) {
 b[i] = (byte) (num >>> (24 - i * 8));
 }
 return b;
}

}

然后，實(shí)現(xiàn)自己的Similarity，從payload中讀出信息，根據(jù)信息來打分。

class PayloadSimilarity extends DefaultSimilarity {

@Override
public float scorePayload(int docId, String fieldName, int start, int end, byte[] payload, int offset, int length) {
    int isbold = BoldFilter.bytes2int(payload);
    if(isbold == BoldFilter.IS_BOLD){
      System.out.println("It is a bold char.");
    } else {
      System.out.println("It is not a bold char.");
    }
    return 1;
}
}

最后，查詢的時(shí)候，一定要用PayloadXXXQuery(在此用PayloadTermQuery，在Lucene 2.4.1中，用BoostingTermQuery)，否則scorePayload不起作用。

public void testPayloadScore() throws Exception {
PayloadSimilarity sim = new PayloadSimilarity();
File indexDir = new File("TestPayloadScore");
IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new BoldAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED);
Document doc1 = new Document();
Field f1 = new Field("contents", "common hello world", Field.Store.NO, Field.Index.ANALYZED);
doc1.add(f1);
writer.addDocument(doc1);
Document doc2 = new Document();
Field f2 = new Field("contents", "common hello world", Field.Store.NO, Field.Index.ANALYZED);
doc2.add(f2);
writer.addDocument(doc2);
writer.close();

IndexReader reader = IndexReader.open(FSDirectory.open(indexDir));
IndexSearcher searcher = new IndexSearcher(reader);
searcher.setSimilarity(sim);
PayloadTermQuery query = new PayloadTermQuery(new Term("contents", "hello"), new MaxPayloadFunction());
TopDocs docs = searcher.search(query, 10);
for (ScoreDoc doc : docs.scoreDocs) {
System.out.println("docid : " + doc.doc + " score : " + doc.score);
}
}

如果scorePayload函數(shù)始終是返回1，則結(jié)果如下，不起作用。

It is not a bold char.
It is a bold char.
docid : 0 score : 0.2101998
docid : 1 score : 0.2101998

如果scorePayload函數(shù)如下：

class PayloadSimilarity extends DefaultSimilarity {

@Override
public float scorePayload(int docId, String fieldName, int start, int end, byte[] payload, int offset, int length) {
    int isbold = BoldFilter.bytes2int(payload);
    if(isbold == BoldFilter.IS_BOLD){
      System.out.println("It is a bold char.");
      return 10;
    } else {
      System.out.println("It is not a bold char.");
      return 1;
    }
}
}

則結(jié)果如下，同樣是包含hello，包含加粗的文檔獲得較高分：

It is not a bold char.
It is a bold char.
docid : 1 score : 2.101998
docid : 0 score : 0.2101998

繼承并實(shí)現(xiàn)自己的collector

以上各種方法，已經(jīng)把Lucene score計(jì)算公式的所有變量都涉及了，如果這還不能滿足您的要求，還可以繼承實(shí)現(xiàn)自己的collector。

在Lucene 2.4中，HitCollector有個(gè)函數(shù)public abstract void collect(int doc, float score)，用來收集搜索的結(jié)果。

其中TopDocCollector的實(shí)現(xiàn)如下：

public void collect(int doc, float score) {
if (score > 0.0f) {
    totalHits++;
    if (reusableSD == null) {
      reusableSD = new ScoreDoc(doc, score);
    } else if (score >= reusableSD.score) {
      reusableSD.doc = doc;
      reusableSD.score = score;
    } else {
      return;
    }
    reusableSD = (ScoreDoc) hq.insertWithOverflow(reusableSD);
}
}

此函數(shù)將docid和score插入一個(gè)PriorityQueue中，使得得分最高的文檔先返回。

我們可以繼承HitCollector，并在此函數(shù)中對score進(jìn)行修改，然后再插入PriorityQueue，或者插入自己的數(shù)據(jù)結(jié)構(gòu)。

比如我們在另外的地方存儲(chǔ)docid和文檔創(chuàng)建時(shí)間的對應(yīng)，我們希望當(dāng)文檔時(shí)間是一天之內(nèi)的分?jǐn)?shù)最高，一周之內(nèi)的分?jǐn)?shù)其次，一個(gè)月之外的分?jǐn)?shù)很低。

我們可以這樣修改：

public static long milisecondsOneDay = 24L * 3600L * 1000L;

public static long millisecondsOneWeek = 7L * 24L * 3600L * 1000L;

public static long millisecondsOneMonth = 30L * 24L * 3600L * 1000L;

public void collect(int doc, float score) {
if (score > 0.0f) {

long time = getTimeByDocId(doc);

if(time < milisecondsOneDay) {

score = score * 1.0;

} else if (time < millisecondsOneWeek){

score = score * 0.8;

} else if (time < millisecondsOneMonth) {

score = score * 0.3;

} else {

score = score * 0.1;

}

    totalHits++;
    if (reusableSD == null) {
      reusableSD = new ScoreDoc(doc, score);
    } else if (score >= reusableSD.score) {
      reusableSD.doc = doc;
      reusableSD.score = score;
    } else {
      return;
    }
    reusableSD = (ScoreDoc) hq.insertWithOverflow(reusableSD);
}
}

在Lucene 3.0中，Collector接口為void collect(int doc)，TopScoreDocCollector實(shí)現(xiàn)如下：

public void collect(int doc) throws IOException {
float score = scorer.score();
totalHits++;
if (score <= pqTop.score) {
return;
}
pqTop.doc = doc + docBase;
pqTop.score = score;
pqTop = pq.updateTop();
}

同樣可以用上面的方式影響其打分。

分類: Lucene原理與代碼分析

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn)。請注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請點(diǎn)擊一鍵舉報(bào)。

小男孩‘自慰网亚洲一区二区,亚洲一级在线播放毛片,亚洲中文字幕av每天更新,黄aⅴ永久免费无码,91成人午夜在线精品,色网站免费在线观看,亚洲欧洲wwwww在线观看

有關(guān)Lucene的問題(4):影響Lucene對文檔打分的四種方式

在索引階段設(shè)置Document Boost和Field Boost，存儲(chǔ)在(.nrm)文件中。

在搜索語句中，設(shè)置Query Boost.

繼承并實(shí)現(xiàn)自己的Similarity

繼承并實(shí)現(xiàn)自己的collector

在索引階段設(shè)置Document Boost和Field Boost，存儲(chǔ)在(.nrm)文件中。

在搜索語句中，設(shè)置Query Boost.