小男孩‘自慰网亚洲一区二区,亚洲一级在线播放毛片,亚洲中文字幕av每天更新,黄aⅴ永久免费无码,91成人午夜在线精品,色网站免费在线观看,亚洲欧洲wwwww在线观看

<nav id="y8vmw"></nav>

<tt id="y8vmw"></tt>

<pre id="y8vmw"></pre>

<tt id="y8vmw"></tt>

<pre id="y8vmw"><u id="y8vmw"></u></pre>

搜索

分享

QQ空間 QQ好友新浪微博微信

!!!!! 字符串相似度算法

看見就非常 2014-10-01

展開全文

原文:http://blog.csdn.net/guffey/article/details/6750494

2011-09-05 17:30 74人閱讀評論(0) 收藏舉報

字符串相似度算法（ Levenshtein Distance算法）

題目：一個字符串可以通過增加一個字符，刪除一個字符，替換一個字符得到另外一個字符串，假設(shè)，我們把從字符串A轉(zhuǎn)換成字符串B，前面3種操作所執(zhí)行的最少次數(shù)稱為AB相似度
如  abc adc  度為 1
   ababababa babababab 度為 2
   abcd acdb 度為2

字符串相似度算法可以使用 Levenshtein Distance算法(中文翻譯：編輯距離算法) 這算法是由俄國科學(xué)家Levenshtein提出的。其步驟

Step	Description
1	Set n to be the length of s. Set m to be the length of t. If n = 0, return m and exit. If m = 0, return n and exit. Construct a matrix containing 0..m rows and 0..n columns.
2	Initialize the first row to 0..n. Initialize the first column to 0..m.
3	Examine each character of s (i from 1 to n).
4	Examine each character of t (j from 1 to m).
5	If s[i] equals t[j], the cost is 0. If s[i] doesn't equal t[j], the cost is 1.
6	Set cell d[i,j] of the matrix equal to the minimum of: a. The cell immediately above plus 1: d[i-1,j] + 1. b. The cell immediately to the left plus 1: d[i,j-1] + 1. c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost.
7	After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m].

C++實現(xiàn)如下

#include <iostream>

#include <vector>

#include <string>

using namespace std;

//算法

int ldistance(const string source,const string target)

{

//step 1

int n=source.length();

int m=target.length();

if (m==0) return n;

if (n==0) return m;

//Construct a matrix

typedef vector< vector<int> > Tmatrix;

Tmatrix matrix(n+1);

for(int i=0; i<=n; i++) matrix[i].resize(m+1);

//step 2 Initialize

for(int i=1;i<=n;i++) matrix[i][0]=i;

for(int i=1;i<=m;i++) matrix[0][i]=i;

//step 3

for(int i=1;i<=n;i++)

{

const char si=source[i-1];

//step 4

for(int j=1;j<=m;j++)

{

const char dj=target[j-1];

//step 5

int cost;

if(si==dj)

{

cost=0;

}

else

{

cost=1;

}

//step 6

const int above=matrix[i-1][j]+1;

const int left=matrix[i][j-1]+1;

const int diag=matrix[i-1][j-1]+cost;

matrix[i][j]=min(above,min(left,diag));

}

}//step7

return matrix[n][m];

}

int main()

{

string s;

string d;

cout<<"source=";

cin>>s;

cout<<"diag=";

cin>>d;

int dist=ldistance(s,d);

cout<<"dist="<<dist<<endl;

}

#include <iostream>

#include <vector>

#include <string>

using namespace std;

//算法

int ldistance(const string source,const string target)

{

//step 1

int n=source.length();

int m=target.length();

if (m==0) return n;

if (n==0) return m;

//Construct a matrix

typedef vector< vector<int> > Tmatrix;

Tmatrix matrix(n+1);

for(int i=0; i<=n; i++) matrix[i].resize(m+1);

//step 2 Initialize

for(int i=1;i<=n;i++) matrix[i][0]=i;

for(int i=1;i<=m;i++) matrix[0][i]=i;

//step 3

for(int i=1;i<=n;i++)

{

const char si=source[i-1];

//step 4

for(int j=1;j<=m;j++)

{

const char dj=target[j-1];

//step 5

int cost;

if(si==dj)

{

cost=0;

}

else

{

cost=1;

}

//step 6

const int above=matrix[i-1][j]+1;

const int left=matrix[i][j-1]+1;

const int diag=matrix[i-1][j-1]+cost;

matrix[i][j]=min(above,min(left,diag));

}

}//step7

return matrix[n][m];

}

int main()

{

string s;

string d;

cout<<"source=";

cin>>s;

cout<<"diag=";

cin>>d;

int dist=ldistance(s,d);

cout<<"dist="<<dist<<endl;

}

java 字符串編輯距離算法實現(xiàn)：

public static int getLevenshteinDistance (String s, String t) {
  if (s == null || t == null) {
    throw new IllegalArgumentException("Strings must not be null");
  }
		
  /*
    The difference between this impl. and the previous is that, rather 
     than creating and retaining a matrix of size s.length()+1 by t.length()+1, 
     we maintain two single-dimensional arrays of length s.length()+1.  The first, d,
     is the 'current working' distance array that maintains the newest distance cost
     counts as we iterate through the characters of String s.  Each time we increment
     the index of String t we are comparing, d is copied to p, the second int[].  Doing so
     allows us to retain the previous cost counts as required by the algorithm (taking 
     the minimum of the cost count to the left, up one, and diagonally up and to the left
     of the current cost count being calculated).  (Note that the arrays aren't really 
     copied anymore, just switched...this is clearly much better than cloning an array 
     or doing a System.arraycopy() each time  through the outer loop.)

     Effectively, the difference between the two implementations is this one does not 
     cause an out of memory condition when calculating the LD over two very large strings.  		
  */		
		
  int n = s.length(); // length of s
  int m = t.length(); // length of t
		
  if (n == 0) {
    return m;
  } else if (m == 0) {
    return n;
  }

  int p[] = new int[n+1]; //'previous' cost array, horizontally
  int d[] = new int[n+1]; // cost array, horizontally
  int _d[]; //placeholder to assist in swapping p and d

  // indexes into strings s and t
  int i; // iterates through s
  int j; // iterates through t

  char t_j; // jth character of t

  int cost; // cost

  for (i = 0; i<=n; i++) {
     p[i] = i;
  }
		
  for (j = 1; j<=m; j++) {
     t_j = t.charAt(j-1);
     d[0] = j;
		
     for (i=1; i<=n; i++) {
        cost = s.charAt(i-1)==t_j ? 0 : 1;
        // minimum of cell to the left+1, to the top+1, diagonally left and up +cost				
        d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),  p[i-1]+cost);  
     }

     // copy current distance counts to 'previous row' distance counts
     _d = p;
     p = d;
     d = _d;
  } 
		
  // our last action in the above loop was to switch d and p, so p now 
  // actually has the most recent cost counts
  return p[n];
}
字符串相似度=1-（編輯距離/（MAX（字符串1長度，字符串2的長度））

oracle 11提供了計算字符串編輯距離和相似度的函數(shù)：

參見http:///reference/utl_match.html

Oracle UTL_MATCH

Version 11.1

General Information
The four functions included in the package use different methods to compare a source string and destination string, and return an assessment of what it would take to turn the source string into the destination string.
Source	$ORACLE_HOME/rdbms/admin/utlmatch.sql

EDIT_DISTANCE
Returns the number of changes required to turn the source string into the destination string using the Levenshtein Distance algorithm.	utl_match.edit_distance(s1 IN VARCHAR2, s2 IN VARCHAR2) RETURN PLS_INTEGER;
	SELECT utl_match.edit_distance('expresso', 'espresso') DIST FROM dual;

EDIT_DISTANCE_SIMILARITY
Returns an integer between 0 and 100, where 0 indicates no similarity at all and 100 indicates a perfect match.	utl_match.edit_distance_similarity( s1 IN VARCHAR2, s2 IN VARCHAR2) RETURN PLS_INTEGER;
	SELECT utl_match.edit_distance_similarity('expresso', 'espresso') SIM FROM dual;

JARO_WINKLER
Instead of simply calculating the number of steps required to change the source string to the destination string, determines how closely the two strings agree with each other and tries to take into account the possibility of a data entry error.	utl_match.jaro_winkler(s1 IN VARCHAR2, s2 IN VARCHAR2) RETURN BINARY_DOUBLE;
	SELECT utl_match.jaro_winkler('expresso', 'espresso') DIST FROM dual;

JARO_WINKLER_SIMILARITY
Returns an integer between 0 and 100, where 0 indicates no similarity at all and 100 indicates a perfect match but tries to take into account possible data entry errors.	utl_match.jaro_winkler_similarity( s1 IN VARCHAR2, s2 IN VARCHAR2) RETURN PLS_INTEGER;
	SELECT utl_match.jaro_winkler_similarity('expresso', 'expresso') SIM FROM dual;

本站是提供個人知識管理的網(wǎng)絡(luò)存儲空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點。請注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請點擊一鍵舉報。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻花（0） +1

來自：看見就非常 > 《tip》

舉報/認(rèn)領(lǐng)

0條評論

請遵守用戶評論公約

類似文章 更多

看見就非常

關(guān)注對話

TA的最新館藏

李零：絕地天通——研究中國早期宗教的三個視角
黃興濤：新發(fā)現(xiàn)嚴(yán)復(fù)手批“編訂名詞館”一部原稿本
中國人文數(shù)字閱讀影響力期刊100強公布 | 文化縱橫網(wǎng)
！?。。?！別去看愛朵露的臉--那是數(shù)據(jù)節(jié)點《虛擬偶像愛朵露》：瞥見新媒體與數(shù)字未來的“節(jié)點” | 機核 GCORES
伽達默爾：論圖像的緘默 | 純粹的物件和缺乏人類印記的純粹自然現(xiàn)象在過去是不能入畫的。而今，當(dāng)我們進入一個古典畫廊，卻正是靜物畫顯得很有現(xiàn)代感。很顯然，靜物畫并不要求像人類或諸神的形象與作為在繪畫中出現(xiàn)
喜歡輕音樂和聆聽更高類型的音樂其實只是為了獲取社會聲望，在這種情況下，熟知某首熱門歌曲的文本就足以揭示出這一受到真心贊許的對象所具有的唯一功能特奧多·W.阿多諾：論音樂中的戀物特

喜歡該文的人也喜歡更多

熱門閱讀換一換