處理正則表達(dá)式的java包:regexp
雖然apache認(rèn)為JakartaORO是一個(gè)更完備的正則表達(dá)式處理包,但regexp的應(yīng)用也是非常廣泛,大概是因?yàn)樗暮?jiǎn)單吧。下面是regexp的學(xué)習(xí)筆記。
1、下載安裝
下載源碼
cvs -d :pserver:anoncvs@cvs.:/home/cvspublic login
password: anoncvs
cvs -d :pserver:anoncvs@cvs.:/home/cvspublic checkout jakarta-regexp
或下載編譯好的包
wget http://apache./dist/jakarta/regexp/binaries/jakarta-regexp-1.3.tar.gz
2、基本情況
1)Regexp是一個(gè)由100%純java正則式處理包,是Jonathan Locke捐給Apache軟件基金會(huì)的。 他最初開(kāi)發(fā)這個(gè)軟件是在1996年,在時(shí)間的考驗(yàn)面前RegExp表達(dá)非常堅(jiān)挺:)。 它包括完整的Javadoc文檔,以及一個(gè)簡(jiǎn)單的Applet來(lái)做可視化調(diào)試和兼容性測(cè)試.
2)RE類(lèi)regexp包中非常重要的一個(gè)類(lèi),它是一個(gè)高效的、輕量級(jí)的正則式計(jì)算器/匹配器的類(lèi),RE是regular
expression的縮寫(xiě)。正則式是能夠進(jìn)行復(fù)雜的字符串匹配的模板,而且當(dāng)一個(gè)字符串能匹配某個(gè)模板時(shí),你可以抽取出那些匹配的部分,這在進(jìn)行文本解
析時(shí)非常有用。下面討論一下正則式的語(yǔ)法。
為了編譯一個(gè)正則式,你需要簡(jiǎn)單地以模板為參數(shù)構(gòu)造一個(gè)RE匹配器對(duì)象來(lái)完成,然后就可調(diào)用任一個(gè)RE.match方法來(lái)對(duì)一個(gè)字符串進(jìn)行匹配檢查,如果匹配成功/失敗,則返回真/假值。例如: RE r = new RE("a*b");
boolean matched = r.match("aaaab");
RE.getParen可以取回匹配的字符序列,或者匹配的字符序列的某一部分(如果模板中有相應(yīng)的括號(hào)的話),以及它們的位置、長(zhǎng)度等屬性。如: RE r = new RE("(a*)b"); // Compile expression
boolean matched = r.match("xaaaab"); // Match against "xaaaab"
String wholeExpr = r.getParen(0); // wholeExpr will be ‘a(chǎn)aaab‘
String insideParens = r.getParen(1); // insideParens will be ‘a(chǎn)aaa‘
int startWholeExpr = r.getParenStart(0); // startWholeExpr will be index 1
int endWholeExpr = r.getParenEnd(0); // endWholeExpr will be index 6
int lenWholeExpr = r.getParenLength(0); // lenWholeExpr will be 5
int startInside = r.getParenStart(1); // startInside will be index 1
int endInside = r.getParenEnd(1); // endInside will be index 5
int lenInside = r.getParenLength(1); // lenInside will be 4
RE支持正則式的后向引用,如: ([0-9]+)=\1
匹配 n=n (象 0=0 or 2=2)這樣的字符串
3)RE支持的正則式的語(yǔ)法如下:
字符
| unicodeChar | Matches any identical unicode character | | \ | Used to quote a meta-character (like ‘*‘) | | \\ | Matches a single ‘\‘ character | | \0nnn | Matches a given octal character | | \xhh | Matches a given 8-bit hexadecimal character | | \\uhhhh | Matches a given 16-bit hexadecimal character | | \t | Matches an ASCII tab character | | \n | Matches an ASCII newline character | | \r | Matches an ASCII return character | | \f | Matches an ASCII form feed character |
字符集
| [abc] | 簡(jiǎn)單字符集 | | [a-zA-Z] | 帶區(qū)間的字符集 | | [^abc] | 字符集的否定 |
標(biāo)準(zhǔn)POSIX 字符集
| [:alnum:] | Alphanumeric characters. | | [:alpha:] | Alphabetic characters. | | [:blank:] | Space and tab characters. | | [:cntrl:] | Control characters. | | [:digit:] | Numeric characters. | | [:graph:] | Characters that are printable and are also visible.(A space is printable, but not visible, while an `a‘ is both.) | | [:lower:] | Lower-case alphabetic characters. | | [:print:] | Printable characters (characters that are not control characters.) | | [:punct:] | Punctuation characters (characters that are not letter,digits, control characters, or space characters). | | [:space:] | Space characters (such as space, tab, and formfeed, to name a few). | | [:upper:] | Upper-case alphabetic characters. | | [:xdigit:] | Characters that are hexadecimal digits. |
非標(biāo)準(zhǔn)的 POSIX樣式的字符集
| [:javastart:] | Start of a Java identifier | | [:javapart:] | Part of a Java identifier |
預(yù)定義的字符集
| . | Matches any character other than newline | | \w | Matches a "word" character (alphanumeric plus "_") | | \W | Matches a non-word character | | \s | Matches a whitespace character | | \S | Matches a non-whitespace character | | \d | Matches a digit character | | \D | Matches a non-digit character |
邊界匹配符
| ^ | Matches only at the beginning of a line | | $ | Matches only at the end of a line | | \b | Matches only at a word boundary | | \B | Matches only at a non-word boundary |
貪婪匹配限定符
| A* | Matches A 0 or more times (greedy) | | A+ | Matches A 1 or more times (greedy) | | A? | Matches A 1 or 0 times (greedy) | | A{n} | Matches A exactly n times (greedy) | | A{n,} | Matches A at least n times (greedy) |
非貪婪匹配限定符
| A*? | Matches A 0 or more times (reluctant) | | A+? | Matches A 1 or more times (reluctant) | | A?? | Matches A 0 or 1 times (reluctant) |
邏輯運(yùn)算符
| AB | Matches A followed by B | | A|B | Matches either A or B | | (A) | Used for subexpression grouping | | (?:A) | Used for subexpression clustering (just like grouping but no backrefs) |
后向引用符
| \1 | Backreference to 1st parenthesized subexpression | | \2 | Backreference to 2nd parenthesized subexpression | | \3 | Backreference to 3rd parenthesized subexpression | | \4 | Backreference to 4th parenthesized subexpression | | \5 | Backreference to 5th parenthesized subexpression | | \6 | Backreference to 6th parenthesized subexpression | | \7 | Backreference to 7th parenthesized subexpression | | \8 | Backreference to 8th parenthesized subexpression | | \9 | Backreference to 9th parenthesized subexpression |
RE運(yùn)行的程序先經(jīng)過(guò)RECompiler類(lèi)的編譯. 由于效率的原因,RE匹配器沒(méi)有包括正則式的編譯類(lèi). 實(shí)際上,如果要預(yù)編譯1個(gè)或多個(gè)正則式,可以通過(guò)命令行運(yùn)行‘recompile‘類(lèi),如 java org.apache.regexp.recompile a*b 則產(chǎn)生類(lèi)似下面的編譯輸出(最后一行不是):
// Pre-compiled regular expression "a*b"
char[] re1Instructions =
{
0x007c, 0x0000, 0x001a, 0x007c, 0x0000, 0x000d, 0x0041,
0x0001, 0x0004, 0x0061, 0x007c, 0x0000, 0x0003, 0x0047,
0x0000, 0xfff6, 0x007c, 0x0000, 0x0003, 0x004e, 0x0000,
0x0003, 0x0041, 0x0001, 0x0004, 0x0062, 0x0045, 0x0000,
0x0000,
};
REProgram re1 = new REProgram(re1Instructions);
RE r = new RE(re1);
通過(guò)利用預(yù)編譯的req來(lái)構(gòu)建RE匹配器對(duì)象,可以避免運(yùn)行時(shí)進(jìn)行編譯的成本。
如果需要?jiǎng)討B(tài)的構(gòu)造正則式,則可以創(chuàng)建單獨(dú)一個(gè)RECompiler對(duì)象,并利用它來(lái)編譯每個(gè)正則式。注意,RE 和 RECompiler
都不是threadsafe的(出于效率的原因), 因此當(dāng)多線程運(yùn)行時(shí),你需要為每個(gè)線程分別創(chuàng)建編譯器和匹配器。
3、例程
1)regexp包中帶有一個(gè)applet寫(xiě)的小程序,運(yùn)行如下:java org.apache.regexp.REDemo
2)Jeffer Hunter寫(xiě)了一個(gè)例程,可以下載。
3)regexp自帶的測(cè)試?yán)蹋埠苡袇⒖純r(jià)值。它把所有正則式及相關(guān)的字符串以及結(jié)果都放在一個(gè)單獨(dú)的文件里,在$REGEXPHOME/docs/RETest.txt中。當(dāng)然,這個(gè)例程的運(yùn)行也要在$REGEXPHOME目錄下。cd $REGEXPHOME
java org.apache.regexp.RETest
參考資料
1、 Jeffrey Hunter‘s README_regular_expressions.txt |
http://www./topics/topics.cgi?LEVEL=programming
2、The Jakarta Site – CVS Repository
http://jakarta./site/cvsindex.html
|