|
1. 目的
用五筆時(shí),如果碰到不會(huì)拆的字,只好換回拼音。但這樣做治標(biāo)不治本,于是到網(wǎng)上找五筆反查工具。最后發(fā)現(xiàn)一個(gè)不錯(cuò)的網(wǎng)站——不僅有每個(gè)字對(duì)應(yīng)的五筆碼,還有其字根圖??上У氖牵@是一個(gè)網(wǎng)站。換句說(shuō),就是每次查的時(shí)候都要上網(wǎng)。很自然的,會(huì)想到將這個(gè)網(wǎng)站上的五筆碼以及對(duì)應(yīng)的字根圖保存到本地上,再寫(xiě)個(gè)查詢(xún)程序做成本地版的>_<
2. 準(zhǔn)備工作——網(wǎng)頁(yè)特點(diǎn)分析

網(wǎng)站(http://www./wbcx)提供了兩種查詢(xún)方式:一種是輸入要查詢(xún)的字;另外一種是一頁(yè)接著一頁(yè)地查看。由于懶得找字庫(kù),就選擇了第二種方式。在此方式下,第一頁(yè)的URL是http://www./wbcx/index5.asp?page=1,第二頁(yè)的URL是http://www./wbcx/index5.asp?page=2,第三頁(yè)的URL是http://www./wbcx/index5.asp?page=3。通過(guò)前三個(gè)頁(yè)的URL,有理由相信第X頁(yè)的URL是http://www./wbcx/index5.asp?page=X。 解決URL問(wèn)題后,就要分析如何從單個(gè)網(wǎng)頁(yè)得到所需要的資源。查看第一頁(yè)的源代碼發(fā)現(xiàn)“86五筆編碼”只出現(xiàn)過(guò)一次,而且其后面就是想要的五筆碼。因此得到服務(wù)器發(fā)回的內(nèi)容后,再定位到“86五筆編碼”就能得到相應(yīng)的五筆碼。字根圖的URL地址出現(xiàn)在五筆碼之后,而且都是以“http://www./GIF-82”開(kāi)頭的。因此在“86五筆編碼”之后的內(nèi)容中,找到第一個(gè)以“http://www./GIF-82”開(kāi)頭的URL,就是需要的圖片地址了。
3. 算法流程
for( 第一頁(yè)到最后一頁(yè) ) { 獲取這一頁(yè)的源代碼 從源代碼中提取五筆碼,字根圖的URL 獲取字根圖 }
4. 源代碼

import java.awt.image.BufferedImage; import java.io.File; import java.io.FileWriter; import java.io.IOException; import java.io.InputStream; import java.net.URL; import java.util.LinkedList;
import javax.imageio.ImageIO;
import org.apache.http.HttpEntity; import org.apache.http.HttpResponse; import org.apache.http.client.HttpClient; import org.apache.http.client.methods.HttpGet; import org.apache.http.client.methods.HttpUriRequest; import org.apache.http.impl.client.DefaultHttpClient;
public class Clawler { private static final int END_PAGE = 6764; private static final String PREFIX = "http://www./wbcx/index5.asp?page="; private static final String CODE_SAVE_PATH = "D:\\Wubi\\WubiCode.txt"; private static final String IMG_SAVE_PATH_PREFIX = "D:\\Wubi\\img\\"; private static LinkedList queue = new LinkedList(); private static String m_imguri; public static void main(String[] args) throws IOException { HttpClient httpClient = new DefaultHttpClient(); FileWriter fw = null; fw = new FileWriter(CODE_SAVE_PATH); for(int i = 1; i <= END_PAGE; ++i) { HttpUriRequest request = new HttpGet(PREFIX + i); try { HttpResponse response = httpClient.execute(request); HttpEntity entity = response.getEntity(); StringBuilder builder = new StringBuilder(); if( entity != null ) { InputStream is = entity.getContent(); byte[] tmp = new byte[2048]; while( is.read(tmp) != -1 ) { builder.append(new String(tmp)); } fw.write( getWubiCode(builder.toString(), i) ); downloadImg(m_imguri, IMG_SAVE_PATH_PREFIX + i + ".gif", i); }
} catch( Exception e ) { queue.addLast((Integer)i); e.printStackTrace(); } if( i%100==0 ) { fw.flush(); } } System.out.println("\n missing Code"); while( !queue.isEmpty() ) { //下載失敗的頁(yè)面 System.out.println(queue.element()); queue.removeFirst(); } System.out.print("all done"); fw.close(); httpClient.getConnectionManager().shutdown(); } public static String getWubiCode(String page, int number) { //提取五筆碼,字根圖的URL StringBuilder save = new StringBuilder(); page = page.substring(page.indexOf("86五筆編碼")); int index = 7; while( page.charAt(index)!= '<' ) save.append(page.charAt(index++)); save.append(System.getProperty("line.separator")); index = 0; StringBuilder imgpath = new StringBuilder(); page = page.substring(page.indexOf("http://www./GIF-82")); while( page.charAt(index) != '\"' ) imgpath.append(page.charAt(index++)); m_imguri = imgpath.toString(); save.insert(0, imgpath.charAt(imgpath.length() - 5)); save.insert(1, ' '); return save.toString(); } public static void downloadImg(String url, String path, int number) { //下載圖片 try { File out = new File(path); BufferedImage buffer = ImageIO.read(new URL(url)); if( buffer == null ) { queue.addLast(number); System.out.println(number + " " + url); } else { ImageIO.write(buffer, "gif", out); } } catch( IOException e ) { queue.addLast(number); System.out.println(url); System.out.println(e.getMessage()); } } }

5. 參考資料 a. httpClient4.1入門(mén)教程(中文版) http://wenku.baidu.com/view/0a027c5e804d2b160b4ec029.html b. 論壇圖片爬蟲(chóng)的一種實(shí)現(xiàn) http://www./topic/1044289 c. 最簡(jiǎn)單的搜索引擎思路 http://www./topic/1055424
|