|
現(xiàn)如今的爬蟲再也不是簡(jiǎn)單的爬取靜態(tài)頁(yè)面,解析Html文本這么簡(jiǎn)單,許多單頁(yè)面應(yīng)用,異步請(qǐng)求調(diào)用,頁(yè)面初始化js渲染等技術(shù)的使用,使得傳統(tǒng)的通過發(fā)起http請(qǐng)求獲得的Document無(wú)法直接使用。因此,基于實(shí)際業(yè)務(wù)需求,在爬取某電商平臺(tái)數(shù)據(jù)時(shí),發(fā)現(xiàn)其頁(yè)面特定位置為js渲染,固此,由此一文,基于實(shí)際代碼測(cè)試,分析HtmlUnit/Selenium/PhantomJs三類流行的js渲染引擎。 -HtmlUnit
|
1 2 3 4 5 | <code><code>1) Seleninum 1+ WebDriver = Selenium2) 基于本地安裝的瀏覽器,需打開瀏覽器3) 需要引用相應(yīng)的WebDriver,正確配置webdriver的路徑參數(shù)4) 在爬取大量js渲染頁(yè)面時(shí)明顯不合適</code></code> |
- PhantomJs1 2 3 4 5 | <code><code><code>1) 神器,短小精悍2) 可本地化運(yùn)行,也可作為服務(wù)端運(yùn)行3) 基于webkit內(nèi)核,性能及表現(xiàn)良好4) 完美解析絕大部分頁(yè)面</code></code></code> |
基于實(shí)測(cè)結(jié)果,在爬取大量任務(wù)時(shí),推薦將PhantomJs作為服務(wù)端使用,此處,分別介紹本地及遠(yuǎn)程服務(wù)端使用例子(也可查看官網(wǎng)example)本地
需要構(gòu)造目標(biāo)執(zhí)行的js文件,利用命令行調(diào)用PhantomJS
示例:
window平臺(tái)下
PhantomJs.exe target.js param1
對(duì)應(yīng)的本地target.js可參考如下示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 | <code><code><code>"use strict";var page = require('webpage').create();var system = require('system');if (system.args.length !== 2) { console.log('Usage: server.js <some port="">'); phantom.exit(1);} else { var url = system.args[1]; page.open(url, function (status) { console.log(page.content); phantom.exit(); } });</some></code></code></code> |
在java程序中,通過調(diào)用控制臺(tái)執(zhí)行命令
1 2 3 4 5 6 7 8 9 10 | <code><code><code> Runtime runtime = Runtime.getRuntime(); Process p = runtime.exec("D:/phantomjs.exe target.js url); InputStream is = p.getInputStream(); BufferedReader br = new BufferedReader(new InputStreamReader(is)); StringBuffer sb = new StringBuffer(); String tmp = ""; while((tmp = br.readLine())!=null){ sb.append(tmp); } return sb.toString();</code></code></code> |
搭建遠(yuǎn)程服務(wù)器
保證遠(yuǎn)程服務(wù)器指定端口開啟
示例:
在阿里ecs上開啟指定端口,如3003
打開控制臺(tái),在安全組中添加自定義TCP連接,可訪問的ip組設(shè)置為0.0.0.0/0,同時(shí)配置入網(wǎng)和出網(wǎng)端口
操作步驟1) 官網(wǎng)下載exe文件至指定位置(linux平臺(tái)同理)
2) 新建一個(gè)server.js文件
3) 命令行運(yùn)行PhantomJS server.js即可開啟服務(wù)
4) 本地通過在瀏覽器或者java代碼中提交http請(qǐng)求,即可獲得響應(yīng),url為 https://遠(yuǎn)程服務(wù)器ip地址:端口號(hào)/https://自定義url
此處server.js為關(guān)鍵,其設(shè)置了服務(wù)器的監(jiān)聽端口及響應(yīng)請(qǐng)求邏輯
server.js示例代碼:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | <code><code><code>"use strict";var page = require('webpage').create();var server = require('webserver').create();var system = require('system');var host, port;if (system.args.length !== 2) { console.log('Usage: server.js <some port="">'); phantom.exit(1);} else { port = system.args[1]; var listening = server.listen(port, function (request, response) { console.log("GOT HTTP REQUEST"); console.log(JSON.stringify(request, null, 4)); // we set the headers here response.statusCode = 200; response.headers = {"Cache": "no-cache", "Content-Type": "text/html"}; // this is also possible: response.setHeader("databee", "databee"); // now we write the body // note: the headers above will now be sent implictly //response.write("<title></title>"); // note: writeBody can be called multiple times var url = request.url; url = url.substring(1);//獲得的url較為奇怪,根據(jù)request的內(nèi)容進(jìn)行url改造成合規(guī)url page.open(url, function (status) { if (status !== 'success') { response.statusCode = 403; response.headers = { 'Cache': 'no-cache', 'Content-Type': 'text/html' }; response.write("FAIL"); response.close(); console.log('FAIL to load the address'); } else { response.statusCode = 200; response.headers = { 'Cache': 'no-cache', 'Content-Type': 'text/html' }; //console.log(page.content) response.write(page.content); response.close();//response.close()表明響應(yīng)結(jié)束,必須加入 console.log('Send success'); } }); //response.close(); }); if (!listening) { console.log("could not create web server listening on port " + port); phantom.exit();//代表退出phantom }}</some></code></code></code> |
提供本地發(fā)起請(qǐng)求Java代碼示例
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | <code><code><code> URL url = new URL(finalUrl);//finalUrl此時(shí)為get請(qǐng)求url HttpURLConnection conn = (HttpURLConnection)url.openConnection(); InputStream is = null; BufferedReader br = null; if (conn.getResponseCode() == 200) { is = conn.getInputStream(); } else { is = conn.getErrorStream(); } br = new BufferedReader(new InputStreamReader(is)); String line = ""; StringBuilder sb = new StringBuilder(); while ((line = br.readLine()) != null) { sb.append(line); } return sb.toString();</code></code></code> |
|
|
來(lái)自: bylele > 《網(wǎng)頁(yè)爬》