總覽
This section describes the motive, the notions and concepts used in Web-Harvest.
本章描述了在Web-Harvest涉 及的動(dòng)機(jī)、觀念和概念。 Rationale理念
World Wide Web, though by far the largest knowledge base, is rarely regarded as database in traditional sense - as source of information used for further computing. Web-Harvest is inspired by practical need for having right data at the right time. And very often, the Web is the only source that publicly provides wanted information.
萬(wàn)維網(wǎng),盡管是目前最大的知識(shí)基地,但是仍然難以將它視為傳統(tǒng)意義上的數(shù)據(jù)庫(kù),從而作為深入計(jì)算的所使用的信息來(lái)源。Web-Harvest 受啟發(fā)滿足實(shí)用性的需要成為在正確的時(shí)間獲取正確的數(shù)據(jù)。web經(jīng)常是唯一給公眾提供所需要 的信息來(lái)源。 Basic concept基本概念
The main goal behind Web-Harvest is to empower the usage of already existing extraction technologies. Its purpose is not to propose a new method, but to provide a way to easily use and combine the existing ones. Web-Harvest offers the set of processors for data handling and control flow. Each processor can be regarded as a function - it has zero or more input parameters and gives a result after execution. Processors could be combined in a pipeline, making the chain of execution. For easier manipulation and data reuse Web-Harvest provides variable context where named variables are stored. The following diagram describes one pipeline execution:
Web-Harvest 的總體目標(biāo)的是要能使用已經(jīng)存在的抽取技術(shù)。它的目標(biāo)不是提供一個(gè)新的方法,而是提供一種可以簡(jiǎn)單使用并整合已經(jīng)存在的技術(shù)的新方式。Web-Harvest 提供一系列數(shù)據(jù)處理和控制流程的處理器。每個(gè)處理器可以看做是一個(gè)方法-它有零個(gè)或多個(gè)輸入?yún)?shù)并能在執(zhí)行后提供一個(gè)結(jié)果。處理器可以組裝為一個(gè)管道,形 成執(zhí)行鏈。為了更加簡(jiǎn)單地操作以及數(shù)據(jù)重用,Web-Harvest 提供了變量上下文,那些被命名的變量可以存儲(chǔ)在這個(gè)上下文中。下圖描述了一個(gè)管道的執(zhí)行過(guò)程:
The result of extraction could be available in files created during execution or from the variable context if Web-Harvest is programmatically used. 在執(zhí)行期間,抽取的結(jié)果可以存在于文件,如果Web-Harvest 采用編程方式進(jìn)行使用時(shí),抽取的結(jié)果也來(lái)自于變量上下文。 Configuration language配置語(yǔ)言
Every extraction process is defined in one or more configuration files, using simple XML-based language. Each processor is described by specific XML element or structure of XML elements. For the illustration, here is presented an example of configuration file:
每個(gè)抽取過(guò)程都定義在一個(gè)或多個(gè)配置文件中,并且使用簡(jiǎn)單的基于XML的語(yǔ)言。每個(gè)處理器都被特定的XML元素或XML元素的結(jié)構(gòu)所描述。為了說(shuō) 明,下面展示了一個(gè)配置文件的例子:
<?xml version="1.0" encoding="UTF-8"?> <config charset="UTF-8"> <var-def name="urlList"> <xpath expression="//img/@src"> <html-to-xml> <http url="http://news."/> </html-to-xml> </xpath> </var-def> <loop item="link" index="i" filter="unique"> <list> <var name="urlList"/> </list> <body> <file action="write" type="binary" path="images/${i}.gif"> <http url="${sys.fullUrl('http://news.', link)}"/> </file> </body> </loop> </config>
This configuration contains two pipelines. The first pipeline performs the following steps:
這個(gè)配置包含了兩個(gè)管道。第一個(gè)管道執(zhí)行了下面的步驟:
1. http://news.的網(wǎng)站內(nèi) 容被下載, 2. HTML清理 3. XPath 表達(dá)式用于查找頁(yè)面圖片的URL序列, 4. 新命名urlList變量用于定義包漢了圖片URL的序列。
The second pipeline uses result of the previous execution in order to collect all page images:
第二個(gè)管道為了收集所有的頁(yè)面圖片,使用了前面執(zhí)行的結(jié)果: 1. Loop處理器迭代了所有的URL序列并且對(duì)于每項(xiàng)都: 2. 下載當(dāng)前URL的圖片, 3. 在文件系統(tǒng)中保存圖片。
This example illustrates some procedural-language elements of Web-Harvest, like variable definition and list iteration, few data management processors (file and http) and couple of HTML/XML processing instructions (html-to-xml and xpath processors).
For slightly more complex example of image download, where some other features of Web-Harvest are used, see Examples page. For technical coverage of supported processors, see User manual.
這個(gè)例子說(shuō)明了Web-Harvest中 一些過(guò)程化語(yǔ)言的元素,比如變量定義和列表迭代,少量數(shù)據(jù)管理的處理器(文件和http)以及一些HTML/XML處理指令。(HTML到XML和 XPATH處理器) 想了解在Web-Harvest 中更加復(fù)雜一點(diǎn)的圖片下載,以及用到的一些特點(diǎn),見(jiàn)Examples 頁(yè)。想了解所支持的處理器的技術(shù)覆蓋范圍,看User manual。 Data valuesAll data produced and consumed during extraction process in Web-Harvest have
three representations: text, binary and list. There is also special data
value empty, whose textual representation is empty string,
binary - empty byte array and list - zero length list. Which form of
data is used - it depends on processor that consumes the data. In
previous configuration html-to-xml processor uses downloaded
content as text in order to transform it to HTML, loop
processor uses variable urlList as a list in order to iterate
over it and file processor treats downloaded images as binary
data when saving them to the files. In most cases proper representation
of the data is chosen by Web-Harvest. However - in some situations it must be
explicitly stated which one to use. One example is file
processor where default data type is text and the binary
content must be explicitly specified with VariablesWeb-Harvest provides the variable context for storing and using variables. There is no special convention for naming variables like in most of the programming languages. Thus, the names like arr[1], 100 or #$& are valid. However, if aforementioned variables were used in scripts or templates (see next section), where expressions are dynamically evaluated, the exception would be thrown. It is therefore recommended to use usual programming language naming in order to avoid any difficulties. When Web-Harvest is programmatically used (from Java code, not from command line) variable context may be initially set by user in order to add custom values and functionality. Similarly, after execution, variable context is available for taking variables from it. When user-defined functions are called (see User manual) separate local variable context is created (like in many programming languages, including Java). The valid way to exchange data between caller and called function is through the function parameters. Scripting and templatingBefore Web-Harvest 0.5 templating mechanism was based on OGNL (Object-Graph Navigation
Language). From the version 0.5 OGNL is replaced by BeanShell, and starting from
version 1.0, multiple scripting languages are supported, giving
developers freedom to choose the favourite one.
Besides the set of powerful text and XML manipulation processors, Web-Harvest supports real scripting languages which code can be easily intergrated within scraper configurations. Languages currently supported are BeanShell, Groovy and Javascript. BeanShell is probably the closest to Java syntax and power, but Groovy and Javascript have some other adventages. It is up to the developer to use prefered language or even to mix different languages in the single configuration. Templating allowes evaluating of marked parts of the text (text "islands" surrounded with ${ and }). Evaluation is performed using the chosen scripting language. In Web-Harvest all elements' attributes are implicitly passed to the templating engine. In upper configuration, there are two places where templater is doing the job:
|
|
|
來(lái)自: ShangShujie > 《我的圖書(shū)館》