一堆信息抽取的資料文檔

hustlg 2006-03-28

展開全文

一堆信息抽取的資料文檔
http://FullSearch.Com 中文全文檢索網(wǎng) 2005-11-25 14:19:09 sigz

“一堆”，就是沒有整理，是堆放的。不是自己寫的，是找來的。
我會在這里繼續(xù)添加的，依然是“堆”。有興趣的可以看看，沒有興趣的就別碰了。

有誰有什么好文，拿出來大家共分享。

1.網(wǎng)上信息抽取技術縱覽（下載）
Line Eikvil 原著（1999.7）陳鴻標譯 (2003.3)
信息抽取（Information Extraction: IE）是把文本里包含的信息進行結構化處理，變成表格一樣的組織形式。輸入信息抽取系統(tǒng)的是原始文本，輸出的是固定格式的信息點。信息點從各種各樣的文檔中被抽取出來，然后以統(tǒng)一的形式集成在一起。這就是信息抽取的主要任務………
第一章導論
第二章簡要介紹信息抽取技術
第三章介紹網(wǎng)頁分裝器(wrapper)的開發(fā)
第四章介紹已經(jīng)開發(fā)出來的網(wǎng)站信息抽取系統(tǒng)
第五章介紹信息抽取技術的應用范圍以及首批已經(jīng)進入商業(yè)運作的商用系統(tǒng)

2.Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence
Silviu Cucerzan ，David Yarowsky
一種獨立于語言的命名實體識別方法。

3.信息抽取研究綜述
王建會對自動摘要算法改進方面所做的研究工作

4.信息抽取綜述
這是介紹信息抽取(Information Extraction)的一篇報告,包括MUC、Web抽取(Web Extraction)等。

5.FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text
本文介紹了FASTUS系統(tǒng)，這是一個從自然語言文本中進行信息抽取的系統(tǒng)，抽取來的信息輸入數(shù)據(jù)庫或者用作其它用途。

6.MUC-7 Information Extraction Task Definition
MUC-7信息抽取任務的定義

7.OVERVIEW OF MUC-7/MET-2
本文簡要介紹了MUL-7/MET-2的任務

8.Information Extraction: Techniques and Challenges
本文介紹了IE(Information Extration)技術(18頁)。

9.信息抽取研究綜述李保利，陳玉忠，俞士汶
摘要：信息抽取研究旨在為人們提供更有力的信息獲取工具，以應對信息爆炸帶來的嚴重挑戰(zhàn)。與信息檢索不同，信息抽取直接從自然語言文本中抽取事實信息。過去十多年來，信息抽取逐步發(fā)展成為自然語言處理領域的一個重要分支，其獨特的發(fā)展軌跡——通過系統(tǒng)化、大規(guī)模地定量評測推動研究向前發(fā)展，以及某些成功啟示，如部分分析技術的有效性、快速NLP系統(tǒng)開發(fā)的必要性，都極大地推動了自然語言處理研究的發(fā)展，促進了NLP研究與應用的緊密結合?；仡櫺畔⒊槿⊙芯康臍v史，總結信息抽取研究的現(xiàn)狀，將有助于這方面研究工作向前發(fā)展。

10.Class-based Language Modeling for Named Entity Identification (Draft)
Jian Sun, Ming Zhou, Jianfeng Gao

(Accepted by special issue \\\\\\\"Word Formation and Chinese Language processing\\\\\\\" of the International Journal of Computational Linguistics and Chinese Language Processing) Abstract: We address in this paper the problem of Chinese named entity (NE) identification using class-based language models (LM). This study is concentrated on three kinds of NEs that are most commonly used, namely, personal name (PER), location name (LOC) and organization name (ORG). Our main contributions are three-fold: (1) In our research, Chinese word segmentation and NE identification have been integrated into a unified framework. It consists of several sub-models, each of which in turn may include other sub-models, leads to the overall model a hierarchical architecture. The class-based hierarchical LM not only effectively captures the features of named entities, but also handles the data sparseness problem. (2) Modeling for NE abbreviation is put forward. Our modeling-based method for NE abbreviation has significant advantages over rule-based ones. (3) In addition, we employ a two-level architecture for ORG model, so that the nested entities in organization names can be identified. When decoding, two-step strategy is adopted: identifying PER and LOC; and identifying ORG. The evaluation on a large, wide-coverage open-test data has empirically demonstrated that the class-based hierarchical language modeling, which integrates segmentation and NE identification, unifies the abbreviation modeling into one framework, has achieved competitive results of Chinese NE identification.

11.BBN公司的信息抽取系統(tǒng)SIFT（中文詳細說明）
Scott Miller, Michael Crystal, Heidi Fox, Lance Ramshaw, Richard Schwartz,
這是BBN的MUC7參評系統(tǒng)SIFT系統(tǒng)的說明，我把它翻譯了一下，基本意思很明了，但部分細節(jié)我可能還沒有把握準確，如果有問題，請給我來信說明。

12.(slides) Chinese Named Entity Identification using class-based language model
Jian Sun, Jianfeng Gao, Lei Zhang, Ming Zhou, and Changning Huang
This is the slides for the 19th International Conference on Computational Linguistics

13.Chinese Named Entity Identification using class-based language model
Jian Sun, Jianfeng Gao, Lei Zhang, Ming Zhou, and Changning Huang
We consider here the problem of Chinese named entity (NE) identification using statistical language model(LM). In this research, word segmentation and NE identification have been integrated into a unified framework that consists of several class-based language models. We also adopt a hierarchical structure for one of the LMs so that the nested entities in organization names can be identified. The evaluation on a large test set shows consistent improvements. Our experiments further demonstrate the improvement after seamlessly integrating with linguistic heuristic information, cache-based model and NE abbreviation identification.

14.MUC-7 EVALUATION OF IE TECHNOLOGY: Overview of Results
Elaine Marsh, Dennis Perzanowski
reviews MUC-7 and introduces the result and progress during this conference

15.Method of k-Nearest Neighbors

16.Multilingual Topic Detection and Tracking:Successful Research Enabled by Corpora and Evaluation
Charles L. Wayne
Topic Detection and Tracking (TDT) refers to automatic techniques for locating topically related material in streams of data such as newswire and broadcast news. DARPA-sponsored research has made enormous progress during the past three years, and the tasks have been made progressively more difficult and realistic. Well-designed corpora and objective performance evaluations have enabled this success.

17.信息提取概述
駱衛(wèi)華的綜述報告

18.Information Extraction Supported Question Answering
Cymfony公司的IE系統(tǒng)，主要面向QA，包括已實現(xiàn)的NE系統(tǒng)和將要實現(xiàn)的CE和GE的原型。

19.ALGORITHMS THAT LEARN TO EXTRACT INFORMATION

20.Description of the American University in Cairo\"s System Used for MUC-7

21.Analyzing the Complexity of a Domain With Respect To An Information Extraction Task

22.從半結構化文本與自由格式文本中學習信息抽取規(guī)則

作者Stephen Soderland為華盛頓州立大學計算機科學系教授。本文的被引用次數(shù)高達50多次。論文以信息抽取系統(tǒng)WHISK系統(tǒng)為例，描述了如何以機器學習的方式，利用小規(guī)模樣本訓練系統(tǒng)自動學習目標文本的抽取模式，從而實現(xiàn)自動化信息抽取的一種技術。這種技術不但極具啟發(fā)意義而且很有實用價值。

23.信息抽取研究綜述

本文出自北京大學計算機科學與技術系，綜述了信息抽取的一些基本概念。

24.利用Lixto進行可視化的信息抽取

作者分析了Lixto抽取系統(tǒng)的架構，介紹了一種半自動化的Wrapper生成技術與自動化Web信息抽取技術。

25.Web數(shù)據(jù)抽取工具綜述

作者將目前的幾種Web數(shù)據(jù)抽取工具按照六種分類：Wrapper開發(fā)語言，可感知HTML的工具，基于NLP的工具，Wrapper歸納工具，基于建模的工具，基于語義的工具依次介紹了各Web數(shù)據(jù)抽取工具的工作原理與特點，并且比較了它們的一般輸出質量。

26.針對BBS短文本的提取標注

本文前半段將會介紹有關本體的相關概念，后一部分將介紹本體在我們系統(tǒng)中的應用。為了配合信息提取，需要一些先驗性的知識和統(tǒng)計信息。所以，我們構造了自己的針對BBS短文本的提取標注工具。為此構建了本體知識并以直觀方式展現(xiàn)出來。結合本體推理機，我們的標注工具在標注的同時具備推理能力使得標注智能化，并能通過引用一個包裝好的提取算法進行提取預覽。

27.XWRAP?? An XML enabled Wrapper Construction System for Web Information Sources

Ling Liu?? Calton Pu?? Wei Han

This paper describes the methodology and the
software development of XWRAP?? an XMLenabled wrap
per construction system for semiautomatic generation of
wrapper programs By XMLenabled we mean that the
metadata about information content that are implicit in
the original web pages will be extracted and encoded ex
plicitly as XML tags in the wrapped documents In addi
tion?? the querybased content ltering process is performed
against the XML documents The XWRAP wrapper gen
eration framework has three distinct features First?? it ex
plicitly separates tasks of building wrappers that are spe
cic to a Web source from the tasks that are repetitive
for any source?? and uses a component library to provide
basic building blocks for wrapper programs Second?? it pro
vides a userfriendly interface program to allow wrapper
developers to generate their wrapper code with a few mouse
clicks Third and most importantly?? we introduce and de
velop a twophase code generation framework The rst
phase utilizes an interactive interface facility to encode the
sourcespecic metadata knowledge identied by individual
wrapper developers as declarative information extraction
rules The second phase combines the information extrac
tion rules generated at the rst phase with the XWRAP
component library to construct an executable wrapper pro
gram for the given web source We report the initial ex
periments on performance of the XWRAP code generation
system and the wrapper programs generated by XWRAP 　

28.Data Mining on Symbolic Knowledge Extracted from the Web

Rayid Ghani??, Rosie Jones??, Dunja Mladeni´cy??, Kamal Nigam??, Se´an Slattery??
Information extractors and classifiers operating on unrestricted, unstructured
texts are an errorful source of large amounts of potentially
useful information, especially when combined with a crawler which
automatically augments the knowledge base from the world-wide
web. At the same time, there is much structured information on the
WorldWideWeb. Wrapping the web-sites which provide this kind of
information provide us with a second source of information; possibly
less up-to-date, but reliable as facts. We give a case study of combining
information from these two kinds of sources in the context
of learning facts about companies. We provide results of association
rules, propositional and relational learning, which demonstrate
that data-mining can help us improve our extractors, and that using
information from two kinds of sources improves the reliability of
data-mined rules.

29.A Brief Survey of Web Data Extraction Tools
Alberto H. F. Laender Berthier A. RibeiroNeto
Altigran S. da Silva Juliana S. Teixeira

In the last few years, several works in the literature have addressed the problem of data extraction from Web pages. The importance of this problem derives from the fact that, once extracted, the data can be handled in a way similar to instances of a traditional database. The approaches proposed in the literature to address the problem of Web data extraction use techniques borrowed from areas such as natural language processing, languages and grammars, machine learning, information retrieval,...

30.Toward Semantic Understanding|An Approach Based on Information Extraction Ontologies
Information is ubiquitous, and we are
ooded with
more than we can process. Somehow, we must rely
less on visual processing, point-and-click navigation,
and manual decision making and more on computer
sifting and organization of information and auto-
mated negotiation and decision making. A resolu-
tion of these problems requires software with seman-
tic understanding|a grand challenge of our time.
More particularly, we must solve problems of au-
tomated interoperability, integration, and knowledge
sharing, and we must build information agents and
process agents that we can trust to give us the in-
formation we want and need and to negotiate on our
behalf in harmony with our beliefs and goals.
This paper pro ers the use of information-
extraction ontologies as an approach that may lead
to semantic understanding.
Keywords: Semantics, information extraction, high-
precision classi cation, schema mapping, data inte-
gration, Semantic Web, agent communication, ontol-
ogy, ontology generation.

31.基于《知網(wǎng)》的中文信息結構抽取
The Chinese message structure is composed of several Chinese fragments which may be
characters words or phrases. Every message structure carries certain information. We have developed a
HowNet-based extractor that can extract Chinese message structures from a real text and serves as an
interactive tool for building large-scale bank of Chinese message structures. The system utilizes the
HowNet Knowledge System as its basic resources. It is an integrated system of rule-based analyzer,
statistics based on the examples and the analogy given by HowNet-based concept similarity calculator.
Keyword: Chinese message structure; Knowledge Database Mark-up Language (KDML); parsing;
chunk;

32.Wrapper induction?? Efficiency and expressiveness Extended abstract

　Recently?? many systems have been built that auto
matically interact with Internet information resources
However?? these resources are usually formatted for use
by people e??g???? the relevant content is embedded in
HTML pages Wrappers are often used to extract a
resources content?? but handcoding wrappers is te
dious and errorprone We advocate wrapper induction??
a technique for automatically constructing wrappers
We have identied several wrapper classes that can be
learned quickly most sites require only a handful of ex
amples?? consuming a few CPU seconds of processing??
yet which are useful for handling numerous Internet re
sources
of surveyed sites can be handled by our
techniques

33.WysiWyg Web Wrapper Factory (W4F)

In this paper, we present the W4F toolkit for the generation of
wrappers for Web sources. W4F consists of a retrieval language to
identify Web sources, a declarative extraction language (the HTML
Extraction Language) to express robust extraction rules and a map-
ping interface to export the extracted information into some user-
de ned data-structures. To assist the user and make the creation
of wrappers rapid and easy, the toolkit o ers some wysiwyg support
via some wizards. Together, they permit the fast and semi-automatic
generation of ready-to-go wrappers provided as Java classes. W4F has
been successfully used to generate wrappers for database systems and
software agents, making the content of Web sources easily accessible
to any kind of application.

34.Adaptive Information Extraction from Text by Rule Induction and Generalisation
(LP)2 is a covering algorithm for adaptive Information
Extraction from text (IE). It induces
symbolic rules that insert SGML tags into texts
by learning from examples found in a userdefined
tagged corpus. Training is performed in
two steps: initially a set of tagging rules is
learned; then additional rules are induced to
correct mistakes and imprecision in tagging. Induction
is performed by bottom-up generalization
of examples in the training corpus. Shallow
knowledge about Natural Language Processing
(NLP) is used in the generalization process. The
algorithm has a considerable success story.
From a scientific point of view, experiments report
excellent results with respect to the current
state of the art on two publicly available corpora.
From an application point of view, a successful
industrial IE tool has been based on
(LP)2. Real world applications have been developed
and licenses have been released to external
companies for building other applications. This
paper presents (LP)2, experimental results and
applications, and discusses the role of shallow
NLP in rule induction.

35.Advanced Web Technology Information Extraction

本文地址：http://www.FullSearcher.Com/n200511171744735.asp

網(wǎng)站地址：http://www.FullSearcher.Com/

文章來源：fullsearcher

本站是提供個人知識管理的網(wǎng)絡存儲空間，所有內容均由用戶發(fā)布，不代表本站觀點。請注意甄別內容中的聯(lián)系方式、誘導購買等信息，謹防詐騙。如發(fā)現(xiàn)有害或侵權內容，請點擊一鍵舉報。

小男孩‘自慰网亚洲一区二区,亚洲一级在线播放毛片,亚洲中文字幕av每天更新,黄aⅴ永久免费无码,91成人午夜在线精品,色网站免费在线观看,亚洲欧洲wwwww在线观看

一堆信息抽取的資料文檔