AI & Shopping
@tableofcontents(omitlevel1)
Research
'''Information extraction: distilling structured data from unstructured text'''
!McCallum, Andrew, ACM Queue vol. 3, no. 9 pgs. 48 November 2005
!
Full Article
:In 2001 the U.S. Department of Labor was tasked with building a Web site that would help people find continuing education opportunities at community colleges, universities, and organizations across the country. The department wanted its Web site to support fielded Boolean searches over locations, dates, times, prerequisites, instructors, topic areas, and course descriptions. Ultimately it was also interested in mining its new database for patterns and educational trends. This was a major dataa daunting technical challenge, but one that was ultimately solved successfully. More details about the solution follow, but first, let's place this problem in context.
'''AUTOMATED SEMANTIC PARTITIONING OF WEB DOCUMENTS'''
!Nagarajan, Saravanakumar, 2004
!
Full Article
:The Semantic Web is a vision for the web as an information source with well defined meaning for all of its content, thereby enabling machinemachine exchange and automated processing of information. For transforming the current hyper text markup language (HTML) web into Semantic Web, there is a need to annotate all the web documents using standards like Resource Description Framework (RDF) and Extensible Markup Language (XML). The transformation of HTML to XML cannot be done manually, and hence there is a need for (semi)automatic transformation techniques. This thesis proposes a “Semantic Partitioner”algorithm, which uses the structural and presentation regularities of HTML documents to automatically transform them into so called “semantic partitions”. A semantic partition of a web document is a hierarchical representation of all of its concepts, instances, and attributes with their values. Also, this thesis discusses how semantic partitioning can be used as a preprocessing step for information integration applications like Meteor, OntoMiner [OM 04], DataRover [DR 03], and Hearsay [MUK 03].
'''Roadrunner: Towards Automatic Data Extraction from Large Web Sites'''
!V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. Technical Report n. RT2001, D.I.A., Universit a di Roma Tre, 2001.
!
Full article
:The paper investigates techniques for extracting data from HTML sites through the use of automatically generated wrappers. To automate the wrapper generation and the data extraction process, the paper develops a novel technique to compare HTML pages and generate a wrapper based on their similarities and differences. Experimental results on realintensive Web sites confirm the feasibility of the approach.
!
Works that cite this paper (248 cites)
'''Extracting structured data from Web pages'''
!
Full Text
!Arvind Arasu & Hector GarciaX
:Many web sites contain large sets of pages generated using a common template or layout. For example, Amazon lays out the author, title, comments, etc. in the same way in all its book pages. The values used to generate the pages (e.g., the author, title,...) typically come from a database. In this paper, we study the problem of automatically extracting the database values from such templategenerated pages, deduces the unknown template used to generate the pages, and extracts, as output, the values encoded in the pages. Experimental evaluation on a large number of real input page collections indicates that our algorithm correctly extracts data in most cases.
'''Semi-Automated Extraction of Targeted Data from Web Pages'''
!From
IEEE Digital Library
!Fabrice Estievenart, JeanAutomated Extraction of Targeted Data fromWeb Pages," icdew, p. 48, 22nd International Conference on Data Engineering Workshops (ICDEW'06), 2006.
:TheWorldWideWeb can be considered an infinite source of information for both individuals and organizations. Yet, if the main standard of publication on the Web (HTML) is quite suited to human reading, its poor semantics makes it difficult for computers to process and use embedded data in a smart and automated way. In this paper, we propose to build a bridge between HTML documents and external applications by means of socalled mapping rules. Such rules mainly record a semantic interpretation of recurring types of information in a cluster of similar Web documents and their location in those documents. Relying on these rules, HTMLfriendly tool called Retrozilla.
'''Searching with Numbers''' (July/Aug 2003)
: A large fraction of the useful Web is comprised of specification documents that largely consist of (attribute name, numeric value) pairs embedded in text. Examples include product information, classified advertisements, resumes, etc. The approach taken in the past to search these documents by first establishing correspondences between values and their names has achieved limited success because of the difficulty of extracting this information from free text. We propose a new approach that does not require this correspondence to be accurately established. Provided the data has "low reflectivity", we can do effective search even if the values in the data have not been assigned attribute names and the user has omitted attribute names in the query. We give algorithms and indexing structures for implementing the search. We also show how hints (i.e., imprecise, partial correspondences) from automatic data extraction techniques can be incorporated into our approach for better accuracy on high reflectivity data sets. Finally, we validate our approach by showing that we get high precision in our answers on real data sets from a variety of domains.
'''A hybrid method for Web data extraction'''
!This paper appears in: Web Intelligence, 2003. WI 2003. Proceedings. IEEE/WIC International Conference on (13 420)
!Available for purchase at
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?tp=&arnumber=1241229&isnumber=27823
: Web data extraction refers to the technology that helps people find wanted information from the Web. We first classify existing data extraction algorithms into two classes: topup algorithm, but satisfactory performance.
'''Structured and semantic data extraction from Web pages'''
!Yong Gan; Su 2935 vol.5, Digital Object Identifier 10.1109/ICMLC.2004.1378533
!
for purchase
: With the development of the Internet, the Web has become an invaluable information source. In order to use this information for more than human browsing, Web pages in HTML must be converted into a format meaningful software programs. Wrappers have been a useful technique to convert HTML documents into semantically meaningful XML files. In this paper, we propose a data extraction approach based on the user pre-defined schema which generates automatically a wrapper to extract data from an HTML document, and produce an XML document conforming to given DTD. After the user define extraction data schema in the form of DTD, the wrapper is generated automatically with the induction and learning algorithm. The experiment indicates that the approach can extract the required data from the source document with high accuracy.
'''Semi-structured data extraction and schema knowledge mining'''
!This paper appears in: EUROMICRO Conference, 1999. Proceedings. 25th (8extraction)<in>metadata)&pos=22'>here
: It is well known that World Wide Web has become a huge information resource. Therefore, it is very important for us to utilize this kind of information effectively. This paper proposes a semistructured data. This knowledge can make users understand the information structure on the web more deeply and thoroughly. At the same time, it can also provide a kind of effective schema for the querying of web information
'''Data extraction from Web data sources''' Buy online
here
!Robinson, J. 288)
: An explanation is given of the basic data structures used in a new page analysis technique to create wrappers (data extractors) for the result pages produced by Web sites in response to user qeries via Web page forms. The key structure called a tpGrid is a representation of the web page, which is easier to analyse than the raw HTML code. The analysis looks for repetition patterns of sets of tagSets, which are defined in the paper.
'''Reengineering Web applications to Web-service providers''' (April 2006)
!
Full Citation
: Web services are the latest technology to integrate applications through Internet. Many B2C services backed by web applications could be reused in an application integration scenario. Correctly and effectively migrating those web applications that sit behind a presentation layer poses a challenging problem for researchers of software engineering. ServiceBuilder is a tool that can automatically generate a webmining techniques on the collected response documents and with little user involvement, generates data extraction rules for data around predefined labels of interests. Finally, ServiceBuilder generates a wrapper that at run time forwards the service requester's input data to the web application and extracts the output data from the responding document and returns it to the service requester.
'''Data Mining Technologies for Digital Libraries and Web Information Systems'''
!
Full citation
!Ramakrishnan Srikant P. Lim, S. Foo, C. Khoo, H. Chen, E. Fox, S. Urs, T. Costantino (Eds.):
: In the first half of the talk, I will discuss data mining technologies that can result in better browsing and searching. Consider the problem of merging documents from different categorizations (taxonomies) into a single master categorization. Current classifiers ignore the implicit similarity information present in the source categorizations. I will show that by incorporating this information into the classification model, classification accuracy can be substantially improved [1]. Next, I will demonstrate novel search technology that treats numbers as firstrich documents [2].
S. Berchtold, C. Bohm, D. Keim, F. Krebs, and H. P. Kriegel (2001), On optimizing nearest neighbor queries in high449.
'''Crawling the Client-Side Hidden web''' (2004)
!
Full Article
: There is a great amount of information on the web that can not be accessed by conventional crawler engines. This portion of the web is usually called hidden web data. To be able to deal with this problem, it is necessary to solve two tasks: crawling the clientside hidden web, dealing with aspects such as
JavaScript technology, nonup menus, etc. Our approach leverages current browser APIs and implements novel crawling models and algorithms.
'''Semi-Automatic Wrapper Generation for Commercial Web Sources''' (2002)
!
Full Article
: Semiprogrammer staff to successfully wrap more than 700 commercial web sources in several industrial applications. We describe our approach for wrapper generation and show the difficulties found with other systems for wrapping this kind of sources.
'''Medical data extraction and organization from the Internet'''
!Buy online
here
!Riano, D. & Gramajo, J. 384)
: The huge amount of medical data in the Internet requires intelligent tools to retrieve and process medical information. In this paper, we introduce GINY, a computer system that integrates several procedures to deal with the structured, semistructured documents that can be found in the Internet. From an ontology that describes a medical concept and its properties, the system retrieves related web pages in the web, analyzes the contents of the pages, and organizes the extracted data in a relational data base (RDB) or a resource description framework (RDF). The stages of the process can be made partially or totally automatic.
'''Other resources:'''
*
IBM Almaden Research Center
Conferences
Events
Only listing conferences in the US (for now)
*
Twenty Eliot spoke to them, no go
* 4th International Conference on Computer Science and its Applications (ICCSA 2006)
** 27 to 29 June 2006
** San Diego, CA
** Full paper (Electronic Submission or by mail) due: '''May 01, 2006'''
*** Abstracts for Posters (Electronic Submission or by mail): May 01, 2006
*** Workshop, Tutorial, Panel Proposals due: May 01, 2006
*** Notification of Acceptance: May 15, 2006
*** PreReady paper/abstract due: May 28, 2006
*
IASTED International Conference on Computational Intelligence
** November 20-22, 2006
** San Francisco, CA
** Submissions due: June 15, 2006
*** Notification of acceptance: August 1, 2006
*** Camera-ready manuscripts due: September 1, 2006
*** Registration Deadline: September 15, 2006
** SPECIAL SESSION: "Natural Language Processing for Real Life Applications" German Research Center for Artificial Intelligence, Germany
*
Semantic Web for Collaborative Knowledge Acquisition
** October 12-15, 2006
** Hyatt Crystal City, Arlington, VA
** Paper submission: May 1, 2006
*** Notification of acceptance: May 22, 2006
*** Camera ready papers: June 2, 2006
** Potential participants are invited to submit full papers (up to 8 pages in length), poster summaries or extended abstracts (1-2 pages in length) by May 1, 2006
*
15th ACM Conference on Information and Knowledge Management (CIKM 2006)
** November 6-11, 2006
** Sheraton Crystal City Hotel, Arlington, VA
** Deadline for research and industrial paper submissions: May 31, 2006
*** Workshop Proposals due to the Workshop Chair: March 30, 2006
*** Tutorial Proposals due to the Tutorial Chair: May 31, 2006
*** Notification to authors: August 1, 2006
*** Final Camera-ready version due (research paper and Industry track): August 31, 2006
** Topics: Databases, information retrieval, and knowledge management (including novel data mining algorithms)
*
Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining MISSED THIS ONE
** August 20 - 23, 2006
** Philadelphia, PA
** Research Track abstracts due: March 3, 2006
*** Industrial Track abstracts due: March 6, 2006
*** Paper submissions and Proposals due: March 10, 2006
*
7th International Workshop on Multimedia Data Mining: Presented in conjunction with KDD RELEVANT?
** August 20, 2006
** Philadelphia, PA
** Submissions due: May 31
*** Acceptance: June 21
*** Camera ready copy: July 11
*
International Conference on Tools with Artificial Intelligence
** November 13-15, 2006
** Arlington, VA
** June 15, 2006 Deadline for paper submissions.
*** August 16, 2006 Notification of acceptance.
*** September 10, 2006 Final camera-ready paper & author registration due.
*** November 13-15, 2006 ICTAI 2006 conference.
Calendars
*
http://www.conferencealerts.com/ai.htm
*
http://cs.nju.edu.cn/zhouzh/zhouzh.files/ai_resource/cfp.htm
*
http://www.cs.wisc.edu/areas/ai/conf.html
Carlos Learning
From Google Scholar:
Doorenbos, R., Etzioni, O., Weld, D.: '''A scalable comparisonWide Web'''. Proc. First International Conference on Autonomous Agents (1997) 39–48
'''Automatic information extraction from semiCheng Lui ChungHwa Telecommunication Laboratories, Yangmei, Tauyuan 326, Taiwan Publisher Elsevier Science Publishers B. V. Amsterdam, The Netherlands, The Netherlands
'''MORPHEUS: a more scalable comparisonX
Authors Jaeyoung Yang Dept. of Computer Science and Engineering, Hanyang University, Korea Heekyoung Seo HCI Laboratory, Samsung Advanced Institute of Technology, Korea Joongmin Choi Dept. of Computer Science and Engineering, Hanyang University, Korea Sponsor SIGART: ACM Special Interest Group on Artificial Intelligence Publisher ACM Press New York, NY, USA