Semistructured Data and XML
The Web provides access to large data sources which are not explicitly
organized as databases. Instead, the information is presented as semistructured
data. In contrast to integrating classical distributed databases,
handling such data raises several new problems such as schema
discovery, wrapping and reorganizing the data sources and coping which
changes in autonomous sources. XML, solves the problem of wrapping from
HTML to another data model.
Information Integration from the Web
Continuing the FLORID
project, the FLORID system has been extended with Web access
capabilities (Versions 2.x). A methodology for wrapping and integrating
HTML pages by mapping the information in an integrated F-Logic data
model representing both the structure of the data sources and
containing an application/level model of the information has been
developed. HTML pages are wrapped using generic rules for the usual
structuring means (i.e., lists, tables, comma-lists, emphasized
keywords). The MONDIAL
case study documents the practicability of the approach.
The Experiences with F-Logic and FLORID are now continued with the LoPiX (Logic
Programming in XML) project, dealing with integration of XML data.
Rewriting queries with caches
In the environment of World Wide Web, the heterogeneity of data sources
online (e.g. databases on internet) provides new challenge of data
model and database theory in data integration: the requirement of the
definition of an expressive language; some data sources are not always
available; the access to remote database is expensive; data sources are
sometimes incomplete, etc. A new caching technology named proxy
cache server provides a promising solution for data integration.
Parts of the earlier asked queries are stored in the caches and the
queries coming later could be directly answered from the caches instead
of going to data sources.
In this part we tackle the problem of semantic cache answering or
query rewriting using caches on the theoretical side: how to decide
whether the caches can answer the query, partially or equivalently; how
to answer; if the query is partially answered, how to get the rest of
the query, etc. Due to the flexibility of the internet data source, we
are confronted with a more complex data model, with inequality,
negation, and schema information.
- Diplom / Studienarbeit /Master
Thesis: Linear time Implementation of XPATH using Labeling
It has been recognised that query
processing of XML with XPATH is effective. The theoretical basis shows
that each XPATH expression can be effectively transformed to CTL, a
branching time logic in model checking. Since the model checking of CTL
is in linear time, it can be concluded that the processing of XML with
XPATH achieves this complexity too. In fact, there exists already one
such implementation up to now. There are two methods considering the
- Trasform each XPATH to CTL and XML to Kripke structure, then
apply the existing model checking algorithm.
- Natively write the algorithm, "borrow" the idea of the labeling
method using in CTL model checking. (recommended)
Required knowledge: XPATH, XML, Java programming
Tools: XML/DOM parser, XPATH parser in Java (both are standard), JDK.