You are on page 1of 31

DiGIR

Distributed Generic Information Retrieval

Stan Blum, Dave Vieglais, P.J. Schwartz

DiGIR 1
Project Goals
 To define a protocol for retrieving
structured data from multiple,
heterogeneous databases
 To build a reference implementation of
said protocol

DiGIR 2
Design Goals
 To use open protocols and standards, such
as HTTP, XML, and UDDI to leverage
existing and emerging technologies
 To de-couple the protocol, software and
semantics
 To automate the establishment of a new
data provider as much as possible

DiGIR 3
High-level Architecture

Protocol
Provider
Portal
Registry

DiGIR 4
Protocol
 Defines request and response message
formats for communication between
Provider and Portal
 Assumes Providers conform to a known
federation schema
 Remains flexible to allow for federation
schema pluggability

DiGIR 5
Provider
 Makes structured data
available to portals
 Communicates via protocol
compliant messaging only
 Complies with a known
federation schema
 Supplies meta-data to
describe data classification
and availability

DiGIR 6
Portal
 The entry point for a “user”
 Can make requests of N
number of providers
 Communicates via protocol
compliant messaging only
 Queries registry for available
providers
 Can determine, based on
provider meta-data, whether
a provider should be queried
DiGIR 7
Project Information
 The DiGIR project is a collaborative effort
 DiGIR is currently established as an open
source project on SourceForge (
http://sourceforge.net).
 Further documentation is available on the
SourceForge site.
 Please join us in collaborating!

DiGIR 8
Protocol Details

DiGIR 9
Protocol Details
 Specified in an XML Schema (.xsd)
 Intended to work in conjunction with
federation schemas, also expressed as
XML Schemas
 Actual request and response documents are
instance documents conforming to both the
protocol schema and a federation schema

DiGIR 10
<request xmlns="http://www.namespaceTBD.org/digir"
xmlns:darwin="http://www.namespaceTBD.org/darwin"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.namespaceTBD.org/digir digir.xsd
http://www.namespaceTBD.org/darwin darwin.xsd">
<header>
<requestType>search</requestType>
</header>
<search>
<dbName>myDiggableBipesDB</dbName>
<filter>
<and>
<in>
<list xsi:type=“darwin:list”>
<darwin:Month>11</darwin:Month>
<darwin:Month>12</darwin:Month>
</list>
</in>
<equals>
<darwin:Genus>Bipes</darwin:Genus>
</equals>
</and>
</filter>
<records start=“0” count=“50”>
</search>
</request>

DiGIR 11
Request Explanation
 Composed of elements from the protocol
namespace (default) and the schema namespace
 <header> contains information about the payload
 <search> contains dbName, filter, and record
specification (will also specify result format)
 <filter> is effectively an XML representation of a
SQL where clause
 This search request is for the first 50 specimen
records that are genus Bipes and were found in
the months of November or December.
DiGIR 12
Filter Building
LOPs (logical operators) COPs (comparison ops)
 <equals>
 <and>
 <lessThan>
 <or>
 <lessThanOrEquals>
 <andNot>
 <notEquals>
 <orNot>  <greaterThan>
 Can be nested  <greaterThanOrEquals>
 <like>
 <in> (multi value)

DiGIR 13
What “binds” the schemas?
 The protocol schema defines various abstract
types and elements:
<xsd:element name="searchCondition" abstract="true">
<xsd:element name="alphaSearchCondition" abstract="true“
substitutionGroup="searchCondition">
<xsd:complexType name="listType" abstract="true" />
<xsd:complexType name="numericListType" abstract="true" />

 A federation schema must define searchable


concepts, or groups of them, as substitutable for
these abstract elements or extensions of the
abstract types
<xsd:element name="Species" type="xsd:string“
substitutionGroup="digir:alphaSearchCondition" />

DiGIR 14
<xsd:complexType name="list
<xsd:complexContent>
<xsd:extension base="digir:listType">
<xsd:sequence>
<xsd:choice>
<xsd:element ref="ScientificName" maxOccurs="unbounded"/>
<xsd:element ref="Kingdom" maxOccurs="unbounded" />
<xsd:element ref="Phylum" maxOccurs="unbounded" />
<xsd:element ref="Class" maxOccurs="unbounded" />
<xsd:element ref="Order" maxOccurs="unbounded" />
<xsd:element ref="Family" maxOccurs="unbounded" />
<xsd:element ref="Genus" maxOccurs="unbounded" />
<xsd:element ref="Species" maxOccurs="unbounded" />
<…>
</xsd:choice>
</xsd:sequence>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>

DiGIR 15
Why “bind” like this?
 To provide data-typing (string, numeric,
etc.) for various concepts within operators
at an abstract level (e.g. LIKE only valid
for string data; IN allows for multiples, but
in a controlled fashion)
 To allow for federation schemas to simply
classify data as types without having to
redefine/extend operators
DiGIR 16
Request Issues
 Do we need another abstract element such as
dateSearchCondition?
 What information will be useful in the header?
 How should we specify the format of the results?
What standard formats should be offered (I.e.
brief, full?).
 Will tblName be part of the meta-data required of
providers?
 What concepts of Darwin Core 2 are searchable?

DiGIR 17
Response Prototype
<response xmlns="http://www.namespaceTBD.org/digir"
xmlns:darwin="http://www.namespaceTBD.org/darwin"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.namespaceTBD.org/digir digir.xsd
http://www.namespaceTBD.org/darwin darwin.xsd">
<header>
<!-- contents TBD -->
</header>
<content>
<record>
</record>
</content>
<diagnostics>
</diagnostics>
</response>

DiGIR 18
Response Issues
 How do we format and validate the response
content?
 What elements are needed for the <header>, if
any?
 Do we always have diagnostics, or only if there is
an error?
 Should a finite set of diagnostics be created and
maintained in its own XML Schema? Will there
ever be a diagnostic that is specific to a federation
schema?
DiGIR 19
Provider Details

DiGIR 20
Provider Details
 Implemented as a web application that answers questions
 Interface is not specific to a particular information domain
 No state information is recorded
 Each request is treated as unique and uninfluenced by previous
requests
 Must always generate a valid response
 Consists of four key components
 Request handler
 Filter handler
 Result set cache
 Response generator

DiGIR 21
Request Handler
 Receives XML document
 Validates document
 Generates internal structures for further
processing

DiGIR 22
Filter Handler
 Internal structural representation of filter
(query) structure
 Responsible for generating a native query
string for querying the database
 Communicates with UDDI to obtain
standard database definition
 Custom configured to work with specific
database implementation

DiGIR 23
Result Set Cache
 Contains the results of applying a query
 Responsible for generating the response
records in the requested format
 Somewhat directly integrated with the
response generator

DiGIR 24
Response Generator
 Generates the response XML document
 Serializes the response header information
 Serializes diagnostic information
 Serializes the requested subset of records

DiGIR 25
Provider Configuration
Portal

Profile
Schema

Data Provider System Data Provider System

DiGIR DiGIR
Provider Provider

Data Map Data Map


Schema Schema

Data Data

DiGIR 26
Portal Details

DiGIR 27
Portal Details
 Divided into two distinct components: a
presentation layer and PortalServices
 The presentation layer supports the UI and
translates requests (HTTP requests from forms or
links) into protocol compliant XML requests
 The presentation layer also handles all display
issues involving the responses, such as format,
sorting, collating, etc…
 The presentation layer is envisioned to be an
application server/web server implementation
DiGIR 28
Portal Details
 PortalServices handles all external network
activity (UDDI calls, provider calls, etc)
 PortalServices limits provider calls to those
necessary based on provider meta-data
 PortalServices threads provider calls for
increased performance (I.e. response time)
 PortalServices is envisioned to be a webapp and
supporting classes running within an application
server, such as TomCat

DiGIR 29
PortalServices
 RegistryAccess
 ProviderCache
 PortalConfig
 PortalServlet
 PortalRequestHandler
 ProviderFilterer
 Marshallers

DiGIR 30
Portal Issues
 What information will be stored in UDDI about a
provider?
 What information will be known for
communicating with a Provider (I.e. IP address,
port, etc…?)
 What meta-data will be provided and what are the
rules for using such data for provider filtering?
 What requirements are there for logging and
monitoring?

DiGIR 31

You might also like