You are on page 1of 27

Table Of Contents

Chapter No.
1

Contents
INTRODUCTION 1.2 Database Management System 1.2 Information Retrieval And Database Querying 1.3 Ranking Based Querying

Page No.

QUERIED UNITS 2.1 Definition 2.2 QUNIT Utility 2.3 QUNIT Based Search

QUNIT DERIVATION 3.1 Using Schema And Data 3.2 Using Query Log 3.3 Using External Evidences

EXPERIMENTS AGAINST STRUCTURED DATABASE 4.1 Understanding Search 4.2 Movie querylog Benchmark 4.3 Evaluating Result Quality

5 6

CONCLUSION REFERENCES

1|Page

1 INTRODUCTION

1.1 DATABASE M ANAGEMENT SYSTEM


A Database Management S ystem (DBMS) is a set of computer programs that controls the creation, maintenance, and the use of the database with computer as a platform or of an organization and its end users. It allows organizations to place control of organization -wide database development in the hands of database administrators (DBAs) and other specialists. A DBMS is a system software package that helps the use of integrated collection of data records and files known as databases. It allows different user application programs to easily access the same database. DBMSs may use any of a variety of database models, such as the network model or relational model. In large systems, a DBMS allows users and other software to store and retrieve data in a structured way. Instead of having to write computer programs to extract information, user can ask simple questions in a query language. Thus, many DBMS packages and provide other Fourth-

generation

programming

language

(4GLs)

application

development features. It helps to specify t he logical organization for a database and access and use the information within a database. It provides facilities f or controlling data access, enforcing data integrity, managing concurrency controlled, and restoring database. Data that resides in fixed fields within a record or file. Relational databases and spreadsheets are examples of structured data. The information stored in databases is known as structured data because it is represented in a strict format. Each record in a relational database table follows the same format as the other records in that table.For structured data, it is common to carefull y design the database in order to create the database schema.The DBMS then checks to ensure that all data follows the structures and constraints specified

2|Page

in the schema.However, not all data is collected and inserted into carefully designed structured databases. In some applications, data is collected in an ad -hoc manner before it is known how it will be stored and managed. This data may have a certain structure, but not all the information collected will have identical structure. Some attributes may be shared among the various entities, but other attributes may exist only in a few entities. Moreover, additional attributes can be introduced in some of the newer data items at any time, and there is no predefined schema. This type of data is known as semis tructured data. A number of data models have been introduced for representing semistructured data, often based on using tree or graph data structures rather than the flat relational model structures. A key difference between structured and semistructured d ata concerns how the schema constructs (such as the names of attributes,

relationships, and entity types) are handled. In semistructured data, the schema information is mixedin with the data values, since each data object can have different attributes that are not known in advance.

3|Page

The database system contains not only the database itself but also a complete definition or description of the database structure and

constraints. This definition is stored in the DBMS catalog, which contains information such as the structure of each file, the type and storage format of each data item, and various constraints on the data. The information stored in the catalog is called meta-data.

The description of a database is called the database schema, which is specified during database design a nd is not expected to change frequently.A schema diagram displays only some aspects of a schema, such as the names of record types and data items, and some types of constraints. The actual data in a database may change quite frequently. The data in the database at a particular moment in time is called a
4|Page

database state or snapshot. It is a lso called the current set of occurrences or instances in the database.

1.2 INFORMATION RETRIEV AL & DATABASE QUERYING Information retrieval (IR) is the science of searching for documents, for information within documents, and for metadata about documents, as well as that of searching relational databases and the W orld W ide W eb. IR is interdisciplinary, based on computer science , mathematics, library science, information science , information architecture , cognitive psychology, linguistics, statistics, and physics. Automated information retrieval systems are used to reduce what has been called " information overload". Many universities and public libraries use IR systems to provide access to books, jour nals and other documents. W eb search engines are the most visible IR applications.

Figure 1

5|Page

A database query is the operation that extracts a recordset from a database. A query consists of search criteria expressed in a database language called SQL. For example, the query can specify that only certain columns or only certain records be included in the recordset.

1.3 RANKING BASED QUERY

To rank documents, IR systems assign a score for each document as an estimation of the document relevance to the given query. Automated ranking of the r esults of a query is a popular aspect of the query model in Information Retrieval (IR). In contrast, database systems support only a Boolean query model. For example, a selection query on a SQL database returns all tuples that satisfy the conditions in the query. Therefore, the following two scenarios are not gracefully handled by a SQL system:

Empty answers: W hen the query is too selective, the answer may be empty. In that case, it is desirable to have the option of requesting a ranked list of approximately matching tuples without having to specify the ranking function that captures proximity to the query. An FBI agent or an analyst involved in data exploration will find such functionality appealing. Many answers: W hen the query is not too se lective, too many tuples may be in the answer. In such a case, it will be desirable to have the option of ordering the matches automatically that ranks more globally important answer tuples higher and

returning only the best matches.

6|Page

Conceptually, the automated ranking of query results problem is really that of taking a user query (say, a conjunctive selection query) and mapping it to a Top -K query with a ranking function that depends on given conditions in the user query. The key questions are: How to derive such ranking functions automatically? How well do ranking functions from IR apply? Are the ranking techniques for handling empty answers and many answers problems different? How

to

execute

such

Top -K

queries

efficiently

over

large

databases?

7|Page

2 QUERIED UNITS (QUNITS)

2.1 DEFINITION
Qunit: A qunit (Queried Unit) is the basic, independent semantic unit of information in a database. It is conceptually a collection of independent unit which represents the desired result for some query against the database. Qunits can be treated as a document for standard IR -like document retrieval.

Qunit definition can be considered as a combination of these two : Base expression It can be considered as a view on a database. It consists of a stored query accessible as a virtual table composed of the result set of a query. Conversion expression it converts the data in a form we want to represent it. Thus it can be used to dete rmine various presentation of given data.

A Basic Qunit Example For example, consider the IMDB (Internet movie database) database. W e would like to create a qunit definition corresponding to the information need cast. A cast is defined as the people a ssociated with a movie, we dont want the name of the movie repeated with each record, we like to have the presentation with the movie title on top and one record for each cast member. The base data in IMDB is relational, and against its schema, we would w rite the base expression in SQL with the conversion expression in XSL-like markup as follows:

8|Page

SQL often referred to as Structured Query Language is a database computer language designed for managing data in relational database management systems (RDBMS), and originally based upon relational algebra. Its scope includes data query and update, schema creation and modification, and data access control.

XSL Extensible Style sheet Language (XSL) is used to refer to a family of languages used for transforming and rendering XML documents. XML (Extensible documents Markup Language) is a set XMLs of rules design for encoding on focuses

electronically

.Although

documents, it is widely used for the representation of arbitrary data structures.

Base Expression SELECT * FROM person, cast, movie W HERE cast.movie id = movie.id AND cast.person id = person.id AND movie.title = "$x" RETURN

Conversion Expression <cast movie="$x"> <foreach:tuple> <person>$person.name</person> </foreach:tuple> </cast> The combination of these t wo expressions (Base and Conversion) forms our qunit definition. On applying this definition to a database, we derive qunit instances, one per movie.
9|Page

2.2 QUNIT UTILITY

The utility of a qunit applies to both qunit definition and qunit instances. By qunit utility we understand the importance of a qunit in relation to a user query in a database .The total number of possible views in a database is very large thus the total number of candidate qunit definitions is massive . The importance of a qunit has been t ermed as utility score.The utilty score of qunit is used to select the most relevant set of useful qunits from the large pool of candidate qunits. Utility score provides the most outstanding and relevant output taht should be returned for a particular query of the user.The qunit utility is relative to each user purpose ,need and is different for each user. W e quantise the qunit utility with well defined objective substitute .this is similar to measuring document relevance in Information retrieval where TF/IDF is used to approximate document relevance . TfIdf weight (term frequenc yinverse document frequenc y) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a colle ction or corpus. The importance increases

proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tfidf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

10 | P a g e

2.3 QUNIT BASED SEARCH

11 | P a g e

Consider the user query, star wars cast , as shown in Fig Queries are first processed to identif y entities using standard query segmentation techniques.In that this case a one join high -ranking between segmentation movie.name is and [movie.name][cast] and this has a very hig h overlap with the qunit definition involves cast.Standard IR techniques can be used to evaluate this query against qunit instances of the identified type; each considered

independently even if they contain elements in common. The qunit instance describing the cast of the movie Star W ars is chosen as the appropriate result.The qunits based approach is a far cleaner approach to model database search.

The benefit of maintaining a clear separation between ranking and database content is that structured info rmation can be considered as one source of information amongst many others. This makes our system easier to extend and enhance with add itional IR m ethods for ranking, such as relevance feedback. Additionally, it allows us to concentrate on making the database more efficient using indices and query optimization, without havi ng to worry about extraneous is sues such as search and ranking.It is observed that this conceptual

demarcation of rankings and results does not imply materialization of all qunits.

12 | P a g e

3 QUNIT DERIVATION

Qunits can be identified by two ways: By Manual Identification By Automated Techniques. They can be identified manually by the database creator at the time of database creation ,the creator has the best knowledge of the data in the database therefore the qunit identification done by him is likely to be superior to anything that automated techniques can provide. Identifying qunits involves writing a set of view definitions for

commonly expected query result types, the ma nual effort involved is only a small part of the total cost of database design . Manual identification of qunits m ay not always be feasible, like legacy systems have already been created without qunits being created therefore automated important. There are several possible sources of information that can be used to infer qunits ,the knowledge of the database s chema is the starting point Independent possible sources of information tha t are used for deriving qunits :1) Data contained in the database 2) History of keyword queries posed to the system previously 3) Publish results and reports based on information from the techniques for deriving qunits from a database are

database in question

13 | P a g e

3.1 USING SCHEM A AND DATA

In this method concept of queriability of a schema of a database is used to infer the databases important schema entities and attributes. Queriability It is defined as a likelyhood of a schema element to be used in a query and is computed using the cardinality of the data that the schema represents. Base expression of a qunit is generated by looking at the top -k schema entities based on descending queriabil ity score. Each of the top -k1 schema entities is then expanded to include the top-k2 neighbouring entities as a join, where k1 and k2 are compatible parameters. This method does not derive optimal qunits like in the case of under specified queries. A Basic Example

14 | P a g e

Here a query George Cloone y is performed .now from the schema in above figure creating a qunit for person would result in the inclusion of important movie genre and the unimportant movie location tables, since every movie has a genre and location. but the information about the shooting location is not of that importance and interest to most of the people. Thus this method of generating qunits is not optimal.

3.2 USING QUERY LOG


This method uses a query rollup strategy for query logs, inspired by the observation that keyword queries are inherently under - specified, and hence the qunit definition for an under -specified query is an

aggregation of the qunit definitions of its special izations. For example, if the expected qunit for the query george clooney is a personality profile about the actor George Clooney, it can be constructed by considering the popular specialized variations of this query, such as George clooney actor, george clooney movi es, and so on.

W e perform the following steps : Sampling the database for entities, and look them up in the search query log. The found query log en tries are then collected, along with the number of times they occur in the query log (query frequency). For each query, we then map each recognized entity on to the schema, constructing simple join plans. Than we consider the popular plan fragments for the qunit definition.

15 | P a g e

An example using query log

W e consider a schema element person.name. Instances of this element are i. ii. George Clooney tom hanks

These instances are looked up in the query log, where we find search queries like i. ii. iii. George Clooney actor George Clooney batman Tom hanks castaway

Using these 3 queries we built an annotated se t of schema links where person.name links to cast.role once and to movie.name twice. This suggests that the rollup of the qunit representing person.name should contain movie.name and cast.role, in that order.

16 | P a g e

3.3 USING EXTERNAL EVIDENCE

In this third method the external evidences are used to create qunits for the database. By considering the useful information from external evidences the goal is to learn qunit definitions. External Evidence External evidences are useful piece of information that exists in following forms: Reports. Published results of queries to the database. Relevant web pages that present parts of the data. Example Movie information from the sources such as W ikipedia and IMDB is organized, and also the information from these two sources greatly overlaps. The aim is to learn the organization of this overlapped data from W ikipedia. The Document Object Model (DOM) is an application programming interface (API) for valid HTML and well-formed XML documents. It defines the logical structure of documents and the way a document is accessed and manipulated. In the DOM specification, the term

"document" is used in the broad sense - increasingly, XML is being used as a way of representing many different kinds of informati on that may be stored in diverse systems, and much of this would traditionally be seen as data rather than as documents. Nevertheless, XML presents this data as documents, and the DOM may be used to manage this data. In the DOM, documents have a logical st ructure which is very much like a tree; to be more precise, which is like a "forest" or "grove", which can contain more than one tree. Each document contains zero or one doctype nodes, one root element node, and zero or more

comments or processing instruct ions; the root element serves as the root of the element tree for the document. However, the DOM does not
17 | P a g e

specif y that documents must be implemented as a tree or a grove, nor does it specify how the relationships among objects be implemented. The DOM is a logical model that may be implemented in any convenient manner. In this specification, we use the term structure model to describe the tree-like representation of a document. We also use the term "tree" when referring to the arrangement of those informatio n items which can be reached by using "tree -walking" methods; (this does not include attributes). One important property of DOM structure models is structural isomorphism : if any two Document Object Model implementations are used to create a representation of the same document, they will create the same structure model, in accordance with the XML Information Set . A type signature defines the inputs and outputs for a function, subroutine or method. A type signature includes at least the function name and the number of its arguments. In some programming

languages, it may also specify the function's return type, the types of its arguments, or errors it may pass back. The foreach statement repeats a group of embedded statements for each element in an array or an object collection. The foreach

statement is used to iterate through the collection to get the desired information.Cardinality specifies how m any instances of an entity relate to one instance of another entity. By using the records in database entities document are identified. The signatures for each web pageare computed, utilizing the DOM tree and frequency of each occurrence. An example of a type signature for the cast page for a movie on W ikipedia would be ((person.name:1) (movie.name:40)), which would suggest using person.name as a label field, followed by a foreach consisting of movie.name, based on the relative cardinality in the signat ure and the number of the tuples generated in our qunit base expression. By aggregating the type signatures over a collection of pages, we can infer the appropriate qunit definition.
18 | P a g e

4 EXPERIMENTS AND EVALUATION

Experiment is performed to explore the natu re of keyword searches that users posed against a structured database. The experiment uses real world query log to evaluate the efficacy of qunit based methods.

4.1 UNDERSTANDING SEARCH


The Internet Movie Database or IMDb is a well -known repository of movie-related information on the Internet. W e performed a user study with five users (a,b,c,d,e) all familiar with IMDb and all with a moderate interest in movies. The subjects had a large v ariance in knowledge about databases. Two were graduate students specializing in

databases, while the other three were non -computer science majors. The subjects were asked to consider a hypothetical movie database that could answer all movie -related queries. Given this, the users were asked to come up with five information needs, and the corresponding keyword queries that they would use to query the database. The summary page of a movie was the most sought af ter page, this information need is expressed in five different ways by the users (row 1).The cast of a movie and finding connections between two actors are also common interests, these are expressed in many different ways (row 2,row 6).A query that only specifies the title of the movie may be specified on account of four different information needs ( column 1).the users may specify an actors name when they mean to look for either of two different pieces of information the actors filmog raphy, or

information about co -actors(row3,row4). There exists a many-to-many relationship between information need and queries.

19 | P a g e

[Actor] [Act or]

[Title] freetext

[Award] [year]

ke yw o rd qu e r y

[Title] poster

Movie summary Cast Filmograph y Coactorship Posters Related movies Aw ards Movies of period Charts / lists Recommendation Soundtracks Tri vi a Box office

A, C E C, D E, A A, C

B E

A E D B C C D

Table 1

Table 1 Information Needs vs Keyword Queries. Five users (A, B, C, D, E) were each asked for their movie related information needs, and what queries they would use to search for them.

20 | P a g e

Dont know

freetext [Title] year

[Title] cast

[Title] OST

[Title] plot

[Title]box

[Genre]

[Movie]

office [Actor]

[Title]

Another key observation is that 10 of the 25 queries here are single entity queries, 8 of which are underspecified the query could be written better by adding on additional predicates. The results of interviews are displayed in Table 1. Each row in this table is an information need suggested by one or more users. Each column is the query structure the user thought to use to obtain an answer for this information need. The users themselves stated specific examples; For example if the user said they would query for star wars cast, it was abstracted to query type [title] cast. The unmatched portion of the query (cast) is still relevant to the schema structure and is hence considered an attribute. Conversely, users often issue queries with words that are non structural details about the result, such as movie space transponders. W e consider these words free-form text in our query analysis. Some users came up with multiple queries to satisfy the same information need, and hence are entered more than once in the corresponding rows.

21 | P a g e

4.2 MOVIE QUERYLOG BENCHM ARK

To construct a typical workload, we use a rea l world dataset of web search engine query logs spanning 650,000 users and 20,000,000 queries. All query strings are first aggregated to combine all identities into a single anonymous crowd, and the queries that resulted in a navigation to the www.imdb.com domain are considered, resulting in 98,549 queries, or 46,901 unique queries. W e consider this to be our Base query log for the IMDb dataset. 93% of the unique queries(calculated by sampling) were identified as movie related terms. W e then construct a be nchmark query log by first classif ying the base query log into various types.

4.2.1 TOKEN
A token is an instance of a type in knowledge representation , the typetoken distinction is a distinction that separates an abstract concept from the objects which are particular instances of the concept. For example, the particular bicycle in your garage is a token of the type of thing known as "The bicycle." Tokens in the query log are first replaced with schema types by looking for the largest possible string overlaps with entities in the database. This leaves us with typed templates, such as [name] movies for george clooney movies. W e then randomly pick two queries that match each of the top (by frequency) 14 templates, giving us 28 queries that we use as a workload for qualitative as sessment. W e observed that our dataset reflects properties consistent with previous reports on query logs. At least 36% of the distinct queries to the search engine were single entity queries that were just the name of an actor, or the title of a movie, while 20% were entity attribute queries, such as terminator cast. Approximately 2% of the queries contained more than one entity such as angelina jolie tombraider,while less than 2% of the queries contained a complex query structure involving aggrega te functions such as highest box office revenue.
22 | P a g e

4.3 EVALUATING RESULT QUALITY

The result quality of a search system is measured by its ability to satisfy a users information need. This metric is subjective due to diversity of user intent and cannot be evaluated against a single hand -crafted gold standard. W e conducted a result relevance study using a real -world search query log as described in the following subsection, against the Internet Movie Database. W e asked 20 users to com pare the results retu rned by each search algorithm, for 25 different search queries, rating each result between 1 (result is correct and relevant) and 0 (result is wrong or irrelevant). For our experiments, we created a survey using 25 of the 28 queries from the movie querylog benchmark. The work load generated using the query log is first issued on each of the competing algorithms and their results are collected. For our algorithms mentioned in Sec. 4, we implement a pro totype in Java, using the imdb.com database (converted using IMDbPy(http://imdbpy.sf.net) to 15 tables, 34M tuples) and the base query log as our derivation data. To avoid the influence of a presentation format on our results, all information was converted by hand into a paragraph in a simplified natural English language with short phrases. To remove bias, the phrases were collated from two indepen dent sources. 20 users were then sourced using the Amazon Mechanical Turk (http://mturk.com) service, all being moder ate to advanced users of search engines, with moderate to high interest in movies. Users were then primed with a sam ple information need and query combination : need to
23 | P a g e

find out more about julio iglesias being the need, and julio iglesias being the search query term. Users were then pre sented with a set of possible answers from a search engine, and were asked to rate the answers presented with one of the options listed in Table 2. Users were then asked to repeat this task for the 25 search queries mentioned above. The table also shows t he score we internally assigned for each option. If the answer is incorrect or uninformative it obviously should be scored 0. If it is the correct answer, it obviously should be scored 1. W here an answer is partially correct(incomplete or excessive), we should give it a score between 0 and 1 depending on how correct it is. An average value for this is 0.5. To provide an objective example of a qunit -based system, we utilize the structure and layout of the imdb.com website as an expert-determined qunit set. E ach page on the website is considered a unique qunit instance, identified by a unique URL format. A list of the qunits is generated by performing a breadth-first crawl starting at the homepage, of 100,000 pages of the website and clustering the different t ypes of URLs. Qunit definitions were then created by hand based on each type of URL, and queried against the test workload. Users were observed to be in agreement with each other, with a third of the questions having an 80% or higher of majority for the winning answer. W e now compare the performance of currently available approaches against the qunits described in the derivation section, using a prototype based on ideas from Sec. 3. To do this, we first ran all the queries on the BANKS[3] on line demonst ration. A crawl of the imdb.com website was converted to XML to retrieve the LCA(Lowest Common An cestor) and MLCA [20] (Meaningful Lowest Common Ances tor). The MLCA operator is unique in that it ensures that the LCA derived is unique to the combination of queried
24 | P a g e

nodes that connect to it, improving result relevance. In ad dition to these algorithms, we also include a data point for the theoretical maximum performance in keyword search, where the user rates every search result from that algorithm as a perfect match. Results are presented in Fig. 3 by considering the average relevance score for each algorithm across the query work load. As we can see, we are still quite far away from reaching the theoretical maximum for result quality. Yet, qunit -based querying clearly outperforms existing methods.

25 | P a g e

5 CONCLUSION

26 | P a g e

27 | P a g e

You might also like