You are on page 1of 305

DOTTORATO DI RICERCA IN

INGEGNERIA DELL’INFORMAZIONE
XVIII CICLO

Sede Amministrativa
Università degli Studi di MODENA e REGGIO EMILIA

TESI PER IL CONSEGUIMENTO DEL TITOLO DI DOTTORE DI RICERCA

Information Retrieval Techniques


for Pattern Matching

Ing. Riccardo Martoglia

Relatore:
Chiar.mo Prof. Paolo Tiberio

Anno Accademico 2004 - 2005


Keywords:
Pattern matching
Similarity searching
Textual information searching
XML query processing
Approximate XML query answering
Contents

Acknowledgments 1

Introduction 3

I Pattern Matching for Plain Text 7


1 Approximate (sub)sequence matching 9
1.1 Foundation of approximate matching for (sub)sequences . . . 11
1.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1.2 Approximate sub2 sequence matching . . . . . . . . . . 13
1.2 Approximate matching processing . . . . . . . . . . . . . . . . 18
1.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . 20
1.3.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . 21
1.3.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2 Approximate matching for EBMT 27


2.1 Research in the EBMT field . . . . . . . . . . . . . . . . . . . 30
2.1.1 Logical representation of examples . . . . . . . . . . . 30
2.1.2 Similarity metrics and scoring functions . . . . . . . . . 31
2.1.3 Efficiency and flexibility of the search process . . . . . 32
2.1.4 Evaluation of EBMT systems . . . . . . . . . . . . . . 32
2.1.5 Some notes about commercial systems . . . . . . . . . 33
2.2 The suggestion search process in EXTRA . . . . . . . . . . . . 33
2.2.1 Definition of the metric . . . . . . . . . . . . . . . . . . 34
2.2.2 The involved processes . . . . . . . . . . . . . . . . . . 35
2.3 Document analysis . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4 The suggestion search process . . . . . . . . . . . . . . . . . . 39
2.4.1 Approximate whole matching . . . . . . . . . . . . . . 40
ii CONTENTS

2.4.2 Approximate sub2 matching . . . . . . . . . . . . . . . 41


2.4.3 Meeting suggestion search and ranking with translator
needs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . 43
2.5.1 Implementation notes . . . . . . . . . . . . . . . . . . . 44
2.5.2 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.5.3 Effectiveness of the system . . . . . . . . . . . . . . . . 45
2.5.4 Efficiency of the system . . . . . . . . . . . . . . . . . 55
2.5.5 Comparison with commercial systems . . . . . . . . . . 57

3 Approximate matching for duplicate document detection 61


3.1 Document similarity measures . . . . . . . . . . . . . . . . . . 64
3.1.1 Logical representation of documents . . . . . . . . . . . 64
3.1.2 The resemblance measure . . . . . . . . . . . . . . . . 65
3.1.3 Other possible indicators . . . . . . . . . . . . . . . . . 68
3.2 Data reduction . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.2.1 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.2.2 Intra-document reduction . . . . . . . . . . . . . . . . 73
3.2.3 Inter-document reduction . . . . . . . . . . . . . . . . 75
3.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . 80
3.4.1 The similarity computation module . . . . . . . . . . . 81
3.4.2 Document generator . . . . . . . . . . . . . . . . . . . 83
3.4.3 Document collections . . . . . . . . . . . . . . . . . . . 83
3.4.4 Test results . . . . . . . . . . . . . . . . . . . . . . . . 85

II Pattern Matching for XML Documents 99


4 Query processing for XML databases 101
4.1 Tree signatures . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.1.1 The signature . . . . . . . . . . . . . . . . . . . . . . . 105
4.1.2 Twig pattern inclusion evaluation . . . . . . . . . . . . 106
4.2 A formal account of twig pattern matching . . . . . . . . . . . 109
4.2.1 Conditions on pre-orders . . . . . . . . . . . . . . . . . 113
4.2.2 Conditions on post-orders . . . . . . . . . . . . . . . . 115
4.2.3 On the computation of new answers . . . . . . . . . . . 118
4.2.4 Characterization of the delta answers . . . . . . . . . . 119
4.3 Exploiting content-based indexes . . . . . . . . . . . . . . . . 120
4.3.1 Path matching . . . . . . . . . . . . . . . . . . . . . . 120
4.3.2 Ordered twig matching . . . . . . . . . . . . . . . . . . 121
CONTENTS iii

4.3.3 Unordered twig matching . . . . . . . . . . . . . . . . 128


4.4 An overview of pattern matching algorithms . . . . . . . . . . 130
4.5 Unordered decomposition approach . . . . . . . . . . . . . . . 135
4.5.1 Identification of the answer set . . . . . . . . . . . . . 136
4.5.2 Efficient computation of the answer set . . . . . . . . . 139
4.6 The XML query processor architecture . . . . . . . . . . . . . 142
4.7 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . 145
4.7.1 Experimental setting . . . . . . . . . . . . . . . . . . . 145
4.7.2 General performance evaluation . . . . . . . . . . . . . 148
4.7.3 Evaluating the impact of each condition . . . . . . . . 149
4.7.4 Decomposition approach performance evaluation . . . . 154

5 Approximate query answering


in heterogeneous XML collections 157
5.1 Matching and rewriting services . . . . . . . . . . . . . . . . . 159
5.1.1 Schema matching . . . . . . . . . . . . . . . . . . . . . 162
5.1.2 Automatic query rewriting . . . . . . . . . . . . . . . . 167
5.2 Structural disambiguation service . . . . . . . . . . . . . . . . 169
5.2.1 Overview of the approach . . . . . . . . . . . . . . . . 171
5.2.2 The disambiguation algorithm . . . . . . . . . . . . . . 174
5.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
5.3.1 Approximate query answering . . . . . . . . . . . . . . 177
5.3.2 Free-text disambiguation . . . . . . . . . . . . . . . . . 178
5.3.3 Structural disambiguation . . . . . . . . . . . . . . . . 179
5.4 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . 180
5.4.1 Matching and rewriting . . . . . . . . . . . . . . . . . . 181
5.4.2 Structural disambiguation . . . . . . . . . . . . . . . . 185
5.5 Future extensions towards Peer-to-Peer scenarios . . . . . . . 191

6 Multi-version management
and personalized access to XML documents 195
6.1 Temporal versioning and slicing support . . . . . . . . . . . . 197
6.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . 197
6.1.2 Providing a native support for temporal slicing . . . . 200
6.2 Semantic versioning and personalization support . . . . . . . . 208
6.2.1 The complete infrastructure . . . . . . . . . . . . . . . 208
6.2.2 Personalized access to versions . . . . . . . . . . . . . . 210
6.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
6.3.1 Temporal XML representation and querying . . . . . . 214
6.3.2 Personalized access to XML documents . . . . . . . . . 216
6.4 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . 217
iv CONTENTS

6.4.1 Temporal slicing . . . . . . . . . . . . . . . . . . . . . 217


6.4.2 Personalized access . . . . . . . . . . . . . . . . . . . . 222

Conclusions and Future Directions 225

A More on EXTRA techniques 227


A.1 The disambiguation techniques . . . . . . . . . . . . . . . . . 227
A.1.1 Preliminary notions . . . . . . . . . . . . . . . . . . . . 227
A.1.2 Noun disambiguation . . . . . . . . . . . . . . . . . . . 229
A.1.3 Verb disambiguation . . . . . . . . . . . . . . . . . . . 233
A.1.4 Properties of the confidence functions and optimization 235
A.2 The M ultiEditDistance algorithms . . . . . . . . . . . . . . . 235

B The complete XML matching algorithms 241


B.1 Notation and common basis . . . . . . . . . . . . . . . . . . . 241
B.2 Path matching . . . . . . . . . . . . . . . . . . . . . . . . . . 242
B.2.1 Standard version . . . . . . . . . . . . . . . . . . . . . 242
B.2.2 Content-based index optimized version . . . . . . . . . 245
B.3 Ordered twig pattern matching . . . . . . . . . . . . . . . . . 248
B.3.1 Standard version . . . . . . . . . . . . . . . . . . . . . 248
B.3.2 Content-based index optimized version . . . . . . . . . 253
B.4 Unordered twig pattern matching . . . . . . . . . . . . . . . . 261
B.4.1 Standard version . . . . . . . . . . . . . . . . . . . . . 261
B.4.2 Content-based index optimized version . . . . . . . . . 266
B.5 Sequential scan range filters . . . . . . . . . . . . . . . . . . . 270
B.5.1 Basic filter . . . . . . . . . . . . . . . . . . . . . . . . . 270
B.5.2 Content-based filter . . . . . . . . . . . . . . . . . . . . 271

C Proofs 273
C.1 Proofs of Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . 273
C.2 Proofs of Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . 274
C.3 Proofs of Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . 275
C.4 Proofs of Appendix B . . . . . . . . . . . . . . . . . . . . . . . 279
List of Figures

1.1 sub2 Position Filter . . . . . . . . . . . . . . . . . . . . . . . . 15


1.2 Query for approximate sub2 sequence match filtering . . . . . . 19
1.3 Filtering tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4 Running time tests . . . . . . . . . . . . . . . . . . . . . . . . 23
1.5 Scalability tests for Collection 2 . . . . . . . . . . . . . . . . . 24

2.1 Processes in an EBMT system . . . . . . . . . . . . . . . . . . 28


2.2 The suggestion search and document analysis processes . . . . 35
2.3 The syntactic and semantic analysis and the actual WSD phase 38
2.4 The role of the DBMS in pretranslation . . . . . . . . . . . . . 44
2.5 Examples of full and partial matches . . . . . . . . . . . . . . 45
2.6 Coverage of the collections as outcome of the search process . 46
2.7 Percentages of disambiguation success . . . . . . . . . . . . . . 49
2.8 Schema of a generic translation model . . . . . . . . . . . . . 51
2.9 Trends of total translation time in simulation . . . . . . . . . 54
2.10 Running time tests . . . . . . . . . . . . . . . . . . . . . . . . 56
2.11 Further efficiency tests . . . . . . . . . . . . . . . . . . . . . . 57

3.1 Representation of a mapping between chunks . . . . . . . . . . 65


3.2 Distance between document bubbles . . . . . . . . . . . . . . 77
3.3 The DANCER Architecture . . . . . . . . . . . . . . . . . . . 80
3.4 Effectiveness of the resemblance measure for Times100L . . . . 86
3.5 Sampling and length-rank selection experiments for Times100L 87
3.6 Chunk clustering experiments for Times100L . . . . . . . . . . 88
3.7 Document bubble experiments for Times100S and Times500s . 89
3.8 Chunk size tests results for Cite100M . . . . . . . . . . . . . . 89
3.9 Chunk size and length rank tests results in CiteR25 . . . . . . 90
3.10 Web of duplicate documents . . . . . . . . . . . . . . . . . . . 92
3.11 Results of the runtime tests with no data reduction . . . . . . 95
3.12 Efficiency tests results for intra-doc data reduction . . . . . . 96
3.13 Further efficiency tests results . . . . . . . . . . . . . . . . . . 97
vi LIST OF FIGURES

4.1 Pre-order and post-order sequences of a tree . . . . . . . . . . 104


4.2 Properties of the pre-order and post-order ranks. . . . . . . . . 105
4.3 Sample query tree Q . . . . . . . . . . . . . . . . . . . . . . . 108
4.4 An example of a data tree and pattern matching results . . . . 110
4.5 Behavior of the domains during the scanning . . . . . . . . . . 111
4.6 Representation of the pre-order conditions in the domain space 113
4.7 Representation of the post-order conditions in the domain space117
4.8 Target list management example . . . . . . . . . . . . . . . . . 122
4.9 Ordered Twig Examples . . . . . . . . . . . . . . . . . . . . . 124
4.10 Ordered Twig Examples . . . . . . . . . . . . . . . . . . . . . 125
4.11 Unordered Twig Examples . . . . . . . . . . . . . . . . . . . . 129
4.12 Pattern matching algorithms . . . . . . . . . . . . . . . . . . . 132
4.13 Examples for decomposition approach . . . . . . . . . . . . . . 136
4.14 Structural join of Example 4.9 . . . . . . . . . . . . . . . . . . 138
4.15 Structural join of Example 4.11 . . . . . . . . . . . . . . . . . 139
4.16 The unordered tree pattern evaluation algorithm . . . . . . . . 140
4.17 Evaluation of paths in Algorithm of Figure 4.16: an example . 141
4.18 Abstract Architecture of XSiter . . . . . . . . . . . . . . . . . 143
4.19 Structure and content of a Datastore . . . . . . . . . . . . . . 144
4.20 The queries used in the main tests . . . . . . . . . . . . . . . . 147
4.21 The query templates used in the decomposition approach tests 147
4.22 Comparison for mean domains size in different settings . . . . 152
4.23 Comparison between time in different settings . . . . . . . . . 153

5.1 The role of schema matching and query rewriting . . . . . . . 160


5.2 A given query is rewritten in order to be compliant . . . . . . 161
5.3 The XML S3 MART matching and rewriting services. . . . . . 162
5.4 Example of structural schema expansion (Schema A) . . . . . 163
5.5 Example of two related schemas and of the expected matches . 165
5.6 RDF and corresponding PCG for portions of Schemas A and B 166
5.7 Examples of query rewriting between Schema A and Schema B 168
5.8 A portion of the eBayr categories. . . . . . . . . . . . . . . . 170
5.9 The STRIDER graph disambiguation service. . . . . . . . . . 171
5.10 The disambiguation algorithm . . . . . . . . . . . . . . . . . . 175
5.11 The T ermCorr() function . . . . . . . . . . . . . . . . . . . . 175
5.12 The ContextCorr() function . . . . . . . . . . . . . . . . . . . 177
5.13 A small selection of the matching results before filtering . . . . 182
5.14 Results of schema matching between DBLP and SIGMOD . . 184
5.15 Mean precision levels for the three groups . . . . . . . . . . . 187
5.16 Typical context selection behavior for Group1 (Yahoo tree) . . 188
5.17 Typical context selection behavior for Group2 (IMDb tree) . . 189
LIST OF FIGURES vii

5.18 Confidence delta values for OLMS . . . . . . . . . . . . . . . . 191

6.1 Reference example. . . . . . . . . . . . . . . . . . . . . . . . . 198


6.2 The temporal inverted indices for the reference example . . . . 201
6.3 The basic holistic twig join four level architecture . . . . . . . 202
6.4 Skeleton of the holistic twig join algorithms (HTJ algorithms) 203
6.5 The buffer loading algorithm Load . . . . . . . . . . . . . . . . 205
6.6 State of levels L1 and L2 during the first iteration . . . . . . . 206
6.7 The Complete Infrastructure . . . . . . . . . . . . . . . . . . . 209
6.8 An example of civic ontology . . . . . . . . . . . . . . . . . . . 212
6.9 Comparison between TS1 and TS2. . . . . . . . . . . . . . . . 219
6.10 Additional execution time comparisons. . . . . . . . . . . . . . 220
6.11 Comparison between the two collections C-R and C-S. . . . . 221
6.12 Scalability results for TS1. . . . . . . . . . . . . . . . . . . . . 222

A.1 Example of WordNet hypernym hierarchy . . . . . . . . . . . 228


A.2 Example of the gaussian decay function D . . . . . . . . . . . 231
A.3 MultiEditDistance between subsequences in σ q and σ . . . . . 236
A.4 Example of approximate sub2 matches with inclusion filter . . . 237
A.5 Approximate sub2 matching with inclusion filtering . . . . . . . 238

B.1 How domains and pointers are implemented . . . . . . . . . . 242


B.2 Path matching algorithm . . . . . . . . . . . . . . . . . . . . . 244
B.3 Content index optimized path matching algorithm . . . . . . . 247
B.4 Ordered twig matching algorithm . . . . . . . . . . . . . . . . 250
B.5 Ordered twig matching auxiliary functions . . . . . . . . . . . 251
B.6 Ordered twig matching solution construction . . . . . . . . . . 252
B.7 Content index optimized ordered twig matching algorithm . . 256
B.8 Content index optimized ordered twig matching algorithm aux-
iliary functions . . . . . . . . . . . . . . . . . . . . . . . . . . 257
B.9 Ordered Target list management functions (part 1) . . . . . . 259
B.10 Ordered Target list management functions (part 2) . . . . . . 260
B.11 Unordered twig matching algorithm . . . . . . . . . . . . . . . 262
B.12 Unordered twig matching auxiliary functions . . . . . . . . . . 263
B.13 Unordered twig matching solution construction (part 1) . . . . 264
B.14 Unordered twig matching solution construction (part 2) . . . . 265
B.15 Content index optimized unordered twig matching algorithm . 268
B.16 Content index optimized unordered twig matching algorithm
auxiliary functions . . . . . . . . . . . . . . . . . . . . . . . . 269
B.17 Unordered Target list management functions . . . . . . . . . . 270
List of Tables

2.1 Examples of improvements offered by stemming and WSD . . 48


2.2 Description of the simulation input parameters . . . . . . . . . 52
2.3 Results of the simulation runs . . . . . . . . . . . . . . . . . . 53
2.4 Pre-translation comparison test results for Collection1 . . . . . 59

3.1 Specifications of the synthetic collections used in the tests . . 85


3.2 Specifications of the real collections used in the tests . . . . . 85
3.3 Affinity and noise values for the CiteR25 collection . . . . . . 90
3.4 Correlation discovery comparison between Citeseer and Dancer 91
3.5 Results of violation detection tests . . . . . . . . . . . . . . . 93

4.1 The XML data collections used for experimental evaluation . . 145
4.2 DBLP Test-Collection Statistics . . . . . . . . . . . . . . . . . 146
4.3 Pattern matching results . . . . . . . . . . . . . . . . . . . . . 148
4.4 Summary of the discussed cases . . . . . . . . . . . . . . . . . 150
4.5 Behavior of the isCleanable() and isNeeded() functions . . . . 151
4.6 Performance comparison for unordered tree inclusion . . . . . 154

5.1 Features of the tested trees . . . . . . . . . . . . . . . . . . . . 186


5.2 Delta values of the selected senses . . . . . . . . . . . . . . . . 190

6.1 Evaluation of the computation scenarios with TS1. . . . . . . 218


6.2 Features of the test queries and query execution time . . . . . 223

A.1 Symbols and meanings . . . . . . . . . . . . . . . . . . . . . . 229

B.1 Path matching functions . . . . . . . . . . . . . . . . . . . . . 243


B.2 Path matching functions for content-based search . . . . . . . 246
B.3 Ordered twig matching functions . . . . . . . . . . . . . . . . 249
B.4 Ordered twig matching functions for content-based search . . . 254
B.5 Target list management functions . . . . . . . . . . . . . . . . 255
B.6 Unordered twig matching functions . . . . . . . . . . . . . . . 261
x LIST OF TABLES

B.7 Target list management functions . . . . . . . . . . . . . . . . 267


Acknowledgments

I would like to begin this thesis with some words of gratitude to all people
who helped me and were close to me during my Ph.D. experience.
First of all, the people that made all this possible: My supervisor Paolo
Tiberio, an always stimulating and incredibly kind person who gave me the
possibility of working on the topics that interested me in the Information Sys-
tems research group, and Federica Mandreoli. Indeed, to thank you properly,
Federica, I would have to write several pages... what could I say in few words?
Ever since my master thesis, you have been (and are) an amazing supervisor,
an ideal co-author, a colleague working with whom is more pleasure than just
“work”, and a good friend. Thank you for showing me the beauty of research
and for all the interesting discussions we had; it is always incredible for me
to see your “multi-tasking” brain getting all those brilliant ideas, even when
you are away from university doing something else, such as driving your car
back to Bologna or taking care of your sons...
I would like to thank all my colleagues and coauthors (and friends) at
ISGroup: Thank you Enrico and Mattia, for all the enjoyable moments we
have spent and the remarkable results we achieved together, and thank you
Simona, the cooperation with you has just started but you instantly left your
mark in our group (and in me) with the exceptional enthusiasm, wonderful
pleasantness and high-quality work that distinguish you. Many thanks also to
Pavel Zezula and Fabio Grandi, working with you is always a truly inspiring
experience. Thank you all, this thesis would not have been as it is without
your precious collaboration!
Special thanks to Sonia Bergamaschi and Domenico Beneventano, which
have always been friendly and helpful to me in many occasions, to all my
friends at LabInfo and “lunch mates”, Francesco, Luca, Maurizio, Robby...
Last but not least, to my parents: A huge “Thank You” for the constant
support you provided in all these years!
January 25, 2006
Riccardo Martoglia
Introduction

Information is the main value of Information Society. The recent develop-


ments in computing power and telecommunications, along with the constant
drop of Internet access costs and data management and storing, created the
right conditions for the global diffusion of the Web and, more generally, of
new research tools able to analyze information and their contents. However,
for them to create added value in all areas of Internet/Information economy,
including education, research and engineering, Information Retrieval (IR)
techniques must be able to answer every User Information Need both effec-
tively and efficiently, guiding the user through the sea of information and not
confusing him with information overload.
Depending on the particular application scenario and on the type of in-
formation that has to be managed and searched, different techniques need to
be devised. The work presented in this thesis mainly deals with two types
of information: Plain text, to which Part I is dedicated, and semi-structured
data, in particular XML documents, deeply discussed in Part II.
Text information is the main form of electronic information representa-
tion: It is sufficient to think at the quantity of text available on the web
(more than 8 billions of indexed pages, doubling every six months) and at its
richness and flexibility of use. Sequences, meant as logic units of meaningful
term successions, such as genetic sequences or plain natural language sen-
tences, can be considered the backbone of textual data. In order to exploit
the full potentialities of textual sequence repositories, it is necessary to rely
on pattern matching techniques, and, in particular, on approximate (similar-
ity) matching ones, developing new algorithms and data structures that go
beyond exact match and are able to find out how much a given text (i.e. a
query) is similar to another (i.e. one of the available textual data). In Chap-
ter 1 we present a purely syntactic approach for searching similarities within
sequences. The underlying similarity measure is exploitable for any language
since it is based on the similarity between sequences of terms such that the
parts most close to a given one are those which maintain most of the original
form and contents. Efficiency in retrieving the most similar parts available in
4 Introduction

the sequence repository is ensured by exploiting specifically devised filtering


and data reduction techniques. The proposed techniques have been special-
ized and applied to a number of application contexts, from Example-Based
Machine Translation (to which Chapter 2 is dedicated) to syntactical doc-
ument similarity search and independent sentence repositories correlation
(analyzed in Chapter 3). For both these areas complete working systems,
named EXTRA and DANCER, respectively, have been designed, developed
and successfully tested.
Besides textual information, semi-structured and, specifically, XML data
is becoming more and more popular. Indeed, nowadays XML has quickly
become the de facto standard for data exchange and for heterogenous data
representation over the Internet. This is also due to the recent emergence
of wrappers for the translation of web documents into XML format. Even
more than for text, querying and accessing in an effective and efficient way
semi-structured information requires a lot of effort in several synergic areas.
First of all, robust query processing techniques over data that conform
to the labelled-tree data model are needed. The idea behind evaluating tree
pattern queries, sometimes called the twig queries, is to find all existing ways
of embedding the pattern in the data. In Chapter 4, efficient evaluation
techniques for all the main types of tree pattern matching are presented;
by taking advantage of pre/post order coding scheme and of the sequential
nature of a compact representation of the data, we define a complete set of
conditions under which data nodes accessed at a given step are no longer
necessary for the subsequent generation of matching solutions and show how
such conditions can be used to write pattern matching algorithms that are
correct and which, from a numbering scheme point of view, cannot be further
improved. The proposed algorithms have formed the backbone of a complete
and extensible XML query processor, named XSiter, which also incorporates,
in a flexible architecture, ad-hoc indexing structures and storing facilities.
Efficient exact structural search techniques, such as the one presented, are
necessary and should be exploited in every next-generation structural search
engine, however they are not sufficient alone to fully answer the user needs in
the most advanced scenarios. In particular, providing effective and efficient
search among large numbers of “related” XML documents is a particularly
challenging task. Such repositories often collect documents coming from dif-
ferent sources, heterogeneous for what concerns the structures adopted for
their representations but related for the contents they deal with. In order
to fully exploit the data available in such document repositories, an entire
ensemble of systems and services is needed to help users to easily find and
access the information they are looking for. In Chapter 5, we propose novel
approximate query answering techniques, which allow the approximation of
Introduction 5

the user queries with respect to the different documents available in a col-
lection. Such techniques first exploit a reworking of the documents’ schemas
(schema matching), then, with the extracted information, the structural com-
ponents of the query (query rewriting) are interpreted and adapted. The key
to a good effectiveness is to exploit the right meaning of the terminology
employed in the schemas. To this end, we propose a further service for auto-
matic structural disambiguation which can prove valuable in enhancing the
effectiveness of the matching (and rewriting) techniques. Indeed, the pre-
sented approach is completely generic and versatile and can be used to make
explicit the meaning of a wide range of structure based information, like
XML schemas, the structures of XML documents, web directories but also
ontologies. All the discussed services have been implemented in our XML
S3 MART system, which includes the STRIDER disambiguation component.
Finally, in Chapter 6, we also consider the versioning aspect of XML
management and deal with the problem of managing and querying time-
varying multi-version XML documents. Indeed, as data changes over time,
the possibility to deal with historical information is essential to many com-
puter applications, such as accounting, banking, law, medical records and
customer relationship management. The central issue of supporting tempo-
ral versioning, i.e. most temporal queries in any language, is time-slicing
the input data while retaining period timestamping. A time-varying XML
document records a version history and temporal slicing makes the different
states of the document available to the application needs. Standard XML
query engines are not aware of the temporal semantics and thus it makes
more difficult to map temporal XML queries into efficient “vanilla” queries
and to apply query optimization and indexing techniques particularly suited
for temporal XML documents. In the light of these facts, in a completely
general setting, we firstly propose a native solution to the temporal slicing
problem, addressing the question of how to construct a complete XML query
processor supporting temporal querying. Then, we focus on one of the most
interesting scenarios in which such techniques can be successfully exploited,
the eGovernment one. In this context, we present how the slicing technology
can be adapted and exploited in a complete normative system in order to
provide efficient access to temporal XML norm texts repositories. Further,
we propose additional techniques in order to support a personalized access
to them.
Part I

Pattern Matching
for Plain Text
Chapter 1

Approximate (sub)sequence
matching

Textual data is the main electronic form of knowledge representation. With


the advent of databases, storing large amounts of textual data has become
an effortless and widespread task. On the other hand, exploiting the full
potentiality of unstructured repositories and thus understanding the utility of
the information they contain is a much more complex task, strictly connected
to the application they serve.
Sequences, meant as logic units of meaningful term successions, can be
considered the backbone of textual data. Consider, for instance, genetic
sequences, where the terms are genetic symbols, or plain natural language
sentences, formed by words. To name just few examples of sequence use, con-
sider the adoption of sentences for the description of the real world modelled
in the database and their role in composing documents. Searching in se-
quence repositories often requires to go beyond exact matching to determine
the sequences which are similar or close to a given query sentence (approxi-
mate matching). The similarity involved in this process can be based either
on the semantics of the sequence or just on its syntax. The former consid-
ers the meaning of the terms in the sequences, thus allowing information
extraction and manipulation. Such an approach is often based on informa-
tion retrieval techniques, such as the analysis of term frequencies, and has
been widely adopted for text segmentation, categorization and summariza-
tion [7]. The latter approach disregards the semantical content and focuses
on the structure of the sequence thus enabling the location of similar word
sequences.
Many applications may benefit from such a facility. We will now briefly
introduce the two main applications that motivated us in devising the new
generic approximate sequence matching techniques we will present in this
10 Approximate (sub)sequence matching

chapter. Chapters 2 and 3 will then provide a much more detailed analysis
of how such techniques can be exploited in real applications. The first sce-
nario is the one of EBMT (Example-Based Machine Translation), one of the
most promising paradigms for multilingual document translation. An EBMT
system translates by analogy: it is given a set of sentences in the source lan-
guage (from which one is translating) and their corresponding translations
in the target language, and uses those examples to translate other, similar
source-language sentences into the target language. Large corpora of bilin-
gual text are maintained in a database, known as translation memory. Due to
the high complexity and extent of languages, in most cases it is rather difficult
for a translation memory to store the exact translation of a given sentence.
Thus, an EBMT system is proved to be a useful translator assistant only
if most of the suggestions provided are based on similarity searching rather
than exact matching. Besides EBMT, there are many other motivating ap-
plications, such as syntactical document similarity search and independent
sentence repositories correlation. Syntactical document similarity search in-
volves the comparison of a query document against the data documents so
that some relevance between the document and the query is obtained. For
this purpose, documents are usually broken up into more primitive units
such as sentences and inserted into a database. When a document is to be
compared against the stored documents, only the documents that overlap at
the unit level will be considered. This type of document similarity can be
exploited both for copy detection [122] and for similar document retrieval
services, as the one offered by the digital library CiteSeer [80]. The corre-
lation of independent sentence repositories is a prerequisite of applications
such as warehousing and mining which analyse textual data. In this context,
correlation between the data should be based on approximate joins which
also takes flexibility in specifying sentence attributes into account.
We argue that the kind of similarity matching useful for most applications
should go beyond the search for whole sequences. The similarity matching
we refer to attempts to match any parts of data sequences against any query
parts. Although complex, this kind of search enables the detection of similar-
ities that could otherwise be unidentified. Even if some works in literature
address the problem of similarity search in the context of information ex-
traction, we are not aware of works related to finding syntactic similarities
between sequences. In particular, we are not aware of solutions fitting into
a DBMS context, which represents the most common choice adopted by the
above cited applications for managing their large amount of textual data.
In this chapter, we propose this kind of solution, based on a purely syntac-
tic approach for searching similarities within sequences [90]. The underlying
similarity measure is exploitable for any language since it is based on the
1.1 Foundation of approximate matching for (sub)sequences 11

similarity between sequences of terms such that the parts most close to a
given one are those which maintain most of the original form and contents.
Applying an approximate sub2 sequence matching algorithm to a given query
sequence and a collection of data sequences is extremely time consuming.
Efficiency in retrieving the most similar parts available in the sequence repos-
itory is ensured by exploiting filtering techniques. Filtering is based on the
fact that it may be much easier to state that two sequences do not match
than to tell that they match. We introduce two new filters for the approx-
imate sub2 sequence matching which quickly discard sequences that cannot
match, efficiently ensuring no false dismissals and few false positives. Section
1.1 introduces the foundation, i.e. the similarity measure and filters, of our
approximate sub2 sequence matching.
As far as the matching processing is concerned, we chose a solution that
would require minimal changes to existing databases. In Section 1.2 we show
how sequence similarity search can be mapped into SQL expressions and
optimized by conventional optimizers. The immediate practical benefit of
our techniques is twofold: approximate sub2 sequence matching can be widely
deployed without changes to the underlying database and existing facilities,
like the query optimizer, can be reused, thus ensuring efficient processing.
Finally, in Section 1.3 we assess and evaluate the results of the conducted
experiments and, in Section 1.4, we discuss related works on approximate
matching.

1.1 Foundation of approximate matching for


(sub)sequences
The problem of searching similarities between sequences is addressed by in-
troducing a syntactic approach which analyzes the sequence contents in order
to find similar parts. In particular, we characterize the problem of approxi-
mate matching between sequences as a problem of searching for similar whole
sequences or parts of them. In Section 1.1.1 we briefly review the notion
of sequence matching from the literature, mainly focusing on sequences of
characters. Then, in Section 1.1.2 we introduce the notion of approximate
sub2 sequence matching.

1.1.1 Background
The problem of sequence matching has been extensively studied in the lit-
erature as sequences constitute a large portion of data stored in computers.
12 Approximate (sub)sequence matching

In this context, approximate matching based on a specified distance function


can be classified in two categories [55]:

ˆ Whole matching. Given a query sequence and a collection of data


sequences, find those data sequences that are within a given distance
threshold from the query sequence.

ˆ Subsequence matching. Given a query sequence and a collection of


data sequences, find those data sequences that contain matching subse-
quences, that is subsequences that are within a given distance threshold
from the query sequence.

As far as the underlying distance function is involved, we opt to exploit the


analogy between a sequence of terms, such as a sentence, and a sequence of
characters, i.e. a string, and thus we rely on the existing distance function
between strings. In most cases, dealing with sequences of characters implies
adopting the edit distance notion to capture the concept of approximate
equality between strings [103]. In the following, we introduce the notion of
edit distance between sequences of elements by adopting the notation below:
given a sequence σ of elements, then:

ˆ |σ| is the length of the sequence;

ˆ σ[i] is i-th element of the sequence;

ˆ σ[i . . . j] is the subsequence of length j − i + 1 starting at position i and


ending at position j.

Definition 1.1 (Edit Distance between sequences) Let σ1 and σ2 be two


sequences. The edit distance between σ1 and σ2 (ed(σ1 , σ2 )) is the minimum
number of edit operations (i.e., insertions, deletions, and substitutions) of
single elements needed to transform the first sequence into the second.

One of the techniques for computing approximate string matching effi-


ciently relies on filters. Filtering is based on the fact that it may be much
easier to state that two sequences do not match than to state that they
match. Therefore, filters quickly discard sequences that cannot match, ef-
ficiently ensuring no false dismissals and few false positives. In particular,
three well known filtering techniques are widely used for approximate string
matching: count filtering, position filtering, and length filtering. They rely
on matching short parts of length q, denoted as q-grams [134, 126, 127, 61],
of the involved sequences. Given a sequence σ, its positional q-grams are
obtained by “sliding” a window of length q over the elements of σ. Since
Approximate sub2 sequence matching 13

q-grams at the beginning and the end of the sequence can have fewer than
q terms from σ, new terms “#” and “$” not in the term grammar are in-
troduced, and the sequence σ is conceptually extended by prefixing it with
q − 1 occurrences of “#” and suffixing it with q − 1 occurrences of “$”.

Definition 1.2 (Positional q-gram) A positional q-gram of a sequence σ


is a pair (i; [i . . . i + q − 1]), where [i . . . i + q − 1] is the q-gram of σ that starts
at position i, counting on the extended sequence. The set Gσ of all positional
q-grams of a sequence σ is the set of all the |σ| + q − 1 pairs constructed from
all q-grams of σ.

The filtering techniques basically take the total number of q-gram matches
and the position of individual q-gram match into account:
Proposition 1.1 (Count Filtering) Consider sequences σ1 and σ2 . If σ1
and σ2 are within an edit distance of d, then the cardinality of Gσ1 ∩ Gσ2 ,
ignoring positional information, must be at least max(|σ1 |, |σ2 |)−1−(d−1)∗q.

Proposition 1.2 (Position Filtering) If sequences σ1 and σ2 are within


an edit distance of d, then a positional q-gram in one cannot correspond to a
positional q-gram in the other that differs from it by more than d positions.

Proposition 1.3 (Length Filtering) If sequences σ1 and σ2 are within an


edit distance of d, their lengths cannot differ by more than d.

Proof and explanations of the above filters can be found in [126, 127].

1.1.2 Approximate sub2 sequence matching


In this section we introduce the concept of similarity searches within se-
quences. In particular, the kinds of approximate matches between sequences
considered are based on one part of one sequence being similar to another.
We adopt the edit distance defined in Def. 1.1 as similarity measure between
(parts of) sequences. This approach enables us to take the position of terms
in the sequence into account: we consider two parts as much similar as they
maintain the same order of the same terms. For instance, consider a sen-
tence example: the part “the dog eats the cat” is more similar to “the dog
eats the mouse” than “the cat eats the dog”. Thus, given two sequences
as sequences of terms σ1 and σ2 , ed(σ1 [i1 . . . j1 ], σ2 [i2 . . . j2 ]) denotes the edit
distance between the two parts σ1 [i1 . . . j1 ] and σ2 [i2 . . . j2 ]. In particular, if
i1 = 1, i2 = 1, j1 = |σ1 |, j2 = |σ2 |, ed(σ1 [i1 . . . j1 ], σ2 [i2 . . . j2 ]) will be de-
noted as ed(σ1 , σ2 ) and represents the edit distance between the two whole
sequences of terms.
14 Approximate (sub)sequence matching

The operation of approximate matching that we introduce in the following


definition extends the notion of subsequence/whole matching in order to
locate (parts of) sequences that match (parts of) query sequences.
Definition 1.3 (Approximate sub2 sequence matching) Given a collec-
tion of query sequences Q and a collection of data sequences D not neces-
sarily distinct, a distance threshold d and a minimum length minL, find
all pairs of sequences (σ1 [i1 . . . j1 ], σ2 [i2 . . . j2 ]) such that σ1 ∈ Q, σ2 ∈ D,
(j1 −i1 +1) ≥ minL, (j2 −i2 +1) ≥ minL and ed(σ1 [i1 . . . j1 ], σ2 [i2 . . . j2 ]) ≤ d.
Notice that whole matching applies when both sequences σ1 and σ2 are con-
sidered as a whole, whereas subsequence matching applies with only one of
them as a whole.
Applying an approximate sub2 sequence matching algorithm to a given
query sequence and a collection of data sequences is extremely time consum-
ing. The main challenge is thus to find filtering techniques suitable for the
problem introduced in Def. 1.3. Such filtering techniques should operate on
whole sequence pairs and efficiently hypothesize a small set of them as match-
ing candidates. As to sequence content, only that of the candidate answers
will be further analyzed. For this purpose, we reexamine the approaches in-
troduced for approximate whole matching. Since our problem has less bonds
than the one studied in literature, most of the properties that should be
satisfied by the matching sequence pairs and on which filters are based are
no longer true. For instance, length filtering is clearly not applicable in this
case. As far as the properties underlying count filtering and position filtering
are considered, in the following we propose two new filters.

sub2 Count filtering


The concept underlying the count filter is still applicable since the two se-
quences must obviously have a certain minimum number of q-grams, in order
to contain the one a part suggestion for the other. In particular, the number
of common q-grams must be determined with respect to some parts of the
sequences. Unfortunately it is almost impractical to extend all the possible
parts of all the sequences for q-gram computation. Thus, for approximate
sub2 sequence matches we consider q-grams from sequences not extended with
“#” and “$”.
Proposition 1.4 If two sequences σ1 , σ2 have a pair of sequences (σ1 [i1 . . . j1 ],
σ2 [i2 . . . j2 ]) such that (j1 − i1 + 1) ≥ minL, (j2 − i2 + 1) ≥ minL and
ed(σ1 [i1 . . . j1 ], σ2 [i2 . . . j2 ]) ≤ d then the cardinality of Gσ1 ∩ Gσ2 must be at
least minL + 1 − (d + 1) ∗ q.
Approximate sub2 sequence matching 15

sub2 P osF ilter ( S t r i n g S1 , S t r i n g S2 , i n t minL , i n t d )


{
i n t w ← minL ; // window s i z e
i n t c ← minL−d ; // c o u n t t h r e s h o l d

i n t [ ] S1 c ; // c o u n t e r s
i n t p1 , p2 ; // p o s i t i o n s i n s e n t e n c e s
boolean S1 l i m [ ] ; // i n c r e m e n t / decrement l i m i t a t i o n c h e c k

f o r ( p2 = 1 . . . | S2 | )
// o u t e r s e n t e n c e c y c l e
{
i f ( p2−w > 0 )
{
S1 lim , ← f a l s e ;
f o r ( p1 = 1 . . | S1 | )
// i n n e r s e n t e n c e c y c l e
{
i f ( S1 [ p1 ] = S2 [ p2−w ] )
{
f o r ( i = 0 . . w−1) // decrement c y c l e
{
i f ( ! S1 l i m [ p1+i ] ) {S1 c [ p1+i ]−−; S1 l i m [ p1+i ] ← true ; }
}
}
}
}

S1 l i m ← f a l s e ;
f o r ( p1 = 1 . . . | S1 | )
// i n n e r s e n t e n c e c y c l e
{
i f ( S1 [ p1 ] = S2 [ p2 ] )
{
f o r ( i = 0 . . w−1) // i n c r e m e n t c y c l e
{
i f ( ! S1 l i m [ p1+i ] ) {S1 c [ p1+i ]++; S1 l i m [ p1+i ] ← true ; }
}
i f ( S1 c [ p1 ] >=c ) return true ;
}
}
}

return f a l s e ;
}

Figure 1.1: sub2 Position Filter

sub2 Position filtering


While sub2 Count filtering is effective in improving the efficiency of approx-
imate sub2 sequence matching, it does not take advantage of position and
order information.

Example 1.1 Let minL = 4, d = 1, q = 1. Then, the minimum number of


common q-grams (in this case, words) threshold for sub2 Count filtering is 3.
16 Approximate (sub)sequence matching

Consider the following sequence (here, sentence) pair, where equal words are
emphasized: √ √ √

σ1 : ABC sof tware welcomes



you to

the world

of computer graphics.
σ2 : XPaint is a new computer graphics sof tware.
As you can see, being the actual number of common words 3, sub2 CountFilter
(σ1 ,σ2 ) returns TRUE even though they do not contain similar parts satisfying
the parameters specified at the beginning of the example. ¤
To provide a position filter for approximate sub2 sequence matching is not
a simple task, since the algorithm, without knowing where the candidate
matching sequences start, should still be efficient and effective. sub2 Position
filter is a new filtering technique explicitly designed to further improve the
performance of the approximate sub2 sequence matching operation: it offers a
much better filtration rate than simple count filtering, efficiently pruning out
a higher number of sequence pairs. In particular, sub2 Position filter works by
dynamically analyzing the relative positions and (partially) the order of equal
terms in the sequence. The schematized dynamic programming algorithm is
shown in Figure 1.1.
The filter receives as input the two sequences σ1 and σ2 , a minimum
length minL and a distance threshold d, and performs two nested cycles: an
outer cycle on the terms of σ2 and an inner cycle on the terms of σ1 . Each
position p1 in σ1 has an associated counter σ1 c[p1 ]: given a term in σ2 , say
σ2 [p2 ], each time it is equal to a term in σ1 , say σ1 [p1 ], the term counters
σ1 c[p1 ] through σ1 c[p1 + w − 1], being w (w = minL) the filter window size,
are incremented. When the term counter σ1 c[p1 ] at the position p1 reaches
the count threshold c and its term, i.e. σ1 [p1 ], is equal to the current one in
the outer cycle, i.e. σ2 [p2 ], the filter stops and returns TRUE; otherwise, that
is if none of the counters reach the threshold, the filter returns FALSE.
Notice that the count threshold c used in sub2 Position filtering is the
same as the one used in the sub2 Count filter when q-grams have length 1
(i.e. q = 1). Furthermore, notice that the filter algorithm “cleverly” marks
an entire window of terms ahead of the equal term found, making it easier to
identify “clusters” of equal terms without having to analyze the surroundings
of each one of them. For further accuracy, together with the increment cycle,
the filter also provides a decrement cycle, to decrement the counters of the
terms in w or more positions earlier. An increment/decrement limitation
is also used: it improves the filter effectiveness in the case of clusters of
repeated terms, preventing the modification of each counter more than once
for each outer cycle. As an explanation of the filter mechanisms, consider
the following examples.
Approximate sub2 sequence matching 17

Example 1.2 Let minL = 4, d = 1. Then, the threshold c for sub2 Position
filtering is 3 and the window size w is 4. Consider the sentence pair of
Example 1.1 where the first four terms of σ2 are not found in σ1 . From
p2 = 5 to p2 = 7, the algorithm works as follows (the term marked √ with the
symbol 4 is the current term in the outer cycle; the number of over the
term in the position p1 in the inner cycle correspond to the value of σ1 c[p1 ];
equal terms are in italics):
p2 = 5 : √ √

σ1 : ABC software welcomes you to the world of computer graphics.


σ2 : XPaint is a new computer graphics software.
4
p2 = 6 : √ √√

σ1 : ABC software welcomes you to the world of computer graphics.


σ2 : XPaint is a new computer graphics software.
4
p2 = 7 : √ √ √ √ √√

σ1 : ABC sof tware welcomes you to the world of computer graphics.
σ2 : XPaint is a new computer graphics sof tware.
4
Notice that none of the counters reaches the threshold 3, so sub2 PosFilter
(σ1 ,σ2 ) returns FALSE, correctly pruning out this pair of sequences (sen-
tences). ¤

By using the ahead-marking technique, the filter is even able to ensure a


smaller number of false positives than the one offered by standard position
filters, as shown in the following example:

Example 1.3 Let minL = 3, d = 1. Then, the threshold c for sub2 Position
filtering is 2 and the window size w is 3. Consider the following sequence
(sentence) pair which represents a wrong candidate answer for the standard
position filter since it contains two close and equal terms:
σ1 : XPaint is very easy to use.
σ2 : Is XPaint a bitmap processing software?
The filter counters

are updated for the two first (common) terms:
√ √
σ1 : XPaint is very easy to use.
σ2 : Is XPaint a bitmap processing software?
4
√ √√ √√ √
σ1 : XP aint is very easy to use.
σ2 : Is XP aint a bitmap processing software?
4
18 Approximate (sub)sequence matching

As you can see, some counters reach the threshold (i.e. 2), but not σ1 c[1]
which is the counter of the only term equal to the current term in the outer
cycle. So, sub2 PosFilter(σ1 ,σ2 ) returns FALSE, correctly pruning out even
this pair of sentences. ¤
As far as the correctness of the filter is concerned, we provide the following
theorem.

Theorem 1.1 Let σ1 and σ2 be two sequences, d be a distance threshold and


minL be a minimum length. If exists a pair (σ1 [i1 . . . j1 ] ∈ σ1 , σ2 [i2 . . . j2 ] ∈
σ2 ) such that (j1 − i1 + 1) ≥ minL, (j2 − i2 + 1) ≥ minL and ed(σ1 [i1 . . . j1 ],
σ2 [i2 . . . j2 ]) ≤ d, then, the sub2 position filter extP osF ilter(σ1 , σ2 , minL, d))
returns TRUE.

1.2 Approximate matching processing


The approximate sub2 sequence matching problem can be easily expressed
in any database system supporting user-defined functions (UDFs), such as
Oracle and DB2. In the rest of the section, we show how the algorithms
used for sequence similarity search can be mapped into SQL expressions and
optimized by conventional optimizer. The immediate practical benefit of our
techniques is that approximate search can be widely and efficiently deployed
without changes to the underlying database. Let D be a table containing
the data sequences and Q an auxiliary table storing the query sequences,
which is created on-the-fly. Both tables share the same schema (COD, SEQ),
where COD is the key attribute and SEQ the sequence. In order to enable
approximate sub2 sequence matching processing through filtering techniques
based on q-grams, the database must be augmented with the data about
q-grams corresponding to the data and query sequences, maintained in D and
Q respectively, and stored in two auxiliary tables Qq and Dq with the same
schema (COD,POS,Qgram). For each sequence σ, its positional q-grams are
represented as separate tuples in the above tables, where POS identifies the
position of the q-gram Qgram. The positional q-grams of S share the same
value for the attribute COD, which serves as the foreign key attribute to the
table storing S.
The SQL expression exploiting the filtering techniques for approximate
sub2 sequence matches has the form pictured in Figure 1.2. It shows that
filters can be expressed as an SQL expression and efficiently implemented
by a commercial relational query engine. The involved SQL expression joins
the auxiliary tables for q-gram sequences, Dq and Qq, with the query ta-
ble Q and the data table D to retrieve the sequence pairs to be further
1.2 Approximate matching processing 19

SELECT S1.COD AS COD1, S2.COD AS COD2


FROM Q S1, Qq S1q, D S2, Dq S2q
WHERE S1.COD = S1q.COD
AND S2.COD = S2q.COD
AND S1q.Qgram = S2q.Qgram
-- position filtering
AND sub2Position(S1.SEQ, S2.SEQ, minL, d)
-- count filtering
GROUP BY S1.COD, S2.COD
HAVING COUNT(*) >= minL + 1 - (d + 1)*q

Figure 1.2: Query for approximate sub2 sequence match filtering

analysed for approximate sub2 sequence matches. The sub2 Position filtering
algorithm shown in Figure 1.1 is implemented by means of an UDF function
sub2Position(S1,S2,minL,d). The sub2 Count filtering is implemented by
comparing the number of q-gram matches with the length of the involved
sequences.
Notice that the structure of the query of Figure 1.2 requires that at least
one q-gram is shared by the two sequences. The fact that the parts of the
sequences analyzed for filtering purposes cannot benefit from extended q-
grams (as already outlined in subsection 1.1.2) also influences the size of
q-grams stored in tables Dq and Qq. Indeed, choosing a q-gram size too big
with respect to the minimum length minL and/or too small with respect to
the number of allowed errors d could imply that a sequence pair share no
q-gram even if they have some approximate matching parts.

Proposition 1.5 The propositional formula “If σ1 and σ2 have some ap-
proximate matching parts then there is at least one q-gram shared by σ1 and
σ2 ” is true if and only if q ∈ [1, b minL
d+1
c].

Intuitively, in order to establish the q limits we consider the minimum length


of the part suggestion, i.e. minL, and we distribute the d allowed errors.
The worst case occurs when we have d + 1 “safe” parts with the same size:
b minL
d+1
c.
Once the candidate sequence pairs for approximate sub2 sequence match-
ing have been selected by the filters, they must be further analysed to locate
the approximate sub2 sequence matches. In this context, we are not mainly
interested in finding a specific solution for this problem. Besides considering
the introduction of an ad-hoc algorithm, we explored the possibility to keep
20 Approximate (sub)sequence matching

using filtering techniques relying on a DBMS by reducing the problem to the


two well-known approaches of whole matching and subsequence matching.
To reduce the problem to a whole matching case, means to transform the
candidate answers so that the edit distance function can be applied. More
precisely, for each pair of candidate answers (σ1 , σ2 ), we consider all possible
subsequences of both sequences, having length greater than minL. Given
σ1 [i1 . . . j1 ] and σ2 [i2 . . . j2 ], we compute the edit distance ed(σ1 [i1 . . . j1 ],
σ2 [i2 . . . j2 ]) and return such subsequence pair if and only if the corresponding
value is less than d. Whole matching can be efficiently processed by means
of the filtering techniques shown in Section 1.1.1. In this case, a mapping
into SQL expression is also possible, where the input of this query would be
the output of the one shown in Figure 1.2. Interested readers can refer to
[61, 91].
Following a similar approach, the problem can be reduced to a subse-
quence matching case. Indeed, if we consider all possible subsequences of
length greater than minL of only one of the two sequences, say σ1 , we can
implement one of the algorithms surveyed in [103]. Also in this case, filtering
techniques exist and a mapping into SQL expression is shown in [61]. The
only problem is that such algorithms do not locate the subsequence of σ2
matching σ1 [i1 . . . j1 ], but most of them can be extended in order to fulfill
this requirement.

1.3 Experimental Evaluation


In this section we present the results of an experimental evaluation of the
techniques described in the previous sections. Before doing so, in Section
1.3.1 we show the data set used for experiments and in Section 1.3.2 we sum-
marize some interesting implementation aspects. As reference application,
we chose the EBMT environment, where the considered sequences are sen-
tences. Therefore, in this section the terms “sentence” and “sequence” will
be used as synonyms.

1.3.1 Data sets


To effectively test our techniques, we used two real data sets:
ˆ Collection1, taken from two versions of a software technical manual.
It consists of 1497 reference sentences corresponding to one version of
the manual and 400 query sentences corresponding to a part of the
following version. Thus, query and data sentences deal with the same
topic and have similar style.
Implementation 21

ˆ Collection2, a complete Translation Memory provided by LOGOS, a


worldwide leader in multilingual document translation. It contains
translations from one of their customers’ technical manuals. There
are 34550 reference sentences and 421 query sentences. Because of its
greater size, Collection2 presents a lower homogeneity with respect to
Collection1.

1.3.2 Implementation
The similarity search techniques described in the previous sections have been
implemented using Java2 and JDBC code; the underlying DBMS is Oracle
9i Enterprise Edition running on a Pentium IV 1.8Ghz Microsoft Windows
XP Pro workstation.
As for query efficiency, by issuing conventional SQL expressions we have
been able to rely on the DBMS standard optimization algorithms; we just
added some appropriate indexes on the q-gram tables to further speed up the
filtering and search processes.
As to the computation of sub2 sequence matching, we tested the three
alternatives described in Section 1.2. Since our experiments do not fo-
cus on such a computation performance, the ad-hoc algorithm we imple-
mented is a naive algorithm which performs two nested cycles for each pos-
sible starting term in the two sequences to compute the matrixes of the
edit distance dynamic programming algorithm. In particular, given two se-
quences σ1 and σ2 , a minimum length minL, and a a distance threshold
d, multi-edit-distance(σ1 , σ1 ,minL,d) locates and presents all possible
matching parts along with their distance. Such function will be presented
more in depth in Chapter 2, in the context where we first tested it, i.e. our
complete EBMT system EXTRA. In order to implement the two queries
for whole and subsequence matching we further extended the database by
introducing a new table storing all possible starting and ending positions
for each part of each sentence and by extending the q-gram tables with all
possible extended q-grams. Despite the high computational complexity of
the implemented algorithm, we noticed that its performance was better than
that of the two full-database solutions. Suffice it to say that the benefits of
using q-gram filtering were completely nullified by the enormous overhead in
generating, storing and exploiting the new q-grams: not only for their bigger
quantity (approximately 2(|S| − minL)(q − 1) more for each sentence) but
also for the complications of considering each extended q-gram in the right
place. We measured the execution time of these solutions, but already from
the first tests they turned out to be more than ten times slower than the
naive algorithm solution and so in the following discussions they will not be
22 Approximate (sub)sequence matching

Real Sub2Pos Sub2Count Cross Product Real Sub2Pos Sub2Count Cross Product
1,0E+06 1,0E+08

1,0E+07
1,0E+05

1,0E+06
Candidate Set Size

Candidate Set Size


1,0E+04
1,0E+05

1,0E+04
1,0E+03

1,0E+03

1,0E+02
1,0E+02

1,0E+01 1,0E+01
MinL=3, K=1, Q=1 MinL=6, K=1, Q=3 MinL=4, K=2, Q=1 MinL=4, K=1, Q=2 MinL=3, K=1, Q=1 MinL=6, K=1, Q=3 MinL=4, K=2, Q=1 MinL=4, K=1, Q=2

(a) Collection1 (b) Collection2

Figure 1.3: Filtering tests

further considered.

1.3.3 Performance
In this chapter, we are mainly interested in presenting the new filters perfor-
mance. The main objective of filtering techniques is to reduce the number
of candidate answer pairs. Obviously, the more filters are effective the more
the size of candidate answers gets near to the size of the answer set. In
order to examine how effective each filter and each combination of filters is
we ran different queries, enabling different filters each time, and measured
the size of the candidate set with respect to the cross product of the query
sentences and data sentences. Another key aspect of filtering techniques is
their efficiency. Indeed, for a filter to be useful its response time should not
be greater than the processing time of just the match algorithm on the whole
cross product. In order to examine how efficient each combination of filters
and matching algorithm is, we ran different queries, enabling different filters
each time, and measured the response time also considering the scalability
for the two most meaningful cases.

Effectiveness of filters
We conducted experiments on both the data sets introduced in Section 1.3.1.
Performance trends were observed under the parameters that are associated
with our problem, that is the minimum length minL, the number of allowed
errors d, and the values of q-gram size allowed by Prop. 1.5. We started by
considering the most meaningful minimum length minL that, in most cases,
Performance 23

MED Sub2Count > MED Sub2Pos > MED Sub2Count + Sub2Pos > MED MED Sub2Count > MED Sub2Pos > MED Sub2Count + Sub2Pos > MED

MinL=4, K=1, Q=2


MinL=4, K=1, Q=2

4 22
26 540
6 68
139 1427

MinL=4, K=2, Q=1


MinL=4, K=2, Q=1

19 131
36 640
31 380
138 1369

MinL=6, K=1, Q=3


MinL=6, K=1, Q=3

3 17
25 456
5 19
103 1052

MinL=3, K=1, Q=1


MinL=3, K=1, Q=1

21 144
31 625
35 358
155 1504

0 20 40 60 80 100 120 140 160 180 0 200 400 600 800 1000 1200 1400 1600
Seconds Seconds

(a) Collection1 (b) Collection2

Figure 1.4: Running time tests

is 3 and thus we allowed at most one error in order to get a significative


answer set (q must be 1). Then we analysed the effect of increasing each of
the two parameters and the subsequent value(s) of q. The most meaning-
ful experiments are shown in Figure 1.3. Obviously, the sub2 Position filter
always filters better than the sub2 Count filter since, besides counting the
number of equal terms (with the same threshold as sub2 Count when q = 1),
it also considers their positions. For this reason, we did not take into account
possible combinations of filters. In particular the sub2 Count filter gave a can-
didate answer that was between 0.003% to 11% of the cross-product size and
sub2 Position filter filters from five to tens of times better than the sub2 Count
filter. Notice that the sub2 Count filter filters better on Collection2 than on
Collection1 since Collection1 contains more homogeneous sentences and thus
it is more likely that they share common terms. In any case, the sub2 Count
filter works better with q values greater than 1 and preferably smaller than
4, since q = 1 does not allow for q-gram overlap. The comparison of the two
alternatives having minL = 4 shows it: setting the similarity threshold to
75% enables q = 2 and thus doubles the filtering performance. Even more
evident is the case of q = 3 where the reduction of all the data sets is more
than 99.8% .

Efficiency of filters
Figure 1.4 presents the response time of the experiments detailed in the ef-
fectiveness of filters paragraph. In particular, it shows the times required
to get the answer sentence pairs for each possible combination of filters and
24 Approximate (sub)sequence matching

MED Sub2Count > MED Sub2Pos > MED Sub2Count + Sub2Pos > MED MED Sub2Count > MED Sub2Pos > MED Sub2Count + Sub2Pos > MED
10000 10000

1000 1000
Seconds

Seconds
100 100

10 10

1 1
25% 50% 100% 25% 50% 100%

(a) minL = 3, k = 1, q = 1 (b) minL = 4, k = 1, q = 2

Figure 1.5: Scalability tests for Collection 2

the matching algorithm (denoted as MED). The assessment and evaluation


of the obtained values focus on determining the best choice for filters with
respect to the parameter values. Enabling filtering techniques reduces the
response time of at least 7 times in the worst case. In particular, any combi-
nation that includes the sub2 Count filter always improves the performances.
Indeed, even if it filters less than the sub2 Position filter (see Figure 1.3), it
plays an important role by pruning out a large portion of sentence pairs and
thus leaving a small set of them on which the sub2 Position filter or directly
the matching algorithm is applied. Moreover, its execution requires a small
amount of time since it relies on the facilities offered by the DBMS. In detail,
we observed the plans generated by the DBMS showing that, using indexes
on the codes of the phrases and on the q-grams, joins are computed using
sort-merge-join algorithms. This improvement is even more evident for Col-
lection2, where response time is significantly reduced, especially for values
of q equal to 2 and 3. Moreover, turning the sub2 Position filter on provides
an even faster execution, with particular benefits for q = 1. From the four
parameter combinations, we selected one parameter combination with q = 1
and one with q > 1 as representative cases characterizing the filters behav-
iours and we analyzed their scalability. In particular, Figure 1.5 shows the
scalability of all the possible filter combinations for q = 1 and q = 2, on
Collection 2. It shows that all combinations grow linearly with the number
of sentence pairs and that the best choice is generally to turn on all the avail-
able filters. As a final remark, note that the sub2 Count filter has a less than
logarithmical growth (i.e. 11-13-15 secs at 25%-50%-100%). The reason why
1.4 Related work 25

this strong performance is not evident from the graphs is that, in the total
times shown, the linear-growing MED time prevails.

1.4 Related work


A large work body has been devoted to the problem of matching sequences
with respect to a similarity measure. Starting from the works of Faloutsos
et al. [2, 55] addressing the problem of whole and subsequence matching for
sequences of the same length, the problem has been considered in different
fields such as text and information processing, genetics (e.g. [8]), and time
series (e.g. [3, 29, 73]). In particular, the paper [8] presents a fast algorithm
for all-against-all genetic sequence matching. They adopt a suffix tree as
indexing method in order to efficiently compare all possible sequences. A
revised version of such algorithm could be adopted for implementing our
multi-edit-distance.
As far as text and information processing is concerned, the work [103]
is an excellent survey on the current techniques to cope with the problem
of string matching allowing errors. The problem has been addressed by
proposing solutions based on specific algorithms (e.g. [104, 105]), indexes
(e.g. [34, 40]), and filters (e.g. [61, 57]). Such solutions are limited to the
problem of string matching and substring matching, where in the latter case
the main objective is to verify if a pattern string is contained in a given
text without necessarily locating the exact position of the occurrences. A
notable exception is the search of the Longest Common Subsequence (LCS)
[114] between two sequences which, however, is limited to the location of
the longest part by allowing only insertions and deletions. As to indexes,
customized secondary storage indexes or indexing techniques for arbitrary
metric spaces have to be supported by the DBMS in order to be useful
for techniques accessing large amounts of data stored in databases, such as
the approach we propose. Amongst the others, we found the work [61] of
particular interest. It presents some filtering techniques relying on a DBMS
for approximate string joins and it offered the starting ideas for our work.
Chapter 2

Approximate matching
for EBMT

Nowadays we are witnessing the need to translate ever increasing quantities


of texts, with an ever increasing quality. The importance – and the expenses
– for inter-language communication and translation are soaring both in the
private and in the public fields. Consider the case of the European Union
for instance: With the recent joining of ten countries, the number of official
languages has increased by 82% and has reached 20 units. The consequent
allocation towards communication will grow from 550 to more than 800 Mil-
lion Euros, while the number of pages to be translated by European institutes
was more than 2 Million in 2004 and is expected to grow by 40% each year.
The same problems arise in the private field: As enterprises grow, they may
face instances where similar contents is translated several times for sepa-
rate uses and this can result in inconsistencies and uncontrolled translation
expenses. Large teams of professional translators, supported by highly expe-
rienced linguistic revisors, and an extensive number of freelance translators
are more frequently employed in every context, both in public and in private
enterprises; however, their expertise and skill is not alone entirely sufficient
in order to achieve highly effective and efficient translation performance. The
best way for translating very large quantities of documents, while ensuring
optimal translation times and costs, is to exploit the constantly growing Ma-
chine Translation (MT) possibilities.
Indeed, translation is a repetitive activity requiring a high degree of at-
tention and the full range of human knowledge. MT tools provide a way to
automate such process and have a specifically definite goal: Bringing bet-
ter quality and quantity in less time. Devised with the aim of preserving
and treasuring the richness and accuracy that only human translation can
achieve, Machine-Assisted (-aided) Human Translation (MAHT) and, in par-
28 Approximate matching for EBMT

Translation
Suggestion search suggestions
(Target language)

Text File
(Source language)

Translation Memory Translator


Code Original Sentence Translated Sentence
(Source language ) (Target
(Target language
language
) )
5673 Welcome to the world of Benvenuti nel mondo della
computer generated art! grafica generata al computer!
32567 Now press the right mouse Ora premere il pulsante
Text File
(Target language)
button. destro del mouse.
… … …

Pairs of
sentences

TM update

Figure 2.1: Processes in an EBMT system

ticular, Example-Based Machine Translation (EBMT) represent one of the


most promising translation paradigms.
Example-based translation is essentially translation by analogy. EBMT
systems can speed up a translator’s work significantly: They analyse the
technical issues of the different document formats, they search and manage
terminologies, glossaries and reference texts, generate translation suggestions
to speed up the translation process itself and ultimately help translators in
achieving a good level of terminological uniformity in the result. But how
do EBMT systems basically work and attempt to achieve all this? All past
translations are stored and organized in a large text repository known as
Translation Memory (TM) (see Figure 2.1). Such TM is organized in a
series of translation examples, which include source language (from which
one is translating) and target language text, together with possible additional
information. Each time a new document is translated, the TM is updated
with the set of sentences in the source language and their corresponding
translations in the target language; such process is denoted as TM update.
Then, with the suggestion search process, the system uses those examples
to help to translate other, similar source-language sentences into the target
language. Due to the high complexity and extent of languages, in most cases
it is quite unlikely that a translation memory stores the exact translation
of a given sentence. An EBMT system consequently proves to be a useful
translator assistant only if most of the suggestions provided are based on
similarity searching rather than exact matching.
In this chapter we show how the approximate matching techniques de-
29

scribed in the previous chapter can be succesfully applied to the EBMT


scenario. To this end, we present EXTRA (EXample-based TRanslation As-
sistant), the EBMT system we have developed over the last few years [91, 92]
to support the translation of texts written in Western languages, and propose
an in-depth analysis of its features, together with an extensive evaluation of
the results that it can achieve. EXTRA is designed to be completely modu-
lar, bringing high flexibility in its design and complete functionalities to the
translator. The heart of EXTRA is its innovative suggestion search engine,
whose foundation is built on a specialization of the approximate matching
techniques discussed in Chapter 1. Instead of relying on Artificial Intelli-
gence, we founded EXTRA’s search algorithms on advanced Information Re-
trieval techniques, ensuring a good trade-off between the effectiveness of the
results and the efficiency of the processes involved. EXTRA is able to propose
effective suggestions through the cascade execution of two processes. First,
it performs analyses of the available bilingual texts ranging from syntactic
text operations to advanced semantic word sense disambiguation techniques.
Then, it applies a metric which is the basis for the suggestion-ranking and
whose properties are exploited to take full advantage of the TM content and
to speed up the retrieval process. Indeed, the system is able to efficiently
search large amounts of bilingual texts stored in the translation memory for
suggestions, also in complex search settings. Furthermore, EXTRA does not
use external knowledge requiring the intervention of users, such as anno-
tated corpora, and is completely customizable and portable as it has been
implemented on top of a standard DBMS.
As to experimental evaluation, the objective and detailed evaluation of
the results provided by EBMT systems has always been a complex matter,
particularly for evaluation of the effectiveness. We also provides a thorough
evaluation of both the effectiveness and the efficiency of EBMT systems and
a comparison between the results offered by our system and the major com-
mercially available systems. Furthermore, we dedicate specific sections to the
simulation of ad-hoc statistical, process-oriented, discrete-event models for
quantifying the benefits offered by EXTRA assisted translation over manual
translation, and to the specific analysis of the effectiveness of the semantic
analysis of the texts.
The work presented in this chapter has been developed as a joint effort
with Logos Group. Logos is a worldwide leader in multilingual technical
document translation. The Logos solution includes professional in-country
specialist localization teams, centralized quality control, software engineering
and dedicated project management and translation memories.
The rest of the chapter is organized as follows: In Section 2.1 we present
an overview of research carried out in the EBMT and related fields. In
30 Approximate matching for EBMT

Section 2.2 we present our approach and discuss the reasons behind it. In
particular, we present an overview of the processes involved, i.e. document
analysis (analysed in detail in Section 2.3) and suggestion search, to which
Sections 2.4 is dedicated. Finally, Section 2.5 presents a detailed discussion
on the results of the experiments performed.

2.1 Research in the EBMT field


Since the EBMT paradigm is founded on predicting which sentences in the
translation memory are useful and which are not, one main problem regarding
an EBMT system is the suggestion search activity. The key to the working of
EBMT systems is to support techniques allowing the system to distinguish
what text segment is similar to the submitted one and how similar it is. For
this purpose, topics that have drawn major attention on behalf of researchers
over the last two decades are: The conception of different ways to represent
the examples in the TM, the definition of metrics and scoring functions, and
of the algorithms for searching the TM for useful text segments. Naturally,
Machine Translation is the research area that has produced most studies on
these aspects and to which most of the researchers proposing EBMT tech-
niques have turned. In this context, we found the papers [124, 51] to be good
surveys. However, since EBMT systems basically manage text information,
the reader should not be surprised to discover that part of the work that has
been done in the area of Information Retrieval (IR) can indeed prove valu-
able for reviewing the EBMT issues from a new and interesting perspective
and devising new techniques to these aims. The dissertation does not deal
with statistical approaches, being them quite peculiar and different in their
working from our system. While such techniques relieve EBMT systems of
the problem of evaluating the goodness (similarity) of the translations, they
require highly extensive monolingual and/or bilingual corpora for the un-
avoidable training of their models, which are not always available, and they
are strongly dependent on the particular language and subject that they were
trained for.

2.1.1 Logical representation of examples


First of all, the available approaches differ in the way they logically represent
and store the available examples in the translation memory; such a choice
is particularly critical, since what is stored in the TM is also likely to be
exploited by the similarity metric in order to determine the text similarity.
Good points of view to distinguish representations of text fragments are the
Similarity metrics and scoring functions 31

employed structure(s), such as sequences or trees, and the amount and the
kind of linguistic processing performed [41]. For instance the approach in [24]
stores examples as strings of words together with some alignment and infor-
mation on equivalence classes (numbers, weekdays, etc.). Other approaches
[42, 41] are more language- and knowledge-dependent and perform some text
operations (e.g. POS tagging and stemming) on the sentences in order to
store more structural information about them. Finally, approaches such as
[118] perform advanced parsing on the text and choose to represent the full
complexity of its syntax by means of complex syntactic tree structures. It
should be noticed that more an approach and the consequent logical repre-
sentation are complex more information can be exploited in the search of
useful material in the translation memory; however, such complexity is paid
in terms of huge computational costs for the creation, storage and matching /
retrieval algorithms (see next paragraph). Furthermore, a strong advantage
of EBMT should be the ability to develop systems automatically despite lack
of external knowledge resources [124], instead some of the cited approaches
assume the availability of a particular knowledge strictly related to the lan-
guage involved, such as the equivalence classes.

2.1.2 Similarity metrics and scoring functions


The example(s) an EBMT system determines to be equivalent (or at least
similar enough) to the text to be translated varies according to the approach
adopted for the similarity metrics. The ones reported in the literature can be
characterised depending on the text patterns that they are applied to: They
can be character-, word- or structure-based. Character-based metrics are not
particularly effective for most Western languages, providing only a superfi-
cial way to compare text fragments. As to word-based metrics, they compare
individual words and define a similarity of the fragment by combining the
words’ similarities. Nagao [102], the pioneer in EBMT system, suggested an
early word-based metric that attempts to emulate human translation practice
in recognizing the similarity of a new source language sentence to an example
by selecting identical phrases available in the translation memory except for
a similar content word. The closeness of the match would be determined
by a semantic distance between the two content words as measured by some
metric based on a thesaurus or ontology. Brown [24] proposes a matching
algorithm searching the example base for contiguous occurrences of succes-
sive words by means of an inverted index. The partial match is performed
by allowing equivalence classes. In [125] the similarity between the words is
determined by means of a thesaurus abstraction hierarchy. Further, in word-
based metrics, the exploitation of the information provided by the order of
32 Approximate matching for EBMT

the words can be fundamental: Word order-sensitive approaches are demon-


strated to generally outperform bag-of-words methods [10]. For instance,
the works [41] and [42] perform advanced hybrid word and structure based
similarity matching techniques. Most of the similarity metrics proposed in
literature are also exploited in the fundamental suggestion-ranking operation
[42, 118]. In [24] on the other hand a rough ranking mechanism based on the
freshness of the translation memory suggestion is proposed, while in [41] a
more subtle ranking mechanism is exploited, taking advantage of the defined
similarity metric and of adaptability scores between the source and target
language of a particular translation suggestion.

2.1.3 Efficiency and flexibility of the search process


Another fundamental aspect of EBMT systems, together with the similarity
metric that they are based on, is the flexibility they offer in order to extract
useful translation parts from the translation memory, along with the effi-
ciency of the proposed search algorithms. Indeed, in a large number of the
proposed research systems the obvious and intuitive “grain-size” for examples
seems to be the sentence [124], i.e. the entire sentence is the smallest unit of
search. For instance, in [102, 42] the assumption is that the aim of the match-
ing process is to find a single best-matching example. Such low flexibility is
a big issue: Consider, for instance, the frequent cases in which a translator
decides to merge two sentences into a single one, or when the translation
memory contains no whole suggestions. In [41] a different approach is taken,
i.e. both the examples and the query are first decomposed into “chunks”,
then the matching function makes a collection of matched chunk material;
however such subdivision in chunks is entirely static and determined before
the query is submitted, thus the approach still lacks the flexibility needed by
translators. Furthermore, most of the EBMT researchers do not specifically
propose algorithms and data structures in order to make the search tech-
niques more efficient. While research on part matching is not particularly
encouraging in the Machine Translation field, much has been done and many
ideas can be taken from the approximate string matching field and adapted to
example-based suggestion search. For a discussion of approximate matching
related work refer to Section 1.4

2.1.4 Evaluation of EBMT systems


The evaluation of the effectiveness of an EBMT system is not particularly
straightforward. In the Information Retrieval field, evaluation is often achieved
by computing recall and precision figures over well known test collections.
Some notes about commercial systems 33

However, such “reference” collections have never been defined in the Ma-
chine Translation field and, more generally, there are no universally accepted
ways to automatically and objectively evaluate MT systems. In [97], the
authors propose a way to measure the closeness between the output of an
MT system and a “reference” translation by measuring it in proportion to
the number of matching words. The “BLEU” measure [108] enhances this
technique by partially considering the order between the words: They count
the number of equal q-grams. The above techniques require the existence of
a complete set of reference (hand made) and automatic (machine generated)
translations to be applicable. Therefore, they are clearly not completely ap-
plicable to EBMT systems, which, by definition, do not generally provide a
complete translation as their output.

2.1.5 Some notes about commercial systems


A related work on EBMT would not be complete without a final mention to
commercially available EBMT systems, as commercial systems take an im-
portant role in the context of computer-aided translation. The simplest form
of EBMT software which is able to search for and retrieve information from a
bilingual parallel corpus are bilingual concordancers, such as ParaConc [109].
Such tools are very straightforward but are generally designed just to search
for words or very short phrases in an exact way [18]. A popular set of more
advanced products include translation memory software such as Trados [133]
and Déjà Vu [46]. They offer some interesting applications for document
management, such as semi-automatic processes for document alignment, but
they basically work in the same manner and show the drawbacks discussed
previously. In particular, linguistic and semantic analysis such as stemming
or word sense disambiguation is generally lacking and the employed similar-
ities between the text fragments appear to be very simple character-based
ones, thus this could lead to wrong and only superficially similar suggestions.
In Section 2.5.5 we will delve further into the performance and features of
commercial EBMT systems, comparing them with the ones of EXTRA.

2.2 The suggestion search process in EXTRA


In this section, we present the suggestion search activity devised for EXTRA
(Example-based TRanslation Assistant). It attempts to overcome some of
the above mentioned deficiencies of existing approaches through an exten-
sive use of innovative ideas conceived in the IR field. In particular, the
contribution of EXTRA to the state of the art in the way it retrieves useful
34 Approximate matching for EBMT

suggestions can be summarized in the following items:

ˆ it relies on a polymorph, language-independent, effective and rigorous


metric which is the basis for the suggestion-ranking and whose proper-
ties are exploited to speed up the retrieval process;

ˆ it is able to efficiently search large amounts of bilingual texts stored in


the translation memory for suggestions;

ˆ it fully exploits the TM content also in complex settings by supporting


different grain-sizes of search;

ˆ it is versatile because it allows the combination of different ways to


logically represent the examples with different ways to search them;

ˆ it does not use external knowledge requiring the intervention of users;

ˆ it is customizable and portable, as it has been implemented on top of


a standard DBMS.

2.2.1 Definition of the metric


As for all the EBMT systems relying on similarity matching, for EXTRA too
the similarity metric plays a fundamental role in the selection of the examples
in the TM that are useful and those that are not and it obviously influences
the logical representation of the examples. Moreover, it is well known that
the definition of a measure quantifying the similarity between objects is a
complex task, heavily dependent on the information retrieval needs, and that
more it meets the human intuition of what constitutes a pair of similar ob-
jects, more it is effective. The first step towards the definition of a similarity
measure is the identification of the properties that characterize the objects
to be compared and that can influence the similarity judgement. Thinking
about the retrieval task in an EBMT system designed to support the transla-
tion of texts in Western languages, we cannot forget that the translator can
submit text in different languages and that the retrieved suggestions should
help translators in speeding-up the translation process and achieving a good
level of terminological uniformity in the result. The adopted similarity met-
ric should consequently be independent from the translation context, such
as involved languages (provided that they are Western languages) and sub-
jects. Moreover, any suggestion allows the translator to save time only when
editing the suggestion takes less time than generating a translation from
scratch. Editing the translation means adding text, deleting text, swapping
text, modifying the text and so on in the retrieved examples. Intuitively, as
The involved processes 35

New document Translated


(to be translated ) documents

Document Analysis

Sequence Translations
of tokens (sentS,σ,sentT)
Queries
(sentSq,σq)

Translation
Memory

Suggestion Search

Translation
suggestions

Figure 2.2: The suggestion search and document analysis processes

much as an example maintains the same words in the same positions of the
unit of text to be translated as much its translation is a good suggestion.
For these reasons, we argue that the classical IR models based on bag of
words approaches are not suitable for our purposes, as they do not take into
account any notion of word order and position (e.g. both the sentences “The
cat eats the dog” and “The dog eats the cat” would be represented by the
set {dog, eat, cat}). Instead, the examples can be logically represented as
sequences of representative items named tokens 1 . The knowledge concern-
ing the position of the tokens in the example can thus be exploited through
the edit distance [103], the metric we presented in Chapter 1 for general
sequence matching and which constitutes the foundation of the suggestion
search process in EXTRA.

2.2.2 The involved processes


Figure 2.2 depicts the flow of a document. The translation memory of EX-
TRA contains a collection of translations. Each translation t is a tuple
(sentS , σ, sentT ) where sentS is the sentence2 in the source language, σ is
the sequence of tokens corresponding to the logical representation of sentS
and sentT is the translation of sentS in the target language.
Each document submitted to the system goes through a document analy-
sis process, either because it has to be added into the translation memory or
because the translator searches useful suggestions for it. Document analysis
1
Each token is a group of characters with collective significance.
2
In the following we will use indifferently the terms translation and example
36 Approximate matching for EBMT

is the process of converting the stream of characters representing an input


document into a set of sentences and generate a logical view of them (i.e.
sequences of tokens). Document analysis is a fundamental step towards the
suggestion search process, the importance of which cannot be ignored. Set-
ting out the internal representation of the sentences to be compared, it ob-
viously influences the effectiveness and the efficiency of the search process.
When a translated document is submitted to the system, it is transformed
through the document analysis into a set of translations, each represented
by the tuple (sentS , σ, sentT ), to be added to the translation memory. The
same process is applied to each document to be translated when it is submit-
ted to the system: it is transformed into a collection of queries, where each
query q is represented by the tuple (sentqS , σ q ) where sentqS is the sentence
for which the translator needs suggestions in the target language and σ q the
corresponding sequence of tokens. The suggestion search process takes the
collection of queries as input {(sentqS , σ q )}, compares their logical representa-
tion with that of the examples {(sentS , σ, sentT )} through the edit distance
and some measure of relevance between the queries and the examples is ob-
tained. Finally, thanks to the alignment processes, EXTRA is able to switch
from the source to the target language in order to present to the translator
a ranked list of suggestions.
Notice that in the design of the EXTRA system we modularized the
suggestion search activity by providing two autonomous processes: The doc-
ument analysis always produces sequences of tokens, independently from the
way they are generated, while the suggestion search takes a query sequence
as input and produces a ranked list of suggestions for it, independently from
the properties of the involved tokens with the exception of their positions.
In this way, we introduce a good level of versatility in the search process by
allowing the testing of the impact of different logical representations in the
search process. For the same reason, the management of the translation con-
text and, in particular, of the involved languages, is limited to the document
analysis phase where the text analysis procedure takes place.

2.3 Document analysis


The main task of the document analysis process is to transform each docu-
ment into a set of sentences and to transform each sentence sentS or sentqS
into its logical representation σ or σ q , respectively. As we mentioned in Sec-
tion 2.1, the logical representation of a sentence can be extracted by means of
text operations. In our approach, we considered various text operations for
the document analysis process. Starting from their original formats, the doc-
2.3 Document analysis 37

uments submitted to EXTRA first go through a “chunking” process where


they are broken into a set of sentences. Then, each sentence has to be trans-
formed into a sequence of tokens as required by the edit distance. There are
various alternatives, each of which is able to affect the search process as it
determines the internal representation of the managed corpora. The most
obvious way to do it is to directly compare the sentences as they are. This
type of approach is not resilient to small changes in the sentences, such as
the modification of an article, a stem change to a term, or the addition of
a comma. Because of this reason, we considered gradually more complex
alternatives to pre-process a sentence and produce its logical representation,
the one on which the similarity search algorithms can be applied:

1. simple punctuation removal;

2. word stemming (and stopword removal);

3. word sense disambiguation (WSD).

Such options are incremental, i.e. stemming also includes punctuation re-
moval and WSD also includes stemming, and produce different logical rep-
resentations with an increasing level of resilience. By switching between the
first two options, the logical representation of each sentence ranges from the
sentence itself with the exception of the punctuation to the sequence of its
most meaningful terms with the exception of their inflections. In a sug-
gestion search perspective, notice that in the former representation all the
words in the sentence are equally important while the latter disregards com-
mon words such as articles and so on while focusing on the most meaningful
terms. On the other hand, translators usually find it easier to translate com-
mon words rather than far-fetched terms, thus the latter representation helps
to find more useful suggestions than the former representation as only the
differences among the fundamental terms are important. Both the above
mentioned representations disregard semantic aspects, as they are the prod-
uct of a syntactic analysis where the meanings of terms are not taken into
account. The syntactic representation of a sentence is thus a sequence of
tokens, i.e. the stemmed version of its worth surviving terms (see the upper
block of Figure 2.3).
Finally, with the Word Sense Disambiguation option, we perform a se-
mantic analysis to disambiguate the main terms in the sequence, thus en-
abling the comparison between meanings instead of terms. For instance,
consider the technical context of a computer graphic software manual: In
such a context, the author could refer to the artistic creation of the user
as both “image” or “picture”, which should be considered as two equivalent
38 Approximate matching for EBMT

Input Syntactic analysis Output


Stemmed sentence
1 The x
Original sentence 2 white white Sequence of Tokens
3 cat cat
4 is x
The white cat is hunting the mouse . Stemming 5 white cat hunt mouse
hunting hunt
6 the x
7 mouse. mouse

Semantic analysis

Stemmed, tagged sentence


1 The x x
2 white JJ white
3 cat NN cat
Tagging
4 is x x
5 hunting VBG hunt
6 the x x
7 mouse. NN mouse

N (nouns)
3 cat ?
Nouns and verbs 7 mouse ?
lists extraction V (verbs)
5 hunt ?

N (nouns)
Sequence of Tokens
3 cat *N-1788952*
WSD: nouns and 7 mouse *N-1993014*
verbs disambiguation white *N-1788952* *V-903354* *N-1993014*
V (verbs)
5 hunt *V-903354*

Figure 2.3: The syntactic and semantic analysis and the actual WSD phase

words. On the other hand, he could describe the picture of a “mouse” from
Cinderella, obviously meant as an animal, which should not be mistaken with
the “mouse”, the electronic device, used to digitally paint it. By employing
WSD, different terms which, in the context of the sequences they belong
to, have the same sense, can be judged by the comparison scheme to be
the same. On the other hand, terms which, used in different contexts, have
different meanings can be considered as distinct tokens. By reconsidering
the previous example, “image” and “picture” would be considered to be the
same, while the two instances of “mouse” would be considered different. The
WSD techniques we designed can be categorized as relational information,
knowledge-driven ones, exploiting one of the most known lexical resources for
the English language: WordNet [100, 101]. Specifically, we devised two com-
pletely automatic techniques for the disambiguation of nouns and verbs that
ensure good effectiveness while not requiring any additional external data
or corpora besides WordNet and an English language probabilistic Parts-Of-
Speech (POS) tagger. Note that we disregarded such categories, as adjectives
and adverbs, which are usually less ambiguous and assume a marginal role
in the meaning of the sentence.
The syntactic analysis precedes the actual WSD phase which receives
a stemmed sentence as input and produces a disambiguated version of it,
2.4 The suggestion search process 39

where the WordNet-disambiguated nouns and verbs are substituted with


the codes of their meanings (named synsets in Wordnet). In Figure 2.3
we show the different steps of the WSD elaboration and an example of the
produced output. The first operations we perform on the stemmed sentence
is POS tagging, which associates each of the terms in the sentence with a tag
corresponding to a part of speech. The tags shown in the figure are exactly
the ones produced by our tagger, which is a variant of the common Brown-
Penn tagset [96]. In particular, we need to identify nouns (tags starting with
N) and verbs (tags starting with V) and create stemmed lists N and V of each
of these two categories, complete with pointers to the positions of the words
in the sentence. Such lists are the input for the WSD techniques themselves,
which attempt to associate each entry with the code of the WordNet synset
that best matches its sense. To enhance the readability of the output we add
a “N-” (noun) or “V-” (verb) prefix to each of the codes. The final step is
to assemble the partial results in the sequence of tokens corresponding to its
worth surviving terms. Not all of the stemmed terms are substituted with
a code: In the example, the word “white” is unchanged because it is not
a noun nor a verb. Notice that the same would happen for words tagged
as nouns or verbs but not present in WordNet. Refer to Appendix A.1 for
detailed description of our WSD techniques.

2.4 The suggestion search process


Given a document to be translated, the suggestion search process accesses
the translation memory by comparing the submitted text with past trans-
lations and returns a ranked list of useful suggestions. Each translator has
her/his own way to gain her/his objective. EXTRA tries to meet the skills
and the work habits of the translators in the best way possible by putting
two similarity matching approaches at their disposal, which can be freely
combined in order to obtain the kind of suggestions they consider to be the
most useful ones (see Section 2.4.3 for a discussion). The approximate whole
matching searches the TM for such sentences that are similar enough to the
query sentence, whereas the approximate sub2 matching extends the above
approach by comparing parts of the query with parts of the available sen-
tences. To give an idea of the usefulness of the two approaches, let us show
the work habits of two possible types of translators, confident and “casual”,
and their plausible interaction with the system. Confident translators might
prefer to translate from scratch such sentences for which no very close sug-
gestions are available. Indeed, they usually know the material they are going
to translate and thus their main objective is simply to carry out the assigned
40 Approximate matching for EBMT

job as soon as possible. The approximate whole matching probably repre-


sents the main source of suggestions for this kind of translators. Different
is the case of more “casual” translators who look for such suggestions al-
lowing them to obtain an acceptable coherence in the adopted terminology.
In this case, they might be willing to edit and combine suggestions covering
part of the submitted corpora, provided that they are of good quality. As
the translation memory often does not contain good quality whole matches,
the approximate sub2 matching might prove to be particularly useful in this
situation. For a detailed description of the foundation of the employed ap-
proximate matching techniques see Section 1.1. In the following, we will only
describe how the general techniques have been modified and customized to
the EBMT scenario in EXTRA.

2.4.1 Approximate whole matching


Given a collection of queries representing a document to be translated and
a relative distance threshold, the approximate whole matching efficiently re-
trieves the most similar sentences available in the translation memory and
returns their target-language counterparts together with the likelihood val-
ues. They correspond to translation memory sentences whose sequences are
far from the query sequence less than an absolute distance threshold related
to the specified relative one.
More precisely, the approximate whole matching works on the trans-
lation memory {(sentS , σ, sentT )} and on the document to be translated
{(sentqS , σ q )} as specified in the following definition.

Definition 2.1 (Approximate whole matching in EXTRA) Given a doc-


ument to be translated corresponding to a collection Q = {(sentqS , σ q )} of
queries, a collection T M = {(sentS , σ, sentT )} of translations, and a relative
distance threshold d, find all translations t ∈ T M such that for some query
q ∈ Q ed(σ q , σ) ≤ round(d ∗ |σ q |), and return (sentT , ed(σ q , σ)).

As to the distance threshold, the maximum number of allowed errors


is defined as a user-specified percentage d of the query sequence length in-
stead of an absolute number that is usually specified in string processing [7].
Indeed, in the context of EBMT systems, searching for sentences within a
number of errors that is independent of the length of the query could be of
little meaning. For example, searching for sentences within 3 errors given a
query of length 6 is totally different to searching sentences within the same
number of errors given a query of length 20. For this reason we consider
d as the percentage of admitted errors with respect to the sentences to be
Approximate sub2 matching 41

translated. For example, setting d = 0.3 would allow 3 errors w.r.t. a query
of length 10, 6 errors w.r.t. a query of length 20, and so on.
Efficiency in retrieving the most similar sequences available in the trans-
lation memory is once again ensured by exploiting filtering techniques (see
Section 1.1).

2.4.2 Approximate sub2 matching


Approximate whole matching is not the only search mechanism provided
by EXTRA. Experience with several language pairs has shown that pro-
ducing an EBMT system that provides reasonable translation coverage of
unrestricted texts requires a large number of pre-translated texts [24]. Con-
sequently, translators may submit sentences for which no whole match exists.
Anyway, the sentences stored in the translation memory could be partially
useful. Thus, to fully exploit the translation memory potentialities, EX-
TRA exploits approximate sub2 matching, the powerful similarity matching
technique introduced in Section 1.1. It goes beyond “standard” whole and
subsequence matching, as it attempts to match any part of the sequences in
the translation memory against any part of the query sequences. Although
complex, this kind of search enables the detection of similarities that could
otherwise be unidentified.
Given a translation t ∈ T M , represented by a tuple (sentS , σ, sentT ), we
denote with σ[i . . . j] the subsequence of σ ranging from its i-th to its j-th
token and with sentT [iσ . . . j σ ] the part of the sentence in the target language
corresponding to σ[i . . . j].

Definition 2.2 (Approximate sub2 matching in EXTRA) Given a doc-


ument to be translated corresponding to a collection Q = {(sentqS , σ q )} of
queries, a collection T M = {(sentS , σ, sentT )} of translations, a relative dis-
tance threshold d and a minimum length minL, find all subsequences σ[i . . . j]
of the translations t = (sentS , σ, sentT ) ∈ T M , such that for some subse-
quence σ q [iq . . . j q ] of a query q ∈ Q, such that ed(σ q [iq . . . j q ], σ[i . . . j]) ≤
round(d ∗ (j q − iq + 1)), (j q − iq + 1) ≥ minL and (j − i + 1) ≥ minL. Then,
return (sentT [iσ . . . j σ ], ed(σ q [iq . . . j q ], σ[i . . . j])).

Performing approximate sub2 matching generates a number of new and


challenging issues. Firstly, unlike whole matching, the retrieval does not
concern whole sentences, but parts of them. In particular, the result of the
approximate sub2 matching are those parts of the target sentence sentT cor-
responding to the subsequences satisfying the specified distance threshold.
In order to correctly solve such link between the sequence and the target
42 Approximate matching for EBMT

language sentence, i.e. identify the right part sentT [iσ . . . j σ ], a special align-
ment technique is needed where the sentence in the source language acts
as “bridge” between the sequence and the sentence in the target language.
Such problem, which we call word alignment, has been addressed in [92].
Furthermore, in order to keep a good efficiency, we exploit the new filtering
techniques described in Section 1.1. In EXTRA, such filters are adapted
to the context of a relative distance threshold but are based on the same
properties which have been previously described.

2.4.3 Meeting suggestion search and ranking with trans-


lator needs
Whole matching extends sub2 sequence matching, i.e. sub2 sequence matching
is able to identify a whole match too. For this reason, there is no point in
applying the two search techniques to the same document to be translated.
On the other hand, some translators may prefer to receive only suggestions
for the whole query sentences, and, in this case, only the intervention of the
whole matching process is required, which is easier and takes less time than
sub2 sequence matching. As a matter of fact, most existing commercial sys-
tems only support this kind of search process. For some translators, however,
such kinds of suggestions may not be sufficient for their needs, consequently
they can rely on sub2 sequence matching. In particular, considering that
translators may submit sentences for which no whole match exists but the
sentences stored in the translation memory could be partially useful, a pos-
sible way to combine the two matching approaches is to search for matching
parts only for the query sentences for which no whole match exists.
Another interesting issue related to the search process is the ranking of
the suggestions. Translators who submit a document to an EBMT system for
suggestion search expect a list of suggestions in the target language ranked
in a meaningful order, for each of the sentences they must translate. Indeed,
in most cases, they do not want to lose time and, consequently, just read only
the top suggestions where the likelihood values help in quickly evaluating the
level of similarity with the material submitted. The two matching techniques
being the basis of the suggestion search process provide suggestions for the
translation of a given document together with the edit distance values (see
Defs. 2.1 and 2.2). Thus the most straightforward way to rank the retrieved
suggestions for each sentence in the submitted document is to sort them by
increasing edit distance values. In this way, the suggestions appearing at
the top of this order are the ones with the lower values of edit distance and
thus are the ones for which it should take less time to adapt them to the
2.5 Experimental Evaluation 43

actual translation. This is certainly true for the results of whole matching
as they are suggestions for the whole sentence to be translated. Different is
the case of sub2 sequence matching that suggests parts of the TM sentences
that match parts of the query sentence. The suggestions concern parts of
the query sentence starting in different positions and have variable lengths.
Thus, one possible way to prepare the suggestions for presentation is to group
them by starting point, i.e. considering the position of the first word of the
query segment for which a translation is proposed. Moreover, together with
the edit distance value, the length is also a factor that can affect the time
required to complete the actual translation. Indeed, in most cases, it takes
longer to use a large number of short suggestions than using a smaller number
of longer suggestions, even when the longer are less similar to the involved
parts than the short ones. Thus, for each starting point the suggestions can
be ordered by length and then by edit distance values. As for each starting
point the longest suggestions contain other suggestions, another possibility
is to output only such suggestions and to sort them on the basis of the edit
distance values. Contained matches are in fact usually not useful, since they
would typically only slow the work of the translator down while giving no
additional hint. In this case, we avoid computing the unnecessary suggestions
by exploiting an ad-hoc algorithm. An algorithm that implements this idea
is shown in Appendix A.2. The impact of document analysis, of the type of
suggestions, and of the ranking on the translation process has been subject
of several experiments. A detailed account of the results that we obtained is
presented in Section 2.5.

2.5 Experimental Evaluation


In this section we present the results of an experimental evaluation of the
techniques described in the previous sections.
System performance was tested both in terms of the effectiveness and the
efficiency of the proposed search techniques. As to effectiveness, we analysed
the quality of the assistance offered by EXTRA by means of some new metrics
we introduced for EBMT systems. As far as efficiency is concerned, we
experimentally observed performance trends under different settings, both
for document analysis and for suggestion search. Unless explicitly specified,
all experiments were carried out by setting the relative distance thresholds d
and dSub to 0.2.
For additional efficiency tests of the approximate matching algorithms
underlying suggestion search refer to Section 1.3.
44 Approximate matching for EBMT

Whole sentences Whole matching

Doc analysis
Translation
Memory (Permanent
DB table)
New document
(to be translated) On-the-fly DB table
Parts of sentences Sub2 matching

Figure 2.4: The role of the DBMS in pretranslation

2.5.1 Implementation notes


As far as the design of the suggestion search process is concerned, the whole
and sub2 matching algorithms and the corresponding filtering techniques were
implemented on top of a standard DBMS by mapping them into vanilla SQL
expressions (by following the guidelines proposed in Section 1.2). Designing
a solution that fits into a DBMS context allowed us to efficiently manage
the large bilingual corpora of the translation memory and ensure total com-
patibility with other applications. The immediate practical benefit of our
techniques is that approximate search in translation memory can be widely
and efficiently deployed without changes to the underlying database.
Figure 2.4 traces the broad outline of the interaction with the DBMS
where the examples are recorded into as many tables and an auxiliary table
is created on-the-fly to store the queries whenever the translator submits a
document. EXTRA has been implemented using Java JDK 1.4.2 and JDBC
code; the underlying DBMS is Oracle 10g Enterprise Edition running on
a Pentium 4 2.5Ghz Windows XP Professional workstation, equipped with
512MB RAM and a RAID0 cluster of 2 80GB EIDE disks with NT file system
(NTFS).

2.5.2 Data Sets


To effectively test the system, we used the two real data sets described in
Section 1.3. Collection1, containing only English sentences, was mainly used
to test the effectiveness of the system in finding useful suggestions in a rela-
tively small translation memory. Collection2, on the other hand, was created
by professional translators and contains years of translation work on a very
specific technical subject. Therefore, such collection is much better estab-
lished than Collection1 and over 20 times larger. For these reasons, we used
it not only to test the reaction of the system to more repetitive data, but
Effectiveness of the system 45

Query sentence: Position the 4 clips (D) as shown and at the specified dimen-
sions.
Similar sentence in the source language: Position the 4 clips (A) as shown
and at the specified distance.
Corresponding sentence in the target language: Posizionare le 4 mollette
(A) come indicato e alla distanza prevista.
Query sentence: On completion of electrical connections, fit the cooktop in
place from the top and secure it by means of the clips as shown.
Sentence containing a similar part: After the electrical connection, fit the
hob from the top and hook it to the support springs, according to the illustration.
Corresponding sentence in the target language: Dopo aver eseguito il col-
legamento elettrico, montare il piano cottura dall’ alto e agganciarlo alle molle di
supporto come da figura.
Suggestion in the target language: collegamento elettrico, montare il piano
cottura dall’alto
Sentence containing a similar part: Secure it by means of the clips.
Suggestion in the target language: Fissare definitivamente per mezzo dei
ganci.

Figure 2.5: Examples of full and partial matches

also to test the efficiency of the system in a larger and, thus, more challeng-
ing scenario. Furthermore, Collection2 contains both English sentences and
their Italian translations; being the sentences available in two languages, the
system could also be tested in aligning them and giving suggestions in the
target language.

2.5.3 Effectiveness of the system


To evaluate the effectiveness of a retrieval system means to assess the quality
of the answer set with respect to a user query. In the Information Retrieval
field, this is often achieved by computing recall and precision figures over
well known test collections. However, such “reference” collections have never
been defined in the Machine Translation field and, more generally, there are
no universally accepted ways to automatically and objectively evaluate MT
systems. Moreover, as we already outlined in Section 2.1, the few efforts to
define some measures of effectiveness do not apply to EBMT systems.
Given those premises, we first caught on to the quality of the suggestions
by examining a significant sample of matches retrieved by EXTRA. Some
of such examples for Collection2 are shown in Figure 2.5 where the parts
emphasized are those considered to be interesting suggestions. At a first
46 Approximate matching for EBMT

100% 18

90% 16
80% 14

Suggestions per sentece


Sentence coverage (%)

70%
12
60%
10
50%
8
40%
6
30%
4
20%
10% 2

0% 0
d, dSub d, dSub d, dSub d, dSub d, dSub d, dSub Whole Sub Whole Sub
0.1 0.2 0.3 0.1 0.2 0.3 d, dSub 0.1 1.1 2.0 2.6 1.5
Sub 25.5% 62.3% 58.0% 2.1% 2.6% 1.9% d, dSub 0.2 1.1 11.4 4.5 12.5
Whole 8.0% 12.0% 17.8% 95.7% 96.7% 97.6% d, dSub 0.3 1.1 15.2 10.1 15.875
Collection1 Collection2 Collection1 Collection2

(a) Percentages of sentence coverage (b) Mean suggestions per sentence

Figure 2.6: Coverage of the collections as outcome of the search process

glance, they seemed to be well-aimed compared to the submitted text.


We then decided to introduce and evaluate “ad hoc” test figures, focused
on our particular problem. In particular, we analyzed the quality (perti-
nence, completeness) of the translation suggestions that EXTRA proposes
to its users, along with the benefits offered by our stemming and WSD tech-
niques. Furthermore, we precisely quantified the benefits of such translation
assistance w.r.t. standard manual translation, by performing several simu-
lation runs on discrete event models that we specifically designed for this
purpose.

Coverage
The content of the translation memory represents the history of past transla-
tions. When a translator is going to translate texts concerning subjects that
have already been dealt with, the EBMT system should help him/her save
time by exploiting the potentialities of the translation memory contents.
In order to quantify the ability of EXTRA in retrieving suggestions for
the submitted text, we propose a new measure, named coverage, which corre-
sponds to the percentage of query sentences for which at least one suggestion,
obtained both from a whole or a sub match, has been found in the trans-
lation memory. Such measure is a good indicator of the effectiveness of a
suggestion search process only if there is a good correlation between the
text to be translated and the translation memory as it is for our collections.
Moreover it proves to be useful in the comparisons of different systems, as
shown in Section 2.5.5. Figure 2.6-a shows that our search techniques en-
sure a good coverage for the considered collections, while Figure 2.6-b shows
Effectiveness of the system 47

the high number of retrieved suggestions per sentence and demonstrates the
wide range of proposed translations. The good size and consolidation of Col-
lection2 implies a very high level of coverage (over 97%, even with a very
restrictive setting of d and dSub = 0.1), where most of the suggestions con-
cern whole matches. As to Collection1, which is relatively small and not so
well established, notice that, as we expected, sub2 matching covers a remark-
able percentage of query sentences and becomes essential to further exploit
the translation memory potentialities. In particular, by setting the distance
threshold d and dSub at at least 0.2, the obtained coverage is more than 70%
of the available sentences. Furthermore, notice that, while the mean number
of whole suggestions per sentence is generally particularly sensitive to the
TM size (e.g. the possibilities of finding similar whole sentences in small
TMs are quite low), the mean number of partial suggestions is sufficiently
high for both scenarios (see Figure 2.6-b), thus proving the good flexibility
of our matching techniques.

Effectiveness of the document analysis techniques


After having evaluated the quality and coverage of the translation sugges-
tions, we felt the need to analyze the impact that the document analysis
techniques described in Section 2.3 have on the final results further in detail.
In particular, we compared the suggestion fragments that are retrieved by
EXTRA by employing the syntactic analysis with the ones offered by the
WSD analysis techniques. These specific tests were performed not only on
the two technical translation collections, Collection1 and Collection2, but
also on some classic literature works, since the effects of WSD can be even
more manifest when applied to semantically and lexically richer texts.
Table 2.1 shows a selection of some of the most interesting results of the
effectiveness of the document analysis comparisons. The use of stemming
techniques not only makes it possible to significantly speed up the sugges-
tion search phase (see Section 2.5.4) but, more importantly, it enables the
retrieval of many additional useful suggestions having very similar meanings
but different forms. The upper part of the table shows some of these cases,
where the differences between the query (left column) and the retrieved frag-
ments (right column) are emphasized in italic: The useful fragments contain
contractions (“you’re”), plural forms (“users”, “brushes”), prepositions and
other non-semantically significant words (“through”, “other”, “just by”, and
so on), but they are nonetheless very similar in meaning and, therefore, their
translation can be useful to speed up the translator’s work.
By coupling stemming with our WSD techniques, EXTRA can offer even
more quality in the finally proposed suggestions. In particular, the enhance-
48 Approximate matching for EBMT

Query fragments Useful fragments that are


retrieved only with stemming
. . . if you’re new to computer graphics . . . . . . you are new to computer graphics . . .
. . . browse through your PC User’s guide . . . . . . consult your PC Users Guide . . .
. . . be sure you have the proper information . . . . . . be sure to have information . . .
. . . to understand a program feature . . . . . . that you understand other program features . . .
. . . move on to Chapter 1, Getting Started . . . . . . begin with Chapter Two, Getting Started . . .
. . . specify a beginning and ending . . . . . . just by specifying the beginning and ending . . .
. . . create a custom brush . . . . . . created custom brushes . . .
Query fragments Useful fragments that are
retrieved only with WSD
. . . in the original picture has been . . . . . . as if the original image had been . . .
. . . the legend of the demon dog . . . . . . legend of the fiend dog . . .
. . . poor lad is an orphan . . . . . . the poor fellow was an orphan . . .
Query fragments Wrong fragments that would be
retrieved without WSD
. . . cream puffs . . . . . . cream of asparagus . . .
. . . knob to set . . . . . . control knob set to . . .
. . . the fellow shall be in my power . . . . . . we should be powerless . . .
. . . do not go across the moor . . . . . . what goes on upon the moor . . .

Table 2.1: Examples of the improvements in the effectiveness of the sugges-


tion search process offered by stemming and WSD

ment to the effectiveness of the suggestion retrieval offered by our in-depth


semantic analysis concerns both precision and recall: Not only more useful
suggestions are delivered to the translator, since synonyms are considered
as equivalent words (see the central part of Table 2.1), but also the num-
ber of wrong or uncorrelated suggestions is significantly reduced, since two
words having the same form but different meaning are now considered to be
different (lower part of Table 2.1). For instance, the nouns “picture” and
“image”, “demon” and “fiend”, “lad” and “fellow” have clearly the same
meaning in the shown fragments and, therefore, are equal for the suggestion
search purposes. On the other hand, the term “cream” in “cream puffs” and
in “cream of asparagus” has obviously different meanings. The same applies
to the term “knob”, which in the query fragment is a verb, while in the
translation memory segment is a noun: Avoiding such deceptively similar
suggestions is clearly important for the quality of the offered translation aid
and, ultimately, to save translation time.
In order to quantify the effectiveness of the WSD techniques we also ex-
tracted a significant sample of 100 sentences from each of the two collections
and systematically analyzed and judged the correctness of the disambigua-
tion of their nouns and verbs. Figure 2.7 depicts the results of such analysis
for each of the collections, with a context window of one (“standard”) or
three (“expanded context”) sentences. Along with the “right” and “wrong”
Effectiveness of the system 49

100%

% of terms disambiguation success


90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Standard Exp. context Standard Exp. context
Partially Right 9% 9% 7% 7%
Right 72% 76% 71% 79%

Collection1 Collection2

Figure 2.7: Percentages of disambiguation success on samples from the two


collections

classifications, we also considered a “partially right” one, since in many cases


the WordNet senses can be very similar and more than one disambiguation
can be considered sufficiently correct. For instance, in the sentence “In addi-
tion you can make global changes to your artwork with ease”, the best sense
for the term “change” is “alteration, modification”, but also the senses “a
relational difference between states”, “action of changing something” (iden-
tified by EXTRA), and “the result of alteration or modifications” can be
considered sufficiently close to it.
On both collections, our WSD techniques provide rather good precision,
which is very close to 90% when the context window is expanded. Indeed,
we found that context window expansion is useful in a good number of cases.
Consider, for instance, the following paragraph from Collection2: “Before
any maintenance on the appliance disconnect it from the electrical power
supply. Do not use abrasive products bleach oven cleaner spray or pan
scourers.” Without expanding the context, the term “maintenance” in the
first sentence would be disambiguated as “the act of sustaining”. By con-
sidering the next sentence, which contains topical nouns such as “cleaner”
and “products”, the term is correctly disambiguated as “activity involved
in maintaining something in good working order”. The following is another
example from Collection1: “You can even pick up an animation as a brush
and produce a painting with it. Your brush works like a little animation.”.
Here the disambiguation of “brush” changes from “the act of brushing your
teeth” to “brush, an implement that has hairs or bristles firmly set into a
handle”, thanks to the presence of words such as “painting” in the preceding
sentence. We also noticed further positive enhancement, which is not repre-
50 Approximate matching for EBMT

sented in the graph of Figure 2.7, in verb disambiguation by exploiting not


only the definitions but also the examples of their senses. For instance, in
the sentence “For example you can learn painting with the mouse”, the verb
learn would be disambiguated as “commit to memory”, while, considering
the examples of usage of its different senses, such as “she learned dancing
from her sister”, it is correctly disambiguated as “acquire or gain knowledge
or skills”.

Translation models and simulations


In order to quantify the benefits offered by the EXTRA translation assis-
tance w.r.t. standard manual translation, as well as the effectiveness of the
suggestions provided by the approximate matching process, we devised two
process-oriented, discrete-event models and simulated them with the aid of
specific simulation toolkits.
The two models simulate the manual and assisted translation of a certain
document and allow us to estimate the time that it would take real translators
in actual translation sessions, offering insight into the dynamic behavior of
such processes. Both models share the same structural scheme shown in
Figure 2.8, involving a configurable number of translators working on the
translation of the text (“servers” in the simulation field) and of the sentences
of the text (“users”). Note that the models we envisaged are completely
general but, since we are not mainly interested in varying the number of
translators or in settings involving large groups of translators in the tests we
present, we will typically configure them for just one translator. Furthermore,
we assume a unary capacity of each translator, that is each translator works,
as one would expect, on a single sentence at a time. In this context, the
mean service time of a translator will be the time spent to translate a given
sentence. Such time is the central element of our model and depends on
the type of translation (assisted or manual) and on a number of factors that
we will analyze later in this section. Each sentence of the document comes
from the source and waits for the translator in a standard FIFO queue. The
simulation ends when all the sentences have been translated; thus, the main
result that we are interested in is the overall simulation time, namely the
time required to translate the whole document in each particular setting.
Table 2.2 shows all the input parameters that we used to describe the two
translation models, together with their default value. Most of the parame-
ters are common to the manual and assisted models and are presented on the
top, while the ones differentiating the two scenarios are shown at the bot-
tom of the table. Parameters describing aleatory variables, such as the ones
involving time, are expressed in terms of their mean value (x) and standard
Effectiveness of the system 51

S .
.
.
Sentences Exit

Translators

Figure 2.8: Schema of a generic translation model

deviation (σ); for the other parameters we simply specify the corresponding
value (val ). All the time values are expressed in seconds. The number of
translators (Ntrans ) is 1 by default, while the document to be translated con-
sists of the (Nsent ) query sentences from Collection1. The amount of time
needed to perform a translation is proportional to the length (in words) of
the sentences to be translated. For this reason we added a parameter, the
base word translation time tword base , representing the time needed to trans-
late one word whether or not the translator is confident in their translation.
Depending on the translator’s experience and on the difficulty of the text
(factors described by the probability of word look-up parameter Plook ), cer-
tain words may require an additional amount of time, which we call word
look-up time tword look : It is the time needed to look-up the word in a dictio-
nary and to decide the right meaning and translation. The default setting is
that, on average, 1 out of 20 words (5%) requires such additional time. On
the other hand, there are a number of very common and/or frequent words
the translation time of which can be less than base translation time since
the translator has already translated them in the preceding sentences and is
very confident on their meaning. Therefore, the word recall saved time para-
meter tword rec models such a time save for each of the recalled words, while
the probabilities of recalling a word translation Prec range linearly from a
minimum (beginning of the translation, less confidence) to a maximum (end
of the translation) value.
Let us now specifically analyze the manual translation model: One pe-
culiar parameter is the time required to make the “hand-made” translation
coherent in its terminology, avoiding inconsistencies with the previous sen-
tences and with the other translators, if any, working on the same document.
For these reasons, such sentence coherence-check time tsent coher is added for
each of the translated sentences and ranges from a minimum to a maximum
value in proportion to the number of sentences already translated. The max-
imum value is also proportional to the number of translators working on the
same task, since higher the number of translators working on the translation,
52 Approximate matching for EBMT

Input param val / x σ Description Man Ass


Ntrans 1 - Number of translators X X
Nsent 400 - Number of sentences to be translated X X
tword base 2.5 0.5 Base word translation time X X
Plook 0.05 - Probability of word look-up X X
tword look 10.0 3.0 Word look-up time X X
Prec 0.0 → 0.1 - Probability in recalling a word translation X X
tword rec 1.0 0.5 Word recall saved time X X
tsent coher 0.0 → 5.0 2.0 Sentence coherence-check time (per translator) X
Nread 5 - Maximum number of suggestions read X
tword read 0.2 0.05 Suggestion word reading time X

Table 2.2: Description of the simulation input parameters (time in seconds)

greater is the coordination time needed to obtain a final coherent work. The
following formula can be used to summarize all the contributions to the total
document translation time for one non-assisted translator, while setting aside
all probability considerations:
à !
X X
tmanual = tsent coher + (tword base + tword look − tword rec ) (2.1)
sentences words

As to assisted translation, its behavior can be modelled starting from the


manual model but applying the following substantial modifications:
ˆ if the offered suggestions are error free (null distance to the query seg-
ments), they automatically translate the words of the source sentences
that they cover. For this reason, all the “covered” words do not require
translation time;
ˆ all the “uncovered” words can be treated as they are in manual trans-
lation. The same applies for each of the erroneous words that need to
be modified in the suggestions in order to produce the new translation;
ˆ the suggestions do speed up translation but they also require time to be
read and chosen: For this reason we introduce the tword read parameter
representing the time needed to read one suggestions’ word, and the
maximum number of suggestions to be read Nread (the default value
of this parameter is 5, an optimal value as we will show later in the
analysis of the simulation’s results);
ˆ finally, the sentence coherence time is considerably reduced, since as-
sisted translation suggestions automatically guarantee an optimal co-
herence on already translated segments and terminology and can, sub-
stantially, be ignored.
Effectiveness of the system 53

Scenario Confidence interval x σ2


Manual, 1 translator
Mean sentence translation time 28.175 < µ < 28.347 28.261 0.014
Max sentence translation time 76.057 < µ < 81.559 78.808 14.232
Total translation time 11270.125 < µ < 11338.425 11304.275 2193.06
Manual, 2 translators
Mean sentence translation time 30.836 < µ < 31.058 30.947 0.023
Max sentence translation time 90.172 < µ < 98.228 94.2 30.505
Total translation time 6167.397 < µ < 6211.589 6189.493 918.107
Assisted, 1 translator
Mean sentence translation time 18.859 < µ < 19.037 18.948 0.015
Max Sentence translation time 76.856 < µ < 84.084 80.47 24.561
Total translation time 7542.862 < µ < 7615.218 7579.04 2461.268

Table 2.3: Results of the simulation runs (time in seconds)

The approximate formula that simplifies these assisted translation contribu-


tions to the total document translation time is the following:
à !
X X X
tassisted = tword read + (tword base + tword look − tword rec )
sentences s words u words
(2.2)
where s words denotes the words of the read suggestions, while u words de-
notes the query words which either are not covered by a suggestion or, while
being covered, still need to be modified.
In our simulation experiments, we considered three scenarios, manual
translation with one or two translators and assisted translation with one
translator, for each of which we were interested in three figures: The mean
and maximum time required for the translation of one sentence and the total
time required for the translation of the whole document. Then we chose a
level of confidence (probability that a confidence interval or region will con-
tain the true figures) of 95% from which we derived the confidence interval
and the minimum number of runs. Table 2.3 shows the mean value (x) and
the variance (σ 2 ) of the 10 runs we performed in order to estimate the three
figures accurately. The time required by an unassisted translator to translate
the query sentences in the given document is approximately 3 hours and 8
minutes; notice that, as expected, the time required to perform the same
work by two translators working in team is not exactly half of this time but
is slightly more, i.e. 1 hour and 43 minutes. This is realistic and is due
to the overhead given by the coordination (or coherence check) between the
translators working on different sentences in the same document: In fact, the
54 Approximate matching for EBMT

1 aut 1 man 2 man 1 aut 1 aut 1 man


12000 7850 30000
Total translation time (sec.)

7800
10000 25000
7750
8000 20000
7700
6000 7650 15000
7600
4000 10000
7550
2000 5000
7500
0 7450 0
1 2 3 4 5 6 7 8 9
0
50

0
0
0
0
0
0
0

01

02

05

5
10
15
20
25
30
35
40

0.

0.

0.

0.

0.
0.

0.

0.
(a) Nsent (b) Nread (c) Plook

Figure 2.9: Trends of total translation time as obtained in the simulation


runs by varying...

mean and maximum translation time per sentence is higher in this scenario.
Now, notice that the assisted translation time is slightly more than 2 hours
(2 hours and 6 minutes) and is significantly closer to the performance of
the team of two translators than to the one of the single translator. This is
good proof of the significant improvement that can be obtained by employing
assisted translation software but, overall, it quantifies the real effectiveness
of the EXTRA translation suggestions. In particular, the time required to
adapt the suggestions retrieved from past translations is not only significantly
lower than the one required to produce the same translations from scratch,
but the speed gain is also particularly consistent: The mean sentence trans-
lation time is by far the lowest of the three scenarios considered, and this
corresponds to the highest per-translator productivity. Furthermore, notice
that the maximum time required to translate a sentence in the given query
document is also particularly close to the manual single translator scenario,
this is because even for the query sentences for which many suggestions are
available the time spent in reading them is limited by the maximum number
of suggestions parameter.
Figure 2.9 shows the trends of the total translation time for automatic
and/or manual translation obtained by our simulation models by varying
the number of query sentences (Nsent , Figure 2.9-a), the maximum number
of suggestions to be read (Nread , Figure 2.9-b), and the probability of look-up
(Plook , Figure 2.9-c) parameters. Notice that the variation of the number of
sentences produces a sort of scalability graph, with linear trends for the three
models: The automatic translation trend stands, as expected, between the
two manual ones. Figure 2.9-b demonstrates the trade-off between the time
saved by exploiting the translation suggestions and the time spent in reading
Efficiency of the system 55

and selecting them: The best trade-off is given by reading (and presenting)
4 or 5 suggestions at the most, therefore such values are the optimal ones
that have been used in the other assisted translation simulations and that
can be used in EXTRA itself in order to deliver a balanced suggestion range
to the user. Finally, Figure 2.9-c shows the trends of total translation time
by varying the Plook parameter, which may represent the ability and expe-
rience of the translator: In this case, the graph shows that the translation
assistance is equally useful both for experienced translators (Plook = 0.01)
and inexperienced ones (Plook > 0.3).

2.5.4 Efficiency of the system


In this section we analyze the efficiency of the document analysis and similar-
ity search techniques. Figure 2.10 shows the results of the tests we performed
in order to estimate the running time of our algorithms on Collection 1 (400
query sentences) and on Collection 2 (421 query sentences). In particular, in
subfigure 2.10-a, the time required for whole and sub2 matching on the two
collections is shown for three different configurations of the relative distance
parameters d and dSub. Notice that, for these tests, the two search techniques
were applied sequentially: sub2 matching was applied to the sentences that
were not covered by whole suggestions, as this is the most popular scenario
for translators. Further, we only performed tests with d less or equal to 0.3,
since greater values would have led to almost useless suggestions. Translators
do indeed usually set the value of d and dSub from 0.2 to 0.1. The algorithms
perform efficiently for all the parameters’ settings, with total running time
(whole + sub2 ) of less than 3 seconds for Collection1 and less than 12 seconds
for Collection2 in the most demanding setting. The value of d = dSub = 0.2
proves to be the optimal one since, while delivering a very good coverage,
nearly the same as d = dSub = 0.3 (see Section 2.5.3), it also enables par-
ticularly fast response time (30% faster than the 0.3 setting). Figure 2.10-b
shows the scalability of the total matching time on the two collections. It
shows that, in both cases, time grows linearly with the number of query sen-
tence pairs. For further discussion on the matching algorithms performance
see Section 1.3, where the performance trends were observed in a generic en-
vironment under all the different parameters associated to the problem, and
the minimum length minL and the values of z-gram size are also considered.
The fast running time of the EXTRA search algorithms is mainly due
to the good performance of the filtering and document analysis techniques
that we presented in the previous sections. The tests results are depicted in
Figure 2.11-a and show the impact that the whole and sub2 filtering and the
document analysis techniques have on the total time shown in the previous
56 Approximate matching for EBMT

10,000
10,000
9,000
9,000
8,000
8,000
7,000
Time (mSec)

7,000
6,000

Time (mSec)
5,000 6,000

4,000 5,000

3,000 4,000
2,000 3,000
1,000 2,000
0
Whole Sub^2 Whole Sub^2 1,000
matching matching matching matching
0
d, dSub 0.1 390 532 4,719 2,605 25% 50% 75% 100%
d, dSub 0.2 422 1,312 6,766 2,957 Collection1 211 554 1,027 1,734
d, dSub 0.3 437 1,950 9,047 2,709 Collection2 3,928 5,743 7,541 9,723

Collection1 Collection2

(a) Whole and sub2 matching time (b) Total matching time scalability

Figure 2.10: Running time tests

experiment, where all the filters were on and stemming was performed on all
the query and the TM sentences. Notice that the graph employs a logarithmic
scale. Disabling stemming, but keeping the filters enabled, produces a to-
tal running time of nearly 22 seconds for Collection1 and approximately 185
seconds for Collection2, more than 20 times higher than in standard con-
figuration; conversely, disabling filters, but keeping the stemming enabled,
produces a huge performance loss, with total searching time of more than
2 minutes and 20 minutes, respectively, for the first and second collection.
Thus, filters have great impact on the final execution time of the search algo-
rithms. By enabling all the available whole and sub2 filters allows the system
to reduce the overall response times by a factor of at least 1:70.
As to document analysis, it can be extremely useful not only for enhanc-
ing the effectiveness of the retrieved suggestions (see Section 2.5.3) but also
for a clear increase in performance. Notice that the document analysis time
is not included in the total response time, since this is not strictly related to
the search algorithms; such time is shown for the query sentences of the two
collections in subfigure 2.11-b, both for stemming and WSD. In particular,
the “no analysis” time corresponds to the time required to read the docu-
ments, extract their sentences and store them in the DB, along with their
z-grams. For instance this time is 6 seconds for the 400 query sentences in
Collection1. By enabling stemming, such time increases by only 2 seconds,
thus showing the performance of over 200 sentences per second offered by
our stemming algorithms. The same graph also reports on the WSD analysis
time; such analysis is much more complex and, therefore, the time required
for it is higher than for stemming. However, 10 sentences per second are
Comparison with commercial systems 57

10,000,000
60

1,000,000
50
100,000
Time (mSec)

40

Time (Sec)
10,000
30
1,000

20
100

10 10

1 0
Total Time Total Time Collection1 Collection2
Filters + Stem 1,734 9,723 No analysis 6 7
Stem Off 21,812 185,320 Stemming 8 9
Filters Off 128,453 1,250,438 Stemming + WSD 51 54
Collection1 Collection2

(a) Total time with / without filters and (b) Query document analysis time
stemming

Figure 2.11: Further efficiency tests: impact of the filtering and document
analysis techniques

still elaborated (approx. 50 seconds in total) and such time is still quite low,
especially if we consider that WSD can prove valuable to achieve an optimal
effectiveness in the search techniques (see Section 2.5.3).

2.5.5 Comparison with commercial systems


In this concluding section, we present a series of tests that we performed
in order to delve into the suggestion search process of commercial EBMT
systems and to compare the performances of such systems with those offered
by EXTRA. The softwares analyzed are two of the most successful products
in this field: Trados version 5 and IBM Translation Manager 2.6.0, which is
no longer commercially available but is still widely used.

On the suggestion search process of commercial EBMT systems


Software producers usually provide few or even no information about the
techniques adopted in their systems. As the role of commercial systems in
the computer-aided translation field cannot be ignored, we tried to under-
stand the principles underlying the suggestion search process of two of the
most diffused EBMT systems through a series of targeted tests. In these
experiments we initially build a translation memory using Collection1, then
we submit a new text containing ad-hoc modifications of the reference sen-
tences or taken from the query sentences of the same collection. Finally we
pre-translated it and analyzed the results.
58 Approximate matching for EBMT

The first series of tests concerns the way past translations are logically
represented and the penalty scheme adopted by the two systems to judge the
differences between the submitted text and the TM content. Starting from
a sentence in the TM, we initially modified (deleted) a word from it to find
out if the systems were still able to identify the match and, if so, how much
the matching score was penalized. Then we modified (deleted) other words
and analyzed the penalty trend. After several tests we came to the following
conclusions:
ˆ the programs identify the different parts up to a certain level of modi-
fications. The penalties given seem to depend not only on the number
of modified (deleted) words but also on the length of the original sen-
tences;
ˆ unlike the EXTRA approach, the systems seem to perform no stemming
to the TM and query sentences’ words. For example, the systems give
an equal penalty to the modification of “graphics” in “graphic” and of
“graphics” in “raphics”, while they should penalize much more heavily
the second variation, where the new term is completely unconnected to
the original one from a linguistic point of view.
In the second series of experiments, we inverted some of the positions of
the words in a given sentence in order to verify if the comparison mechanism is
sensitive to the order of words. In particular, we inverted the terms “graphics
tool” in “tool graphics” and “ease and precision” in “precision and ease”. In
this case, the search algorithms appeared to be order sensitive, similarly to
EXTRA edit distance approach, and not simply based on a “bag of words”
approach. Further, Trados identified and displayed the moved segments with
a particular specific color.
Finally, we tested the ability of such systems to identify interesting parts
in the stored text. In these experiments, we joined two whole sentences (s1
and s2 in the following) contained in the TM and tried to pre-translate the
newly created sentence in each of the systems. The text was created in three
different ways:
ˆ separation with a conjunction (“s1 and s2 ”);

ˆ separation with a comma (“s1 , s2 ”);

ˆ inclusion of one sentence s in s1 previously separated with a comma


(“s11 , s, s21 ”).
These scenarios create no problem to EXTRA, since it is able to identify the
suggestions from the two original sentences by exploiting our sub2 matching
Comparison with commercial systems 59

Trados TrMan EXTRA


Exact Match 4.96% 1.91% n/a
Fuzzy Match 14.74% 24.79% n/a
Total Coverage 19.70% 26.70% 74.30%
Time 23 s 3.5 s 1.7 s

Table 2.4: Pre-translation comparison test results for Collection1

algorithms. As to commercial systems, in all three cases Trados was not able
to automatically retrieve any suggestions. In particular, we found out that
the system was not able to dynamically segment the query and/or the exam-
ples in order to find sub-matches. To retrieve the suggestions to our queries,
for example, the user would have two possibilities, both of them being quite
impractical: To translate in Concordance mode, manually selecting the par-
tial segments to look for in the TM, or, in batch mode, (s)he would have to
change the segmentation rules, adding the comma and the conjunction as per-
manent separators. The only way to automatically retrieve matches between
segments is indeed to insert the interesting parts, already segmented, in the
TM and to split the query sentences in multiple segments. The problem is
that the user does not know which segments could match and adding static
segmentation rules would not be a general solution. Translation Manager
was able to partially solve the three queries we submitted. The Transla-
tion Manager similarity model seems to be more complex than the Trados
one: While Trados does not analyze further unknown sentences, Translation
Manager tries to find more suggestions for the unknown sequences of words
by re-analyzing the sequences of which a segment consists. For example, in
the sentence “s11 , s, s21 ” the system first identified the sentence s1 , then it
specifically restricted the search to the contaminating segment (s). Some
limitations still remain: The system is not able to suggest the interesting
part of the TM sentence that matches a part of the query sentence, but it
just presents the whole sentence to the user. Furthermore, for the segments
to be identified, they must match the majority of a TM sentence, otherwise
they cannot be retrieved by the engine.

Pre-translation coverage and efficiency

In order to evaluate coverage and efficiency, we performed the pre-translation


of Collection1 with both systems and examined how many sentences were
matched with the ones in the TM and the time that it took for the search.
In particular we kept the distinction made by the two programs between ex-
60 Approximate matching for EBMT

act match (with a similarity greater than 95%) and fuzzy match (from 50%
to 95% of similarity). Parts with a smaller similarity were categorized as
“not found”. The results are shown in Table 2.4, where we also report, for
ease of comparison, the results obtained by EXTRA on the same collection.
For EXTRA we considered the default parameters’ values and, for the time
comparison, we considered the total time for whole and sub2 matching; fur-
ther, the distinction between exact and fuzzy match does not apply to our
system. The quantity of exact matches favors Trados, but Translation Man-
ager is able to find almost twice the number of fuzzy matches and ultimately
proves to have the most effective similarity search engine between the two
commercial systems. On the other hand, the level of coverage provided in
EXTRA by our whole and sub2 matching algorithms is nearly three times
higher and guarantees a much more accurate suggestion retrieval. Finally,
notice that the time required for the pre-translation operation is quite differ-
ent for the three systems. In particular, the EXTRA search algorithms are
more complex but do not require more time; the EXTRA search techniques
do indeed actually reveal to be the most efficient ones.
Chapter 3

Approximate matching for


duplicate document detection

The recent advances in computing power and telecommunications and the


constant drop of the costs for Internet access and data management and stor-
age, created the right conditions for the global diffusion of data sources for
nearly every topic. In the last years, the sheer size of such amount of elec-
tronic information, mainly textual data, along with its intrinsic property of
extraordinary accessibility have made the building of digital libraries easier.
Nowadays, digital libraries represent a concrete alternative to traditional
libraries, especially for reference needs. An informal definition of a digital
library is that of a “managed collection of information, with associated ser-
vices, where the information is stored in digital formats and accessible over
a network” [136]. The preference access channel is the Internet, which re-
duces the marginal costs of the distribution of digital contents and makes
their fruition readily and easily accessible to the users. However, such a wide
web of data portals makes it much easier to collect and distribute duplicates.
One of the reasons for this danger is that electronic media facilitates ma-
licious users in the illegal appropriation, re-elaboration and distribution of
other people’s work. This gives rise to problems of protecting the owners of
intellectual property.
One possibility of addressing these problems lies in preventing violations
from occurring. Much work has been done in this direction, involving, for
instance, new software systems and new laws for copyrighted work copy pre-
vention and new technologies developed in order to limit the access and/or
inhibit the copy of digital information, providing encryption, new hardware
for authorization, and so on. As already observed in [22], the main draw-
back of these approaches is that, while never being completely secure, they
are quite cumbersome and impose too many restrictions to the users. For
62 Approximate matching for duplicate document detection

these reasons, copy prevention is not properly the best choice for the digital
library context. Digital libraries are open and shared systems, supposed to
be designed to advance and spread knowledge. Nonetheless, as it has been
recently highlighted in [83], the application of too many and too strict limita-
tions in such a domain often results in the paradoxical and intolerable result
of severely limiting the usefulness, stability, accessibility and flexibility of
digital libraries. Further, it has been noted that, since copies are typically
what we preserve, works that are copy protected are less likely to survive into
the future [21], thus also limiting the preservation of our cultural heritage.
In our opinion, a good trade-off between protecting the information and
ensuring its availability can be reached by using duplicate detection ap-
proaches that allow the free access to the documents while identifying the
works and the users that violate the rules. One of the techniques following
this approach relies on watermark schemes [78], where a publisher adds a
unique signature in a document at publishing time so that when an unau-
thorized copy is found, the source will be known. On the other hand,
watermarks-based protection systems can be easily broken by attackers who
remove embedded watermarks or cause them to be undetectable. Another
approach, which is the one we advocate, is that of proper duplicate detection
techniques [22, 23, 33, 66, 121]. These techniques are able to identify such
violations that occur when a document infringes upon another document in
some way (e.g. by rearranging portions of text). For a duplicate detection
technique, the notion of security represents how hard it is for a malicious
user to break the system [22]. Obviously, not all the duplicate detection
approaches present the same level of security. The level of security deliv-
ered by duplicate detection techniques is variable and is strictly correlated
to the scheme adopted for the comparison of documents. For instance, any
approach relying on an exact comparison of documents can not be particu-
larly secure, since a few insignificant changes to a document may prevent it
from identifying it as a duplicate.
A duplicate detection technique ensuring a good level of security can thus
be employed as service of an infrastructure that gives users access to a wide
variety of digital libraries and information sources [22] and that, at the same
time, protects the owners of intellectual properties by detecting different
levels of violation such as plagiarisms, subsets, overlaps and so on. In con-
texts requiring secrecy, a duplicate detection service could be supported by
a number of other important services, such as encryption and authorization
mechanisms, which would help too in the protection of intellectual property.
Moreover, in a digital library context, the use of duplicate detection tech-
niques is not limited to the safeguard of the intellectual property. Indeed,
duplicates are widely diffused for a number of reasons other than the il-
63

legal ones: the same document may be stored in almost identical form in
multiple places (i.e. mirror sites, partially overlapping data sources) or dif-
ferent versions of a document may be available (i.e. multiple revisions and
updates over time, various summarization levels, different formats). As a
consequence, being the document collection of a digital library the union of
data obtained from multiple sources, it usually presents an high level of du-
plication. “Legal” duplicates do not give any extra information to the user
and therefore lower the accuracy of the results of searches in a digital li-
brary. The availability of duplicate detection techniques represents an added
value for a digital library search engine as they improve the quality and the
correctness of search results.
For a duplicate detection technique to be secure, it is evident the need of a
solid pair-wise document comparison scheme which is able to detect different
levels of duplication, ranging from (near) duplicates up to (partial) overlaps.
By exploiting the approximate matching techniques described in Chapter 1,
in this chapter we specialize them to the document comparison scenario, de-
vising effective similarity measures that allow us to accurately determine how
much a document is similar to (or is contained into) another. Conceptually,
our pair-wise document comparison scheme tries to detect the resemblance
between the content of documents. In order to do it, we do not extract
stand-alone keywords but we consider the document chunks representing the
contexts of words selected from the text. The comparison of the informa-
tion conveyed by the chunks allows us to accurately quantify the similarity
between the involved documents. The security delivered by such approach
is particularly improved w.r.t. other schemes relying on crisp similarities,
since it is able to identify with much more precision and reliability the actual
level of overlap and similarity between two documents, thus detecting differ-
ent levels of duplication and violations. Further, we address efficiency and
scalability by introducing a number of data reduction techniques that are
able to reduce both time and space requirements without affecting the good
quality of the results. This is achieved by reducing the number of useless
comparisons (filtration), as well as the amount of space required to store the
logical representation of documents (intra-document reduction) and the doc-
ument search space (inter-document reduction). Such techniques have been
implemented in a system prototype named DANCER (Document ANalysis
and Comparison ExpeRt) [93].
The chapter is organized as follows: In Section 3.1 we define the new doc-
ument similarity measures for duplicate detection. Section 3.2 presents the
data reduction techniques. In Section 3.3 we discuss related works. Finally,
in Section 3.4 we show the DANCER architecture and the results of some
experiments.
64 Approximate matching for duplicate document detection

3.1 Document similarity measures


In order to deal with the problems we outlined, we need to establish mea-
sures that effectively quantify the level of duplication between two given
documents. The definition of what constitutes a duplicate is at present un-
clear. We agree with the definition given in [33] where Chowdhury et al.
state that if a document contains roughly the same content it is a dupli-
cate whether or not it is a precise syntactic match. The similarity measures
we are going to define capture the informal notion of “roughly the same”
(resemblance measure) and “roughly contained” (containment measure) in a
rigorous way. Documents can have different formats. Our measures do not
actually compare documents by analyzing their original formats but their
logical representation. The following subsection describes the logical view
we adopt to represent documents.

3.1.1 Logical representation of documents


The resemblance and containment measures compare the content of pairs of
documents getting up to a specific level of detail. In particular, documents
contain some structural information which can be exploited in order to divide
them into well defined units (e.g. sequences of words, of sentences, of para-
graphs, and so on) known as chunks [22] or shingles [33]. Starting from their
original formats, documents that are to be compared undergo a “chunking”
process [121] where they are broken up into more primitive units. The unit
of chunking determines the logical representation of documents. Unlike pre-
vious approaches, we will only consider units having a stand alone meaning
and defining a context. Such two properties are essential for the compar-
ison scheme we are going to introduce since it relies on the correlation of
the information conveyed by the chunks. For the above reason, single words
cannot be considered as chunks and the smallest unit is the sentence which
represents a context of words whose order determines the meaning. Thus,
one chunk can be equal to one sentence, two sentences, three sentences, k
sentences, but also to one paragraph, and so on. For the sake of simplicity,
in the following the chunking unit will be one sentence unless specified.
Once documents have been broken up into an initial set of chunks, chunks
undergo some filtrations to improve the resilience of our approach to insignif-
icant document changes. In particular, with syntactic filtration we remove
suffixes and stopwords from the document chunks and stem the chunks by
converting the set of the remaining terms to a common root form [7], whereas
length filtration allows us to perform an initial selection amongst the chunks
by applying a length threshold expressing the minimum length (in words) of
The resemblance measure 65

Di Dj Di Dj

c1i c1j c1i pm(2)=1 c2j


c2i c2j c2i pm(1)=2 c1j
c3i c3j c3i pm(4)=3 c4j
c4i c4j c4i pm(3)=4 c3j
c5i c5j c5i pm(m)=5 cmj

... ... ... ...


cni ... cni ...
chunks
cmj chunks pm(5)=m c5j
chunks chunks

(a) Mapping between chunks. . . (b) . . . and corresponding permutation

Figure 3.1: Representation of a mapping between chunks

the worth surviving chunks. The resulting logical representation of a docu-


ment D consists of a sequence of filtered chunks c1 . . . cn extracted from D.
In the following In = {1, . . . , n} will denote the corresponding index set and
|ck | will denote the length in words of the chunk ck .

3.1.2 The resemblance measure


Before introducing the similarity measures, we give an intuition about what,
in our opinion, good measures of resemblance should satisfy. To this end, as in
[121], we start from a document D0 to be compared to a generic document Di .
Let D0 be represented by a sequence of chunks (e.g. sentences) indicated with
upper case letters ABC, we consider the following documents: D1 = ABC,
D2 = BAC, D3 = AB, D4 = ABCD, D5 = A . . A} , and D6 = A0 BC where
| .{z
n times
chunks identified by different capital letters are completely different whereas
A0 is a “variant” of A. Informally, we would expect to obtain the following
suggestions: D0 is the exact duplicate of D1 and D2 where the latter is a
chunk level rearrangement of D0 ; D3 and D4 are quite similar to D0 where
D3 (D4 ) is the more similar the shorter is the chunk C (D) with respect to
A and B; D3 is contained in D0 whereas D4 contains D0 ; D5 is somewhat
similar for low n and not very similar for high n; D6 is the more similar the
more A0 is similar to A.
More formally, the above comparison approach requires two documents
Di and Dj logically represented by the sequences c1i . . . cni and c1j . . . cm
j , re-
k h
spectively, and a similarity measure between chunks sim(ci , cj ) stating how
much two chunks are approximatively equal (an effective similarity measure
will be introduced in the following). Intuitively, a possible similarity mea-
66 Approximate matching for duplicate document detection

sure between documents satisfying the requirements stated above is the one
that looks for the mapping between chunks maximizing the overall document
similarity (see Figure 3.1.a). The statement can also be formulated in the
following way. Let us suppose that the longest document in terms of number
of chunks is Dj , i.e. m > n, and let us consider the index set Im of Dj ,
then a permutation of Im is a function pm allowing the rearrangement of the
positions of the Dj chunks. We look for the permutations pm maximizing the
overall document similarity obtained by combining the similarities between
the chunks in the same position (see Figure 3.1.b). Its formalization is given
in the following definition.

Definition 3.1 (Resemblance measure) Given two documents Di = c1i


. . . cni , Dj = c1j . . . cm k h
j , such that m ≥ n, and a similarity measure sim(ci , cj )
between chunks, the similarity Sim(Di , Dj ) between Di and Dj is the max-
imum of the following set {Simpm (Di , Dj ) | pm is a permutation of Im }
where n ³
X ´
p (k) p (k)
|cki | + |cj m | · sim(cki , cj m )
k=1
Simpm (Di , Dj ) = n m (3.1)
X X
|cki | + |ckj |
k=1 k=1

Notice that the similarity measure is defined in order to return a similarity


score ranging from 0 (totally different contents) to 1 (equal contents, also
rearranged). Further, notice that Eq. 3.1 weights similarities between chunks
p (k)
on the basis of their relative lengths denoted with |cki | and |cj m |. Indeed,
in the denominator we sum the lengths (in words) of the two documents
whereas in the numerator we only sum the lengths of the matching chunks.
In this way, similar small parts of the two documents will have a low weight
in the computation of the similarity. Moreover, the relative lengths of the
non-matching parts affect the resulting score: the bigger they are the lower
is the resemblance value.

Example 3.1 Let us consider the situation depicted at the beginning of


the present section. Let us suppose that the lengths of the involved chunks
expressed in terms of word numbers is the following: |A| = 10, |B| = 8,
|C| = 2, |D| = 7, and |A0 | = 8. Intuitively, the best mapping is the one
which associates the (approximatively) equal chunks in the two documents.
We remind that different capital letters correspond to completely different
chunks whereas A0 is a “variant” of A. Thus, the similarities sim(cki , chj )
between all chunks but sim(A, A0 ), will be 1 if cki = chj , 0 otherwise. The
permutation on the chunk indexes of the longest document maximizing the
The resemblance measure 67

similarity with D0 and the resulting score are summarized in the following
table:
Doc Permutation Similarity with D0
D1 Identity 1
D2 {1 7→ 2; 2 7→ 1; 3 7→ 3} 1
(10+10)∗1+(8+8)∗1
D3 Identity (10+8+2)+(10+8)
= 0.947
(10+10)∗1+(8+8)∗1+(2+2)∗1
D4 Identity (10+8+2)+(10+8+2+7)
= 0.851
(10+10)∗1
D5 Identity (10+8+2)+n∗10
≤ 0.667
(8+10)∗sim(A,A0 )+(8+8)∗1+(2+2)∗1
D6 Identity (10+8+2)+(8+8+2)
≥ 0.526

Notice that, as we expected, D0 is estimated to be equal to D1 and D2 ,


where in D2 the D1 chunks are simply rearranged. D3 and D4 are both quite
similar to D0 but D3 is more similar to D0 than D4 since C is shorter than
D. As to D5 , notice that the score is 0.667 when n = 1, it is 0.5 when n = 2,
0.4 when n = 3 and so on. Finally, the more A’ is similar to A the more
the score Sim(D0 , D6 ) is high. For instance, if sim(A, A0 ) = 0.9 then the
similarity score is 0.953, whereas if the chunk similarity is lower, e.g. 0.6,
then we obtain 0.811. ¤

The resemblance measure introduced above relies on a comparison measure


between chunks. In the literature, comparison between chunks is usually
quantified by means of a crisp function, being the simplest and most straight-
forward measure to define. In our opinion this measure has one major draw-
back: The obtained scores are completely dependent on the size of the chunks
and are not satisfyingly precise in any case. Such a problem has been recog-
nized in other related papers [22, 23, 66, 121]. For instance, in [22] Garcia-
Molina et al. propose a copy detection system and measure the notion of
security in terms of how many changes need to be made to a document so
that it will not be identified as a copy. In this context, they state that “the
unit of chunking is critical since it shapes the subsequent overlap search and
storage cost”. Indeed, by using large chunks and a crisp chunk compari-
son measure, any change, even to a very limited portion of the documents,
would increase the probability of missing actual overlaps thus considerably
decreasing the security. On the other hand, by using chunks smaller than the
sentence, such as words, would produce a very large number of small chunks,
too small to be able to truly capture the contents of the document.
For these reasons, we propose the introduction of a chunk comparison
function that goes beyond equality by analyzing their contents and comput-
ing how much they are similar. To this end, we consider a chunk as a sequence
of terms and we exploit the edit distance metric. In previous chapters, the
68 Approximate matching for duplicate document detection

edit distance has proved to be a good metric for detecting syntactic similari-
ties between sentences. In this context, we extend the concept of sentence to
that of chunk having a stand alone meaning and defining a context. Given
two chunks chi ∈ Di and ckj ∈ Dj and an edit distance threshold t, we define
the similarity between chi and ckj in the following way:

1 − ed(chi ,ckj ) if
ed(ch k
i ,cj )
≤t
h k h k
max(|ci |,|cj |)) max(|ci |,|ckj |))
h
sim(ci , cj ) = (3.2)
0 otherwise
By computing Eq. 3.1 of the resemblance measure with the above similarity
measure between chunks, we are able to obtain a good level of effectiveness
since we perform a term comparison without giving up the context-awareness
which is guaranteed by the use of chunks. The security delivered by such
approach is particularly improved w.r.t. other schemes relying on crisp simi-
larities, since it is much more independent from the chosen size of the chunks
and it is able to identify with much more precision the actual level of over-
lap and similarity between two documents, thus detecting different levels
of duplication and violations (see Section 3.4 for an extensive experimental
evaluation).

3.1.3 Other possible indicators


In a duplicate detection context, beside the above resemblance measure, some
applications could also be interested in other indicators. Another possible
similarity measure is an asymmetric one representing the length of the max-
imum contiguous part of a document (approximately) overlapping another
document. This gives rise to the maximum contiguous overlap measures.
Definition 3.2 (Maximum contiguous overlap w.r.t. a permutation)
Let Di = c1i . . . cni and Dj = c1j . . . cm
j be two documents, such that m > n, and
let In = {1, . . . , n} and Im = {1, . . . , m} be the corresponding index sets. Let
sim(cki , chj ) be a similarity measure between chunks and pm be a permutation
on Im . Then:
ˆ The maximum contiguous part in Di overlapping Dj w.r.t. pm is de-
D
noted M axOverlap(Di )pmj and is defined as:
 
X
maxs,e∈In :s≤e  |cki | (3.3)
k∈[s,e]

p (k)
where [s, e] is a sequence of indexes in In such that sim(cki , cj m )>0
for each k ∈ [s, e].
Other possible indicators 69

ˆ The maximum contiguous part in Dj overlapping Di w.r.t. pm is de-


noted M axOverlap(Dj )D pm and is defined as:
i

 
X p (k)
maxs,e∈In :s≤e∧sim(ck ,cpm (k) )>0∀k∈[s,e]  |cj m | (3.4)
i j
k∈[s,e]

The role of the above measure is twofold. When it is computed w.r.t. one of
the permutations pm associated to the maximum value Simpm (Di , Dj ) of Def.
3.1, it is an additional indicator that helps to understand the meaning of the
score of the resemblance measure. Otherwise, it can provide a different score
quantifying the maximum contiguous part overlapping another document.
For instance, such score can help to easily identify partial plagiarisms, such
as an entire paragraph copied from one document to another, even if the
relative size of the copied part is very small w.r.t. the final document. The
formula to be computed is:
M axOverlap(Di )Dj = maxpm M axOverlap(Di )D
pm
j
(3.5)
D
Notice that, in general, the two scores M axOverlap(Di )pmj and M axOverlap
(Di )Dj are different.
Example 3.2 Let us reconsider the document set of example 3.1. As we
have already shown, the measure sim(D0 , D3 ) = 0.947 is indicative of a
high similarity but gives no idea about the maximum overlap. For a deeper
analysis we compute M axOverlap(D0 )D2 = M axOverlap(D2 )D0 = (10 +
8) = 18. Such scores state that the content of D2 is approximately a copy of
a great part of D0 . ¤
Another useful indicator is the one telling how much one of the two doc-
uments is contained in the other. It can be measured by introducing an
asymmetric measure as variant of the resemblance measure: The contain-
ment measure.
Definition 3.3 (Containment measure) Given two documents Di = c1i
. . . cni , Dj = c1j . . . cm k h
j , such that m > n, and a similarity measure sim(ci , cj )
between chunks, the containment aSim(Di , Dj ) of Di in Dj is estimated by
the maximum of the following value set {aSimpm (Di , Dj ) | pm is a permuta-
tion of Im } where
n
X ¡ ¢ p (k)
|cki | · sim(cki , cj m )
k=1
aSimpm (Di , Dj ) = n (3.6)
X
|cki |
k=1
70 Approximate matching for duplicate document detection

The containment aSim(Dj , Di ) of Dj in Di can be obtained by substituting


p (k)
cki with cj m .

Example 3.3 Let us consider again the document set of example 3.1 and,
in particular, the containment of D3 in D0 and of D0 in D4 and viceversa.
The scores are aSim(D3 , D0 ) = 10∗1+8∗1
10+8
= 1, aSim(D0 , D3 ) = 10∗1+8∗1
10+8+2
= 0.9,
10∗1+8∗1+2∗1 10∗1+8∗1+2∗1
aSim(D0 , D4 ) = 10+8+2
= 1, and aSim(D4 , D0 ) = 10+8+2+7 = 0.74.
In particular, aSim(D3 , D0 ) and aSim(D0 , D4 ) state that document D3 is
fully contained in document D0 which, in turn, is fully contained in document
D4 . ¤

3.2 Data reduction


So far we have introduced a similarity measure between documents and we
have given an intuition of its efficacy (more details will be given in Section
3.4). However, we are aware that the comparison of documents, in partic-
ular for large collections, is a time consuming operation. In this section,
we address the problems of efficiency and scalability by introducing three
fine tunable approaches for data reduction. With the term “data reduction”
we mean the reduction of the number of comparisons performed by such
techniques which rely on a similarity function to detect duplicates. More
precisely, we concentrate our attention on similarity search and clustering in
a document search space. In a similarity search context, given a query doc-
ument, the system retrieves the documents in the search space whose level
of duplication with the query exceeds a given similarity threshold (similarity
search) or which are the most “duplicate” ones (k-NN search). On the other
hand, with clustering the system analyzes a large set of documents and col-
lects them into clusters, that is collections of duplicate documents that can
thus be treated collectively as a group. In both cases, we assume that the
detection of duplicates relies on the resemblance measure given in Def. 3.1.
The approaches we devised are the following:

ˆ filtering, to reduce the number of useless comparisons;

ˆ intra-document reduction, to reduce the number of chunks in each doc-


ument;

ˆ inter-document reduction, to reduce the number of stored documents.

They are orthogonal and thus fully combinable. The order of presentation
corresponds to the increasing impact on the document search space. Filtering
Filtering 71

leaves it unchanged since, although filters reduce the document search space,
they ensure no false dismissals; intra-document reduction introduces an ap-
proximation in the logical representation of documents by storing a selected
number of chunk samples; inter-document reduction approximates the docu-
ment search space representation by pruning out less significant documents
while maintaining significant ones.
The tests we performed and which will be presented in Section 3.4 show
that the similarity measures are robust w.r.t. such three different data re-
duction techniques and that a good trade-off between the effectiveness of
the measures and the efficiency of the duplicate detection technique can be
reached as data reduction allows us to decrease requirements for both time
and space while keeping a very high effectiveness in the results.

3.2.1 Filtering
Filters allow the reduction of the number of useless comparisons while always
ensuring the correctness of the results. Indeed, filtering is based on the fact
that it may be much easier to state that two data items do not match than
to state that they match. In this section, we consider the application of
filtering techniques based on (dis)similarity thresholds for the comparison of
both documents and their chunks.
Since the edit distance threshold t in Eq. 3.2 allows the identification of
pairs of chunks similar enough, we can ensure efficiency in the chunk similar-
ity computation by applying the filtering techniques described in Chapter 1
before the edit distance computation: count filtering, position filtering, and
length filtering. In the following, we will refer to such filters as chunk filters.
Exploiting chunk filters allows us to greatly reduce the costs of approximate
chunk match, i.e. the first phase of the document similarity computation.
Because of the way such filters work it is not possible to give a theoretical
approximation of the computational benefits offered by such techniques: In
fact all such filters’ behavior is dependent on the data on which they are
applied. For instance, the results offered by the length filter are strictly
influenced by the distribution of the sentences’ lengths. For instance, in a
typical case, the number of required comparisons between chunks (sentences)
can be reduced from hundreds of thousands to some dozens of comparisons,
thus exponentially reducing the required time, without missing any of the
final results.
The pair-wise document comparison scheme introduced in Def. 3.1 and
its variants essentially try to find the best mapping between the document
chunks by considering the similarities between all possible pairs of chunks
themselves. The mapping computation phase is the second and last phase
72 Approximate matching for duplicate document detection

in the document similarity computation; further details on the approach


we exploit in order to solve such mapping problem, thus also discussing its
complexity, will be given in Section 3.4.1. In order to improve the efficiency
also in this phase, since the number of document pairs having at least one
similar chunk is very high, we introduce a filtering technique which is able
to output a candidate set of document pairs. The basic idea of our filter is
to take advantage of the information conveyed by the similarities between
the involved chunks, ignoring possible mappings between the chunks. The
intuition is that two documents that are very similar have a great number of
similar chunks.

Theorem 3.1 Given a pair of documents Di = c1i . . . cni , Dj = c1j . . . cm


j ,
where m ≥ n and a minimum document similarity threshold s (s ∈ [0, 1]),
then if at least one of the following three conditions holds:

ˆ Sim(Di , Dj ) ≥ s,

ˆ aSim(Di , Dj ) ≥ s,

ˆ aSim(Dj , Di ) ≥ s

then one of the following two conditions holds:

^ i , Dj ) ≥ s,
ˆ aSim(D

^ j , Di ) ≥ s
ˆ aSim(D

where Pn Pm k k h
^ i , Dj ) = k=1 h=1 |ci | · sim(ci , cj )
aSim(D Pn k
(3.7)
k=1 |ci |

The above theorem shows the correctness of the filter we devised for
data reduction. In other words, we ensure no false dismissals, i.e. that
none of the pairs of documents whose similarity actually exceeds a given
threshold is left out by the filter. Consequently, whenever we solve range
queries where a query document Q and a duplication threshold s are specified,
we can first apply the filter shown in the theorem (i.e. for each document
^
D in the collection check whether aSim(Q, ^
D) > s or aSim(D, Q) > s)
which quickly discards documents that cannot match. Then, we compute
the resemblance level between the query document and each document in
the resulting candidate set. As for chunk filters, it is not possible to define
an a priori formula quantifying in general terms the reduction offered by the
document filter to the computational complexity of the mapping computation
Intra-document reduction 73

between the chunks, since the improvement is dependent on the data which
it is applied to. In Section 3.4 we describe the effect of such filter by means
of experimental evaluation on different document collections.
Being the chunk filters and the document filter correct, they reduce but
do not approximate the document search space. In particular, the adoption
of the chunk filtering techniques requires a space overhead to store the q-
gram repository. By following the arguments given in [61], we state that the
required space is bounded by some linear function of q times the size of the
corresponding chunks.
Finally, in Section 3.4 we quantify the effectiveness of the chunk filters
and the document filter in the reduction of the number of comparisons. From
our tests, we infer that filters ensure a small number of false positives as in
most cases the level of duplication between documents is heavily dependent
on the number of similar chunks.

3.2.2 Intra-document reduction


The intra-document reduction aims at reducing the number of chunks per
document which have to be stored and compared. To this end, we consider
two different techniques: length-rank chunk selection and chunk clustering.
Both act by selecting for each document the percentage of its chunks spec-
ified by a given reduction ratio chunkRatio. Each of these techniques is
completely fine tunable to achieve good results, both in efficacy and effi-
ciency, for document collections of different size and type (see Section 3.4 for
more details on the results).

Length-rank chunk selection


The length-rank chunk selection is a variant of the chunk sampling approach
previously addressed in [22, 66]. It acts by selecting the longest chunks of
each document. The number of chunks selected from each document is the
chunkRatio percentage of the total number of its chunks. Even though it is a
simple idea, it works better than sampling. Indeed, it aims at selecting similar
chunks from similar documents. Moreover, it prunes out small chunks which
have a little impact on the document similarity computation, as it is weighted
on the chunk lengths. On the other hand, small chunks can also be source of
noise, especially for real data sets as shown in Section 3.4. With length-rank
chunk selection, we thus ignore the similarities between small chunks and
stress the similarities between large portions of the involved documents. For
these reasons, we found it to be particularly effective, keeping good quality
results while significantly reducing computing time and required stored space.
74 Approximate matching for duplicate document detection

Chunk clustering
The chunk clustering is the process of cluster analysis in the chunk search
space representing one document. The intuition is that, if a document con-
tains two or more very similar chunks, chunk clustering stores just one of
them. Its effectiveness is mainly due to the availability of the proximity
measure defined in Eq. 3.2 which is appropriate to the chunk domain and on
which relies the document resemblance measure. By featuring the similarities
among chunks, chunk clustering is able to choose the “right” representatives,
giving particularly good results for documents with a remarkable inner re-
peatedness. More precisely, given a document D containing n chunks, the
clustering algorithm produces chunkRatio ∗ n clusters in the chunk space.
As to the clustering algorithm, among those proposed in the literature, we
experimentally found out that the most suitable for our needs is a hierarchi-
cal complete link agglomerative clustering [70]. For each cluster γ, we keep
some features which will be exploited in the document similarity computa-
tion. To this end, the cluster centroid R will be used in place of the chunks
it represents. The centroid corresponds to the chunk minimizing the max-
imum of the distances between itself and the other elements of the cluster.
Moreover, in order to weight the γ contribution to the document similarity,
also the total length |γ| in words and the number N of the chunks in γ are
considered.
Therefore, in the chunk clustering setting, documents are no longer repre-
sented as sequences of chunks but as sequences of chunk clusters and an ad-
justment of the resemblance measure of Def. 3.1 is required. More precisely,
we start from two documents Di and Dj of n and m chunks, respectively,
which have been clustered in n0 = chunkRatio ∗ n and m0 = chunkRatio ∗ m
0
clusters, and thus are represented by the sequences Di = γi1 . . . γin , Dj =
0
γj1 . . . γjm where the k-th cluster γi(j)
k k
in Di (Dj ) is a tuple (Ri(j) k
, Ni(j) k
, |γi(j) |)
k k k
with centroid Ri(j) , number of chunks Ni(j) , and length |γi(j) |. Being the
reduction ratio chunkRatio common to the two documents, it follows that
if n ≤ m then n0 ≤ m0 . In the revised form of the resemblance measure of
Def. 3.1, the similarity Sim(Di , Dj ) between Di and Dj is the maximum of
the similarity scores between the two documents, each one computed on a
permutation pm0 on Im0 . To this end, we devised two variants of Eq. 3.1.
Cluster-Based Function The Cluster-Based Function is a straightforward
adaptation of Eq. 3.1:
Pn0 ³ k pm0 (k)
´
p (k)
k=1 |γ i | + |γ j | · sim(Rik , Rj m0 )
CSimpm0 (Di , Dj ) = P n0 k
Pm0 k
(3.8)
k=1 |γi | + k=1 |γj |
Inter-document reduction 75

Average-Length Cluster-Based Function The second function provides


a different way to weight the similarities between the cluster represen-
tatives by using the average length of each cluster. It is thus able to
distinguish large clusters with small chunks from small clusters with
large ones by giving more weight to the latter than the former. It is
defined as follows:
µ ¶
Pn0 |γik |
p 0 (k)
|γj m | p (k)
k=1 Nik
+ pm0 (k) · sim(Rik , Rj m )
Nj
ALCSimpm0 (Di , Dj ) = Pn0 |γik | Pm0 |γjk |
k=1 N k +
i
k=1 N k
j
(3.9)

As to the accuracy of the intra-document reduction techniques, by introduc-


ing an approximation in the logical representation of documents we would
expect an approximation of the duplication scores computed on the origi-
nal document search space, too. The loss of accuracy is analyzed in Section
3.4. On the other hand, such techniques reduce the space required to store
the logical representation of documents and thus the number of pair-wise
chunk comparisons which is strictly correlated to the response time. More
precisely, for each document represented by n chunks, both techniques store
the n × chunkRatio most representative chunks. Thus, given two documents
Di and Dj having ni and nj chunks respectively, by adopting the intra-
document reduction we cut down the number of pair-wise chunk comparisons
from ni × nj to chunkRatio2 × ni × nj .

3.2.3 Inter-document reduction


In this subsection, we discuss the approach we follow for the inter-document
reduction. The technique we propose derives from a recent idea based on
the notion of data bubble [20], which is a particular data structure employed
to reduce the amount of data on which to perform hierarchical clustering
algorithms [69]. The intuition behind a data bubble is that of a “convenient
abstraction summarizing the sufficient information on which clustering can
be performed” [20]. Starting from the original definition of a data bubble, we
define the concept of document bubble, which is the key for inter-document
reduction. Intuitively, a document bubble works for inter-document reduc-
tion as a chunk cluster works in intra-document data reduction: It reduces
the amount of data to be stored and elaborated for similar documents, keep-
ing just a representative (in this case, a document) and a series of values
summarizing the involved set of similar documents. Data bubbles provide a
very general framework for applying hierarchical clustering algorithm to data
76 Approximate matching for duplicate document detection

items randomly chosen from an arbitrary data set, assuming only that a dis-
tance measure is defined for the original objects [20]. We introduce the notion
of document bubble which is a specialization of the data bubble notion in
the context of duplicate detection where the distance measure between doc-
uments d(Di , Dj ) is the complement to 1 of the resemblance measure shown
in Def. 3.1, i.e. 1 − sim(Di , Dj ).
Definition 3.4 (Document bubble) Let D = {D1 . . . Dn } be a set of n
documents. The document bubble B w.r.t. D is a tuple (R, N, ext, inn),
where:
ˆ R is the representative document for D corresponding to the document
in D minimizing the maximum of the distances between itself and the
other documents of the cluster,
ˆ N is the number of documents in D,
ˆ ext is a real number such that the documents of D are located within a
radius ext around R,
ˆ inn is the average of the nearest neighbor distances within the set of
documents D.
We build document bubbles in the following way: Given a set of documents
DS and a reduction ratio bubRatio, by applying the Vitter’s algorithm [135]
we perform a random sampling in order to select the initial bubRatio ∗ |DS|
document originators. The remaining documents are assigned to the “clos-
est” originator by applying the standard document similarity algorithm. The
outcome is a collection of bubRatio∗|DS| sets of documents. Finally, for each
document set, the features of the resulting cluster are computed and stored
as a document bubble.
Therefore, the inter-document reduction summarizes the starting set of
DS documents to be clustered in bubRatio ∗ |DS| document bubbles which
approximate the original document search space. In comparison with tra-
ditional hierarchical clustering, inter-document reduction allows us to cut
down the number of pair-wise document comparisons from |DS| × |DS| to
bubRatio2 × |DS| × |DS|. In such context, the distance measure suitable for
hierarchical clustering is no longer the distance d(Di , Dj ) between documents
but the distance measure between two bubbles B and C given in [20]:


 d(RB , RC ) − (extB + extC ) + (innB + innC )

 (if d(RB , RC ) − (extB + extC ) ≥ 0)
d(B, C) = (3.10)

 max(inn B , innC )


(otherwise)
3.3 Related work 77

dist(repB,repC)
dist(repB,repC)

B B
C C

repB repC repB repC

dist(rep B,repC)-(extB+extC)>0 dist(rep B,repC)-(extB+extC)<0

Non overlapping doc bubbles Overlapping doc bubbles

Figure 3.2: Distance between document bubbles

The two conditions correspond to non overlapping and overlapping docu-


ment bubbles (see Figure 3.2): The distance between two non-overlapping
document bubbles is the distance of their representatives minus their radii
plus the average of their nearest neighbor distances. Otherwise, if the docu-
ment bubbles overlap, we take the maximum of the averages of their nearest
neighbor distances as their distance.

3.3 Related work


The problem of duplicate document detection has been addressed in differ-
ent contexts and for purposes also going beyond security issues. A good
dissertation on the state of the art is given in [33].
The COPS working prototype [22] is a copy detection service where orig-
inal documents can be registered, and copies can be detected. It has been
devised as one of the main components of an information infrastructure that
gives users access to a wide variety of digital libraries. To this end, they define
different violation tests, subset, overlap, and plagarism, and as many ordinary
operational tests (OOTs) that approximate the desired violation tests. Such
OOTs essentially count the number of common chunks between each pair of
documents and select such registered documents that exceed some threshold
fraction of matching chunks. Finally, they address the efficiency and scala-
bility of OOTs by applying random selections at the chunk and document
levels. Since no semantic premise is used to reduce the amount of data to be
compared, a degree of approximation is introduced to the matching process
resulting in a decay of the accuracy, which is even more substantial for doc-
uments of unequal sizes. The same approach is followed in the KOALA [66]
78 Approximate matching for duplicate document detection

and in the DSC [23] techniques which address the general problem of textual
matching for document search and clustering and plagiarism/copyright ap-
plications. In such approaches, chunks or shingles are obtained by “sliding” a
window of fixed length over the tokens of the document. The DSC algorithm
has a more efficient alternative, DSC-SS, which uses super shingles but does
not work well with short documents and thus is not suitable for web clus-
tering. Such approaches, as outlined by the authors, are heavily dependent
on the type and the dimension of chunks which affect both the accuracy and
the efficiency of the systems. As to accuracy, the bigger the chunking unit
the lower the probability of matching unrelated documents but the higher
the probability of missing actual overlaps. As to search cost, the larger the
chunk, the lower is the running time but the higher the potential number of
distinct chunks that will be stored. Moreover, such approaches are usually
not resilient to small changes occurring within chunks as only equal chunks
are considered. Our approach differs from the above cited ones since it is less
dependent from the size of the chunks to be compared as our comparison
scheme enters into chunks. In this way, we are able to accurately detect dif-
ferent levels of duplication. Indeed, notice that if we adopt a crisp similarity,
the similarity measure of Eq. 3.1 computes the number of common chunks
weighted on their lengths. For the above approaches, finding the chunk size
ensuring a “good” level of efficiency is almost impossible. The problem is
also outlined in [121] where Garcia-Molina et. al. address again the problem
of copy detection by proposing a different approach named SCAM and com-
pare SCAM with COPS by showing that the adoption of sentence unit as
chunk implies a percentage of false negatives of approximatively 50% (more
than 1000 netnews articles). SCAM investigates the use of words as the unit
of chunking. It essentially measures the similarity between pairs of docu-
ments by considering the corresponding vectors of words together with their
occurrences and by adopting a variant of the cosine similarity measure which
uses the relative frequencies of words as indicators of similar word usage. In
our opinion, it does not fully meet the characteristics of a good measure of
resemblance or, at least, of those depicted in Section 3.1. For instance, the
authors show that, setting the measure parameters to the values adopted in
the implementation, the similarity between ha, b, ci and ha, bi is 1, the same
holds for ha, b, ci against ha, b, c, di and hak i with k = 1, whereas no similarity
is detected with k > 0. Moreover, as documents are broken up into stand-
alone words with no notion of context and the adopted measure relies on
the cosine similarity measure which is insensitive to word order, documents
differing in contents but sharing more or less the same words are judged very
similar.
Recently, an approach based on collection statistics has been proposed in
3.3 Related work 79

[33] for clustering purposes. The approach extracts from each document a
set of significant words selected on the basis of their inverse document fre-
quency and then computes a single hash value for the document by using the
SHA1 algorithm [1] which is order sensitive. All documents resulting in the
same hash value are duplicates. The approach supports an high similarity
threshold justified by the aim of clustering large amounts of documents. It
thus turns out to be not much suitable for plagiarism or similar problems
where the detection of different degrees of duplication within a collection is
required. As to the efficacy of the measure, a comparison with our approach
is also possible. Indeed, they conducted some experiments on synthetic doc-
uments generated in a similar way as ours, i.e. 10 seed documents and 10
variants for each seed document. Although they introduce a mechanism
which lowers the probability of modification, they show that the average of
document clusters formed for each seed document, which should ideally be
1, is 3.3 with a maximum of 9.
As far as the techniques for data reduction are concerned, the main ob-
jective of our work has been to devise techniques able to reduce the number
of document comparisons relying on our scheme and to test the robustness
of the resemblance measures w.r.t. such techniques. Filters have been suc-
cessfully employed in the context of approximate string matching (e.g. [61])
where the properties of strings whose edit distance is within a given thresh-
old are exploited in order to quickly discard strings which cannot match.
In our context, we devised an effective filter suitable for the resemblance
measure of the comparison scheme we propose. Data bubbles capitalize on
sampling as means of improving the efficiency of the clustering task. In our
work, we have revised and adapted to our context the general data bubble
framework proposed in [20]. The construction of the data bubble could be
sped up by adopting the very recent data summarization method proposed
in [142] which is very quick and is based only on distance information. Fi-
nally, several indexing techniques have been proposed to efficiently support
similarity queries. Among those, the Metric Access Methods (MAMs) only
require that the distance function used to measure the (dis)similarity of the
objects is a metric. They are orthogonal to the data reduction techniques
we proposed and thus they could be combined in order to deal with dupli-
cate detection problems involving large amount of documents. Indeed, the
fact that our resemblance measure is not a metric (it does not satisfy the
triangle inequality property) does not prevent us from using a MAM such
as the M-Tree [36] for indexing the document search space. The paper [35]
shows that the M-Tree and any other MAM can be extended so as to process
queries using distance functions based on user preferences. In particular any
distance function that is lower-bounded by the distance used to build the
80 Approximate matching for duplicate document detection

Pre-processing
Conversion to ASCII

Chunk Identification
Documents
Chunks

Syntactic Filtration

Length Filtration

Document
Statistic and Repository Inter / Intra document
Graphic Analysis Data reduction

Similarity Computation

Chunk Mapping
Chunk Similarity Search
Optimization

Figure 3.3: The DANCER Architecture

tree can be used to correctly query the index without any false dismissal.
In this enlarged scenario, several distances can be dealt with at a time, the
query (user-defined) distance, the index distance (used to build the tree)
and the comparison distance (used to quickly discard uninteresting index
paths), where the query and/or the comparison distance functions can also
be nonmetric. As a matter of fact, a metric which is a lower-bound of our
resemblance measure can be easily defined and our filters can be used as
comparison distances.

3.4 Experimental Evaluation


To evaluate both the effectiveness and the efficiency of the ideas and the
techniques presented so far, we conducted a number of exploratory experi-
ments using a system prototype named DANCER (Document ANalysis and
Comparison ExpeRt), whose architecture is summarized in Figure 3.3. The
aim of the document pre-processing modules is to translate the documents to
be stored in DANCER into their logical representations suitable for the simi-
larity computation (see Subsection 3.1.1). When a document is submitted to
the system, the module first converts to ASCII the contents of the documents,
then it performs a simple finite-states automa in order to identify both sen-
tences and paragraphs, and provides the user with the option of choosing the
The similarity computation module 81

desired size of chunks in terms of a fixed number of sentences or paragraphs.


The last two steps are the chunk syntactic filtration and the chunk length
filtration which have already been described in Subsection 3.1.1. The logical
representations of the submitted documents are then stored in the document
repository. The remaining system modules access such information in order
to perform different tasks: document similarity computations, data reduc-
tion functions, both intra and inter-document (see Sections 3.2.2 and 3.2.3
respectively), statistic and graphic analysis of the document collections, and
generations of synthetic document collections.
The following sections describe:

ˆ the similarity computation module;

ˆ the document generator tool we developed to create synthetic document


collections;

ˆ the document collections used in the tests;

ˆ assessment and evaluation of the experiments we conducted.

3.4.1 The similarity computation module


The Similarity Computation module is able to compute the similarities be-
tween two sets V and W of documents by following two phases: the Chunk
Approximate Matching and the Chunk Mapping Optimization. The former
approximately matches the chunks of the involved documents, i.e., as re-
quired by Eq. 3.2, for each Di ∈ V and Dj ∈ W it returns all pairs of chunks
chi ∈ Di and ckj ∈ Dj such that ed(ci , cj ) ≤ t. It is implemented on top of
the DBMS managing the document repository by means of an SQL expres-
sion which, before computing the edit distance, quickly filters pairs of chunks
that do not match by means of the chunk filtering techniques described in
the previous sections (for more details see Chapter 1).
As shown in Section 3.1, finding the similarities between the different
chunks could not be sufficient to correctly compute the resemblance measure
Sim(Di , Dj ) between Di and Dj . Indeed, a chunk in document Di could be
similar to more than one chunk in document Dj . Those chunks in Dj could,
in turn, have some similarities with other chunks of Di , and so on, making
the problem quite complex.
Given a similarity measure, say the symmetric one introduced in Def. 3.1,
the goal of Chunk Mapping Optimization is to find the best mapping between
the chunks in the two documents, that is the permutation pm maximizing
82 Approximate matching for duplicate document detection

Eq. 3.1. Instead of generating all possible permutations and testing them, we
express the problem in the following form: We find the maximum of function
Pm Pn ¡ h k
¢ h k
k=1 h=1 |ci | + |cj | · sim(ci , cj ) · xh,k
X n Xm
k
|ci | + |ckj |
k=1 k=1

where xh,k is a boolean variable stating if the pair of chunks (chi , ckj ) actually
participates to the similarity computation and thus assuming the values
(
1 if chunk chi is associated to chunk ckj
xh,k =
0 otherwise

under the following constraints:


X
∀k xh,k ≤ 1
h
X
∀h xh,k ≤ 1
k
that is a chunk in a document is coupled with at most one chunk in the other
document (see Figure 3.1.a).
With those premises, the resemblance measure computation becomes a
matter of solving an integer linear programming problem, for which a stan-
dard ILP package can be employed. The complexity of algorithms solving
the linear programming problem has generally been high; however there exist
worst-case polynomial-time algorithms, such as [74], which make the costs of
this type of calculation feasable. Further, in our case, the computation of the
resemblance measure requires the intervention of ILP algorithms only when a
pair of chunks chi and ckj exists such that chi has similarity greater than 0 with
more than one chunk in document Dj , also including ckj , and vice-versa. In
all the other cases, which are the majority, the mapping between chunks can
be straightforwardly computed in linear time w.r.t. the number of matching
chunk pairs. Obviously, the more are the pairs of similar chunks the more
is the probability that the intervention of the ILP package is required. In
our experiments, we set the edit distance threshold t to 0.4 or lower values,
that is we only consider pairs of chunks whose similarity score is at least 0.6.
We found out that such a setting has no impact on the effectiveness of the
measure (see Section 3.4) and allows the improvement of system efficiency
since it implies that the ILP is activated only few times.
Finally, let us go back to the two kinds of content-based document analysis
techniques we depicted in Section 3.2. Whenever a range query w.r.t. a query
Document generator 83

document Q is submitted to DANCER, the Similarity Computation module


computes the similarities between Q, i.e. V = {Q}, and the data documents
contained in W . Clustering, instead, is performed on a document collection,
i.e. V = W where V represents the documents to be clustered.

3.4.2 Document generator


To test the performance more in depth, we developed a document generator
tool, designed to automatically produce random variations out of an initial
set of seed documents. The algorithm takes one seed document at a time
and a set of “variation” documents, from which we extract the new material
needed to modify the seed documents.
The document generator is fully parameterizable and works by randomly
applying modifications to the different parts of each seed document. The
types of modifications are amongst deletions, swaps, insertions or substitu-
tions, and involve paragraphs, sentences and words. The frequency of modi-
fications to the three different levels of the document structure are specified
by as many parameters: parF req, sentF req, and wordF req. The gener-
ator algorithm is based on independent streams of random numbers which
determine the type of modification to operate and, in case of insertions or
substitutions, the elements of the variation documents to be used. Finally,
an additional parameter specifies the number of documents (variations) to
generate out of each seed document.

3.4.3 Document collections


As stated in [22], in the document similarity / copy detection field there are no
standard reference collections or test beds which are universally known and
accepted. For our experiments we created / collected different document sets,
coming from various contexts and studied to stress the different features and
capabilities of our approach in various scenarios. Tough they are not intended
to be a reference or to be comprehensive w.r.t. every possible situation, the
document sets we use allow us to analyze in detail the effectiveness in copy
(related) and in violation detection scenarios, as well the efficiency of the
different matching techniques together with the impact of the proposed data
reduction approaches.
We report the experiments performed on eight different data sets, five of
which are synthetic ones and three are real data sets. The synthetic ones
were created with our document generator:

ˆ Times1000S, Times500S, Times100L, Times100S: collections of


84 Approximate matching for duplicate document detection

1000 (short), 500 (short), 100 (long) and 100 (short) documents, con-
taining 10 variations (including the seed documents themselves) for
each of the 100, 50, 10 and 10 seed documents, respectively, extracted
from articles of the Times newspaper digital library;

ˆ Cite100M: collection of 100 medium sized documents, containing 10


variations (including the seed documents themselves) for each of the 10
seed documents extracted from various scientific articles coming from
the Citeseer [80] digital library.

For the Times100S, Times100L and Cite100M collections, we simulate


a sufficiently varied but still realistic scenario in document changes by set-
ting the following document generator parameters: sentF req as one each 6
sentences and wordF req as one each 8 words. In the “biggest” collections,
i.e. Times1000S and Times500S, we experimented a slightly more “aggres-
sive” setting (sentF req as one each 4 sentences and wordF req as one each
5 words).
Finally, to test the system behavior also with less inner-correlated doc-
uments, we decided to search for some real documents’ sets. The task of
finding sets that could be significant to evaluate the effectiveness of the mea-
sures was not a trivial task. The contents of such documents should guide
their grouping into distinct sets of (near) duplicates, therefore a random se-
lection of uncorrelated articles was clearly not the right choice. In order to
fulfill such requirements, we collected groups of documents starting from se-
lected articles and then exploiting the documents suggested as similar by the
Citeseer digital library search engine. In particular, we selected two sets of
real documents: CiteR25 and CiteR8. These are collections of 25 (long)
and 8 (long) scientific articles coming from the Citeseer digital library. For
the CiteR8 collection we started from the article “A Faster Algorithm for
Approximate String Matching” [9] and then recursively followed the “similar
on the sentence level” links to select 7 correlated documents. In CiteR25
we selected 5 initial documents and, by following the same technique, we
formed 5 groups of 5 correlated documents each. In this way, the obtained
collections of scientific articles are significant real data sets since there is a
certain level of repeatedness within each group while the inter-group cross
correlation is quite low.
Details about the different collections are depicted in Tables 3.1 and 3.2.
Notice that they involve a cross product of up to 1,000,000 (1000*1000) doc-
uments comparisons, with a number of chunks comparisons near 330,000,000
different pairs.
Test results 85

Times1000S Times500S Times100L Times100S Cite100M


# docs 1000 500 100 100 100
# sents 17231 8606 10833 2078 4045
# words 392478 219438 223171 44192 91273

Table 3.1: Specifications of the synthetic collections used in the tests

CiteR25 CiteR8
# docs 25 8
# sents 8223 4086
# words 164935 81339

Table 3.2: Specifications of the real collections used in the tests

3.4.4 Test results


The system performance was tested both with respect to the effectiveness of
the resemblance and the other measures and the efficiency of the proposed
data reduction techniques. In particular, the effectiveness of the measures
was also tested at increasing levels of data reduction with the aim of evaluat-
ing the robustness and of finding a good tradeoff between the accuracy and
the efficiency of the system. Unless otherwise specified, tests were performed
with chunk size of one sentence, a minimum chunk length of 3 words (i.e.
length threshold 3) and an edit distance threshold t between chunks ranging
from 0.2 to 0.4. All experiments were run on a Pentium 4 2.5Ghz Windows
XP Pro workstation, equipped with 512MB RAM and a RAID0 cluster of 2
80GB EIDE disks.

Effectiveness tests

In this section, we present the experiments we performed to evaluate the


effectiveness of the similarity measures, and their robustness w.r.t. the dif-
ferent data reduction techniques described in Section 3.2; further, we present
an experimental analysis of the level of security offered by our scheme.
For our experiments, we inserted the documents of each collection in
the system and we queried the collection against itself in order to obtain
a duplication matrix containing a duplication score between each pair of
documents. Then, we analyzed the duplication matrix in different ways:

ˆ direct graphic visualization of the duplication matrix values;


86 Approximate matching for duplicate document detection

10

20

30

40

50

60

70

80

90

100
10 20 30 40 50 60 70 80 90 100

Figure 3.4: Effectiveness of the resemblance measure for Times100L

ˆ numerical computation of average and standard deviation of the dupli-


cation scores.

Most of the tests we present involve the synthetic collections. As test-


beds for duplicate detection are lacking, the use of synthetic collections makes
the effectiveness evaluation more objective since the generation process we
adopted together with its settings provide a clear direction of the awaited
results. Let us start from the Times collections, in particular from the
Times100L, containing 10 groups of 10 (near) duplicates. We use a direct
visualization of the duplication matrix, by associating a particular shade of
color to each duplication value. In this way, each pixel represents the level
of duplication of one pair of documents, with a dark color associated to non-
duplicate pairs and a brighter one for more duplicate ones. This technique
allows us to show the quality of the computation at a glance, while preserv-
ing most of the detail needed for an in-depth analysis. Figure 3.4 shows the
image generated with this technique for the Times100L. Notice that, in order
to make the image instantly readable, all the documents generated from the
same seed document are inserted with subsequent indexes and, thus, visual-
ized in a specific zone of the plot. In Figure 3.4 the 10 groups of 10 (near)
duplicates are clearly visible, meaning that the system is able to distinguish
the 10 clusters we generated. The obtained results show the robustness of
the similarity measures w.r.t. text modifications. The level of security of our
duplicate detection technique is thus good: The similarity scores it computes
are proportional to the number of changes applied to the seed documents and
many changes need to be made so that the resulting documents are identified
as unrelated to the original documents.
The other aspect we considered is the robustness of the resemblance mea-
sure w.r.t data reduction. To this end, we evaluated the decay of the effective-
ness by comparing the “original” duplication matrix of a reference collection
with those obtained by applying different data reduction settings. Obviously,
Test results 87

10 10 10

20 20 20

30 30 30

40 40 40

50 50 50

60 60 60

70 70 70

80 80 80

90 90 90

100 100 100


10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100

(a) sampling 0.5 (b) sampling 0.3 (c) sampling 0.1

10 10 10

20 20 20

30 30 30

40 40 40

50 50 50

60 60 60

70 70 70

80 80 80

90 90 90

100 100 100


10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100

(d) length-rank sel. 0.5 (e) length-rank sel. 0.3 (f) length-rank sel. 0.1

Figure 3.5: Sampling and length-rank selection experiments for Times100L

we did not perform tests with filtering techniques which have been proved
to be correct (see Subsection 3.2.1) and thus we do not affect the accuracy
of the system. As to the intra-document reduction, we first compare the
results of the length-rank chunk selection with the sampling approach used
in [22]. Sampling was implemented both through a fixed-step sampling al-
gorithm keeping every i-th chunk and with the sequential random sampling
algorithm by Vitter [135]. Since the two implementations produced approx-
imately the same quality level, we only present random sampling results.
Figure 3.5 compares the effect of the different ratios of sampling (first line)
and length-rank chunk selection (second line). For instance, the third column
represents the image matrixes obtained by applying a reduction ratio of 0.1,
i.e. in both cases we kept one every 10 chunks per document. Here, the ref-
erence collection is again Times100L and thus the decay of the effectiveness
can be evaluated by comparing the image matrixes of Figure 3.5 with the
image matrix of Figure 3.4. With sampling 0.5 and 0.3 the 10 groups are still
recognizable but the intra-group duplication scores are much reduced (pixels
are darker than the corresponding ones in Figure 3.4). With sampling 0.1
the quality of the results is quite poor. The quality of the length-rank chunk
88 Approximate matching for duplicate document detection

10 10 10

20 20 20

30 30 30

40 40 40

50 50 50

60 60 60

70 70 70

80 80 80

90 90 90

100 100 100


10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100

(a) C-B 0.5 (b) C-B 0.3 (c) C-B 0.1

10 10 10

20 20 20

30 30 30

40 40 40

50 50 50

60 60 60

70 70 70

80 80 80

90 90 90

100 100 100


10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100

(d) A-L C-B 0.5 (e) A-L C-B 0.3 (f) A-L C-B 0.1

Figure 3.6: Chunk clustering experiments for Times100L

selection is, instead, very good and even at the highest selection level (0.1)
the similarity scores are much better than any sampling 0.5.
Figure 3.6 shows the results of the experiments we conducted with the
chunk clustering technique at three different levels of reduction ratio, where
duplication scores were computed using the clustered-based (C-B) and the
average-length clustered-based (A-L C-B) functions (see Section 3.2.2). Since
the inner repeatedness of the collection is quite low, the results are not as
good as for length-rank chunk selection. Notice that the A-L C-B resem-
blance measure (see Eq. 3.9) produces “smooth” results with respect to the
different reduction settings: Results have an acceptable quality up to 0.2-
0.1 reduction ratio. On the other hand, the C-B resemblance measure (see
Eq. 3.8) behaves almost in a crisp way: the resemblance results near to the
extremes, 0 and 1, degrade very slowly at the different reduction ratios and
several document pairs are recognized with a full score even at 0.1.
The results of the effectiveness experiments we conducted for inter doc-
ument reduction are depicted in Figure 3.7. The collections tested are the
Times100S and Times500S. The reduction ratio was set to 0.2, that is 20 and
100 document bubbles were generated for Times100S and Times500S, respec-
Test results 89

2
10

4
20

6
30

8
40

10
50

12
60

14
70

16
80

18
90

20
100
2 4 6 8 10 12 14 16 18 20 10 20 30 40 50 60 70 80 90 100

(a) Times100S, bub. 0.2 (b) Times500S, bub. 0.2

Figure 3.7: Document bubble experiments for Times100S and Times500s

tively. We kept the document bubbles ordered as much as possible as the


original documents. In this way, in comparison with the original duplication
matrix, a good effectiveness with document bubbles would be represented
by a lower resolution duplication matrix maintaining the original pattern of
squares. Clearly, a pixel in the matrix represents the resemblance score be-
tween a pair of document bubbles. As shown in Figure 3.7, the resemblance
measure is robust also w.r.t. the document bubble technique.

10 10

20 20

30 30

40 40

50 50

60 60

70 70

80 80

90 90

100 100
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100

(a) sentences as chunks (b) paragraphs as chunks

Figure 3.8: Chunk size tests results for Cite100M

We also performed tests on the impact of the chunking unit on the quality
of the obtained results (Figure 3.8). The results concern Cite100M, other
syntethic collections behaved similarly. The tests shown so far have been
performed with the chunk size of one sentence (Figure 3.8a). By choosing a
bigger chunk size (i.e. a paragraph, Figure 3.8b) the similarity scores were
quite lowered but the similar document groups were still perfectly identifiable.
Such results are due to the fact that our comparison scheme, through the
chunk similarity measure, succeeds in correlating the information conveyed
90 Approximate matching for duplicate document detection

Sim Asim
Group Affinity Noise Affinity Noise
1 0.36496 ± 0.39617 0.00012 ± 0.00038 0.37176 ± 0.40008 0.00011 ± 0.00037
2 0.28984 ± 0.40268 0.00013 ± 0.00053 0.29572 ± 0.40308 0.00009 ± 0.00040
3 0.31576 ± 0.39189 0.00030 ± 0.00066 0.32352 ± 0.39356 0.00045 ± 0.00074
4 0.32384 ± 0.40992 0.00024 ± 0.00067 0.32580 ± 0.41103 0.00035 ± 0.00109
5 0.50104 ± 0.32719 0.00015 ± 0.00041 0.50332 ± 0.32709 0.00011 ± 0.00035
Total Avg. 0.35909 ± 0.38557 0.00019 ± 0.00053 0.36402 ± 0.38697 0.00022 ± 0.00059

Table 3.3: Affinity and noise values for the CiteR25 collection

5 5 5

10 10 10

15 15 15

20 20 20

25 25 25
5 10 15 20 25 5 10 15 20 25 5 10 15 20 25

(a) chunk=sent., no red. (b) paragraphs as chunks (c) length-rank sel. 0.3

Figure 3.9: Chunk size and length rank tests results in CiteR25

by the chunks.
In order to test the effectiveness of the resemblance measure in deter-
mining duplicate documents also in real heterogeneous document sets, we
performed “ad-hoc” tests on the CiteR25 and CiteR8 collections. First, we
inserted into the system the 5 groups of 5 documents of CiteR25 and then
computed the affinity and noise values for the symmetric and asymmetric
similarity measures. Affinity represents how close documents in the same
group are in terms of duplication. The affinity value a ± d reported in each
line of Table 3.3 is the average a and the standard deviation d over the intra-
group document comparisons. Noise represents undesired matches between
documents belonging to different groups and the value reported in each line
of Table 3.3 is the average and the standard deviation over the comparisons
between the documents of the group w.r.t. the other groups. Notice the
extremely low noise and the high affinity values, which are three orders of
magnitude higher than the noise ones. This once again confirms the good-
ness of our resemblance measures, even for real and therefore not extremely
correlated document sets. The image matrix of CiteR25 is depicted in Figure
3.9a, whereas Figure 3.9b shows the effect of the variation of the chunking
Test results 91

Citeseer
Doc. 1 Doc. 2 Doc. 3 Doc. 4 Doc. 5 Doc. 6 Doc. 7 Doc. 8
Doc. 1 - 31.5% 18.0% - - - - -
Doc. 2 41.1% - 47.6% - - - - -
Doc. 3 - - - 5.2% 5.2% - - -
Doc. 4 - - 41.6% - - 9.2% - -
Doc. 5 - - 19.7% - - - - -
Doc. 6 - - 20.7% 8.8% - - 20.5% 13.8%
Doc. 7 - - 14.3% - - 35.4% - 68.1%
Doc. 8 - - 12.0% - - 27.4% 59.4% -
DANCER
Doc. 1 Doc. 2 Doc. 3 Doc. 4 Doc. 5 Doc. 6 Doc. 7 Doc. 8
Doc. 1 100.0% 44.3% 25.3% 3.4% 3.7% 3.4% 2.6% 1.6%
Doc. 2 57.3% 100.0% 46.2% 3.2% 3.7% 3.1% 2.6% 1.9%
Doc. 3 2.0% 2.9% 100.0% 8.3% 7.1% 5.5%(A) 3.0% 2.3%
Doc. 4 1.8% 1.3% 54.6% 100.0% 5.9%(B) 18.2% 8.8%(C) 7.2%(D)
Doc. 5 0.9% 0.7% 22.7% 2.8% 100.0% 2.2% 4.1% 0.9%
Doc. 6 1.8% 1.2% 36.2% 18.5% 4.5% 100.0% 31.0% 24.1%
Doc. 7 2.3% 1.8% 32.0% 14.0%(E) 4.1% 48.6% 100.0% 66.7%
Doc. 8 1.8% 1.7% 30.6% 14.5%(F ) 3.9% 47.8% 83.8% 100.0%

Table 3.4: Correlation discovery: a comparison between Citeseer search en-


gine and Dancer

unit and Figure 3.9c that of the length-rank chunk selection. As to the
first aspect (Figure 3.9b), the obtained results are comparable to the ones
obtained in the synthetic tests. As to the second aspect, notice that the qual-
ity of the computation on the reduced data is even better than the original
one, as noise decreases and affinity increases. This is due to the fact that
length-rank chunk selection stresses the similarities between large portions
of the involved documents while small chunks are excluded from the com-
putation. In the Citeseer real data sets, small chunks often correspond to
the few small sentences, such as “The figure shows the results ”, “Technical
report”, “Experimental evaluation”, or even the authors’ affiliations, that are
quite common in scientific articles, so that even unrelated documents might
both contain them. They are thus source of noise that are left out by the
length-rank chunk selection.
The second test on real collections involved CiteR8 and was devised to
stress the DANCER capabilities in discovering inter-correlations between sets
of scientific articles coming from a digital library. Such correlations are quite
important in order to classify the documents and to show to the user a web
of links involving articles similar to the one being accessed. In fact, the 8
documents were selected by starting from the article [9] and navigating the
92 Approximate matching for duplicate document detection

5,9% (B)
Doc. Doc. Doc.
3 4 5

14% (E) 14,5% (F)

5,5% (A) 8,8% (C)


7,2% (D)

Doc. Doc. Doc.


6 7 8

Figure 3.10: Web of duplicate documents

“similar on the sentence level” links proposed by the Citeseer digital library
search engine. They correspond to variations of the original article and to
articles on the same subject from the same authors, and include extended
versions, extended abstracts and surveys. In order to provide the links, the
Citeseer engine adopts a sentence based approach to detect the similarities
between the documents: It maintains the collection of all sentences of all the
stored documents and then computes the percentage of identical sentences
between all documents. The percentages computed by Citeseer between the
CiteR8 documents are shown in the upper part of Table 3.4 (Doc. 1 is the
starting document). Notice that the matrix is asymmetric as each value rep-
resents the percentage of the sentences that the document in the row shares
with the document in the column. For instance, Doc. 1 is linked to Doc.2
(with which it shares 31.5% of its sentences) and Doc.3 (18.0%). Further,
notice that the resulting correlation matrix is quite sparse and unpopulated.
Now, consider the corresponding matrix produced by DANCER (lower part
of Table 3.4). To make the resulting values directly comparable with the
ones from Citeseer, we used the containment measure (see Subsection 3.1.3).
Notice that the output given by our approach is much more detailed, provid-
ing more precise similarities between all the available document pairs. For
instance, Citeseer does not provide a link from Doc.1 to Doc.4, whereas our
approach reveals that the first shares 3.4% of its contents with the latter.
Even by considering a 5.0% of minimum similarity threshold, that could be
employed by Citeseer not to flood the user with low similarity links, there
still remains a good number of links that have to be discovered in order to
consider the similarity computation a complete and correct one. Such signif-
icant links, which are not provided by Citeseer, are marked in Table 3.4 with
capital letters A to F and are also graphically represented in Figure 3.10.
Their significance can not be ignored, since they involve similarities even up
Test results 93

Type of test Amount Similarity range (DANCER) Similarity range (others)


Resemblance Containment Overlap Crisp, stem Crisp
Exact copy - 100%(A) 100% 1924-2705 100%(A) 100%
Partial copy 50% subset 69%-78%(B) 100%(B) 1090-1254 69%-78%(B) 69%-78%
25% subset 35%-51% 100% 460-595 35%-51% 35%-51%
10% subset 14%-27% 100% 211-267 14%-27% 14%-27%
5% subset 6%-13%(C) 100%(C) 102-132 6%-13%(C) 6%-13%
2 paragraphs 6%-11% 5%-12% 202-285 6%-11% 6%-11%
1 paragraph 3%-5%(D) 2%-7% 110-134(D) 3%-5%(D) 3%-5%
Plagiarism Low(S) 88%-97%(E) 86%-98% 422-793 88%-97%(E) 88%-97%(E)
Med(S) 65%-73% 61%-77% 88-179 65%-73% 65%-73%
High(S) 26%-38% 23%-39% 50-99 26%-38% 26%-38%
Low(S) Low(W) 84%-92%(F ) 81%-94% 380-764 19%-30%(F ) 5%-13%(F )
Low(S) Med(W) 41%-48% 39%-49% 64-148 2%-6% 1%
Low(S) High(W) 12%-18%(G) 11%-20% 35-47 1%-2%(G) 0%(G)
Med(S) Low(W) 61%-69% 60%-71% 290-383 12%-27% 3%-8%
Med(S) Med(W) 33%-39%(H) 31%-40% 42-125 2%-3%(H) 1%(H)
Med(S) High(W) 8%-13% 7%-15% 18-29 1%-2% 0%
Genuine - 1%-12%(I) 1%-14% 5-25 1%-3%(I) 1%(I)
(related)

Table 3.5: Results of violation detection tests

to 14.5%.
Finally, in order to specifically test the level of security provided by our
approach, we performed an in-depth violation detection analysis, whose re-
sults are shown in Table 3.5. The tests consisted in verifying the output (i.e.
the similarity scores) of our approach obtained by simulating well-defined
expected-case violation scenarios: Exact copy, several levels of partial copy,
several levels and types of plagiarism. Each row of data in the table sum-
marizes the results of 20 similarity tests performed on a specific scenario; for
this reason each cell of the results is represented by a similarity range, show-
ing the lowest and the highest computed similarity score. The table shows
all the three DANCER similarity measures (resemblance, containment and
maximum contiguous overlap) defined in Section 3.1. For the two asym-
metric measures, the presented values refer to the violating document w.r.t.
the original one. The overlap measure is expressed in number of (stemmed
and stopword-filtered) words. Further, we computed and compared the re-
semblance similarities that would be delivered by other approaches based on
crisp chunk similarities, such as [22, 33], with and without syntactic filtra-
tions (such as stemming): Such values are shown on the two last columns of
the table. For each test a document of the Times collection was chosen as
the “original” document and its similarity was measured w.r.t. a “violating”
document, derived from the original by applying various modifications. By
choosing specific similarity thresholds, such as 15%-20% for resemblance and
containment, and 100 words for overlap, such values can be used to help the
human judgement in the various violations, even well substituting and ap-
94 Approximate matching for duplicate document detection

proximating it where highly effective measures such as the ones we described


are employed.
For the exact copy test the violating document is exactly the same as
the original one. The computed resemblance is 100% clearly reporting the
violation, as expected, both for DANCER and for standard crisp approaches
(values A in Table 3.5). As to partial copy, we simulated the “subset” sce-
nario, in which only a portion of the original document is kept in order to
obtain a smaller document, and the even more typical scenario in which a
portion (for instance, a whole paragraph or two) of the original document
is copied and inserted in the context of a new document. When the copied
document is a big subset of the original one (50%, 25%) the resemblance
measure can still be a good indicator, even in the crisp approach (values
B). However, when the copied document represents a small portion of the
original one (10%, 5%), or only one or two small paragraphs are copied, such
violations are difficultly detected by standard symmetric measure, while the
DANCER containment and overlap measures are clearly able to report them:
See, for instance, values (C), where containment is maximum, and (D), where
the detected overlap is more than 100 words and should not pass unnoticed.
As to the plagiarism test we performed the analysis by synthetically gen-
erating violating documents at different plagiarism levels, even very subtle
ones: In fact, it often happens that only very small parts (i.e., sparse sen-
tences) are copied from an original document and then slightly modified, for
instance by changing a word or two. In order to test the DANCER effec-
tiveness also from such subtle but very frequent behavior, we exploited our
document generator and, by varying the value of the sentF req and wordF req
parameters, produced several “gradual” variations of the original documents.
The variations include three levels of modifications on the sentence level (de-
noted on the table with Low(S), Med(S), and High(S)), involving deletions,
insertions, swaps or substitutions of whole sentences, and three levels of mod-
ifications on the words of a sentence (Low(W), Med(W), High(W)). When
the modifications only affect sentences as a whole, the resemblance measures
detected by DANCER, as well as the crisp ones, clearly identify the viola-
tions (see, for instance, values E), even for high amounts of modifications.
However, when the modifications also affect some of the words on the sen-
tence, the DANCER results are very different from the crisp ones. Even
for very low amounts of modifications on the words (for instance, one each
ten) the crisp measures significantly drop and no longer truthfully represent
the gravity of the violation (see values F). This is even more true for the
crisp measure applied with no stemming. In the case of medium and high
amounts of modifications (for instance values G and H) the plagiarism is less
accentuated but still present: Here the crisp approaches clearly show an un-
Test results 95

Chunk Matching Doc Sim Computation


Tim
es
10
0S
Ci
te 10
0M

Ci
teR
8
Ci
teR
25
Tim
es
50
0S
Tim
es
10
0L
Tim
es
10
00
S

0 50 100 150 200 250


Seconds

Figure 3.11: Results of the runtime tests with no data reduction

acceptable security level, showing similarities well below threshold and thus
producing false negatives.
The last row of the table shows the resulting similarities in a “genuine”
document scenario: Here the analyzed document pairs are extracted from
related documents, i.e. documents involving the same topics but with no
common or plagiarized parts. As expected, being the DANCER approach
much more sensitive, the obtained similarities are higher than the crisp ap-
proach, however they are still low and below typical violation thresholds.
Then, the percentage of false positives delivered by our approach is still kept
very low, similarly to crisp techniques and differently from other approaches
which are able to identify subtle violations but at the cost of very high per-
centages of false positives, such as [121].

Efficiency tests
As far as the efficiency of the proposed techniques is concerned, we measured
the system runtime performance in performing the resemblance computations
by analyzing the impact of the data reduction techniques.
The computing time required to query each collection against itself when
only chunk filters are applied is shown in Figure 3.11. The graph makes
a distinction between the time required for each of the two phases, that is
chunk matching and similarity scores’ computation. The similarity score
computation time represents the time required to compute the similarity
scores for each pair of documents. The overall time can be significantly
decreased by applying the data reduction techniques we devised.
As to the document filter, setting the similarity threshold s to a value
96 Approximate matching for duplicate document detection

Sampling Clustering LenRank Sampling Clustering LenRank


160 60

140
50

120

40
100

Seconds

Seconds
80 30

60
20

40

10
20

0 0
1,0 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 1,0 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1
Reduction Ratio Reduction Ratio

(a) Times100L (b) Cite100M

Figure 3.12: Efficiency tests results for intra-document data reduction

greater than 0, even at a minimal setting as 0.1, halves the total required
time. Such an improvement is due to the ability of the filter to quickly
discard pairs of documents which cannot match and to ensure few false pos-
itives thus limiting the number of useless comparisons. For instance, for
Times100L the cross product computation requires 100*100/2 comparisons
(resemblance measure is symmetric). By setting the similarity threshold at
0.1, the document filter leaves out the 91% of the document pairs while the
worth surviving document pairs on which we compute the similarity scores
are 450. From such a computation, we found out that all the pairs contained
in the candidate set are actually similar enough, that is the similarity score
is at least 0.1. In this case we have no false positives and, thus, the best
filter performance. The same applies to threshold values up to 0.6. As the
threshold grows over 0.6, the number of positive document pairs obviously
decrease (up to 0 when the threshold is 0.9) and the filter leaves out more
document pairs. The worst case occurs when the threshold is 0.7 where we
have a candidate set of 378 document pairs containing 5 false positives. Such
a behavior is mainly due to the fact that documents having a certain num-
ber of similar chunks are often similar documents. In most of these cases,
the mapping between chunks can be straightforwardly computed without
requiring the intervention of the ILP package (see Subsection 3.4.1).
Moreover, the intra-document reduction techniques provide us with fur-
ther ways to enhance the system performance, as shown in Figure 3.12, where
we compare the time required to compute the resemblance matrix employing
the three intra-document reduction techniques on Times100L and Cite100M
at different values of reduction ratios, starting from 1.0 which means no re-
duction. Using length-rank chunk selection at 0.1, for instance, allows us to
Test results 97

No clustering Clustering 0,5 No clustering Clustering 0,5


60 60

50 50

40 40

Seconds
Seconds

30 30

20 20

10 10

0 0
Sentences as chunks Paragraphs as chunks Sentences as chunks Paragraphs as chunks

(a) Document bubble technique (b) Different chunk sizes on Cite100M

Figure 3.13: Further efficiency tests results

reduce the total computing time by a factor of up to 1:9, without compro-


mising the good quality of the results (see the previous section). In general,
our intra-document reduction techniques offer a good trade-off between ef-
fectiveness and efficiency, allowing remarkable time reductions, comparable
to the sampling approach, but with much higher quality results.
Further efficiency test results are on inter-document data reduction. Doc-
ument bubbles bring an extremely consistent time reduction (Figure 3.13a),
for example a setting of 0.2 is more than 20 times faster than the stan-
dard computation. Moreover, Figure 3.13b shows the impact of chunk size:
Changing the default size of a chunk from a sentence to an entire paragraph
reduces the number of total chunks and therefore produces a significant speed
up, but at the price of a reduced quality of the similarity scores.
Notice that we showed the runtime impact of the various data reduction
techniques separately but it is possible to combine them to achieve even faster
computations.
As a final remark, the pre-processing time of the various data reduction
approaches (i.e. clustering and document bubbles construction) is not shown
because it is not part of the computation itself, but it is an initial prepa-
ration phase that has to be done just once. Anyway, notice that simpler
techniques like length-rank chunk selection are almost immediate, while the
most complex ones such as chunk clustering and document bubbles require a
bigger amount of time. For example, from our tests, document bubbles con-
struction required approximately the same time as a complete resemblance
computation on the same data set. As we have already stated in Section 3.3,
it could be reduced by adopting the approach proposed in [142]. Further,
notice that such construction techniques could be modified in order to make
98 Approximate matching for duplicate document detection

them dynamic to the addition of new documents in the database.


Part II

Pattern Matching
for XML Documents
Chapter 4

Query processing for XML


databases

With the rapidly increasing popularity of XML for data representation, there
is a lot of interest in query processing over data that conform to the labelled-
tree data model. The idea behind evaluating tree pattern queries, sometimes
called the twig queries, is to find all existing ways of embedding the pattern in
the data. From the formal point of view, three main types of pattern match-
ing exist: One involving paths and two involving trees. XML data objects
can be seen as ordered labelled trees, so the problem can be characterized
as the ordered tree pattern matching, of which the path pattern matching can
be seen as a particular case. Though there are certainly situations where
the ordered tree pattern matching perfectly reflects the information needs
of users, there are many other that would prefer to consider query trees as
unordered. For example, when searching for a twig of the element person
with the subelements first name and last name (possibly with specific val-
ues), ordered matching would not consider the case where the order of the
first name and the last name is reversed. However, this could exactly be
the person we are searching for. The way to solve this problem is to con-
sider the query twig as an unordered tree in which each node has a label
and where only the ancestor-descendant relationships are important – the
preceding-following relationships are unimportant. This is called unordered
tree pattern matching.
In general, since XML data collections can be very large, efficient evalu-
ation techniques for all types of tree pattern matching are needed. A naive
approach to solve the problem is to first decompose complex relationships into
binary structural relationships (parent-child or ancestor-descendant) between
pairs of nodes, then match each of the binary relationships against the XML
database, and finally complete together results from those basic matches.
102 Query processing for XML databases

The main disadvantage of such decomposition based approach is that in-


termediate result sizes can become very large, even for quite small search re-
sults. Another disadvantage is that users must wait long to get (even partial)
results. In order to avoid such problems, the holistic twig join approach was
proposed [25]. In order to compactly represent partial results of individual
query root-to-leaf paths, a chain of linked stacks is used as a structure. Then
the algorithm merges the sorted lists of participating element sets together
and in this way avoids creating large intermediate results. Such approach
was further improved by using additional index structures on element sets to
quickly locate the first match for a sub-twig pattern, see for example [32, 72].
Another interesting way to support tree data processing tries to make the
relational system aware of tree-structured data [63, 64]. With local changes
to the traditional (relational) database kernel, the system is easily able to
identify a subtree size, intersection of paths, inclusion or disjointness of sub-
trees, and improves the performance of XPaths, in general.
Most of the advanced solutions share two characteristics: (i) they are
based on a coding scheme, which retains structural relationships between
tree elements; (ii) they process the supporting data structures in a sequen-
tial way, skipping areas of obviously no query answers, whenever possible.
The generic trends aiming at efficiency of matching call for compact struc-
tures, stored possibly in main memory, and at evaluation algorithms able to
provide (partial) solutions as soon as possible. The obvious key to success
is to skip as much of the underlying data as possible, and at the same time
never return back in the processed sequence. This is fundamental for the
important segment of XML applications processing data streams [76]. How-
ever, some data must be retained in special structures to make the reasoning
possible. In order to build correct and efficient twig-matching algorithms,
strong theoretical bases must be established so that the skipped area can be
maximized with the minimum amount of retained data.
In this chapter we deal with the three problems of pattern matching
(path, ordered and unordered twig matching) by exploiting the tree signa-
ture approach [137], which has originally been proposed for the ordered tree
matching. By taking advantage of the tree signature properties (Section 4.1),
and in particular of the involved pre/post order coding scheme and of its se-
quential nature, we first characterize the pattern matching problems from a
pre- and post-order point of view (Section 4.1) and then we show:

ˆ that the pre/post-order ranks are sufficient to define a complete set


of conditions under which a data node accessed at a given step is no
longer necessary for the subsequent generation of a matching solution
(Section 4.2);
4.1 Tree signatures 103

ˆ the properties of the twig matching solutions, generated at each step


of the scanning (Section 4.2);

ˆ how to take advantage from the indexes built on the content of the
document nodes (Section 4.3);

ˆ how the discovered conditions and properties can be used to write pat-
tern matching algorithms that are correct and which, from a numbering
scheme point of view, cannot be further improved (Section 4.4 presents
a summary of their main features).

All the twig matching algorithms have been implemented in the XSiter (XML
SIgnaTure Enhanced Retrieval) system, a native and extensible XML query
processor providing very high querying performance in general XML query-
ing settings (see Section 4.6 for an overview of the system architecture and
features). A detailed description of the complete pattern matching algo-
rithms is available in Appendix B. Further, we also consider an alternative
approach [138] specific for unordered tree matching, which in certain query-
ing scenarios can provide an equivalently high (or even better) efficiency than
that of the “standard” algorithms (Section 4.5). Such approach is based on
decomposition and structurally consistent joins. Finally, we provide exten-
sive experimental evaluation performed on real and synthetic data of all the
proposed algorithms (Section 4.7).

4.1 Tree signatures


The idea of tree signatures proposed in [137] is to maintain a small but
sufficient representation of the tree structures able to decide the ordered tree
inclusion problem for the XML data processing. As a coding schema, the
pre-order and post-order ranks [47] are used. In this way, tree structures are
linearized, and extended string processing algorithms are applied to identify
the tree inclusion.
An ordered tree T is a rooted tree in which the children of each node
are ordered. If a node v ∈ T has k children then the children are uniquely
identified, left to right, as i1 , i2 , . . . , ik . A labelled tree T associates a label
(name) tv ∈ Σ (the domain of tree node labels) with each node v ∈ T . If the
path from the root to v has length n, we say that the node v is on the level
n, i.e. level(v) = n. Finally, size(v) denotes the number of nodes rooted
at v – the size of any leaf node is zero. In this section, we consider ordered
labelled trees.
104 Query processing for XML databases

Sample Data Tree

A (1,10)

B (2,5) F (7,9)

C (3,3) G (6,4) H (8,8)

D (4,1) E (5,2) O (9,6) P (10,7)

pre: A B C D E G F H O P
post: D E C G B O P H F A
rank: 1 2 3 4 5 6 7 8 9 10

Figure 4.1: Pre-order and post-order sequences of a tree

The pre-order and post-order sequences are ordered lists of all nodes of
a given tree T . In a pre-order sequence, a tree node v is traversed and
assigned its (increasing) pre-order rank, pre(v), before its children are recur-
sively traversed from left to right. In the post-order sequence, a tree node v
is traversed and assigned its (increasing) post-order rank, post(v), after its
children are recursively traversed from left to right. For illustration, see the
pre-order and post-order sequences of our sample tree in Figure 4.1 – the
node’s position in the sequence is its pre-order/post-order rank, respectively.
Pre- and post-order ranks are also indicated in brackets near the nodes.
Given a node v ∈ T with pre(v) and post(v) ranks, the following proper-
ties are important towards our objectives:

ˆ all nodes x with pre(x) < pre(v) are ancestors or preceding of v;

ˆ all nodes x with pre(x) > pre(v) are descendants or following of v;

ˆ all nodes x with post(x) < post(v) are descendants or preceding of v;

ˆ all nodes x with post(x) > post(v) are ancestors or following of v;

ˆ for any v ∈ T , we have pre(v) − post(v) + size(v) = level(v);

ˆ if pre(v) = 1, v is the root, if pre(v) = n, v is a leaf. For all the other


neighboring nodes vi and vi+1 in the pre-order sequence, if post(vi+1 ) >
post(vi ), vi is a leaf.
The signature 105

post
n

A F
v

P D

n pre

Figure 4.2: Properties of the pre-order and post-order ranks.

As proposed in [62], such properties can be summarized in a two dimen-


sional diagram. See Figure 4.2 for illustration, where the ancestor (A), de-
scendant (D), preceding (P), and following (F) nodes of v are strictly located
in the proper regions. Notice that in the pre-order sequence all descendant
nodes (if they exist) form a continuous sequence, which is constrained on the
left by the reference node and on the right by the first following node (or the
end of the sequence). The parent node of the reference is the ancestor with
the highest pre-order rank, i.e. the closest ancestor of the reference.

4.1.1 The signature


The tree signature is a list of entries for all nodes in acceding pre-order.
Except the node name, each entry also contains the node’s position in the
post-order sequence.

Definition 4.1 Let T be an ordered labelled tree. The signature of T is


a sequence, sig(T ) = ht1 , post(t1 ); t2 , post(t2 ); . . . tn , post(tn )i, of n = |T |
entries, where ti is a name of the node with pre(ti ) = i. The post(ti ) value
is the post-order value of the node named ti and the pre-order value i.

Observe that the index in the signature sequence is the node’s pre-order, so
the value serves actually two purposes. In the following, we use the term
pre-order if we mean the rank of the node, when we consider the position of
the node’s entry in the signature sequence, we use the term index. For ex-
ample, ha, 10; b, 5; c, 3; d, 1; e, 2; g, 4; f, 9; h, 8; o, 6; p, 7i is the signature of the
tree from Figure 4.1. The first signature element a is the tree root. Leaf
nodes in signatures are all nodes with post-order smaller than the post-order
of the following node in the signature sequence, that is nodes d, e, g, o – the
last node, in our example it is node p, is always a leaf. We can also deter-
106 Query processing for XML databases

mine the level of leaf nodes, because the size(i) = 0 for all leaves i, thus
level(i) = i − post(i).

Extended Signatures

By extending entries of tree signatures with two pre-order numbers repre-


senting pointers to the first following, f f , and the first ancestor, f a, nodes,
the extended signatures are defined in [137]. The generic entry of the i-th
extended signature entry is hti , post(ti ), f f i , f ai i. Such version of the tree sig-
natures makes possible to compute levels for any node as: level(i) = f f i −
post(i)−1. The cardinality of the descendant node set can also be computed:
size(i) = f f i − i − 1. For the tree from Figure 4.1, the extended signature is:
sig(T ) = ha, 10, 11, 0; b, 5, 7, 1; c, 3, 6, 2; d, 1, 5, 3; e, 2, 6, 3; g, 4, 7, 2; f, 9, 11, 1;
h, 8, 11, 7; o, 6, 10, 8; p, 7, 11, 8i.

Sub-Signatures

A sub-signature sub sigS (T ) is a specialized (restricted) view of T through


signatures, which retains the original hierarchical relationships of elements in
T . Specifically, sub sigS (T ) = hts1 , post(ts1 ); ts2 , post(ts2 ); . . . tsn , post(tsn )i is
a sub-sequence of sig(T ), defined by the ordered set S = (s1 , s2 , . . . sk ) of
indexes (pre-order values) in sig(T ), such that 1 ≤ s1 < s2 < . . . < sk ≤ n.
Naturally, the set operations of the union and the intersection can be applied
on sub-signatures provided the sub-signatures are derived from the same
signatures and the results are kept sorted. For example consider two sub-
signatures of the signature representing the tree from Figure 4.1, defined by
ordered sets S1 = (2, 3, 4) and S2 = (2, 3, 5, 6). The union of S1 and S2 is the
set (2, 3, 4, 5, 6), that is the sub-signature representing the subtree rooted at
the node b of our sample tree.

4.1.2 Twig pattern inclusion evaluation


The problem of twig pattern inclusion evaluation on tree signatures can be
seen as a problem of finding all sub-signatures of a given data signature
matching with the twig pattern at node name level and satisfying some of
the relationships of parent-child (ancestor-descendant) and sibling between
the nodes.
In the following we will consider three kinds of twig pattern inclusion
evaluation: ordered tree inclusion, path inclusion, and unordered tree inclu-
sion.
Twig pattern inclusion evaluation 107

Ordered tree inclusion evaluation


Let D and Q be ordered labelled trees. The tree Q is included in D, if
D contains all elements (nodes) of Q and when the sibling and ancestor
relationships of the nodes in D are the same as in Q. Using the concept
of signatures, we can formally define the ordered tree inclusion problem as
follows. Suppose the data tree D and the query tree Q specified by signatures

sig(D) = hd1 , post(d1 ); d2 , post(d2 ); . . . dm , post(dm )i,

sig(Q) = hq1 , post(q1 ); q2 , post(q2 ); . . . qn , post(qn )i.


Let sub sigS (D) be the sub-signature (i.e. a subsequence) of sig(D) induced
by a name sequence-inclusion of sig(Q) in sig(D) – a specific query signature
can determine zero or more data sub-signatures. Regarding the node names,
any sub sigS (D) ≡ sig(Q), because qi = dsi for all i, but the corresponding
entries can have different post-order values. The following lemma defines the
necessary constraints for qualification.

Lemma 4.1 The query tree Q is included in the data tree D in an or-
dered fashion if the following two conditions are satisfied: (1) on the level of
node names, sig(Q) is sequence-included in sig(D) determining sub sigS (D)
through the ordered set of indexes S = (s1 , s2 , . . . sn ) where s1 < . . . <
sn , (2) for all pairs of entries i and j in sig(Q) and sub sigS (D), i, j =
1, 2, . . . |Q| − 1 and i + j ≤ |Q|, whenever post(qi+j ) > post(qi ) it is also true
that post(dsi+j ) > post(dsi ) and whenever post(qi+j ) < post(qi ) it is also true
that post(dsi+j ) < post(dsi ).

Observe that Lemma 4.1 defines a weak inclusion of the query tree in the
data tree, in the sense that the parent-child relationships of the query are im-
plicitly reflected in the data tree as only the ancestor-descendant. However,
due to the properties of pre-order and post-order ranks, such constraints can
easily be strengthened, if required.

Example 4.1 For example, consider the data tree D in Figure 4.1 and the
query tree Q in Figure 4.3. Such query qualifies in D, because sig(Q) =
hh, 3; o, 1; p, 2i determines sub sigS (T ) = hh, 8; o, 6; p, 7i through the ordered
set S = (8, 9, 10), because (1) q1 = d8 , q2 = d9 , and q3 = d10 , (2) the
post-order of node h is higher than the post-order of nodes o and p, and
the post-order of node o is smaller than the post-order of node p (both
in sig(Q) and sub sigS (T )). If we change in our query tree Q the label
h for f , we get sig(Q) = hf, 3; o, 1; p, 2i. Such a modified query tree is
108 Query processing for XML databases

Sample Twig Query

H (1,3)

O (2,1) P (3,2)

Figure 4.3: Sample query tree Q

also included in D, because Lemma 4.1 does not insist on the strict parent-
child relationships, and implicitly considers all such relationships as ancestor-
descendant. However, the query tree with the root g, resulting in sig(Q) =
hg, 3; o, 1; p, 2i, does not qualify, even though it is also sequence-included (on
the level of names) as the sub-signature sub sigS (D) = hg, 4; o, 6; p, 7i|S =
(6, 9, 10). The reason is that the query requires the post-order to go down
from g to o (from 3 to 1) , while in the sub-signature it actually goes up
(from 4 to 6). That means that o is not a descendant node of g, as required
by the query, which can be verified in Figure 4.1. ¤

Multiple nodes with common names may result in multiple tree inclu-
sions. As demonstrated in [137], the tree signatures can easily deal with
such situations just by simply distinguishing between node names and their
unique occurrences.

Path inclusion evaluation


The path inclusion evaluation is a special case of the ordered tree inclusion
evaluation as all the relationships between the nodes in any path P are of
parent-child (ancestor-descendant) type. Following the numbering scheme
of a path P signature sig(P ) = ht1 , post(t1 ); . . . ; tn , post(tn )i, it means that
the post-order values of subsequent entries i and j (i, j = 1, 2, . . . n − 1 and
i + j ≤ n) satisfy the inequality post(qpi ) < post(qpi+j ). The lemma below
easily follows from the above observation and from the fact that inequalities
are transitive.

Lemma 4.2 A path P is included in the data tree D if the following two
conditions are satisfied: (1) on the level of node names, sub sigP (Q) is
sequence-included in sig(D) determining sub sigS (D) through the ordered set
of indexes S = (s1 , . . . , sn ) where s1 < . . . < sn , (2) for each i ∈ [1, |P | − 1]:
post(dsi ) < post(dsi+1 ).
4.2 A formal account of twig pattern matching 109

Unordered tree inclusion evaluation


Let Q and D be ordered labelled trees. An unordered tree inclusion of Q
in D is identified by a total mapping from nodes in Q to some nodes in
D, such that only the ancestor-descendant structural relationships between
nodes in Q are satisfied by the corresponding nodes in D. The unordered
tree inclusion evaluation essentially searches for a node mapping keeping
the ancestor-descendant relationships of the query nodes in the target data
nodes. Using the concept of signature, the query tree Q is included in the
data tree D in an unordered fashion if at least one qualifying index set exists.

Lemma 4.3 The query tree Q is included in the data tree D in an unordered
fashion if the following two conditions are satisfied: (1) on the level of node
names, an ordered set of indexes S = (s1 , s2 , . . . sn ) exists, 1 ≤ si ≤ m for
i = 1, . . . , n, such that dsi = qi , for i = 1, . . . , n, (2) for all pairs of entries
i and j, i, j = 1, 2, . . . |Q| − 1 and i + j ≤ |Q|, if post(qi+j ) < post(qi ) then
post(dsi+j ) < post(dsi ) ∧ si+j > si .

Notice that the index set S is ordered but, unlike the ordered inclusion of
Lemma 4.1, indexes are not necessarily in an increasing order. In other
words, an unordered tree inclusion does not necessarily imply the node-name
inclusion of the query signature in the data signature. Should the signature
sig(Q) of the query not be included on the level of node names in the signa-
ture sig(D) of the data, S would not determine the qualifying sub-signature
sub sigS (D). Anyway, as shown in [138], any S satisfying the properties
specified in Lemma 4.3 can always undergo a sorting process in order to de-
termine the corresponding sub-signature of sig(D) qualifying the unordered
tree inclusion of Q in D.
Figure 4.4 shows further examples of the three types of pattern matching.
In the shown trees, the nodes’ names are appended by their pre-order and
post-order ranks in brackets.

4.2 A formal account of twig pattern match-


ing
Given a twig pattern Q and a data tree D, represented by the signatures

sig(Q) = hq1 , post(q1 ); q2 , post(q2 ); . . . ; qn , post(qn )i

and
sig(D) = hd1 , post(d1 ); d2 , post(d2 ); . . . ; dm , post(dm )i
110 Query processing for XML databases

Sample Data Tree


book (1,8)

author (2,3) title (5,4) author (6,7)

firstName (3,1) lastName (4,2) lastName (7,5) firstName (8,6)

Path matching Ordered twig matching Unordered twig matching

book (1,3) book (1,3) author (1,3)

author (2,2)

lastName (3,1) title (2,1) lastName (3,2) firstName (2,1) lastName (3,2)

Matches: < 1,2,4 > Matches: < 1,5,7 > Matches: < 2,3,4 >
<1,6,7> < 6,8,7 >

Figure 4.4: An example of a data tree and pattern matching results

we denote with ansQ (D) the set of answers to the ordered inclusion of Q
in D and with U ansQ (D) the set of answers to the unordered inclusion of
Q in D. For the sake of brevity we will use the notation (U )ansQ (D) to
designate situations which apply to both the cases. Obviously, if Q is a path
P then ansP (D) = U ansP (D). In the previous section, we have shown the
properties an index set must satisfy so that it is an answer to the inclusion
of Q in D. In all the three cases, the matching on the level of node names
is required. Let Σi designate the set (domain) of all positions j in the data
signature where the query node name qi occurs (i.e. dj = qi ). The set of
answers (U )ansQ (D) is a subset of the Cartesian product of the domains
Σ1 × Σ2 × . . . × Σn determined by Lemma 4.1 or Lemma 4.3, respectively.
Obviously, if one of the Σi sets is empty, Q is not included in D, because the
Cartesian product is empty.
A naı̈ve strategy to compute the desired Cartesian product subset is to
first compute the sets Σi , for i ∈ [1, n] and then discard from the Cartesian
product Σ1 × Σ2 × . . . × Σn the tuples the corresponding sub-signatures of
which do not satisfy the properties required by specific pattern matching con-
straints. The intrinsic limitation of this approach is twofold: it can produce
very large intermediate results and, even more important, such evaluation
4.2 A formal account of twig pattern matching 111

Σh
k

Q
1

...
h

...
n
D 1 2 ... k ... m SEQUENTIAL SCAN

Figure 4.5: Behavior of the domains during the scanning

procedure does not exploit the sequential nature of tree signatures.


In the following sections we present algorithms that perform the twig pat-
tern matching by sequentially scanning the data signature. Our algorithms
exploit properties of the pre-order/post-order numbering scheme adopted in
the construction of tree signatures.
At each step j of the sequential scan of D, for j = 1, . . . , m, Σji de-
notes the set of all positions in the range from 1 up to j where qi occurs
and (U )ansjQ (D) denotes the set of answers to the (un)ordered inclusion
of the twig pattern Q in D computable from the domains Σj1 , Σj2 , . . . , Σjn .
Notice that the sets ansjQ (D) and U ansjQ (D) are subsets of the Cartesian
product Σj1 × Σj2 × . . . × Σjn and that ansjQ (D) ⊆ U ansjQ (D) as any answer
to the ordered inclusion of Q in D is an answer to the unordered inclu-
sion of Q in D. Obviously, an inclusion relationship between the answer
sets holds, (U )ans1Q (D) ⊆ (U )ans2Q (D) ⊆ . . . ⊆ (U )ansm Q (D), and the last
answer set is the set of answers to the (un)ordered inclusion of Q in D,
(U )ansmQ (D) = (U )ansQ (D). Thus, the set of answers can be incrementally
constructed by sequentially scanning the data signature. For example, con-
sider the ordered twig matching scenario in Figure 4.4. At the step 5 of the
sequential scanning, the domains are Σ51 = {1}, Σ52 = {5}, Σ53 = {4}, and
ans5Q (D) = ∅. In principle, at each step new answers can be added to the
answer set of the earlier one. Assuming that ans0Q (D) = ∅, we denote with
∆(U )ansjQ (D) = (U )ansjQ (D) \ (U )ansQ j−1
(D) the j-th delta answer, i.e. the
set of matches which can be computed (decided) at step j and which have not
been computed at an earlier step. Notice that the complete set of matches is
the union of the delta answers and that the matches which can be computed
at step j result from the domains Σj1 , . . . , Σjn .
For efficiency reasons, domains should be maintained in main memory
112 Query processing for XML databases

where, unfortunately, their growth poses some fundamental problems from


the performance point of view. For illustration, consider Figure 4.5 showing
the domains on a plane where the y-axis represents the query nodes and the
x-axis the data nodes (domain space in the following). During the sequential
scan, domains grow in an uncontrolled way. Such a growth is continuous
and is influenced by the peculiarities of the data and query. Tree structured
data often have a considerable number of nodes (see e.g. XML documents)
and the 80-20 law tells us that most of the queries involve a limited set of
labels, which are in fact the most frequent in the data tree. The combination
of these two issues deteriorates the performance of the matching processes,
it can even make the management of the domains unfeasible due to the
constraints on the main memory size. In this context, our ultimate goal is to
regulate the growth of the domains so that the pattern matching problems
are solved efficiently even for tricky queries and data. By exploiting the
pre/post ordering scheme, we want to maintain the domains as compact as
possible by putting nothing which is useless and by deleting elements that
are no longer necessary for the generation of the subsequent answers.

In other words, at a given step j, not all data nodes represented in the
domains Σj1 , . . . , Σjn are necessary for the generation of the delta answers
∆(U )ansjQ (D), . . . , ∆(U )ansm j
Q (D). Thus, we denote with ∆Σi the “reduced”
versions of the original domains Σji which are needed to decide the delta
answers from the j-th step. In the following, we show the pre- and post-
order conditions ensuring that a data node already accessed (or accessed
at a given step) is not necessary for the generation of the solutions from
that step up to the end of the scanning. Necessary conditions are founded
on the relative positions between such data node and the other data nodes
accessed so far, because at a given step of the sequential scanning we have no
information about the properties of the data nodes that follow. Moreover, we
characterize the delta answers that can be decided at each step. To this end,
we will consider the snapshot of the sequential scanning process occurring at
the k-th step (see Figure 4.5). We assume that lk (D) matches lh (Q) and thus
k should be added to ∆Σkh . Our main aim is to determine the conditions
under which any of the already accessed data nodes, i.e. those with pre-
orders 1, . . . , k, will always violate one of the pre- or post-order relationships
required by Lemmas 4.1-4.3 with the data nodes already accessed or those
following the k-th. In this case, either the data node has already been used
in the generation of the previous delta answers or it will never be used and
thus is unnecessary for the generation of ∆(U )ansjQ (D), for each j ≥ k.
Conditions on pre-orders 113

Q Q
1 1

i

...

=
Ø
h’-1
...

h’

...
h=n n
1 ... k k+1 ... m D 1 ... k’ ... k ... m D
j j
(a) Condition PRO1 (b) Condition PRO2

Figure 4.6: Representation of the pre-order conditions in the domain space

4.2.1 Conditions on pre-orders


Let us first consider pre-order codes and the ordered case. Recall that a se-
quential scan of a signature means that the data nodes are visited according
to their increasing pre-order codes. Moreover, any set of indexes (pre-order
values) S = (s1 , s2 , . . . , sn ) qualifying the ordered inclusion is also ranked
according to pre-order, 1 ≤ s1 < . . . < sn ≤ m, i.e. a total order is required.
A direct consequence of this property is given by the following Lemma (Con-
dition PRO1 - PR stands for PRe-order, O stands for Ordered). It states
that a data node matching the last query node qn then it will never belong
to the solutions that can be computed in the following steps. For illustration
see Figure 4.6-a depicting Condition PRO1 on the domain space.
Lemma 4.4 (Condition PRO1) If h = n then k ∈ / S for each S ∈ ∆ansjQ
(D) for each j ∈ [k + 1, m]. Thus k does not belong to ∆Σjh .
From the previous Lemma, it follows that ∆Σjn is always empty but when
dk = qn and, in this case, ∆Σkn only contains k.
Besides the previous Lemma, the following Lemmas states the conditions
under which dk is no longer necessary for the generation of the delta answers.
The first one states that if at the k-th step a domain ∆Σki preceding ∆Σkh is
empty then k will never belong to the solutions that can be computed in the
k-th step and in the following ones.
Lemma 4.5 (Condition PRO2 applied to k) If ∆Σki = ∅, i ∈ [1, h − 1],
then for each S ∈ ∆ansjQ (D), for each j ∈ [k, m]: k ∈
/ S. Thus k does not
j
belong to ∆Σh .
The second Lemma extends the condition of the previous one to the sub-
sequent steps. Its proof is similar to the previous one. For illustration see
Figure 4.6-b.
114 Query processing for XML databases

Lemma 4.6 (Condition PRO2 applied to k 0 < k) If k 0 ∈ ∆Σk−1 h0 and


∆Σki ∩ ∆Σki = ∅, i ∈ [1, h0 − 1], then for each S ∈ ∆ansjQ (D), for each
0

/ S. Thus k 0 does not belong to ∆Σjh0 .


j ∈ [k, m]: k 0 ∈

The following Theorem shows that the three previous conditions together
constitute the sufficient conditions such that a data node, due to its pre-order
value, is no longer necessary.

Theorem 4.1 (Completeness) For the ordered case, beyond the condi-
tions expressed in Lemmas 4.4, 4.5, and 4.6, there is no other condition
ensuring that at each step k, any data node due to its pre-order value does
not belong to the solutions which will be generated in the following steps.

1 {1} {1} {1} {1} {1} {1} {1} {1}


2 {} {} {} {} {5} {5} {5} {5}
3 {} {} {} {} {} {} {7} {}
1 2 3 4 5 6 7 8

Example 4.2 With respect to the example of Figure 4.4, the Table above
shows the impact of Conditions PRO1 and PRO2 on the composition of the
delta domains in the domain space during the sequential scan for ordered
twig matching. Notice that at the 4-th step, Condition PRO2 avoids the
insertion of the data node 4 in the pertinence domain ∆Σ43 , as ∆Σ42 is empty.
Moreover at the 8-th step, thanks to Condition PRO1 ∆Σ83 is empty. ¤

As to the unordered case, notice that the pre-order values of any qual-
ifying set of indexes are not required to be completely ordered as it is for
the ordered evaluation. For this reason, the Lemmas above are no longer
sound. However, the unordered evaluation requires a partial order among
the pre-order values of a qualifying set of indexes (s1 , . . . , sn ). In particular,
whenever post(qi+j ) < post(qi ) it is required that post(dsi+j ) < post(dsi ) and
that si+j > si . Thus Lemmas 4.5 and 4.6 rewritten in the following way are
still sound.

Lemma 4.7 (Condition PRU applied to k) If ∆Σki = ∅, i ∈ [1, h − 1]


and post(qi ) > post(qh ) then for each S ∈ ∆U ansjQ (D) , for each j ∈ [k, m]:
k∈/ S. Thus k does not belong to ∆Σjh .
0
Lemma 4.8 (Condition PRU applied to k 0 < k) If k 0 ∈ ∆Σk−1 k
h0 , ∆Σi ∩
∆Σki = ∅, and post(qi ) > post(qh0 ), i ∈ [1, h0 − 1], then for each S ∈
∆U ansjQ (D), for each j ∈ [k, m]: k 0 ∈
/ S. Thus k 0 does not belong to ∆Σjh0 .
Conditions on post-orders 115

On the other hand, there is no counterpart for Lemma 4.4. Indeed, as no


total order among the pre-order values is required, the “position” of the query
node matching the data node does not influence the use of such data node in
the solutions which will be generated in the following steps. More precisely,
in the ordered case a concept of “last” query node from a pre-order point of
view exists and thus whenever dk = qn any solution (s1 , . . . , sn−1 , k) involving
dk is generated at the k-th steps as the pre-order values s1 , . . . , sn−1 of all
the other data nodes are required to be smaller than k (and thus accessed in
the steps preceding the k-th). Instead, in the unordered case, the data node
dk accessed at the k-th step can always be useful for the generation of the
answers in the subsequent steps unless no data node in the delta sets exists
for which it is required that its pre-order value is smaller than k. This last
aspect is considered in the two Lemmas above. For this reason the following
Theorem is sound and the proof is similar to that of Theorem 4.1.

Theorem 4.2 (Completeness) For the unordered case, beyond the condi-
tions expressed in Lemmas 4.7 and 4.8, there is no other condition ensuring
that at each step k, any data node due to its pre-order value does not belong
to the solutions which will be generated in the following steps.

In this way, we have shown the sufficient and necessary pre-order conditions
for the exclusion of a data node in the generation of the delta solutions of
the ordered and unordered inclusion of a query tree in a data tree.

4.2.2 Conditions on post-orders


As far as post-order requirements are involved, here the distinction is be-
tween the path matching and the (un)ordered twig matching. Indeed, path
matching requires a total order among the post-order values of the data nodes
belonging to a match whereas in the twig matching only a partial order is
sufficient. Thus, we first introduce some general rules which will then be
used to study each of the matching approaches.
First of all, the following Lemma easily follows from the property on post-
order values a solution for twig pattern matching must satisfy. It allows us
to prevent the introduction of the current data node dk in the pertinence
domain ∆Σjh for the construction of the solutions in the steps from j = k to
j = m.

Lemma 4.9 (Condition POT1) If i ∈ [1, h − 1] exists such that for each
si ∈ ∆Σk−1 i , post(dsi ) < post(dk ) is required but post(dsi ) > post(dk ) or
post(dsi ) > post(dh ) is required but post(dsi ) < post(dk ) then k 6∈ S, for each
S ∈ ∆(U )ansjQ (D) for each j ∈ [k, m]. Thus k can be deleted from ∆Σjh .
116 Query processing for XML databases

Different is the case of the deletion of a data node preceding the k-th in the
sequential scanning and thus already belonging to a delta domain. Notice
that, as in the pre-order case, at the k-th step of the sequential scanning, a
node belonging to a delta domain will no longer be necessary for the genera-
tion of the delta answer sets due to its post-order value if one of the required
post-order relationships w.r.t. the other nodes will always be violate by the
data nodes following the k-th in the sequential scanning. It means that either
dj with j < k has already been used in the generation of the previous delta
answer set or it will never be used. In the pattern matching definition, two
kinds of relationships are taken into account between the post-order values
of two data nodes di and dj : either it is required that post(di ) < post(dj )
or post(di ) > post(dj ). Given the relationship between the post-order value
of the k-th node, post(dk ), and that of a preceding node post(dj ) (j < k),
we want to predict what kind of inequality relationship will hold between
the post-order value of dj and those of the nodes following dk in the se-
quential scanning. Only if we are able to do it, we can state that dj is no
longer necessary due to its post-order value. At first glance, by considering
the properties of the pre-order and post-order ranks given in Figure 4.2, it
seems that the post-order relationships between post(dj ) and post(dk ) and
post(dj ) and post(dj0 ) with j0 > k are completely independent. It is true
when post(dj ) > post(dk ). Different is the case of the other post-order re-
latioship post(dj ) < post(dk ) which is taken into account in the following
Lemma.

Lemma 4.10 Let j < k and post(dj ) < post(dk ). It follows that post(dj ) <
post(dj0 ), for each j0 ∈ [k, m].

Thus, both in the cases of path and (un)ordered twig matching, a data node
dj can be deleted from the pertinence delta domain if and only if it is re-
quired that its post-order value is greater than that of another data node but
that condition shall never be verified from a particular step of the scanning
process. We deeply analyze such situations in the following.
Let us first consider the case of path matching. Lemma 4.10 allows the
introduction of the following Lemma showing a sufficient condition such that
a data node, due to its post-order value, is no longer necessary to generate
the answers to the path inclusion evaluation. For illustration see Figure 4.7-a
depicting Condition POP on the domain space.

Lemma 4.11 (Condition POP) Let Q be a path P , si ∈ ∆Σki where


i ∈ [1, n] and post(dsi ) < post(dk ). It follows that si 6∈ S, for each S ∈
∆ansjP (D) for each j ∈ [k, m]. Thus si can be deleted from ∆Σji .
Conditions on post-orders 117

P Q
1 1

i si
si i+1
...

...
s Ø

n n
1 ... k-1 k ... m D 1 ... k-1 k ... m D
si j si < j
< POST
POST
s

(a) Condition POP (b) Condition POT2

Figure 4.7: Representation of the post-order conditions in the domain space

Example 4.3 Consider the example of Figure 4.4 at step 6, when the com-
position of the delta domains is the following: ∆Σ61 = {1} (book), ∆Σ62 = {2}
(author), and ∆Σ63 = {4} (lastName). In this case, the post-order of the
current data node is post6 (D) = 7 which is greater than both post2 (D) = 3
and post4 (D) = 2. Thus nodes 2 and 4 can be deleted from their pertinence
domains and the composition of the delta domains becomes ∆Σ61 = {1},
∆Σ62 = {}, and ∆Σ63 = {}. Intuitively, they have already been used in the
generation of the delta answer ans4P = {h1, 2, 4i} at the 4-th step and they
will never be used again because node 6 belongs to another path. ¤

Given the situation depicted in Lemma 4.9, the previous Lemma produces the
same effect on the current data node dk as Lemma 4.9, i.e. its deletion from
∆Σji . On the other hand, Lemma 4.11 also acts on the nodes preceding dk in
the sequential scanning. For this reason, Lemma 4.11 can be used in place
of Lemma 4.9. The proof is given in the following proposition. Notice that,
being the query Q a path P , we only consider the case post(qi ) < post(qh ).
Lemma 4.12 If for each i ∈ [1, h − 1], for each si ∈ ∆Σk−1 i , post(dsi ) <
post(dk ) then, due to Lemmas 4.5 and 4.11, k 6∈ S, for each S ∈ ∆ansjP (D)
for each j ∈ [k, m]. Thus k can be deleted from ∆Σjh .
It should be emphasized that Condition POP is a necessary and sufficient
condition, i.e. it states the only possible condition such that a data node due
to its post-order value can be deleted. The proof is Theorem 4.4 where we
show that in the computation of the delta answers there is no condition on
post-order values to be checked.
The problem of the generic twig inclusion evaluation is more involved
than the path matching problem. In these cases, both the two kind of post-
order relationship are in principle allowed between any two data nodes in
118 Query processing for XML databases

a set qualifying the pattern matching and only a partial order is required.
The following Lemma is the counterpart of Lemma 4.11 as it shows the
post-order conditions under which the nodes preceding dk in the sequential
scanning can be deleted. It only considers the inequality condition on post-
order values where it is required that dk is greater than any other data node.
For illustration see Figure 4.7-b.
Lemma 4.13 (Condition POT2) Let si ∈ ∆Σki and post(dsi ) < post(dk ).
If ı̄ ∈ [1, n] exists such that post(qı̄ ) < post(qi ) and ∆Σkı̄ = ∅ or for each
s ∈ ∆Σkī such that s > si , post(ds ) > post(dsi ), then si ∈ / S, for each
j
S ∈ ∆(U )ansQ (D) for each j ∈ [k, m].

Finally, whenever Condition POT2 involves the root domain, i.e. ∆Σk1 , no
check on the other delta domains is required and all the nodes s1 ∈ ∆Σk1
such that post(ds1 ) < post(dk ) can be deleted from ∆Σk1 .
Lemma 4.14 (Condition POT3) Let s1 ∈ ∆Σ1k and post(ds1 ) < post(dk )
/ S, for each S ∈ ∆(U )ansjQ (D) for each j ∈ [k, m].
then s1 ∈

Example 4.4 Consider the unordered twig matching of the example of Fig-
ure 4.4 at step 6 when the delta domains are as follows: ∆Σ61 = {2} (author),
∆Σ62 = {3} (firstName), and ∆Σ63 = {4} (lastName). At this step, a new
root arrives, i.e. the node 6, as post6 (D) > post2 (D). Thus Condition POT3
allows us to delete node 2 from ∆Σ61 , which become empty. Consequently, the
delta domains ∆Σ62 and ∆Σ63 can also be emptied thanks to Condition PRO2.
Note that such situation is very frequent in data-centric XML scenarios. ¤

Theorem 4.3 (Completeness) For the twig case, beyond the conditions
expressed in Lemmas 4.13 and 4.14, there is no other condition ensuring
that at each step k, any data node due to its post-order value does not belong
to the solutions which will be generated in the following steps.

4.2.3 On the computation of new answers


In this subsection, we want to detect at which step of the sequential scanning
new matches can be decided. For this problem, we must again consider two
cases: the ordered and the unordered. For the ordered case, we exploit the
total order among the pre-order values of the data nodes in a match. A
direct consequence of this property is given by the following fact, stating
that the set of matches which can be computed at step k is empty unless
lk (D) matches with the “last” query node ln (Q).
Lemma 4.15 If h 6= n then ∆anskQ = ∅.
Characterization of the delta answers 119

For the unordered case, where only a partial order is required, new matches
can only be decided when the data node k matches with a query node which
is a leaf.
Lemma 4.16 If i > h exists such that post(qh ) > post(qi ) then ∆anskQ = ∅.

4.2.4 Characterization of the delta answers


In this subsection, we characterize the delta answers generated at each se-
quential step, respecting our three kinds of pattern matching strategies.
The following Theorem represents a considerable result for the path match-
ing. It shows how the set of delta answers can be computed at each step of the
sequential scanning. It also shows that Lemma 4.11 together with Theorem
4.1 delete all the data nodes which are no longer necessary.
Theorem 4.4 If Lemmas 4.4, 4.5, 4.6, and 4.11 have been applied at each
step of the sequential scanning then the set of answers ∆anskP (D) which can
be generated at step k for the path P is such subset of the cartesian product
s
∆Σk1 × . . . × ∆Σkn defined as ((s1 , . . . , sn ) | si ∈ ∆Σi i+1 for each i ∈ [1, n − 1])
For the ordered case, post-order values must be checked. On the other
hand, the completeness of the conditions shown in the previous section en-
sures that we cover all possible node deletions from the pre- and post-order
points of view. In particular, in this case, due to Lemma 4.5, the “last” delta
domain ∆Σkn is always empty unless the current data node k matches with
the n-th query node and, in this case, ∆Σkn only contains k. Thus, whenever
∆anskQ (D) is not empty, it only contains new matches.
Theorem 4.5 If Lemmas 4.4, 4.5, 4.6, 4.9, 4.13, and 4.14 have been applied
at each step of the sequential scanning then the set of answers ∆anskQ (D)
which can be generated at step k for the twig Q is such subset of the Cartesian
s
product ∆Σk1 × . . . × ∆Σkn where for each (s1 , . . . , sn ): (1) si ∈ ∆Σi i+1 for
each i < n (2) condition on the post-order values expressed in Lemma 4.1
are satisfied.
In the unordered case, instead, as the delta domains of the query leaves are
not always empty, we must avoid producing redundant results.
Theorem 4.6 If Lemmas 4.7, 4.8, 4.9, 4.13, and 4.14 have been applied at
each step of the sequential scanning, then the set of answers ∆U anskQ (D),
which can be generated for the twig Q at step k whenever h is a leaf, is
such subset of the Cartesian product ∆Σk1 × . . . × ∆Σkn so that for each
(s1 , . . . , sh , . . . , sn ): (1) sh = k, (2) for all pairs of entries i and j, i, j =
1, 2, . . . n − 1 and i + j ≤ n, if posti+j (Q) < posti (Q) then postsi+j (D) <
s
postsi (D) and si ∈ ∆Σi i+j .
120 Query processing for XML databases

4.3 Exploiting content-based indexes


In the following sections we describe how to take advantage from available
indexes built on the content of documents nodes. Generally two kinds of
operations could be exploited by taking in consideration content-based index
information, first we can avoid to insert useless nodes into delta domains and
second we can avoid to scan some fragments of the data document since we
are guaranteed that no useful nodes will be found in the skipped document
part. Given a query leaf ql that specifies a value condition and associated to a
content-based index we can can obtain from the index all the the occurrences
of ql in the current document that satisfy the condition. Let T (ql ) be such set
of occurrences ordered by increasing value of pre-order and let t(ql ) ∈ T (ql )
be the next potential match for ql , in the following t(ql ) is also called current
target for ql .

4.3.1 Path matching


Consider the evaluation of a path, if the query specifies a value condition
on the leaf node ql and we have a content-based index built on ql we can
improve the scanning process as follows. From the index we can obtain
the list T (ql ), each solution to the path matching will contain one of the
occurrences conteined in T (ql ); obviously the opposite is not true, i.e. not
all occurrences will belong to a solution.

Current Target Definition As we said, t(ql ) is the next potential match


for the query node ql , for the path case t(ql ) is defined as the first node in
T (ql ) that has a pre-order value greater then the current document pre-order
value.

Insertion Avoidance and Skipping Policy A path answer requires that


pre-order and post-order values of its element are totally ordered in increas-
ing and decreasing order respectively, therefore, during the sequential scan
knowing a priori the post-order value of the next node matching with the
query leaf enables us to avoid the domain insertion of useless nodes and
to reduce the fragment of the data tree to be scanned. More precisely if,
during the sequential scan, we access a node du (with u < pre(t(ql ))) hav-
ing post(du ) < post(t(ql )) we are guaranteed that node du does not belong
to any answer ending with node t(ql ) as post-order values must be in de-
scending order. But we are also guaranteed that node du does not belong
to any answer ending with any node t0 (ql ) ∈ T (ql ) : pre(t0 (ql )) > pre(t(ql )),
since due to Lemma 4.10 if post(du ) < post(t(ql )) and u < pre(t(ql )) then
Ordered twig matching 121

post(du ) < post(t0 (ql )) for each t0 (ql ) ∈ T (ql ) : pre(t0 (ql )) > pre(t(ql )). Under
these conditions, if Tl is scanned in sorted order by pre-order values then
we can safely discard the node du since it will never belong to any answer.
Moreover since post(du0 ) < post(du ) for each du0 descendant of du it follows
that conditions above hold also for these nodes so we can safely discard them.
From the above considerations we can conclude that if during the sequential
scan we access a node du such that post(du ) < post(t(ql )), we can directly
continue the scan from the first following of du . If the signature does not
contain the first following values f fi we can still safely skip a part of the
document due to the following observation. We have f f du = u + size(du ) + 1
and size(du ) = post(du ) − u + level(du ) since 0 ≤ level(du ) ≤ h where h is
the height of the data tree we can safely continue the scan from the node
having pre-oreder euqals to post(du ) + 1.

4.3.2 Ordered twig matching


In the previous section we described how to speed up the sequential scan in
the path matching when an index on the value condition specified on the leaf
of the path query is available. In this section we extend those observations
to support a similar improvement for the ordered twig pattern matching.
First we start analyzing the main differences between the path and the
twig matching. The first difference is that while in a path we can have at
most one value condition, i.e. on the leaf of the path, the twig query can
contain more than one condition. If we have at least two content-based
indexes on the leaves subject to value conditions, from those indexes we will
obtain as many lists of data nodes that satisfy the conditions and we have to
coherently manage them. The second difference is that in the path matching
any query node is an ancestor of the leaf (unless it is the leaf itself) and
the same relationship must be retained in each solution whereas in the twig
matching the relationship between each query node and a query leaf can be
either of ancestor or following-preceding type. Given a twig query we assume
to have a list L = {ql1 , ql2 , .., qlr } that contains all the query leaves subject
to value condition and associated with a content index and that the list is
ordered according to their pre-order values (li < li+1 ∀i ∈ [1, r − 1]). For each
leaf qli ∈ L through the associated index we can obtain a list T (qli ) of all the
occurrences of qli in the current document that satisfy the specified condition
ordered according to pre-order values.
We first discuss the definition of the current targets and the management
of the lists T (qli ) and then we discuss the skip policy.
122 Query processing for XML databases

Ordered Twig Query Data Tree

A (1,3) A (1,6)

B (2,1) C (3,2)
C (2,1) C (3,2) B (4,3) C (5,4) C (6,5)

Figure 4.8: Target list management example

Current Target Definition and List Management Each element in


T (qli ) matches with qli , however an ordered twig answer requires that pre-
orders of its elements are totally ordered in increasing order (see Lemma 4.1),
so not all nodes in each T (qli ) will be candidate for the domain insertion.

Example 4.5 Consider Figure 4.8, where the nodes are represented as cir-
cles filled with different shades:

ˆ a white circle is a generic node;

ˆ a dark-grey circle is a query node with a value condition (or a document


node with a value);

ˆ a light-grey patterned circle is a query node ancestor of at least one


value constrained leaf.

The query in the example has two value constrained leaves; suppose that each
document leaf satisfies the correspondent condition, then the lists T (B2 ) and
T (C3 ) obtained through the index are {B4 } and {C2 , C3 , C5 , C6 } respectively.
An answer to the query requires that any node matching “C” follows (i.e.
has a greater pre-order value) any node matching “B” thus we know that
element {C2 ,C3 } in TC3 will never be candidate for domain insertion because
no element TB2 has a smaller pre-order value. ¤

While we perform a sequential scan over the input document we associate


each list T (qli ) with the current target t(qli ) that represents the next potential
match for the element qli . Current targets are related to each other and
depend on the current document pre-order value and on the current state of
the delta domains.

Definition 4.2 During the sequential scan, let k be the current document
pre-order, we say that two list T (qli ) and T (qli+1 ) are aligned iff:
Ordered twig matching 123

ˆ pre(t(qli )) > k and pre(t(qli+1 )) > k

ˆ pre(t(qli+1 )) > pre(t(qli )) or pre(t(qli+1 )) > minP re(∆Σkli )


Where minP re(∆Σkli ) is the minimun pre-order value for nodes in the delta
domain ∆Σli at the k-th step of the algorithm.
The alignment property is transitive, i.e. if T (qli ) is aligned to T (qli+1 ) and
T (qli+1 ) is aligned to T (qli+2 ) then T (qli ) is aligned to T (qli+2 ). The first
alignment is performed before starting the sequential scan, in this case the
alignment depends only on the elements contained in the lists because each
delta domain ∆Σli is empty. Sequential scan progressively increments the
current document pre-order, this could lead to the definition of new targets.
From the definition of the alignment property we can derive how new targets
should be defined. The target t(qli+1 ) for the element qli+1 depends on:
ˆ the current document pre-order (k);

ˆ the pre-order of the current target for the element qli (t(qli ));

ˆ the minimum pre-order of elements in ∆Σkli (minP re(∆Σkli )).

With these three values we can define the minimum pre-order value that the
target t(qli+1 ) must assume. More precisely given
minP re = max{k, min{pre(t(qli )), minP re(∆Σkli )}}
t(qli+1 ) is then the first element in T (qli+1 ) that has a pre-order value greater
or equal to minP re. Then, the updating of the targets should be performed
during the sequential scan and whenever a deletion is performed on a ∆Σli
delta domain. In order to take advantage from the transitive property and in
order to minimize the number of operations we perform the update starting
from t(qli ) and then propagate that to the targets on its right (with increasing
value of i). If the target lists are aligned we also say that current targets are
aligned.

Insertion Avoidance and Skipping Policy For the path case we have
shown that during a sequential scan, depending on the current target and on
post-order value of the current document’s element, we can safely skip some
part of the input document because we are sure that no useful elements for the
current query will be found in the skipped document parts. For the ordered
twig case we need to introduce more constraints that limit the applicability
of the skipping strategy. As for the path case the skipping policy can only
be based on conditions on the post-order values between document nodes
124 Query processing for XML databases

Ordered Twig Query Data Tree 1 Data Tree 2

A (1,3) D (1,n+5) A (1,n+5)

B (2,1) C (3,2) X (2,n+1) A (n+3,n+4) X (2,n+1) D (n+3,n+4)

B C B B C
(n+4,n+2) (n+5,n+3) (3,1) (n+4,n+2) (n+5,n+3)

Figure 4.9: Ordered Twig Examples

and current targets, in particular Lemma 4.10 ensures that if j < k and
post(dj ) < post(dk ) then post(dj ) < post(dj0 ), for each j0 ∈ [k, m]. For the
skipping policy point of view this means that while we are looking for answers
that include current targets, if we access a document node that should have
a post-order value greater than the targets’s one (i.e. we are looking for
an ancestor of the targets) but it is actually smaller then we can skip all
descendants of the current document element because none of them could be
useful.

Example 4.6 Consider Figure 4.9. The sample query specifies a value con-
dition over its leaf “C”, suppose that for both input documents “C” elements
satisfy this condition. Let us first analyze the first data tree. During the se-
quential scan we first access “D” element that does not match with any query
element, then we access its first child and independently from its label “X”
we can entirely skip its subtree. At this step, the delta domain associated
to “A” is empty and the current document post-order is smaller than the
current target one (i.e. the current target is outside the subtree rooted by
“X”). Instead, the second data tree case is different. In this case the first
element matches with the root query element and being an ancestor of the
current target it will be inserted into the correspondent delta domains. Again
we access element “X” but this time even if the current post-order value is
smaller than the current target one we cannot skip the subtree of the current
element, in fact the delta domain associated to “A” is not empty and possible
matches for “B” (whose post-order is not required to be greater than the one
of “C”) could be lost if we skip the current element subtree (as the example
shows). ¤

This simple example shows that, differently from the path case, we cannot
establish if a skip is safe or not by taking into consideration only post-order
Ordered twig matching 125

Ordered Twig Query Data Tree

A (1,3) D (1,12)

B (2,1) C (3,2) A (2,3) A (5,8) A (10,11)

B (3,1) E (4,2) B (6,4) A (7,7) B (11,9) C (12,10)

E (8,5) C (9,6)

Figure 4.10: Ordered Twig Examples

values. Before explaining under which conditions a skip is safe, we need to


make some considerations on the relationship between a generic query node
and the query leaves subject to a value condition.
Each query node can be ancestor of zero, one or more query leaves subject
to a value condition, thus we associate each query node, say qx , to a sublist
of L, Lx = {ql1x , ql2x , .., qlux }, that contains all the query leaves in L that are
descendants of qx . Notice that l1x < l2x < .. < lux and also that post(ql1x ) <
post(ql2x ) < .. < post(qlux ). By convention if qx ∈ L then Lx = {qx }.
Given a set of aligned targets, in order to establish if a node matching qx
with Lx 6= ∅ is useless we can simply consider the post-order of the current
target for qlux . In particular let dk be the current data node matching with qx ,
then if post(dk ) < post(t(qlux )), dk does not belong to any subsequent answer
due to Lemma 4.10, as T (qlux ) is ordered by the pre-order value. For this
reason we can avoid to insert dk in the domain associated to qx . Otherwise,
if post(dk ) >= post(t(qlux )), dk can potentially belong to an answer and thus
it could be inserted in the correspondent domain. In particular if ∆Σklj = ∅
∀qlj ∈ Lx we also know that dk is an ancestor of all t(qlix ) since post(dk )
> post(t(qlix )) for each i < u. The condition above is necessary but not
sufficient in order to establish if a node is useful.

Example 4.7 Consider Figure 4.10. As before, suppose that each document
leaf satisfies the correspondent value condition, initial target lists are T (B2 ) =
{B3 , B6 , B11 } and T (C3 ) = {C9 , C12 } and they are aligned; then, current
targets are t(B2 ) = B3 and t(C3 ) = C9 . The first accessed node is D1 that
does not match with any query node. Then we access A2 that matches with
the root of the query, we have L1 = {B2 , C3 }, now since post(t(C3 ))=6 and
126 Query processing for XML databases

the current document post-order value is 3 A2 is an useless element and it


will not be inserted into the correspondent domain. Now we access B3 that
is the current target for B2 , but since the domain of A1 is empty we cannot
insert into the domain, the target for B2 becomes B6 (that is still coherent
with the current t(C3 )). The next node does not match with any query
node so we continue the scan. We arrive at A5 that matches with the query
node and since post(t(C3 ))=6 and the current document post-order value is
8 the element could be useful and we insert it into the correspondent domain.
Since ∆Σ52 = ∅ and ∆Σ53 = ∅ we know that A5 is an ancestor of the current
targets for B2 and C3 . Now we access B6 that is the current target for B2 ,
since the parent domain is not empty we can insert it into the domain, the
target for B2 becomes B11 that is beyond the t(C3 ) but since we have inserted
the previous target into the domain t(C3 ) does not change (see the previous
section). Now we access A7 and for the same reasons explained before we
need to insert it into the correspondent domain. It has to be noted that
if, instead of checking only the post-order of t(C3 ), we had also checked the
post-order of t(B2 ), we would have concluded that A7 is an useless element;
in the implementation one could choose which condition has to be applied
depending on the desired trade-off. The successive node does not match with
any query node so we continue the scan, we now access C9 that is the current
target for C3 , the node is useful and we could generate the first solution
{5,6,9}, t(C3 ) becomes C12 . Next node is A10 that is a following of A5 so all
the domains are emptied. Node A10 matches with the query root and since
post(t(C3 ))=10 and the current document post-order value is 11 the element
could be useful and we insert it into the correspondent domain. Like before
we also know that A10 is an ancestor of both t(B2 ) and t(C3 ). Next we find
and insert B11 and C12 into the correspondent domain and we generate the
last solution {10,11,12}. ¤

For each twig node qx we have:

ˆ no reference target if Lx = ∅;

ˆ one reference target qlux , i.e. the one with the highest pre-order in Lx .

The observation above enables us to avoid the insertion of useless nodes but,
in order to perform a skip, we must first define under which conditions a skip
is safe. Basically a skip is safe if there is no useful element is in the document
skipped part. The rough condition above could be refined as follows:

ˆ none of the current partial solutions can be extended by nodes that are
descendants of the current document node;
Ordered twig matching 127

ˆ it is not possible to build a complete solution with nodes that are


descendants of the current document node.
A matching node can be inserted into its domain if the delta domains of
its preceding and ancestor elements are not empty. For the ordered twig
matching we know that domains are filled “from left to right”, in other words
if Di is empty then each Dj with j ∈ [i + 1, n] is also empty. Analyzing the
delta domains we can establish if a skip is safe or not. A skip is considered
safe iff:
∃qi : ∆Σki = ∅∧ ∀j ∈ [1, i] ∃qlvj ∈ Lj : post(dk ) < post(t(qlvj ))
Instead of checking the existence of a qlvj whose target is not in the subtree
of the current document node we could simply check if the current reference
target t(qluj ) is descendant of the current document node, the condition above
will be:
∃qi : ∆Σki = ∅∧ ∀j ∈ [1, i] post(dk ) < post(t(qluj ))
In order to verify if a solution could be completely built with nodes that are
descendants of the current data node it is sufficient to check if post(dk ) <
post(t(qlu1 )). If the check succeeds we are guaranteed that no solution will be
completely built with nodes that are descendant of dk , however this condition
is subsumed by the previous one since, if we cannot extend a partial solution,
it is not possible to build a complete one and, if no partial solutions exist, the
two conditions collapse. Let us analyze the first condition. If the first empty
domain has a reference target or belongs to a target itself (say qli ) then we
know that we are still looking for a match for qli so if post(dk ) < post(t(qli ))
we know that our target is not a descendant of dk and, as we have shown
for the path case, any subsequent t(qli ) is not a descendant of dk . The
condition ensures us that also each reference target t(qluj ) with j ∈ [1, i] is
not descendant of dk that is necessary in order not to miss any useful match.
If the first empty domain belongs to a node qi with Li = ∅ then it belongs
to a node that is related to targets by a following-preceding relationship, for
such nodes we have no information about the next occurrence so we cannot
be sure that no useful occurrence is a descendant of dk , in this condition
we can not perform a skip, the same happens when at least one non empty
domain associated to a node qi that has Li = ∅ exists. It has to be noted
that even in the previous situation we have no information about the next
occurrences of nodes matching with qi with Li = ∅, but we know that each
occurrence of them will be useless because at least one preceding domain will
be empty. Finally we highlight that if we can perform a skip at step k then
we are guaranteed that, even if dk matches with a query node qj , dk will not
128 Query processing for XML databases

be inserted into ∆Σj since if 1 ≤ j ≤ i, then the skip condition ensure us


that t(quj ) is not a descendant of dk . Otherwise if j > i, then at least one
empty delta domain ∆Σki that prevents the insertion exists.

4.3.3 Unordered twig matching


Similarly to the ordered case, we can have more than one leaf with a specified
value condition and associated to a content index and similar to the previous
case we need to coherently manage the lists of occurrences obtained through
the indexes.

Current Target Definition and List Management Like the ordered


case, given a twig query we can obtain a list L = {ql1 , ql2 , .., qlr } that contains
all the query leaves subject to value conditions and associated with a content
index; for each leaf qli ∈ L through the associated index we can obtain a
list T (qli ) of all the occurrences of qli in the current document that satisfy
the specified condition ordered according to pre-order values. During the
sequential scan, we associate each list T (qli ) with the current target t(qli )
that represents the next potential match for the element qli . Since there
is no order constrain, for the unordered twig matching we do not have any
alignment property between the target lists. This means that current targets
are not related to each other and do not depend on the current state of the
delta domains. For the unordered matching, current targets depend only on
the current document pre-order value. The current target for the leaf qli is
the first element in T (qli ) that has a pre-order value greater or equal to the
current document pre-order value.

Insertion Avoidance and Skipping Policy Before starting to analyze


the skipping policy we need to highlight another difference between the un-
ordered and the ordered case induced by the absence of the order constraint.
For the ordered case we could statically define for each twig node its refer-
ence target between the associated targets, for the unordered case this static
definition is not possible. Consider a node with more than one associated
target, we cannot assume that the one with the highest query pre-order value
is also the one whose current target has the highest post-order value: Since
matching nodes need to be ancestor of each associated target they need to
have a post-order value greater than each associated target, i.e. greater
than the highest one. We can still define a single reference target (the one
whose current target has the highest post-order value) for each query node
but this reference target needs to be dynamically updated along with the
Unordered twig matching 129

Unordered Twig Query Data Tree 1 Data Tree 2

A (1,3) D (1,n+5) A (1,n+5)

B (2,1) C (3,2) X (2,n+1) A (n+3,n+4) X (2,n+1) D (n+3,n+4)

C B C C B
(n+4,n+2) (n+5,n+3) (3,1) (n+4,n+2) (n+5,n+3)

Figure 4.11: Unordered Twig Examples

update of the related current targets. Assume that the current reference tar-
get for qx is qlix and that qx matches with the current document node dk , if
post(dk ) < post(t(qlix )) then the node dk is useless and we can avoid to insert
it into the delta domain associated to qx . Now we can discuss the skipping
policy used for the unordered matching algorithm. Like the previous cases
the skipping policy can only be based on conditions over the post-order val-
ues between nodes and current targets and, like the ordered case, in order
to establish if a skip is safe or not we need to analyze the status of the delta
domains. For the unordered case a matching node could be inserted in the
correspondent domain if there is at least a supporting occurrence (a node
with a greater post-order value) in the domain of its parent/ancestor.
Example 4.8 Consider Figure 4.11. The shown scenario is very similar to
the one for the ordered case; the query specifies a value condition on the node
“B”, suppose again that for both input documents all “B” nodes satisfy this
condition. Let us analyze the first data tree, the sequential scan first accesses
node “D” that does not match with any query node, then it accesses its first
child, independently of its label we can avoid to scan its subtree; at this
step we have all the delta domains empty and the current document post-
order value is smaller than the post-order value of the current target for “B”.
Even if we can build a partial solution with nodes found in the subtree of
the current element we will never be able to complete this partial solution
with the required “B” node (next match for “B” is outside the subtree and
since a match for “A” is still missing it is not possible to complete partial
solutions with this node). Now let us analyze the second data tree case.
The root of the document matches with “A” and, since it is ancestor of the
current target for “B”, we can insert it into the delta domain associated to
“A”. Now we access the first child of the root, again the current document
post-order value is smaller than the one of the current target for “B” but this
130 Query processing for XML databases

time we cannot skip the subtree. The delta domain associated to “A” is not
empty so possible matches for “C” (like the one shown in the example) will
be useful. It has to be noted that, if we introduce the order constraint, the
same subtree could be safely skipped (since “C” matches are useless unless
we found a preceding match for “B”). ¤
Starting from these examples we can derive the conditions under which a
skip is safe for the unordered matching algorithm. From a qi point of view a
skip is considered safe iff:
Li 6= ∅ ∧ post(dk ) < post(t(qlui )) ∧ (∆Σki = ∅∨ skip is safe ∀qj ∈
children(qi ))
It is obvious that if qi is related to targets only by following-preceding rela-
tionships (i.e. Li = ∅), or its current reference target is descendant of the
current document node, the skip is unsafe because, in the former case, we
have no information about next matches for these kinds of node and, in the
latter case, we know that at least one target will be lost if we perform a skip.
The second part of the condition is less intuitive. First if the delta domain
associated to qi is empty the skip is considered safe because we are guaran-
teed that matching elements with qi or with any descendant of qi will never
be part of a solution since there is not a valid match for qlui in the subtree
of the current document node. If the delta domain associated to qi is not
empty we need to verify if the skip is safe for all the children of qi ; if for at
least one child the skip is considered unsafe then the skip is unsafe also for
qi . In order to establish if a skip is safe or not it is sufficient to check the
condition above for the query root q1 .

4.4 An overview of pattern matching algo-


rithms
In Section 4.2 we have introduced a theoretical framework consisting of a set
of pre/post-order conditions for a node to be deleted. The next challenge is
to conceive sequential pattern matching algorithms that exploit the theoret-
ical framework to manage the domains efficiently. The set of conditions is
complete thus ensuring that the domains are maintained as compact as pos-
sible from a numbering scheme point of view. At the same time, to efficiently
put the theoretical framework in practice, it also means to find implemen-
tation solutions consuming little time. It should be noticed that a smart
management of the domains in the sequential scanning does not prevent the
adoption of other improvements like filters or the use of indexes.
4.4 An overview of pattern matching algorithms 131

In this section we show how the conditions presented so far can be used
in pattern matching algorithms to manage the domains associated with the
query nodes. In particular, for each query node identified by its pre-order
value i we assume that the post(i) operator accesses its post-order value,
the l(i) operator accesses its label, and we associate a domain Di together
with the maximum and the minimum post-order values of the data nodes
stored in Di (accessible by means of the minPost and maxPost operators,
respectively). We will in this context only show a sketch of the ideas and the
basic skeletons of the algorithms. Further, we will not analyze the content
index optimized versions exploiting the properties described in Section 4.3.
For a full discussion of the complete algorithms and all their different versions
see Appendix B.
Nodes are scanned in sorted order of their pre-order values and insertions
in the domains are always performed on the top by means of the push opera-
tor. Thus the data nodes from the bottom up to the top of each domain are
in pre-order increasing order. Moreover, each data node k in the pertinence
domain Dh consists of a pair: (post(k), pointer to a node in Dprev(h) )
where prev(h) is h − 1 in the ordered case whereas it is the parent of h in Q
in the unordered case. When the data node is inserted into Dh , each pointer
indicates the pair which is at the top of Dprev(h) . In this way, the set of
data nodes in Dprev(h) from the bottom up to the data node pointed by k
implements ∆Σkprev(h) . By recursively following such links from Dprev(h) to
D1 , we can derive ∆Σkprev(prev(h)) , . . . , ∆Σk1 . Figure 4.12 shows the algorithm
skeletons with place-holders implementing the complete set of deletion con-
ditions, as suggested by Theorems 4.4, 4.5 and 4.6 (e.g. PRO1 for Condition
PRO1). Even if the implementation of the complete set of the deletion condi-
tions ensures the compactness of domains, to put the theoretical framework
in practice also means to select the most effective reduction conditions with
respect to the pattern and the data tree. Indeed, in some cases the CPU
time spent to apply a condition is not repaid by the advantages gained on
the domain dimensions and/or the execution time. A deep analysis of this
aspect is provided in Section 4.7.
The three kinds of considered pattern matching share a common skeleton
shown on top of Figure 4.12. Line 0 determines the sequence of data nodes for
the sequential scanning. In the worst case it is the whole data tree. Otherwise
it can be reduced by filters or auxiliary structures as in [25, 32, 72], to which
case our theoretical framework can easily be adopted. Indeed, whenever the
pre-order value of the first occurrence first(l) of each label l in D and
of the last one last(l) are available, then, thanks to Condition PRO2, the
sequential scanning can start from first(l(1)). In the ordered case, it
132 Query processing for XML databases

(0) getRange(start, end);


(1) for each k where k ranges from start to end
(2) and matches with the query node h: l(k)=l(h)

(3) POP (3) POT2 & POT3 (3) POT2 & POT3
(4) PRO2 (4) PRO2 (4) PRU
(5) Lemma 4.15 (5) POT1 (5) POT1
(6) PRO1 (6) Lemma 4.15 (6) Lemma 4.16
(7) PRO1
(a) PMatch(P) (b) OTMatch(Q) (c) UTMatch(Q)

Figure 4.12: Pattern matching algorithms

can end at last(l(n)) due to Lemma 4.15 whereas, in the unordered case,
Lemma 4.16 suggests to set the end as the maximum value among last(l(l))
for each leaf l in the query. Note that, if the algorithms are performed on
compressed structural surrogates of the original data tree, then the first and
last values can be computed in the surrogate construction phase. Therefore,
assuming that the current data node k matches with the h-th query node, and
thus it should be added to Dh (lines 1-2), the three algorithms implement the
required conditions in the most effective order. First, they try to delete nodes
by means of the conditions on post-orders (line 3), then they check whether
the intersection between domains at different steps is empty (line 4). In this
case, they delete all data nodes in the domains specified in the corresponding
Conditions, finally they work on the current node. In particular the twig
algorithms implement Condition POT1 to check whether k can be added
to Dh and all the algorithms verify if solutions can be generated. Finally,
through the PRO1 code fragment, the ordered algorithms delete k if it is the
last node.
As to the PMatch(P) algorithm, domains can be treated as stacks, that
is deletions are implemented following a LIFO policy by means of the pop
operator. Indeed, node deletions are only performed in the POP fragment
which corresponds to the following code:

(1) for each Di where i ranges from 1 to n


(2) while(¬isEmpty(Di ) ∧ post(top(Di ))<post(k))
(3) pop(Di );

Instead of checking all nodes as specified in Condition POP, we stop look-


ing at the nodes in Di whenever post(top(Di ))>post(k)) (line 2). It
fully implements Condition POP because if post(top(Di ))>post(k) then
4.4 An overview of pattern matching algorithms 133

post(si )>post(k) for each si ∈ Di and thus Condition POP can no longer
be applied. Moreover, the fact that domains are stacks allows us to imple-
ment Condition PRO2 (isEmpty(Di ) checks whether Di is empty whereas
empty(Di0 ) empties Di0 ):

(1) for each Di where i ranges from 1 to n


(2) if(isEmpty(Di ))
(3) for each Di0 where i0 ranges from i + 1 to n
(4) empty(Di0 );
(5) if(¬isEmpty(Dh−1 ))
(6) push(Dh ,(post(k),pointerToTop(Dh−1 )));

where at line 6 the addition of k in Dh took place. Observe that, instead of


checking the intersection between the state of the domains at different steps
as required by Condition PRO2, we only check whether Di is empty (line
2). Indeed, it can be shown that in order to delete the nodes belonging to a
domain Di at step k, it is first necessary to delete the nodes belonging to Di
at a step preceding the k-th one. More details can be found in Appendix B.
Notice that when both POP and PRO2 are applied, the external cycle shared
by both the conditions (line 1) are merged. Finally the Lemma 4.15 code
fragment is:

(1) if(h = n)
(2) showSolutions(h,1);

where showSolutions(h,1) is a recursive function implementing Theorem


4.4 and the PRO1 one is:

(1) if(h = n)
(2) pop(Dh );

Domains of the other two algorithms, OTMatch(Q) and UTMatch(Q), cannot


be stacks because they are not ordered on post-order values thus deletions can
be applied at any position of the domains. In these cases the code fragment
corresponding to POT2 & POT3 is:

(1) for each Di where i ranges from 1 to n


(2) for each si in Di in ascending order
(3) if(post(si )<post(k) ∧ isCleanable(i,si ))
(4) pos ← index(Di ,si );
(5) delete(Di ,si );
(6) if(i =
6 n)
(7) updateLists(i,pos);
134 Query processing for XML databases

where the boolean function isCleanable() checks whether si can be deleted.


In particular, if i is the root, it simply returns true (Condition POT3),
otherwise it checks the conditions expressed in Condition POT2. In this
case, links connecting each domain Di with the domains of the descendants
of i in the twig Q and the minPost operator are exploited to speed up the
process. In particular, instead of checking the post-order value of each data
node in the domains, we check if minPost(D)>post(k) for each domain
D. Whenever a node si is deleted, the updateLists() function updates the
pointer of all the nodes pointing to si in order to make it point to the node
below si . Such an update is performed in a descending order and stops when
a node pointing to a node below si is accessed.
As to PRO2 and PRU, they consists of two code fragments. The first one
is the application of Conditions PRO2 and PRU to the current data node k:

(1) if(¬isEmpty(Dprev(h) )
(2) push(Dh ,(post(k),pointerToTop(Dprev(h) )));

where it is sufficient to only check Dprev(h) because, whenever Conditions


PRO2 and PRU are applied, if a domain Di is empty then all the domains
“following” Di are emptied. Such emptying are performed by the second
fragment, which is the application of Conditions PRO2 and PRU to a node
k 0 preceding k and thus already belonging to a domain Dh0 . In this case,
k 0 is deleted in the updateLists() procedure when, due to the deletions
applied in the POT2 & POT3 code fragment, its pointer becomes dangling.
0
We recall that, for instance, ∆Σkprev(h0 ) is implemented by that portion of
Dprev(h0 ) between the bottom and the data node pointed by k 0 and ∆Σkprev(h0 )
is the current state of Dprev(h0 ) . Thus, intuitively, if the pointer of k 0 is
0
dangling it means that ∆Σkprev(h0 ) ∩ ∆Σkprev(h0 ) = ∅ as required by Conditions
PRO2 and PRU. The same can be recursively applied to the other domains
∆Σkprev(prev(h0 )) , . . . , ∆Σk1 . As to POT1, the code fragment is:

(1) if (isNeeded(h,k))
(2) push(Dh ,(post(k),pointerToTop(Dprev(h) )));

where the boolean function isNeeded() checks the condition shown in Con-
dition POT1 by using the minPost(D) and maxPost(D) values for each
domain D instead of comparing k with each data node in D. Obviously,
whenever both POT1 and PRO2 or PRU are applied, the two conditions of
Lines 1 are put together: ¬isEmpty(Dprev(h) ) ∧ isNeeded(h,k). By anol-
ogy to the path matching algorithm, Lemma 4.15 and Lemma 4.16 check if
new solutions can be generated and, in this case, call recursive functions
implementing Theorem 4.5 and 4.6, respectively.
4.5 Unordered decomposition approach 135

4.5 Unordered decomposition approach


In this section we propose an alternative approach specific for unordered tree
matching, which in certain querying scenarios can provide an equivalently
high (or even better) efficiency than that of the “standard” algorithms pre-
sented in the previous section. The idea is not to consider the twig as a
whole but to decompose it into a collection of root to leaf paths and search
for their embedding in the data trees. Then the structurally consistent path
qualifications are joined to find unordered query tree inclusions in the data.
More formally, suppose the data tree D specified by signature sig(D) and the
query tree Q specified by signature sig(Q). The query tree Q is decomposed
into a set of root-to-leaf paths Pi and the inclusion of the corresponding
signatures sig(Pi ) in the data signature sig(D) is evaluated. Any path Pi
represents all (and only) the ancestor-descendant relationships between the
involved nodes. Thus, an ordered inclusion of sig(Pi ) in sig(D) states that
a mapping, keeping the ancestor-descendant relationships, exists from the
nodes in Pi to some nodes in D. If there are structurally consistent answers
to the ordered inclusion of all the paths Pi in D, the unordered tree inclusion
of Q in D is found. In principle, the decomposition approach consists of the
following three steps:

1. decomposition of the query Q into a set of paths Pi ;

2. evaluation of the inclusion of the corresponding signatures sig(Pi ) in


the data signature sig(D);

3. identification of the set of answers to the unordered inclusion of Q in


D.

The query decomposition process transforms a query twig into a set of


root-to-leaf paths so that the ordered tree inclusion can be safely applied.
For efficiency reasons, we sort the paths on the basis of their selectivity, so
that in the next phase, the more selective paths are evaluated before the less
selective ones. The outcome of this phase is an ordered set rew(Q) of the sub-
signatures sub sigPj (Q) defined by the index sets Pj , for each leaf j. See [138]
for more details. Evaluating the path inclusions is a quite straightforward
operation (see previous sections for the required properties), therefore we
will now focus on the last point, i.e. identifying the set of answers to the
unordered inclusion of Q in D.
136 Query processing for XML databases

Unordered Twig Query 1 Unordered Twig Query 2 Data Tree

A (1,4) A (1,5)
A (1,3)

F (2,1) B (3,2) F (2,1) B (3,2) F (4,3) A (2,3) F (5,4)

B (3,1) C (4,2)

Figure 4.13: Examples for decomposition approach

4.5.1 Identification of the answer set


The answer set ansQ (D) of the unordered inclusion of Q in D can be deter-
mined by joining compatible answer sets ansP (D), for all P ∈ rew(Q). The
main problem is to establish how to join the answers for the paths in rew(Q)
to get the answers of the unordered inclusion of Q in D. Not all pairs of
answers of two distinct sets are necessarily “joinable”. The condition is that
any pair of paths Pi and Pj share a common sub-path (at least the root) and
differ in the other nodes (at least the leaves). Such commonalities and dif-
ferences must meet a correspondence in any pair of index sets Si ∈ ansPi (D)
and Sj ∈ ansPj (D), respectively, in order that they are joinable. In this case,
we state that Si ∈ ansPi (D) and Sj ∈ ansPj (D) are structurally consistent.

Example 4.9 Consider Figure 4.13, in particular the first unordered query
shown, which we will call Q, and the data tree, D. We have sig(Q) =
ha, 3, 4, 0; f, 1, 3, 1; b, 2, 4, 1i and sig(D) = ha, 5; a, 3; b, 1; c, 2; f, 4i The only
sub-signature qualifying the unordered tree inclusion of Q in D is defined by
the index set {1, 5, 3} and the corresponding sub-signature is sub sig{1,3,5} (D)
= ha, 5; b, 1; f, 4i. Notice that the index set {1, 5, 3} satisfies both conditions
of Lemma 4.3 whereas the index set {2, 5, 3} only matches at the level of
node names but it is not a qualifying one. The rewriting of Q gives rise
to the following paths rew(Q) = {P2 , P3 }, where P2 = {1, 2} and P3 =
{1, 3}, and the outcome of their evaluation is ansP2 = {{1, 5}} and ansP3 =
{{1, 3}, {2, 3}}. The common sub-path between P2 and P3 is P2 ∩ P3 =
{1}. The index 1 occurs in the first position both in P2 and P3 . From the
cartesian product of ansP2 (D) and ansP3 (D) it follows that the index sets
{1, 5} ∈ ansP2 (D) and {1, 3} ∈ ansP3 (D) are structurally consistent as they
share the same value in the first position and have different values in the
second position, whereas {1, 5} ∈ ansP2 (D) and {2, 3} ∈ ansP3 (D) are not
structurally compatible and thus not joinable. ¤
Identification of the answer set 137

The following definition states the meaning of structural consistency for two
generic subtrees Ti and Tj of Q – paths Pi and Pj are particular instances of
Ti and Tj .
Definition 4.3 (Structural consistency) Let Q be a query twig, D a data
tree, Ti = {t1i , . . . , tni } and Tj = {t1j , . . . , tm
j } two ordered sets of indexes deter-
mining sub sigTi (Q) and sub sigTj (Q), respectively, ansTi (D) and ansTj (D)
the answers of the unordered inclusion of Ti and Tj in D, respectively.
Si = {s1i , . . . , sni } ∈ ansTi (D) and Sj = {s1j , . . . , sm j } ∈ ansTj (D) are
structurally consistent if:
ˆ for each pair of common indexes thi = tkj , shi = skj ;

ˆ for each pair of different indexes thi 6= tkj , shi 6= skj .

Definition 4.4 (Join of answers) Given two structurally consistent an-


swers Si ∈ ansTi (D) and Sj ∈ ansTj (D), where Ti = {t1i , . . . , tni }, Tj =
{t1j , . . . , tm 1 n 1 m
j }, Si = {si , . . . , si } and Sj = {sj , . . . , sj }, the join of Si and Sj ,
Si ./ Sj , is defined on the ordered set Ti ∪ Tj = {t1 , . . . , tk } as the index set
{s1 , . . . , sk } where:
ˆ for each h = 1, . . . , n, l ∈ {1, . . . , k} exists such that thi = tl and shi = sl ;

ˆ for each h = 1, . . . , m, l ∈ {1, . . . , k} exists such that thj = tl and


shj = sl .

Any answer to the unordered inclusion of Q in D is the result of a se-


quence of joins of structurally consistent answers, one for each P ∈ rew(Q),
identifying distinct paths in sig(D). The answer set ansQ (D) can thus be
computed by sequentially joining the sets of answers of the evaluation of the
path queries. We denote such operation as the structural join.
Definition 4.5 (Structural join) Let Q be a query twig, D a data tree, Ti
and Tj two ordered sets of indexes determining sub sigTi (Q) and sub sigTj (Q),
respectively, ansTi (D) and ansTj (D) the answers of the unordered inclusions
of Ti and Tj in D, respectively.
The structural join sj(ansTi (D), ansTj (D)) between the two sets ansTi (D)
and ansTj (D) is the set ansT (D) where:
ˆ T = {t1 , . . . , tk } is the ordered set obtained by the union Ti ∪ Tj of the
ordered sets Ti and Tj ;

ˆ ansT (D) contains the join Si ./ Sj of each pair of structurally consistent


answers (Si ∈ ansTi (D), Sj ∈ ansTj (D)).
138 Query processing for XML databases

P3 1 3
P2 1 2
ansP2 (D): ansP3 (D): 1 3
1 5
2 3
P2 ∪ P3 1 2 3
sj(ansP2 , ansP3 ):
1 5 3
Figure 4.14: Structural join of Example 4.9

The structural join sj(ansTi (D), ansTj (D)) thus returns an answer set de-
fined on the union of two sub-queries Ti and Tj as the join of the structurally
consistent answers of ansTi (D) and ansTj (D). Starting from the set of an-
swers {ansPx1 (D), . . . , ansPxk (D)} for paths in rew(Q), we get the answer
set ansQ (D) identifying the unordered inclusion of Q in D by incrementally
merging the answer sets by means of the structural join. Since the structural
join operator is associative and symmetric, we can compute ansQ (D) as:

ansQ (D) = sj(ansPx1 (D), . . . , ansPxk (D)) (4.1)

where rew(Q) = {Px1 , . . . , Pxk }.


Example 4.10 The answer set ansQ (D) of Example 4.9 is the outcome of
the structural join sj(ansP2 (D), ansP3 (D)) = ansP2 ∪P3 (D) where P2 ∪ P3 =
{1, 2} ∪ {1, 3} is the ordered set {1, 2, 3}. The answers to the individual
paths and the final answers are shown in Figure 4.14 (the first line of each
table represents the query). It joins the only pair of structurally consistent
answers: {1, 5} ∈ ansP2 (D) and {1, 3} ∈ ansP3 (D). ¤

Example 4.11 Consider Figure 4.13 again. In this example we show the
evaluation of the unordered tree inclusion of the second twig query depicted,
which we will call Q, in the data tree D. It can be easily verified that there
is no qualifying sub-signature since at most two of the three paths find a
correspondence in the data tree.
The rewriting phase produces the set rew(Q) = {P2 , P3 , P4 } where P2 =
{1, 2}, P3 = {1, 3}, and P4 = {1, 4}. The final result ansQ (D) is the outcome
of the structural join:
sj(ansP2 (D), ansP3 (D), ansP4 (D)) =
= sj(sj(ansP2 (D), ansP3 (D)), ansP4 (D)) = ∅
The answer sets of the separate paths and of sj(ansP2 (D), ansP3 (D)) are
shown in Figure 4.15. The final result is empty since the only pair of joinable
answers {1, 5, 3} ∈ sj(ansP2 (D), ansP3 (D)) and {1, 5} ∈ ansP4 (D) is not
structurally consistent: the two different query nodes 2 ∈ P2 ∪ P3 and 4 ∈ P4
Efficient computation of the answer set 139

P3 1 3
P2 1 2
ansP2 (D): ansP3 (D): 1 3 ansP4 (D):
1 5
2 3
P4 1 4
1 5
P2 ∪ P3 1 2 3
sj(ansP2 (D), ansP3 (D)):
1 5 3
Figure 4.15: Structural join of Example 4.11

correspond to the same data node 5. It means that there are not as many
data tree paths as query tree paths. ¤

Theorem 4.7 Given a query twig Q and a data tree D, the answer set
ansQ (D) as defined by Eq. 4.1 contains all and only the index sets S quali-
fying the unordered inclusion of Q in D according to Lemma 4.3.

4.5.2 Efficient computation of the answer set


In the previous section, we have specified two distinct phases for the decom-
position approach for unordered tree pattern matching: the computation of
the answer set for each root-to-leaf path of the query and the structural join
of such sets. The main drawback of this approach is that many intermediate
results may not be part of any final answer. In the following, we show how
these two phases can be merged into one to avoid unnecessary computations.
The basic idea is to evaluate at each step the most selective path among
the available ones and to directly combine the partial results computed with
structurally consistent answers of the paths.
The full algorithm is depicted in Figure 4.16. It makes use of the pop
operation which extracts the next element from the ordered set of paths
rew(Q). The algorithm essentially computes the answer set by incrementally
joining the partial answers collected up to that moment with the answer set
of the next path P in rew(Q). As paths are sorted by their selectivity, P is
the most selective path among those which have not been evaluated yet. In
particular, from step 1 to step 3, the algorithm initializes the partial query pQ
evaluated up to moment to the most selective path P and stores in the partial
answer set anspQ (D) the evaluation of the inclusion of pQ in D. From step 4
to step 12, it iterates the process by joining the partial answer set anspQ (D)
with the answer set ansP (D) of the next path P of rew(Q). Notice that,
at each step, it does not properly compute first the answer set ansP (D) and
140 Query processing for XML databases

Input: the paths of the rewriting phase rew(Q)


Output: ansQ (D)
Algorithm:

(1) P = pop(rew(Q));
(2) pQ = P ;
(3) evaluate anspQ (D);
(4) while((rew(Q) not empty) AND (anspQ (D) not empty))
(5) P = pop(rew(Q));
(6) pP = P \ (P ∩ pQ);
(7) tk is the parent of pP , k is the position in pQ;
(8) P Ans = ∅;
(9) for each answer S in anspQ (D)
(10) evaluate anspP (sub sig{sk +1,...,f fsk −1} (D));
(11) if(anspP (sub sig{sk +1,...,f fsk −1} (D)) not empty)
(12) add sj({S}, anspP (sub sig{sk +1,...,f fsk −1} (D))) to P Ans;
(13) pQ = pQ ∪ P ;
(14) anspQ (D)=P Ans;}

Figure 4.16: The unordered tree pattern evaluation algorithm

the structural join sj(anspQ (D), ansP (D)) as shown in Eq. 4.1, but it rather
applies a sort of nested loop algorithm in order to perform the two phases
in one shot. As each pair of index sets must be structurally consistent in
order to be joinable, we compute only such answers in ansP (Q), which are
structurally consistent with some of the answers in anspQ (D). As a matter
of fact, only such answers may be part of the answers to Q. In order to do
it, the algorithm tries to extend each answer in anspQ (D) to the answers to
pQ ∪ P by only evaluating such sub-path of P which has not been evaluated
in pQ. In particular, step 6 stores in the sub-path pP such part of the path
P to be evaluated which is not in common with the query pQ evaluated up
to that moment: P \ (P ∩ pQ). Step 7 identifies tk as the parent of the
partial path pP where k is its position in pQ. For instance, by considering
Example 4.9, the two paths P2 and P3 of the query Q are depicted in Figure
4.17-a. If rew(Q) = {P2 , P3 }, then at step 5 pQ = P2 and, as the part of the
path P3 corresponding to the query node a has already been evaluated while
evaluating P2 , the partial path pP to be evaluated and the parent tk of pP
are depicted in Figure 4.17-b.
For each index set S ∈ anspQ (D), each index set in ansP (Q), which is
structurally consistent with S, must share the same values in the positions
corresponding to the common sub-path P ∩ pQ. In other words, we assume
Efficient computation of the answer set 141

A tk A

F B F B
P2 P3 pQ pP

(a) (b)

Figure 4.17: Evaluation of paths in Algorithm of Figure 4.16: an example

that the part of the path P which is common to pQ has already been evalu-
ated and that the indexes of the data nodes matching P ∩ pQ are contained
in S. In particular, the index sk in S actually represents the entry of the
data node matching the query node corresponding to tk . Thus, in order to
compute the answers in ansP (D) that are structurally consistent with S and,
then, join with S, the algorithm extends S to the answers to P ∪ pQ by only
evaluating in the “right” sub-tree of the data tree the inclusion of the part
pP of the path P which has not been evaluated yet (step 10). As the path P
has been split into two branches P ∩ pQ and pP , where tk is the parent of pP
and S contains a set of indexes matching P ∩ pQ, the evaluation of pP must
be limited to the descendants of the data node dsk which in the tree signature
corresponds to the sequence of nodes having pre-order values from sk + 1 up
to f fsk − 1. Then it joins S with such answer set by only checking that
different query entries correspond to different data entries (step 12). Notice
that, in step 10, by shrinking the index interval to a limited portion of the
data signature, we are able to reduce the computing time for the sequence
inclusion evaluation.
The algorithm ends when we have evaluated all the paths in rew(Q) or
when the partial answer set collected up to that moment anspQ (D) becomes
empty. The latter case occurs when we evaluate a path P having no answer
which is structurally consistent with those in anspQ (D): sj(anspQ (D), ansP
(D)) = ∅. In this case, for each answer S in anspQ (D) two alternatives exists.
Either the evaluation of the partial path pP fails (line 11), which means that
none of the answers in ansP (D) share the same values of S in the positions
corresponding to the common sub-path P ∩pQ, or the structural join between
S and the answers to pP fails (line 12), which means that some of the answers
in ansP (D) share the same values of S in positions corresponding to different
indexes in P and pQ.
Example 4.12 Let us apply the algorithm described in Figure 4.16 to Ex-
142 Query processing for XML databases

ample 4.9 where the signatures involved are sig(Q) = ha, 3, 4, 0; f, 1, 3, 1; b, 2,


4, 1i and sig(D) = ha, 5; a, 3; b, 1; c, 2; f, 4i. Since the two paths are of the
same length, we start from P2 = {1, 2} whose answer set is ansP2 (D) =
{{1, 5}}. Then, the algorithm essentially deals with the next path, e.g.
P3 = {1, 3}, in the way shown in Figure 4.17. It computes P2 ∩ P3 = {1},
tk = 1 where k = 1, and pP = {3}. It then considers the only index set
S = {1, 5} in ansP2 (D) and stores in anspP (D) the index sets qualifying the
inclusion of the query sub sig{3} (Q) = hb, 2, 4, 1i on the sub-tree rooted by the
data node labelled with a and having index s1 = 1 that is in the signature
sub sig{2,3,4,5} (D) = ha, 3; b, 1; c, 2; f, 4i. The outcome is thus anspP (D) =
{{3}} and ansQ (D) = {{1, 5, 3}}. Being P2 and P3 of the same length, we
can also start from ansP3 (D) = {{1, 3}, {2, 3}}. In this case pP = {2} while,
as in the previous case, tk = 1 where k = 1. We then consider the first
index set {1, 3} and evaluate anspP (D) on the descendants of the data node
having index s1 = 1. The answer anspP (sub sig{2,3,4} (D) to the inclusion
of sub sig{pP } (Q) = hf, 1, 3, 1; i in sub sig{2,3,4,5} (D) = ha, 3; b, 1; c, 2; f, 4i is
{{5}}. Thus ansQ (D) = {{1, 5, 3}}. For the next index set {2, 3}, it is re-
quired to evaluate sub sig{pP } (Q) on the sub-tree rooted by s1 = 2 that is
sub sig{3,4} (D) = hb, 1; c, 2i and the answer set anspP (D) is empty. ¤

In summary, the proposed solution performs a small number of additional


operations on the paths of the query twig Q, but dramatically reduces the
number of operations on the data trees by avoiding the computation of useless
path answers. In this way, we remarkably reduce computing efforts.

4.6 The XML query processor architecture


All the twig matching algorithms we described in the previous sections have
been implemented in the XSiter (XML SIgnaTure Enhanced Retrieval) sys-
tem, a native and extensible XML query processor providing very high query-
ing performance in general XML querying settings. In this section we briefly
describe the main system architecture and features.
In Figure 4.18 we can see the abstract architecture of XSiter. The system
is essentially composed by three subsystems, that respectively manage, from
the top to the bottom, the interaction with the user (GUI), the import process
of documents and queries and the query processing (Core System) and finally
the persistence of managed documents (Store System). In the Store System,
datastore structures (Figure 4.19) are managed. Essentially, a datastore is a
collection used for aggregating conceptually related documents and for keep-
ing their internal representations persistent among different query sessions.
4.6 The XML query processor architecture 143

Query
Language

Query Specifier Result Visualizer

GUI

Internal Query
Representation
Query Importer
Query Engine
Doc.xml Doc Importer Internal Doc
Representation
Core System

Datastore Datastore

Offline Process Datastore Collection Store System

Figure 4.18: Abstract Architecture of XSiter

Queries and documents are transformed in an almost homogenous represen-


tation; in the following we will briefly describe only the document one, being
it the most complex. The chosen internal representation addresses the main
issues needed for querying XML documents and consists of four main parts
(see right part of Figure 4.19):
ˆ We have the (Tree Signature), that, as we have seen, is used for solving
tree pattern matching (structural constraints);
ˆ A simple document summary (Local Tag Index ) is used as a filter for
limiting the search space. In particular, for each tag that is present in
a document, the first and the last document positions are kept;
ˆ The signature does not include values, that are stored separately and
are evaluated only by need (value constraints);
ˆ Values, elements contents or attribute values, can be indexed (Content
Based Indexes) in order to speed up the search process. Such part is
optional and is generated according to user needs.
Finally, along with the document internal representations, in a datastore two
shared global structures are also kept (see left part of Figure 4.19), named
144 Query processing for XML databases

</>
Global TagIndex Signature

TagMapping </> “...”


Values Local TagIndex Content Based
</> # Indexes

Stored Documents
(Internal Representation)
Datastore

Figure 4.19: Structure and content of a Datastore

Global TagIndex and TagMapping. The Global TagIndex keeps track of


which tag is present in each managed document and is used to quickly filter
out documents that can not contain matches for a particular query. In-
stead, TagMapping provides mappings between textual tags and correspon-
dent unique numbers (ids), which are used to store tags in the most compact
possible way.
In general, as we deeply discussed in the previous sections, all XSiter algo-
rithms efficiently process the supporting data structures in a sequential way,
skipping areas of obviously no query answers, whenever possible. The key
to their efficiency is to skip as much of the underlying data as possible, and
at the same time never return back in the processed sequence. Repositories
involving a large number of documents are efficiently managed and queried,
also thanks to the document filters exploiting Global TagIndex Data. Fur-
ther, a filter based on Local TagIndexes is also available, which is used to
limit the parts of a document that have to be scanned in order to solve a
particular query. These features, together with the minimal memory usage
of the algorithms, make the system suitable for querying and managing very
large documents.
XSiter is currently implemented as a general purpose system, meaning
that little special domain optimizations have been currently applied but the
architecture was developed to be simply extensible. In particular, current
algorithms use a very high abstraction level of content index access that
enables us to substantially use different kinds of indexes without changing
the search algorithms. Further, specialized index structures can be easily
integrated and exploited in our system in order to better match different
application needs.
4.7 Experimental evaluation 145

4.7 Experimental evaluation


In this section we present the results of the experimental evaluation of the
XSiter query processor matching algorithms, as described in the previous
sections. In particular, in the main part of the tests (Subsections 4.7.2 and
4.7.3) we show the performance of each of the algorithms of Section 4.4 and
we evaluate the benefits offered by each of the conditions discussed, both in
terms of the reduced size of the domains and in the amount of saved time
w.r.t. their standard execution. Further, in Subsection 4.7.4, we specifically
evaluate the performance of our decomposition technique for unordered tree
inclusion, as described in Section 4.5.

4.7.1 Experimental setting


The data sets

Type Dimensions Labels


Real Synth Size Depth F/O F/O (root) # Equi
DBLP X 3814975 3 2-12 376698 20
Gen1 X 2000000 5 3 50000 7 X
Gen2 X 32000 8 3 30 7 X

Table 4.1: The XML data collections used for experimental evaluation

In our tests, we used both real and synthetic data sets. We present the
results we obtained on one real and two synthetic collections (see Table 4.1).
In order to show the performance of the matching algorithms on real-world
“data-centric” XML scenarios, we chose the complete DBLP Computer Sci-
ence Bibliography archive. The file consists of over 3.8 Millions of elements.
Table 4.2 shows more details about this XML archive. It is a very “flat”
(3 levels) and wide (very high root fan out) data tree, as in typical “data-
centric” XML documents. As in most real data sets, the distribution of the
node labels is non equiprobable. In fact, the whole set presents repetitions
of typical patterns (for instance, “article-author”). Since typical real data
sets are very flat, this would not allow us to test some of the most complex
conditions, such as POT2. For this reason, we also generated two synthetic
data sets, Gen1 and Gen2, as random trees, using the following parameters:
depth (5 and 8, respectively), fan-out (3) and root fan-out (50000 and 30,
respectively). Both synthetic collections differ from the DBLP set in their
labels distribution as they are uniformly distributed. However, while Gen1
146 Query processing for XML databases

Middle-level Leaf-level
Element name Occs Element name Occs
inproceedings 241244 author 823369
article 129468 title 376737
proceedings 3820 year 376709
incollection 1079 url 375192
book 1010 pages 361989
phdthesis 72 booktitle 245795
mastersthesis 5 ee 143068
crossref 141681
editor 8032
publisher 5093
isbn 4450
school 77
Summary
Total number of elements 3814975
Total number of values 3438237
Maximum tree height 3

Table 4.2: DBLP Test-Collection Statistics

proposes a similarly wide and slightly deeper tree, Gen2 is very deep and has
smaller root fan-out, thus simulating more “text centric” trees. Note that
the size of collections (in particular the root fan-out) is not very important.
Our aim is mainly to analyze the behavior of the algorithms and the trends
of the sizes of domains. This is typically clear after a significant portion of
the data is scanned.

The testing queries

Figure 4.20 shows the testing queries we used to perform the main tests on
the algorithms discussed in Section 4.4. The upper row shows the queries on
the DBLP collection (denoted with Dn), while the lower one defines queries
for the synthetically generated collections (denoted with Gn). In both cases,
we provide a path and two twigs. For DBLP, the query depth is limited to 2,
therefore we tried to differentiate the queries by means of an increasing fan-
out. The labels are specifically chosen amongst the less selective, in order
to test our algorithms in the most demanding settings. As to Gen1 and
Gen2, we created queries G2 and G3 deeper than the DBLP ones. They are
specifically conceived to test all the conditions which would not be activated
in the shallow DBLP setting.
Experimental setting 147

Path D1 Twig D2 Twig D3

article inproceedings inproceedings

author
title pages ee url
author author title booktitle
year crossref

Path G1 Twig G2 Twig G3


A A
A

B
B D B C D

C
E G E F G

Figure 4.20: The queries used in the main tests

Template xL 2-2 Template xL 3-2 Template xL 8-2

inproceedings inproceedings inproceedings

author
title pages ee url
author title author title year
year crossref booktitle

Template xH 2-2 Template xH 3-2 Template xH 7-3


book book dblp

phdthesis book
author title author title year

author school author isbn


title year publisher

Figure 4.21: The query templates used in the decomposition approach tests

Further, we performed specific series of tests on the decomposition ap-


proach (Section 4.5) by means of another set of queries, derived from the
six query twig templates of Figure 4.21. Such templates present different
element name selectivity, i.e. the number of elements having a given element
name, different branching factors, i.e. the maximum number of sibling ele-
ments, and different tree heights. We refer to the templates as “xSb-h’, where
S stands for element name selectivity and can be H(igh) or L(ow), b is the
branching factor, and h the tree height. To understand the element name se-
lectivity, refer to Table 4.2, showing the number of occurrences of each name
in the DBLP data set. In particular, we used inproceedings for the low
selectivity and book and phdthesis for the high selectivity. We have con-
148 Query processing for XML databases

Query Time # sols sols MDS Insertions


(msec) /constr # % avoid.
DBLP collection
P-D1 1109 260540 1 1 390008 144%
O-D2 2890 553997 2.32 1.44 1038632 39%
O-D3 5984 98934 2.58 1.17 1843762 51%
U-D2 3750 559209 2.3 1.43 1041697 38%
U-D3 7266 149700 2.56 1.17 2245014 24%
Gen1 collection
P-G1 941 13209 1.09 1.05 390036 120%
O-G2 1392 848 1.59 1.32 392751 264%
O-G3 2394 94 2.35 1.59 393119 410%
U-G2 2238 1412 1.53 1.47 487100 193%
U-G3 2854 920 2.59 1.75 587464 240%
Gen2 collection
P-G1 90 1320 1.4 1.2 8425 69%
O-G2 113 884 7.82 2.93 8511 178%
O-G3 190 3173 6.47 5.99 8709 276%
U-G2 180 2788 10.64 5.4 11061 114%
U-G3 961 26854 132.9 7.92 14039 133%

Table 4.3: Pattern matching results for the different queries and collections

ducted experiments by using not only queries defined by the plain templates
(designated as “NSb-h”) which only contain tree structural relationships, but
also queries (designated as “VSb-h”), where the templates are extended by
predicates on the author name. Value accesses are supported by a content
index. We have chosen the highly content-selective predicates, because we
believe that this kind of queries is especially significant for highly selective
fields, such as the author name. On the other hand, the performance of
queries with low selectivity fields should be very close to the corresponding
templates. In this way, we measure the response time of twelve queries, half
of which contain predicates.
All experiments are executed on a Pentium 4 2.5Ghz Windows XP Pro-
fessional workstation, equipped with 512MB RAM and a RAID0 cluster of
2 80GB EIDE disks with NT file system (NTFS).

4.7.2 General performance evaluation


Table 4.3 summarizes the results obtained by applying our pattern matching
algorithms to solve the proposed queries, for each of the three collections.
Evaluating the impact of each condition 149

Queries are denoted with a prefix signifying the applied matching algorithm
(P- for path matching, O- for ordered and U- for unordered twig matching).
For each query setting, we present the fundamental details of the algorithms
execution, i.e. the total execution time (in milliseconds), the total number
of solutions retrieved, the mean number of solutions constructed each time a
solution construction is started, the mean domain size (denoted with MDS),
the total number of node insertions, and the percentage of avoided insertions
with respect to their total number. Observe that in all the cases the time
is in the order of a few seconds (7 seconds at most for query U-D3), even
though each of the settings presents non-trivial query execution challenges: a
very wide data tree for both DBLP and Gen1, a considerable repetitiveness
for DBLP labels and patterns (notice the very high number of solutions,
over half a million for queries D2) and a very deep and involved tree for
Gen2 (notice the high number of solutions for each solution construction,
nearly one or even two orders of magnitude larger than for the other two
collections). Also observe the large number of node insertions, and especially
the high percentage of avoided insertions, which is very significant for all
collections (e.g. nearly a million avoided insertions for DBLP O-D3 query,
while in Gen1 and Gen2 twigs the number of non-inserted nodes is much
higher than the ones inserted). Finally, the MDS parameter is particularly
significant for all the queries. It represents the mean size of the domains
measured each time a solution construction is called for the whole size of
the collection. Its low values in each of the settings (less than 1.8 nodes
for DBLP and Gen1, reaching 7.92 for the most complex query in the deep
Gen2) testimony the good efficacy of our reduction conditions. In particular,
since the mean domains size is low, this means that the number of deletions
is very near to the number of insertions. Keeping the MDS low is essential
for efficiency reasons, since the time spent in constructing the solutions is
roughly proportional to the Cartesian product of the domains size, but, in
many cases, it is also essential for the good outcome of the matching, since
an overflow of the domains would mean a total failure.

4.7.3 Evaluating the impact of each condition


Now that the effectiveness of the algorithms and of the conditions is clear, we
still need information about the benefits offered by enabling each of them.
In the following, we deeply analyze the MDS, i.e. the domains behavior and
execution time, measuring how much each of the conditions applied in the
algorithms influence their trend. In order to simplify the analysis we discuss
the path and twig matching separately, identifying the most significant cases
for each of them. We will present all the specific results and graphs only for
150 Query processing for XML databases

Case Post-order Pre-order


POP POT1 POT2 POT3 PRO1 PRO2(PRU)
P-A x
P-B / T-E x
P-C / T-F x
T-A x x
T-B x
T-C x
T-D x

Table 4.4: Summary of the discussed cases (disabled conditions are marked
with a ‘x’)

some of the most complex and interesting cases, shortly discussing the others
in words. Table 4.4 presents a summary of the cases we will discuss, together
with the associated disabled conditions.

Path matching

Let us start by considering the simpler path matching scenario, where we


distinguish between cases P-A, P-B and P-C. In the following, we refer to
the case where all the conditions are turned on as the “standard” (std) case.
First is the case P-A where we turn off the post-order condition and which
is expected to significantly degrade the matching performance since such a
condition is clearly the key to the high number of node deletions. In fact,
case P-A produces an uncontrolled growth in the domains’ size, preventing
the conclusion of the matching for all the query settings, both for domain
size overflow and for the consequently exploded execution time. Then are
the cases involving deactivation of pre-order conditions (P-B, P-C), which
again influence deletions but in a lesser manner. Case P-B means we do not
empty the domains on the “right” when a given domain becomes empty (thus
there will be “dangling” pointers), while in P-C the nodes in the last domain
are no longer deleted after solution construction. Case P-B produces larger
but still controlled domain sizes (20%-30% higher MDS than the “standard”
cases), while the execution time is nearly unchanged (only 2% less). This
is expected, since, at least for short paths, the time required to apply the
conditions compensates the shorter solution generation time. Notice that the
modified path algorithm for Case P-B is equivalent to the one proposed in
[25]. Finally, with case P-C we obtained results which were almost identical
to the standard case. The deletion of the nodes in the last stacks, which
Evaluating the impact of each condition 151

Query isCleanable() isNeeded()


Calls % true % true Calls % false % false
(POT3) (POT2) (POT1p) (POT1s)
DBLP collection
O-D2 1389881 17.36% - 1439703 27.65% 0.21%
O-D3 7477979 3.23% - 2780210 19.25% 14.43%
U-D2 1395093 17.29% - 1439703 27.65% n/a
U-D3 7866217 3.07% - 2780212 19.25% n/a
Gen1 collection
O-G2 417154 68.48% 15.14% 1429087 66.5% 6.02%
O-G3 456767 62.54% 14.68% 1999979 71.48% 8.86%
U-G2 542592 52.65% 21.74% 1429090 65.92% n/a
U-G3 762319 37.47% 24.79% 1999979 70.63% n/a
Gen2 collection
O-G2 27645 17.1% 7.48% 23679 54.71% 9.34%
O-G3 57733 8.17% 3.6% 32758 59.25% 14.15%
U-G2 55344 8.54% 6% 23679 53.28% n/a
U-G3 109385 4.3% 4.67% 32758 57.14% n/a

Table 4.5: Behavior of the isCleanable() and isNeeded() functions

would be immediately provided by condition PRO1, is equally carried out by


the other conditions (i.e. POP, PRO2) just a few steps later, on mean. This
results in nearly identical execution time and mean domain size (nearly 6%
larger).

Twig matching
As to twig matching, the number of available conditions requires a deeper
analysis. As for paths, we will first inspect the pre-order conditions (i.e.
POT), which are the main source of avoided insertions and deletions. As
shown in the algorithms (Section 4.4), the following are key functions which
activate POT conditions: isCleanable() for deletions (exploiting POT2 and
POT3) and isNeeded() (which will be isNeededOrd() for the ordered and
isNeededUnord() for the unordered case) for insertions (exploiting POT1).
We started by analyzing the “percentages of success” of such functions for
each call in each of the queries – Table 4.5 provides such statistics. For
isCleanable() “success” means allowing a node deletion (returning true,
both for POT3 or POT2), while for isNeeded() means preventing an use-
less insertion (returning false for POT1). Notice that POT1 can be satisfied
by examining nodes in the parent domain (denoted in table with POT1p) or,
152 Query processing for XML databases

Case std. Case T-A Case std. Case T-B Case T-D
600 14.00
12.00
500
Mean domains size

Mean domains size


10.00
400 8.00

300 6.00
4.00
200
2.00
100 0.00
O-G2 O-G3 U-G2 U-G3 O-G2 O-G3 U-G2 U-G3
(Gen1) (Gen1) (Gen1) (Gen1) (Gen2) (Gen2) (Gen2) (Gen2)
0
Case std. 1.32 1.59 1.47 1.75 2.93 5.99 5.40 7.92
0
0
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
60

Case T-B 1.65 1.96 1.88 2.29 4.40 7.94 6.42 8.74
12
18
24
30
36
42
48
54
60
66
72
78
84
90
96
Document pre-order Case T-D 1.44 1.83 1.85 2.16 3.70 7.81 8.50 12.11

(a) MDS, std. vs. T-A (O-G3, Gen1) (b) MDS, std. vs. T-B and T-D

Figure 4.22: Comparison for mean domains size in different settings

only for unordered, in the siblings ones (POT1s in table). The percentage
of success for both functions is considerable in all cases. In DBLP, the per-
centage of deletion success is lower than in the other collections. This is due
both to the more repetitive and simple structure of its data tree and to the
inapplicability of condition POT2 (DBLP queries have only two levels). Such
condition proves instead quite useful in the other collections, particularly for
unordered matching, where its application comes often near the one of POT3.
As to POT1, the main contribution is given in situation POT1p, while also
POT1s can give a good contribution in the ordered matching. About the
two functions we are discussing, we also performed some CPU utilization
tests and found out that their contribution is generally significant also from
an execution time point of view, since their percentage of CPU utilization is
typically less than 4% of the total CPU times.
To quantify the specific effects of the conditions on the domain size and
time, we distinguish cases T-A to T-D (see Table 4.4). The first three cases
will clearly produce less node deletions, while case T-D will allow useless
insertions. Case T-A is conceptually equivalent to the case P-A, i.e. deletions
are almost totally prevented. Like in P-A, as expected, we found out that the
domain sizes grow uncontrollably, preventing the termination of the matching
in acceptable time. As an example, the graph in Figure 4.22-a shows a plot
comparing, after each data node, the mean stack size of case T-A to the
standard one for O-G3 query, collection Gen1. As to cases T-B and T-
D, the algorithms generally produced domain sizes which were larger than
the standard case (see Figure 4.22-b). Even if the difference in size may
not seem particularly significant, we have to consider that the time spent in
constructing the solutions is roughly proportional to the cartesian product of
the size of each domains, thus the differences in execution time may become
Evaluating the impact of each condition 153

Case std. Case T-D


3000

2500

2000

Time (msec)
1500

1000

500

0
O-G2 O-G2b O-G3 U-G2 U-G2b U-G3
(Gen2) (Gen2) (Gen2) (Gen2) (Gen2) (Gen2)
Case std. 113 188 190 180 592 961
Case T-D 152 227 266 319 2516 2552

Figure 4.23: Comparison between time in different settings

more evident. For instance, if the domains are on mean one and a half larger,
as for query UG-3 (collection Gen2, case T-D), each solution construction run
becomes nearly 20 times longer. While for the most simple queries we found
out that execution time is still not much affected, since the time required to
check the conditions compensates the shorter solution generation time, for the
most complex settings the difference in execution time can be remarkable.
As an example, Figure 4.23 shows, for the most complex queries and for
collection Gen2, the comparison between the standard case time and the one
of the T-D case. Notice that, in order to verify execution time savings in more
complex situations, we also employed new queries specifically for these tests,
e.g. a modified version of query U-G2, named U-G2b, where second level
nodes have two children instead of one. As seen in the graph, the difference
in execution time can reach a proportion of 1:5 (query U-G2b). The results
obtained for case T-C are very different between the ordered and unordered
settings. Disabling condition POT3 produces almost no variations in the
ordered matching (condition POT2 produces the same deletions at the cost of
a little more time spent in checking the hypotheses), while it proves essential
for unordered matching (time and domain size grow uncontrollably). Note
that if we disabled PRO1 together with POT3, case T-C would degenerate
for the ordered setting too, since the last domain would not be empty and
POT2 could no longer be always applied.
Finally, as for the paths, we can also briefly analyze the cases involving
the activation of the pre-order conditions (PRO, PRU), denoted as T-E and
T-F in Table 4.4. Simulating case T-E (which is conceptually similar to the
P-B one for paths) means to disable the deletions produced by the pointers
update, i.e. there will be dangling pointers. Differently from P-B, such
case produces uncontrolled growth in domains size and in time, proving that
such conditions are essential for more complex queries. As to case T-F, this is
154 Query processing for XML databases

Query Evaluation
Twig elem. pred. solutions Decomp. Permutation
# # # (sec) N mean (sec) total (sec)
NH2-2 3 0 1343 0.016 2 0.014 0.028
NH3-2 4 0 1343 0.016 6 0.015 0.105
NH7-3 10 0 90720 1.1 288 0.9 259.2
NL2-2 3 0 559209 2.2 2 2.28 4.56
NL3-2 4 0 559209 4.2 6 2.49 14.94
NL8-2 9 0 149700 7.7 40320 4.8 193536
VH2-2 3 1 1 0.015 2 0.014 0.028
VH3-2 4 1 1 0.016 6 0.016 0.096
VH7-3 10 2 1 0.031 288 0.03 8.64
VL2-2 3 1 39 0.65 2 0.832 1.664
VL3-2 4 1 36 0.69 6 1.1 6.6
VL8-2 9 1 29 0.718 40320 2.3 92736

Table 4.6: Performance comparison for unordered tree inclusion

equivalent to case P-C and the results obtained for ordered matching confirm
the one discussed for such case.

4.7.4 Decomposition approach performance evaluation


In this section we provide specific evaluation of the performance of our de-
composition technique for unordered tree inclusion, as described in Section
4.5. In particular, we measure the time needed to process the different query
twigs using the paths decomposition approach, and compare the obtained
results with the query processing performance of a naı̈ve permutation ap-
proach. The permutation approach considers all the permutations the query
satisfying its ancestor-descendant relationships and computes the answers to
the ordered inclusion of the corresponding signatures in the data signature.
The union of the partial answers is the result of the unordered inclusion.
Table 4.6 summarizes the results of the unordered tree inclusion perfor-
mance tests for both approaches we considered. For each query twig, the
total number of elements and predicates, the number of solutions (inclu-
sions) found in the data set, and the processing time, expressed in seconds,
are reported. For the permutation approach, the number of needed permu-
tations and the mean per-permutation processing time are also presented.
It is evident that the decomposition approach is superior and scores a lower
time in every respect. In particular, with low branching factors (i.e. 2), such
approach is twice as faster for both selectivity settings. With high branching
factors (i.e. 3, 8) the speed increment becomes larger and larger – the num-
Decomposition approach performance evaluation 155

ber of permutations required in the alternative approach grows factorially:


for queries NL8-2 and VL8-2 the decomposition method is more than 25,000
times faster. The decomposition approach is particularly fast with the high
selectivity queries. Even for greater heights (i.e. in VH7-3 ), the processing
time remains in milliseconds.
For the decomposition method, as we do not have statistics on the path
selectivity at our disposal, we measured the time needed to solve each query
for each of the possible order of path evaluation and reported only the lower
one. As we expected, we found that starting with the most highly selective
paths always increases the query evaluation efficiency. In particular, the
time spent is nearly proportional to the number of occurrences of such path
in the data. Evaluating query NL2-2 starting with the title path produces
a response time of 2.2 seconds, while starting with the less selective author
path, the time would nearly double (3.9 sec.). This holds for all the query
twigs as well – for NL8-2, the time ranges from 7.7 sec (crossref path) up
to 15.7 sec (author path). Of course, for the predicate queries the best time
is obtained by starting the evaluation from the value-enabled paths.
Finally, notice that the permutation approach also requires an initial
“startup” phase where all the different permutation twigs are generated; the
time used to generate such permutations is not taken into account.
Chapter 5

Approximate query answering


in heterogeneous XML
collections

In recent years, the constant integration and enhancements in computational


resources and telecommunications, along with the considerable drop in dig-
itizing costs, have fostered development of systems which are able to elec-
tronically store, access and diffuse via the Web a large number of digital
documents and multimedia data. In such a sea of electronic information,
the user can easily get lost in her/his struggle to find the information (s)he
requires. Heterogeneous collections of various types of documents, such as ac-
tual text documents or metadata on textual and/or multimedia documents,
are more and more widespread on the web. Think of the several available
portals and digital libraries offering search capabilities, for instance those pro-
viding scientific data and articles, or those assisting the users in finding the
best bargains for their shopping needs. Such repositories often collect data
coming from different sources. The documents are heterogeneous for what
concerns the structures adopted for their representations but are related for
the contents they deal with.
In this context, XML has quickly become the de facto standard for data
exchange and for heterogenous data representation over the Internet. This
is also due to the recent emergence of wrappers (e.g. [12, 43]) for the trans-
lation of web documents into XML format. Along with XML, languages for
describing the structures and data types and for querying XML documents
are becoming more and more popular. Among the several languages pro-
posed in recent years, the syntax and semantics of XML Schema [132] and of
XQuery [16] are W3C recommendations/working drafts, for the former and
the latter purposes respectively. Thus, in a large number of heterogeneous
Approximate query answering
158 in heterogeneous XML collections
web collections, data are most likely expressed in XML and are associated to
XML Schemas, while structural queries submitted to XML web search en-
gines are written in XQuery, a language expressive enough to allow users to
perform structural inquiries, going beyond the “flat” bag of words approaches
of common plain text search engines.
In order to exploit the data available in such document repositories, an
entire ensemble of systems and services is needed to help users to easily find
and access the information they are looking for. Sites offering access to large
document bases are now widely available all over the web, but they are still
far from perfect in delivering the information required by the user. Efficient
exact structural search techniques, like the ones described in Chapter 4, are
indeed necessary and should be exploited in the underlying search engines,
but they are not sufficient to fully answer the user needs in these scenarios.
In particular, one key issue which is still an open problem is the effective and
efficient search among large numbers of “related” XML documents. Indeed,
if, from one side, the adoption of XQuery allows users to perform struc-
tural inquiries, on the other hand, such high flexibility could also mean more
complexity: Hardly a user knows the exact structure of all the documents
contained in the document base. Further, XML documents about the same
subject and describing the same reality, for instance compact disks in a music
store, but coming from different sources, could use largely different structures
and element names, even though they could be useful in order to satisfy the
user’s information need.
Given those premises, the need for solutions to the problem of performing
queries on all the useful documents of the document base, also on the ones
which do not exactly comply with the structural part of the query itself
but which are similar enough, becomes apparent. Further, it is also evident
that such solutions should focus on the structural properties of the accessed
information and, in order to provide a good effectiveness, they should be
able to know the right meaning of the employed terminology. However, due
to the ambiguity of natural languages, terms describing information usually
have several meanings and making explicit their semantics can be a very
complex and tedious task. Indeed, while the problem of making explicit
the meanings of words is usually demanded to human intervention, in most
application contexts human intervention is not always feasible.
In this chapter, we propose a series of techniques trying to give an answer
to the above mentioned needs and providing altogether an effective and effi-
cient approach for approximate query answering in heterogeneous document
bases in XML format. In particular:

ˆ in Section 5.1 we propose XML S3 MART services [94], which are able
5.1 Matching and rewriting services 159

to approximate the user queries with respect to the different documents


available in a collection. Instead of working directly on the data, we
first exploit a reworking of the documents’ schemas (schema match-
ing), then with the extracted information we interpret and adapt the
structural components of the query (query rewriting);

ˆ in Section 5.2 we propose a further service for automatic structural


disambiguation [86, 87] which can prove valuable in enhancing the ef-
fectiveness of the matching (and rewriting) techniques. Indeed, the
presented approach is completely generic and versatile and can be used
to make explicit the meaning of a wide range of structure based in-
formation, like XML schemas, the structures of XML documents, web
directories but also ontologies.

In Section 5.3 we provide a discussion on schema matching, query rewrit-


ing and structural disambiguation related work. In Section 5.4 we provide
extensive experimental evaluation of all the proposed techniques. Finally, in
Section 5.5, we breifly describe how we plan to enhance the XML S3 MART
system in order to support distributed Peer-to-Peer (P2P) systems and, in
particular, Peer Data Management Systems (PDMS) settings.

5.1 Matching and rewriting services


The services we propose rely on the information about the structures of
the XML documents which we suppose to be described in XML Schema.
A schema matching process extracts the semantic and structural similari-
ties between the schema elements, which are then exploited in the proper
query processing phase where we perform the rewriting of the submitted
queries, in order to make them compatible with the available documents’
structures. The queries produced by the rewriting phase can thus be issued
to a “standard” XML engine and enhance the effectiveness of the searches.
Such services have been implemented in our XML S3 MART (XML Semantic
Structural Schema Matcher for Automatic query RewriTing) system. In this
section we give a brief overview of XML S3 MART motivations and function-
alities and introduce how such a module can work in an open-architecture
web repository offering advanced XML search functions. Then, in Sections
5.1.1 and 5.1.2 we specifically analyze the matching and rewriting features.
From a technological point of view, the principle we followed in planning,
designing and implementing the matching and rewriting functionalities was to
offer a solution allowing easy extensions of the offered features and promoting
information exchange between different systems. In fact, next generation web
Approximate query answering
160 in heterogeneous XML collections

Interface logic
XQuery GUI
client

Business logic

XML S3MART

Data logic
Data Manager &
Search Engine
XML Multimedia
Repository Repository

Figure 5.1: The role of schema matching and query rewriting in an open-
architecture web repository offering advanced XML search functions

systems offering access to large XML document repositories should follow an


open architecture standard and be partitioned in a series of modules which,
together, deliver all the required functionalities to the users. Such modules
should be autonomous but should cooperate in order to make the XML data
available and accessible on the web; they should access data and be accessed
by other modules, ultimately tying their functionalities together into services.
For all these reasons, XML S3 MART provides web services making use of
SOAP which, together with the XML standard, give the architecture an
high level of inter-operability.
The matching and rewriting services offered by XML S3 MART can be
thought as the middleware (Figure 5.1) of a system offering access to large
XML repositories containing documents which are heterogenous, i.e. incom-
patible, from a structural point of view but related in their contents. Such
middleware interacts with other services in order to provide advanced search
engine functionalities to the user. At the interface level users can exploit a
graphical user interface, such as [19], to query the available XML corpora by
drawing their request on one of the XML Schemas (named source schema).
XML S3 MART automatically rewrites the query expressed on the source
schema into a set of XML queries, one for each of the XML schemas the
other useful documents are associated to (target schemas). Then, the result-
ing XML queries can be submitted to a standard underlying data manager
and search engine, such as [28], as they are consistent with the structures of
the useful documents in the corpus. The results are then gathered, ranked
and sent to the user interface component. Notice that the returned results
can be actual XML textual documents but also multimedia data for which
XML metadata are available in the document base.
Let us now concentrate on the motivation and characteristics of our ap-
proximate query answering approach. The basic premise is that the struc-
5.1 Matching and rewriting services 161

Original query... Document in Repository


FOR $x IN /musicStore <cdStore>
WHERE $x/storage/stock/compactDisk <name>Music World Shop</name>
/albumTitle = "Then comes the sun" <address> ... </address>
RETURN $x/signboard/namesign <cd>
<cdTitle>Then comes the sun</cdTitle>
… rewritten query <vocalist>Elisa</vocalist>
</cd>
FOR $x IN /cdStore ...
WHERE $x/cd/cdTitle = "Then comes the sun" </cdStore>
RETURN $x/name

Figure 5.2: A given query is rewritten in order to be compliant with useful


documents in the repository

tural parts of the documents, described by XML schemas, are used to search
the documents as they are involved in the query formulation. Due to the
intrinsic nature of the semi-structured data, all such documents can be use-
ful to answer a query only if, though being different, the target schemas
share meaningful similarities with the source one, both structural (similar
structure of the underlying XML tree) and semantical (employed terms have
similar meanings) ones. Consider for instance the query shown in the up-
per left part of Figure 5.2, asking for the names of the music stores selling
a particular album. The document shown in the right part of the figure
would clearly be useful to answer such need, however, since its structure and
element names are different, it would not be returned by a standard XML
search engine. In order to retrieve all such useful documents available in the
document base, thus fully exploiting the potentialities of the data, the query
needs to be rewritten (lower left part of Figure 5.2). Being such similari-
ties independent from the queries which could be issued, they are identified
by a schema matching operation which is preliminary to the proper query
processing phase. Then, using all the information extracted by such analy-
sis, the approximate query answering process is performed by first applying
a query rewriting operation in a completely automatic, effective and efficient
way. See also Figure 5.3, which depicts the role of the two operations and
the interaction between them.

As a final remark, notice that, some kind of approximation could also


be required for the values expressed in the queries as they usually concern
the contents of the stored documents (texts, images, video, audio, etc.), thus
requiring to go beyond the exact match. We concentrate our attention only
on the structural parts of the submitted queries and we do not deal with the
problem of value approximation, which has been considered elsewhere (for
instance in [130]).
Approximate query answering
162 in heterogeneous XML collections

Schema matching
S Structural expansion

Original Semantic annotation


XML Schemas
“Expanded”
XML Schemas
Matching computation

“Expanded & Annotated”


XML Schemas

M Matching
information

Q Query Rewriting Q

Submitted query Rewritten queries


(source schema) (on target schemas )

Figure 5.3: The XML S3 MART matching and rewriting services.

5.1.1 Schema matching


The schema matching operation takes as input the set of XML schemas
characterizing the structural parts of the documents in the repository and,
for each pair of schemas, identifies the “best” matches between the attributes
and the elements of the two schemas. It is composed by three sub-processes
(as shown in Figure 5.3), the first two of which, the structural expansion and
the terminology disambiguation, are needed to maximize the effectiveness of
the third phase, the real matching one.

Structural Schema Expansion


The W3C XML Schema [132] recommendation defines the structure and data
types for XML documents. The purpose of a schema is to define a class of
XML documents. In XML Schema, there is a basic difference between com-
plex types, which allow elements as their content and may carry attributes,
and simple types, which cannot have element content and attributes. There
is also a major distinction between definitions which create new types (both
simple and complex), and declarations which enable elements and attributes
with specific names and types (both simple and complex) to appear in docu-
ment instances. An XML document referencing an XML schema uses (some
of) the elements introduced in the schema and the structural relationships
between them to describe the structure of the document itself.
In the structural schema expansion phase, each XML schema is modi-
fied and expanded in order to make the structural relationships between the
involved elements more explicit and thus to represent the class of XML doc-
uments it defines, i.e. the structural part of the XML documents referencing
Schema matching 163

Original XML Schema (fragment) Underlying Tree structure (fragment)


<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="musicStore" type="musicStoreType"/>
<xsd:complexType name="musicStoreType">
<xsd:all> musicStore
<xsd:element name="location" type="locationType"/>
...
</xsd:all>
</xsd:complexType> location signboard storage
<xsd:complexType name="locationType">
<xsd:all>
<xsd:element name="town" type="xsd:string"/>
<xsd:element name="country" type="xsd:string"/> ... town country colorsign namesign stock
Expanded XML Schema (fragment) ...
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="musicStore">
<xsd:element name="location">
<xsd:element name="town" type="xsd:string"/>
<xsd:element name="country" type="xsd:string"/>
</xsd:element> ...

Figure 5.4: Example of structural schema expansion (Schema A)

it. As a matter of fact, searches are performed on the XML documents stored
in the repository and a query in XQuery usually contains paths expressing
structural relationships between elements and attributes. For instance the
path /musicStore/storage/stock selects all the stock elements that have
a storage parent and a musicStore grandparent which is the root element.
The resulting expanded schema file abstracts from several complexities of
the XML schema syntax, such as complex type definitions, element refer-
ences, global definitions, and so on, and ultimately better captures the tree
structure underlying the concepts expressed in the schema.
Consider, for instance, Figure 5.4 showing a fragment of an XML Schema
describing the structural part of documents about music stores and their
merchandiser, along with a fragment of the corresponding expanded schema
file and a representation of the underlying tree structure expressing the struc-
tural relationship between the elements which can appear in the XML doc-
uments complying with the schema. As can be seen from the figure, the
original XML Schema contains, along with the element definitions whose
importance is definitely central (i.e. elements musicStore, location), also
type definitions (i.e. complex types musicStoreType, locationType) and
regular expression keywords (i.e. all), which may interfere or even distort
the discovery of the real underlying tree structure, which is essential for an
effective schema matching. In general, XML Schema constructions need to
be resolved and rewritten in a more explicit way in order for the structure
of the schema to be the most possibly similar to its underlying conceptual
tree structure involving elements and attributes. Going back to the example
of Figure 5.4, the element location, for instance, is conceptually a child of
musicStore: This relation is made explicit only in the expanded version of
the schema, while in the original XML Schema location was the child of a
all node, which was child of a complex type. Further, every complex type
Approximate query answering
164 in heterogeneous XML collections

and keyword is discarded.


In the resulting expanded schema, every element or attribute node has
a name. Middle elements have children and these can be deduced immedi-
ately from the new explicit structure. On the other hand, leaf elements (or
attributes) can hold a textual value, whose primitive type is maintained and
specified in their “type=...” parameter of the output schema.

Terminology disambiguation
As discussed in the introduction, after having made explicit the structural
relationships of a schema with the expansion process, a further step is re-
quired in order to refine and complete even more the information delivered
by each schema, thus maximizing the successive matching computation’s ef-
fectiveness. This time the focus is on the semantics of the terms used in the
element and attribute definitions. In this step, each term is disambiguated,
that is its meaning is made explicit as it will be used for the identification
of the semantical similarities between the elements and attributes of the
schemas, which actually rely on the distance between meanings. To this end,
we exploit one of the most known lexical resources for the English language:
WordNet [100]. The WordNet (WN) lexical database is conceptually orga-
nized in synonym sets or synsets, representing different meanings or senses.
Each term in WN is usually associated to more than one synset, signifying
it is polysemic, i.e. it has more than one meaning (some preliminary Word-
Net concepts are also available in Appendix A.1). In this section the focus
is on the rewriting and matching computation techniques, therefore we will
consider term disambiguation as a semi-automatic operation where the op-
erator, by using an ad-hoc GUI, is required to “annotate” each term used in
each XML schema with the best candidate among the WN terms and, then,
to select one of its synsets. Our new structural disambiguation technique
greatly enhances this approach, providing completely automatic terminology
disambiguation, and will be specifically discussed later in this chapter, in
Section 5.2.

Matching Computation
The matching computation phase performs the actual matching operation
between the expanded annotated schemas made available by the previous
steps. For each pair of schemas, we identify the “best” matchings between
the attributes and the elements of the two schemas by considering both the
structure of the corresponding trees and the semantics of the involved terms.
Indeed, in our opinion the meanings of the terms used in the XML schemas
Schema matching 165

Schema A - Tree structure Schema B - Tree structure

musicStore A cdStore
A

B
location signboard storage name
E
address
B
cd
F

C D E C D J K G
town country colorsign namesign stock city street state vocalist cdTitle trackList
F H
compactDisk passage
G K I
songList albumTitle title
H
track

I J
songTitle singer

Figure 5.5: Example of two related schemas and of the expected matches

cannot be ignored as they represent the semantics of the actual content of the
XML documents. On the other hand, the structural part of XML documents
cannot be considered as a plain set of terms as the position of each node
in the tree provides the context of the corresponding term. For instance,
let us consider the two expanded schemas represented by the trees shown
in Figure 5.5. Though being different in the structure and in the adopted
terminology, they both describe the contents of the albums sold by music
stores, for which the information about their location is also represented.
Thus, among the results of a query expressed by using Schema A we would
also expect documents consistent with Schema B. In particular, by looking
at the two schemas of Figure 5.5, a careful reader would probably identify
the matches which are represented by the same letter. At this step, the
terms used in the two schemas have already been disambiguated by choosing
the best WN synset. As WordNet is a general purpose lexical ontology, it
does not provide meanings for specific context. Thus, the best choice is to
associate terms as albumTitle and songTitle for Schema A and cdTitle
and title for Schema B with the same WN term, i.e. title, for which the
best synset can thus be chosen. In these cases, which are quite common, it
is only the position of the corresponding nodes which can help us to better
contextualize the selected meaning. For instance, it should be clear that the
node albumTitle matches with the node cdTitle, as both refer to album
title, and that songTitle matches with the node title, as both refer to
song title.
The steps we devised for the matching computation are partially derived
from the ones proposed in [98] and are the following:
1. the involved schemas are first converted into directed labelled graphs
following the RDF specifications [79], where each entity represents an
Approximate query answering
166 in heterogeneous XML collections

RDF Model for Schema A (portion) RDF Model for Schema B (portion) Pairwise connectivity graph (portion)

/musicStore
, name
/musicStore name /cdStore name /cdStore
“musicStore”
,
“musicStore” “cdStore” “cdStore”
child child child

/musicStore/location name /cdStore/address name /musicStore/location


, name
“location” “address” /cdStore/address
“location”
,
“cdStore”

Figure 5.6: RDF and corresponding PCG for portions of Schemas A and B

element or attribute of the schema identified by the full path (e.g.


/musicStore/location) and each literal represents a particular name
(e.g. location) or a primitive type (e.g. xsd:string) which more
than one element or attribute of the schema can share. As to the la-
bels on the arcs, we mainly employ two kinds of them: child, which
captures the involved schema structure, and name. Such label set can
be optionally extended for further flexibility in the matching process.
From the RDF graphs of each pair of schemas a pairwise connectivity
graph (PCG), involving node pairs, is constructed [98] in which a la-
belled edge connects two pairs of nodes, one for each RDF graph, if such
labelled edge connects the involved nodes in the RDF graphs. Figure
5.6 shows a portion of the RDF graph and of the pairwise connectivity
graph for Schema A and Schema B.

2. Then, an initial similarity score is computed for each node pair con-
tained in the PCG. This is one of the most important steps in the
matching process. In [98] the scores are obtained using a simple string
matcher that compares common prefixes and suffixes of literals. In-
stead, in order to maximize the matching effectiveness, we chose to
adopt an in-depth semantic approach. Exploiting the semantics of the
terms in the XML schemas provided in the disambiguation phase, we
follow a linguistic approach in the computation of the similarities be-
tween pairs of literals (names), which quantifies the distance between
the involved meanings by comparing the WN hypernyms hierarchies of
the involved synsets. We recall that hypernym relations are also known
as IS-A relations (for instance “feline” is a hypernym for “cat”, since
you can say “a cat is a feline”). In this case, the scores for each pair
of synsets (s1 , s2 ) are obtained by computing the depths of the synset
in the WN hypernyms hierarchy and the length of the path connecting
Automatic query rewriting 167

them as follows:
2 ∗ depth of the least common ancestor
.
depth of s1 + depth of s2

3. The initial similarities, reflecting the semantics of the single node pairs,
are refined by an iterative fixpoint calculation as in the similarity
flooding algorithm [98], which brings the structural information of the
schemas in the computation. In fact, this method is one of the most
versatile and also provides realistic metrics for match accuracy [48].
The intuition behind this computation is that two nodes belonging to
two distinct schemes are the more similar the more their adjacent nodes
are similar. In other words, the similarity of two elements propagates
to their respective adjacent nodes. The fixpoint computation is iter-
ated until the similarities converge or a maximum number of iterations
is reached.

4. Finally, we apply a stable marriage filter which produces the “best”


matching between the elements and attributes of the two schemas. The
stable marriage filter guarantees that, for each pair of nodes (x, y), no
other pair (x0 , y 0 ) exists such that x is more similar to y 0 than to y and
y 0 is more similar to x than to x0 .

5.1.2 Automatic query rewriting


By exploiting the best matches provided by the matching computation, we
straightforwardly rewrite a given query, written w.r.t. a source schema, on
the target schemas. Each rewrite is assigned a score, in order to allow the
ranking of the results retrieved by the query. Query rewriting is simplified
by the fact that the previous phases were devised for this purpose: The
expanded structure of the schemas summarizes the actual structure of the
XML data where elements and attributes are identified by their full paths
and have a key role in an XQuery FLWOR expression paths. At present,
we support conjunctive queries with standard variable use, predicates and
wildcards (e.g. the query given in Figure 5.2). We will now briefly explain
the approach and show some meaningful examples. After having substituted
each path in the WHERE and RETURN clauses with the corresponding full paths
and then discarded the variable introduced in the FOR clause, we rewrite the
query for each of the target schemas in the following way:

1. all the full paths in the query are rewritten by using the best matches
between the nodes in the given source schema and target schema (e.g.
Approximate query answering
168 in heterogeneous XML collections

Original Query on Source Schema A Automatically Rewritten Query on Target Schema B

Query 1
FOR $x IN /musicStore FOR $x IN /cdStore
WHERE $x/storage/*/compactDisk//singer = "Elisa" WHERE $x/cd/vocalist = "Elisa"
AND $x//track/songTitle = "Gift" AND $x/cd/trackList/passage/title = "Gift"
RETURN $x/signboard/namesign RETURN $x/name

Query 2
FOR $x IN /musicStore/storage/stock FOR $x IN /cdStore/cd
/compactDisk/songlist/track WHERE $x/vocalist = "Elisa"
WHERE $x/singer = "Elisa" AND $x/trackList/passage/title = "Gift"
AND $x/songtitle = "Gift" RETURN $x/trackList/passage
RETURN $x

Query 3
FOR $x IN /musicStore FOR $x IN /cdStore
WHERE $x/storage/stock/compactDisk = "Gift" WHERE ( $x/cd/vocalist = "Gift"
AND $x/location = "Modena" OR $x/cd/cdTitle = "Gift"
RETURN $x OR $x/cd/trackList/passage/title = "Gift" )
AND ( $x/address/city = "Modena"
OR $x/address/street = "Modena"
OR $x/address/state = "Modena" )
RETURN $x

Figure 5.7: Examples of query rewriting between Schema A and Schema B

the path /musicStore/storage/stock/compactDisk of Schema A is


automatically rewritten in the corresponding best match, /cdStore/cd
of Schema B);

2. a variable is reconstructed and inserted in the FOR clause in order to


link all the rewritten paths (its value will be the longest common prefix
of the involved paths);

3. a score is assigned to the rewritten query. It is the average of the


scores assigned to each path rewriting which is based on the similarity
between the involved nodes, as specified in the match.

Figure 5.7 shows some examples of query rewriting. The submitted


queries are written by using Schema A of Figure 5.5 and the resulting rewrit-
ing on Schema B is shown on the right of the figure. Query 1 involves the
rewriting of a query containing paths with wildcards: In order to successfully
elaborate them, the best matches are accessed not exactly but by means of
regular expressions string matching. For instance, the only path of the tree
structure of Schema A satisfying the path /musicStore/storage/*/compact
Disk//singer is /musicStore/storage/stock/compactDisk/songList/
track/singer and the corresponding match in Schema B (label J in Fig-
ure 5.5) will be the one used in the rewrite. When more than one path
of the source schema satisfies a wildcard path, all the corresponding paths
are rewritten and put in an OR clause. Query 2 demonstrates the rewriting
behavior in the variable management. The value of the $x variable in the
submitted query is the path of the element track in Schema A and the cor-
responding element in Schema B is passage (label H in Figure 5.5). However
5.2 Structural disambiguation service 169

directly translating the variable value in the rewritten query would lead to a
wrong rewrite: While the elements singer and songTitle referenced in the
query are descendants of track, the corresponding best matches in Schema
B, that is songTitle and vocalist, are not both descendants of passage.
Notice that, in these cases, the query is correctly rewritten as we first substi-
tute each path with the corresponding full path and then we reconstruct the
variable, which in this case holds the value of path cdStore/cd. In example
3 an additional rewrite feature is highlighted concerning the management of
predicates involving values and, in particular, textual values. At present,
whenever the best match for an element containing a value is a middle ele-
ment, the predicates expressed on such element are rewritten as OR clauses
on the elements which are descendants of the matching target element and
which contain a compatible value. For instance, the element compactDisk
and its match cd on Schema B are not leaf elements, therefore the condition
is rewritten on the descendant leaves vocalist, cdTitle and title.

5.2 Structural disambiguation service


In this section we propose a service for automatic structural disambigua-
tion which can prove valuable in enhancing the effectiveness of the matching
(and rewriting) techniques described in the previous section and, in gen-
eral, of the majority of the available knowledge-based applications. Indeed,
knowledge-based approaches, i.e. approaches which exploit the semantics of
the information they access, are rapidly acquiring more and more importance
in a wide range of application contexts. We refer to “hot” research topics,
not only schema matching and query rewriting, as considered in the previ-
ous section for XML S3 MART and also in peer data management systems
(PDMS) [84], but also XML data clustering and classification [128, 131] and
ontology-based annotation of web pages and query expansion [39, 53], all go-
ing in the direction of the Semantic Web “... an extension of the current web
in which information is given well-defined meaning, better enabling comput-
ers and people to work in cooperation” [14]. In these contexts, most of the
proposed approaches share a common basis: They focus on the structural
properties of the accessed information, which are represented adopting XML
or ontology based data models, and their effectiveness is heavily dependent
on knowing the right meaning of the employed terminology. For example,
Figure 5.8 shows the hierarchical representation of a portion of the cate-
gories offered by eBayr , one of the most famous world’s online marketplaces
(nodes are univocally identified by their pre-order values). It contains many
polysemous words, from string to batteries and memory, to which com-
Approximate query answering
170 in heterogeneous XML collections

buy 1

computers 2 8 cameras 11 antiques


musical
desktop PC 9 accessories 12 forniture 14
3 instruments
components 10 batteries 13 chair 15 string

4 memory 5 speaker 6 fan 7 mouse

Figure 5.8: A portion of the eBayr categories.

monly available vocabularies associates several meanings. The information


given by the surrounding nodes allows us to state, for instance, that string
are “stringed instruments that are played with a bow” and batteries are
electronic devices and not a group of guns or whatever else.
Our generic disambiguation service works on graph-like structured infor-
mation, mainly focusing on trees. It can be used to make explicit the meaning
of a wide range of structure based information, including XML schemas as
employed for instance in XML S3 MART, but also the structures of XML doc-
uments, web directories, and ontologies. Starting from the lesson learnt in
the word sense disambiguation (wsd) field [68], where several solutions have
been proposed for free text, we have conceived a versatile approach which
tries to disambiguate the terms occurring in the nodes’ labels by analysing
their context and by using an external knowledge source. More precisely,
starting from a given node, we support several ways of navigating the graph
in order to extract the context which can thus be tailored on the specific
application needs. Moreover, the disambiguation method does not depend
on training data or extensions, which are not always available. For instance,
in a PDMS, peers not necessarily store actual data. We follow instead a dif-
ferent approach: The exploitation of the information provided by commonly
available thesauri such as WordNet [100]. In particular, disambiguation is
founded on the hypernymy/hyponymy hierarchy, as suggested by most of the
classic wsd studies, and the sense contexts, extracted from the thesaurus, can
be compared against the graph context to refine the results. The outcome of
the overall process is a ranking of the plausible senses for each term. In this
way, we are able to support both the assisted annotation and the completely
automatic one whenever the top sense is selected. Such service has been
implented in our STRIDER (STRucture-based Information Disambiguation
ExpeRt) system. Subsection 5.2.1 presents an overview of our disambigua-
tion approach, while the proper disambiguation algorithm is presented in
Overview of the approach 171

Graph Terms’ Sense


Rankings
...

Terms / Senses Polysemous Contextualized


Suggestions Terms Polysemous Terms
Context extraction

Terms / Senses Graph Context Context Terms


Selection Extraction Expansion Disambiguation
{0.8, …}
Arc weights

Terms / senses
External
feedback
knowledge sources

Figure 5.9: The STRIDER graph disambiguation service.

Section 5.2.2.

5.2.1 Overview of the approach


In this section we present the functional architecture of the generic structural
disambiguation service (see Figure 5.9) and introduce relevant terminology.
Being trees particular kinds of graphs, without loss of soundness, in the fol-
lowing we will use indifferently the terms tree and graph. Indeed, at the end
of the present section, we will show that the service can be straightforwardly
extended to graphs. We emphasize that no extension or training data is re-
quired for our disambiguation purposes as they are not always available. The
only external source is a thesaurus associating each word with the concepts
or senses it is able to express.
The service is able to disambiguate XML schemas, the structures of XML
documents, web directories, and, in general, such information descriptions
which can be represented as trees. As a particular case, XML schemas are
represented as trees which make explicit the structural relationships between
the involved elements, thus capturing the element context, and abstract from
the complexity of the language syntax. The tree contains a set of nodes whose
labels must be disambiguated and a set of arcs which connect pairs of nodes
and which may as well be labelled (e.g. type, property). The individuation
of the correct sense for each label can be possible by analysing the context
of the involved terms and by using an external knowledge source. Arcs are
particularly important as they connect each label with its context. Each arc
label is associated with two weights between 0 and 1 (default value 1), one for
each crossing direction (direct and inverse). Weights will be used to compute
the distance between two nodes in the graph and the lower the weight of an
arc is the closer two nodes connected by such arc are.
Approximate query answering
172 in heterogeneous XML collections

The “terms/senses selection” component in Figure 5.9 takes the label of


each node N of the tree, extracts the contained terms (which can also be
more than one as for instance desktop PC components in Figure 5.8) and
associates each of these terms (t, N )1 with a list of senses Senses(t, N ) =
[s1 , s2 , . . . , sk ]. In principle, such list is the complete list of senses provided by
the thesaurus but it can also be a shrunk version suggested either by human
or machine experts or as feedback of a previous disambiguation process.
Each polysemous term (t, N ) is then associated with its context. The
context is first extracted from the tree but it does not necessarily coincide
with the entire tree. Indeed, different applications require different con-
texts. For instance, while disambiguating the term string in the musical
instruments category of eBayr , using categories such as women’s clothing
would be quite misleading. Thus we support different contexts by means of
different crossing settings. By default, the nodes reachable by the term’s
node N through any arc belong to the term’s context. The set of crossable
arc labels and the corresponding crossing directions is shrinkable, that is it is
possible to specify which kinds of arcs are crossable, in which direction and
the maximum number of crossings (distance from the term’s node). More-
over, as we deal with trees, we also provide the possibility of including the
siblings of the term’s node in the context. The above options can be freely
combined. As a special case, let us consider trees having no label on the arcs.
It actually represents the conceptual structure of the most common applica-
tion contexts such as web directories, XML documents, and XML schemas.
When the only crossing direction is the direct one, the context is defined by
the descendants or subtree of the term’s node. Conversely, it is represented
by the ancestors. For instance, for the eBay example, one of the best crossing
settings is to include ancestors, descendants, and siblings whereas the whole
structure would be useful for structures dealing with more “contextualized”
topics such as book descriptions.
Given a crossing setting, the “graph context extraction” component in
Figure 5.9 contextualizes each polysemous term (t, N ) by extracting its graph
context Gcontext(t, N ) from the set of terms belonging to the reachable
nodes. Not all nodes contribute with the same weight to the disambigua-
tion of a term. In principle, the more one node is close to the term’s node
and is connected by arcs with low weights the more it influences the term
disambiguation. For this reason, we associate each reachable node Nc in the
context with a weight weight(Nc ) computed as follows. Given the path from
the node Nc to the term’s node N , we count the number of instances corre-

1
Notice that the same term could be included more than once and that the disam-
biguation is strictly dependent on the node each instance belongs to.
Overview of the approach 173

sponding to each pattern specified in the crossing setting (i.e. arc label and
arc crossing direction) and we define the distance d between N and Nc as the
sum of the product of the weights associated to each pattern and the corre-
sponding number of instances. Then, weight(Nc ) is computed by applying a
gaussian distance decay function defined on d:
d2
e− 8 2
weight(Nc ) = 2 · √ + 1 − √
2π 2π

Thus each element of the graph context is a triple ((tc , Nc ), Senses(tc , Nc ),


weight(Nc )) defined from each term tc belonging to each reachable node Nc .

Example 5.1 Assume that in the eBay tree (Figure 5.8) the context is made
up of the siblings and ancestors, that the weight of the parent/child arcs is 1 in
the direct direction and 0.5 in the opposite one, and that the maximum num-
ber of crossings is 2. The graph context of the term (mouse, 7) is made up of
the terms (computers, 2), (desktop, 3), (PC, 3), (components, 3), (memory, 4),
(speaker, 5), and (fan, 6). The distance between node 7 and nodes 2, 3, 4
(5 and 6) are 1 (i.e. 2 arcs crossed in the opposite direction with weight 0.5),
0.5 (i.e. 1 arc crossed in the opposite direction with weight 0.5), and 1.5 (i.e.
1 arc crossed in the opposite direction with weight 0.5 and 1 arc crossed in
the direct direction with weight 1), respectively. Then, weight(2) = 0.91,
weight(3) = 0.95, and weight(4) = weight(5) = weight(6) = 0.8. ¤

The context of each term (t, N ) can be expanded by the contexts Scontext
(s) of each sense s in Senses(t, N ). It is particularly useful when the graph
context provides too little information. In particular for each sense we con-
sider the definitions, the examples and any other explanation of the sense
provided by the thesaurus. As most of the semantics is carried by noun words
[68], the “context expansion” module in Figure 5.9 defines Scontext(s) as the
set of nouns contained in the sense explanation.
Finally, each term (t, N ) with its senses Senses(t, N ) is disambiguated by
using the previously extracted context. The proper disambiguation process
is the subject of the following section. The result is a ranked version of
Senses(t, N ) where each sense s ∈ Senses(t, N ) is associated with a confi-
dence φ(s) in choosing s as a sense of (t, N ).
The overall approach is quite versatile. It supports several disambiguation
needs by means of parameters which can be freely combined, from the weights
to the graph context. Moreover, the ranking approach has been conceived in
order to support two types of graph disambiguation services: The assisted
and the completely automatic one. In the former case, the disambiguation
Approximate query answering
174 in heterogeneous XML collections

task is committed to a human expert and the disambiguation service assists


him/her by providing useful suggestions. In the latter case, there is no human
intervention and the selected sense can be the top one. Moreover, the above
approach can be straightforwardly applied also to graphs. Indeed, only the
context extraction phase accesses the submitted structure whereas the actual
disambiguation algorithm is completely independent from it. The only prob-
lem is in the weight computation where more than one path can connect a
pair of nodes. In this case, the one with the lower distance could be selected.
In this way, we would be able to disambiguate ontologies written in different
languages such as OWL and RDF where arc labels are quite frequent (e.g. in
RDF arcs can be of subClassOf type or range type or many other types).
However, at present, trees are our main focus of interest and, in particular,
trees having no label on the arcs which have been subject of our tests, while
we plan to deal with general graphs in the future.

5.2.2 The disambiguation algorithm


The algorithm for disambiguation we devised follows a relational information
and knowledge-driven approach. Indeed, the context is not merely consid-
ered as a bag of words but other information such as their distance from
the polysemous term to be disambiguated and semantic relations are also
extracted. Moreover, we use additional information provided by thesauri:
The hypernymy/hyponymy relationships among senses and the sense expla-
nations and frequencies of use. Note that some of the techniques presented
in this section are borrowed and adapted from the ones of the free-text word
sense disambiguator we devised for our EXTRA system (see Chapter 2 and
Appendix A.1 for a detailed description).
The algorithm is shown in Figure 5.10. It takes in input a term (t, N ) to be
disambiguated and produces a vector φ of confidences in choosing each of the
senses in Senses(t, N ). In particular, given Senses(t, N ) = [s1 , s2 , . . . , sk ],
φ is a vector of k values and φ[i] is the confidence in choosing si as sense
of (t, N ). The obtained confidence vector tunes two contributions (line 11):
That of the context, whose weight is expressed by the constant α and which
is subdivided in the graph context (confidence vector φG , weight γ) and the
expanded context (confidence vector φE , weight ²), γ + ² = 1, and that of
the frequency of sense use in English language (confidence vector φU ) with
weight β, α + β = 1.2
The terms surrounding a given one provide a good informational context
and good hints about what sense to choose for it. The contribution of the
2
All operations on the vectors are the usually defined ones.
The disambiguation algorithm 175

algorithm Disambiguate(t, N )

//graph context contribution


(01) φG = [0, . . . , 0]
| {z }
# senses in Senses(tc ,N )
(02) norm = 0
(03) for each (tc , Nc ) in Gcontext(t, N )
(04) φG = φG + weight(Nc ) ∗ T ermCorr(t, tc , norm)
(05) norm = norm ∗ weight(Nc )
(06) φG =φG /norm
//expanded context contribution
(07) for i from 1 to the number of senses in Senses(t, N )
(08) if expanded context
(09) φE [i]= ContextCorr(Gcontext(t, N ), Scontext(si ))
(10) φU [i]=decay(si )
(11) φ=α(γ ∗ φG + ² ∗ φE ) + β ∗ φU

Figure 5.10: The disambiguation algorithm

function T ermCorr(t, tc , norm)


(1) c(t, tc ) is the minimum common hypernymy of t and tc
(2) φC = [0, . . . , 0]
(3) for i from 1 to the number of senses in Senses(t, N )
(4) if c(t, tc ) is ancestor of si
(5) φC [i]=sim(t, tc )
(6) norm=norm + sim(t, tc )
(7) return φC

Figure 5.11: The T ermCorr() function

graph context is computed from step 1 to step 6. In particular, φG is the


sum of the values measuring the level of semantic correlation between the
polysemous term t and the ones in the graph context Gcontext(t, N ) (step
4). The contribution of each context term (tc , Nc ) is weighted by the relative
position in the graph of the tc ’s node, Nc , w.r.t. N (i.e. weight(Nc )). Finally
in step 6 the whole vector φG is divided by the norm value in order to obtain
normalized confidences.
The basis of function T ermCorr() (see Figure 5.11) derives from the one
in [113]. As in [113], the confidence in choosing one of the senses associated
with each term is directly proportional to the semantic similarities between
that term and each term in the context; the intuition behind the similarity is
Approximate query answering
176 in heterogeneous XML collections

that the more similar two terms are, the more informative will be the most
specific concept that subsumes them both. However, our approach differs in
the semantic similarity measure sim(t, tc ) as it does not rely on a training
phase on large pre-classified corpora but exploits the hypernymy hierarchy
of the thesaurus. In this context, one of the most promising measures is the
Leacock-Chodorow [82], which has been reviewed in the following way:
(
−ln len(t,t
2·H
c)
if ∃ a common hypernymy
sim(t, tc ) = (5.1)
0 otherwise

where len(t, tc ) is the minimum among the number of links connecting each
sense in Senses(t, N ) and each sense in Senses(tc , Nc ) and H is the height
of the hypernymy hierarchy (in WordNet it is 16). Moreover, we define the
minimum common hypernym c(t, tc ) of t and tc as the sense which is the
most specific (lowest in the hierarchy) of the hypernyms common to the two
senses (i.e. that crossed in the computation of len(t, tc )). For instance, in
WordNet the minimum path length between the terms “cat” and “mouse” is
5, since the senses of such nouns that join most rapidly are “cat (animal)”
and “mouse (animal)” and the minimum common hypernym is “placental
mammal”. Obviously these two values are not computed within the function
but once for each pair of the involved terms. Eq. 5.1 is decreasing as one
moves higher in the taxonomy thus guaranteeing that “more abstract” is
synonymous of “less informative”. Therefore, function T ermCorr() increases
the confidence of such senses in Senses(t, N ) which are descendants of the
minimum common hypernym (lines 3-4) and the increment is proportional
to how informative the minimum common hypernym is (line 5). At the end
of the process (Figure 5.10, line 6), the value assigned in φG to each sense is
then the proportion of support it receives, out of the support possible which
is kept updated by function T ermCorr() (line 6) and in the main algorithm
(Figure 5.10, line 5).
Beside the contribution of the graph context, also the expanded context
can be exploited in the disambiguation process (Figure 5.10, lines 7-9). In this
case, the main objective is to quantify the semantic correlation between the
context Gcontext(t, N ) of the polysemous term (t, N ) and the explanation
of each sense s in Senses(t, N ) represented by Scontext(s). In particular,
the confidence in choosing s is proportional to the computed similarity value
(Figure 5.10, line 9). The pseudocode of function ContextCorr() is shown
in Figure 5.12. It essentially computes the semantic similarity between each
term ti in the graph context and the terms in the sense context Scontext(s)
(lines 3-7) by calling the T ermCorr() function for each term tsj in Scontext(s)
(line 6) and then by computing the maximum of the obtained confidence
5.3 Related work 177

function ContextCorr([t1 , . . . , tn ], [ts1 , . . . , tsm ])


(1) φC = [0, . . . , 0]
(2) for i from 1 to n
(3) φT = [0, . . . , 0]
(4) norm = 0
(5) for j from 1 to m
(6) φT = φT + T ermCorr(ti , tsj , norm)
(7) φC [i]=max(φT /norm)
(8) return mean(φC )

Figure 5.12: The ContextCorr() function

vector φT . The returned value (line 8) is the mean of the similarity values
computed for the terms in Gcontext(t, N ).
The last contribution is that of function decay(), exploiting the frequency
of use of the senses in English language (Figure 5.10, line 9). In particular,
WordNet orders its list of senses W N Senses(t) of each term t on the basis
of the frequency of use (i.e. the first is the most common sense, etc.). We
increment the confidence in choosing each sense s in Senses(t, N ) in a way
which is inversely proportional to its position, pos(s), in such ordered list:

pos(si ) − 1
decay(si ) = 1 − ρ
|W N Senses(t)|

where 0 < ρ < 1 is a parameter we usually set at 0.8 and |W N Senses(t)| is


the cardinality of W N Senses(t). In this way, we quantify the frequency of
the senses where the first sense has no decay and the last sense has a decay of
1:5. Such an adjustment attempts to emulate the common sense of a human
in choosing the right meaning of a noun when the context gives little help.
As a final remark, notice that for the sake of simplicity of presentation,
algorithm Disambiguate() takes one term at a time. However, for efficiency
reasons, in the actual implementation the sim() computation is performed
only once for a given pair of terms (also swapped as sim() is a symmetric
measure).

5.3 Related work


5.3.1 Approximate query answering
Recently, several works took into account the problem of answering approx-
imate structural queries against XML documents. Much research has been
Approximate query answering
178 in heterogeneous XML collections

done on the instance level, trying to reduce the approximate structural query
evaluation problem to well-known unordered tree inclusion [119] or tree edit
distance [65] problems directly on the data trees. However, the process of un-
ordered tree matching is difficult and extremely time consuming; for instance,
the edit distance on unordered trees was found in [141] N P hard. Ad-hoc
approaches based on explicit navigation of the nodes’ instances, such as [37],
are equally very expensive and generally deliver inadequate performance due
to the very large size of most of the available XML data trees.
On the other hand, a large number of approaches prefer to address the
problem of structural heterogeneity by first trying to solve the differences
between the schemas on which data are based. Schema matching is a prob-
lem which has been the focus of work since the 1970s in the AI, DB and
knowledge representation communities [15, 48]. Many systems have been de-
veloped: The most interesting ones working on XML data are COMA [49],
which supports the combination of different schema matching techniques,
CUPID [85], combining a name and a structural matching algorithm, and
Similarity Flooding (SF) [98], providing a particularly versatile graph match-
ing algorithm. Many approaches also combine schema level with instance
level analysis, such as LSD and GLUE [50], which are based on machine
learning approaches needing a preliminary training phase. However, most of
the work on XML schema matching has been motivated by the problem of
schema integration: A global view of the schemas is constructed and from
this point the fundamental aspect of query rewriting remains a particularly
problematic and difficult to be solved aspect [115]. As to rewriting, most of
the works present interesting and complex theoretical studies [107]. In [27]
the theoretical foundations for query rewriting techniques based on views of
semi-structured data, in particular for regular path queries. Some rewriting
methods have also been studied in the context of mediator systems, in or-
der to rewrite the submitted queries on the involved sources: For instance,
[111] presents an approach based on the exploitation of a description logic,
while [106] deals with the problem of the informative capability of each of
the sources. However, in general, the proposed rewriting approaches rarely
actually benefit from the great promises of the schema matching methods.

5.3.2 Free-text disambiguation


Before discussing the few approaches proposed for the “structural” disam-
biguation problem, we will first briefly review our disambiguation approach
in the more “classic” and well studied field of wsd for free text. The neces-
sity of looking at the context of a word in order to correctly disambiguate
it is universally accepted, nonetheless two different approaches exist: The
Structural disambiguation 179

bag of words approach, where the context is merely a set of words next to
the term to disambiguate, and the relational information approach, which
extends the former with other information such as their distance or rela-
tion with the involved word. The disambiguation algorithm developed by
us adopts the relational information approach which is more complex but
generally performs much better. In the literature, a further distinction is
based on the kind of information source used to assign a sense to each word
occurrence [68]. Our disambiguation method is a knowledge-driven method
as it combines the context of the word to be disambiguated with additional
information extracted from an external knowledge source, such as electron-
ically oriented dictionaries and thesauri. Such approach often benefits from
a general applicability and is able to achieve good effectiveness even when it
is not restricted to specific domains. WordNet is, with no doubt, the most
used external knowledge source [26] and its hypernym hierarchies constitute
a very solid foundation on which to build effective relatedness and similarity
measures, the most common of which are the path based ones [82]. Further,
the descriptions and glosses provided for each term can deliver additional
ways to perform or refine the disambiguation: The gloss overlap approach
[11] is one of them. Among the alternative approaches, the most common
one is the corpus-based or statistic approach where the context of a word is
combined with previously disambiguated instances of such word, extracted
from annotated corpora [4, 5]. Recently, new methods relying on the entire
web textual data, and in particular on the page count statistics gathered by
search engines like Google [38, 39] have also been proposed. However, gen-
erally speaking, the problem of such approaches is that they are extremely
data hungry and require extensive training, huge textual corpora, which are
not always available, and/or a very large quantity of manual work to produce
the sense-annotated corpora they rely on. This problem prevents their use
in the application contexts we refer to, as even “raw” data are not always
available (e.g. in a PDMS, peers not necessarily store actual data).

5.3.3 Structural disambiguation


Structural disambiguation is acknowledged as a very real and frequent prob-
lem for many semantic-aware applications. However, to our best knowledge,
up to now it has only been partially considered in two contexts, schema
matching and the XML data clustering, and few actual structural disam-
biguation approaches have recently been presented. In many schema match-
ing approaches, the semantic closeness between nodes relies on syntactic ap-
proaches, such as simple string matching possibly considering its synonyms
(e.g. [98, 85]). Also, a good number of statistical wsd approaches have
Approximate query answering
180 in heterogeneous XML collections

been proposed in the matching context (e.g. [84]). However, as we already


outlined, they rely on additional data which may not always be available.
As to the proper structural disambiguation approaches, in [131] the authors
propose a technique for XML data clustering, where disambiguation is per-
formed on the documents’ tag names. The local context of a tag is captured
as a bag of words containing the tag name itself, the textual content of the
element and the text of the subordinate elements and then it is enlarged
by including related words retrieved with WordNet. This context is then
compared to the ones associated to the different WordNet senses of the term
to be disambiguated by means of standard vector model techniques. In a
similar scenario, the method proposed in [128] performs disambiguation by
applying a shortest path algorithm on a weighted graph constructed on the
terms in the path from each node to the root and on their related WordNet
terms. For the graph construction, WordNet relations are navigated just one
level. In a schema matching application, [17] presents a node disambigua-
tion technique exploiting the hierarchical structure of a schema tree together
with WordNet hierarchies. In order for this approach to be fully effective,
the schema relations have to coincide, at least partially, with the WordNet
ones, and this appears as a quite strong requirement.
Generalizing, the approach we presented in Section 5.2 differs from the
existing structural disambiguation approaches as it has not been conceived
in a particular scenario but it is versatile enough to be applicable to different
semantic-aware application contexts. It fully exploits the potentialities of the
context of a node in a graph structure and its extraction is flexible enough
to include relational information between the nodes and different kinds of
relationships, such as ancestors, descendants or siblings. Further, we fully
exploit WordNet hierarchies, and in particular the hypernym ones which are
the most used for building effective relatedness measures between terms in
free text wsd.

5.4 Experimental evaluation


In this section we provide an extensive experimental evaluation of the tech-
niques we proposed in this chapter. In particular, we present a selection of
the most interesting results we obtained through the experimental evaluation
performed on the prototypes of XML S3 MART (Section 5.4.1) for the match-
ing and rewriting services, and STRIDER for the structural disambiguation
one (Section 5.4.2).
Matching and rewriting 181

5.4.1 Matching and rewriting


Since in our method the proper rewriting phase and its effectiveness is com-
pletely dependent on the schema matching phase and becomes quite straight-
forward, we will mainly focus on the quality of the matches produced by the
matching process.

Experimental setting
We evaluated the effectiveness of our techniques in a wide range of contexts
by performing tests on a large number of different XML schemas. Such
schemas include the ones we devised ad hoc in order to precisely evaluate
the behavior of the different features of our approach, an example of which
is the music store example introduced in Figure 5.5, and schemas officially
adopted in worldwide DLs in order to describe bibliography metadata or
audio-visual content.
In particular, we further tested XML S3 MART on carefully selected pairs
of similar “official” schemas derived from:

ˆ generic digital libraries metadata description standards, such as DCMI


(Dublin Core Metadata Initiative), IFLA-FRBR (International Federa-
tion of Library Associations and Institutions Functional Requirements
for Bibliographic Records) and RSLP (Research Support Libraries Pro-
gramme) Collection Description proposal;

ˆ specific IFLA-FRBR extensions for the description of audio-visual con-


tent, such the ones proposed in the ECHO (European CHronicles On-
line) Project;

ˆ XML multimedia description standards, notably MPEG-7;

ˆ specific XML languages such as ODRL (Open Digital Rights Lan-


guage), proposed by the Digital Rights Management community for
declaring the rights connected with digital distributions of works, such
as e-books;

ˆ schemas employed in popular XML digital libraries of scientific works,


in particular DBLP Computer Science Bibliography archive and ACM
SIGMOD Record XML Collection.

In this section we will discuss the results obtained for the music store
example and for a real case concerning the schemas employed for storing
the two most popular digital libraries of scientific references in XML format:
Approximate query answering
182 in heterogeneous XML collections

0.98
musicStore cdStore

0.26 0.27
0.59
compactDisk cd
0.31 0.18

0.08
stock street

Figure 5.13: A small selection of the matching results between the nodes of
Schema A (on the left) and B (on the right) before filtering; similarity scores
are shown on the edges.

The DBLP Computer Science Bibliography archive and the ACM SIGMOD
Record.

Effectiveness of matching
For the music store schema matching, we devised the two schemas so to have
both different terms describing the same concept (such as musicStore and
cdStore, location and address) and also different conceptual organizations
(notably singer, associated to each of the tracks of a cd in Schema A, vs.
vocalist, pertaining to a whole cd in Schema B). Firstly, we performed a
careful annotation, in which we associated each of the different terms to the
most similar term and sense available in WordNet. The annotation phase for
the ad-hoc schemas was quite straightforward, since we basically only had to
associate each term with the corresponding WN term; the only peculiarities
are the annotations of composite terms (e.g. cdTitle, annotated as title) and
of the very few terms not present in WN (e.g. compactDisk annotated as cd ).
After annotation, XML S3 MART iterative matching algorithm automatically
identified the best matches among the node pairs which coincides with the
ones shown in Figure 5.5. For instance, matches A, E, F and G between
nodes with identical annotation and a similar surrounding context are clearly
identified. A very similar context of surrounding nodes, together with similar
but not identical annotations, are also the key to identify matches B, C,
D, H and J. The matches I and K require particular attention: Schema A
songTitle and albumTitle are correctly matched respectively with Schema
B title and cdTitle. In these cases, all four annotations are the same
(title) but the different contexts of surrounding nodes allow XML S3 MART
to identify the right correspondences. Notice that before applying the stable
marriage filtering each node in Schema A is matched to more than one node
in Schema B; simply choosing the best matching node in Schema B for each
of the nodes in Schema A would not represent a good choice. Consider for
instance the small excerpt of the results before filtration shown in Figure
Matching and rewriting 183

5.13: The best match for stock (Schema A) is cdStore but such node has a
better match with musicStore. The same applies between stock - cd, and
cd - compactDisk. Therefore, the matches for musicStore and compactDisk
are correctly selected (similarities in bold), while stock, a node which has no
correspondent in Schema B, is ultimately matched with street. However,
the score for such match is very low (< 0.1) and will be finally filtered out
by a threshold filter. We also tested our system against specific changes in
the ad-hoc schemas. For instance, we modified Schema B by making the
vocalist node child of the node passage, making the whole schema more
similar to Schema A structure, and we tested if the similarities between the
involved nodes increased as expected. In particular, we noticed that the
similarities of the matches between cd, passage and vocalist and their
Schema B correspondents were 10% to 30% higher.
As to the tests on a real case, Figure 5.14 shows the two involved schemas,
describing the proceedings of conferences along with the articles belonging
to conference proceedings. Along with the complexities already discussed
in the ad-hoc test, such as different terms describing the same concept
(proceedings and issue, inproceedings and article), the proposed pair
of schemas presents additional challenges, such as an higher number of nodes,
structures describing the same reality with different levels of detail (as for
author) and different distribution of the nodes (more linear for DBLP, with
a higher depth for SIGMOD), making the evaluation of the matching phase
particularly critical and interesting. In such real cases the annotation process
is no longer trivial and many terms could not have a WN correspondence:
For instance DBLP’s ee, the link to the electronic edition of an article, is
annotated as link, inprocedings, a term not available in WN, as article. In
general, we tried to be as objective and as faithful to the schemas’ terms
as possible, avoiding, for instance, to artificially hint at the right matches
by selecting identical annotations for different corresponding terms: For in-
stance, terms like procedings (DBLP) and issue (SIGMOD) were anno-
tated with the respective WN terms, dblp was annotated as bibligraphy while
sigmodRecord as record.
After annotation, XML S3 MART matcher automatically produced the
matches shown in Figure 5.14. Each match is identified by the same letter
inside the nodes and is associated with a similarity score (on the right). The
effectiveness is very high: Practically all the matches, from the fundamental
ones like B and G, involving articles and proceedings, to the most subtle,
such as L involving the link to electronic editions, are correctly identified
without any manual intervention. Notice that the nodes having no match
(weak matches were pruned out by filtering them with a similarity threshold
of 0.1) actually represent concepts not covered in the other schema, such as
Approximate query answering
184 in heterogeneous XML collections

Schema DBLP - Tree structure Schema SIGMOD - Tree structure Sim Scores
sigmod
Record A A 0.48
B 0.64
issues C 0.21
dblp A
issue B
D 0.29
E 0.29
proceedings B F 0.29
D E F C G 0.98
volume number year month conference location H 0.22
C D E F I 0.28
articles
key title volume number year J 0.22
G
K 0.31
article
inproceedings G
L 0.29

H J K
H I J K L

key author title pages ee crossref article title init end toFull authors
Code Page Page Text

href author

author I
author
Position Name

Figure 5.14: Results of schema matching between Schema DBLP and Schema
SIGMOD. Each match, represented by a letter, is associated to a similarity
score (shown on the right).

authorPosition or location for SIGMOD. The semantics delivered by the


terminology disambiguation have a great role in deciding the matches, from
matches D, E, F, G, L, and J, involving terms with similar contexts and iden-
tical annotations, to A and B, where the similarity of the annotations and
contexts is very high. On the other hand, also the fixed point computation
relying on the structure of the involved schemas is quite important. Indeed,
nodes like key and title are present twice in DBLP but are nonetheless cor-
rectly matched thanks to the influence of the surrounding similar nodes: In
particular, the key of an inproceedings is associated to the articleCode
of an article, while the key of a proceedings has no match in Schema
B. Matches C and I are particularly critical and can be obtained by anno-
tating the terms with particular care: The node conference in Schema B
actually represents the title of a given conference, and as such has been an-
notated and correctly matched. The same applies for the node author in
Schema A, representing the name of an author. In this regard, we think that
a good future improvement, simplifying the process of making annotations
of complex terms like these, would be to allow and exploit composite anno-
tations such has “the title of a conference”, “the name of an author”, or,
for nodes like Schema B articles, “a group of authors”. Further, an addi-
tional improvement, this time to the matching algorithm, might be to enable
the identification of 1:n or, even more generally, of m:n correspondences be-
Structural disambiguation 185

tween nodes: For instance match K is not completely correct since the node
pages would have to be matched with two nodes of Schema B, initPage
and endPage, and not just with initPage.
We performed many other tests on XML S3 MART effectiveness, generally
confirming the correct identification of at least 90% of the available matches.
Among them, we conducted “mixed” tests between little correlated schemas,
for instance between Schema B and DBLP. In this case, the matches’ scores
were very low as we expected. For instance, the nodes labelled title in
Schema B (title of a song) and DBLP (title of articles and proceedings)
were matched with a very low score, more than three times smaller than the
corresponding DBLP-SIGMOD match. This is because, though having the
same annotation, the nodes had a completely different surrounding context.
Finally notice that, in order to obtain such matching results, it has been
necessary to find a good trade-off between the influence of the similarities
between given pairs of nodes and that of the surrounding nodes, i.e. between
the annotations and context of nodes; in particular, we tested several graph
coefficient propagation formulas [98] and we found that the one delivering
the most effective results is the inverse total average.

5.4.2 Structural disambiguation


In this section we present the results of the actual implementation of our
STRIDER disambiguation approach.

Experimental setting
Tests were conceived in order to show the behavior of our disambiguation
approach in different scenarios. We tested 3 groups of trees characterized by
2 dimensions of interest. The first dimension, specificity, indicates how much
a tree is contextualized in a particular scope; trees with low specificity can
be used to describe heterogeneous concepts, such as a web directory, whereas
trees with high specificity are used to represent specialized fields such as data
about movies and their features and staff. The second dimension, polysemy,
indicates how much the terms are ambiguous. Trees with high polysemy
contain terms with very different meanings: For instance, rock and track
whose meanings radically change in different contexts. On the other hand,
trees with low polysemy contain mostly terms whose senses are characterized
by subtle shades of meaning, such as title. For each feasible combination
of these properties we formed a group by selecting the three most represen-
tative trees. Group1 is characterized by a low specificity and a polysemy
which increases along with the level of the tree; it is the case of web directo-
Approximate query answering
186 in heterogeneous XML collections

# terms # senses Perc. Sense


mean max correct simil.
eBay 16 3.062 8 0.327 3.321
Google 23 3.522 11 0.296 3.201
Yahoo 15 2.733 6 0.366 3.372
Group1 18.000 3.106 8.333 0.330 3.298
IMDb 41 3.854 10 0.278 2.991
OLMS 21 6.286 17 0.159 2.31
Shakes. 15 8 29 0.133 2.152
Group2 25.667 6.047 18.667 0.190 2.484
DBLP 14 5.429 11 0.224 2.6
DCMI 17 5 10 0.235 2.983
Sigmod 18 6.444 13 0.198 2.901
Group3 16.333 5.624 11.333 0.219 2.828

Table 5.1: Features of the tested trees

ries in which we usually find very different categories under the same root, a
low polysemy at low levels and high polysemy at the leaf level. The trees we
selected for Group1 are a small portion of Google—’s and Yahoor ’s web direc-
tories and of eBayr ’s catalog. Group2 is characterized by a high specificity
and a high polysemy; we chose structures extracted from XML documents of
Shakespeare’s plays, Internet Movie Database (IMDbr , www.imdb.org) and
a possible On Line Music Shop (OLMS). Finally, Group3 is characterized
by a high specificity and a low polysemy and contains representative XML
schemas from the DBLP and SIGMOD Record scientific digital libraries and
the Dublin Core Metadata Initiative (DCMIr , dublincore.org) specifica-
tions. Low specificity and high polysemy are hardly compatible, therefore
we will not consider this one as a feasible combination.
Table 5.1 shows the features of each tree involved in our experimental eval-
uation. From left to right: The number of terms, the mean and maximum
number of terms’ senses, the percentage of correct senses between all the pos-
sible senses and the average similarities among the senses of each given term
in the tree (computed by using a variant of Eg. 5.1). Notice that our trees
are composed by 15-40 terms. Even though not big, their composition allows
us to generate a significant variety of graph contexts. The other features are
instead important in order to understand the difficulty of the disambiguation
task: For instance, higher is the number of senses of the involved terms more
difficult will be their disambiguation. The mean number of senses of Group2
and Group3 is almost double than that of Group1, thus we expect their dis-
ambiguation to be harder. This is confirmed by the percentage of correct
Structural disambiguation 187

1.000
0.900

Precision level 0.800


0.700
0.600
0.500
0.400
0.300
Graph Exp Comb Graph Exp Comb Graph Exp Comb
P(1) 0.892 0.614 0.900 0.816 0.617 0.832 0.631 0.627 0.694
P(2) 1.000 0.914 1.000 0.960 0.865 0.944 0.844 0.811 0.883
P(3) 1.000 1.000 1.000 0.960 0.944 0.944 0.939 0.963 0.939
Group 1 Group 2 Group 3

Figure 5.15: Mean precision levels for the three groups

senses between all the possible senses, which can be considered an even more
significant “ease factor” and is higher in Group1. The last feature partially
expresses how the trees are positioned w.r.t. the polysemy dimension: the
higher is the average of the sense similarity the lower is the polysemy and the
different senses have a closer meaning. This is true in particular for Group1
and Group3 trees, confirming the initial hypothesis.

Effectiveness evaluation
In our experiments we evaluated the performances of our disambiguation al-
gorithm mainly in terms of effectiveness. Efficiency evaluation is not crucial
for a disambiguation approach so it will not be deepened (in any case, the
disambiguation process for the analysed trees required at most few seconds).
Traditionally, wsd algorithms are evaluated in terms of precision and recall
figures [68]. In order to produce a deeper analysis not only of the quality
of the results but also of its possible motivations w.r.t. the different tree
scenarios, we considered the precision figure along with a number of newly
introduced indicators. Recall parameter is not considered because its compu-
tation is usually based on frequent repetitions of the same terms in different
documents, and we are not interested in evaluating the wsd quality from a
single term perspective.
The disambiguation algorithm has first been tested on the entire collection
of trees using the default graph context: all the terms in the tree. Figure 5.15
shows the precision results for the disambiguation of the three groups. Three
contributions are presented: The graph context one (Graph), the expanded
context one (Exp) and the combined one (Comb). In general, precision P
is the mean of the number of terms correctly disambiguated divided by the
number of terms in the trees of each group. Since we have at our disposal
Approximate query answering
188 in heterogeneous XML collections

1.000
0.900
Precision level

0.800
0.700
0.600
0.500
0.400
0.300
Graph Exp Comb Graph Exp Comb
P(1) 0.867 0.533 0.867 1.000 0.533 1.000
P(2) 1.000 0.867 1.000 1.000 0.867 1.000
P(3) 1.000 1.000 1.000 1.000 1.000 1.000
Complete context Selected context

Figure 5.16: Typical context selection behavior for Group1 (Yahoo tree)

complete ranking results, we compute precision P(M) at different levels of


quality, by considering the results up to the first M ranks: For instance, P(1)
will be the percentage of terms in which the correct senses are at the top
position of the ranking. Combination of graph context and expanded context
contributions produces good P(1) precision levels of 90% and of over 83% for
groups Group1 and Group2, respectively. Precision results for Group3 are
lower (nearly 70%), but we have to consider the large number and higher
similarity between the senses of the involved terms; even in this difficult
settings, the results are quite encouraging, particularly if we notice that P(2)
is above 88%. As to the effectiveness of the context expansion, notice that
its contribution alone (Exp) is generally very near to the graph context one,
particularly in the complex Group3 setting, meaning a good efficacy of this
approach too; further, in all the three cases the combination of the two
contributions (Comb) produces better results than each of the contributions
alone. This is achieved by using optimal values for the α, γ (0.7) β, and ²
(0.3) weights, as obtained from a series of exploratory tests.
The next step was to evaluate the different behaviors of the trees disam-
biguation by varying the composition of their terms’ context. We tested an
extensive range of combinations for all the available trees, with “selected”
contexts including only ancestor, descendant and/or sibling terms, and dis-
covered two main behaviors: Group1 trees respond well to a more careful
context selection, while Group2 and Group3 show an opposite trend. Fig-
ures 5.16 and 5.17 show two illustrative comparisons between complete and
selected contexts for Yahoo tree (Group1) and IMDb tree (Group2), respec-
tively. Notice that, in the first case, the combined precision P(1) raises from
86% to a perfect 100% for a selected setting involving only ancestors, descen-
dants and siblings. This is due to the fact that Group1 concepts are very het-
erogeneous and including in the context only directly related terms reduces
Structural disambiguation 189

1.000
0.900

Precision level
0.800
0.700
0.600
0.500
0.400
0.300
Graph Exp Comb Graph Exp Comb
P(1) 0.878 0.659 0.878 0.756 0.659 0.805
P(2) 0.951 0.951 0.951 0.902 0.951 0.927
P(3) 0.951 0.976 0.951 0.951 0.976 0.951
Complete context Selected context

Figure 5.17: Typical context selection behavior for Group2 (IMDb tree)

the disambiguation “noise” produced by completely uncorrelated ones. For


instance, when the complete Yahoo tree is used to disambiguate the term
hygiene in the health category, the top sense is that related to the health
science as the process is wrongly influenced by terms like neurology and
cardiology contained in the medicine category. Instead, when the tree
terms are specific and more contextualized, such as in the other two groups,
the result is the opposite: Notice the IMDb combined precision dropping
from nearly 88% to 80% when only ancestors and descendant terms are kept
(Figure 5.17).
Precision figures are the fundamental way to evaluate a wsd approach,
however, in our case, we wanted to analyze the results more in depth and
from different perspectives. For instance, precision P(1) might be high thanks
to the effectiveness of the approach but also for the possibly small number
of senses of the involved terms (think of terms with just one sense). In
order to deepen our analysis, we computed additional “delta” parameters
(see Table 5.2): The left part of the table shows delta values between rank
positions, while the right part shows delta values between confidences. Delta
rank values express the mean difference between the position in the ranking
of the correct sense and that of the last one; we computed them when the
right senses appear in the first (rank1 in table), second (rank2) and third
(rank3) position. For a given rank, we indicate by a ‘-’ the situation where
there are no correct senses with that rank. In general, the higher the “delta
to last” rank value is, the harder the disambiguation task should be. At
a first glance, Group2 and Group3 confirm their inherent complexity w.r.t.
Group1, where rank1 delta values are nearly double. Also notice the very
high rank1 delta of some trees, such as the Shakespeare one, meaning that our
approach correctly disambiguates also terms with very high number of senses.
Further, we wanted to analyze the actual confidence values and, in particular:
Approximate query answering
190 in heterogeneous XML collections

Delta to last (rank) Delta (conf )


rank1 rank2 rank3 to foll. from top
eBay 2.154 0.667 - 0.307 -0.043
Google 2.5 2 - 0.244 -0.003
Yahoo 1.733 - - 0.184 0
Group1 2.129 1.333 - 0.245 -0.015
IMDb 2.444 2.667 - 0.14 -0.017
OLMS 4.118 10.5 - 0.171 -0.02
Shakes. 9.125 2.2 - 0.171 -0.042
Group2 5.229 5.122 - 0.161 -0.026
DBLP 3.4 6.667 - 0.142 -0.039
DCMI 3.273 4 5 0.125 -0.039
Sigmod 5.5 4.375 - 0.168 -0.035
Group3 4.058 5.014 5 0.145 -0.038

Table 5.2: Delta values of the selected senses

How much the right senses’ confidences are far from the incorrect ones, i.e.
how much the algorithm is confident in its choices (delta confidence to the
followings, first column of the right part of the table), and, when the choice
is not right, how much the correct sense confidence is far from the chosen
one (delta confidence from the top). We see that the “to the followings”
values are sufficiently high (from 14% of Group3 to over 24% of Group1),
while the “from the top” ones are nearly null, meaning very few and small
mistakes. Notice that the wsd choices performed on Group1, which gave the
best results in terms of precision, are also the most “reliable” ones, as we
expected.
In Table 5.2 we showed aggregate delta values for each group, however we
also found interesting to investigate the visual trend of the delta confidences
of the terms of a tree. Figure 5.18 shows the double histogram composed by
the delta to the followings (top part) and the delta from the top (bottom
part) values, where the horizontal axis represents the 21 terms of the On Line
Music Shop tree. Notice that for two terms no contributions are present: This
is due to the fact that these terms have only one available sense and, thus,
their disambiguation is not relevant. Further, the graph shows that, when the
upper bars are not particularly high (low confidence), the bottom bars are
not null (wrong disambiguation choices), but only in a very limited number
of cases. In most cases, the upper bars are evidence of good disambiguation
confidence and reliability, with peaks of over 40%.
Up to now we have not considered the contribution of the terms/senses’
feedback to the overall effectiveness of the results, in particular in the disam-
5.5 Future extensions towards Peer-to-Peer scenarios 191

0.5

0.4

0.3
Delta confidence

0.2

0.1

-0.1

-0.2
Terms

Figure 5.18: Confidence delta values for OLMS

biguation of the most ambiguous terms in the tree. For illustration, suggest-
ing the correct meaning of the term volume in the DBLP tree as a book helps
the algorithm in choosing the right meaning for number as a periodic publi-
cation. Moreover, suggesting the correct meaning of the term line (part of
character’s speech) in the Shakespeare tree produces better disambiguation
results, for instance for the speaker term, where the position of the right
sense passes from second to first in the ranking. Notice that, in this case, and
in many others, the feedback on the term merely confirms the top sense in the
ranking (i.e. our algorithm is able to correctly disambiguate it); nonetheless,
this has a positive effect on the disambiguation of the near terms since the
“noise” produced by the wrong senses is eliminated. The flexibility of our
approach allows also to benefit from a completely automatic feedback, where
the results of a given run are refined by automatically disabling the contri-
butions of all but the top X senses in the following runs. We can generally
choose a very low X value, such as 2 or 3, since the right sense is typically
occupying the very top positions in the ranking. For instance, by choosing
X = 2 in the SIGMOD tree, the results of the second run show a precision
increment of almost 17%, and similar results are generally obtainable on all
the considered trees.

5.5 Future extensions towards Peer-to-Peer


scenarios
In this concluding section, we briefly describe how we plan to enhance the
XML S3 MART system and some of its previously described features in order
to support distributed Peer-to-Peer (P2P) systems and, in particular, Peer
Approximate query answering
192 in heterogeneous XML collections

Data Management Systems (PDMS) settings. Indeed, this is a particularly


challenging scenario, requiring additional advanced techniques for efficient
and effective approximate query answering. P2P systems are more and more
diffused on the Internet and are characterized by high flexibility, rapid evo-
lution and decentralization; however, searching for particular information in
large P2P networks is often quite a long and disappointing task for users.
PDMSs [129] represent a recent proposal trying to make a synthesis between
the flexibility of P2P systems and the semantic expressiveness of the recent
database and XML search techniques. In a PDMS architecture, each user
should be able to search for and exploit all the available contents, even when
it resides on peers which are different from the queried one.
In order to deal with the problem of effective search in a PDMS archi-
tecture, many of the techniques presented in this chapter can be exploited,
however several additional issues need to be addressed.
First of all, it is necessary to deal with the data heterogeneity of the
PDMS participants. To this end, the schema matching and query rewriting
approach of the XML S3 MART system can be very useful: Since single peers
are independent entities, they might adopt very different schemas to represent
their data. Thus, suitable semantic mappings containing all the correspon-
dences between their different concepts need to be computed. In order to
exploit XML S3 MART schema matching features, the mapping computation
needs to be modified and transformed in an incremental computation, which
can efficiently update the semantic mappings when new peers connect and
disconnect from the network. With the modified algorithms, each peer can
establish how each of its concepts can be approximated in its “neighbor”
entities by means of a numerical score expressing their semantic closeness.
A further key issue in PDMS querying is the following: It is not always
convenient for a peer to propagate a query towards all other peers. Indeed,
it would be very inefficient to involve entities which contain unrelated data,
since the requesting peer would be overloaded with a large number of insignif-
icant results and the network traffic would be uselessly multiplied. What
would be needed is to exploit the mappings discovered by XML S3 MART in
order to select, for each of the incoming queries, which of the neighbors are
potentially able to solve it effectively. In this way, queries could be propa-
gated only to the peers having a satisfying mapping score for the involved
concepts, a technique which can be defined as routing by mapping.
In choosing the most convenient neighbors for query propagation, all the
information available in the subnetworks rooted by them should be taken in
consideration. However, in a P2P context it is not possible for each node to
perform schema matching on every other possible entity in the network. One
interesting way to handle this problem should be devising and exploiting
5.5 Future extensions towards Peer-to-Peer scenarios 193

ad-hoc data structures, which could be named Semantic Routing Indexes


(SRIs), in which each peer could store summary information on how its
contents are semantically approximated in the whole subnetworks rooted on
the neighboring peers.
Finally, another essential improvement to the XML S3 MART system
would be to develop and immerse it in a complete PDMS simulation environ-
ment in which to verify its behavior in a distributed setting and, eventually,
to verify the effectiveness of its matching and semantic indexing features.
Some very promising initial results about all the ideas and techniques
described in this section can be found in [117].
Chapter 6

Multi-version management
and personalized access
to XML documents

Nowadays XML is universally accepted as the standard for structural data


representation and exchange and, as we have also seen in the last two chap-
ters, the problem of supporting structural querying in XML databases is an
appealing research topic for the database community. As data changes over
time, the possibility to deal with historical information is essential to many
computer applications, such as accounting, banking, law, medical records
and customer relationship management. In the last years, researchers have
tried to provide answers to this need by proposing models and languages for
representing and querying the temporal aspect of XML data. Recent works
on this topic include [44, 56, 58, 99].
The central issue of supporting temporal versioning, i.e. most temporal
queries in any language, is time-slicing the input data while retaining period
timestamping. A time-varying XML document records a version history and
temporal slicing makes the different states of the document available to the
application needs. While a great deal of work has been done on temporal
slicing in the database field [54], the paper [56] has the merit of having been
the first to raise the temporal slicing issue in the XML context, where it is
complicated by the fact that timestamps are distributed throughout XML
documents. The solution proposed in [56] relies on a stratum approach whose
advantage is that they can exploit existing techniques in the underlying XML
query engine, such as query optimization and query evaluation. However,
standard XML query engines are not aware of the temporal semantics and
thus it makes more difficult to map temporal XML queries into efficient
“vanilla” queries and to apply query optimization and indexing techniques
Multi-version management
196 and personalized access to XML documents

particularly suited for temporal XML documents. Thus, the need for a native
solution to the temporal slicing problem becomes apparent to be able to
effectively and efficiently manage temporal versioning.
One of the most interesting scenarios in which this is particularly essen-
tial is the eGovernment one. Indeed, we are witnessing a strong institutional
push towards the implementation of eGovernment support services, aimed
at a higher level of integration and involvement of the citizens in the Public
Administration (PA) activities that concern them. In this context, collec-
tions of norm texts and legal information presented to citizens are becoming
popular on the internet and one of the main objectives of many reasearch
activities and projects is the development of techniques supporting tempo-
ral querying but also, as in [52], another versioning dimension, the semantic
one, thus allowing personalization facilities. Here, personalization plays an
important role, because some norms or some of their parts have or acquire a
limited applicability. For example, a given norm may contain some articles
which are only applicable to particular classes of citizens (e.g. public employ-
ees). Hence, a citizen accessing the repository may be interested in finding
a personalized version of the norm, that is a version only containing articles
which are applicable to his/her personal case. In existing works, personal-
ization is either absent (e.g. www.normeinrete.it) or predefined by human
experts and hardwired in the repository structure (e.g. www.italia.gov.it),
whereas flexible and on-demand personalization services are lacking.
In this chapter we propose new techniques for the effective and efficient
management and querying of time varying XML documents and for their
personalized access. In particular:

ˆ in the first part (Section 6.1) we deal with the problem of managing and
querying time-varying multi-version XML documents in a completely
general scenario. In particular, we propose a native solution to the
temporal slicing problem, addressing the question of how to construct
an XML query processor supporting time-slicing [88];

ˆ in the second part (Section 6.2) we focus on the eGovernment scenario


and we present how the slicing technology described in Section 6.1 can
be adapted and exploited in a complete normative system in order to
provide efficient access to temporal XML norm texts repositories [89].
Further, we propose additional techniques in order to support semantic
versioning and personalized access to them.

Finally, we provide a related work discussion on temporal representation,


querying and personalized access (Section 6.3), and we conclude by providing
6.1 Temporal versioning and slicing support 197

extensive experimental evaluation of all the proposed techniques (Section


6.4).

6.1 Temporal versioning and slicing support


In this section we propose a native solution to the temporal slicing problem,
addressing the question of how to construct an XML query processor sup-
porting time-slicing. The underlying idea is to propose the changes that a
“conventional” XML pattern matching engine would need to be able to slice
time-varying XML documents. The advantage of this solution is that we can
benefit from the XML pattern matching techniques present in the literature,
where the focus is on the structural aspects which are intrinsic also in tem-
poral XML data, and that, at the same time, we can freely extend them to
become temporally aware.
We begin by providing some background in Section 6.1.1, where the tem-
poral slicing problem is defined. Then, we propose a novel temporal index-
ing scheme (first subsection of Section 6.1.2), which adopts the inverted list
technology proposed in [139] for XML databases and changes it in order to
allow the storing of time-varying XML documents, and we show how a time-
varying XML document can be encoded in it. Finally, we devise a flexible
technology supporting temporal slicing (remaining parts of Section 6.1.2).
It consists in alternative solutions supporting temporal slicing on the above
storing scheme, all relying on the holistic twig join approach [25], which is one
of the most popular approaches for XML pattern matching. The proposed
solutions act at the different levels of the holistic twig join architectures with
the aim of limiting main memory space requirements, I/O and CPU costs.
They include the introduction of novel algorithms and the exploitation of
different access methods.

6.1.1 Preliminaries
A time-varying XML document records a version history, which consists of
the information in each version, along with timestamps indicating the lifetime
of that version [44]. The left part of Figure 6.1 shows the tree representa-
tion of our reference time-varying XML document taken from a legislative
repository of norms. Data nodes are identified by capital letters. For sim-
plicity’s sake, timestamps are defined on a single time dimension and the
granularity is the year. Temporal slicing is essentially the snapshot of the
time-varying XML document(s) at a given time point but, in its broader
meaning, it consists in computing simultaneously the portion of each state
Multi-version management
198 and personalized access to XML documents

Time varying XML document (fragment) Time-slicing example


law A [1994, now]
+
[1991, 1994] contents
U
contents B [1996, now]
article

section C [1970, 2003] section F [2004, now]


[1996, 1998]
U
[2001, 2003] [2004, now]
D E G
article article article [2004, now]
[1970, 1990] [1995, 1998] contents B contents B
U
[2001, 2003] article E article G

Figure 6.1: Reference example.

of time-varying XML document(s) which is contained in a given period and


which matches with a given XML query twig pattern. Moreover, it is often
required to combine the results back into a period-stamped representation
[56] in the period [1994, now] and for the query twig contents//article.
The right part of Figure 6.1 shows the output of a temporal slicing example.
This section introduces a notation for time-varying XML documents and a
formal definition for the temporal slicing problem.

Document representation
A temporal XML model is required when there is the need of managing
temporal information in XML documents and the adopted solution usually
depends on the peculiarities of the application one wants to support. For
the sake of generality, our proposal is not bound to a specific temporal XML
model. On the contrary, it is able to deal with time-varying XML docu-
ments containing timestamps defined on an arbitrary number of temporal
dimensions and represented as temporal elements [54], i.e. disjoint union of
periods, as well as single periods.
In the following, we will refer to time-varying XML documents by adopt-
ing part of the notation introduced in [44]. A time-varying XML database
is a collection of XML documents, also containing time-varying documents.
We denote with DT a time-varying XML document represented as an ordered
labelled tree containing timestamped elements and attributes (in the follow-
ing denoted as nodes) related by some structural relationships (ancestor-
descendant, parent-child, preceding-following). The timestamp is a temporal
element chosen from one or more temporal dimensions and records the life-
Preliminaries 199

time of a node. Not all nodes are necessarily timestamped. We will use the
notation nT to signify that node n has been timestamped and lif etime(nT )
to denote its lifetime. Sometimes it can be necessary to extend the lifetime
of a node n[T ] , which can be either temporal or snapshot, to a temporal di-
mension not specified in its timestamp. In this case, we follow the semantics
given in [45]: If no temporal semantics is provided, for each newly added tem-
poral dimension we set the value on this dimension to the whole time-line,
i.e. [t0 , t∞ ).
The snapshot operator is an auxiliary operation which extracts a complete
snapshot or state of a time-varying document at a given instant and which
is particularly useful in our context. Timestamps are not represented in the
snapshot. A snapshot at time t replaces each timestamped node nT with its
non-timestamped copy x if t is in lif etime(nT ) or with the empty string,
otherwise. The snapshot operator is defined as snp(t, DT ) = D where D is
the snapshot at time t of DT .

The time-slice operator

The time-slice operator is applied to a time-varying XML database and is


defined as time-slice(twig,t-window). The twig parameter is a non-
temporal node-labeled twig pattern which is defined on the snapshot schema
[44] of the database through any XML query languages, e.g. XQuery, by
specifying a pattern of selection predicates on multiple elements having some
specified tree structured relationships. It defines the portion of interest in
each state of the documents contained in the database. It can also be the
whole document. The t-window parameter is the temporal window on which
the time-slice operator has to be applied. More precisely, by default temporal
slicing is applied to the whole time-lines, that is by using every single time
point contained in the time-varying documents. With t-window, it is possible
to restrict the set of time points by specifying a collection of periods chosen
from one or more temporal dimensions.
Given a twig pattern twig, a temporal window t-window and a time-
varying XML database TXMLdb, a slice is a mapping from nodes in twig to
nodes in TXMLdb, such that: (i) query node predicates are satisfied by the
[T ] [T ]
corresponding document nodes thus determining the tuple (n1 , . . . , nk ) of
the database nodes that identify a distinct match of twig in TXMLdb, (ii)
[T ] [T ]
(n1 , . . . , nk ) is structurally consistent, i.e. the parent-child and ancestor-
descendant relationships between query nodes are satisfied by the correspond-
[T ] [T ]
ing document nodes, (iii) (n1 , . . . , nk ) is temporally consistent, i.e. its life-
[T ] [T ] [T ] [T ]
time lif etime(n1 , . . . , nk ) = lif etime(n1 ) ∩ . . . ∩ lif etime(nk ) is not
Multi-version management
200 and personalized access to XML documents

[T ] [T ]
empty and it is contained in the temporal window, lif etime(n1 , . . . , nk ) ⊆
t-window. For instance, in the reference example, the tuple (B,D) is struc-
turally but not temporally consistent as lif etime(B) ∩ lif etime(D) = ∅. In
this chapter, we consider the temporal slicing problem:
Given a twig pattern twig, a temporal window t-window and
a time-varying XML database TXMLdb, for each distinct slice
[T ] [T ]
(n1 , . . . , nk ), time-slice(twig,t-window) computes the snap-
[T ] [T ] [T ] [T ]
shot snp(t, (n1 , . . . , nk )), where t ∈ lif espan(n1 , . . . , nk ).
Obviously, it is possible to provide a period-timestamped representation of
[T ] [T ]
the results by associating each distinct state snp(t, (n1 , . . . , nk )) with its
[T ] [T ]
pertinence lif etime(n1 , . . . , nk ) in t-window.

6.1.2 Providing a native support for temporal slicing


Existing work on “conventional” XML query processing (see, for example,
[139]) shows that capturing the XML document structure using traditional
indices is a good solution, on which it is possible to devise efficient structural
or containment join algorithms for twig pattern matching. Being timestamps
distributed throughout the structure of XML documents, we decided to start
from one of the most popular approaches for XML query processing whose
efficiency in solving structural constraints is proved. In particular, our so-
lution for temporal slicing support consists in an extension to the indexing
scheme described in [139] such that time-varying XML databases can be im-
plemented and in alternative changes to the holistic twig join technology [25]
in order to efficiently support the time-slice operator in different scenarios.

The temporal indexing scheme


The indexing scheme described in [139] is an extension of the classic inverted
index data structure in information retrieval which maps elements and strings
to inverted lists. The position of a string occurrence in the XML database
is represented in each inverted list as a tuple (DocId, LeftPos,LevelNum)
and, analogously, the position of an element occurrence as a tuple (DocId,
LeftPos:RightPos,LevelNum) where (a) DocId is the identifier of the doc-
ument, (b) LeftPos and RightPos can be generated by counting word num-
bers from the beginning of the document DocId until the start and end
of the element, respectively, and (c) LevelNum is the depth of the node
in the document. In this context, structural relationships between tree
nodes can be easily determined: (i) ancestor-descendant: A tree node n2
encoded as (D2 , L2 : R2 , N2 ) is a descendent of the tree node n1 encoded as
Providing a native support for temporal slicing 201

law (1, 1:14, 1 | 1970:now)

contents (1, 2:13, 2 | 1991:1994), (1, 2:13, 2 | 1996:now)

section (1, 3:8, 3 | 1970:2003), (1, 9:12, 3 | 2004:now)

(1, 4:5, 4 | 1970:1990), (1, 6:7, 4 | 1995:1998),


article
(1, 6:7, 4 | 2001:2003), (1, 10:11, 4 | 2004:now)

Figure 6.2: The temporal inverted indices for the reference example

(D1 , L1 : R1 , N1 ) iff D1 = D2 , L1 < L2 , and R2 < R1 ; (ii) parent-child : n2 is


a child of n1 iff it is a descendant of n1 and L2 = L1 + 1.
As temporal XML documents are XML documents containing time-varying
data, they can be indexed using the interval-based scheme described above
and thus by indexing timestamps as “standard” tuples. On the other hand,
timestamped nodes have a specific semantics which should be exploited
when documents are accessed and, in particular, when the time-slice op-
eration is applied. Our proposal adds time to the interval-based indexing
scheme by substituting the inverted indices in [139] with temporal inverted
indices. In each temporal inverted index, besides the position of an el-
ement occurrence in the time-varying XML database, the tuple (DocId,
LeftPos:RightPos,LevelNum|TempPer) contains an implicit temporal at-
tribute [54], TempPer. It consists of a sequence of From:To temporal at-
tributes, one for each involved temporal dimension, and represents a pe-
riod. Thus, our temporal inverted indices are in 1NF and each timestamped
node nT , whose lifetime is a temporal element containing a number of peri-
ods, is encoded through as many tuples having the same projection on the
non-temporal attributes (DocId, LeftPos:RightPos,LevelNum) but with
different TempPer values, each representing a period. All the temporal in-
verted indices are defined on the same temporal dimensions such that tuples
coming from different inverted indices are always comparable from a tem-
poral point of view. Therefore, given the number h of the different tempo-
ral dimensions represented in the time-varying XML database, TempPer is
From1 :To1 ,...,Fromh :Toh .
In this context, each time-varying XML document to be inserted in the
database undergoes a pre-processing phase where (i) the lifetime of each node
is derived from the timestamps associated with it, (ii) in case, the resulting
lifetime is extended to the temporal dimensions on which it has not been
defined by following the approach described in Subsection 6.1.1. Figure 6.2
illustrates the structure of the four indices for the reference example. Notice
that the snapshot node A, whose label is law, is extended to the temporal
dimension by setting the pertinence of the corresponding tuple to [1970, now].
Multi-version management
202 and personalized access to XML documents

A technology for the time-slice operator

Level ...
SOL q1 q2 qn

Level
...
L2

Sq 1 Sq2 Sq n
Level ...
L1 n q1 n q2 n qn
Bq 1 Bq2 Bq n

Level ID {ptr} ID {ptr} ID {ptr}


...
L0 ... ... ... ... ... ...

Iq1 I q2 Iqn

Figure 6.3: The basic holistic twig join four level architecture

The basic four level architecture of the holistic twig join approach is de-
picted in Figure 6.3. Similarly to the tree signature twig matching algorithms
we described in Chapter 4, the approach maintains in main-memory a chain
of linked stacks to compactly represent partial results to root-to-leaf query
paths, which are then composed to obtain matches for the twig pattern (level
SOL in Figure). In particular, given a path involving the nodes q1 , . . . , qn , the
two stack-based algorithms presented in [25], one for path matching and the
other for twig matching, work on the inverted indices Iq1 , . . . , Iqn (level L0 in
Figure) and build solutions from the stacks Sq1 , . . . , Sqn (level L2 in Figure).
During the computation, thanks to a deletion policy the set of stacks contains
data nodes which are guaranteed to lie on a root-to-leaf path in the XML
database and thus represents in linear space a compact encoding of partial
and total answers to the query twig pattern. The skeleton of the two holistic
twig join algorithms (HTJ algorithms in the following) is presented in Figure
6.4. At each iteration the algorithms identify the next node to be processed.
To this end, for each query node q, at level L1 is the node in the inverted in-
dex Iq with the smaller LeftPos value and not yet processed. Among those,
the algorithms choose the node with the smaller value, let it be nq̄ . Then,
given knowledge of such node, they remove partial answers form the stacks
that cannot be extended to total answers and push the node nq̄ into the stack
Sq̄ . Whenever a node associated with a leaf node of the query path is pushed
on a stack, the set of stacks contains an encoding of total answers and the
algorithms output these answers. The algorithms presented in [25] have been
Providing a native support for temporal slicing 203

While there are nodes to be processed


(1) Choose the next node nq̄
(2) Apply the deletion policy
(3) Push the node nq̄ into the pertinence stack Sq̄
(4) Output solutions

Figure 6.4: Skeleton of the holistic twig join algorithms (HTJ algorithms)

further improved in [30, 72]. As our solutions do not modify the core of such
algorithms, we refer interested readers to the above cited papers.
The time-slice operator can be implemented by applying minimal changes
to the holistic twig join architecture. The time-varying XML database is
recorded in the temporal inverted indices which substitute the “conventional”
inverted index at the lower level of the architecture and thus the nodes in the
stacks will be represented both by the position and the temporal attributes.
Given a twig pattern twig, a temporal window t-window, a slice is the
snapshot of any answer to twig which is temporally consistent. Thus the
holistic twig join algorithms continue to work as they are responsible for
the structural consistency of the slices and provide the best management of
the stacks from this point of view. Temporal consistency, instead, must be
checked on each answer output of the overall process. In particular, for each
potential slice ((D, L1 : R1 , N1 |T1 ), . . . , (D, Lk : Rk , Nk |Tk )) it is necessary
to intersect the periods represented by the values T1 , . . . , Tk and then check
both that such intersection is not empty and that it is contained in the
temporal window. Finally, the snapshot operation is simply a projection
of the temporally consistent answers on the non-temporal attributes. In
this way, we have described the “first step” towards the realization of a
temporal XML query processor. On the other hand, the performances of
this first solution are strictly related to the peculiarities of the underlying
database. Indeed, XML documents usually contain millions of nodes and
this is absolutely true in the temporal context where documents record the
history of the applied changes. Thus, the holistic twig join algorithms can
produce a lot of answers which are structurally consistent but which are
eventually discarded as they are not temporally consistent. This situation
implies useless computations due to an uncontrolled growth of the the number
of tuples put on the stacks.
Temporal consistency considers two aspects: The intersection of the in-
volved lifetimes must be non-empty (non-empty intersection constraint in the
following) and it must be contained in the temporal window (containment
constraint in the following). We devised alternative solutions which rely on
the two different aspects of temporal consistency and act at the different
Multi-version management
204 and personalized access to XML documents

levels of the architecture with the aim of limiting the number of temporally
useless nodes the algorithms put in the stacks. The reference architecture is
slightly different from the one presented in Figure 6.3. Indeed, in our con-
text, any timestamped node whose lifetime is a temporal element is encoded
into more tuples (e.g. see the encoding of the timestamped node E in the
reference example). Thus, at level L1, each node nq must be interpreted as
the set of tuples encoding nq . They are stored in buffer Bq and step 3 of the
HTJ algorithms empties Bq and pushes the tuples in the stack Sq .

Non-empty intersection constraint


Not all temporal tuples which enter level L1 will at the end belong to the
set of slices. In particular, some of them will be discarded due to the non-
empty intersection constraint. The following Lemma characterizes this as-
pect. Without lose of generality, it only considers paths as the twig matching
algorithm relies on the path matching one.

Proposition 6.1 Let (D, L : R, N |T ) be a tuple belonging to the tempo-


ral inverted S
index Iq , Iq1 , . . . , Iqk the inverted indices of the ancestors of q
and T Pqi = σLeftPos<L (Iqi )|TempPer , for i ∈ [1, k], the union of the temporal
pertinences of all the tuples in Iqi having LeftPos smaller than L. Then
(D, L : R, N |T ) will belong to no slice if the intersection of its temporal
pertinence with T Pq1 , . . . , T Pqk is empty, i.e. T ∩ T Pq1 ∩ . . . ∩ T Pqk = ∅.

Notice that, at each step of the process, the tuples having LeftPos smaller
than L can be in the stacks, in the buffers or still have to be read from
the inverted indices. However, looking for such tuples in the three levels
of the architecture would be quite computationally expensive. Thus, in the
following we introduce a new approach for buffer loading which allows us
to look only at the stack level. Moreover, we avoid accessing the temporal
pertinence of the tuples contained in the stacks by associating a temporal
pertinence to each stack (temporal stack ). Such a temporal pertinence must
therefore be updated at each push and pop operation. At each step of the
process, for efficiency purposes both in the update and in the intersection
phase, such a temporal pertinence is the smaller multidimensional period Pq
containing the union of the temporal pertinence of the tuples in the stack Sq .
The aim of our buffer loading approach is to avoid loading the temporal
tuples encoding a node n[T ] in the pertinence buffer Bq if the inverted indices
associated with the parents of q contain tuples with LeftPos smaller than
that of nq and not yet processed. Such an approach is consistent with step
1 of the HTJ algorithms as it chooses the node at level L1 with the smaller
LeftPos value and ensures that when n[T ] enters Bq all the tuples involved
Providing a native support for temporal slicing 205

Input: Twig pattern twig, the last processed node n←



q
Output: Next node nq̄ to be processed
Algorithm Load:

(1) if all buffers are empty


(2) start=root(twig);
(3) else
(4) start=← −
q;
(5) for each query node q from start to leaf(twig)
(6) get nq ;
(7) minq is the minimum between nq .LeftPos and minparent(q) ;
(8) if nq .LeftPos is equal to minq
(9) load nq into Bq ;
(10)return the last node inserted into the buffers

Figure 6.5: The buffer loading algorithm Load

in Prop. 6.1 are in the stacks. The algorithm implementing step 1 of the
HTJ algorithms is shown in Figure 6.5. We associate each buffer Bq with
the minimum minq among the LeftPos values of the tuples contained in the
buffer itself and those of its ancestors. Assuming that all buffers are empty,
the algorithm starts from the root of the twig (step 2) and, for each node q
up to the leaf, it updates the minimum minq and inserts nq , the node in Iq
with the smaller LeftPos value and not yet processed, if it is smaller than
minq . The same applies when some buffers are not empty. In this case, it
starts from the query node matching with the previously processed data node
and it can be easily shown that the buffers of the ancestors of such node are
not empty whereas the buffers of the subpath rooted by such node are all
empty.
Lemma 6.1 Assume that step 1 of the HTJ algorithms depicted in Figure
6.4 is implemented by the algorithm Load. The tuple (D, L : R, N |T ) in Bq
will belong to no slice if the intersection of its temporal pertinence T with the
multidimensional period Pq1 →qk = Pq1 ∩ . . . ∩ Pqk intersecting the periods of
the stacks of the ancestors q1 , . . . , qk of q is empty.
For instance, at the first iteration of the HTJ algorithms applied to the
reference example, step 1 and step 3 produce the situation depicted in Figure
6.6. Notice that when the tuple (1, 4 : 5, 4|1970 : 1990) encoding node D
(label article) enters level L1 all the tuples with LeftPos smaller than 4
are already at level L2 and due to the above Lemma we can state that it will
belong to no slice.
Multi-version management
206 and personalized access to XML documents

Level (1, 2:13, 2 | 1996:now) STEP 3


L2 (1, 2:13, 2 | 1991:1994)

[1991, now] [ ]

Level (1, 2:13, 2 | 1996:now) STEP 1

L1 (1, 2:13, 2 | 1991:1994)


mincontents = 2 minarticle = 2

contents article

Figure 6.6: State of levels L1 and L2 during the first iteration

Thus, the non-empty intersection constraint can be exploited to prevent


the insertion of useless nodes into the stacks by acting at level L1 and L2 of
the architecture. At level L2 we act at step 3 of the HTJ algorithms by sim-
ply avoiding pushing into the stack Sq each temporal tuple (D, L : R, N |T )
encoding the next node to be processed which satisfies Lemma 6.1, i.e. such
that T ∩ Pq1 →qk = ∅. At level L1, instead, we act at step 9 of the algorithm
Load by avoiding loading in any buffer Bq each temporal tuple encoding nq
which satisfies Lemma 6.1. More precisely, given the LeftPos value of the last
processed node, say CurLef tP os, we only load each tuple (D, L : R, N |T )
such that L is the minimum value greater than CurLef tP os and T inter-
sects Pq1 →qk . To this purpose, our solution uses time-key indices combining
the LeftPos attribute with the attributes Fromj :Toj in the TempPer im-
plicit attribute representing one temporal dimension in order to improve the
performances of range-interval selection queries on the temporal inverted in-
dices. In particular, we considered two access methods: The B+-tree and a
temporal index, the Multiversion B-tree (MVBT) [13].
An one-dimensional index like the B+-tree, clusters data primarily on a
single attribute. Thus, we built B+-trees that cluster first on the LeftPos
attribute and than on the interval end time Toj . In this way, we can take
advantage of sequential I/O as tree leaf pages are linked and records in them
are ordered. In particular, we start with the first leaf page that contains a
LeftPos value greater than CurLef tP os and a Toj value greater than or
equal to Pq1 →qk |Fromj , i.e. the projection of the period Pq1 →qk on the interval
start time Fromj . Then we proceed by loading the records until the leaf page
with the next LeftPos value or with a Fromj value greater than Pq1 →qk |Toj is
met. This has the effect of selecting each tuple (D, L : R, N |T ) where L is the
smaller value greater than CurLef tP os and its period T |Fromj :Toj intersect
the period Pq1 →qk |Fromj :Toj , as T |Toj ≥ Pq1 →qk |Fromj and T |Fromj ≤ Pq1 →qk |Toj .
Providing a native support for temporal slicing 207

The alternative approach we considered is to maintain multiple versions


of a standard B+-tree through an MVBT. An MVBT index record contains
a key, a time interval and a pointer to a page and, thus, this structure is able
to directly support our range-interval selection requirements.

Containment constraint
The following proposition is the equivalent of Prop. 6.1 when the containment
constraint is considered.

Proposition 6.2 Let (D, L : R, N |T ) be a tuple belonging to the temporal


inverted index Iq . Then (D, L : R, N |T ) will belong to no slice if the in-
tersection of its temporal pertinence with the temporal window t-window is
empty.

It allows us to act at level L1 and L2, but also between level L0 and level L1.
At level L1 and L2 the approach is the same as the non-empty intersection
constraint; it is sufficient to use the temporal window t-window, and thus
Prop. 6.2, instead of Lemma 6.1. Moreover, it is also possible to add an
intermediate level between level L0 and level L1 of the architecture, which
we call “under L1” (UL1), where the only tuples satisfying Prop. 6.2 are
selected from each temporal inverted index, are ordered on the basis of their
(DocId,LeftPos) values and then pushed into the buffers. Similarly to the
approach explained in the previous section, to speed up the selection, we
exploit B+ -tree indices built on one temporal dimension. Notice that this
solution deals with buffers as streams of tuples and thus it provides interesting
efficiency improvements only when the temporal window is quite selective.

Combining solutions
The non-empty intersection constraint and the containment constraint are
orthogonal thus, in principle, the solutions presented in the above subsections
can be freely combined in order to decrease the number of useless tuples we
put in the stacks. Each combination gives rise to a different scenario denoted
as “X/Y”, where “X” and “Y” are the employed solutions for the non-empty
intersection constraint and for the containment constraint, respectively (e.g.
scenario L1/L2 employs solution L1 for the non-empty intersection constraint
and solution L2 for the containment constraint). Some of these scenarios will
be discussed in the following. First, scenario L1/UL1 is not applicable since
in solution UL1 selected data is kept and read directly from buffers, with no
chance of additional indexing. Instead, in scenario L1/L1 the management of
the two constraints can be easily combined by querying the indices with the
Multi-version management
208 and personalized access to XML documents

intersection of the temporal pertinence of the ancestors (Proposition 6.1) and


the required temporal window. All other combinations are straightforwardly
achievable, but not necessarily advisable. In particular, when L1 is involved
for any of the two constraints the L1 indices have to be built and queried:
Therefore, it is best to combine the management of the two constraints as
in L1/L1 discussed above. Finally, notice that the baseline scenario is the
SOL/SOL one, involving none of the solutions discussed in this chapter.

6.2 Semantic versioning and personalization


support
In this section, we show how the slicing technology described in Section 6.1
can be adapted and exploited in a real application scenario. In particular,
in the context of the research activity entitled “Semantic web techniques for
the management of digital identity and the access to norms”, which we have
carryied out as part of the PRIN national project “European Citizen in eGov-
ernance: legal-philosophical, legal, computer science and economical aspects”
[52], we focus on eGovernment and on the development of a complete norma-
tive querying system providing efficient access to temporal XML norm texts
repositories. Further, we add support for the semantic versioning dimension,
allowing a fully personalized access to the required documents. Indeed, as we
have seen in the introduction of the chapter, in eGovernment personalization
plays a very important role, because some norms or some of their parts have
or acquire a limited applicability. For example, a given norm may contain
some articles which are only applicable to particular classes of citizens (e.g.
public employees). Hence, a citizen accessing the repository may be inter-
ested in finding a personalized version of the norm, that is a version only
containing articles which are applicable to his/her personal case.
The following subsections are organized as follows: Section 6.2.1 describes
the complete eGovernment infrastructure, while Section 6.2.2 investigates
the aspects of personalized access to multi-version documents, where the
versioning is both temporal and semantic.

6.2.1 The complete infrastructure


In order to enhance the participation of the citizens to an eGovernance proce-
dure of interest, their automatic and accurate positioning within the reference
legal framework is needed. To solve this problem we employ Semantic Web
techniques and introduce a civic ontology, which corresponds to a classifi-
cation of citizens based on the distinctions introduced by subsequent norms
The complete infrastructure 209

WEB SERVICES
OF PUBLIC PUBLIC
ADMINISTRATION ADMINISTRATION DB
1 IDENTIFICATION
XML REPOSITORY
WEB SERVICES OF ANNOTATED
SIMPLE WITH
ELABORATION 2 CLASSIFICATION CREATION NORMS
ONTOLOGY /UPDATE
UNIT
CLASS CX OC

3 QUERYING RESULTS

Figure 6.7: The Complete Infrastructure

which imply some limitation (total or partial) in their applicability. In the


following, we refer to such norms as founding acts. Moreover, we define
the citizen’s digital identity as the total amount of information concerning
him/her –necessary for the sake of classification with respect to the ontology–
which is available online [116]. Such information must be retrievable in an
automatic, secure and reliable way from the PA databases through suitable
Web services (identification services). For instance, in order to see whether
a citizen is married, a simple query concerning his/her marital status can
be issued to registry databases. In this way, the classification of the citizen
accessing the repository makes it possible to produce the most appropriate
version of all and only norms which are applicable to his/her case.
Hence, the resulting complete infrastructure is composed by various com-
ponents that have to communicate between each other to collect partial and
final results (see Figure 6.7). Firstly, in order to obtain a personalized access,
a secure authentication is required for a citizen accessing the infrastructure.
This is performed through a simple elaboration unit, also acting as user inter-
face, which processes the citizen’s requests and manages the results. Then,
we can identify the following phases:

ˆ the identification phase (step 1 in Figure 6.7) consists of calls to


identification services to reconstruct the digital identity of the authen-
ticated user on-the-fly. In this phase the system collects pieces of infor-
mation from all the involved PA web services and composes the identity
of the citizen.

ˆ the citizen classification phase (step 2 in Figure 6.7) in which the


classification service uses the collected digital identity to classify the
citizen with respect to the civic ontology (OC in Figure 6.7), by means
of an embedded reasoning service. In Figure 6.7, the most specific class
Multi-version management
210 and personalized access to XML documents

CX has been assigned to the citizen;


ˆ finally, in the querying phase (step 3 in Figure 6.7) the citizen’s
query is executed on the multi-version XML repository, by accessing
and reconstructing the appropriate version of all and only norms which
are applicable to the class CX . The querying phase will be deeply
analyzed in the next Section.
In order to supply the desired services, the digital identity is modelled and
represented within the system in a form such that it can be translated into
the same language used for the ontology (e.g. a Description Logic [6]). In
this way, during the classification procedure, the matching between the civic
ontology classes and the citizen’s digital identity can be reduced to a standard
reasoning task (e.g. ontology entailment for the underlying Description Logic
[67]).
Furthermore, the civic ontology used in step 2 and 3 requires to be created
and constantly maintained: each time a new founding act is enforced, the
execution of a creation/update phase is needed. Notice that this process
(and also the introduction of semantic annotations into the multi-version
XML documents) is a delicate task which needs advice by human experts
and “official validation” of the outcomes and, thus, it can only partially be
automated. However, computer tools and graphic environments (e.g. based
on the Protégé platform [112]) could be provided to assist the human experts
to perform this task. For the specification of the identification, classification
and creation/update services, we plan to adopt a standard declarative for-
malism (e.g. based on XML/SOAP [123]). The study of the services and of
the mechanisms necessary to their semi-automatic specification will be dealt
with in future research work.

6.2.2 Personalized access to versions


Temporal concerns are widespread in the eGovernment domain and a legal
information system should be able to retrieve or reconstruct on demand any
version of a given document to meet common application requirements. In
fact, whereas it is crucial to reconstruct the current (consolidated) version of
a norm as it is the one that currently belongs to the regulations and must be
enforced today, also past versions are still important, not only for historical
reasons. For example, if a Court has to pass judgment today on some fact
committed in the past, the version of norms which must be applied to the
case is the one that was in force then. Temporal versioning aspects are exam-
ined in the next subsection. We then extend the temporal framework with
semantic versioning in order to provide personalized access to norm texts.
Personalized access to versions 211

Semantic versioning also plays an important role, due to the limited applica-
bility that norms or some of their parts have or acquire. Hence, it is crucial
to maintain the mapping between each portion of a norm and the maximal
class of citizens it applies to in order to support an effective personaliza-
tion service. Finally, notice that temporal and limited applicability aspects
though orthogonal may also interplay in the production and management of
versions. For instance, a new norm might state a modification to a preexist-
ing norm, where the modified norm becomes applicable to a limited category
of citizens only (e.g. retired persons), whereas the rest of the citizens remain
subject to the unmodified norm.

Temporal versioning
We first focused on the temporal aspects and on the effective and efficient
management of time-varying norm texts. Our work on these aspects is based
on our previous research experiences [58, 59, 60] and on the work discussed
in Section 6.1. To this purpose, we developed a temporal XML data model
which uses four time dimensions to correctly represent the evolution of norms
in time and their resulting versioning. The considered dimensions are:

Validity time. It is the time the norm is in force. It has the same semantics
of valid time as in temporal databases [71], since it represents the time
the norm actually belongs to the regulations in the real world.

Efficacy time. It is the time the norm can be applied to concrete cases.
While such cases do exist, the norm continues its efficacy even if no
longer in force. It also has a semantics of valid time although it is
independent from validity time.

Transaction time. It is the time the norm is stored in a computer system.


It has the same semantics of transaction time as in temporal databases
[71].

Publication time. It is the time of publication of the norm on the Official


Journal. It has the same semantics as event time in temporal databases
[75]. As a global and unchangeable norm property, it is not used as a
versioning dimension.

The data model was defined via an XML Schema, where the structure of
norms is defined by means of a contents-section-article-paragraph hierarchy
and multiple content versions can be defined at each level of the hierarchy.
Each version is characterized by timestamp attributes defining its temporal
Multi-version management
212 and personalized access to XML documents

The sample ontology A fragment of an XML document


supporting personalized access
<article num="1">
(1,8)
Citizen <ver num="1">
<aa applies_to="3"/> [… Temporal attributes … ]
<paragraph num="1">
(2,1) (3,6) (8,7)
Unemployed Employee Retired <ver num="1"> [ … Text … ]
<aa applies_to="4"/> [… Temporal attributes … ]
</ver>
(4,4) (7,5) </paragraph>
Subordinate Self-employed
<paragraph num="2">
<ver num="1"> [ … Text … ]
<aa applies_also="8"/> [… Temporal attributes … ]
(5,2) (6,3)
Public Private </ver>
</paragraph>
</ver>
</article>

Figure 6.8: An example of civic ontology, where each class has a name and
is associated to a (pre,post) pair, and a fragment of a XML norm containing
applicability annotations.

pertinence with respect to each of the validity, efficacy and transaction time
dimensions.
Legal text repositories are usually managed by traditional information
retrieval systems where users are allowed to access their contents by means
of keyword-based queries expressing the subjects they are interested in. We
extended such a framework by offering users the possibility of expressing
temporal specifications for the reconstruction of a consistent version of the
retrieved normative acts (consolidated act).

Semantic versioning
The temporal multi-version model described above has then been enhanced
to include a semantic versioning mechanism to provide personalized access,
that is retrieval of all and only norm provisions that are applicable to a
given citizen according to his/her digital identity. Hence, the semantic ver-
sioning dimension encodes information about the applicability of different
parts of a norm text to the relevant classes of the civic ontology defined
in the infrastructure (OC in Figure 6.7). Semantic information is mapped
onto a tree-like civic ontology, that is based on a taxonomy induced by IS-
A relationships. The tree-like civic ontology is sufficient to satisfy basic
application requirements as to applicability constraints and personalization
services, though more advanced application requirements may need a more
sophisticated ontology definition.
For instance, the left part of Figure 6.8 depicts a simple civic ontology
built from a small corpus of norms ruling the status of citizens with respect
to their work position. The right part shows a fragment of a multi-version
XML norm text supporting personalized access with respect to this ontology.
Personalized access to versions 213

As we currently manage tree-like ontologies, this allows us to exploit the pre-


order and post-order properties of trees in order to enumerate the nodes and
check ancestor-descendant relationships between the classes. These codes
are represented in the upper left part of the ontology classes in the Figure,
in the form: (pre-order,post-order). For example, the class “Employee” has
pre-order “3”, which is also its identifier, whereas its post order is “6”. The
article in the XML fragment on the right-hand-side of Figure 6.8 is composed
of two paragraphs and contains applicability annotations (tag aa).
Notice that applicability is inherited by descendant nodes unless locally
redefined. Hence, by means of redefinitions we can also introduce, for each
part of a document, complex applicability properties including extensions
or restrictions with respect to ancestors. For instance, the whole article in
the Figure is applicable to civic class “3” (tag applies to) and by default to
all its descendants. However, its first paragraph is applicable to class “4”,
which is a restriction, whereas the second one is applicable to class “8” (tag
applies also), which is an extension. The reconstruction of pertinent versions
of the norm based on its applicability annotations is very important in an e-
Government scenario. The representation of extensions and restrictions gives
rise to high expressiveness and flexibility in such a context.

Accessing the right version for personalization


The queries that can be submitted to the system can contain four types of
constraints: temporal, structural, textual and applicability. Such constraints
are completely orthogonal and allow users to perform very specific searches in
the XML norm repository. Let us focus first on the applicability constraint.
Consider again the ontology and norm fragment in Figure 6.8 and let John
Smith be a “self-employed” citizen (i.e. belonging to class “7”) retrieving the
norm: hence, the sample article in the Figure will be selected as pertinent,
but only the second paragraph will be actually presented as applicable. Fur-
thermore, the applicability constraint can be combined with the other three
ones in order to fully support a multi-dimensional retrieval. For instance,
John Smith could be interested in all the norms ...

ˆ which contain paragraphs (structural constraint) dealing with health


care (textual constraint), ...

ˆ which were valid and in effect between 2002 and 2004 (temporal con-
straint), ...

ˆ which are applicable to his personal case (applicability constraint).


Multi-version management
214 and personalized access to XML documents

Such a query can be issued to our system using the standard XQuery FLWR
syntax as follows:
FOR $a IN norm
WHERE textConstr ($a//paragraph//text(), ’health AND care’)
AND tempConstr (’vTime OVERLAPS
PERIOD(’2002-01-01’,’2004-12-31’)’)
AND tempConstr (’eTime OVERLAPS
PERIOD(’2002-01-01’,’2004-12-31’)’)
AND applConstr (’class 7’)
RETURN $a
where textConstr, tempConstr, and applConstr are suitable functions al-
lowing the specification of the textual, temporal and applicability constraints,
respectively (the structural constraint is implicit in the XPath expressions
used in the XQuery statement). Notice that the temporal constraints can
involve all the four available time dimensions (publication, validity, efficacy
and transaction), allowing high flexibility in satisfying the information needs
of users in the eGovernment scenario. In particular, by means of validity
and efficacy time constraints, a user is able to extract consolidated current
versions from the multi-version repository, or to access past versions of par-
ticular norm texts, all consistently reconstructed by the system on the basis
of the user’s requirements and personalized on the basis of his/her identity.

6.3 Related work


6.3.1 Temporal XML representation and querying
In the last years, there has been a growing interest in representing and query-
ing the temporal aspect of XML data. Recent papers on this topic include
those of Currim et al. [44], Gao and Snodgrass [56], Mendelzon et al. [99],
and Grandi et al. [58] where the history of changes XML data undergo is rep-
resented into a single document from which versions can be extracted when
needed. In [44], the authors study the problem of consistently deriving a
scheme for managing the temporal counterpart of non-temporal XML docu-
ments, starting from the definition of their schema. The paper [56] presents a
temporal XML query language, τ XQuery, with which the authors add tempo-
ral support to XQuery by extending its syntax and semantics to three kinds
of temporal queries: Current, sequenced, and representational. Similarly, the
TXPath query language described in [99] extends XPath for supporting tem-
poral queries. Finally, the main objective of the work presented in [58] has
been the development of a computer system for the temporal management of
Temporal XML representation and querying 215

multiversion norms represented as XML documents and made available on


the Web.
Closer to our definition of time-slice operator, Gao and Snodgrass [56]
need to time-slice documents in a given period and to evaluate a query in
each time slice of the documents. The authors suggest an implementation
based on a stratum approach to exploit the availability of XQuery implemen-
tations. Even if they propose different optimizations of the initial time-slicing
approach, this solution results in long XQuery programs also for simple tem-
poral queries and postprocessing phases in order to coalesce the query results.
Moreover, an XQuery engine is not aware of the temporal semantics and thus
it makes more difficult to apply query optimization and indexing techniques
particularly suited for temporal XML documents. Native solutions are, in-
stead, proposed in [31, 99]. The paper [31] introduces techniques for storing
and querying multiversion XML documents. Each time one or more updates
occur on a multiversion XML document, the proposed versioning scheme cre-
ates a new physical version of the document where it stores the differences
w.r.t. the previous version. This leads to large overheads when “conven-
tional” queries involving structural constraints and spanning over multiple
versions are submitted to the system. In [99] the authors propose an approach
for evaluating TXPath queries which integrates the temporal dimension into
a path indexing scheme by taking into account the available continuous paths
from the root to the elements, i.e. paths that are valid continuously during a
certain time interval. While twig querying is not directly handled in this ap-
proach, path query performance is enhanced w.r.t. standard path indexing,
even though the main memory representation of their indices is more than
10 times the size of the original documents. Moreover, query processing can
still be quite heavy for large documents, as it requires the full navigation of
the document collection structure, in order to access the required element
tables, and the execution of a binary join between them at each level of the
query path.
Similarly to the structural join approach [139] proposed for XML query
pattern matching, the temporal slicing problem can be naturally decom-
posed into a set of temporal-structural constraints. For instance solving
time-slice(//contents// section//article,[1994, now]) means to find
all occurrences in a temporal XML database of the basic ancestor-descendant
relationships (contents,section) and (section,article) which are tem-
porally consistent. In the literature, a great deal of work has been devoted
to the processing of temporal join (see e.g. [110]) also using indices [140].
Given the temporal indexing scheme, we could have extended temporal join
algorithms to the structural join problem or vice versa. However the main
drawback of the structural join approach is that the sizes of the results of
Multi-version management
216 and personalized access to XML documents

binary structural joins can get very large, even when the input and the final
result sizes obtained by stitching together the basic matches are much more
manageable.
Our native approach extends one of the most efficient approaches for XML
query processing and the underlying indexing scheme in order to support tem-
poral slicing and overcome most of the previously discussed problems. Start-
ing from the holistic twig join approach [25], which directly avoids the prob-
lem of very large intermediate results size by using a chain of linked stacks
to compactly represent partial results, we proposed new flexible technolo-
gies consisting in alternative solutions and extensively experimented them in
different settings.

6.3.2 Personalized access to XML documents


The problem of information overload when browsing and searching the web
becomes more and more crucial as the web keeps growing exponentially: A
personalized access to resources is the most popular remedy for increasing the
quality and speed-up these tasks. In particular, as we experienced in recent
years, the personalization in web search has been mainly influenced by two
factors: The analysis of the user context and the exploitation of ontologies.
The analysis of the user context involves the recording of the user behavior
during searches, such as the subject of the visited pages and the dwelling time
on a page, to learn a retrieval function which is used to produce a customized
ranking of search results that suits users preferences. Whenever this kind of
information is collected without any effort of the user, an implicit feedback
is provided. For instance, in [120] a client-side web search agent is presented
that uses query expansion based on previous searches and on an immediate
result re-ranking based on clickthrough information. On the other hand, the
user context could be collected through an explicit feedback. For instance, in
[77] is presented a contextual search system that allows users to explicitly
and conveniently define their context features to obtain personalized results.
The biggest concerns of personalized search through the analysis of the
user context is privacy, especially when it is done on the server-side. On the
other hand, ontology-based approaches, such as the one we presented, avoid
this problem exploiting specific domain ontologies to answer user queries
on a conceptual level. These approaches tend to gather semantics from the
documents to define an ontology of the concepts, rather than use classification
techniques to automatically create user profiles.
To conclude, our personalized search techniques differ from the other
above cited approaches because they exploit a domain ontology, rather than
user profiles, to perform queries on the repository: This implies the inter-
6.4 Experimental evaluation 217

vention of human experts to build the ontology, but avoids privacy issues.
Moreover, we access more complex documents, such as semi-structured ones,
rather than unstructured documents such as web pages; indeed, the high flex-
ibility we provide makes it possible to access fragments of the documents,
returning all and only the ones that fit to the user needs and avoiding the
retrieval of useless information.

6.4 Experimental evaluation


In this section we present the results of an actual implementation of our XML
query processor supporting multi-version XML documents. In particular, in
subsection 6.4.1 we focus on temporal slicing support, showing the behavior
of our system on different document collections and in different execution
scenarios. Then, in subsection 6.4.2 we specifically evaluate the system in
the eGovernment scenario, also testing its personalized access performances.
All experiments have been performed on a Pentium 4 3Ghz Windows XP
Professional workstation, equipped with 1GB RAM and an 160GB EIDE HD
with NT file system (NTFS).

6.4.1 Temporal slicing


Experimental setting

The document collections follow the structure of the documents used in [58],
where three temporal dimensions are involved, and have been generated by
a configurable XML generator. On average, each document contains 30-40
nodes, a depth level of 10, 10-15 of these nodes are timestamped nodes nT ,
each one in 2-3 versions composed by the union of 1-2 distinct periods. We
are also able to change the length of the periods and the probability that the
temporal pertinence of the document nodes overlap. Finally, we investigate
different kinds of probability density functions generating collections with
different distributions, thus directly affecting the containment constraint.
Experiments were conducted on a reference collection (C-R), consisting
of 5000 documents (120 MB) generated following a uniform distribution and
characterized by not much scattered nodes, and on several variations of it.
We tested the performance of the time-slice operator with different twig
and t-window parameters. In this context we will deepen the performance
analysis by considering the same path, involving three nodes, and different
temporal windows as our focus is not on the structural aspects.
Multi-version management
218 and personalized access to XML documents

Evaluation Execution Non-Consistent Tuples (%)


scenarios: Time (ms) Solutions (%) Buffer Stack
L1/L1 1890 23.10 % 7.99 % 7.76 %
L2/L1 1953 23.10 % 9.23 % 7.76 %
SOL/L1 2000 39.13 % 9.43 % 9.17 %
L1/L2 2625 23.10 % 17.95 % 7.76 %
L2/L2 2797 23.10 % 23.37 % 7.76 %
SOL/L2 2835 39.13 % 23.80 % 9.17 %
L1/SOL 12125 95.74 % 88.92 % 88.85 %
L2/SOL 12334 95.74 % 99.33 % 88.85 %
SOL/SOL 12688 96.51 % 100.00 % 100.00 %

Table 6.1: Evaluation of the computation scenarios with TS1.

Evaluation of the default setting


We started by testing the time-slice operator with a default setting (denoted
as TS1 in the following). Its temporal window has a selectivity of 20%, i.e.
20% of the tuples stored in the temporal inverted indexes involved by the
twig pattern intersect the temporal window. The returned solutions are 5584.
Table 6.1 shows the performance of each scenario when executing TS1. In
particular, from the left: The execution time, the percentage of potential
solutions at level SOL that are not temporally consistent and, in the last two
columns, the percentage of tuples that are put in the buffers and in the stacks
w.r.t. the total number of tuples involved in the evaluation. Notice that, the
temporal inverted indices exploited at level L1 are B+-trees; the comparison
of the performances between the B+ -tree and MVBT implementations will
be shown in the following.
The best result is given by the computation scenario L1/L1: Its execution
time is more than 6 times faster than the execution time of the baseline
scenario SOL/SOL. Such a result clearly shows that combining solutions
at a low level of the architecture, such as L1, avoids I/O costs for reading
unnecessary tuples and their further elaboration cost at the upper levels. The
decrease of read tuples from 100% of SOL/SOL to just 7.99% of L1/L1 and
the decrease of temporally inconsistent solutions from 96.51% of SOL/SOL
to 23.1% of L1/L1 represent a remarkable result in terms of efficiency. Let
us now have a look to the other scenarios. TS1 represents a typical querying
setting where the containment constraint is much more selective than the
non-empty intersection constraint. This consideration induces us to analyse
the obtained performances by partitioning the scenarios in three groups,
*/L1, */L2 and */SOL, on the basis of the adopted containment constraint
Temporal slicing 219

100 14000
90
12000
80
% of Tuples in the Buffers

70 10000

Execution Time (ms)


60 8000
50
40 6000
30 4000
20
2000
10
0 0
L1/L1 L2/L1 L1/L2 SOL/L2 SOL/SOL L1/L1 L2/L1 L1/L2 SOL/L2 SOL/SOL
TS1 7.99 9.23 17.95 23.80 100.00 TS1 1290 1953 2625 2835 12688
TS2 15.58 17.68 25.15 32.07 100.00 TS2 2812 2844 3422 3547 12697

(a) Percentage of tuples in the buffers (b) Execution Time (ms)

Figure 6.9: Comparison between TS1 and TS2.

solution. The scenarios within each group show similar execution time and
percentages of tuples. In group */L1 the low percentage of tuples in buffers
(10%) means low I/O costs and this has a good influence on the execution
time. In group */L2 the percentages of tuples in buffers are more than double
of those of group */L1, while the execution time is about 1.5 times higher.
Finally, group */SOL is characterized by percentages of tuples in buffers and
execution time approximately ten and six time higher than those in */T1,
respectively. Moreover, within each group it should be noticed that rising
the non-empty intersection constraint solution from level L1 to level SOL
produces more and more deterioration in the overall performances.

Changing the selectivity of the temporal window


We are now interested in showing how our XML query processor responds
to the execution of temporal slicing with different selectivity levels; to this
purpose we considered a second time-slice (TS2) having a selectivity of 31%
(lower than TS1) and returning 12873 solutions. Figure 6.9 shows the per-
centage of read tuples (Figure 6.9-a) and the execution time (Figure 6.9-b)
of TS1 compared with our reference time-slice setting (TS1). Notice that the
trend of growth of the percentage of read tuples along the different scenar-
ios is similar. However, for TS1 the execution time follows the same trend
as the read tuples whereas for TS2 the execution time of different scenarios
are closer. In this case, the lower selectivity of the temporal window makes
the benefits achievable by the L1 solutions less appreciable. Notice that, in
the SOL/SOL scenario both queries have the same number of tuples in the
buffers because no selectivity is applied at the lower levels; this explains also
the same execution time.
Multi-version management
220 and personalized access to XML documents

14000 20000

Execution Time (ms)


12000
10000 15000
Execution Time (ms)

8000
10000
6000
4000
5000
2000
0 0
L1/L1 L2/UL1 SOL/UL1 SOL/SOL L1/L1 L1/L1
SOL/SOL STRUCT
TS1 1290 3078 3081 12688 B+TREE MVBT
TS2 2812 3938 3953 12691 TS1 1290 2655 12688 17750
TS3 1031 797 813 9891 TS2 2812 5709 12691 17859

(a) UL1 scenarios performances (b) MVBT and structural approach


performances

Figure 6.10: Additional execution time comparisons.

Evaluation of the performance of solution UL1

In order to evaluate the results of exploiting access methods at level UL1 we


considered a third time-slice (TS3) that is characterized by a highly selective
temporal window (1%) and returns 123 solutions. Figure 6.10-a compares
the execution time of the scenarios involving UL1 solutions (*/UL1) with
the best and the baseline scenarios shown above (L1/L1 and SOL/SOL).
As one would expect, it shows that */UL1 scenarios are inefficient for low-
selectivity settings, while they are the best ones with high-selectivity setting.
In particular the best computation scenario for TS3 is L2/UL1.

Comparison with MVBT and purely structural techniques

In Figure 6.10-b we compare the execution time for scenario L1/L1 when the
access method is the B+-tree w.r.t. the MVBT. Notice that when MVBT
indices are used to access data the execution time is generally higher than the
B+-tree solution. This might be due to the implementation we used which
is a beta-version included in the XXL package [130]. The last comparison
involves the holistic twig join algorithms applied on the original indexing
scheme proposed in [139] where temporal attributes are added to the index
structure but are considered as common attributes. Notice that in this in-
dexing scheme tuples must have different LeftPos and RightPos values and
thus each temporal XML document must be converted into an XML docu-
ment where each timestamped node gives rise to a number of distinct nodes
equal to the number of distinct periods. The results are shown on the right of
Figure 6.10-b where it is clear that the execution time of the purely structural
Temporal slicing 221

14000 100
90

% Non-Consistent Solutions
12000
80
10000 70
Execution Time (ms)

8000 60
50
6000 40
4000 30
20
2000
10
0 0
L1/L1 SOL/L1 SOL/SOL L1/L1 SOL/L1 SOL/SOL L1/L1 SOL/L1 SOL/SOL L1/L1 SOL/L1 SOL/SOL
C-R 1890 2000 12688 2812 2859 12691 C-R 23.10 39.13 96.51 29.96 43.23 91.95
C-S 906 1383 9766 1250 1797 9875 C-S 32.5 95.01 99.98 63.17 98.22 99.88
TS1 TS2 TS1 TS2

(a) Execution time (b) Percentage of Non-Consistent Solu-


tions

Figure 6.11: Comparison between the two collections C-R and C-S.

approach (STRUCT) is generally higher than our baseline scenario and thus
also than the other scenarios (13 times slower than the best scenario). This
demonstrates that the introduction of our temporal indexing scheme alone
brings significant benefits on temporal slicing performance. We refer the in-
terested reader also to Section 6.3 where we provide additional discussion of
state of the art techniques w.r.t. ours.

Evaluation on differently distributed collections

We also considered the performance of our XML query processor on another


collection (C-S) of the same size of the reference one, but that is characterized
by temporally scattered nodes. Figure 6.11 shows the execution time and
the number of temporally inconsistent potential solutions of TS1 and TS2
on both collections. The execution time of scenarios L1/L1 and SOL/L1,
depicted in Figure 6.11-a, shows that it is almost unchanged for collection
C-R, whereas the difference is more remarkable for both temporal slicing
settings for collection C-S. Notice also that the percentage of temporally
inconsistent potential solutions when no solution is applied under level SOL is
limited in the C-R case but explodes in the C-S case (see for instance SOL/L1
in Figure 6.11-b). The non-empty intersection constraint is mainly influenced
by the temporal sparsity of the nodes in the collection: The more the nodes
are temporally scattered the more the number of temporally inconsistent
potential solutions increases. Therefore, when temporal slicing is applied
to this kind of collections the best way to process it is to adopt a solution
exploiting the non-empty intersection constraint at the lowest level, i.e. L1.
Multi-version management
222 and personalized access to XML documents

100000

Execution Time (ms)


10000

1000
5000 Docs 10000 Docs 20000 Docs
L1/L1 1890 3531 5654
L2/L2 2797 5329 9844
SOL/SOL 12688 22893 45750

Figure 6.12: Scalability results for TS1.

Scalability
Figure 6.12 (notice the logarithmic scales) reports the performance of our
XML query processor in executing TS1 for the reference collection C-R and
for two collections having the same characteristics but different sizes: 10000
and 20000 documents. The execution time grew linearly in every scenario,
with a proportion of approximately 0.75 w.r.t. the number of documents for
our best scenario L1/L1. Such tests have also been performed on the other
temporal slicing settings where we measured a similar trend, thus showing
the good scalability of the processor in every type of query context.

6.4.2 Personalized access


Experimental setting
The architecture of the system we designed for the eGovernment scenario is
based on an “XML-native” approach, as it is composed of a Multi-version
XML Query Processor designed on purpose, which is able to manage the
XML data repository and to support all the temporal, structural, textual
and applicability query facilities in a single component. The prototype ex-
ploits the temporal slicing technology whose evaluation has been shown in
the previous subsection, together with additional ad-hoc data structures (re-
lying on embedded “light” DBMS libraries) and algorithms which allow users
to store and reconstruct on-the-fly the XML norm texts satisfying the four
types of constraints. As in the temporal slicing section, such a component
stores the XML norms not as entire documents but by converting them into
a collection of ad-hoc temporal tuples, representing each of its multi-version
parts (i.e. paragraphs, articles, and so on). Textual constraints are handled
by means of an inverted index. The system accesses and retrieves only the
Personalized access 223

Query Constraints Selectivity Performance


Tm St Tx (msec)
Q1 (Q1-A) - X X 0.6% (0.23%) 1046 (1095)
Q2 (Q2-A) - X X 4.02% (1.65%) 2970 (3004)
Q3 (Q3-A) X X - 2.9% (1.3%) 6523 (6760)
Q4 (Q4-A) X X X 0.68% (0.31%) 1015 (1020)
Q5 (Q5-A) X X X 1.46% (0.77%) 2550 (2602)

Table 6.2: Features of the test queries and query execution time (time in
msecs, collection C1)

strictly necessary data by querying ad-hoc and temporally-enhanced struc-


tures without accessing whole documents; hence, there is no need to build
space-consuming structures such as DOM trees to process a query and only
the parts which satisfy the query constraints are used for the reconstruc-
tion of the results. Furthermore, the architecture also provides support to
personalized access by handling the applicability constraints. Owing to the
properties of the adopted pre- and post-order encoding of the civic classes,
the system is able to very efficiently deal with applicability constraints during
query processing by means of simple comparisons involving such encodings
and semantic annotations.
In order to evaluate the performance of our system, we built a specific
query benchmark and conducted a number of exploratory experiments to
test its behavior under different workloads. We performed the tests on three
XML document sets of increasing size: collection C1 (5,000 XML norm text
documents), C2 (10,000 documents) and C3 (20,000 documents). In the
following, we will present in detail the results obtained on the main collection
(C1), then we will briefly describe the scalability performance shown on the
other two collections. The total size of the collections is 120MB, 240MB,
and 480MB, respectively. In all collections the documents were synthetically
generated by means of an ad-hoc XML generator we developed, which is able
to produce different documents compliant to our multi-version model. For
each collection, the average, minimum and maximum document size is 24KB,
2KB and 125KB, respectively.

Execution time on main collection

Experiments were conducted by submitting queries of five different types


(Q1-Q5). Table 6.2 presents the features of the test queries and the query
execution time for each of them. All the queries require structural support
Multi-version management
224 and personalized access to XML documents

(St constraint); types Q1 and Q2 also involve textual search by keywords (Tx
constraint), with different selectivities; type Q3 contains temporal conditions
(Tm constraint) on three time dimensions: transaction, valid and publica-
tion time; types Q4 and Q5 mix the previous ones since they involve both
keywords and temporal conditions. For each query type, we also present a
personalized access variant involving an additional applicability constraint,
denoted as Qx-A in Table 6.2 (performance figures in parentheses).
Let us first focus on the “standard” queries. Our approach shows a good
efficiency in every context, providing a short response time (including query
analysis, retrieval of the qualifying norm parts and reconstruction of the
result) of approximately one or two seconds for most of the queries. Notice
that the selectivity of the query predicates does not impair performances,
even when large amounts of documents containing some (typically small)
relevant portions have to be retrieved, as it happens for queries Q2 and Q3.
Our system is able to deliver a fast and reliable performance in all cases, since
it practically avoids the retrieval of useless document parts. Furthermore,
consider that, for the same reasons, the main memory requirements of the
Multi-version XML Query Processor are very small, less than 5% with respect
to “DOM-based” approaches such as the one adopted in [59, 58]. Notice that
this property is also very promising towards future extensions to cope with
concurrent multi-user query processing.
The time needed to answer the personalized access versions of the Q1–Q5
queries is approximately 0.5-1% more than for the original versions. More-
over, since the applicability annotations of each part of an XML document are
stored as simple integers, the size of the tuples with applicability annotations
is practically unchanged (only a 3-4% storage space overhead is required with
respect to documents without semantic versioning), even with quite complex
annotations involving several applicability extensions and restrictions.

Scalability
Finally, let us comment the performance in querying the other two collections
C2 and C3 and, therefore, concerning the scalability of the system. We ran
the same queries of the previous tests on the larger collections and saw that
the computing time always grew sub-linearly with the number of documents.
For instance, query Q1 executed on the 10,000 documents of collection C2
(which is as double as C1) took 1,366 msec (i.e. the system was only 30%
slower); similarly, on the 20,000 documents of collection C3, the average
response time was 1,741 msec (i.e. the system was less than 30% slower than
with C2). Also with the other queries the measured trend was the same, thus
showing the good scalability of the system in every type of query context.
Conclusions
and Future Directions

In this thesis we presented different techniques allowing effective and efficient


management and search in large amounts of information. In particular, we
considered different application scenarios and two main kinds of information,
i.e. textual and semi-structured XML documents. The main contributions
of our work are the following:

ˆ We proposed similarity measures between text sequences and we de-


fined novel approximate matching algorithms searching for matches
within them. We successfully applied such techniques to the EBMT
and duplicate document detection scenarios, also fulfilling additional
requirements which are peculiar to these tasks. Efficiency and porta-
bility are ensured by the introduction of ad-hoc filtering techniques and
a mapping into plain SQL expressions, respectively;

ˆ As to semi-structured information, we deeply studied XML pattern


matching properties and we defined a set of reduction conditions, based
on the pre/post-order numbering scheme, which is complete and which
is applicable to the three kinds of tree pattern matching. We showed
that such a theoretical framework can be applied for building efficient
pattern matching algorithms;

ˆ We considered the problem of approximate query answering for hetero-


geneous XML document bases and we proposed a method for structural
query approximation which is able to automatically identify semantic
and structural similarities between the involved schemas and to exploit
them in order to automatically rewrite a query written on a source
schema towards other available schemas. The effectiveness of such
application widely benefits from the definition and exploitation of a
versatile approach for the disambiguation of the schema terminological
information;
226 Conclusions and Future Directions

ˆ Finally, we delved into the multi-version XML management and query-


ing scenario. We proposed a novel temporal indexing scheme and,
starting from the holistic twig join approach, we proposed new flexible
technologies consisting in alternative solutions supporting the tempo-
ral slicing problem. We also applied and specialized such approach to
the eGovernment scenario: In the context of a complete and modular
infrastructure, we developed a temporal and semantical XML query
processor supporting both temporal versioning, essential in normative
systems, and semantic versioning;

ˆ For all the presented techniques, we implemented them in full sys-


tem prototypes and performed very detailed experimental evaluation
on them. We showed that the results and performance that our systems
achieve can widely benefit the different reference applications.

Starting from the ideas and work presented in this thesis, several interesting
issues for future research can be considered. These include:

ˆ Testing and specializing XML matching techniques towards multimedia


querying, exploiting and supporting, for instance, the peculiar features
and potentialities of the XML-based MPEG-7 audio-visual metadata
description schema [95];

ˆ Enhancing the schema matching, structural disambiguation and query


rewriting techniques, by studying automatic deduction of the schemas
underlying the submitted queries and ad-hoc rewriting rules, and by
extending and testing their applicability toward graph-like schemas and
ontologies;

ˆ Deeply analyzing the Peer-to-Peer and Peer Data Management Systems


scenarios, bringing approximate query answering techniques toward the
requirements of such distributed settings;

ˆ Extending the multi-version query processor framework in order to fully


support semantic versioning through generic graph-like ontologies, also
improving the XML documents annotation scheme and their storage
organization.
Appendix A

More on EXTRA techniques

In this appendix we present and discuss in detail some of the techniques of


the EXTRA system which could not be systematically analyzed in Chapter
2. In particular, Section A.1 explains the noun and verbs disambiguation
techniques, while Section A.2 details the M ultiEditDistance matching algo-
rithms.

A.1 The disambiguation techniques


A.1.1 Preliminary notions
Before analyzing our Word Sense Disambiguation techniques in depth, we
introduce some preliminary notions based on WordNet. WordNet is a lexi-
cal database [100] containing the definitions of most of the English language
words. It is organized conceptually in synonym sets or synsets, represent-
ing different concepts or senses. We indicate the set of WordNet senses as
S = {s1 , s2 , ..., sm } and with Sw the set of senses (Sw ⊆ S) a word w (which
can be a noun n or a verb v) is able to express. Because not all the senses
of a term are equally frequent in any language, in WordNet Sw is an ordered
list of senses [sw w w
1 , s2 , . . . sm ]. The senses constitute the nodes of a word sense
network, which are linked by a variety of semantic relations. The predomi-
nant relationship for nouns, and the one we are particularly interested in, is
hypernymy/hyponymy (IS-A). An example of hypernymy/hyponymy hierar-
chy is shown in Figure A.1, where the senses are represented with a short
textual description (e.g. “cat (animal)”). By considering any pair of senses,
we introduce the notion of minimum common hypernym csi ,sj (also called
most informative subsumer in [113]) between two senses si and sj , as the
most specific (lowest in the hierarchy) of their common hypernyms, if avail-
228 More on EXTRA techniques

c s 1, s2 Placental
mammal

Carnivore Rodent

Feline, felid Mouse (animal)


s2
Cat (animal )
s1

Figure A.1: Example of WordNet hypernym hierarchy

able (e.g. note that the noun hierarchy has nine root senses). Furthermore,
whenever a minimum common hypernym between the two senses si and sj
is available, we call path length denoted as len(si , sj ) the number of links
connecting s1 with s2 and passing from the minimum common hypernym
node. For instance, by starting from senses s1 “cat (animal)” and s2 “mouse
(animal)” and moving up in the hierarchy , we see that they first intersect on
the common sense of “placental mammal”, which therefore is the minimum
common hypernym cs1 ,s2 and the path length is len(s1 , s2 ) = 3 + 2 = 5 (3
links from s1 to cs1 ,s2 plus 2 from cs1 ,s2 to s2 ). As to nouns, we define the
path length between two nouns ni , nj as:
n
len(ni , nj ) = minsni ∈Sn nj len(snk i , sk0j ) (A.1)
k i ,sk0 ∈Snj

In the same way, we can introduce the minimum common hypernym cni ,nj
between two nouns ni , nj as the minimum common hypernym sense corre-
sponding to the minimum path length between the two nouns. For instance,
the minimum path length between nouns “cat” and “mouse” is 5, since the
senses of such nouns that join most rapidly are the ones depicted in Figure
A.1. Their minimum common hypernym is the sense “placental mammal”.
In the literature, the hypernym hierarchies and the notion of path length
between pairs of synsets has been extensively studied as a backbone for the
definition of similarities between two senses [26]. Among the proposed mea-
sures, one of the most promising one that does not require a training phase
on large pre-classified corpora is the Leacock-Chodorow measure [81]. In the
following definition, we introduce a more general variant of such measure
for quantifying the similarity between two nouns as follows, where we also
consider the case where the minimum common hypernym is not available.

Definition A.1 (Semantic similarity between nouns) Given two nouns


Noun disambiguation 229

N Set of noun tokens


V Set of verb tokens
n, nh , ni , nj Nouns
Sn Set of senses of n
v, vh , vi , vj Verbs
Sv Set of senses of v
snk (svk ) k-th sense of n (v)
cni ,nj Minimum common hypernym between ni and nj
sim(ni , nj ) Semantic similarity between ni and nj

Table A.1: Symbols and meanings

ni , nj , the semantic similarity sim(ni , nj ) between ni and nj is defined as:


(
len(ni ,nj )
−ln 2·Hmax if ∃ cni ,nj
sim(ni , nj ) = (A.2)
0 otherwise

where Hmax has value 16 has it is the maximum height of WordNet IS-A
hierarchy.

The resulting similarity is a non-negative number. For example, the similar-


5
ity between the nouns “cat” and “mouse” is −ln( 32 ) = 1.86. A null similarity
corresponds to the similarity between two nouns having no common hyper-
nyms.
Table A.1 summarizes the set of symbols introduced so far and that will
be used in the following.

A.1.2 Noun disambiguation


After the premises, we are now ready to discuss the core of our WSD tech-
nique. Given the list of nouns (N ) and verbs (V ) to be disambiguated,
the techniques that we have devised compute the confidences in choosing
each of the senses associated with each term and the sense with the high-
est confidence, that is the most probable, will be the chosen one. Since the
technique for verbs disambiguation is strictly based on the nouns one, we
will first consider the latter. The basic idea of such technique is that the
nouns surrounding a given one in a sentence provide a good informational
context and good hints about what sense to choose for it. In fact, it is well
known that, when two polysemous nouns are correlated, it is quite proba-
ble that their minimum common hypernym provides valuable information
about which sense of each noun is the relevant one. For instance, consider
230 More on EXTRA techniques

the sentence in Figure 2.3 again: The nouns to disambiguate are “cat” and
“mouse”. If no context is available, a mouse could be an animal, but also
an electronic device, while a cat could also be a vehicle. If we consider their
minimum common hypernym, i.e. “placental mammal”, the two senses that
join through it, “cat (animal)” and “mouse (animal)”, will have the highest
confidence and will be the ones chosen.
The basis of the technique we propose derives from the one in [113]. As
in [113], the confidence in choosing one of the senses associated with each
term is directly proportional to the semantic similarities between that term
and the other ones. On the other hand, we deal with sentences of variable
lengths and whose meaning is also affected by the relative positions of their
terms. Thus, we refined the confidence computation by introducing two
enhancements which make our approach more effective in such context. In
particular, in the computation of the confidence in choosing one of the senses
associated with each noun, we weigh the contribution of the surrounding
nouns w.r.t. their positions and we consider the frequency of use of such
sense in English language.
As far as the first aspect is considered, note that the contributes of the
surrounding nouns to the confidence of each of the senses of a given noun
are not equally important. In particular, we assume that closer the positions
of the two nouns are in the sentence, more correlated the two nouns are
and therefore more important will be the information extracted from the
computation of their semantic similarity. Thus, given a noun nh ∈ N to
be disambiguated, we weigh the similarity between nh ∈ N and each of the
surrounding nouns nj ∈ N on their relative positions by adopting a gaussian
distance decay function D centered on nh and defined on their distance dh,j :
d2
h,j
e− 8 2
D(dh,j ) = 2 · √ +1− √ (A.3)
2π 2π
For two close nouns, the decay is almost absent (values near 1), while for
distant words the decay asymptotically tends towards values around 1/5 of
the full value. For example, consider the following technical sentence: “As
soon as the curve is the right shape, click the mouse button to freeze it in that
position” (see Figure A.2). The decay function is centered on “mouse”, the
noun to be disambiguated, while the surrounding nouns forming its context
are underlined. In this example, the “mouse” is clearly an electronic device
and not an animal; the best hint of this fact is given by the presence of the
noun “button”, located close to the term “mouse” (point B in the figure,
low decay), clearly providing a good way to understand the meaning. More
distant nouns such as “curve” or “position” are less correlated to “mouse”
Noun disambiguation 231

1,2 D(d i,j)

1
B
0,8

0,6

0,4
A
0,2

0
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
soon curve right shape click mouse button freeze that position di,j

Figure A.2: The gaussian decay function D for the disambiguation of


“mouse” defined on the words of the example sentence

and have a lesser influence on its disambiguation (point A in the figure, high
decay).
As to the second aspect, we exploit the rank of the senses, which is based
on the frequency of use of the senses in English language, by incrementing
the confidence of one sense of a given noun n in a way which is inversely
proportional to its position in the list Sn . In this case, we defined a linear
sense-decay function R on the sense number k (ranging from 1 to m) of a
given noun n:
k−1
R(snk ) = 1 − γ (A.4)
m−1
where 0 < γ < 1 is a parameter we usually set at 0.8. In this way, we
quantify the frequency of the senses where the first sense has no decay and
the last sense has a decay of 1:5. Such an adjustment attempts to emulate
the common sense of a human in choosing the right meaning of a noun when
the context gives little help. In our experiments, it has been proved to be
particularly useful for short sentences where the choice of a sense can be led
astray by the small number of surrounding nouns.
All the concepts discussed are involved in the notion of confidence in
choosing a sense for a noun nh ∈ N . It is defined as a sum of two components,
the first one involving the semantic similarity and the distance decay between
nh and the surrounding nouns, while the second represents the contribution
of the sense decay. The confidence in choosing the meaning of a noun in a
sentence is formally defined as follows.
232 More on EXTRA techniques

Definition A.2 (Meaning of a noun in a sentence) Given the set N of


nouns in a given sentence, the confidence ϕ(snk h ) in choosing sense snk h ∈ Snh
of noun nh ∈ N is:
P
nj ∈N,j6=h sim(nh , nj ) · D(dh,j ) · xh,j,k
ϕ(snk h ) = α P + β · R(snk h ) (A.5)
nj ∈N,j6=h sim(n h , n j ) · D(dh,j )

where α > 0 and β > 0 (α + β = 1) and xh,j,k is a boolean variable depending


on the minimun common hypernym cnh ,nj between nh and nj , defined as:
(
1 if cnh ,nj is an hypernym of snk h
xh,j,k = (A.6)
0 otherwise

The meaning chosen for nh ∈ N in the sentence is the sense snh ∈ Sni such
that ϕ(snh ) = max{ϕ(snk h ) | snk h ∈ Snh }.

By varying the values of the α and β parameters we can change the relative
weight of each of the components; the default values, 0.7 and 0.3, respectively,
make the contribution of the first component predominant. Confidences are
values between 0 and 1; an high value of ϕ(snk h ) indicates a high probability
that the correct sense of nh is snk h . Therefore, we choose the sense with highest
confidence as the most probable meaning for each noun in the sentence. For
instance, in the sentence “You can even pick up an animation as a brush and
paint a picture with it”, the noun “brush” has several senses in WordNet. Our
disambiguation algorithm is able to correctly disambiguate it by measuring
the semantic similarities between it and the other nouns, i.e. “animation”
and “picture”. At the end of the algorithm the sense having the highest
confidence (0.79) is “an implement that has hairs or bristles firmly set into
a handle”, which is the correct one and will be the one chosen, while other
off-topic senses, such as “a dense growth of bushes” or “the act of brushing
your teeth”, all have a much lower confidence (less than 0.4).
As previously described, the nouns N contained in a sentence provide
a good informational context for the correct disambiguation of each noun
nh ∈ N . However, there may be some situations in which such context is not
sufficient, for example when the nouns in N are not strictly inter-correlated
or when the length of the sentence is too short. Consider, for instance, the
sentence “The cat hunts the beetle”: The two nouns “cat” and “beetle”
alone could be correlated both as vehicles and as animals. If we consider the
following sentence, however, “It just ran away from the dog”, the name “dog”
would clearly be very useful in order to disambiguate the meaning of “cat”
and “beetle” as animals. Therefore, in order to provide good effectiveness
Verb disambiguation 233

also in such difficult situations, we allow the expansion of the dimension of


the context window to include the nouns of the sentences before and after
the one involved. The nouns contained in this window constitute a wider
noun set N 0 and they will contribute to the disambiguation of the nouns N
of the central sentence. The distance decay is still defined on the distance, in
words, between the nouns in the sentences: For this purpose, the sentences
in the context window are considered as one long sentence. In this way, the
nouns from the near sentences automatically contribute less than the ones
from the central sentence, as one would expect.

A.1.3 Verb disambiguation


The automatic disambiguation of verbs is a more complex and a less studied
problem. The adoption of the same approach used for noun disambigua-
tion would not be effective for several reasons. As nouns are organized in
hypernymy structures, verbs in WordNet are organized in troponimy (“is a
manner of”) hierarchies. However, the verbs of a sentence have a much lower
semantic inter-correlation than nouns, making the computation of seman-
tic similarities between them quite inappropriate. Consider the verbs in the
technical sentence of the previous examples: “As soon as the curve is the
right shape, click the mouse button to freeze it in that position”. It is evi-
dent that the disambiguation of “click” is totally independent from the one
of “freeze”. In fact, verbs could be better disambiguated by analyzing their
relations with the objects and subjects, but this would generate two main
problems: The need for a complete sentence parser, much more resource and
knowledge “hungry” than a tagger, and of external knowledge sources, other
than WordNet, where the nouns’ hierarchies should be connected with the
verbs’ ones.
The approach we adopt for verbs capitalizes on the good effectiveness of
the noun technique, while also exploiting one of the most interesting addi-
tional information WordNet provides for verbs: The usage examples. Basi-
cally, in order to disambiguate a verb vh ∈ V of a given sentence, we introduce
a technique similar to the one used for noun disambiguation, where the sim-
ilarity measure is applied between each noun ni ∈ N of the given sentence
and each noun nj in the usage examples of the verb vh . In particular, we
denote with N (svkh ) the nouns in the usage examples of the sense svkh of the
verb vh and with N (Svh ) the nouns in the usage examples of all the senses
of verb vh . As with nouns, we can formally define the meaning of a verb in
a sentence.

Definition A.3 (Meaning of a verb in a sentence) Given the set V of


234 More on EXTRA techniques

verbs of a given sentence, the confidence ϕ(svkh ) in choosing sense svkh ∈ Svh
of a verb vh ∈ V is computed as:

P
ni ∈N maxnj ∈N (svh ) sim(ni , nj ) · D(di,h )
ϕ(svkh ) = α P k
+ βR(svkh ) (A.7)
ni ∈N maxnj 0 ∈N (Svh ) sim(ni , nj 0 ) · D(di,h )

where α > 0 and β > 0 (α + β = 1). The meaning chosen for vh ∈ V in the
sentence is the sense svh ∈ Svh such that ϕ(svh ) = max{ϕ(svkh ) | svkh ∈ Svh }.

Again, notice that the confidence is a value between 0 and 1; an high value
of ϕ(svkh ) indicates a high probability for svkh to be the correct sense of vh .
Therefore, we choose the sense with highest confidence as the most probable
meaning for each verb in the sentence. As for each noun n ∈ N in the sen-
tence, we identify the most semantically close noun from the usage examples
(maxnj ∈N (svh ) sim(ni , nj )), the most probable sense is the one whose nouns
k
in its usage examples best match the nouns in the sentence. In this way, we
choose the sense whose usage examples best reflect the use of the verb in the
sentence.
For example, consider the disambiguation of the verb “click” in the frag-
ment “click the mouse button”. In WordNet such verb has 7 different senses,
each with different usage examples. The correct sense, that is “to move or
strike with a click”, contains the following usage fragment: “he clicked on
the light”. Since the semantic similarities between “mouse”, “button” and
“light” as electronic devices are very high, the algorithm chooses “light” as
the best match both for “mouse” and “button” and, consequently, it is able
to choose the correct sense for the verb, since this match greatly increments
its confidence.
Good hints for the disambiguation of a verb can also be extracted from
the definitions of its different senses. In particular, our approach is also able
to extract, along with the nouns of the usage examples, or instead of them,
the nouns present in such definitions, computing the similarities between
them and the nouns of the original sentence. Consider the sentence “the
white cat is hunting the mouse” again. A large contribution to the correct
disambiguation of the verb “hunt” comes from the nouns of the definition
“pursue for food or sport (as of wild animals)”: The semantic similarity
between “cat”, “mouse” and “animal” is obviously very high and decisive in
correctly choosing such a sense as the preferred one.
Properties of the confidence functions and optimization 235

A.1.4 Properties of the confidence functions and opti-


mization
Both the confidence functions ϕ(snk h ) and ϕ(svkh ) have two components de-
pending on the k-th sense, whose values are between 0 and 1 and whose
weights are specified by the α and β parameters. Since verbs and nouns in
WordNet can have quite a large number of senses (in some cases over 30),
the complexity of the disambiguation process could be very high. To pre-
vent the computation of the confidences of all the different senses, we adopt
a simple optimization based on the monotonic properties of the R(·) sense
decay function, whose value can be directly and easily computed from the
sense position. In the following we will show it for ϕ(snk h ), the same holds
for ϕ(svkh ). After the computation of the confidence ϕ(snk h ) where ϕ(sknh ) is
the local maximum value, the computation of the confidence ϕ(snk+1 h
) for the
nh nh nh
subsequent sense sk+1 is performed only if ϕ(sk ) < α + βR(sk+1 ). Indeed,
whenever ϕ(snk h ) ≥ α + βR(snk+1 h
) then ϕ(sknh ) ≥ ϕ(snk+1
h
) ≥ . . . ≥ ϕ(snmh ) (m
is the cardinality of Snh ) as the α component is between 0 and 1 and R(·) is a
decay function. Thus, the fact that the meaning chosen for nh has the max-
imum value of ϕ(·) among the computed ones allows us to avoid computing
the subsequent values ϕ(snk+1 h
) . . . ϕ(snmh ).

A.2 The M ultiEditDistance algorithms


Given a query sequence σ q and a TM sequence σ, coming out from the filter-
ing phase, a function is needed to locate all the possible matching parts along
with their distances. For this purpose, the multiEditDistance function per-
forms two nested cycles for each possible starting token in the two sequences
and for each pair of starting positions it computes the matrices of the edit
distance dynamic programming algorithm [103] thus allowing the identifica-
tion of the sub2 matches. Figure A.3 shows some aspects of the computation
for the two sequences σ q “white dog hunt mouse” and σ “cat hunt mouse”.
Each cell M [i][j] of each matrix M represents the edit distance value between
the subsequences ranging from the active starting positions to the i-th token
of the first sequence and the j-th token of the second one. For instance, the
cell denoted as A in the figure represents the edit distance value between the
subsequences σ q [2 . . . 4] “dog hunt mouse” and σ[1 . . . 3] “cat hunt mouse”,
while the cell denoted as B represents the edit distance value between the
subsequences σ q [3 . . . 4] “hunt mouse” and σ[2 . . . 3] “hunt mouse”.
By computing all the cells’ values with multiEditDistance and by check-
ing them w.r.t. the minimum length minL and the maximum distance d, it
236 More on EXTRA techniques

se

se

se

se
do e

nt

nt

nt
ou

ou

ou

ou
it
g

g
wh

hu

do
hu

hu
m

m
0 1 2 3 4 0 1 2 3 0 1 2 0 1

cat 1 cat 1 cat 1 cat 1 Increasing σq


starting
hunt 2 hunt 2 hunt 2 hunt 2 token
mouse 3 mouse 3 A mouse 3 mouse 3
se

se

se

se
do e

ou

ou

ou

ou
nt

nt

nt
it
g

g
wh

hu

do
hu

hu
m

m
0 1 2 3 4 0 1 2 3 0 1 2 0 1

hunt 1 hunt 1 hunt 1 hunt 1

mouse 2 mouse 2 mouse 2 B mouse 2


se

se

se

se
do e

ou

ou

ou

ou
nt

nt

nt
it
g

g
wh

hu

do
hu

hu
m

m
0 1 2 3 4 0 1 2 3 0 1 2 0 1

mouse 1 mouse 1 mouse 1 mouse 1

Increasing σ

]
[1

[2

[3

[4
q

q
starting
A : ed (σq[2...4], σ[1...3]) σq : white dog hunt mouse
token
]

]
ed (σq[3..4], σ[2..3])
[1

[2

[3
B : σ : cat hunt mouse

Figure A.3: MultiEditDistance between subsequences in σ q and σ

is possible to perform all the steps required for sub2 matching and to identify
all the valid sub2 matches. However, as discussed in Chapter 2, the transla-
tor could be interested just in suggestions that are not contained in other
suggestions, i.e. the longest ones. To satisfy this particular need, a filtration
process, which we call inclusion filtering, is also required in order to prune
out the shorter matches contained in the longer ones.
In order to better understand the set of requirements, consider the follow-
ing example of approximate sub2 matching with inclusion filtering. Suppose
that minL = 3 and d = 0.3 and that we search for the sub2 matches for
the query sentence “So, welcome to the world of computer generated art”
where “welcome world compute generate art” is the sequence outcome of the
syntactic analysis. Let us suppose that the translation memory contains the
five sentences shown in the following table (number of sentence in the first
column, sentence in the source language in the second one):

2518 So, welcome to the world of music.


3945 We welcome our guests to the Madrid Art Expo!
5673 Welcome to the world of computer aided translation!
10271 We welcome you to the world of computer generated fractals
13456 This is a computer generated art work.

The logical representation is the following:


A.2 The M ultiEditDistance algorithms 237

welcome world compute generate art

2518.

3945.

5673.

10271.

13456.

Figure A.4: Example of approximate sub2 matches with inclusion filter

2518 welcome world music


3945 welcome guest Madrid art Expo
5673 welcome world compute aid translation
10271 welcome world compute generate fractal
13456 be compute generate art work

All the above sequences present some tokens in common with the query se-
quence (see Figure A.4). However, the subsequence of 2518 is not a valid
sub2 match since it is too short, the part of 3945 is not considered since
ed(σ 3945 [welcome . . . art], σ q [welcome . . . art]) > round(0.3 ∗ 5). The remain-
ing sequences all satisfy the minL and d requirements. However, note that
the inclusion requirement excludes σ 5673 [welcome . . . compute] since it is con-
tained in σ 10271 [welcome . . . generate].
In order to implement a sub2 matching satisfying all the discussed require-
ments, the multiEditDistance algorithm could still be used as a first step to
produce all candidate sub2 matches, then a subsequent phase could check for
the d, minL, and inclusion constraints and prune out all the invalid matches.
However, a more efficient way to proceed is to avoid the computation of the
undesired matches, at least partially. In particular, it could be possible to
exclude the computation of the shorter matches that would not satisfy the
minimum length requirement and of those matches that would be included
in already identified ones. To do so, we developed a modified version of the
multiEditDistance function, named multiEditDistanceOpt, which is able
to efficiently solve the sub2 matching problem by considering all the above
constraints.
The pseudo-code is shown in Figure A.5. It is applied to the collection
of the pairs of sequences coming out from the filters. Such collection is
ordered on the basis of the query sequences σ q ; in this way all the pairs
of sequences sharing the same query sequence will be one after the other.
238 More on EXTRA techniques

1 sub 2 MatchingOpt ( d, minL )


2 {
3 matches ← ∅ // matches s e t o f matches
q
4 σlast ← empty s e q u e n c e
5 ∀(σ q , σ) ∈ sub 2 Count and sub 2 P o s i t i o n f i l t e r r e s u l t s
q
6 i f ( σlast 6= σ q )
q
7 σlast ← σ q
8 L ← |σ q | − minL + 1
9 i f (L < 1)
10 n e x t (σ q , σ)
11 lastP os ← L −1
12 maxEnd[L] ← {−1, . . . , −1} // a r r a y o f L i n t e g e r s
13 matches ← matches ∪ pM atches
14 pM atches[L] ← {∅, . . . , ∅} // a r r a y o f L s e t s
15 pM atches = m u l t i E d i t D i s t a n c e O p t ( σ q , σ, d, minL,
16 maxEnd, lastP os, L, pM atches )
17 matches ← matches ∪ pM atches
18 return matches
19 }

21 m u l t i E d i t D i s t a n c e O p t ( σ q , σ, d, minL, maxEnd, lastP os, L , pM atches )


22 {
23 ∀iext ∈ (0 . . . lastP os) // c y c l e f o r each s t a r t i n g p o s i t i o n i n σ q
24 ∀jext ∈ (0 . . . |σ| − minL) // c y c l e f o r each s t a r t i n g p o s i t i o n i n σ
25 ∀i ∈ ((iext + 1) . . . |σ q |)
26 ∀j ∈ ((jext + 1) . . . |σ|)
27 Compute DM [i][j]
// d i s t a n c e m a t r i x i n pos i , j
28 i f ( ( i − iext ≥ minL ) and ( j − jext ≥ minL )
29 and ( DM [i][j] ≤ round(d ∗ (i − iext )) )
30 insertF lag ← true
// new match
31 ∀pos ∈ (0 . . . L)
32 i f ( ( pos < iext and maxEnd[pos] ≥ i − 1 )
// c o n t a i n e d match
33 o r ( pos = iext and maxEnd[pos] ≥ i − 1) )
34 insertF lag ← f a l s e
35 e l s e i f ( pos = iext and maxEnd[pos] < i − 1 )
// c o n t a i n i n g match
36 pM atches[pos] ← empty s e t
37 i f ( insertF lag = true )
// new match i n s e r t i o n
38 pM atches[iext ] ← pM atches[iext ] ∪
39 subMatch ( iext + 1, i, jext + 1, j, DM [i][j] )
40 maxEnd[iext ] ← i − 1
41 i f ( i = |σ q | )
42 lastP os ← iext
43 return pM atches
44 }

Figure A.5: Algorithms for approximate sub2 matching with inclusion filtering

For each pair (line 5) the multiEditDistanceOpt function is called by the


sub2 M atchingOpt function, which also manages the set of solutions and some
auxiliary structures used in the computation.
The matches set (initialized at line 3) will store all the resulting matches.
A.2 The M ultiEditDistance algorithms 239

Each match is a tuple containing the start and end positions in σ q , the start
and end positions in σ and the computed edit distance. Besides matches,
a few auxiliary data structures are exploited in the multiEditDistanceOpt
function and initialized by the sub2 M atchingOpt function (lines 8, 11, 12,
14) for each new query sequence σ q . The goal of the integers L and lastP os is
to optimize the computation w.r.t. the minimum length constraint, whereas
that of the arrays maxEnd and pM atches is to dynamically apply the in-
clusion filtering. In particular, L is a constant representing the number of
acceptable starting positions for a match on the query sequence; it depends
on the length of the query sequence and the minimum length of any sugges-
tion: |σ q | − minL + 1 (line 8). lastP os represents the current last acceptable
starting position for the query sequence; its initial value is L − 1 (line 11).
As to the inclusion filter structures, maxEnd is an array of L integers, each
one L[i] representing the maximum end position of all the already computed
matches starting at the i-th position in the query. Finally, pM atches is
an array of L sets, each one pM atches[i] containing the computed matches
starting at the i-th position in the query.
If, for a given σ q , L is less than 1 the query sequence is discarded for
insufficient length and is not analyzed any further (lines 9-10); otherwise,
the current pair of sequences (σ q , σ) is passed to the multiEditDistanceOpt.
Such function considers each starting position in each of the two sequences
(lines 23-24), computes the corresponding distance matrix (line 25-27), iden-
tifies new matches (lines 28-29), checks if a match is included or includes
other matches (lines 31-36) and, eventually, if the match is not included in
others, inserts it in the relevant set (lines 38-39) and updates the maxEnd
and lastP os values accordingly (lines 40-42). More precisely, if the new
match is longer than the already computed matches having the same start-
ing position, thanks to the inclusion filter, the function empties the relevant
match set (line 36). For the same reason, if the match is included in others,
the insertion is not performed by setting insertF lag to false (line 34). Notice
that, with the help of the auxiliary structures, the inclusion filtering process
is performed quite efficiently, without ever re-accessing the already computed
matches. Furthermore, note that lastP os needs to be updated only when the
ending position of the query sequence in the new match coincides with the
last token of such sequence (line 41): In this case, for the properties of the
inclusion requirements, lastP os is set at the starting position of the found
match and, for such query sequence, the following starting positions will not
be analyzed any further since they would only produce shorter contained
matches.
Appendix B

The complete XML matching


algorithms

In this appendix we present and discuss the complete versions of the sequen-
tial twig pattern matching algorithms whose basic properties and ideas have
been shown as an overview in Section 4.4. We have three classes of algo-
rithms which, respectively, perform path matching (Section B.2), ordered
(Section B.3) and unordered (Section B.4) twig matching. For each class
of algorithms we present both the “standard” and the content-based index
optimized versions. All these algorithms perform a sequential scan of the
tree signature; in Section B.5 we give more detail over the current solutions
used to delimit the scan range. The algorithms associate one domain to each
query node and rely on two principles: generate the qualifying index set as
soon as possible and delete from the domains those data nodes which are no
longer needed for the generation of the subsequent answers. With reference
to the previous section, during the scanning process the algorithms generate
the “delta” answer sets ∆(U )anskQ , that is the set of answers which can be
computed at step k.

B.1 Notation and common basis


Let sig(D) = hd1 , post(d1 ); d2 , post(d2 ); . . . ; dm , post(dm )i denote the signa-
ture of the data tree and sig(Q) = hq1 , post(q1 ); q2 , post(q2 ); . . . ; qn , post(qn )i
denote the signature of a query twig pattern. To distinguish a path from
a general tree, we use the capital letter P in place of Q. As in Chapter 4,
for each query node qi , we assume that the post(qi ) operator accesses its
post-order value and Di represents its domain. Together with the domain,
the maximum and the minimum post-order values of the data nodes stored
242 The complete XML matching algorithms

∆Σprev(h’)
k

∆Σprev(h’)
k'

dk'
Dprev(h’) Dh’

0
Figure B.1: How ∆Σkprev(h0 ) (and ∆Σkprev(h0 ) ) are implemented

in Di , accessible by means of the minPost and maxPost operators, respec-


tively, are associated to each query node. first(qi ) is the pre-order value of
the first occurrence of the name of qi in D, i.e. the minimum pre-order, and
last(qi ) is the last one, i.e. the maximum pre-order value of node with name
qi . Recall that a pre-order of a node is also the node’s position (index) in the
signature. Notice that both these values can be computed while constructing
the data tree signature.
Insertions in the domains are always performed on the top by means of
the push() operator. Thus the data nodes from the bottom up to the top of
each domain are in pre-order increasing order. Moreover, each data node dk
in the pertinence domain Dh consists of a pair: (post(dk ), pointer to a
node in Dprev(h) ) where prev(h) is h−1 in the ordered case whereas it is the
parent of h in Q in the unordered case. When the data node dk is inserted
into Dh , its pointer indicates the pair which is at the top of Dprev(h) . For
illustration see Figure B.1, where a node d0k preceding dk is represented with
its pointers. In this way, the set of data nodes in Dprev(h0 ) from the bottom up
0
to the data node pointed by k 0 implements ∆Σkprev(h0 ) and the whole content
of Dprev(h0 ) at step k implements ∆Σkprev(h0 ) . By recursively following such
0 0
links from Dprev(h0 ) to D1 , we can derive ∆Σkprev(prev(h0 )) , . . . , ∆Σk1 . As a final
note, we allow the access to a particular position (from the bottom to the
top) in each domain by means of the dot notation. For instance D3 .2 means
the second entry from the bottom of D3 .

B.2 Path matching


B.2.1 Standard version
In this subsection we propose the algorithm for the matching of a path pattern
in a data tree. In this case, domains can be treated as stacks, that is deletions
Standard version 243

DATA STACK MANAGEMENT


empty(stack) Empties stack
isEmpty(stack) Returns true if stack is empty, false otherwise
pointer(elem) Returns the pointer of the given element
pointerToTop(stack) Returns a pointer to the top of the given stack
pop(stack) Pops (and returns) an element from stack
push(stack,elem) Pushes given elem in stack
SOLUTION CONSTRUCTION
showSolutions(...) Recursively builds the path solutions

Table B.1: Path matching functions

are implemented following a LIFO policy. Table B.1 provides a summary of


the path matching auxiliary functions. In particular, the top of the table
presents all the functions which are needed in order to manage the stacks; in
addition to the already introduced push() function, notice the complemen-
tary pop() one, which performs the deletions. Further, isEmpty() checks
whether a given stack is empty, whereas empty() empties the stack. In the
lower part of the table the functions needed for the solutions construction are
shown, in our case the only showSolutions(). We assume that the query
pattern has unique node names. The algorithm for path matching, which is
depicted in Figure B.2, can be easily extended to the case where multiples
query nodes have the same names and we will show it in the following. Fi-
nally, for path matching we do not need the maximum and the minimum
post-order values associated to each stack.
The key idea of the algorithm is to scan the portion of the signature from
start to end, which in this case will be from first(q1 ) to last(qn ). For each
data node, from Line 2 to Line 8, it deletes those data nodes in the stacks
which are no longer useful to generate the delta answers. Then it adds the
k-th data node in the proper stack (Line 10) and, if the data node matches
with the leaf of the query path, it shows the answers which can be generated
(Line 12). Notice that the k-th data node points to that data node which
matches qh−1 and which has the highest pre-order value smaller than k. Such
pointers will be used in the showSolutions() function. At line 13, due to
the PRO1 condition, the algorithm delete k if it is the last node.
Instead of checking all nodes as specified in Condition POP, we stop
looking at the nodes in Di whenever post(top(Di ))>post(dk )) (line 4). It
fully implements Condition POP because, as we will prove in the following,
if post(top(Di ))>post(dk ) then post(si )>post(dk ) for each si ∈ Di and
thus Condition POP can no longer be applied. Moreover, Condition PRO2 is
244 The complete XML matching algorithms

Input: path P having signature sig(P ); rew(Q)


Output: ansP (D)

algorithm PathMatch(P)
(0) getRange(start, end);
(1) for each dk where k ranges from start to end
(2) for each h such that qh = dk in descending order
(3) for each Di where i ranges from 1 to n
(4) while(¬isEmpty(Di ) ∧ post(top(Di ))<post(dk ))
(5) pop(Di );
(6) if(isEmpty(Di ))
(7) for each Di0 where i0 ranges from i + 1 to n
(8) empty(Di0 );
(9) if(¬isEmpty(Dh−1 ))
(10) push(Dh ,(post(dk ),pointerToTop(Dh−1 )));
(11) if(h = n)
(12) showSolutions(h,1);
(13) pop(Dh );
(14) for each Di where i ranges from 1 to n
(15) if(isEmpty(Di ) ∧ last(qi )<k)
(16) exit;

procedure showSolutions(h,p)
(1) index[h] ← p;
(2) if(h = 1) output(D1 .index[1],...,Dn .index[n])
(3) else
(4) for i = 1 to pointer(Dh .index[h])
(5) showSolutions(h − 1,i);

Figure B.2: Path matching algorithm

implemented in Lines 6-9. Observe that, instead of checking the intersection


between the state of the domains at different steps as required by Condition
PRO2, we only check whether Dh−1 is empty (line 9). Indeed, we will show
that, in order to delete the nodes belonging to a domain Di at step k, it is
first necessary to delete the nodes belonging to Di at a step preceding the
k-th one.
Before demonstrating the correctness of the algorithm, we present three
properties of the data nodes stored in each stack Di .

Lemma B.1 If post(top(Di ))>post(dk ) then post(dk0 )>post(dk ) for each


k 0 ∈ Di .
Content-based index optimized version 245

Lemma B.2 For each i ∈ [1, n] ∆Σji ∩∆Σki = ∅ iff the condition isEmpty(Di )
is true at any step j 0 with k ≤ j 0 ≤ j.

Lemma B.3 At each step j and for each query index i, the stack Di is a
subset of Σji containing only the data entries that cannot be deleted from Σji ,
i.e. it has the same content as ∆Σji when Lemmas 4.4, 4.5, 4.6, and 4.11
have been applied.

Starting from the data node in the leaf stack Dn (function call at Line 12
of the main algorithm), function showSolutions() uses the pointer associ-
ated with each data node dk to “jump” from one stack Di to the previous one
Di−1 (up to D1 ) and recursively combines dk with each node from the bot-
tom of Di−1 up to the node pointed by dk . The correctness of the algorithm
follows from the properties shown so far.

Theorem B.1 For each data node dj , S = (s1 , . . . , sn ) ∈ ∆ansjP (D) iff the
algorithm, by calling the function showSolutions(), generates the solution
S.

Finally, let us consider the correctness of the scanned range, which is


between the first occurrence of q1 and the last occurrence of qn in sig(D).
As D1 is empty before first(q1 ) has been accessed, we can avoid to access
all the data nodes before first(q1 ) which should be inserted in D2 , . . . ,
Dn but which will never been used due to Lemma 4.5. On the other hand,
from Lemma 4.4, it follows that ∆anskQ (D) = ∅ for k ∈ [last(qn ) + 1, m].
S
Therefore ansQ (D) = last( qn )
k= first(q1 )
∆anskQ . Moreover, the algorithm exits
whenever any stack Di is empty and no data node matching with qi will be
accessed, i.e. last(qi )< k (Lines 14-16). It means that ∆Σk0 i = ∅ for each
k0
k0 ∈ [k, m] and thus that ∆ansQ (D) = ∅.

B.2.2 Content-based index optimized version


In the previous section we have described the general algorithm for the match-
ing of a path pattern in a data tree, in this section we propose another path
matching algorithm (detailed in Figure B.3) that can be used for path queries
with a specified value condition and if it is available a content index over the
leaf of the query. In this condition we can reduce the part of the document
that has to be scanned by simply considering the post-orders of the docu-
ment’s nodes that satisfy the value condition specified by the query. Since
we need to use a content-based index we have to introduce some functions
to deal with it (see Table B.2).
246 The complete XML matching algorithms

PATH QUERY NAVIGATION


getLastElement(signature) Returns the last element of the path
getValue(element) Returns the value for the specified element
getCondition(element) Returns the value condition for the specified
element
INDEX
getMatchList(element,value, Returns an iterator that holds the list of
condition) elements in the document correspondent
to element that satisfies the
condition condition with the value
value. List is ordered by pre-orders
values.
ITERATOR MANAGEMENT
hasNext(iterator) Checks if exists a subsequent element
getNext(iterator) Returns the subsequent element

Table B.2: Path matching functions for content-based search

Differently from the previous algorithm we first retrieve the list of doc-
ument’s elements of type ql (last element of the path, i.e. the only with a
specified value condition) that satisfy the specified value condition by calling
the function getMatchList() (Line 2). If the list is empty no answer can be
found (i.e. there is not any element ql that satisfies the specified value con-
dition) so we can terminate the algorithm (Lines 3 and 4) otherwise we can
start the search. Since the list is ordered by the pre-order value and answers
are sought sequentially from start to end we will first find (if they exist)
answers that end with the first element of the list, then answers ended by
the second element, and so on. Each element of the list is sequentially used
as a target element (curLeaf ) for the search, following observations made in
Section 4.3.1 we can skip document subtrees rooted from nodes dk such that
post(dk ) < post(curLeaf ). Such skip is made by the loop at Lines 13 and
14; the proposed algorithm supposes to use a signature that does not contain
the first following values, if the signature contains those values we can simply
replace Line 14 with k ← f f (dk ). The loop from Lines 8 to 12 updates the
target of the search, if needed. If we have gone past the curLeaf (Line 8) we
need to change the target: if a next target exists (Line 9) we simply update
the curLeaf variable, otherwise no other answers can be found and we can
terminate the algorithm (Line 12). In order to keep curLeaf and k coher-
ently updated we need to repeat the two loops described before (Lines 8 to
14) until we have curLeaf and k such that k ≤ pre(curLeaf ) and post(dk )
Content-based index optimized version 247

Input: path P having signature sig(P ); rew(Q)


Output: ansP (D)

algorithm PathMatchCont(P)
(0) getRange(start, end);
(1) ql ← getLastElement(P );
(2) docLeaves ← getMatchList(ql ,getValue(ql ),getCondition(ql ));
(3) if (¬hasNext(docLeaves))
(4) exit;
(5) curLeaf ← getNext(docLeaves);
(6) for each dk where k ranges from start to end
(7) while((k > pre(curLeaf )) ∨ ((post(dk ) < post(curLeaf ))))
(8) while(k > pre(curLeaf ))
(9) if(hasNext(docLeaves))
(10) curLeaf ← getNext(docLeaves);
(11) else
(12) exit;
(13) while(post(dk ) < post(curLeaf ))
(14) k ← post(dk ) + 1;
(15) for each h such that qh = dk in descending order
(16) for each Di where i ranges from 1 to n
(17) while(¬isEmpty(Di ) ∧ post(top(Di ))<post(dk ))
(18) pop(Di );
(19) if(isEmpty(Di ))
(20) for each Di0 where i0 ranges from i + 1 to n
(21) empty(Di0 );
(22) if(¬isEmpty(Dh−1 ))
(23) push(Dh ,(post(dk ),pointerToTop(Dh−1 )));
(24) if(h = n ∧ dk = curLeaf )
(25) showSolutions(h,1);
(26) pop(Dh );
(27) for each Di where i ranges from 1 to n
(28) if(isEmpty(Di ) ∧ last(qi )<k)
(29) exit;

Figure B.3: Content index optimized path matching algorithm

≥ post(curLeaf ).
From Lines 15 to 29 the algorithm is substantially the same of the previous
section, on Line 24 before generating answers we have also to check that dk
is the same node as curLeaf because we need to check that the value of dk
satisfies the specified condition (and that is possible only if dk = curLeaf ).
248 The complete XML matching algorithms

B.3 Ordered twig pattern matching


B.3.1 Standard version
As already discussed in Section 4.4, the domains of the twig matching algo-
rithms cannot be stacks because they are not ordered on post-order values,
thus deletions can be applied at any position of the domains. Therefore, in
this case we will implement the query domains as lists. Table B.3 shows
a summary of the functions employed in the ordered twig matching algo-
rithms; note that we omit the functions which we already discussed in the
path case (functions such as empty() or push(), which are the same but
applied to lists instead of stacks). deleteLast() is the equivalent of the
pop() used for stacks. Besides the functions managing lists, the rest of the
table shows the functions which interact with the main algorithm in order
to produce the matching results: The boolean functions isCleanable() and
isNeededOrd(), checking whether a node can be deleted and whether a node
insertion can be avoided, respectively, and updateRightLists(), updating
domains after a deletion. Each of these matching functions will be shown
in detail, together with the main algorithm. Further, as we will show, twig
query navigation functions and more advanced solution construction func-
tions are now needed, in this case descendants() and the other ones shown
in the lower part of the table and that will be discussed later while explaining
the solution construction algorithm. Finally, in the twig matching algorithms
we also need the maximum and the minimum post-order values associated
to each domain, and thus we will also exploit the minPost and maxPost
operators discussed in Section B.1.
The ordered twig matching algorithm is shown in Figure B.4. The scanned
range is the same as the one we previously discussed for paths. Further, as
for the path algorithm, also in this case we implement the required condi-
tions in the most effective order. First, we try to delete nodes by means of
the post-order conditions, in particular POT2 and POT3, (Lines 3-9) and,
if a deletion is performed, we update the pointers in the right lists (Line 9).
Then, we work on the current node (Lines 10-14), checking if an insertion is
needed (condition PRO2 and POT1, Line 10) and verifying if new solutions
can be generated (Lines 12-13). As in path matching, condition PRO1 is
used to delete a node from the last stack (Line 14) and the algorithm exits
whenever any stack Di is empty and no data node matching with qi will be
accessed (Lines 15-17).
Let us first analyze the deletion part of the algorithm (Lines 3-9). Here,
the boolean function isCleanable() (see Figure B.5 for the complete code)
checks whether di can be deleted. In particular, if i is the index of the
Standard version 249

DATA LIST MANAGEMENT


decreasePointer(elem) Decreases by 1 the pointer in given elem
delete(list,elem) Deletes given elem from list
deleteLast(list) Deletes last elem from list
index(list,elem) Returns the position of elem in list
noEmptyLists() Returns false if there is at least an empty list
pointerToLast(list) Returns a pointer to the top of the given list
MATCHING
isCleanable(pre,elem) Returns true if current elem can be deleted
from the pre-th list, false otherwise
isNeededOrd(pre,elem) Checks if insertion of given elem is needed
in the pre-th list (ordered version)
updateRightLists(pre,pos) Updates the pointers in the (pre + 1)-th list,
possibly propagating the update, following the
deletion of the pos-th element from the
pre-th list
TWIG QUERY NAVIGATION
descendants(pre) Return the descendants of the given twig node
SOLUTION CONSTRUCTION
findSolsOrd(...) Recursively builds the ordered twig solutions
checkPostDir(node, Returns true if given nodes have the correct
precN ode) post-order direction w.r.t. the query
twigBlock(pre) Returns the block information for the pre-th
query node (1 for block opening, -n for n
blocks closing, 0 otherwise)
updateStackMax(stack,post) Updates top element of stack if less than post

Table B.3: Ordered twig matching functions

twig root, it simply returns true (Condition POT3), otherwise it checks the
conditions expressed in Condition POT2. In this case, links connecting each
domain Di with the domains of the descendants ı̄ of the i-th twig node and the
minPost operator are exploited to speed up the process. In particular, instead
of checking the post-order value of each data node in the domains, we check
if minPost(Dı̄ )>post(di ) for each domain Dı̄ (Line4 of the isCleanable()
code). Whenever a node di is deleted, the updateRightLists() function,
shown in detail in Figure B.5, updates the pointer of all the nodes pointing
to di in order to make it point to the node below di (Line 12 of the function).
Such an update is performed in a descending order and stops when a node
pointing to a node below di is accessed (Line 3).
As to the current node insertion (Lines 10-11 of the main algorithm),
250 The complete XML matching algorithms

Input: query Q having signature sig(Q); rew(Q)


Output: ansQ (D)

algorithm OrderedTwigMatch(Q)
(0) getRange(start, end);
(1) for each dk where k ranges from start to end
(2) for each h such that qh = dk in descending order
(3) for each Di where i ranges from 1 to n
(4) for each di in Di in ascending order
(5) if(post(di )<post(dk ) ∧ isCleanable(i,di ))
(6) pos ← index(Di ,di );
(7) delete(Di ,di );
(8) if(i 6= n)
(9) updateRightLists(i,pos);
(10) if(¬isEmpty(Dh−1 ) ∧ isNeededOrd(h,dk ))
(11) push(Dh ,(post(dk ),pointerToLast(Dh−1 )));
(12) if(h = n)
(13) findSolsOrd(h,1);
(14) deleteLast(Dh );
(15) for each Di where i ranges from 1 to n
(16) if(isEmpty(Di ) ∧ last(qi )<k)
(17) exit;

Figure B.4: Ordered twig matching algorithm

condition PRO2 is exploited, making it sufficient to only check Dh−1 . This is


because, whenever Condition PRO2 is applied, if a domain Dh is empty then
all the domains “following” Dh are emptied. Such emptying are performed
in the updateRightLists() code from Lines 5 to 10, and are the application
of PRO2 to a node inserted at step k 0 (di+1 in the algorithm) preceding the
current k-th data node and thus already belonging to a domain Dh0 , where
h0 > i. In this case, di+1 is deleted when, due to the deletions applied in the
main algorithm, its pointer becomes dangling. We recall that, for instance,
0
∆Σkh0 −1 is implemented by that portion of Dh0 −1 between the bottom and the
data node pointed by di+1 and ∆Σkh0 −1 is the current state of Dh0 −1 . Thus,
0
intuitively, if the pointer of di+1 is dangling it means that ∆Σkh0 −1 ∩ ∆Σkh0 −1 =
∅ as required by Conditions PRO2. The same can be recursively applied to
the other domains ∆Σkh0 −2 , . . . , ∆Σk1 .
Before inserting a new node (Lines 10-11 of the main algorithm), we also
call the boolean function isNeededOrd(), which checks the condition shown
in Condition POT1 by using the minPost(D) and maxPost(D) values for
each domain D instead of comparing the current data node with each data
Standard version 251

function isCleanable(i,di )
(1) if(i = 1)
(2) return true;
(3) for each ı̄ in descendants(i)
(4) if(isEmpty(Dı̄ ) ∨ minPost(Dı̄ )>post(di ))
(5) return true;
(6) return false;

function isNeededOrd(i,di )
(1) if(i = 1)
(2) return true;
(3) if(isEmpty(Dparent(i) ) ∨ maxPost(Dparent(i) )<post(di ))
(4) return false;
(5) if(isEmpty(Di−1 ) ∨ minPost(Di−1 )>post(di ))
(6) return false;
(7) return true;

procedure updateRightLists(i,pos)
(1) for each di+1 in Di+1 in descending order
(2) if(pointer(di+1 )<pos)
(3) return;
(4) if(pointer(di+1 )=1)
(5) for each di+1 from di+1 in descending order
(6) rP os ← index(Di+1 ,di+1 );
(7) delete(Di+1 ,di+1 );
(8) if(i + 1 6= n)
(9) updateRightLists(i + 1,rP os);
(10) return;
(11) else
(12) decreasePointer(di+1 );

Figure B.5: Ordered twig matching auxiliary functions

node in D. Notice that, in order to speed up the process, we only perform


the check in the parent domain (Line 3 of isNeededOrd()) and in the first
left sibling (Line 5), due to the transitivity of the relationships. Finally, by
analogy to the path matching algorithm, we check if new solutions can be
generated (Lemma 4.15) and, in this case, call (Line 13 of the main algorithm)
the recursive function findSolsOrd() implementing Theorem 4.5.
From the above analysis, it easily follows the Lemma below.
Lemma B.4 At each step j and for each query index i, the list Di is a subset
of Σji containing only the data entries that cannot be deleted from Σji , i.e. it
252 The complete XML matching algorithms

procedure findSolsOrd(h,p)
(1) index[h] ← p;
(2) if(h = 1) output(D1 .index[1],...,Dn .index[n])
(3) else
(4) if(twigBlock(h)<0)
(5) for i = 0 to twigBlock(h)
(6) push(postStack,post(Dh .index[h]));
(7) else if (twigBlock(h)=0)
(8) updateStackMax(postStack,post(Dh .index[h]));
(9) if(twigBlock(h − 1)>0)
(10) curP ost ← pop(postStack);
(11) for i = 1 to pointer(Dh .index[h])
(12) okT oContinue ← false;
(13) if(checkPostDir(Dh .index[h],Dh−1 .i))
(14) if(twigBlock(h − 1)<=0)
(15) okT oContinue ← true;
(16) else
(17) if(post(Dh−1 .i)>curP ost)
(18) okT oContinue ← true;
(19) updateStackMax(postStack,post(Dh−1 .i));
(20) if(okT oContinue)
(21) findSolsOrd(h − 1,i);

Figure B.6: Ordered twig matching solution construction

has the same content as ∆Σji when Lemmas 4.4, 4.5, 4.6, 4.9, 4.13, and 4.14
have been applied.

The ordered twig matching solution construction, shown in detail in Fig-


ure B.6, is inspired by the path one, i.e. it is a function which recursively
builds each solution one step at a time, starting from the last domain and
outputting the current solution when reaching the first domain (Line 2).
However, the function is more complex since the solutions have to be checked
while being built; in particular, in the ordered twig matching we would have
to do all the post-order checks defined in Lemma 4.1. Instead of perform-
ing all these checks, and in order to maintain the step-by-step backward
construction behavior, at each step the algorithm verifies, by means of the
checkPostDir() function, if the post order of the current node and the one
in the preceding domain have the correct post-order direction (increasing
/ decreasing) w.r.t. the corresponding twig query nodes (Line 13). More-
over, the algorithm checks that the post-orders of all the children of a given
node are actually smaller than the parent one: This is done by using a stack
Content-based index optimized version 253

structure named postStack, in which the post orders of the children nodes
are kept, and a function named twigBlock(), which helps in identifying the
structure of the query, i.e. its “blocks” of children. In particular, for a given
node, the twigBlock() function returns an integer: 1 for block opening (i.e.
the given node is the father of other nodes), -n for n blocks closing (i.e. the
given node is a leaf and is the last children for n parents), 0 otherwise (no
blocks opening or closing). If a block is closing (Line 4), and this will be
the first case since we are constructing the solutions from the last domain,
the current post order is saved in the stack (Line 6); such post order is then
updated in case of other siblings (Lines 7-8) in order to keep the maximum
one of the current block. Then, in case of a parent node (block opening,
Line 9) such value is retrieved and, if the post order direction check succeeds
(Line 13), Line 17 checks if such value, representing the maximum post order
of the children, is less than the post of the current node (Line 17). Finally,
if all the check succeed, the algorithm continues by recursively calling itself
(Line 21).

Theorem B.2 For each data node dj , S = (s1 , . . . , sn ) ∈ ∆ansjQ (D) iff the
algorithm, by calling the function findSolsOrd(), generates the solution S.

B.3.2 Content-based index optimized version


Basically, the content-based index optimized algorithm follows the same prin-
ciples explained for the path algorithm (see Section B.2.2) but since a twig
could have more then one leaf with a specified value condition and some
query element could not be related to target leaves by an ancestor-descendant
relationship we have some differences. Before starting to analyze the algo-
rithm we need to introduce some new functions (see Table B.4), in particular
some that help to identify value constrained leaves (isConstrainedLeaf()
and getValueConstrainedLeaves()), establish which is the current docu-
ment target leaf for a specified element (getTarget()), and check if a skip
is applicable or not (canSkipOrd()). We modified the isNeedOrd() func-
tion (now called isNeedOrdCont()) and updateRightLists() (now called
updateRightListsCont) in order to consider target information. Beside
these functions, we introduced a completely new set of functions needed to
manage document target lists (see Table B.5). Now that we have briefly
introduced the new used functions we can start to analyze the algorithm
(detailed in Figure B.7). After the computation of the scan range (Line 0)
we retrieve the list of query leaves that specify a value condition by calling
the function getValueConstrainedLeaves() (Line 1). In the following we
assume that the returned list is not empty and for each leaf li contained
254 The complete XML matching algorithms

QUERY NAVIGATION
isConstrainedLeaf(element) Returns true if the passed element
is a value constrained leaf
getValueConstrainedLeaves Returns the list of leaf that specifies
(signature) a value condition
getTarget(pre) Returns the current target for twig
pre-th element, by convention
if the element is a target itself
it returns the relative target
MATCHING
isNeededOrdCont(pre,elem) Checks if insertion of given elem is
needed in the pre-th list
considering the current targets
updateRightListsCont(pre,pos, Updates the pointers in the (pre + 1)-th
curDocP re) list, possibly propagating the update,
following the deletion of the pos-th
element from the pre-th list. It
also perform target alignment if it
is needed
canSkipOrd(post) Returns true if evaluating the
current document post-order value,
current targets and current domains
is possible to perform a Skip

Table B.4: Ordered twig matching functions for content-based search

in the list we have an available content-based index. From Lines 2 to 5 we


retrieve (through the index) for each leaf li in targetLeaves an iterator that
holds the list of document elements that match with element li and satisfy
the specified value condition (function getMatchList() on Line 3). If one of
these lists is found to be empty (Lines 4 and 5) we can stop the algorithm
because no answer will be found in the input document. Then we perform the
first target list alignment by calling the function firstAlignment() (Line
6) that will be discussed later. If the alignment fails (i.e. at least one list
result empty after the attempt to align the lists) we can stop the algorithm,
no answer can be found in the input document, otherwise we can start the
sequential scan. At each step of the scan we first try to adjust current targets
in order to be coherent with the current document pre-order value (k) and
with the current state of domains (Line 9). Again, if the adjustment fails we
can stop the algorithm because some target list is found to be empty and
the relative domain is also empty (i.e. no other match for the element could
Content-based index optimized version 255

TARGET LIST MANAGEMENT


adjustTargetsOrd(pre) Adjusts the current targets depending on the
current document pre and domains.
Returns true if the operation was completed
successfully
firstAlignment(pre) Performs the first target list alignment, taking in
consideration pre as the start scanning pre-order
value.
Returns true if the operation was completed
successfully
alignment(pre,element) Performs the alignment between the target list
associated to element and the subsequent one.
Returns true if the operation was completed
successfully

Table B.5: Target list management functions

be found). From Line 11 to Line 14 we check if it is possible to perform a


skip by repetitively calling the function canSkipOrd(). If we can perform
a skip on Line 12 we update the current document pre-order value k with
the approximate value of the first following of current element (see Section
B.2.2). For each update we need to re-adjust the current targets so on Lines
13 and 14 we apply the same schema applied on Lines 9 and 10. From Line
15 to 20 the algorithm works exactly as the previous one, thus it will not
be further discussed. On Line 21, after the deletion of an element from the
relative domain, we check if the deletion is relative to the first element of
a domain that belong to a target element (say li ), in this case we need to
perform an alignment between the target list targetListli and the the target
list targetListli+1 (see Definition 4.3.2) by calling the function alignment()
(Line 22). The rest of the algorithm from Line 23 to Line 32 works exactly as
the previous one; it has to be noted that on Line 24 we call the new defined
function updateRightListsCont() instead of updateRightLists() and on
Line 25 we call isNeedOrdCont() instead of isNeedOrd().
Let us now discuss more deeply the newly proposed functions (shown
in Figure B.8). We start with the modified version of isNeedOrd() called
isNeedOrdCont(), all the differences between the original version are on the
first two lines. If the i-th element of the query twig has a reference target (i.e.
at least is an ancestor of one target) we check if the post-order value of di is
smaller than the the post-order value of the current target related to the i-th
element. If the check succeeds, we know that element di is not an ancestor
of the related current target and due the Lemma 4.10 we are sure that di
256 The complete XML matching algorithms

Input: query Q having signature sig(Q); rew(Q)


Output: ansQ (D)

algorithm OrderedTwigMatchCont(Q)
(0) getRange(start, end);
(1) targetLeaves ← getValueConstrainedLeaves(Q);
(2) for each li in targetLeaves
(3) targetListli ← getMatchList(li ,getValue(li ),getCondition(li ));
(4) if(¬hasNext(targetListli ))
(5) exit;
(6) if(¬firstAlignment(start))
(7) exit;
(8) for each dk where k ranges from start to end
(9) if(¬adjustTargetsOrd(k))
(10) exit;
(11) while(canSkipOrd(post(dk )))
(12) k ← post(dk ) + 1;
(13) if(¬adjustTargetsOrd(k))
(14) exit;
(15) for each h such that qh = dk in descending order
(16) for each Di where i ranges from 1 to n
(17) for each di in Di in ascending order
(18) if(post(di )<post(dk ) ∧ isCleanable(i,di ))
(19) pos ← index(Di ,di );
(20) delete(Di ,di );
(21) if(isConstrainedLeaf(qi ) ∧ pos = 1)
(22) alignment(k,qi )
(23) if(i 6= n)
(24) updateRightListsCont(i,pos,k);
(25) if(¬isEmpty(Dh−1 ) ∧ isNeededOrdCont(h,dk ))
(26) push(Dh ,(post(dk ),pointerToLast(Dh−1 )));
(27) if(h = n)
(28) findSolsOrd(h,1);
(29) deleteLast(Dh );
(30) for each Di where i ranges from 1 to n
(31) if(isEmpty(Di ) ∧ last(qi )<k)
(32) exit;

Figure B.7: Content index optimized ordered twig matching algorithm

will not be an ancestor of any subsequent target, so di can be considered an


useless element and we can directly return false (Line 1). Also the function
updateRightLists() has been modified since current targets depend also
Content-based index optimized version 257

function isNeededOrdCont(i,di )
(0) if(hasAReferenceTarget(i) ∧ post(di ) < post(getTarget(i)))
(1) return false;
(2) if(i = 1)
(3) return true;
(4) if(isEmpty(Dparent(i) ) ∨ maxPost(Dparent(i) )<post(di ))
(5) return false;
(6) if(isEmpty(Di−1 ) ∨ minPost(Di−1 )>post(di ))
(7) return false;
(8) return true;

procedure updateRightListsCont(i,pos,pre)
(1) for each di+1 in Di+1 in descending order
(2) if(pointer(di+1 )<pos)
(3) return;
(4) if(pointer(di+1 )=1)
(5) for each di+1 from di+1 in descending order
(6) rP os ← index(Di+1 ,di+1 );
(7) delete(Di+1 ,di+1 );
(8) if(isConstrainedLeaf(qi+1 ) ∧ rP os = 1)
(9) alignment(pre,qi+1 )
(10) if(i + 1 6= n)
(11) updateRightListsCont(i + 1,rP os,pre);
(12) return;
(13) else
(14) decreasePointer(di+1 );

function canSkipOrd(post)
(0) for i from 1 to n
(1) if( ¬(hasAReferenceTarget(i) ∨ isContrainedLeaf(qi )) ∨
post ≥post(getTarget(i)) )
(2) return false;
(3) else
(4) if( isEmpty(Di ) ∨
(hasAReferenceTarget(i) ∧ post >maxPost(Di )) )
(5) return true;

Figure B.8: Content index optimized ordered twig matching algorithm aux-
iliary functions

by the status of the domains associated to target leaves. After the deletion
of an occurrence (Line 7), we check if we delete the first occurrence of a
258 The complete XML matching algorithms

domain associated to a value constrained leaf, if the check succeeds we need


to perform an alignment between the target list targetListqi+1 and the the
subsequent one by calling the function alignment() (Line 9). The rest of the
function remains equal to the previous version. The function canSkipOrd()
establishes if it is possible to perform a skip; the conditions under which a
skip is safe are described in Section 4.3.2 and this function simply checks
those conditions. The function is made by a main loop that analyzes each
Di in increasing order of i. For each domain, we first check if it belongs to
an element related to targets by a following-preceding relationship or if the
current document post-order value is greater or equal to the post-order value
of the current target for li , if the check succeeds we must return false because
in the former case we have no information about next matches for elements
that are related with target leaves only by following-preceding relationships
and, since preceding domains are not empty, useful matches for element qi
could be found in the current document subtree. In the latter case the next
match for li is a descendant of the current document element. If the check
fails we need to check the state of the domain Di . If the domain is empty or
the current document post-order value is greater than the maximum post-
order value of its occurrences, we can safely perform a skip and we return
true. The condition explained in Section 4.3.2 takes in consideration only
empty domains however since in the main matching algorithm (see Figure
B.7) domains are cleaned (Lines 16 to 24) only when a match occurs, a
domain could be not physically but substantially empty, this is the reason
why we have introduced the second part of the condition. It has to be noted
that there is no default return value, this is because at least the last domain
is always empty (in the main matching algorithm as we insert an element
in the last domain we generate all the correspondent answers and then we
remove it) so if we reach the last domain (all preceding domain are not empty
and no return condition has be found) if the last leaf is value constrained
and the next match for it is not descendant of the current document element
we return true, otherwise we return false.
Now let us analyze the target lists management functions (Figures B.9
and B.10). At the beginning of the new algorithm we retrieve through the
content indexes the lists of document leaves that satisfy the value conditions
specified by the query. In Section 4.3.2 we have introduced the definition of
alignment property that holds for the ordered twig matching, the functions
that we are going to describe are used to align and maintain aligned the used
lists. As in the path matching algorithm, lists are accessed by an iterator
pattern, for each managed list we hold the last accessed element (for list
targetListli the last accessed element is curT argetli ) that represents the
current document target for the correspondent element. After the retrieving,
Content-based index optimized version 259

function adjustTargetsOrd(pre)
(0) if((curT argetl1 is null) ∧ isEmpty(Dl1 ))
(1) return false;
(2) while((curT argetl1 is not null) ∧ pre(curT argetl1 )<pre)
(3) if(hasNext(targetListl1 ))
(4) curT argetl1 ← getNext(targetListl1 );
(5) else
(6) curT argetl1 ← null;
(7) if(isEmpty(Dl1 ))
(8) return false;
(9) return alignment(pre,l1 );

function firstAlignment(pre)
(0) foreach targetListli in targetLists
(1) if(i = 1)
(2) minP re ← pre;
(3) else
(4) minP re ← pre(curT argetli−1 );
(5) do
(6) if(¬hasNext(targetListli ))
(7) return false;
(8) curT argetli ← getNext(targetListli );
(9) while(pre(curT argetli )<minP re)
(10) return true;

Figure B.9: Ordered Target list management functions (part 1)

lists are not aligned, in order to perform a first alignment we call the function
firstAlignment(), it has to be noted that instead of introducing a new
function we could perform the same task by initializing all the curT argets
and then call the function adjustTargetsOrd() but since the first alignment
is simpler than a normal adjustment (we have not to check domains because
at the beginning they are all empty) we have chosen to develop an ad-hoc
function. The function firstAlignment() moves the iterator of each list
until it founds an element that have a pre-order value greater than minP re
(loop from Lines 5 to 9). The minP re value is the first pre-order value
scanned by the main algorithm (the value start returned by the getRange()
function) for the first list or the pre-order value of the target of the preceding
list for the other lists. As we reach the end of a list, we return false (Line
7); if we reach the end of the main loop (i.e. we have a valid target for each
element) we can return true (Line 10).
At each step of the main algorithm we may need to update the targets,
260 The complete XML matching algorithms

function alignment(pre,li )
(0) if(@li+1 )
(1) return true;
(2) if(isEmpty(Dli ))
(3) minP re ← pre(curT argetli );
(4) else
(5) minP re ← pre(Dli [0]);
(6) if(minP re < pre)
(7) minP re ← pre;
(8) while((curT argetli+1 is not null) ∧ (pre(curT argetli+1 )<minP re))
(9) if(hasNext(targetListli+1 ))
(10) curT argetli+1 ← getNext(targetListli+1 );
(11) else
(12) curT argetli+1 ← null;
(13) if(isEmpty(Dli+1 ))
(14) return false;
(15) return alignment(pre,targetLists,li+1 ;

Figure B.10: Ordered Target list management functions (part 2)

this is made by calling the function adjustTargetsOrd(). If we do not have


a target for the first constrained leaf (i.e. during precedent adjustments we
reach the end of the target list) and the correspondent domain is empty (Line
0) we can return false (Line 1) because independently of any other condition
we are sure than no other answer could be found in the input document.
From Line 2 to Line 8 we update the current target for the first constrained
leaf if it is needed (Line 4) and until we reach the end of the list, in the latter
case we set the current target as null (Line 6) and if the domain associated
to the first constrained leaf is empty we return false (Line 8) for the same
reasons explained before. Finally we start the alignment for the subsequent
targets by calling the function alignment() and we return its result (Line
9).
The alignment between the target list for element li and element li+1 is
performed by the function alignment(). Basically the function is recursive,
on Line 0 if we have required the alignment between the last constrained
leaf and the subsequent one (that does not exist) we simply return true
(Line 1). From Line 2 to Line 7 we establish the minimum pre-order value
(minP re) that target for leaf li+1 must assume. Code from Line 2 to Line
7 establishes the value for minP re following the above definition and taking
into account that some of the value could be not defined. From Line 8 to Line
14 we update the target for leaf li+1 if it is needed as we described for the
B.4 Unordered twig pattern matching 261

MATCHING
isNeededUnord(pre,elem) Checks if insertion of given elem is needed
in the pre-th list (unordered version)
updateDescLists(pre,pos) Updates the pointers in the descendants of the
pre-th list, possibly propagating the update,
following the deletion of the pos-th element
from the pre-th list
TWIG QUERY NAVIGATION
firstChild(pre) Returns the pre-order of the first child of the
given twig node, -1 if the node is a leaf
firstLeaf() Returns the pre-order of the first leaf in twig
isLeaf(pre) Returns true if given twig node is a leaf, false
otherwise
parent(pre) Returns the pre-order of the parent of the given
twig node
siblings(pre) Returns the pre-orders of the siblings of the
given twig node
SOLUTION CONSTRUCTION
findSolsUnord(...) Recursively builds the unordered twig solutions
preVisit(...) Used by findSolsUnord to navigate domains
extendSols(...) Used by findSolsUnord to build the solutions

Table B.6: Unordered twig matching functions

adjustTargetsOrd() function. On Line 15 we recursively call the function


in order to perform the alignment between the target list for li+1 and the one
for li+2 and we return its result.

B.4 Unordered twig pattern matching


B.4.1 Standard version
In this section we will show the complete unordered twig matching algo-
rithm, commenting on the parts which differ from the ordered one discussed
in previous section. Table B.3 shows a summary of the functions employed
which have not already introduced for the ordered case. In particular, the
upper part shows the new functions which interact with the main algo-
rithm in order to produce the matching results: isNeededUnord(), checking
whether a node insertion can be avoided, and updateDescLists(), updat-
ing domains after a deletion. Such functions are the unordered counter-
parts of the isNeededOrd() and updateRightLists() discussed in the or-
262 The complete XML matching algorithms

Input: query Q having signature sig(Q); rew(Q)


Output: ansQ (D)

algorithm UnorderedTwigMatch(Q)
(0) getRange(start, end);
(1) for each dk where k ranges from start to end
(2) for each h such that qh = dk in descending order
(3) for each Di where i ranges from 1 to n
(4) for each di in Di in ascending order
(5) if(post(di )<post(dk ) ∧ isCleanable(i,di ))
(6) pos ← index(Di ,di );
(7) delete(Di ,di );
(8) if(¬isLeaf(i))
(9) updateDescLists(i,pos);
(10) if(¬isEmpty(Dparent(h) ) ∧ isNeededUnord(h,dk ))
(11) push(Dh ,(post(dk ),pointerToLast(Dparent(h) ))));
(12) if(isLeaf(h) ∧ noEmptyLists())
(13) findSolsUnord(firstLeaf(),-1, h, indexesList);
(14) for each Di where i ranges from 1 to n
(15) if(isEmpty(Di ) ∧ last(qi )<k)
(16) exit;

Figure B.11: Unordered twig matching algorithm

dered case. Further, additional twig query navigation functions are needed,
which are quite self-explanatory, and, since the solution construction is dif-
ferent from the ordered case, new functions are also needed in this respect
(findSolsUnord(), preVisit() and extendSols()). These functions will
be discussed later while explaining the solution construction algorithm.
The unordered twig matching algorithm is shown in Figure B.11. As in
the other two algorithms we analysed, we first try to delete nodes by means
of the post-order conditions, in particular POT2 and POT3, (Lines 3-9) and,
if a deletion is performed, we update the pointers in the subsequent lists, in
this case in the descendant ones (Line 9). Then, we work on the current node
(Lines 10-13), checking if an insertion is needed (condition PRU and POT1,
Line 10) and verifying if new solutions can be generated (Lines 12-13). This
time, condition PRO1 is not available and, thus, we do not delete a node
from the last stack after solution construction as in the other algorithms.
Finally, the algorithm exits whenever any stack Di is empty and no data
node matching with qi will be accessed (Lines 14-16).
The boolean function isCleanable() is the same of the ordered case and
will not be further discussed. In this case, whenever a node di is deleted, the
Standard version 263

function isNeededUnord(i,di )
(1) if(i = 1)
(2) return true;
(3) if(isEmpty(Dparent(i) ) ∨ maxPost(Dparent(i) )<post(di ))
(4) return false;
(5) return true;

procedure updateDescLists(i,pos)
(1) i ← firstChild(i);
(2) if(i 6= -1)
(3) children ← i ∪ siblings(i);
(4) for each h in children
(5) for each dh in Dh in descending order
(6) if(pointer(dh )<pos)
(7) break;
(8) if(pointer(dh )=1)
(9) for each dh from dh in descending order
(10) dP os ← index(Dh ,dh );
(11) delete(Dh ,dh );
(12) if(¬isLeaf(h))
(13) updateDescLists(h,dP os);
(14) break;
(15) else
(16) decreasePointer(dh );

Figure B.12: Unordered twig matching auxiliary functions

updateDescLists() function, shown in detail in Figure B.12, updates the


pointer of all the nodes pointing to di in order to make it point to the node
below di (Line 16 of the function). It basically works in the same manner
as updateRightLists() for the ordered case but, instead of updating the
pointers on the following domain, at each call it updates the pointers in all
the domains children of the given one, possibly propagating the update to
the descendants.
As to the current node insertion (Lines 10-11 of the main algorithm),
all the considerations done for the ordered case are still true, but in this
case condition PRU is exploited instead of PRO2, making it sufficient to
only check Dparent(h) . Then, before inserting a new node (Lines 10-11 of the
main algorithm), we call the boolean function isNeededUnord(), which, like
isNeededOrd(), checks the condition shown in Condition POT1. In this case
the only relation required and, thus, the only one that has to be checked, is
the parent child one (Line 3 of isNeededUnOrd()). Finally, as in the other
264 The complete XML matching algorithms

procedure findSolsUnord(h,prec,lastLeaf ,indexesList)


(1) if(isLeaf(h))
(2) extendSols (h,prec,0,lastLeaf ,indexesList)
(3) if(h>1)
(4) findSolsUnord(parent(h),h,lastLeaf ,indexesList);
(5) else
(6) extendSols (h,prec,1,lastLeaf ,indexesList);
(7) for each s in siblings(prec)
(8) preVisit(s,lastLeaf ,indexesList);
(9) if(h>1)
(10) findSolsUnord(parent(h),h,lastLeaf ,indexesList);

procedure preVisit(h,lastLeaf ,indexesList)


(1) extendSols (h,parent(h),-1,lastLeaf ,indexesList);
(2) h ← firstChild(h);
(3) if(h 6= -1)
(4) preVisit(h,lastLeaf ,indexesList);
(5) for each s in siblings(h)
(6) preVisit(s,i,cont+1);

Figure B.13: Unordered twig matching solution construction (part 1)

algorithms, we check if new solutions can be generated (following Lemma


4.16, Chapter 4) and, in this case, call (Line 13 of the main algorithm) the
recursive function findSolsUnord() implementing Theorem 4.6.

Lemma B.5 At each step j and for each query index i, the list Di is a subset
of Σji containing only the data entries that cannot be deleted from Σji , i.e. it
has the same content as ∆Σji when Lemmas 4.7, 4.8, 4.9, 4.13, and 4.14
have been applied.

The unordered twig matching solution construction, and all its required
functions, are shown in detail in Figure B.13 and B.14. In this case the
step-by-step backward construction behavior of the other cases would not
be the best choice, since the pointers are now from children to parents and
not from right to left domains. In this case the solution construction starts
from the first leaf (see the initial call at Line 13 of the main algorithm), then
navigates all the query nodes one by one, gradually checking and producing
all the answers by extending them with the nodes contained in the associ-
ated domains with the extendSols() function. The solutions are kept in
indexesList, which contains, for each of them, an index array pointing to
the domain nodes, as in the path and ordered matching case.
Standard version 265

procedure extendSols(h,prec,dir,lastLeaf ,indexesList)


(1) indexesListOrig ← indexesList;
(2) for each index in |indexesListOrig|
(3) if(dir=0)
(4) for i = 1 to |Dh |
(5) if (lastLeaf =h) i ← |Dh |;
(6) index[h] ← i;
(7) put index in indexesList;
(8) else
(9) dprec ← Dprec .index[prec];
(10) if(dir>0)
(11) for each i = 1 to pointer(dprec )
(12) if(checkPostDir(Dh .i,dprec ))
(13) extend indexes in indexesList;
(14) else
(15) i ← |Dh |
(16) while(pointer(Dh .i)>=index[prec])
(17) if(checkPostDir(Dh .i,dprec ))
(18) extend indexes in indexesList;
(19) if (lastLeaf =h) break;
(20) i ← i-1;

Figure B.14: Unordered twig matching solution construction (part 2)

Let us examine the way in which the query domains are navigated in
order to build the solutions: Starting from the first leaf, findSolsUnord()
goes up the query twig by recursively calling itself up to the query root node
(Lines 4, 10 of its code), thus covering the left most path. For each of the
navigated nodes having right sibilings, before navigating up it calls on each
of the siblings the preVisit() function (Line 8), which recursively explores
in pre-visit, from parent to child, the subtrees of the given nodes (Lines 4,6 of
its code). In this way, all the query nodes (domains) are covered and we move
from one domain to the other in the most suitable way w.r.t. the pointers
in the domain nodes and the solution construction. Indeed, first we go up
from a leaf to its parent, thus exploiting the available node pointers going in
the same direction; in this way we extend the current solution only with the
pointed node, which is a sort of upper bound, and the ones underlying it (the
same as in path and ordered construction) (Lines 10-13 of extendSols()).
Then, when we have reached a parent node, we go downward from it to its
other children, doing the opposite: Starting from the last node in the child
domain to be included in the solution, we extend the solutions with all the
nodes which point to the parent node, or to any node above it (Lines 15-20 of
266 The complete XML matching algorithms

extendSols()). In other words, in the “downward” solution extension the


parent node acts as a lower bound. The dir parameter actually codes the
direction in which to perform the extension: 1 if going up, -1 if going down. 0
is used to insert the first domain nodes in the solutions (no actual extension,
Line 2 of findSolsUnord(), Lines 3-7 of extendSols()). During the exte-
sions, all the post-order checks defined in Lemma 4.3 are performed (Lines
12, 17). Finally, the parameter lastLeaf is the pre-order of the query node
which started solution construction: This is used at Line 19 of extendSols()
in order to limit the extension relative to this domain only to the last inserted
node. This is necessary since, as we said before, the node starting solution
construction is not deleted immediately after it, as in the other matching
algorithms, and therefore, without this check, duplicate solutions would be
generated.

Theorem B.3 For each data node dj , S = (s1 , . . . , sn ) ∈ ∆U ansjQ (D) iff the
algorithm, by calling the function findSolsUnord(), generates the solution
S.

B.4.2 Content-based index optimized version


The content-based optimized algorithm for unordered pattern matching (de-
tailed in Figure B.15) is based on the observations made in Section 4.3.3.
As we did for the ordered case we introduce the functions needed to adjust
current targets and to establish if it is possible to perform a skip. We have
also modified the function verifying if an element is needed in order to take
in consideration target information (See Table B.7).
Since there are not many differences with respect to the previous cases, we
first briefly review the main matching algorithm and then we show the new
introduced functions. From Line 0 to Line 5 we perform the same operations
discussed for the ordered match algorithm (see Section B.3.2). Starting from
Line 6 we have the main scanning loop that reads the input document using
the previously computed range. At each step of the scan we first try to adjust
current targets in order to be coherent with the current document pre-order
value k (Line 7). If the adjustment fails we can stop the algorithm because
some target list is found to be empty and the relative domain is also empty
(i.e. no other match for the element could be found). From Line 9 to Line
12 we check if it is possible to perform a skip by repetitively call the function
canSkipUnord(). If we can perform a skip on Line 10 we update the current
document pre-order value k with the approximate value of the first following
of current element (see Section B.2.2). For each update we need to re-adjust
the current targets so on Lines 11 and 12 we apply the same schema applied
Content-based index optimized version 267

MATCHING
isNeededUnordCont(pre,elem) Checks if insertion of given elem is
needed in the pre-th list
considering the current targets
canSkipUnord(post) Returns true if evaluating the
current document post-order value,
current targets and current domains
is possible to perform a Skip
TARGET LIST MANAGEMENT
adjustTargetsUnord(pre) Adjusts the current targets depending on
the current document pre.
Returns true if the operation was completed
successfully

Table B.7: Target list management functions

on Lines 7 and 8. From Line 13 to 28 the algorithm works exactly as the


previous one, thus it will not be further discussed. It has to be noted that
on Line 21 we call the newly defined function isNeedUnordCont() instead
of isNeedUnord().
Now we can examine the auxiliary functions used by the main algorithm.
In Figure B.16 we have the matching auxiliary functions. First we have the
modified version of isNeededUnord(), called isNeededUnordCont(). As for
the ordered case all the differences are in the first two lines where if the i-
th twig element has a reference target and the current document post-order
value is smaller than the one of the current target for the element then we can
directly return false. The rest of the function is equal to the previous version
and it will not be discussed further. The function canSkipUnord() simply
calls the recursive function checkSkipUnord() by passing the post-order
value passed to it and 1 as pre-order value, that means that the function will
start the domain check from the root domain. Function checkSkipUnord()
returns true if analyzing the pre-domain, and possibly its child/descendant
domains, a skip is considered safe, otherwise it returns false. It has to be
noted that the function must be called on elements that have a reference
target or are target themselves (the root element has always a reference target
as long as a target exists). On Line 0 we check if the current document post-
order is smaller than the reference target of the current element, if so we go
ahead with the analysis otherwise we directly return false (Line 15, the target
for the current element is a descendant of the current document element). If
the domain associated to the pre-element is empty or the current document
268 The complete XML matching algorithms

Input: query Q having signature sig(Q); rew(Q)


Output: ansQ (D)

algorithm UnorderedTwigMatchCont(Q)
(0) getRange(start, end);
(1) targetLeaves ← getValueConstrainedLeaves(Q);
(2) for each li in targetLeaves
(3) targetListli ← getMatchList(li ,getValue(li ),getCondition(li ));
(4) if(¬hasNext(targetListli ))
(5) exit;
(6) for each dk where k ranges from start to end
(7) if(¬adjustTargetsUnord(k))
(8) exit;
(9) while(canSkipUnord(post(dk )))
(10) k ← post(dk ) + 1;
(11) if(¬adjustTargetsUnord(k))
(12) exit;
(13) for each h such that qh = dk in descending order
(14) for each Di where i ranges from 1 to n
(15) for each di in Di in ascending order
(16) if(post(di )<post(dk ) ∧ isCleanable(i,di ))
(17) pos ← index(Di ,di );
(18) delete(Di ,di );
(19) if(¬isLeaf(i))
(20) updateDescLists(i,pos);
(21) if(¬isEmpty(Dparent(h) ) ∧ isNeededUnordCont(h,dk ))
(22) push(Dh ,(post(dk ),pointerToLast(Dparent(h) ))));
(23) if(isLeaf(h) ∧ noEmptyLists())
(24) findSolsUnord(firstLeaf(),-1, h, indexesList);
(25) for each Di where i ranges from 1 to n
(27) if(isEmpty(Di ) ∧ last(qi )<k)
(28) exit;

Figure B.15: Content index optimized unordered twig matching algorithm

post-order value is greater then the maximum post-order value in the pre-
domain (recall that since domains are cleaned only when we found a match
a domain could be not physically empty but substantially empty and see
Section 4.3.3 for the Skipping Policy) we can directly return true (Line 2).
If the domain is not empty we need to verify what kind of children has the
current element and possibly call the function again on them. If the current
element has at least one child that is related to targets only by following-
Content-based index optimized version 269

function isNeededUnordCont(i,di )
(0) if(hasAReferenceTarget(i) ∧ post(di ) < post(getTarget(i)))
(1) return false;
(2) if(i = 1)
(3) return true;
(4) if(isEmpty(Dparent(i) ) ∨ maxPost(Dparent(i) )<post(di ))
(5) return false;
(6) return true;

function canSkipUnord(post)
(0) return checkSkipUnord(1,post);

function checkSkipUnord(pre,post)
(0) if(post < post(getTarget(pre)))
(1) if( isEmpty(Dpre ) ∨
(hasAReferenceTarget(pre) ∧ post >maxPost(Dpre )) )
(2) return true;
(3) else
(4) j ← firstChild(pre);
(5) if(j = -1)
(6) return true;
(7) children ← j ∪ siblings(j);
(8) for each qi in children
(9) if(¬(hasAReferenceTarget(i) ∨ isContrainedLeaf(qi )))
(10) return false;
(11) if(¬checkSkipUnord(i,post))
(12) return false;
(13) return true;
(14) else
(15) return false;

Figure B.16: Content index optimized unordered twig matching algorithm


auxiliary functions

preceding relationship (Line 9, for those elements we have no information


about next matches and since the parent domain is not empty useful match
could be found in the subtree of the current document element) or at least
one check over its child fails (Line 11), we return false (Lines 10 and 12)
otherwise we can safely return true (Line 13).
Finally we have the target list management functions (see Figure B.17).
Since there is no order constraint, it does not exist an alignment property,
so the only target management function is adjustTargetsUnord(). This
270 The complete XML matching algorithms

function adjustTargetsUnord(pre)
(0) for each li in targetLeaves
(1) if((curT argetli is null) ∧ isEmpty(Dli ))
(2) return false;
(3) while((curT argetli is not null) ∧ pre(curT argetli )<pre)
(4) if(hasNext(targetListli ))
(5) curT argetli ← getNext(targetListli );
(6) else
(7) curT argetli ← null;
(8) if(isEmpty(Dli ))
(9) return false;
(10)return true;

Figure B.17: Unordered Target list management functions

function adjusts the current targets in order to be coherent to the current


document pre-order value. The function is made by a main loop (from Line 0
to Line 9) that goes through all the targets possibly trying to update them.
If at least one target is found to be null (that means that previously we
reach the end of its target list) and its domain is found to be empty (Line 1)
then we return false (Line 2) because no other match could be found in the
current document. Otherwise we update each target as needed until we found
a target that has a pre-order value greater or equal to the current document
one (loop from Line 3 to Line 9). If during the search of a new target we
reach the end of the relative target list we set that target to null (Line 7) and
if the relative domain is empty we return false for the same reasons explained
before. Finally if we have successfully updated all the targets we can return
true (Line 10).

B.5 Sequential scan range filters


In this section we will give more detail over the current solutions used to
delimit the portion of the document that has to be scanned. Scan range is
actually computed by two different kinds of filter. Each filter outputs a rough
range and the final range (used by algorithms) is obtained by intersecting
those ranges.

B.5.1 Basic filter


The first kind of filter operates starting from the observations made in Section
4.4 and B.1. Thanks to condition PRO2 we can safely start the sequential
Content-based filter 271

scan from the first occurrence of q1 (first(q1 )), while to establish the right
limit of the range we need to separate the ordered (and path) case form the
unordered one. In the former case, due to Lemma 4.15, we can stop the scan
with the last occurrence of qn (last(qn )) whereas, in the latter case, Lemma
4.16 suggests to stop with the maximum value among last(ql ) for each leaf
l in the query.
During the computation of the range we can also make some checks that
could identify an empty answer space; for each element of the query we can
compute a specific range that represents the part of the document where
we can find occurrences of that element that is [first(qi ),last(qi )]. If
some twig element specifies a value condition and we have a content in-
dex for those elements we can limit the specific range for an element qi to
[firstV(qi ),lastV(qi )], where firstV(qi ) and lastV(qi ) return the first
and the last pre-order value for element qi in the document satisfying the
specified value condition, respectively.
By analyzing specific ranges we can simply conclude if a document can-
not contain a solution. Again we need to analyze separately ordered and
unordered case. Ordered match requires by definition (see Lemma 4.1) that
answer elements are totaly ordered by their pre-order values so subsequent
elements in the query must be subsequent in the answer. If a specific range
of an element qj ends before the beginning of the specific range of qi with
i < j, the document cannot contain any answer. For each query node qi with
i ∈ [1, n − 1] we need to perform the described check with any node qj with
j ∈ [i + 1, n]. It is obvious that these checks represent a necessary condition
but not a sufficient one.
Unordered match requires only a partial order between answer node pre-
ordes so checks related to a node qi are performed only with node qj with
qj ∈ descendants(qi ).

B.5.2 Content-based filter


As the name suggests, this filter can be used only if the query specifies some
value condition and if we have content-based index built on nodes that specify
those conditions. The basic idea for this filter is that, given an element de
that satisfies the value condition specified by the query, if the document
contains a solution with de then this solution lies on the subtree rooted from
the furthest ancestor of de (i.e. the one with the smallest pre-order value,
we call this ancestor d1e ) that matches with q1 . Given a query node qj that
specifies a value condition and with a content-based index built on it we can
obtain from the index all the document elements that satisfy the specified
condition. For each of these elements de we can identify a range [d1e ,f fd1e −1]
272 The complete XML matching algorithms

and considering all these ranges we can identify a single range for the query
node qj that is [minStart(qj ),maxEnd(qj )], where minStart(qj ) represents
the minimum start (start with the smallest pre-order) for the previous ranges
while maxEnd(qj ) represents the maximum end (end with the greatest pre-
order) for the previous ranges. The second kind of filter computes such range
for all nodes qj that satisfy the conditions explained before and then returns
a range obtained from the intersections of these ranges with the whole data
tree range.
Appendix C

Proofs

C.1 Proofs of Chapter 1

Proof of Proposition 1.4. Let us start from the threshold of count


filter of Prop. 1.1, max(|σ1 |, |σ2 |) −1 − (d − 1) ∗ q, and let us substitute
max(|σ(σ1 )|, |σ(σ2 )|) with the minimum length minL and subtract the num-
ber of qsub -grams with wild cards to the total count, i.e. (qsub − 1) ∗ 2. Then,
the formula is minL − 1 − (d − 1) ∗ q − (q − 1) ∗ 2, which is equivalent to
minL + 1 − (d + 1) ∗ q. ¤

Proof of Theorem 1.1. First, notice that showing Theorem 1.1 is equiv-
alent to prove the following statement: if extP osF ilter(σ1 , σ2 , minL, d) re-
turns FALSE, then does not exist a pair (σ1 [i1 . . . j1 ] ∈ σ1 , σ2 [i2 . . . j2 ] ∈ σ2 )
such that (j1 − i1 + 1) ≥ minL, (j2 − i2 + 1) ≥ minL and ed(σ1 [i1 . . . j1 ], σ2
[i2 . . . j2 ]) ≤ d. Let extP osF ilter(σ1 , σ2 , minL, d) return FALSE then there
are two alternatives:

1. There is no common term between the two sequences.


In this case, it is obvious that a pair (σ1 [i1 . . . j1 ] ∈ σ1 , σ2 [i2 . . . j2 ] ∈ σ2 )
does not exist so that (j1 − i1 + 1) ≥ minL, (j2 − i2 + 1) ≥ minL and
ed(σ1 [i1 . . . j1 ], σ2 [i2 . . . j2 ]) ≤ d.

2. For each common term σ1 [p1 ] = σ2 [p2 ] the corresponding counter has
the following value: σ1 c[p1 ] < c.
In this second case, if we show that for each common term σ1 [p1 ] =
σ2 [p2 ] if σ1 c[p1 ] < c then ed(σ1 [p1 −w +1 . . . p1 ], σ2 [p2 −w +1 . . . p2 ]) > d
then it is obvious that a pair (σ1 [i1 . . . j1 ] ∈ σ1 , σ2 [i2 . . . j2 ] ∈ σ2 ) does
274 Proofs

not exist so that (j1 − i1 + 1) ≥ minL, (j2 − i2 + 1) ≥ minL and


ed(σ1 [i1 . . . j1 ], σ2 [i2 . . . j2 ]) ≤ d.

Let w = minL be the window size, and w1 ≡ [p1 − w + 1 . . . p1 ], w2 ≡


[p2 − w + 1 . . . p2 ] be the two windows.
Given the set P P of position pairs (k1 , k2 ) in the interval (k1 ∈ w1 and
k2 ∈ w2 ) whose corresponding terms in the two sequences are equal (σ1 [k1 ] =
σ2 [k2 ]), let us introduce three subsets of P P of pairs of terms with the fol-
lowing properties:

ˆ P P1 is obtained from P P by pruning out the pairs for which another


pair in P P exists with the same position in σ2 : ∀(k1 , k2 ) ∈ P P1 :
@(k10 , k20 ) ∈ P P1 : k2 = k20 .

ˆ P P2 be a subset of P P1 , with the following additional property: ∀(k1 , k2 )


∈ P P2 : @(k10 , k20 ) ∈ P P2 : k1 = k10 .

ˆ P P3 be the total ordered set from P P2 , i.e. satisfying the following


additional property: ∀(k1 , k2 ), (k10 , k20 ) : if k1 < k10 then k2 < k20 .

From the edit distance definition and from the fact that P P3 contains the
ordered and non-repeated equal terms, it directly follows that the P P3 car-
dinality is as follows:
P P3 ⊆P P2 P P2 ⊆P P1
w − ed(σ1 (w1 ), σ2 (w2 )) = card(P P3 ) ≤ card(P P2 ) ≤ card(P P1 )
= σ1 c[p1 ] < c

where card(P P1 ) = σ1 c[p1 ] < c from the filter definition and by hypothe-
sis. Then, w − ed(σ1 (w1 ), σ2 (w2 )) < c. But, since c = w − d then w −
ed(σ1 (w1 ), σ2 (w2 )) < w − d. It follows that ed(σ1 (w1 ), σ2 (w2 )) > d. ¤

C.2 Proofs of Chapter 3

Proof of Theorem 3.1. First notice that aSim(D ^ i , Dj ) ≥ aSim(Di , Dj )


^ j , Di ) ≥ aSim(Dj , Di ), that is the two approximations are upper
and aSim(D
bounds of the corresponding asymmetric document similarity values. Indeed,
the set {(cki , chj ) | k ∈ [1, n], h ∈ [1, m]} obviously contains the set of pairs
p (k)
given by the permutation pm maximizing the similarity: {(cki , cj m ) | k ∈
[1, n]}.
C.3 Proofs of Chapter 4 275

Furthermore, we state that either asim(Di , Dj ) ≤ sim(Di , Dj ) ≤ asim


(Dj , Di ) or the other way round, that is asim(Dj , Di ) ≤ sim(Di , Dj ) ≤
asim(Di , Dj ). Indeed, from Eq. 3.1, it follows that sim(Di , Dj ) is equal to

α β
z }| { z }| {
X n X n
p (k) p (k) p (k)
|cki | · sim(cki , cj m )+ |cj m | · sim(cki , cj m )
k=1 k=1
m
X X n
|cki | + |chj |
|k=1{z } |h=1{z }
γ δ

where α, β, γ, and δ are positive values. Moreover from Eq. 3.6, we have
aSim(Di , Dj ) = αγ and aSim(Dj , Di ) = βδ . Obviously, either aSim(Di , Dj ) ≤
aSim(Dj , Di ) or aSim(Di , Dj ) ≥ aSim(Dj , Di ). Let us suppose that aSim
(Di , Dj ) ≤ aSim(Dj , Di ) then αγ ≤ βδ ⇒ αγ ≤ α+β
γ+δ
≤ βδ that is aSim(Di , Dj )
≤ Sim(Di , Dj ) ≤ aSim(Dj , Di ). As to the second inequality notice that
? ? ? ?
α+β
γ+δ
≤ βδ ⇔ (α + β)δ ≤ (γ + δ)β ⇔ (αδ + βδ) ≤ (βγ + βδ) ⇔ αδ ≤ βγ which
is true since αγ ≤ βδ . In the same way, if aSim(Di , Dj ) ≥ aSim(Dj , Di ) then
aSim(Dj , Di ) ≤ Sim(Di , Dj ) ≤ aSim(Di , Dj ).
It follows that if aSim(Di , Dj ) ≤ aSim(Dj , Di ) then aSim(Di , Dj ) ≤
^ j , Di ), whereas if aSim(Di , Dj ) ≤
Sim(Di , Dj ) ≤ aSim(Dj , Di ) ≤ aSim(D
aSim(Dj , Di ) then aSim(Dj , Di ) ≤ Sim(Di , Dj ) ≤ aSim (Di , Dj ) ≤ aSim^
(Di , Dj ) from which follow the statements of the theorem. ¤

C.3 Proofs of Chapter 4

Proof of Lemma 4.1. Because the index i increases according to the pre-
order sequence, node i + j must be either the descendant or the following
node of i. If post(qi+j ) < post(qi ), the node i + j in the query is a descendant
of the node i, thus also post(dsi+j ) < post(dsi ) is required. By analogy, if
post(qi+j ) > post(qi ), the node i + j in the query is a following node of i,
thus also post(dsi+j ) > post(dsi ) must hold. ¤

Proof of Lemma 4.5. Let us suppose that j ∈ [k, m] exists such that
S = (s1 , . . . , sn ) ∈ ∆ansjQ (D) and sh = k. Notice that it should be si < k
276 Proofs

but ∆Σki = ∅ and thus no index si exists such that dsi = qi and si < k. ¤

Proof of Theorem 4.1. The proof is ab absurdo. In particular we will


show that the following four facts together constitute an absurd:
1. Let dk0 be a data node with k 0 ≤ k and dk0 = qh0 , then k 0 does not
belong to ∆ansjQ (D), j ∈ [k, m], because for each (s1 , . . . , sh0 −1 ) ∈
∆Σj1 × . . . × ∆Σjh0 −1 , either i exists such that si > k 0 or given that
(s1 , . . . , sh0 −1 ) belongs to an answer S and k belongs to it too, then S
can not belong to ∆ansjQ ,
0
2. for each i ∈ [1, h0 − 1], at step k 0 , ∆Σki 6= ∅ (condition of Lemma 4.5)
0
3. for each i ∈ [1, h0 − 1], at step k, ∆Σki ∩ ∆Σki 6= ∅ (condition of Lemma
4.6)

4. h0 6= n (condition of Lemma 4.4)


Indeed, wherever the facts 2,3, and 4 are true then (s1 , . . . , sh−1 ) ∈ ∆Σk1 ×
0
. . .×∆Σkh0 −1 exists, such that for each i, si < k 0 . As, for each i < h0 , ∆Σki 6= ∅
0
and ∆Σki ∩ ∆Σki 6= ∅ then si ∈ ∆Σki exists such that si < k 0 . Moreover h 6= n
and thus for any solution S = (s1 , . . . , sh , sh+1 , . . . , sn ) such that sh0 = k 0 it
0
must be sh+1 > k 0 , . . . , sn > k 0 that is S ∈
/ anskQ (D). Therefore, without any
knowledge about the data nodes following k 0 in the sequential scanning, S
could belong to ∆ansjQ . ¤

Proof of Lemma 4.10. Obviously, due to the premise, post(dj ) < post(dj0 )
when j0 = k. As to j0 ∈ [k + 1, m], let us consider the two possible alter-
natives: either post(dk ) < post(dj0 ) or post(dk ) > post(dj0 ). In the former
case, it easily follows that post(dj ) < post(dj0 ). The latter case means that
dk is an ancestor of dj0 since k < j0. Moreover dj is a preceding of dk since
post(dj ) < post(dk ) and j < k. It follows that dj is a preceding of dj0 and
thus post(dj ) < post(dj0 ). ¤

Proof of Lemma 4.11. Let us suppose that j ∈ [k, m] exists such that
S = (s1 , . . . , sn ) ∈ ∆ansjQ (D), si ∈ S. Notice that (s1 , . . . , sn ) ∈ ∆ansjQ (D)
iff j exist such that sj 0 > k for each j 0 ≥ j. Let us consider sj . Being
si < sj then S is a solution iff post(dsi ) > post(dsj ). But it is impossible
since post(dsi ) < post(dk ) and, from Lemma 4.10, post(dsi ) < post(dsj ). ¤
C.3 Proofs of Chapter 4 277

Proof of Lemma 4.12. Notice that due to Lemma 4.11 each data node
in ∆Σk−1
i , for i ∈ [1, h − 1], can be deleted from its domain. In this way,
each ∆Σki is empty and thus due to Lemma 4.5, k can be deleted from ∆Σjh . ¤

Proof of Lemma 4.13. Let us suppose that j ∈ [k, m] exists such that
S = (s1 , . . . , sn ) ∈ ∆(U )ansjQ (D) and si ∈ S, and that ı̄ ∈ [1, n] exists such
that post(qı̄ ) < post(qi ). Notice that, being post(qı̄ ) < post(qi ), the node
matching with qı̄ , i.e. that with pre-order value equal to sı̄ , both in the
case of ordered and unordered matching must satisfy the following property:
post(dsi ) > post(dsı̄ ). On the other hand, wherever ∆Σkı̄ = ∅ or for each
s ∈ ∆Σkī , post(ds ) > post(dsi ), the condition post(dsi ) > post(dsı̄ ) can be
satisfied iff sı̄ > k. But being post(dsi ) < post(dk ), due to Lemma 4.10
post(dsi ) < post(ds ) for each s > k and thus also for s = sı̄ . Thus the above
condition can not be satisfied. ¤

Proof of Lemma 4.14. First notice that for each qi for i ∈ [2, n], post(q1 ) >
post(qi ) since q1 is the root of the twig pattern. thus for each j ∈ [k, m] for
each solution S = (s1 , s2 , . . . , sn ) ∈ ∆(U )ansjQ (D) involving s1 it should
be post(ds1 ) > post(dsi ). On the other hand being post(ds1 ) < post(dk ),
from Lemma 4.10 it follows that post(dsi ) < post(ds ) for each s > k and
S ∈ ∆(U )ansjQ (D) iff at least one si is greater than k. ¤

Proof of Theorem 4.3. For a data node si ∈ ∆Σki whenever post(dsi ) >
post(dk ) there is no way to predict the relationship between the post-order
of the data node si and that of the nodes after k. On the other hand,
whenever post(dsi ) < post(dk ) we will show that the following two facts
together constitute an absurd:

1. si does not belong to ∆ansjQ (D), j ∈ [k, m], because for each (si+1 , . . . ,
sn ) ∈ ∆Σji+1 ×. . .×∆Σjn , i0 exists such that it is required that post(dsi0 ) <
post(dsi ) but post(dsi0 ) > post(dsi ),

2. i 6= 1 and for each ı̄ ∈ [1, n] such that post(qı̄ ) < post(qi ) and ∆Σkı̄ 6= ∅
and exists s ∈ ∆Σkī such that s > k and post(ds ) > post(dsi ) (condition
of Lemma 4.13)

Indeed, wherever the fact above is true, (si+1 , . . . , sn ) ∈ ∆Σji+1 × . . . × ∆Σjn


exists such that for each i0 > i such that post(qi0 ) < post(qi ), si0 > si and
post(dsi0 ) < post(dsi ). Indeed, the data nodes of such a partial solution can
278 Proofs

be the ones specified in the fact above. ¤

Proof of Lemma 4.16. Let us suppose that S = (s1 , . . . , sh , . . . , sn ) exists


such that S ∈ ∆anskQ and then sh = k. Moreover by hypothesis i > h exists
such that post(qh ) > post(qi ). Thus S is a solution if si > k that is si must
be accessed after k in the sequential scanning. For this reason S can not
belong to ∆anskQ . ¤

Proof of Theorem 4.4. The set of answers ∆anskP (D) is a subset of the
cartesian product ∆Σk1 × . . . × ∆Σkn as, by applying Lemmas 4.4, 4.5, 4.6,
and 4.11, we never delete useful data nodes.
Given that premise, we have to show that if (s1 , . . . , sn ) ∈ ∆anskP (D) then
s
si ∈ ∆Σi i+1 for each i ∈ [1, n − 1] and vice versa. If (s1 , . . . , sn ) ∈ ∆anskP (D)
then s1 < . . . < sn and thus si must be processed before si+1 in the sequen-
s
tial scanning that is si ∈ ∆Σi i+1 . The other way around, we have to show
s
that if si ∈ ∆Σi i+1 for each i ∈ [1, n − 1] then (s1 , . . . , sn ) ∈ ∆anskP (D).
We first show that (s1 , . . . , sn ) is a solution. Notice that (s1 , . . . , sn ) can
s
is a solution as s1 < s2 < . . . < sn because si ∈ ∆Σi i+1 and, for each
i ∈ [1, n − 1], post(dsi ) > post(dsi+1 ) since, due to Lemma 4.11, we have
that post(dsi ) > post(dj ), for each j such that si < j ≤ k being si ∈ ∆Σki .
Finally, we show that (s1 , . . . , sn ) 6∈ ansk−1 k
P (D). It is true as sn ∈ ∆Σn if, due
to Lemma 4.4, k = sn and thus, it straightforwardly follow that sn 6∈ Σk−1 n
which is one of the domains of ansk−1 P (D). ¤

Proof of Theorem 4.5. The set of index tuples specified in the theorem is
a subset of the cartesian product ∆Σk1 × . . . × ∆Σkn as, by applying Lemmas
4.4, 4.5, 4.6, 4.9, 4.13, and 4.14, we never delete useful data nodes. Moreover
any index tuple (s1 , . . . , sn ) which satisfies conditions 1 and 2 is a solution
because, as in the path case, condition 1 implies that s1 < s2 < . . . < sn
whereas condition 2 explicitly requires that the relationships between post-
orders are satisfied. Finally the fact that (s1 , . . . , sn ) 6∈ ansk−1
Q (D) is the
same as in the path case. ¤

Proof of Theorem 4.6. The set of index tuples specified in the theorem is
a subset of the cartesian product ∆Σk1 × . . . × ∆Σkn as, by applying Lemmas
4.7, 4.8, 4.9, 4.13, and 4.14, we never delete useful data nodes. Moreover any
index tuple (s1 , . . . , sn ) which satisfies conditions 1 and 2 is a solution and
C.4 Proofs of Appendix B 279

the proof is similar to the above ones. ¤

C.4 Proofs of Appendix B

Proof of Lemma B.1. Let k 0 ∈ Di and let top(Di )=k 00 then k 0 <
k as k 00 is at the top of Di and the data signature is scanned in a se-
00

quential way. Then there are two alternatives: either post(dk0 )>post(dk00 )
or post(dk0 )<post(dk00 ). In the first case, it straightforward follows that
post(dk0 )>post(dk ) as due to the premise post(dk00 )>post(dk ) whereas the
second case is impossible as when k 00 was added to Di , the algorithm should
have deleted k 0 from Di (see Line 5). ¤

0
Proof of Lemma B.2. Let us suppose that ∆Σji = ∅ with k ≤ j 0 ≤ j,
then being k ≤ j 0 and j 0 ≤ j it easily follows that ∆Σji ∩ ∆Σki = ∅.
As far as the opposite direction is0 involved, we show that if for each step
j 0 with k ≤ j 0 ≤ j we have that ∆Σji 6= ∅ then ∆Σji ∩ ∆Σki 6= ∅. The proof
is by induction. When j = k and ∆Σki 6= ∅, then ∆Σki ∩ ∆Σki = ∆Σki 6= ∅.
Let us suppose that the statement is true for0 j = r, we show it for j = r + 1.
As if for each step j 0 with k ≤ j 0 ≤ r ∆Σji 6= ∅ then ∆Σri ∩ ∆Σki 6= ∅, let
us suppose that ∆Σri ∩ ∆Σki = (i1 , . . . , in ). We have to show that if for each
0
step j 0 with k ≤ j 0 ≤ (r + 1) ∆Σji 6= ∅ then ∆Σr+1 i ∩ ∆Σki 6= ∅. Notice that
r+1 k
∆Σi ∩ ∆Σi = ∅ iff at step r + 1 we delete the index set (i1 , . . . , in ) from
∆Σr+1i . But, as such domains are stacks and k ≤ (r + 1) thus (i1 , . . . , in ) are
at the bottom of the stack and the deletion of (i1 , . . . , in ) implies the deletion
of all the data node in ∆Σr+1
i , but it is impossible because the ∆Σr+1 i 6= ∅. ¤

Proof of Lemma B.3. First, notice that Di ⊆ Σji as for each data node dj ,
at Line 10 the algorithm adds dj to the right stack. Moreover the algorithm
deletes some indexes from Di by means of the pop and empty operators. If k
can not belong to ∆Σji due to its pre-order value, it can be due to one of three
possible alternatives shown in Theorem 4.1. The algorithm detects all these
conditions and delete k from Di . At Lines 6-8, assuming that Di = ∆Σji , due
to Lemma 4.6 it deletes all the nodes in Di+1 ∪. . .∪Dn . Indeed, in Lemma B.2
j k j0
it has been shown that ∆Σi must be empty in order that ∆Σi ∩∆Σi = ∅, for
j 0 ≥ j. Thus, at the j-th step we delete all the “unnecessary” data nodes. At
Lines 9-10, it applies Lemma 4.5. In particular, as whenever a stack is empty
we empty all the stacks at “its right”, then it is sufficient to check stack Dh−1 .
280 Proofs

Finally, at Line 13, it applies Lemma 4.4. Moreover, the algorithm deletes k
from Di due to its post-order value at Lines 4-5 where it applies Lemma 4.11.
Notice that, due to Lemma B.1, the algorithm correctly avoid to go on with
Di whenever post(top(Di ))>post(dj ) as, in this case, no other data node
can be deleted from that stack due to its post-order value. It follows that
the algorithm never deletes from the stacks data nodes which could belong
to a delta answer set ∆ansj0 P (D) for j0 ∈ [j, m]. Thus, k ∈ Di at step j iff
k ∈ ∆Σji . ¤

Proof of Theorem B.1. Due to Theorem 4.4, S = (s1 , . . . , sn ) belongs


s
to ∆ansjP (D) iff for each i ∈ [1, n], si ∈ ∆Σi i+1 and si ∈ ∆Σki . Lemma
B.3 states that si ∈ ∆Σki iff si ∈ Di at step k. Moreover, the “chain” of
pointers followed by the function showSolutions() allows us to state that
such a function only generates those solutions S = (s1 , . . . , sn ) such that
s
si ∈ ∆Σi i+1 . Indeed, whenever the algorithm adds a new data node to any
stack Dh , it sets the pointer to the top of the “preceding” stack Dh−1 (Line
10). Thus, as the algorithm sequentially scans the data signature, for each
data node si+1 in Di+1 , the nodes from the bottom of Di up to the node
pointed by si+1 are all those nodes matching qi and whose pre-order value is
s
less than si+1 , i.e. all those in Σi i+1 . ¤
Bibliography

[1] Secure Hash Standard. Technical Report FIPS PUB 180-1, U.S. Department
of Commerce/National Institute of Standards and Technology, 1995.

[2] R. Agrawal, C. Faloutsos, and A. N. Swami. Efficient Similarity Search In Se-


quence Databases. In Proc. of 4th International Conference on Foundations
of Data Organization and Algorithms (FODO 1993), 1993.

[3] R. Agrawal, K.-I. Lin, H. S. Sawhney, and K. Shim. Fast Similarity Search
in the Presence of Noise, Scaling, and Translation in Time-Series Databases.
In Proc. of 21th International Conference on Very Large Data Bases (VLDB
1995), 1995.

[4] E. Amitay, R. Nelken, W. Niblack, R. Sivan, and A. Soffer. Multi-resolution


disambiguation of term occurrences. In Proc. of the 12th Conference on
Information and Knowledge Management (CIKM 2003), 2003.

[5] J. Artiles, A. Penas, and F. Verdejo. Word Sense Disambiguation based on


term to term similarity in a context space. In Proc. of Senseval-3, 2004.

[6] F. Baader, I. Horrocks, and U. Sattler. Description logics for the semantic
web. Künstliche Intelligenz, 16(4), 2002.

[7] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addi-


son Wesley, 1999.

[8] R. A. Baeza-Yates and G. H. Gonnet. A Fast Algorithm on Average for All-


Against-All Sequence Matching. In Proc. of the International Workshop and
Symposium on String Processing and Information Retrieval (SPIRE 1999),
1999.

[9] R. A. Baeza-Yates and G. Navarro. A Faster Algorithm for Approximate


String Matching. In Combinatorial Pattern Matching, 7th Annual Sympo-
sium, 1996.

[10] T. Baldwin and H. Tanaka. The Effects of Word Order and Segmentation
on Translation Retrieval Performance. In Proc. of the 18th International
Conference on Computational Linguistics (COLING 2000), 2000.
282 BIBLIOGRAPHY

[11] S. Banerjee and T. Pedersen. Extended gloss overlaps as a measure of se-


mantic relatedness. In Proc. of 18th IJCAI Conference, 2003.

[12] R. Baumgartner, S. Flesca, and G. Gottlob. Visual Web information extrac-


tion with Lixto. In Proc. of the Twenty-seventh Int. Conference on Very
Large Data Bases, 2001.

[13] B. Becker, S. Gschwind, T. Ohler, B. Seeger, and P. Widmayer. An asymp-


totically optimal multiversion b-tree. VLDB J., 5(4), 1996.

[14] T. Berners-Lee, J. Hendler, and O. Lassila. The Semantic Web. Scientific


American, 284(5), 2001.

[15] Philip A. Bernstein and Erhard Rahm. On Matching Schemas Automati-


cally. Technical Report MSR-TR-2001-17, Microsoft Research (MSR), 2001.

[16] S. Boag, D. Chamberlin, M. F. Fernández, D. Florescu, J. Robie, and


J. Siméon. XQuery 1.0: An XML Query Language. W3C Working Draft,
2003.

[17] P. Bouquet, L. Serafini, and S. Zanobini. Semantic coordination: a new


approach and an application. In Proc. of the 2nd ISWC Conference, 2003.

[18] L. Bowker and M. Barlow. Bilingual concordancers and translation mem-


ories: A comparative evaluation. In Proc. of the 2nd International Work-
shop on Language Resources for Translation Work, Research and Training
(LR4Trans-II 2004), 2004.

[19] D. Braga and A. Campi. A Graphical Environment to Query XML Data with
XQuery. In Proc. of the 4th Intl. Conference on Web Information Systems
Engineering, 2003.

[20] M. M. Breunig, H. P. Kriegel, P. Kröger, and J. Sander. Data Bubbles:


Quality Preserving Performance Boosting for Hierarchical Clustering. SIG-
MOD Record (ACM Special Interest Group on Management of Data), 30(2),
2001.

[21] D. Bricklin. Copy Protection Robs the Future.


http://www.bricklin.com/robfuture.htm.

[22] Sergev Brin, James Davis, and Hector Garcia-Molina. Copy Detection Mech-
anisms for Digital Documents. In Proc. of the 1995 ACM SIGMOD Inter-
national Conference on Management of Data (SIGMOD 1995), 1995.

[23] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic


clustering of the Web. Computer Networks and ISDN Systems, 29(8-13),
1997.
BIBLIOGRAPHY 283

[24] R.D. Brown. Example-Based Machine Translation in the Pangloss Systems.


In Proc. of 16th International Conference on Computational Linguistics,
1996.
[25] N. Bruno, N. Koudas, and D. Srivastava. Holistic twig joins: optimal XML
pattern matching. In Proceedings of the 2002 International Conference on
Management of Data (SIGMOD 2002), 2002.
[26] A. Budanitsky and G. Hirst. Semantic distance in wordnet: an experimental,
application-oriented evaluation of five measures. In Proc. of the NAACL 2001
Workshop on WordNet and Other Lexical Resources, 2001.
[27] D. Calvanese, G. De Giacomo, M. Lenzerini, and M. Y. Vardi. View-Based
Query Processing for Regular Path Queries with Inverse. In Proc. of the
Nineteenth ACMSIGMOD-SIGACT-SIGART (PODS-00), 2000.
[28] D. Castelli and P. Pagano. A System for Building Expandable Digital Li-
braries. In Proc. of the Third ACM/IEEE-CS Joint Conference on Digital
Libraries, 2003.
[29] K.-P. Chan and A. W.-C. Fu. Efficient time series matching by wavelets.
In Proc. of the 15th International Conference on Data Engineering (ICDE
1999), 1999.
[30] T. Chen, J. Lu, and T. Wang Ling. On boosting holism in xml twig pat-
tern matching using structural indexing techniques. In Proc. of the ACM
SIGMOD, 2005.
[31] S. Chien, V. J. Tsotras, and C. Zaniolo. Efficient schemes for managing
multiversionxml documents. VLDB J., 11(4), 2002.
[32] S.-Y. Chien, Z. Vagena, D. Zhang, V. Tsotras, and C. Zaniolo. Efficient
structural joins on indexed XML documents. In Proceedings of 28th Inter-
national Conference on Very Large Data Bases (VLDB 2002), 2002.
[33] A. Chowdhury, O. Frieder, and D. Grossman. Collection statistics for fast
duplicate document detection. ACM Transactions on Information Systems,
20(2), 2002.
[34] E. Chvez and G. Navarro. A metric index for approximate string matching.
In Proc. of the 5th Latin American Symposium on Theoretical Informatics,
2002.
[35] P. Ciaccia and M. Patella. Searching in Metric Spaces with User-defined and
Approximate Distances. Trans. on Database Systems (TODS), 4(27), 2002.
[36] P. Ciaccia, M. Patella, and P. Zezula. M-Tree: An efficient access method for
similarity search in metric spaces. In Proc. of 23rd International Conference
on Very Large Data Bases (VLDB), 1997.
284 BIBLIOGRAPHY

[37] P. Ciaccia and W. Penzo. Relevance ranking tuning for similarity queries on
xml data. In Proc. of the VLDB EEXTT Workshop, 2002.

[38] R. Cilibrasi and P.M.B. Vitanyi. Automatic meaning discovery using Google.
Technical report, University of Amsterdam, 2004.

[39] P. Cimiano, S. Handschuh, and S. Staab. Towards the self-annotating web.


In Proc. of the 13th World Wide Web Conference (WWW 2004), 2004.

[40] A. Cobbs. Fast Approximate Matching Using Suffix Trees. In Proc. of the 6th
International Symposium on Combinatorial Pattern Matching (CPM 1995),
1995.

[41] P. Collins and P. Cunningham. Adaptation Guided Retrieval in EBMT: A


Case-Based Approach to Machine Translation. In Proc. of the 3rd European
Workshop on Advances in Case-Based Reasoning, (EWCBR 1996), 1996.

[42] L. Cranias, H. Papageorgiou, and S. Piperidis. A Matching Technique In


Example-Based Machine Translation. In Proc. of the 15th International
Conference on Computational Linguistics (COLING 1994), 1994.

[43] V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: automatic data


extraction from data-intensive web sites. In Proc. of the 2002 ACM SIGMOD
Int. Conference on Management of Data (SIGMOD 2002), 2002.

[44] F. Currim, S. Currim, C. Dyreson, and R. T. Snodgrass. A Tale of Two


Schemas: Creating a Temporal Schema from a Snapshot Schema with
τ XSchema. In Proc. of EDBT, Heraklion, Greece, 2004.

[45] C. De Castro, F. Grandi, and M. R. Scalas. Semantic interoperability of


multitemporal relational databases. In Proc. of ER, 1993.

[46] Atril Deja Vu - Translation Memory and Productivity System. Home page
http://www.atril.com.

[47] P.F. Dietz. Maintaining Order in a Linked List. In Proceedings of 14th Anual
ACM Symposium on Theory of Computing (STOC 1982), 1982.

[48] H. Do, S. Melnik, and E. Rahm. Comparison of schema matching evalua-


tions. In Proc. of the 2nd Int. Workshop on Web Databases, 2002.

[49] H. Do and E. Rahm. COMA – A system for flexible combination of schema


matching approaches. In Proc. of the 28th Conference on Very Large Data
Bases (VLDB 2002), 2002.

[50] A. Doan, P. Domingos, and A. Y. Halevy. Reconciling schemas of disparate


data sources: a machine-learning approach. SIGMOD Record, 30(2), 2001.
BIBLIOGRAPHY 285

[51] B. Dorr, P. Jordan, and J. Benoit. A survey of current research in machine


translation. Advances in Computers, 49, 1999.

[52] The “Semantic web techniques for the management of digital


identity and the access to norms” PRIN Project Home Page.
http://www.cirsfid.unibo.it/eGov03.

[53] Marc Ehrig and Alexander Maedche. Ontology-focused crawling of web doc-
uments. In Proc. of the ACM SAC, 2003.

[54] R. T. Snodgrass et al. The TSQL2 Temporal Query Language. Kluwer


Academic Publishing, New York, 1995.

[55] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast Subsequence


Matching in Time-Series Databases. In Proc. of the 1994 ACM SIGMOD
International Conference on Management of Data (ICMD 1994), 1994.

[56] D. Gao and R. T. Snodgrass. Temporal slicing in the evaluation of xml


queries. In Proc. of VLDB, Berlin, Germany, 2003.

[57] R. Giegerich, F. Hischke, S. Kurtz, and Enno Ohlebusch. A General Tech-


nique to Improve Filter Algorithms for Approximate String Matching. In
Proc. of the 4th South American Workshop on String Processing, 1997.

[58] F. Grandi and F. Mandreoli. Temporal modelling and management of nor-


mative documents in xml format. Data Knowl. Eng., 54(3), 2005.

[59] F. Grandi, F. Mandreoli, P. Tiberio, and M. Bergonzini. A temporal data


model and management system for normative texts in xml format. In Proc.
of the 15th ACM Intl’ Workshop on Web Information and Data Management
(WIDM), New Orleans, LA, November 2003.

[60] F. Grandi, F. Mandreoli, P. Tiberio, and M. Bergonzini. A temporal data


model and system architecture for the management of normative texts.
In Proc. of the 11th Natlional Conference on Advanced Database Systems
(SEBD), Cetraro, Italy, June 2003.

[61] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan,


and D. Srivastava. Approximate String Joins in a Database (Almost) for
Free. In Proc. of 27th International Conference on Very Large DataBases
(VLDB 2001), 2001.

[62] T. Grust. Accelerating XPath location steps. In Proceedings of the ACM


International Conference on Management of Data (SIGMOD 2002).

[63] T. Grust, M. Van Keulen, and J. Teubner. Staircase Join: Teach a Rela-
tional DBMS to Watch its (Axis) Steps. In Proceedings of 29th International
Conference on Very Large Data Bases (VLDB 2003), 2003.
286 BIBLIOGRAPHY

[64] T. Grust, M. Van Keulen, and J. Teubner. Accelerating XPath evaluation


in any RDBMS. ACM Transactions on Database Systems, 29(1), 2004.

[65] S. Guha, H. V. Jagadish, N. Koudas, D. Srivastava, and T. Yu. Approximate


XML joins. In Proc. of ACM SIGMOD, 2002.

[66] N. Heintze. Scalable Document Fingerprinting. Second Usenix Workshop on


Electronic Commerce, 1996.

[67] I. Horrocks and P. F. Patel-Schneider. Reducing owl entailment to descrip-


tion logic satisfiability. In Proc. of ISWC 2003, 2003.

[68] N. Ide and J. Veronis. Introduction to the Special Issue on Word Sense
Disambiguation: The State of the Art. Computational Linguistics, 24(1),
1998.

[69] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall
Inc., 1988.

[70] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM


Computing Surveys, 31(3), 1999.

[71] C. S. Jensen, C. E. Dyreson, and (Eds.) et al. The Consensus Glossary


of Temporal Database Concepts - February 1998 Version. In O. Etzion,
S. Jajodia, and S. Sripada, editors, Temporal Databases — Research and
Practice. Springer-Verlag, 1998. LNCS No. 1399.

[72] H. Jiang, W. Wang, H. Lu, and J. Xu Yu. Holistic Twig Joins on Indexed
XML Documents. In Proceedings of 29th International Conference on Very
Large Data Bases (VLDB 2003), 2003.

[73] T. Kahveci and A. K. Singh. Variable Length Queries for Time Series Data.
In Proc. of the 17th International Conference on Data Engineering (ICDE
2001), 2001.

[74] L.G. Khachiyan. A Polynomial Algorithm in Linear Programming. Doklady


Akademii Nauk SSSR, 244, 1979.

[75] Seung-Kyum Kim and Sharma Chakravarthy. Modeling time: Adequacy of


three distinct time concepts for temporal data. In Proc. of 12th International
Conference on the Entity-Relationship Approach (ER’93), Arlington, TX,
December 1993. LNCS No. 823.

[76] C. Koch, S. Scherzinger, N. Schweikardt, and B. Stegmaier. FluXQuery: An


Optimizing XQuery Processor for Streaming XML Data. In Proc. of 30th
International Confl on Very Large Data Bases (VLDB 2004), August 31 –
September 3, 2004, Toronto, Canada, 2004.
BIBLIOGRAPHY 287

[77] R. Kraft, F. Maghoul, and C. C. Chang. Y!Q: Contextual Search at the


Point of Inspiration. In Proc. of CIKM 2005, Bremen, Germany, 2005.
[78] S. H. Kwok. Watermark-based copyright protection system security. Com-
munications of the ACM, 46(10), 2003.
[79] O. Lassila and R. Swick. Resource Description Framework (RDF) model and
syntax specification. W3C Working Draft WD-rdf-syntax-19981008, 1998.
[80] S. Lawrence, K. Bollacher, and C. Lee Giles. Indexing and Retrieval of Sci-
entific Literature. In Proc.s of 8th International Conference on Information
and Knowledge Management (CIKM 1999), 1999.
[81] C. Leacock and M. Chodorow. Combining local context and WordNet sim-
ilarity for word sense identification. In C. Fellbaum, editor, WordNet: An
electronic lexical database. MIT Press, 1998.
[82] C. Leacock, M. Chodorow, and G. A. Miller. Using Corpus Statistics
and WordNet Relations for Sense Identification. Computational Linguistics,
24(1), 1998.
[83] J. Litman. Digital Copyright and the Progress of Science. ACM SIGIR
Forum, 36(2), 2002.
[84] J. Madhavan, P. A. Bernstein, A. Doan, and A. Y. Halevy. Corpus-based
schema matching. In Proc. of 21st International Conference on Data Engi-
neering (ICDE 2005), 2005.
[85] J. Madhavan, P. A. Bernstein, and E. Rahm. Generic Schema Matching with
Cupid. In Proc. of the 27th Conference on Very Large Data Bases (VLDB
2001), 2001.
[86] F. Mandreoli, R. Martoglia, and E. Ronchetti. Versatile structural dis-
ambiguation for semantic-aware applications. In Proc. of the 14th Inter-
national Conference on Information Knowledge and Management (CIKM
2005), 2005.
[87] F. Mandreoli, R. Martoglia, and E. Ronchetti. Strider: a versatile system
for structural disambiguation. In Proc. of the 10th International Conference
on Extending Database Technology (EDBT 2006), 2006.
[88] F. Mandreoli, R. Martoglia, and E. Ronchetti. Supporting temporal slicing
in xml databases. In Proc. of the 10th International Conference on Extending
Database Technology (EDBT 2006), 2006.
[89] F. Mandreoli, R. Martoglia, E. Ronchetti, P. Tiberio, F. Grandi, and M. R.
Scalas. Personalized access to multi-version norm texts in an egovernment
scenario. In Proc. of the International Conference on E-Government (DEXA
EGOV 2005), 2005.
288 BIBLIOGRAPHY

[90] F. Mandreoli, R. Martoglia, and P. Tiberio. A Syntactic Approach for


Searching Similarities within Sentences. In Proc. of the 11th ACM Con-
ference of Information and Knowledge Management (CIKM 2002), 2002.

[91] F. Mandreoli, R. Martoglia, and P. Tiberio. Searching Similar (Sub)sentences


for Example Based Machine Translation. In Proc. of the 10th Convegno su
Sistemi Evoluti per Basi di Dati (SEBD 2002), 2002.

[92] F. Mandreoli, R. Martoglia, and P. Tiberio. Exploiting multi-lingual text


potentialities in EBMT systems. In Proc. of the 13th IEEE International
Workshop on Research Issues in Data Engineering: Multi Lingual Informa-
tion Management (RIDE-MLIM 2003), 2003.

[93] F. Mandreoli, R. Martoglia, and P. Tiberio. A Document Comparison


Scheme for Secure Duplicate Detection. International Journal of Digital
Libraries, 4(3), 2004.

[94] F. Mandreoli, R. Martoglia, and P. Tiberio. Approximate Query Answering


for a Heterogeneous XML Document Base. In Proc. of the 5th International
Conference on Web Information Systems Engineering (WISE 2004), 2004.

[95] J. M. Martinez. MPEG-7 standard overview, ISO/IEC JTC1/SC29/WG11


N6828. http://www.chiariglione.org/mpeg/standards/mpeg-7.

[96] O. Mason. QTag, a probabilistic parts-of-speech tagger.


http://web.bham.ac.uk/O.Mason/software/tagger/.

[97] I. Dan Melamed. Automatic Evaluation and Uniform Filter Cascades for
Inducing N-Best Translation Lexicons. In Proc. of the Third Workshop on
Very Large Corpora (WVLC3), 1995.

[98] S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity Flooding: A Ver-


satile Graph Matching Algorithm and ist Application to Schema Matching.
In Proc. of the 18th International Conference on Data Engineering (ICDE
2002), 2002.

[99] A. O. Mendelzon, F. Rizzolo, and A. A. Vaisman. Indexing temporal xml


documents. In Proc. of VLDB, 2004.

[100] G.A. Miller. WordNet: A Lexical Database for English. In CACM 38, 1995.

[101] G.A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller. Five Papers
on WordNet. Technical report, Princeton University’s Cognitive Science
Laboratory, 1993.

[102] M. Nagao. A Framework of a Mechanical Translation between Japanese and


English by Analogy Principle. Nato Publications, 1984.
BIBLIOGRAPHY 289

[103] G. Navarro. A guided tour to approximate string matching. ACM Computing


Surveys, 33(1), 2001.

[104] G. Navarro and R. A. Baeza-Yatesa. Very Fast and Simple Approximate


String Matching. Information Processing Letters, 72, 1999.

[105] G. Navarro and R. A. Baeza-Yatesa. New and faster filters for multiple
approximate string matching. Random Structures and Algorithms, 20(1),
2002.

[106] Y. Papakonstantinou, A. Gupta, and L. M. Haas. Capabilities-Based Query


Rewriting in Mediator Systems. Distributed and Parallel Databases, 6(1),
1998.

[107] Y. Papakonstantinou and V. Vassalos. Query rewriting for semistructured


data. In Proc. of the ACM SIGMOD, 1999.

[108] K. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a Method for Au-
tomatic Evaluation of Machine Translation. In Proc. of the 40th Annual
Meeting of the Association for Computational Linguistics (ACL 2002), 2002.

[109] ParaConc - Multilingual Concordancer. http://www.athel.com.

[110] T. Bach Pedersen, C. S. Jensen, and C. E. Dyreson. Extending practical


pre-aggregation in on-line analytical processing. In Proc. of VLDB, 1999.

[111] M. Peim, E. Franconi, N. W. Paton, and C. A. Goble. Query Processing


with Description Logic Ontologies Over Object-Wrapped Databases. In Proc.
of the 14th International Conference on Scientific and Statistical Database
Management, 2002.

[112] Owl plugin for protégé. http://protege.stanford.edu/plugins/owl/, 2004.

[113] P. Resnik. Disambiguating Noun Groupings with Respect to WordNet


Senses. In Proc. of the Third Workshop on Very Large Corpora, 1995.

[114] C. Rick. A New Flexible Algorithm for the Longest Common Subsequence
Problem. Technical report, University of Bonn, Computer Science Depart-
ment IV, 1994.

[115] N. Rishe, J. Yuan, R. Athauda, S. Chen, X. Lu, X. Ma, A. Vaschillo, A. Sha-


poshnikov, and D. Vasilevsky. Semantic Access: Semantic Interface for
Querying Databases. In Proc. of the 26th Conference on Very Large Data
Bases (VLDB 2000), 2000.

[116] S. Rodotà. Introduction to the “one world, one privacy” session. In Proc. of
23rd Data Protection Commissioners Conference, Paris, France.
290 BIBLIOGRAPHY

[117] S. Sassatelli. Approssimazione semantica per routing di interrogazioni in un


PDMS. Master thesis, Università degli studi di Modena e Reggio Emilia,
2004/2005.

[118] S. Sato and M. Nagao. Toward Memory-based Translation. In Proc. of


the 13th International Conference on Computational linguistics (COLING
1990), 1990.

[119] T. Schlieder and Felix Naumann. Approximate tree embedding for querying
XML data. In Proc. of ACM SIGIR Workshop On XML and Information
Retrieval, 2000.

[120] X. Shen, B. Tan, and C. X. Zhai. Implicit user modeling for personalized
search. In Proc. of CIKM 2005, Bremen, Germany, 2005.

[121] N. Shivakumar and H. Garcia-Molina. SCAM: A Copy Detection Mechanism


for Digital Documents. In Proc. of the 2nd International Conference on
Theory and Practice of Digital Libraries, 1995.

[122] N. Shivakumar and H. Garcia-Molina. Building a scalable and accurate copy


detection mechanism. In Proc. of the 1st ACM International Conference on
Digital Libraries (ICDL 1996), 1996.

[123] Web services activity. W3C Consortium,


http://www.w3.org/2000/xp/Group/, 2004.

[124] H. Somers. Review Article: Example-based Machine Translation. Machine


Translation, 14(2), 1999.

[125] E. Sumita and H. Iida. Experiments and Prospects of Example-based Ma-


chine Translation. In Proc. of the 29th Annual Meeting of the Association
for Computational Linguistics (ACL 1991), 1991.

[126] E. Sutinen and J. Tarhio. On Using q-gram Locations in Approximate String


Matching. In Proc. of 3rd Annual European Symposium, 1995.

[127] E. Sutinen and J. Tarhio. Filtration with q-samples in Approximate String


Matching. In Proc. of the 7th annual Symposium on Combinatorial Pattern
Matching, 1996.

[128] A. Tagarelli and S. Greco. Clustering Transactional XML Data with


Semantically-Enriched Content and Structural Features. In Proc. of the 5th
International Conference on Web Information Systems Engineering (WISE
2004), 2004.

[129] I. Tatarinov and A. Halevy. Efficient Query Reformulation in Peer Data


Management Systems. In Proc. of ACM SIGMOD, 2004.
BIBLIOGRAPHY 291

[130] A. Theobald and Gerhard Weikum. The index-based XXL search engine
for querying XML data with relevance ranking. Lecture Notes in Computer
Science, 2287, 2002.

[131] M. Theobald, R. Schenkel, and G. Weikum. Exploiting Structure, Annota-


tion, and Ontological Knowledge for Automatic Classification of XML Data.
In Proc. of the WebDB Workshop, 2003.

[132] H. S. Thompson, D. Beech, M. Maloney, and N. Mendelsohn. XMLSchema.


W3C Recommendation, 2001.

[133] Trados Team Edition - Translation Memory Technologies. Home page


http://www.trados.com.

[134] E. Ukkonen. Approximate String Matching with q-grams and Maximal


Matches. Theoretical Computer Science, 92(1), 1992.

[135] J. S. Vitter. An Efficient Algorithm for Sequential Random Sampling. ACM


Transactions on Mathematical Software, 13(1), 1987.

[136] W. Y. Arms. Digital Libraries. The MIT Press, 2000.

[137] P. Zezula, G. Amato, F. Debole, and F. Rabitti. Tree Signatures for XML
Querying and Navigation. In Proceedings of the XML Database Symposium
(XSym 2003), 2003.

[138] P. Zezula, F. Mandreoli, and R. Martoglia. Tree Signatures and Unordered


XML Pattern Matching. In Proceedings of 30th International Conference
on Current Trends in Theory and Practice of Informatics (SOFSEM 2004),
2004.

[139] C. Zhang, J. F. Naughton, D. J. DeWitt, Q. Luo, and G. M. Lohman. On


supporting containment queries in relational database management systems.
In Proc. of ACM SIGMOD, 2001.

[140] D. Zhang, V. J. Tsotras, and B. Seeger. Efficient temporal join processing


using indices. In Proc. of ICDE, 2002.

[141] K. Zhang, R. Statman, and D. Shasha. On the editing distance between


unordered labeled trees. Information Processing Letters, 42(3), 1992.

[142] J. Zhou and J. Sander. Data bubbles for non-vector data: Speeding-up hier-
archical clustering in arbitrary metric spaces. In Proc. of 29th International
Conference on Very Large Data Bases (VLDB), 2003.

You might also like