You are on page 1of 18

1 | CS 2201 - Data Structures Unit 4

UNIT 4 HASHING AND DISJOINT SET


1. Define Equivalence relation.
An equivalence relation R is defined on a set S, if for every pair of elements (a,b) in S,
a R b is either false or true.
Where a R b is true if and only if,
i. a R a, for each element a in S (said to be Reflexive)
ii. a R b if and only if b R a (said to be Symmetric)
iii. a R b and b R c implies a R c (said to be Transitive)
Example: Electrical connectivity.

2. What are the basic operations that are performed on Disjoint Set ADT and specify the
data structures used for representing the Set ADT.
(i) The basic operations performed on Disjoint Set ADT are,
Union (x, y) Performs a Union of the sets containing the two elements x and y
Find (x) Returns a pointer to the set containing the element x
(ii) The Data structures that are used for representing the SET are,
Array
Linked List
Tree

3. What is path compression?
Path compression is performed during a Find ( ) operation on Set ADT.
This is the only way to speed up the find ( ) algorithm, without reworking the data
structure entirely.

4. Define Equivalence Classes. Specify the properties of Equivalence Classes.
Definition: The equivalence class of an element a (in S) is the subset of S that contains
all elements related to a.
Properties of Equivalence Classes
(i) Each element must belong to exactly one equivalence class.
(ii) All equivalence classes are mutually disjoint.

5. Define hashing.
Hashing is the transformation of a given key (integer, real or string) into a shorter fixed
length value (called hash value or location) that represents the original key.
Hashing is used to index and retrieve items in a database because it is faster to find the
item using the short hashed key than to find it using the original value.

2 | CS 2201 - Data Structures Unit 4

6. What is a hash collision? List the different collision resolving techniques?
Collision: When two different keys computed to the same hash location or value in
the hash table through the hash function, then it is termed as hash collision.
The hash collision resolving techniques are
(i) Separate chaining or External hashing
(ii) Open addressing or Closed hashing

7. What do you mean by Open addressed hashing system?
In a Open addressing hashing system, when a collision occurs, alternative cells are tried
until an empty cell is found. The cells h
i
(x), h
i+1
(x), h
i+2
(x),. are tried in succession.
The Hash function of the Open addressed hashing system is
h (key) = (Hash (key) + F(i) ) % TABLESIZE, Where F(i) is the collision function.

8. What is Extendible Hashing System?
An Extendible hashing is a hash system which treats a hash as a bit string, and uses a
trie for bucket lookup. It is hierarchical in nature, re-hashing is an incremental operation
and can be performed one bucket at a time, as needed.

9. What is a hash table?
The hash table data structure is merely an array of some fixed size, containing the keys.
Each key is+ mapped into some number (called hash location or value) in the range from
0 to TABLESIZE 1 and placed the key in the appropriate cell.

10. List the applications of SET and Hashing systems.
Applications of Disjoint Set ADT:
i. Connected components algorithm
ii. Minimum spanning tree algorithm
iii. Maze construction & Puzzles and games.
Applications of Hashing System:
a. Symbol table management in Compilers
b. Online Spell Checkers and Dictionary System
c. Graph theory
d. Error detection in computer networks
e. Puzzles & Gaming



3 | CS 2201 - Data Structures Unit 4

PART B
1. Define Hash function. Write routines to find and insert an element in the Separate
Chaining hash system. (8)
Hash Function:
A hash function h (Key) is a key to address transformation (maps a key value into hash
value or location) which acts upon a given key to compute the relative position of the key
in an array called Hash table.
Properties of Hash function
The hash function should be simple and it must distribute the data evenly.
It should minimize the number of hash collisions.
Separate Chaining Hash System:
Separate Chaining is a most common hash collision resolving method which keeps linked
lists of all the Key values that hashes into the same hash location.
The Hash table entries acts as a HEAD for all the linked lists.
Find Operation:
To perform a search, a hashing function is used to determine the linked list to traverse.
Traverse the linked list in normal manner and return the position where the element is
found.
Insertion
To perform insertion of an element, traverse down the appropriate list to check whether
the element is already in place.
If the element turns to be a new one, it can be inserted either at the front of the list or at
the end of the list.
If it is a duplicate element, an extra field is kept and placed.
Case (i) Inserting a New Key at front of the Linked List
Inserting a new key at front of the list is easy and convenient.
It also may happen that recently inserted keys are most likely to be accessed in the near
future and it eliminates the need for traversing the linked list.
Case (ii) Inserting a New Key at end of the Linked List
Inserting a new Key at the end helps to avoid redundancy.
It requires the appropriate list to be traversed.
Algorithm
void Insert (int key, Hashtable H)
begin
/* Traverse the list to check whether the key is already present */
Pos = FIND (Key, H);
If (Pos = = NULL) /* Key is not found */
begin
4 | CS 2201 - Data Structures Unit 4

Newcell = getnode();
If (Newcell ! = NULL)
begin
Loc = key % HashTablesize;
Newcell Element = Key;
Newcell Next = Loc Next;
/* Insert the key at the front of the list */
Loc Next = Newcell;
end
end
end.

Pros and Cons of Separate Chaining Hash System
Pros: Unlimited Memory more number of Key values can be inserted as it uses array of
linked lists.
Cons: It requires pointers, which occupies more memory space.
It takes more effort to perform a search, since it takes time to evaluate the hash function
and also to traverse the list.

Example:
Keys: 64, 81, 0, 4, 25, 49, 36, 16 and 9.
H (Key) = Key % TableSize
H (64) = 64 % 10 = 4
H (64) = 81 % 10 = 1
H (64) = 0 % 10 = 0
H (64) = 4 % 10 = 4
H (64) = 25 % 10 = 5
H (64) = 49 % 10 = 9
H (64) = 36 % 10 = 6
H (64) = 16 % 10 = 6
H (64) = 9 % 10 = 9





5 | CS 2201 - Data Structures Unit 4

2. Explain how the Collision is handled in the Open addressing Hashing System?
Or
Discuss the collision resolving strategies used in the closed hashing system.

OPEN ADDRESSED HASH SYSTEM or CLOSED HASH SYSTEM
In a Open addressing hashing system, when a collision occurs, alternative cells are tried
until an empty cell is found. The cells h
i
(x), h
i+1
(x), h
i+2
(x),. are tried in succession.
The Hash function of the Open addressed hashing system is
h (key) = (Hash (key) + F(i) ) % TABLESIZE, Where F( ) is the collision function.
There are three common collision resolution strategies. They are
(i) Open Addressing with Linear Probing
(ii) Open Addressing with Quadratic probing
(iii) Open Addressing with Double Hashing.

LINEAR PROBING
In linear probing, the collision function F (i) is the linear function of I, which is amounts to
try cells sequentially in search of an empty cell (with wrap around).
If the end of the table is reached and no empty cell has been found, then the search is
continued from the beginning of the table. It has a tendency to create clusters in the table.
Hash (key) = (Hash (key) + F(i) ) % TABLESIZE,
Where F(i) = i is the collision function for the i
th
collision.

Example: Insert the following keys 89,18,49,58,69
Hash (key) = (Hash (key) + F(i) ) % TABLESIZE
(i) Hash (89) = (Hash (89) + F(0) ) % 10 = 9 (No Collision)
(ii) Hash (18) = (Hash (18) + F(0) ) % 10 = 8 (No Collision)
(iii) Hash (49) = (Hash (49) + F(0) ) % 10 = 9 (1
st
Collision, so F (1) = 1)
Hash (49) = (Hash (49) + F(1) ) % 10 = 50 % 10 = 0 (No Collision)
(iv) Hash (58) = (Hash (58) + F(0) ) % 10 = 8 (1
st
Collision, so F (1) = 1)
Hash (58) = (Hash (58) + F(1) ) % 10 = 59 % 10 = 9 (2
nd
Collision, so F (2) = 2)
Hash (58) = (Hash (58) + F(2) ) % 10 = 60 % 10 = 0 (3
rd
Collision, so F (3) = 3)
Hash (58) = (Hash (58) + F(3) ) % 10 = 61 % 10 = 1 (No Collision)
(v) Hash (69) = (Hash (69) + F(0) ) % 10 = 9 (1
st
Collision, so F (1) = 1)
Hash (69) = (Hash (69) + F(1) ) % 10 = 70 % 10 = 0 (2
nd
Collision, so F (2) = 2)
Hash (69) = (Hash (69) + F(2) ) % 10 = 71 % 10 = 1 (3
rd
Collision, so F (3) = 3)
Hash (69) = (Hash (69) + F(3) ) % 10 = 72 % 10 = 2 (No Collision)

6 | CS 2201 - Data Structures Unit 4


Limitations: As long as the table is big enough, a free cell can always be found, but the time
to do so can get large.

QUADRATIC PROBING
It is a collision resolution method that eliminates the primary clustering problem using
quadratic collision function. The collision function F(i)=i
2
.
In Quadratic probing, on the first collision, look ahead one position and place the key in the
hash table. On the second collision, look 2
2
positions ahead, and on the third collision look 3
2

positions ahead and so on.

Example: Insert the keys 89, 18, 49, 58, 69
Hash (key) = (Hash (key) + F(i) ) % TABLESIZE, F (i) = i
2

(i) Hash (89) = (Hash (89) + F(0) ) % 10 = 9 (No Collision)
(ii) Hash (18) = (Hash (18) + F(0) ) % 10 = 8 (No Collision)
(iii) Hash (49) = (Hash (49) + F(0) ) % 10 = 9 (1
st
Collision, so F (1) = 1
2
)
Hash (49) = (Hash (49) + 1 ) % 10 = 50 % 10 = 0 (No Collision)
(iv) Hash (58) = (Hash (58) + F(0) ) % 10 = 8 (1
st
Collision, so F (1) = 1
2
)
Hash (58) = (Hash (58) + 1 ) % 10 = 59 % 10 = 9 (2
nd
Collision, so F (2) = 2
2
)
Hash (58) = (Hash (58) + 4 ) % 10 = 62 % 10 = 2 (No Collision)
(v) Hash (69) = (Hash (69) + F(0) ) % 10 = 9 (1
st
Collision, so F (1) = 1
2
)
Hash (69) = (Hash (69) + 1 ) % 10 = 70 % 10 = 0 (2
nd
Collision, so F (2) = 2
2
)
Hash (69) = (Hash (69) + 4 ) % 10 = 73 % 10 = 3 (No Collision)

Limitations: In a Quadratic Probing System,
The TableSize needs to be Prime and
The new key element can always be inserted iff the hash table is atleast half empty.
7 | CS 2201 - Data Structures Unit 4




DOUBLE HASHING:
In double hashing, a second hash function, hash
2
(x) is applied and probe at a distance
hash
2
(Key), 2 hash
2
(Key), 3 hash
2
(Key) and so on.
The collision function, F(i)=i* hash
2
(Key)
Where i=1, 2, 3, 4
Here the second hash function, Hash
2
(Key) = R (Key % R), R is any prime < Tablesize.
Example: Insert the keys 89, 18, 49, 58, 69
Hash (key) = (Hash (key) + F(i) ) % TABLESIZE, F (i) = i * hash
2
(Key)
(i) Hash (89) = (Hash (89) + F(0) ) % 10 = 9 (No Collision)
(ii) Hash (18) = (Hash (18) + F(0) ) % 10 = 8 (No Collision)
(iii) Hash (49) = (Hash (49) + F(0) ) % 10 = 9 (1
st
Collision)
F (1) = 1 * hash
2
(Key) = 1 * 7 (49 % 7) taking R =7
= (Hash (49) +7) % 10 = 56 % 10 = 6 (No Collision)
(iv) Hash (58) = (Hash (58) + F(0) ) % 10 = 8 (1
st
Collision)
F (1) = 1 * hash
2
(Key) = 1 * 7 (58 % 7) taking R = 7
= (Hash (58) + 5) % 10 = 63 % 10 = 3 (No Collision)
(v) Hash (69) = (Hash (69) + F(0) ) % 10 = 9 (1
st
Collision)
F (1) = 1 * hash
2
(Key) = 1 * 7 (69 % 7) taking R =7
= (Hash (69) +1) % 10 = 70 % 10 = 0 (No Collision)

Note: Taking common PRIME number for the second hash function during collision is
advisable.
8 | CS 2201 - Data Structures Unit 4



3. Write short notes on Re-hashing and Extendible hashing System with suitable
example. (10)

RE-HASHING SYSTEM
Rehashing System increases the size of a hash table array, and restoring all of the items into
the array using the hash function.

When the original hash table is too full,
Build the new hash table that is about twice as big (relatively next prime that is at least
twice the current tables size) with an associated new hash function.
Scan down the original hash table and compute the hash location for each element and
Insert the elements into the new hash table. Then drop the original table.
When should we Rehash?
The Rehashing process occurs when,
the original hash table is HALF full
an insertion fails
load reaches certain level (load factor) best option for rehashing.
Load Factor Number of Key elements in the hash Table and can be represented as (when
= 0 (table empty); = 0.5 (half full); =1 (table Full)

Example: Hash the following key elements 18, 15, 6 and 24 for the TableSize of 7.

Hash (18) = 18 % 7 = 4
Hash (15) = 15 % 7 = 1
Hash (6) = 6 % 7 = 6
Hash (24) = 24 % 7 = 3


0
1 15
2
3 24
4 18
5
6 6
9 | CS 2201 - Data Structures Unit 4


Initiating the Rehashing System,

New hash table Size = 7 * 2 = 14,
and the nearest PRIME greater than 14 is 17.

By scanning down the original hash table and
Rehashing using the new hash function as,

Hash (18) = 18 % 17 = 1
Hash (15) = 15 % 17 = 15
Hash (6) = 6 % 17 = 6
Hash (24) = 24 % 17 = 7
Now the original Hash table is freed.
New HashTable


Pros and Cons of Rehashing System
Rehashing can be used in other Data structures as well. For instance if the Queue data
structure became full, declare a double-sized array and copy everything over, freeing the
original.
Rehashing frees the programmer from worrying about the table size.
Cons:
Rehashing is time consuming, rehash every element once again.
It is also very expensive when running short of memory space.

EXTENDIBLE HASHING SYSTEM
Extendible hashing is a type of hash system which treats a hash as a bit string, and uses
a trie for bucket lookup. Because of the hierarchical nature of the system, re-hashing is an
incremental operation (done one bucket at a time, as needed). This means that time-sensitive
applications are less affected by table growth than by standard full-table rehashes.
Hash Function: The hash function Hash (Key) for the extendible hash system returns a binary
number. The first i bits of each string will be used as indices to figure out where they will go in
the "directory" (hash table). Additionally, i is the smallest number such that the first i bits of all
keys are different.
Key terms used here is:
1. The key size that maps the directory (the Global depth), and
2. The key size that has previously mapped the bucket (the Local depth)


0
1 18
2
3
4
5
6 6
7 24
8
9
10
11
12
13
14
15 15
16
10 | CS 2201 - Data Structures Unit 4

Operations on hash table:
1. Doubling the directory when a bucket becomes full - If the local depth is equal to the
global depth, then there is only one pointer to the bucket, and there is no other
directory pointers that can map to the bucket, so the directory must be doubled.
2. Creating a new bucket, and re-distributing the entries between the old and the new
bucket - If the bucket is full, if the local depth is less than the global depth, then there
exists more than one pointer from the directory to the bucket, and the bucket can be
split
Example: Keys (say k1, k2, k3) to be used: 100100, 010110, 110110
Initially the bucket size is 1. The first two keys to be
inserted, k
1
and k
2
, can be distinguished by the most
significant bit, and would be inserted into the table.
Now, if k
3
were to be hashed to the table, it wouldn't be enough to distinguish all three keys by
one bit (because k
3
and k
1
have 1 as their leftmost bit. Also, because the bucket size is one, the
table would overflow. Because comparing the first two most significant bits would give each
key a unique location, the directory size is doubled to 4 as:

And so now k
1
and k
3
have a unique location, being
distinguished by the first two leftmost bits. Because k
2
is in the
top half of the table, both 00 and 01 point to it because there
is no other key to compare to that begins with a 0.

4. Consider a hash table of size 10, initially empty, after adding the following elements with
h(x) = x mod 10 as the hash function. Assume that the hash table uses linear probing and
rehashing occurs at the start of an add where the load factor is 0.5. keys are 7, 84, 31, 57, 44,
19, 27, 14, and 64 (6)

Refer class work note book. 1

5. Show the result of the following sequence of instructions on the sets from 1 to 17 integer
digits: Union (1,2); Union (3,4); Union (1,7); Union (3,6); Union (8,9); Union (1,8); Union (3,10);
Union (3,11); Union (8,12); Union (9,13); Union (14,15); Union (16,17); Union (14,16); Union
(1,3); Union (1,14) when unions are performed as
(i) Arbitrarily (ii) by height (iii) by Size (iv) find sets of 13, 7, 10 thro path
compression. (6)
Ans: Refer class work note book. 1


11 | CS 2201 - Data Structures Unit 4



6. What do you mean by Disjoint Sets. Discuss in detail the various representations of Disjoint
Sets. (16)
BASIC DEFINITIONS:
(i) A set is a collection of objects.
(ii) Set A is a subset of set B, if all elements of A are in B. Subsets are also Sets.
(iii) Union of two sets A and B is a set C which consists of all elements in A and B.
(iv) Two sets are mutually disjoint if they do not have a common element, also called
Disjoint Sets
(v) A relation R is defined on a set S if for every pair of elements (a,b), a, b S, a R b is
either true or false. If a R b is true, then we say that a is related to b.
(vi) An equivalence relation is a relation R that satisfy three properties:
(reflexive) a R a, for all a S.
(symmetric) a R b if and only if b R a.
(transitive) a R b and b R c implies that a R c.
(vii) An equivalence relation partitions a set into distinct equivalence classes

Operations on Disjoint Set
1. Union ( a, b)
Check if a and b are already related: if they are in the same equivalence class.
If not, merge the two equivalence classes containing a and b into a new
equivalence class.
2. Find (x )
Return the name (pointer or index of representative) of the set containing a given
element X.
Implementation / Representation of Disjoint Set
The Disjoint Set can be implemented using three data structures as
1. Array Implementation.
2. Linked List Implementation.
3. Tree Implementation.

ARRAY IMPLEMENTATION OF DISJOINT SET
Array representation assigns one position for each element. Each position stores the element
and an index to the representative. Initially, each element is in its own set.
Find-Set(): To make the Find-Set operation fast, it stores the name of each equivalence class in
the array. Thus the find takes constant time, O(1).
Union-Sets(): Assume element a belongs to set i and element b belongs to set j. When we
perform Union(a,b) all js have to be changed to is. Each union operation unfortunately takes
(n) time. So for n-1 unions the time taken is (n2).
12 | CS 2201 - Data Structures Unit 4

Algorithms:
Initialize( int N )
begin
int array[N+1];
for (int i=1; i<=N; i++)
array[i] = i;
end
int find( int i )
begin
return array[i];
end

void UnionSets( int i, int j )
begin
rooti=find(i);
rootj=find(j);
for (int k=1; k<=N; k++)
begin
if (array[k] == rootj)
array[k] = rooti;

end
end

Limitations: Using Array for representing the Disjoint Set requires more memory.
LINKED LIST IMPLEMENTATION OF DISJOINT SET
Each set is represented by a linked list
The first object in each linked list serves as its set's representative.
Each object in the linked list contains
a set member,
a pointer to the object containing the next set member,
a pointer back to the representative.
Each list maintains pointers, head, to the representative, and tail, to the last object in the
list.
Within each linked list, the objects may appear in any order (subject to our assumption that
the first object in each list is the representative).


1 2 3 4 5
1 2 3 4 5
13 | CS 2201 - Data Structures Unit 4

Example: Set C and set F using Linked list & Union ( F, C).








TREE IMPLEMENTATION OF DISJOINT SET
A tree data structure can be used to represent a disjoint set ADT. Each set is represented by a
tree. The elements in the tree have the same root and hence the root is used to name the set.
The trees do not have to be binary since we only need a parent pointer.

Operations & algorithms:
nitialize Set (int N)
begin
int parent [N];
for ( int i = 0; i < N; ++i )
{ parent[i] = -1; }
end


If parent[i] == -1, then i is a root node. Initially, each integer is in its own set
Find-set( ): The Find-Set operation takes a time proportional to the depth of the tree.
int find-set( int i ) // Iterative Find-set () algorithm.
begin
while( parent[i]!=-1)
i = parent[i];
return i;
end.

1 2 3 4 5 6 7 8
-1 -1 - 1 - 1 - 1 - 1 - 1 - 1
14 | CS 2201 - Data Structures Unit 4

int find-set( int i ) // Recursive Find-set () algorithm
begin
if(parent[i]==-1)
return i;
else
return find-set(parent[i]);
end.

Union-sets() operation: (an Arbitrary Union-Sets () algorithm)
void UnionSets( int i, int j )
begin
i = find-set ( i ); // root of i
j = find-set ( j ); // root of j
if ( i != j )
parent[j] = i; // 2nd set is appended to 1st set
end.
Example: Tree representation of disjoint set ADT



After union-sets (5,6)

Tree representation of disjoint set ADT after union (7,8),

Tree representation of disjoint set ADT after union (5,7) as

15 | CS 2201 - Data Structures Unit 4


7. With algorithm, discuss the effect of path compression and the various smart Union
strategies in disjoint sets with suitable example. (10)

The Set union problem: The set union problem consists of performing a sequence of union and
find operations, starting from a collection of n singleton sets {I}, {2}...{n}.
The union-sets in the basic tree data structure representation were performed arbitrarily, by
making the second tree a subtree of the first.
Union-sets(X,Y) operation will add Y as subtree of X, irrespective of the depth of the tree.
Tree representation after Union-Set(5,7)

Arbitrary algorithm for Union-Sets(4,5)

The basic approaches to improve the Union-sets algorithm are
(i) Union by Size
(ii) Union by Height / Union by Rank

Union by Size(X, Y) Algorithm:
Union by Size makes the children of the root of the smaller tree point to the root of the larger.
This requires that the size of each tree is maintained.
Union by size is easy to implement and requires no extra space.



16 | CS 2201 - Data Structures Unit 4

Example: Union-sets(3, 7)

Result of Union by Size (3, 7)


Union by Height (X, Y) / Union by Rank (X, Y) Algorithm:
Union-by-height is a trivial modification of union-by-size.
It keeps track of the height, instead of the size, of each tree and performs unions by making the
shallow tree a subtree of the deeper tree.
It requires maintaining the height of the subtree rooted at each node (also referred to as the
rank of a node.)
The height of a tree increases only when two equally deep trees are joined (and then the height
goes up by one).

Algorithm:

Step 1: first compare heights
Step 2: link up shorter tree as child of taller tree
Step 3: if equal height, make arbitrary choice
Step 4: then increment height of new merged tree if height has changed will happen if
merging two equal height trees



17 | CS 2201 - Data Structures Unit 4

void Union-Set by rank (root1, root2) //let array s[] is a set
{
if(s[root1] < s[root2])
s[root2]=root1;
if(s[root2] < s[root1])
s[root1]=root2;
if(s[root1]==s[root2])
s[root1]=root2;
s[root2]--;
}

Example: Union by rank (3,7)



Result of Union by height (3,7)

APPLICATION OF DISJOINT SETS
1. Maze generation (using a modified Kruskal's algorithm)
2. Construction of spanning tree for the graphs
3. Connected component labeling (electrical connections, network connections, etc)
4. Online maintenance of biconnected components
5. Alias analysis system software (compilers)
6. Used in construction of contour trees

18 | CS 2201 - Data Structures Unit 4

Path Compression Algorithm:

After finding the root V of the tree containing U in a find-set (U), traverse the path from u to v
one more time and change the parent pointers of all vertices along the path to point directly to
the root node V. This process is called path compression.



path compression, is also quite simple and very effective. During Find-set operations to make
each node on the find path point directly to the root. Path compression does not change any
ranks.
Algorithm:
int Find(int x)
begin
if (parent[x] < 0)
return x
else
return parent[x] = Find(parent[x])
end
Example: