Partitioning and Distribution of Crawled Pages

PARTITIONING AND DISTRIBUTION OF
CRAWLED PAGES
INTRODUCTION
This report discusses the algorithm of partitioning a database of crawled pages and copying the
partitions to other machines remotely in the process of distributing the web documents to be
indexed in parallel on different machines.
We will first discuss the prerequisites and initial settings that are needed for our algorithm to
run successfully, we will also give a brief description of the algorithm with referring to the used
SQL statements.
PREREQUISITES
This section discusses the prerequisites and the very detailed steps to do them.
1. Enable Remote Connections on all SQL Servers of all the machines
In order to do this please follow the following steps:
a. Open SQL Server Surface Area Configuration (Start-> Programs->Microsoft SQL Server
->configuration tools -> SQL Server Surface Area Configuration.
b. On the SQL Server Surface Area Configuration page, click Surface Area Configuration for
Services and Connections.
c. On the Surface Area Configuration for Services and Connections page, expand Database
Engine, click Remote Connections, click Local and remote connections, click the using
both TCP/IP and named piped protocol, and then click Apply.
d. Click OK when you receive the following message: Changes to Connection Settings will
not take effect until you restart the Database Engine service.
e. On the Surface Area Configuration for Services and Connections page, expand Database
Engine, click Service, click Stop, wait until the MSSQLSERVER service stops, and then
click Start to restart the MSSQLSERVER service.

2. Set all TCP ports to 1433
This step is mostly needed in case of SQL Server Express edition but it might be needed in other
editions, and the steps are the same.
This can be done as follows:
a. Open SQL configuration manager then go to protocols for SQLExpress-> enabled TCP/IP
(on the right) -> right click -> properties, in IP addresses tab, fill in "1433" in all the TCP
Port.
b. Restart SQL Server or pc (better)
If the TCP port is not filled with 1433 you will get this error - provider: TCP Provider, error: 0
- No connection could be made because the target machine actively refused it.
3. Set the authentication mode to Mixed Mode (windows and SQL server
authentication)
This can be done as follows
a. In SQL server management studio, right click the server instance and click properties.
b. Click "security" on the left hand.
c. Click the radio button of SQL Server and Windows authentication mode, on the right.
4. Add a login on each machine with the same user name and password
This is done through SQL Server Management studio through the Logins under the SQL server
instance. (New Login)
ALGORITHM DESCRIPTION
The algorithm consists of two phases the first is local and is responsible for partitioning the
main crawled pages database, number of database partitions will be equal to number of
machines. The second phase of the algorithm is remote and is responsible for creating
databases on all the involved machines and copying the partitions in them remotely.
Input: the IPs of the machines that the database will be distributed on.

LOCAL PHASE
In this phase the database that contains all the crawled pages is partitions, the number of
partitions depends on the number of machines the main steps of this phase are:
1. The total number of documents in the database is calculated:
"select count(ID) from <TableName>"
2. The size of each partition is calculated using the total number of rows and the number
of machines that we would like the database to be distributed on.
3. For each machine, a partition is created and stored in SQLDataReader in preparation for
being copied remotely to that machine:
select Top <PartitionLength> * from <TableName> where
ID> <MinDocID>;
where <PartitionLength> is the size of each partition that was just calculated, and
<MinDocID> is the first DocID in the database partition.

REMOTE PHASE
In this phase the partitions are copied to all the machines remotely.Aafter the database
partition is created, it is copied to another machine remotely as follows:
1. A database is created remotely on the machine:
"IF EXISTS (select * from sys.databases where name
='<databaseName>')
Drop database <databaseName>
create database <databaseName> Collate <collation>
where <databaseName> is the name of the database to be created on the remote
machine, and <collation> is the same collation of the source database.
2. Next a table is created in the destination database, just created, with the same fields
and data types of the source database table that holds the crawled pages:
"IF EXISTS (select * from sys.tables where name =
<TableName>')
Drop table <TableName>
create table <TableName> (ID bigint Identity (1,1)
,HTML_field varbinary(max), encoding nchar(20))"
3. Finally SQLBulkCopy class is used to copy all the rows in the created partition to the
destination machine, which much faster than just copying row by row from the source
to the destination machine.

Crawled Pages Database
1
st
partition

2
nd
partition 3
rd
partition
Run the indexer Run the indexer Run the indexer

Partitioning and Distribution of Crawled Pages

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Partitioning and Distribution of Crawled Pages

Uploaded by

Copyright:

Available Formats

PARTITIONING AND DISTRIBUTION OF

You might also like