You are on page 1of 8

Running TPC-H on Pig

Jie Li
jieli@cs.duke.edu
Ralf Diestelkaemper
ralf@cs.duke.edu

Koichi Ishida
ki13@duke.edu
Xuan Wang
xw45@duke.edu

Muzhi Zhao
zhaomuzh@cs.duke.edu
Yin Lin
linyin@cs.duke.edu

1. Introduction
Being one of the most popular high-level platforms on top of Hadoop, Pig has been shown to be
less efficient to its fellow Hive [1]. While Hive has conducted intensive benchmarks to improve its
performance [2], Pig is in great need of more comprehensive benchmarks.
Therefore, we implement the TPC-H benchmark queries for Pig, which is the de facto standard to
compare relational databases in their data warehousing performance. It consists of 22 queries with
different degrees of complexity.
By comparing Pigs results with Hive's results, we identify several bottlenecks of Pig's
performance. We show how to write efficient Pig scripts by either addressing some of these
bottlenecks, or making full use of Pig's features. Our work opens many optimization opportunities
for Pig and makes a performance comparison of Pig and Hive directly possible.
This report is organized as follows. In Section 2 we describe six rules of writing efficient Pig
scripts. We present our benchmark results and analysis in Section 3. Finally, we conclude with
future work in Section 4.

2. Six Rules of Writing Efficient Pig Scripts


As Pig is using a rule-based optimizer, some rules can be used to improve its efficiency. We
summarize six rules of writing efficient Pig scripts to continue the work of [3, 4].

2.1. Reorder Joins to Reduce Disk I/O


The join usually involves multiple passes of reading and writing of the tables. For a multi-join
query, executing joins with smaller result set first can decrease the amount of intermediate data
among these joins. That reduces the overall amount of data spilled to and read from disk.
Though it is nontrivial to determine which join has the least output, we can assign higher priority
to certain joins, such as joins with small tables or tables being filtered, and joins between
primary-key and foreign-key so the results are not expanding. At this point we profit from the fact

that the table sizes and table attributes are well defined for the TPC-H benchmark.
We also consider different types of joins. Besides the default hash join, the replicated join is also
applicable to joins with small tables. For TPC-H it is not very effective because only the tables
nation and region are small enough to use the replicated join regardless of the scale factor, and
these two tables are always in the joins with other small tables like supplier and customer, while
the overall performance is dominated by the joins involving larger tables like lineitem and orders.

2.2. Use COGROUP to Implement Join and Group-by


We find that Pig's COGROUP operator can be used to implement a join and a group-by on the
same key, which is very common in TPC-H queries. Pig does not recognize the common key and
compiles them into two individual Map Reduce jobs, both of which require the full
map-shuffle-reduce process. With COGROUP, however, only one job is required, as the join and
the group-by can be performed together. Table 1 shows an example in which a COGROUP can be
applied instead of the join and the group-by.
SQL

Pig

Select A.x, COUNT(B.y)

t1 = COGROUP A by A.x ,B by B.x;

from

t2 = FOREACH t1 GENERATE group, COUNT(B.y);

A JOIN B on A.x = B.x

GROUP by A.x

Table 1: Use COGROUP to Implement Join and Group-by On The Same Key

2.3. Use FLATTEN to Implement Self-join and Group-by


If the join in Rule 2 is a self-join, we can achieve better optimization by replacing COGROUP
with GROUP and use FLATTEN to perform the self-join. After calculating the aggregation for
each group, the tuples in the group can be FLATTENed to apply the self-join. Therefore, the table
needs to be processed once only. Table 2 shows an example for the optimized self-join
implementation.
SQL
select

Pig

t1 = group A by x;

from

A as A1

where

A1.y < (select

t2 = foreach t1 generate FLATTEN(A),


AVG(A2.y)

from A as A2

AVG(A.y) as avg_y;
t3 = filter t2 by y < avg_y;

where A2.x = A1.x)

Table 2: Use FLATTEN to Implement Self-join and Group-by On The Same Key

2.4. Project Before (CO)GROUP


When rewriting the scripts, we consistently observed that the FLATTEN and COGROUP were not

as effective as expected. After analysing the Map Reduce statistics, we noticed that the Map
Reduce jobs had incredibly large amount of local I/O, because all fields of the input tables were
processed. This issue can be addressed by doing explicit projections before GROUP and
COGROUP. Though Pig is able to prune the irrelevant columns as early as possible for join
operations, currently it does not prune the columns of the tables involved in the GROUP and
COGROUP [5].

2.5. Remove Types in LOAD


In Pig's current implementation, Pig casts each field to the type defined in the schema upon
loading, even when the field is not being used later. Pig can only delay such cast when the type of
the field is not defined; in such case, it infers the type of the field from its usage, which may be
wrong, e.g. infers an Integer to be a Double. Currently there is no way for Pig to have type defined
as well as conversion delayed fields. This an open issue since 2008 [6]. However, we can remove
the types of irrelevant fields.

2.6. Use Hash-based Aggregation


Currently Pig only supports the sort-based aggregation so the Maps' output will be sorted before
being passed to the Combiner. Hive, on the other hand, already implements the hash-based
aggregation, e.g. a hash table is maintained in the Map phase. When the number of distinct keys is
small to fit in the hash table, no sort and spill are required. We notice that Pig is going to support
the hash-based aggregation in the next release [7].

3. Experiments
In this section we evaluate the performance of Pig using the TPC-H benchmark. First we describe
our benchmark environment and our initial benchmark results without applying the optimization
rules in Section 2. Then we choose several representative queries to explore the improvement by
our optimization rules in detail. Finally we present the updated benchmark result with these rules.

3.1. Benchmark Environment


We generate 100GB dataset with a scaling factor 100. Hadoop version 0.20.203, Pig version 0.9.0
and Hive version 0.7.1 are used. We launch two 8-slave clusters on Amazon EC2 to benchmark
Pig and Hive in parallel. We use the m1.small instances, each with one virtual core, 1.7 GB
memory and 160 GB instance storage. We use the default Hadoop configuration except that each
node has only one reduce slot, and the HDFS block size is 128 MB. We configure both Hive and
Pig to use eight reducers per job.

3.2. Initial Overall Result


As none of us had any experience of Pig, we first learned how to write Pig scripts and focused on
the correctness. Therefore, it is not surprising that our initial benchmark result (Figure 1) shows
that Hive outperformed Pig in almost all queries (except Q19 that we failed to run on Hive and
Q16 for which Pig was slightly faster than Hive).

Figure 1: Initial TPCH Query Performance

3.3. Improvement of Optimization Rules


In this section we illustrate the improvement of our optimization rules (except the straightforward
Rule 4) using several representative queries.

3.3.1. Rule 1
We first apply Rule 1 to Q7 and Q9, both of which involve five joins. In our initial scripts the joins
with the lineitem table(the largest table) are executed earlier than necessary. After rearranging the
joins, we speed up Q7 by 2x and Q9 by 3x, as shown in Figure 2. We observe Hive's queries can
also be optimized in this way. The failure of Hive's Q9 might be attributed to its poor join order,
which resulted in larger memory requirements.

Figure 2: Applying Rule 1 to Q7 and Q9

3.3.2. Rule 2
Figure 3 shows the improvement of Q13 by applying Rule 2, using COGROUP to implement a
join and a group-by on the same key.
Note that the additional aggregation is done in the COGROUP. Therefore the COGROUP
produces less output thus is slightly faster than the join. So for Query 13 COGROUP is a big win.
Though Hive also compiles these two operators separately, the join is aware of the following
group-by, thus is able to run the aggregation in advance and reduce the time of group-by.

Figure 3: Applying Rule 2 to Q13


3.3.3. Rule 3
Rule 3 is applicable to Q17, which involves one self-join of lineitem and one implicit group-by on
the same key. With the FLATTEN operator in group-by, we eliminated the expensive self-join and
reduced the time by half. Note that Q17 has another join between part and lineitem on the same
key as the self-join. Now that the self-join is eliminated, Q17 remaining structure is equal to a join
followed by a group-by on the same key, which can be efficiently evaluated by COGROUP as we
discussed earlier. So with the COGROUP and FLATTEN operators, two joins plus one group-by
can be done within a single MapReduce job. Though Hive regards the self-join as a regular join,
its performance is better than we expected due to the optimization mentioned earlier.

Figure 4: Applying Rule 2 and Rule 3 to Q17

3.3.4. Rule 5
We had a hard time analyzing why Pig was slower than Hive for the simplest query Q6, which
applies some filterers to a table and evaluates a global aggregation. We are surprised to see a 20%
performance boost of Pig by dropping the types in the schema. Only this way we can benefit from
its "lazy conversion of types" feature. However, Pig is still 50% slower than Hive due to their
sort-based aggregation. The huge gap between Pigs sort-based aggregation and Hives hash-based
aggregation can also be shown in Q1.

Figure 5: Apply Rule 5 to Q6

Figure 6: The Effect of Rule 6

3.3.5. All Rewritten Queries


Figure 7 shows all the rewritten queries based on these rules. We can see the improvement can be
significant, and Pig can be competitive or even faster than Hive. We would like to post our scripts
to the community for further analysis.

Figure 7: All Rewritten Queries

3.4. Updated Overall Result


Figure 8 is the updated result with eight rewritten Pig scripts. We can see now Pig is not
dominated by Hive, and instead they are competitive.

Figure 8: Updated TPC-H Query Performance

4. Conclusion and Future Work


In this report, we have summarized six rules of writing efficient Pig scripts. These rules can be
classified into three categories. First, we can choose a better query plan for Pig, especially the
order of joins. Second, we can make full use of Pigs features, like COGROUP, FLATTEN, etc.
Third, we need to be aware of Pigs current issues, such as projection, type conversions and
sort-based aggregation. We verify these rules by comparing Pig with Hive on the TPC-H
benchmark, and the results show that by following these rules Pig can be competitive to Hive.
In future work, we will cooperate with the Pig community to proceed the benchmarking [8]. We
will keep watching on the existing issues highlighted by our benchmark, and help incorporate
some novel query rewriting rules to Pig's rule-based optimizer. We will help the community to
reproduce our benchmark, so it can be used for Pigs new features and releases. In the long run,
we expect Pig can move on to a cost-based optimizer, for which the TPC-H benchmark will be a
precious set of workloads.
While our work focuses on Pig, we will also take efforts to connect the community of Pig and
Hive to compete together and learn from each other, which is our ultimate goal, i.e., boosting the
prosperity of the Hadoop ecosystem.

Acknowledgements
We referred to six Pig scripts used in [9]. We appreciate Amazon EC2s education grants.

Reference
[1] Hive performance benchmarks. https://issues.apache.org/jira/browse/HIVE-396
[2] Running TPC-H queries on Hive. https://issues.apache.org/jira/browse/HIVE-600
[3] Performance and Efficiency. http://pig.apache.org/docs/r0.9.1/perf.html
[4] Making Pig Fly. http://ofps.oreilly.com/titles/9781449302641/making_pig_fly.html
[5] Logical Optimizer: Nested column pruning. https://issues.apache.org/jira/browse/PIG-1324
[6] PERFORMANCE: delay type conversion. https://issues.apache.org/jira/browse/PIG-410
[7] Support partial aggregation in map task. https://issues.apache.org/jira/browse/PIG-2228
[8] Running TPC-H on Pig. https://issues.apache.org/jira/browse/PIG-2397
[9] Sai Wu, Feng Li, Sharad Mehrotra, and Beng Chin Ooi. Query optimization for massively
parallel data processing. In Proceedings of the 2nd ACM Symposium on Cloud Computing (SOCC
'11).

You might also like