Professional Documents
Culture Documents
Jie Li
jieli@cs.duke.edu
Ralf Diestelkaemper
ralf@cs.duke.edu
Koichi Ishida
ki13@duke.edu
Xuan Wang
xw45@duke.edu
Muzhi Zhao
zhaomuzh@cs.duke.edu
Yin Lin
linyin@cs.duke.edu
1. Introduction
Being one of the most popular high-level platforms on top of Hadoop, Pig has been shown to be
less efficient to its fellow Hive [1]. While Hive has conducted intensive benchmarks to improve its
performance [2], Pig is in great need of more comprehensive benchmarks.
Therefore, we implement the TPC-H benchmark queries for Pig, which is the de facto standard to
compare relational databases in their data warehousing performance. It consists of 22 queries with
different degrees of complexity.
By comparing Pigs results with Hive's results, we identify several bottlenecks of Pig's
performance. We show how to write efficient Pig scripts by either addressing some of these
bottlenecks, or making full use of Pig's features. Our work opens many optimization opportunities
for Pig and makes a performance comparison of Pig and Hive directly possible.
This report is organized as follows. In Section 2 we describe six rules of writing efficient Pig
scripts. We present our benchmark results and analysis in Section 3. Finally, we conclude with
future work in Section 4.
that the table sizes and table attributes are well defined for the TPC-H benchmark.
We also consider different types of joins. Besides the default hash join, the replicated join is also
applicable to joins with small tables. For TPC-H it is not very effective because only the tables
nation and region are small enough to use the replicated join regardless of the scale factor, and
these two tables are always in the joins with other small tables like supplier and customer, while
the overall performance is dominated by the joins involving larger tables like lineitem and orders.
Pig
from
GROUP by A.x
Table 1: Use COGROUP to Implement Join and Group-by On The Same Key
Pig
t1 = group A by x;
from
A as A1
where
from A as A2
AVG(A.y) as avg_y;
t3 = filter t2 by y < avg_y;
Table 2: Use FLATTEN to Implement Self-join and Group-by On The Same Key
as effective as expected. After analysing the Map Reduce statistics, we noticed that the Map
Reduce jobs had incredibly large amount of local I/O, because all fields of the input tables were
processed. This issue can be addressed by doing explicit projections before GROUP and
COGROUP. Though Pig is able to prune the irrelevant columns as early as possible for join
operations, currently it does not prune the columns of the tables involved in the GROUP and
COGROUP [5].
3. Experiments
In this section we evaluate the performance of Pig using the TPC-H benchmark. First we describe
our benchmark environment and our initial benchmark results without applying the optimization
rules in Section 2. Then we choose several representative queries to explore the improvement by
our optimization rules in detail. Finally we present the updated benchmark result with these rules.
3.3.1. Rule 1
We first apply Rule 1 to Q7 and Q9, both of which involve five joins. In our initial scripts the joins
with the lineitem table(the largest table) are executed earlier than necessary. After rearranging the
joins, we speed up Q7 by 2x and Q9 by 3x, as shown in Figure 2. We observe Hive's queries can
also be optimized in this way. The failure of Hive's Q9 might be attributed to its poor join order,
which resulted in larger memory requirements.
3.3.2. Rule 2
Figure 3 shows the improvement of Q13 by applying Rule 2, using COGROUP to implement a
join and a group-by on the same key.
Note that the additional aggregation is done in the COGROUP. Therefore the COGROUP
produces less output thus is slightly faster than the join. So for Query 13 COGROUP is a big win.
Though Hive also compiles these two operators separately, the join is aware of the following
group-by, thus is able to run the aggregation in advance and reduce the time of group-by.
3.3.4. Rule 5
We had a hard time analyzing why Pig was slower than Hive for the simplest query Q6, which
applies some filterers to a table and evaluates a global aggregation. We are surprised to see a 20%
performance boost of Pig by dropping the types in the schema. Only this way we can benefit from
its "lazy conversion of types" feature. However, Pig is still 50% slower than Hive due to their
sort-based aggregation. The huge gap between Pigs sort-based aggregation and Hives hash-based
aggregation can also be shown in Q1.
Acknowledgements
We referred to six Pig scripts used in [9]. We appreciate Amazon EC2s education grants.
Reference
[1] Hive performance benchmarks. https://issues.apache.org/jira/browse/HIVE-396
[2] Running TPC-H queries on Hive. https://issues.apache.org/jira/browse/HIVE-600
[3] Performance and Efficiency. http://pig.apache.org/docs/r0.9.1/perf.html
[4] Making Pig Fly. http://ofps.oreilly.com/titles/9781449302641/making_pig_fly.html
[5] Logical Optimizer: Nested column pruning. https://issues.apache.org/jira/browse/PIG-1324
[6] PERFORMANCE: delay type conversion. https://issues.apache.org/jira/browse/PIG-410
[7] Support partial aggregation in map task. https://issues.apache.org/jira/browse/PIG-2228
[8] Running TPC-H on Pig. https://issues.apache.org/jira/browse/PIG-2397
[9] Sai Wu, Feng Li, Sharad Mehrotra, and Beng Chin Ooi. Query optimization for massively
parallel data processing. In Proceedings of the 2nd ACM Symposium on Cloud Computing (SOCC
'11).