Performance Validation of Dynamic-Remapping-Based Task Scheduling On 3D Multi-Core Processors

Performance Validation of Dynamic-Remapping-Based Task Scheduling on 3D Multi-Core Processors
Chien-Hui
( Christina )
Liao and Hung-Pin
( Charles )
Wen
Department of Electrical Engineering, National Chaio Tung University, Hsinchu, Taiwan. Email: liangel.cm97g@g2.nctu.edu.tw. opwen@g2.nctu.edu.tw
Abstract-Many heuristics applying Dynamic Voltage and Fre quency Scaling (DVFS) techniques have been proposed for energy min imization on three-dimensional multi-core processors. However, most previous works were built upon a fixed task-to-core mapping where many slack spaces can be further improved. In our previous research, we proposed a dynamic remapping strategy, Iterative Dynamic Remapping (IDR), to enhance an energy-aware task-scheduling algorithm while considering transmission cost. In this paper, performance for IDR with consideration to transmission costs between cores is validated through comparison with a Quadratic-Programming-based (QP-based) method and a Genetic-Algorithm-based (GA-based) method. Experimental results show that, the IDR strategy can run at least five-order faster while achieving comparable performance on total energy consumption of the QP-based method. Compared to the GA-based method the IDR strategy can run at least three-order faster while achieving co (or even better) performance on total energy consumption. Fig.
energy minimizaion
1.
system overview of task scheduler

r-
--- -- -------
---
-----
:nparable
:. At the same core: a I. To the neighbor core ! at the same layer: f:i :- To the neighbor layer
i Transmission cost
Introduction To fulfill high-performance demands on embedded systems, Multiprocessor System-on-a-Chip design methodology arises as a new paradigm where 3D integration is the latest enabling tech nique. However, since a 3D multi-core processor often consumes excessive energy, leading to a problem of high power density [1] [2] , energy efficiency becomes its paramount concern. Many previous researches focused on energy minimization for 3D multi-core systems at the physical level and behavioral level where behavioral-level techniques are typically more effective than the physical-level ones on energy minimization [3] . Particu larly, Dynamic Voltage and Frequency Scaling (DVFS) scheduling algorithms have prevailed recently. Wu et al. from [4] proposed an energy-efficient task-scheduling algorithm via DVFS at the system level and formulated an priority gain function considering both gains and losses for selecting tasks to be scaled. Figure 1 shows the system overview of the timing-and-resource constrained scheduler with the following input information required: Data flow graph (DFG): DFG is also called task graph that usually describes the behavior of design. Timing constraint: Timing constraint can be specified by the user but is required to be larger than the length of the critical path of the given DFG. Resource constraint: 3D multi-core processors is illustrated as Figure 2 where both the number of cores per layer and the number of layers are parameterized in our work. The transmission cost between any two cores is also considered and can be specified by users. Energy model: The energy model shown in Table I was proposed in [3] and includes three voltage levels: 5V, 3V and 2.4V. In this work, energy consumption is defined as the execution delay multiplied by the power. All tasks must be assigned to one core in a correct execution order. Moreover, energy minimization is the objective for schedule where Energy-Saving rate (ESR) is defined as Equation (1) to approximate the energy efficiency of the computed schedule.
1.

Fig.
2.
transmission cost in a 3D multi-core processor
l :)
__
___ _____________
Figure 3 (a) shows the result from [4] for scheduling 31 tasks on a 3D processors with eight cores considering transmission cost under a timing constraint 15. Taking Figure 3 (a) for example, an exploration of the slack slots is conducted after applying DVFS. Due to the fixed task-to-core mapping, many time slots (denoted a "X" in Figure 3 (a)) can be further utilized. For example, if we move task N2 from core 001 to core 010 using a slower frequency as shown in Figure 3 (b), the remaining spaces can be better utilized and thus the energy-saving rate is improved. Built on top of the previous task-scheduling algorithm [4] , a dynamic task-to core mapping strategy, Iterative Dynamic Remapping (IDR), is proposed to reduce slack slots in our previous research to improve Energy-Saving Rate. Experimental results show that ESR of the algorithm with the IDR strategy is 16 percent higher than the previous work [4] on average. Based on such result, the scheduling algorithm with dy namic task-to-core remapping can result in more energy efficient scheduling than that with a fixed task-to-core mapping. We also implemented an ILP-based method for an optimal solution without considering the transmission costs between cores. The experimental results show IDR strategy can achieve comparable performance as the ILP and the energy-saving difference between IDR and ILP is only -2.54 percent on average. However, in this paper, we would like to further validate our IDR strategy considering transmission costs between cores. Hence, the problem is formulated into Quadratic Programming (QP) and compared to the IDR solution considering the transmission costs between cores. However, the scalability problem of QP-based method is more sever with transmission costs. Therefore, scheduling with a larger task graph is implemented into Genetic Algorithm (GA)
TABLE I Energy model VoJtage(V) Delay Power(W) Energy(J)
II
ESR
Einit (5V ) - Efinal(5V, 3V, 2.4V ) Einit (5V )
00
(1)
:I I
,';
25t 8t 4.8t
II
978-1-4577-2081-9/12/$26.00 2012 IEEE
to Equation (1) to evaluate the energy efficiency of the schedule computed by the ORI algorithm.
tasklo-core mapping initial energy computation voltage scaling (5V->3V) increasing current timing constraint
ASAP & ALAP analysis
2.4V
3V
SV
lask-Io-core remapping
energy-saving rate com utalion
(a) fixed task-to-core mapping (ESR=31.49%)
Fig.
4.
Design flow of Iterative Dynamic Remapping
(IDR )
strategy
o
Fig.
2.4V
3V
SV
(b) dynamic remapping (ESR=34.37%)
3.
Examples for different task-to-core mapping strategies
and also compared to our IDR solution. Experiments show that compared to QP and GA, IDR can run at least five-order faster than QP and three-order faster than GA while achieving comparable or even better performance on energy saving. The rest of this paper is organized as follows. In Section II, the framework of task-to-core mapping and scheduling with DVFS in [4] as well as dynamic task-to-core remapping strategy IDR are presented. Performance validation the proposed solution with by QP and GA are elaborated in Section III, respectively. Finally, Section IV concludes this paper. II. Scheduling by Dynamic Remapping Wu et al. [4] proposed an energy-efficient task scheduling algorithm via DVFS at the system level and formulated a priority gain function considering both gains and losses for selecting tasks to scale down its frequency. Using their algorithm [4] as a baseline, we further propose a dynamic task-to-core remapping strategy, Iterative Dynamic Remapping (IDR), to reduce slack slots and acquire energy saving. In this section, we first overview the baseline scheduling algorithm (denoted as ORI) from [4] and then briefly elaborate IDR strategy. Due to the limitation of space, more details of IDR method can be referred to [5] . In the baseline scheduling algorithm, the earliest possible time (As Soon As Possible, ASAP [6] ) and the latest possible time (As Late As Possible, ALAP [6] ) for each operation are first computed. Second, task-to-core mapping (computed by List Scheduling-based approach [6] [7] ) decides the core that a task runs on and its execution order. After task-to-core mapping, the initial energy of each task using 5Voltage can be derived. Later, a task-candidate set is computed based on a gain function and tasks with the highest rankings take turn to be selected for voltage scaling. Last, the Energy-Saving Rate (ESR) is derived according
From the previous example in Figure 3 (a), after a fixed mapping, many time slots are still available. Therefore, we proposed a dynamic task-to-core remapping strategy, IDR, and integrate it with the ORI algorithm. The idea of the IDR strategy comes from observing the execution of the ORI algorithm under a fixed core mapping. After voltage scaling, the distribution of tasks on cores can be more nonuniform. If we apply the task-to core remapping after voltage scaling, the task density of cores becomes more uniform and can obtain more energy saving. Figure 4 shows the flow of the IDR strategy on top of the ORI method, where multiple rounds of task-to-core remapping and voltage scaling are performed. Input information required by this strategy includes a initial timing constraint used for the first round and a timing-constraint limit used for the final round. Moreover, the initial timing constraint must be less than or equal to the timing-constraint limit. IDR applies task-to-core mapping and voltage scaling under the initial timing constraint in the first round. In addition, IDR executes multiple rounds of task to-core remapping and voltage-scaling applications under current timing constraint which incrementally increases up to the timing constraint limit. Finally, IDR applies another round of voltage scaling under a timing constraint limit. In the first round of the IDR strategy, the task-to-core mapping (computed by List-Scheduling-based approach [6] [7] ) and voltage scaling (5V --+ 3V) is performed under the initial timing constraint. Only 5V --+ 3V voltage scaling is applied to prevent the failure of task-to-core remapping later. After the first round of voltage scaling, the changed ASAP time and ALAP time of each task are updated. In each round, the mobility (ASAP-ALAP) of each task needs to be recomputed and the task-to-core remapping and voltage scaling under a current timing constraint are applied iteratively in IDR. Note that task-to-core remapping is also computed by List-Scheduling-based approach [6] [7] and the priority of tasks remapping to core is decided according to their mobility. After the task-to-core remapping, the distribution of executed tasks on each core can be more uniform. Same as in the first round, only 5V --+ 3V voltage scaling is applied for preventing the failures of the task-to-core remapping later. In each round of task-to-core remapping and 5V --+ 3V voltage scaling, the current timing constraint is relaxed. Until the current timing constraint reaches the timing-constraint limit, we perform the final round of voltage scaling (5V --+ 3V and 3V --+ 2.4V) under the timing-constraint limit. In our previous work, experiments showed that the Energy Saving Rate (ESR) of the IDR method can achieve 16 percent higher than that of the baseline algorithm on average. We also implemented an ILP-based method for an optimal solution without considering the transmission costs between cores. As a result, IDR method can achieve comparable performance as the ILP and the energy-saving difference between IDR and ILP is
TABLE II
only -2.54 percent on average. The scheduling algorithm with dynamic task-to-core remapping strategy, IDR, can achieve more energy-efficience than that with a fixed task-to-core mapping. III. Performance Validation Previous result has shown that our scheduling algorithm with dynamic task-to-core remapping strategy, IDR, can be more energy efficient than a fixed task-to-core mapping. Moreover, in this paper, we would also like to validate how energy-efficient IDR is, especially, when considering transmission costs. Hence, performance validation for IDR is performed in this section with transmission costs. First, the problem is formulated into Quadratic Programming (QP) and compared to the IDR solution considering transmission costs. Experimental results show that compared to QP, IDR runs at least five-order faster than QP does and achieves comparable performance on energy saving. The energy-saving difference between IDR and QP is only 0.66 percent on average. Later, suffering from the the scalability problem for QP, scheduling with a larger task graph is implemented into Genetic Algorithm (GA) and is also compared to the IDR solution. Experimental results show that IDR runs at least three order faster than GA and achieves comparable performance (or even better) on energy saving. The energy-saving difference between IDR and GA is only -2.52 percent on average. A. Compare with Quadratic Programming (QP) In this section, we compare the IDR strategy with a QP-based method. We formulate each problem into a QP instance with the transmission costs between cores. For briefly illustrating the QP instance, the problem is simplified to assume all tasks with the same delay. However, in our actual experiments, the QP problem is more complex than the one shown below assuming all tasks with different delay values. First, the earliest possible time (As Soon As Possible, ASAP [6] ) and the latest possible time (As Late As Possible, ALAP [6] ) for each operation are computed. The following variables and parameters are used in QP: Ei and Li are ASAP time and ALAP time of task Ni, respectively; PredXi denotes all the immediate predecessors in the DFG of the task Xi; CN is the total number of processors on the 3D multi-cores architecture; TN is the size of DFG; Xi,j,l,m are 0/1 integer variables if a task Ni (1 ::; i ::; TN) executes at the I step (Ei ::; I ::; Li) of the processor element m (1 ::; m ::; CN) with voltage j (1 ::; j ::; 3 for three supply voltage level) then Xi,j,l,m = 1. Otherwise, Xi,j,l,m = O. P1, P2 and P3 are the power consumption using different supply voltages. According to the power model shown in Table I, P1 = 25, P2 = 4 and P3 = 1.6. The objective function and constraints for QP are formulated below. The objective function in Equation (2) states the mini mization of energy consumption. Constraint 1 in Equation (3) requires that each task starts to execute at the only one step (Ei to Li) of one core with a specific voltage. Constraint 3 in Equation (4) ensures that no step of core(M) contains more than one task can be executed. Constraint 4 in Equation (5) guarantees the precedence constraint for a task (Xisojs,lsomJ. All its predecessors (Xip,jp,lp,mp) E Predxis must be scheduled into the earlier step where tr(ms, mp) is the transmission cost between core ms and core mp. In other words, if Xis,js,ls,ms Xip,jp,lp,mp = 1, then lp + tr(ms, mp) ::; Is. minimize:
Energy-saving rate (ESR) and the ESR Difference IDR and QP task size
( E S R)
between
ESRQP %
10 11 12 13 14 15 16 17 18 19
average
48.31 50.29 55.28 55.66 52.89 52.48 53.92 51.88 54.95 54.66 53.03
ESRIDR % 47.91 49.99 52.87 53.33 53.76 54.90 55.26 55.92 54.95 54.66 53.36
E S R % -0.83 -0.60 -4.36 -4.19 +1.64 +4.61 +2.49 +7.79 +0.00 +0.00 +0.66
speed-up
tQP/tIDR 526360 231578 135508 562260 118758 108310 159438 232713 231675 584593 285556 (4)
mEM IEstep
Li.s Lip
L L
Xi,j,l,m-<:: 1, Vi, 1-<:: is -<:: TN,
Vj,
1-<::
js-<::
3, I
s tep, m
ls=Eis lp=Eip
L (ls -lp - tr(ms, mp))Xis,js,lsomsXi,,,jp,lp,mp 2'" 0,

Vis , ip ,
Vjs, jp , 1-<:: js ,jp-<:: 3, Vms, mp, 1-<:: ms, mp -<:: eN
1-<:: is-<::
TN, Xip
Predxis'
(5)
We first run our task scheduling algorithm, IDR, as well as the QP-based method to schedule task graphs with 10 to 19 tasks on 3D four-core processors and the timing constraint was set 1.5X Critical-Path Length (CPL). Our framework was implemented in C/C++ and executed on a Linux machine with a Pentium Core Duo (2.4 GHz) processor and 4 GB memory. For QP-based method, the QP solver (mixed integer programming package from ILOG CPLEX [10] ) was used to minimize energy consumption. Table II shows the comparison of Energy-Saving Rate of QP (ESRQP) and IDR (ESRIDR) as well as the energy saving-rate difference (llE s R ) from IDR to the QP-based method. The energy-saving-rate difference (llE s R ) is defined as the ESR difference between IDR and the QP-based (or GA-based) method as: ESR(IDR) - ESR(QP or GA) x 100 ('fc) 0 (6) E S R
=
ESR(QP or GA)
Table II demonstrates the efficiency of the IDR strategy. The energy-saving rate of QP-based method is 53.03 percent on average whereas the energy-saving rate of IDR method is 53.36 percent. Their energy-saving rate difference is only 0.66 percent on average. Experimental result shows our dynamic remapping method result in a comparable (or even better when an exception occurs) performance as QP. Since in some cases, such as task graphs 14 to 19, QP is terminated due to out of memory during finding solutions. Besides, the proposed method only takes less than 1 second to complete, resulting in 100,000X to 500,000X speedups when comparing to the QP-based method. Furthermore, scheduling with larger task-graph sizes using QP-based method are also implemented. However, until the QP run out of memory during computing, no feasible solution is found. Hence, the scalability problem is the bottleneck using the QP-based method. B. Compare with Genetic Algorithm (GA) Since the QP-based method suffers from the the scalability problem, we implemented Genetic Algorithm (GA) to schedule larger task graphs and also compared its result with the IDR solution. The experiments were conducted on scheduling these task graphs from Standard Task Graph (STG) [8] on different 3D multi-core processors. We further performed the dynamic
j=l
L Pj L L L Xi,j,l,m
x
TN CN
Li
i=l m=ll=Ei
(2)
subject to:
j=lm=ll=Ei.
L L L Xi,j,l,m
CN
Li
1,
Vi,
1-<:: i-<::
TN
(3)
TABLE III Energy-saving rate (ESR) and the ESR Difference task size 30 timing limit 1.2CPL 1.3CPL 1.4CPL 1.5CPL 50 1.2CPL 1.3CPL 1.4CPL 1.5CPL 100 1.2CPL 1.3CPL 1.4CPL 1.5CPL 300 1.2CPL 1.3CPL 1.4CPL 1.5CPL average
(ESR)
between IDR and GA
ESRIDR %
37.42 44.00 55.02 63.17 47.88 50. 03 54.50 58. 32 35. 37 37.55 48.24 53.09 36.99 42. 43 53.26 58.85 45.18
ESRGA% (initrand )
33.65 36.90 47.21 49.96 44. 87 47. 32 52.85 56.67 30. 37 32.64 37.91 39.44 29.97 39.17 41. 34
ESR %
+11.20 +19.24 +16.54 +26.44 +7. 36 +9.09 +3.51 -1. 78 +33. 44 +26.24 +24.39 +29.66 +41.57 +50.24 21.24
(tGA
( init
speed-up
,j/tIDR)
2556 2867 2457 2325 7154 6714 7140 7617 14488 34784 8244 25668 20132 22993 11796
ESRGA% (initIDR)
39. 23 44.00 58. 34 63.27 49.45 52.74 56. 32 57. 73 35.60 38. 23 49.00 53.94 36.99 42. 43 53. 36 58.85 49. 34
ESR %
-4.61 +0.00 -5.69 -0.16 -3. 93 -2.28 -3.10 -3.40 -1.48 -1.94 -1.76 -1.62 +0.00 +0.00 -0.19 +0.00 -2.52
(tGA
(init 1J1l)/tIDRl
speed-up
1858 2308 2471 2244 5583 5588 4965 6168 12234 18858 12166 17851 11853 10342 9377 10855 8418
remapping strategy, IDR, on graphs with node size 30, 50, 100, 300 and randomly selected 10 cases of each size. The settings of scheduling are 30 to 100 tasks with eight cores and 300 tasks with sixteen cores on 3D multi-core processors with transmission costs ( a = 0, (3 = 2 and 'Y = 1). Timing constraints were set from 1.2X to 1.5X Critical-Path Length (CPL) for each case. For the IDR methods, we set the initial timing constraint 1.05X CPL and the current timing constraint increases O.lX CPL for each round of voltage scaling. For the comparison, the settings used in the GA approach are listed as follows: The population size is set to 10000 and the number of genes is set as same as the task size. The operators which include reproduction, crossover, and mutation are used; tasks are randomly assigned to processors and the voltage of tasks are also randomly decided in each generation; then, execution ordering of tasks on each core are decided by considering the relationship between tasks as well as transmission cost; the GA stops when no more improvement on the fitness score (i.e., the energy-saving rate) can be obtained for the last 1000 generations. In addition, the GA-based method was initialized randomly or with our IDR solution. Table III shows the Energy-Saving Rate (ESR) of dynamic remapping method IDR (ESRIDR) and GA-based method (ESRGA) as well as their ESR differences (,0,.ESR) defined as Equation (6) between IDR and the GA-based method. As shown in Table III, the energy-saving rate of the IDR method is 45.18 percent. The energy-saving rate of the GA method with random initialization is 41.34 percent and the energy-saving rate difference between IDR and GAinitrand is 21.24 percent on average. The energy-saving rate of the GA method initialized with a IDR solution is 49.34 percent and the energy-saving rate difference between IDR and GAinitllJII is -2.52 percent on average. Experimental result shows our dynamic remapping method achieves a comparable (or even better) performance as GA. Especially, compared to GA method with random initializa tion, our method IDR result in much better performance than GA method does. Moreover, as the task size increases, it is more difficult to find out feasible or better solution of GA-based method. Besides, the proposed method only take less than few seconds to complete, resulting in 2,000X to 35,000X speedups when comparing to the GA-based method, validating that IDR is a highly energy-efficient scheduling algorithm. IV. Conclusions In our previous research, a dynamic task-to-core remapping strategy, IDR, on top of a baseline task-scheduling algorithms
from [4] was proposed. In this paper, the performance vali dation for IDR considering transmission costs between cores is performed. During validation, the scheduling problem is implemented into QP-based method and GA-based method. Experimental results show the solution quality from the dynamic remapping strategy, IDR, is very close to the QP one with only 0.66 percent difference. Besides, IDR can run 100,000X to 500,000X faster than the QP method. Our experiments also showed that the energy-saving rate of the IDR method achieves 21.24 percent higher than that of the GA with random initialization on average and only -2.52 percent difference be tween IDR and GA initialized with a IDR solution. In addition, IDR can run 2000X to 35,000X faster than the GA method. As a result, the energy-efficiency of scheduling algorithm with dynamic remapping strategy, IDR, is validated through theoretic comparison with approaches. References
[1] W . -L. Hung and G. M. Link and Y. Xie and N. Vijaykrishnan and M. J. Irwin : 'Interconnect and Thermal-aware Floorplanning for 3D Microprocessors'. Proc. Int. Symposium on Quality Electronic Design, Washington, DC, USA, 2006, pp. 98-104 [2] K. Puttaswamy: 'Thermal Analysis of a 3D Die-Stacked High Performance Microprocessor'. Proc. GLSVLSI, 2006 [3] S. Raje and M. Sarrafzadeh: 'Variable voltage scheduling'. Proc. Int. symposium on Low power design, Dana Point, California, United States, 1995, pp. 9-14 [4] C.-B. Wu and Y.-L. Lin: 'Energy-Efficient Task Scheduling for DVFS-Multi-Core University, 2010 [5] C. ing H. Liao by and Y. 3D IC'. Master thesis, Lin and National Tsing Hua Wen: 'Enhanc Pro 2011.
Z.
H. -P. on 3D technical
Energy-Efficient Task Dynamic
Scheduling
Multi-Core report,
cessors
Remapping,'
http://dl.dropbox. com/u/30720041/technical_main.pdf [6] D. D. Gajski and N. D. Dutt and A. C. -H. Wu and S. Y.-L. Lin: 'Scheduling','High-Level Synthesis: Introduction to Chip and System Design'(Springer, 1992, 1st edn.), pp. 213-258 [7] T. L. Adam and K. M. Chandy and J. R. Dickson: 'A comparison of list schedules for parallel processing systems'. ACM, Commun. 1974, 17, (12), pp. 685-690 [8] 'Standard Task Graph Set, Kasahara Laboratory Department of Electrical, Electronics and Computer Enginnering, Waseda Univer sity', http://www. kasahara.elec. waseda.ac. jp/schedule/index.html [9] A. P. Chandrakasan and M. Potkonjak and R. Mehra and J. Rabaey and R. Brodersen: 'Optimizing power using transformations', IEEE Trans. [10] ILOG on Computer-Aided CPLEX Design of Integrated IBM, Circuits and http://wwwSystems., Jan, 1995, 14, (1), pp. 12-31 Optimizer<!ll, 01. ibm.com/software/integration/optimization/cplex-optimizer/

Performance Validation of Dynamic-Remapping-Based Task Scheduling On 3D Multi-Core Processors

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Performance Validation of Dynamic-Remapping-Based Task Scheduling On 3D Multi-Core Processors

Uploaded by

Copyright:

Available Formats

Performance Validation of Dynamic-Remapping-Based Task Scheduling on 3D Multi-Core Processors

Liao and Hung-Pin

system overview of task scheduler

transmission cost in a 3D multi-core processor

Einit (5V ) - Efinal(5V, 3V, 2.4V ) Einit (5V )

978-1-4577-2081-9/12/$26.00 2012 IEEE

ASAP & ALAP analysis

energy-saving rate com utalion

(a) fixed task-to-core mapping (ESR=31.49%)

Design flow of Iterative Dynamic Remapping

(b) dynamic remapping (ESR=34.37%)

Examples for different task-to-core mapping strategies

Xi,j,l,m-<:: 1, Vi, 1-<:: is -<:: TN,

L (ls -lp - tr(ms, mp))Xis,js,lsomsXi,,,jp,lp,mp 2'" 0,

Vjs, jp , 1-<:: js ,jp-<:: 3, Vms, mp, 1-<:: ms, mp -<:: eN

between IDR and GA

Energy-Efficient Task Dynamic

You might also like