Evolvable Multi-Processor A Novel MPSoC Architecture With Evolvable Task Decomposition and Schedulin

www.ietdl.
org
Published in IET Computers & Digital Techniques Received on 21st September 2008 Revised on 26th February 2009 doi: 10.1049/iet-cdt.2008.0120
ISSN 1751-8601
Evolvable multi-processor: a novel MPSoC architecture with evolvable task decomposition and scheduling
S. Vakili S.M. Fakhraie S. Mohammadi
Silicon Intelligence Lab, School of ECE, University of Tehran, Tehran, Iran E-mail: sh.vakili@ece.ut.ac.ir
Abstract: Multi-processor system-on-chip (MPSoC) approach is an emerging trend for designing high performance computational systems. This trend faces some restrictive challenges in hardware and software developments. This paper presents a novel MPSoC system, which tries to overcome some of these major challenges using new architectural techniques. The main novelty of this system is accomplishment of both task decomposition and scheduling activities at run-time using hardware units. Hence, parallel programming or compile-time parallelisation is not required in this system and thus, it can directly and efciently execute single-processor sequential programmes. This system utilises evolutionary algorithms to perform decomposition and scheduling operations, and therefore it is called evolvable multi-processor (EvoMP) system. This approach nds an efcient scheme to distribute different segments of the running application among available computational resources. Such a dynamic adaptability is also benecial to achieve advantageous features like low-cost fault tolerance. This paper presents the operational and architectural details, improvements, constraints, and the obtained experimental results of the EvoMP.
Introduction
Contemporary digital design methods can be classied into three major orientations including pure hardware, recongurable hardware and microprocessor-based designs. Flexibility, simplicity of development, and short design time of microprocessor-based solutions have made it the most popular and applied design method. On the other hand, low performance is the main disadvantage of general purpose microprocessor in comparison with the other approaches. Introduction of large varieties of techniques and architectures in the last two decades has led to great improvements in performance and proportional growth in hardware complexity of the processors. However, these improvements seem to be saturated in recent years. The sequential essence of the conventional processors and their software is one of the most restrictive constraints that prevent parallel execution of the codes. Although some architectural techniques such as verylong-instruction-word (VLIW) have tried to address this issue, they could not meet the increasing demands on processing power. IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 2, pp. 143 156 doi: 10.1049/iet-cdt.2008.0120
Multi-processor approach is one of the most remarkable trends to design new high performance computational systems [1]. Emerging MPSoC design eld demonstrates the implied multi-processor orientation in embedded systems and system-on-chip (SoC) devices. MPSoC is a processor-centric solution and therefore most of the desirable advantages of uniprocessors such as short timeto market, post-fabricate reusability, exibility and programmability are also achievable in MPSoC designs [2, 3]. However, moving from single processor to multiprocessor systems is accompanied by many challenges in hardware and software developments. The most complicated design challenge in MPSoC systems is software development issues due to sequential nature of the conventional programming models. Software developers have been trained for many decades to think about programs, sequentially. However, multi-processor systems require concurrent software that their execution can be distributed among different processors. Approximately all existing software are developed using classical sequential 143
& The Institution of Engineering and Technology 2010
www.ietdl.org
models [2]. Thus, in order to execute these programs on a multi-processor environment, they must be rst converted to concurrent ones. In recent years, some researches have focused on compile-time techniques, which aim to perform this conversion automatically. Nevertheless, programming with parallel models is still the most commonly used approach to achieve concurrent executable software. Parallel programming models are often supported by standard application programming interface libraries, such as MPI and OpenMP [4, 5]. But, reprogramming all existing software for future MP systems requires huge amount of investments. Furthermore, writing efcient programs using these libraries is much more complicated than classical sequential programming. There are two necessary activities for concurrent software generation including decomposition of the program into some tasks and scheduling them among cooperative processors in the system. Both task decomposition and scheduling activities are NP-complete (non-deterministic polynomial time complete) problems and major issues for concurrent execution. Optimal decomposition of an application described in a serial manner into a set of concurrent tasks is a very difcult job and there are still very few applications, which can be decomposed automatically despite many years of research in this eld [2]. Another complicated challenge in such systems is task scheduling. All task scheduling mechanisms can be divided into static (programming-time or compile-time) and dynamic (run-time) categories [4]. Static scheduling approach has potential advantages to nd more optimal solutions in comparison with dynamic scheduling. Because the programmer or compiler can see the entire application and can also compare different solutions but a dynamic scheduler must make a decision according to the available resources and pending tasks, instantaneously. On the other hand, the number of computational resources must be constant and predetermined in static scheduling. It means that this approach implies in non-scalable software whereas dynamic scheduling systems do not face such constraints [69]. Synchronisation of processing elements is another important issue in MPSoC designs. Data dependencies between different tasks necessitate inter-processor communications [2]. These communications must be managed by an appropriate controlling mechanism. In static scheduling systems, control and synchronisation information are embedded in the software. In dynamic scheduling systems, a dedicated scheduler unit (which can be implemented in hardware, middleware or operating system) usually performs these activities. Debugging, security and lack of design methodology for on-chip interconnection networks are other challenges facing MPSoC designs [13]. Remarkable advantages of adaptive and dynamic MPSoC systems have motivated many researches to work on novel architectures and techniques for the development of such systems, in recent years [10 14]. 144 & The Institution of Engineering and Technology 2010 This paper introduces a novel homogeneous MPSoC architecture, which can perform all necessary activities for parallel execution of a program, dynamically on hardware. In other words, this system accomplishes task decomposition and scheduling, and also addresses data dependency requirements at run-time through hardware mechanisms. An evolutionary algorithm (EA) hardware core is exploited to perform both task decomposition and scheduling and therefore, this system is called evolvable multi-processor system (EvoMP). Hence, the existing classical sequential software can be effectively partitioned and mapped into different processors automatically. One of the main goals of these novelties is to nd a solution to prevent huge investments required for reprogramming the available software. Furthermore, all controlling and synchronisation operations are distributed among processing elements. Runtime parallelisation brings this system adaptability and exibility. Low-cost fault tolerance is another advantageous feature of EvoMP that benets from its adaptability. The presented version of the EvoMP uses 2-D mesh topology and utilises network-on-chip (NoC) for interconnections. The size of each dimension can be simply determined by setting a congurable parameter. The EvoMP uses a shared data memory. Accesses to this memory are also accomplished via NoC. This paper presents the primary version of EvoMP that uses genetic algorithm (GA). For this purpose, a custom hardware core is designed and exploited for GA computations. In [1517] also, GA is used for task scheduling but at compile time and in static scheduling systems. EvoMP system is designed and implemented using RT-level VHDL. Subsequent sections of this paper are organised as follows. Section 2 describes the EvoMP system architecture and some of its major constitutive units. The architecture of each processor is explained in Section 3. This section will also clarify the principles of operation of the entire system. Scope of the work and experimental results are presented in Section 4 and nally Section 5 contains the conclusions and future works.
EvoMP system architecture
The EvoMP utilises a GA core to perform both task decomposition and scheduling simultaneously at run time. The genetic core generates an encoded bit string (chromosome) that contains the decomposition and scheduling information, that is, this bit string determines the processor, which is in charge of executing each instruction in the programme. These data are received and used by all cooperating processors. The top view of the EvoMP system is shown in Fig. 1a. Evolutionary strategies obviously require enough time to evolve. Therefore this system can be efciently used for iterative programs like DSP applications that perform constant computations on different data. When this system IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 2, pp. 143 156 doi: 10.1049/iet-cdt.2008.0120
www.ietdl.org
Figure 1 EvoMP architecture

a Overview of EvoMP machine b NoC switch architecture
starts to execute such a program, genetic core generates a random data that results in random decomposition and scheduling of instructions among processors. When all processors reach the end of the iteration, the genetic core looks at a dedicated counter, which counts the clock cycles. At the end of an iteration, the output of this counter shows the number of clock cycles, taken to execute entire recent iteration. This value is used as tness value for the corresponding chromosome generated by the genetic core. Then, the counter is reset, the genetic core generates the next chromosome and the system starts execution of the next iteration with a new parallelisation scheme. After few initial random chromosomes (rst population), the genetic core goes to evolution state, in which new chromosomes are generated by the recombination of the best found solutions in previous generations and random data. This process is repeated until this core nds an appropriate solution for task decomposition and scheduling. Hereafter, the genetic core goes from evolution to termination state where the best found solution is used as constant output of the genetic core. Hence, the EvoMP does not require prior information about task decomposition and scheduling of the target program. But, it needs a primary evolution time to nd the best decomposition and scheduling solutions according to the number of available computational resources and the running program. IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 2, pp. 143 156 doi: 10.1049/iet-cdt.2008.0120
The most important remaining problem is the data dependency. When the program is divided into some tasks, which must be executed on different processors, the data dependency requirements between these tasks must be met by appropriate inter-processor communications. Control of these communications is fully distributed in EvoMP, that is, each processor will automatically detect and send its own data to all processors that need them. The functionality of this system is tightly coupled with the following key point. All processors have one dedicated copy of the program in their internal instruction memory. Accordingly, all processors have enough information to recognise the processor on which each instruction must be executed. Hence, they can detect any dependency on their local computed data and do not require request response scheme. The architectural details of the utilised mechanism, to meet the dependency requirements, are presented in Section 3.
2.1 Inter-processor communication scheme

NoC is an advanced SoC interconnection architecture. Enhanced performance and scalability are the main 145
www.ietdl.org
advantages of NoC in comparison with previous communication architectures (e.g. shared buses, or segmented buses with bridges) [1821]. EvoMP system exploits a custom designed NoC with a simple XY routing algorithm. The architectural overview of the designed switch is shown in Fig. 1b. This architecture utilises the globally asynchronous and locally synchronous approach to prevent probable hazard caused by clock skew problem. Data output port of the overall system is connected to memory management unit (MMU in Fig. 1a) and the output values are sent to this port through NoC. The bit length of the its (i.e. bit length of communication links between switches) is one of the congurable parameters of the EvoMP. As shown in Fig. 1b, a shared bus is used for communication between input and output ports in the NoC switch in order to reduce the hardware area of the switches. The highest priority input module, which contains a new packet, obtains the control of the bus and holds it until all its of the current packet are sent to the destination output port. Simulations have conrmed that the throughput of this switch architecture properly meets the throughput requirements of the system. performed through the NoC communication to help the scalability of the system. For this purpose, 00 address in the mesh (as shown in Fig. 1a) is dedicated to data memory and MMU and no processing or computational circuit exists in this unit. As stated earlier, mesh size (number of contributing processors) is congurable in this system. The only valid address in all mesh sizes is 00, and therefore this position is selected for data memory. MMU also has an internal instruction memory and scheduling data FIFO (rst-in rst-out memory). This unit reads the instruction memory in order to nd Load and Store instructions. Then, it reads the addressed data word in Load instruction from the data memory and sends it to the processor responsible for this instruction. For Store instruction, the processor sends the data and writing address to the MMU via a packet. Only Store instruction packets are received in MMU. Thus, it manages and then writes the received data to the memory. In data management phase, read and write addresses must be compared in order to prevent data dependency hazards including read after write, write after write and write after read. If the size of the mesh increases, the trafc near MMU NoC-switch becomes a bottleneck of performance. Our simulations show that this limiting issue appears only when the system is large enough (often in 4 4 or greater sizes depending on the application). This is the main issue for scalability of the current system and we hope to eliminate it by designing a distributed physical memory version of the EvoMP in the near future.
2.2 Encoded decomposition and scheduling data format

Decomposition and scheduling data generated by the genetic core consists of some scheduling words. Each word determines the processor which is in charge of executing some successive instructions in the program. A scheduling word consists of Proc_Addr and Instr_num elds. Proc_Addr indicates the target processor address and Instr_num species number of instructions, which must be executed on it. The maximum number of instructions scheduled by one such word (i.e. maximum number of instructions in an individual task) depends on bit length of Instr_num eld. This bit length is also a congurable parameter in EvoMP system. Fig. 2 illustrates a sample scheduling data for the 2-tap nite impulse response (FIR) lter program and corresponding parallelisation scheme.
2.4 Genetic core architecture

EAs are a population-based metaheuristic search methods inspired by biological evolution mechanisms in living organisms. The candidate solutions are usually encoded as a vector of numbers (commonly binary). Each population is comprised of a xed number of candidates. The tness function quantitatively evaluates suitability of each candidate. The promising candidates are selected and kept for subsequent generations and poor ones are weeded out. GA is the most popular EA that is used in presented version of the EvoMP system. The candidate solutions are called chromosomes in GA. First generation of solutions is chosen randomly. After tness evaluation, the elites are selected. In subsequent generations, the chromosomes are created through a process of selection and recombination.
2.3 Memory organisation

All multi-processor systems can be divided into two categories including shared and distributed data memory systems. Comprehensive comparison between these two approaches can be found in [1, 4]. EvoMP system utilises the shared memory scheme. The memory accesses are
Figure 2 Sample of scheduling data (chromosome) for 2-tap FIR lter program and corresponding execution scheme 146 & The Institution of Engineering and Technology 2010 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 2, pp. 143 156 doi: 10.1049/iet-cdt.2008.0120
www.ietdl.org
Recombination (crossover) operators merge the information contained within pairs of selected parents by placing random subsets of the information from both parents into their respective positions in a child. Due to random factors involved in producing children chromosomes, the children may, or may not, have higher tness values than their parents. In this way, the generations gradually move towards better regions of the search space [22]. As mentioned earlier, a hardware genetic core is designed and exploited to perform decomposition and scheduling activities in the EvoMP. The genetic core hardware area must be considered as an overhead in this system. Therefore we have tried to design low complexity architecture for this core. Each chromosome consists of some decomposition and scheduling words that must be sent to the processors. Fig. 3a shows the internal architecture of the designed core. A chromosome memory is used to store all chromosomes of each population. When the system starts to work, the genetic core is in an initialise state, in which the output words are generated randomly using a dedicated linear feedback shift register (LFSR). These words are generated and distributed in successive clock cycles at the beginning of each new iteration. The Instr_num eld in these words is accumulated and the result is compared with the program length stored in a register. When the accumulator value exceeds the program length, the chromosome is completed because all instructions are scheduled and the core stops generating new words. Note that all output words are also stored in the chromosome memory. Then, this core waits for End_Loop signal. This signal is only activated when all contributing processors reach the end of an iteration. Then, the genetic core looks at the Clock-Count counter, which shows the number of spent clock cycles to execute the completed iteration. This value is used as the tness value of the corresponding chromosome. Two storage components (called Best_Chrom in Fig. 3a) always store the tness value and start memory address of the two best found chromosomes (i.e. the elite count is constant and equal to two in this core). The Clock-Count is then reset to zero for the next iteration. When the rst population was completed, the genetic core goes to evolution state, in which the new population is generated by recombination operators in crossover module. The internal architecture of this module is depicted in Fig. 3b. In the recombination state, new chromosomes can be produced by one of the following three ways: 1. by crossover between two best found chromosomes (elites), 2. by crossover between one of the best chromosomes and a random data generated by the LFSR, 3. a pure random chromosome generated by the LFSR. There are some congurable parameters in this core including population size (number of chromosomes in a population) and number of chromosomes that must be generated by each of the above approaches in each population. These parameters affect the evolution speed. The evolution process is continued until the termination condition is met. Then the core state changes to termination state, in which the best chromosome obtained in evolution phase is permanently used for decomposition and scheduling. Thus, the execution time for all iterations is constant in this state. The termination condition can be one of the following options: 1. best achieved chromosomes have not been changed for a predetermined number of populations,
Figure 3 Internal architecture of

a Genetic core b Crossover unit
IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 2, pp. 143 156 doi: 10.1049/iet-cdt.2008.0120
147
www.ietdl.org
2. a solution is found that satises predetermined criteria, 3. xed number of generations is reached. The rst option is exploited in the current version of the EvoMP. Note that the execution time of each iteration is variant during the evolution phase. Thus, this phase is useless in applications in which response time is crucially bounded. On the other hand, more evolution time obviously leads to better solution and better results. The genetic core remains in termination state as long as there is no faulty processor. When a fault is detected in a processor, no task will be assigned to this processor in future iterations and in fact, it will be eliminated of future computation. Therefore the genetic core returns to the evolution phase to nd an appropriate solution for the new situation. Online built-in self-test technique is used for fault detection in the processors. There are sometimes invalid processor addresses in this system. For example, assume that a 2 3 mesh of processors is instantiated. At least three bits are needed to address these processors (one bit for rows and two bits for columns). But as there are only three columns, two addresses (i.e. 111 and 011) are invalid. Hence, these invalid addresses must not appear in Proc_addr eld of output words of the genetic core. The Convert_Address unit (shown in Fig. 3a) is used to map invalid addresses to valid ones. A simple and scalable logic is used for this address conversion. This unit also contains an address mapping table, which plays an important role in fault tolerance scheme utilised in EvoMP system. a PE becomes faulty, the scheduler stops assigning new tasks to it. A major issue in such systems is the existence of centralised scheduler and controlling units. Because, if a fault occurs in one of these components, then the entire system fails [26, 27]. The EvoMP system does not use redundant spare hardware for fault tolerance and instead adapts itself to the available resources dynamically, that is, it always utilises all available processors in the system. All processors are at the same degree of importance in this system. This signicant advantage plays an important role in the utilised fault tolerance scheme. As stated earlier, there is an address mapping table in the output stage of the genetic core that maps input addresses to determined addresses in the table. When a processor becomes faulty, the contents of this table are changed so that the faulty processor address is replaced by a random valid one. Accordingly, no instruction will be assigned to the faulty processor in subsequent iterations. The genetic core also returns to the evolution state to nd an appropriate solution for the new situation. In this way, the system adapts its decomposition and scheduling scheme to the available computational resources by paying rather small time penalty for re-evolution. Thus, EvoMP is a gracefully degradable system, which can continue execution even with one healthy processor. The genetic core and the centralised memory are the only centralised controlling units in this system. These units are much smaller than the computational section area including the processors and the NoC, and therefore, simple redundant hardware techniques, for example, TMR, seem to be suitable to make them fault tolerant. Furthermore, note that the genetic core is important only in the evolution phase. Therefore if it becomes faulty in termination state, the system will continue to work without any change. Even if it becomes faulty before the evolution phase, making the evolution impossible, the system will still execute the program, somehow in an inefcient way.
2.5 Fault tolerance scheme

Low-cost fault tolerance is one of the features of the EvoMP, the benets of its adaptability. Different fault tolerance techniques have been proposed in the literature for various types of MP systems. These mechanisms can be divided into two major categories. First category consists of fault tolerance techniques for homogeneous recongurable hardware architectures and static scheduling multiprocessor systems, in which the number of operational processing elements and their tasks are predetermined. Thus, the fault tolerance can only be achieved by some dedicated spare PEs. The faulty PEs can be replaced by the spare ones to perform their assigned tasks [23]. For example [24] uses some spare modules (molecules, which are simple recongurable hardware units) in each computational cell in Embryonic project. Triple modular redundancy (TMR) and similar redundant-based schemes are also used for fault tolerance in bio-inspired POEtic tissue project in [25]. Second category of these mechanisms is focused on dynamic task scheduling systems, in which the tasks are distributed between all operational resources and whenever 148 & The Institution of Engineering and Technology 2010
Architecture of each processor
This section describes the internal architecture of each processor in the EvoMP system. The main particular feature of these processors, that distinguishes them from the conventional ones, is their capability of automatic data dependency detection and transmission of the corresponding data. As shown in Fig. 4, this architecture is a multi-functionalunit (multi-FU) design. The shared bus scheme is used for data communications of different FUs. Number and types of FUs can vary. Hence, adding a new instruction to this architecture can be easily obtained by designing the required hardware and exploiting it as a new FU. The communication scheme of this new FU must be obviously compatible with the others. Before studying the architecture, the dedicated machine code style of the EvoMP system must be considered. IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 2, pp. 143 156 doi: 10.1049/iet-cdt.2008.0120
www.ietdl.org
Figure 4 Internal block diagram of each EvoMP processors
3.1 EvoMP machine code style

Run-time detection of the data dependencies was the most complicated challenge in this work. This is achieved by the combination of a special machine code style designed for the EvoMP and some architectural techniques. In EvoMP machine code style, each line of the program has a line number that is called ID. When an instruction requires a register as a source operand, the line number of the most recent instruction, which has modied this register, is used instead of the register number. Assume that the following three instructions are a segment of a sample program. Leftside numbers represent the line number (ID) of each instruction. 10. ADD R1, R2, R3; R3 R1 R2 11. AND R2, R6, R7; R7 R2&R6 12. SUB R7, R3, R4; R4 R7 R3 Accordingly, R7 and R3 operands in the above code must be replaced by 11 and 10, respectively. Thus, the SUB instruction will be converted to the following line in EvoMP machine code: 12. SUB (11), (10), R4 The processor in charge of executing this instruction requests these IDs as operands. If they are also computed on this processor, then they will be found in the register le, otherwise the corresponding processors detect the dependency
and send them along with their IDs to this processor. The ID number is also stored in the register le. For example, in the above example, 12 will be saved in a dedicated position in R4 register. The word length of the ID numbers must be large enough to be able to identify all instructions of the program. The only remaining problem is the data dependency between successive iterations. This problem is also solved by adding another eld to the ID numbers. This eld species the iteration number and acts just like an iteration counter. Bit length of this eld depends on maximum distance of dependent instructions. In our experimental applications, one bit is sufcient because two dependent instructions are at most one iteration far from each other (i.e. they are in the same or two successive iterations). Fig. 5a shows the assembly program of 2-tap FIR lter. The data dependencies are illustrated by arrows in this gure. The only inter-iteration dependency is distinguished by a solid downward arrow. Fig. 5b shows the same program after applying the required changes (EvoMP code style). The ID of each instruction is equal to its address in the instruction memory, except for the initialisation section of the program, for which the IDs are specied in the header section of the code.
3.2 More detailed operational view of each processor

The internal architecture of each processor is represented in Fig. 4. The Fetch_Issue unit has access to both instruction memory, and decomposition and scheduling data generated by the genetic core. This unit can determine the processor,
Figure 5 2-tap FIR lter assembly code in

a Regular style b EvoMP style
149
www.ietdl.org
which is in charge of executing each instruction. Local instructions (instructions, which must be executed on this processor) will appear on Instr bus to be received and executed by an FU. When the highest priority non-busy FU observes an executable instruction on Instr bus, it sends a signal through the shared bus to other FUs and the Fetch_Issue unit to inform them of reception of the current instruction. The Fetch_Issue unit will read the next instruction when it receives this signal. Token-ring technique is utilised to specify the FU, which must receive the pending instruction when more than one non-busy FU exists. Both operand IDs on Instr bus are checked in the register le module. The corresponding data value of each existing ID is put on R1_Data or R2_Data bus and the receiving FU will store this value as an operand. All of these operations are performed in combinational circuits. If an operand is not found in the register le, the FU receives the instruction but the position of this operand remains empty and the FU does not start the computation until this operand is received through Extra_Bus. The following two reasons may cause the operands to be unavailable: (i) another processor possesses this operand and has not sent it yet and (ii) another local FU is computing this operand. In both cases, the required value will appear on Extra_Bus sooner or later and the pending FU grabs it immediately. This architecture also supports in-order issue, out-of-order execution scheme, that is, the instructions appear on Instr bus in the same order of their appearance in the program. But the execution time of different instructions may vary. Thus, the result of an instruction may be computed before a prior instruction. All types of data dependencies are addressed by appropriate hardware mechanisms in this architecture. The register le unit contains 15 register modules (R1R15), each of them contains a register to store the ID and a register to store the value. The Fetch_Issue unit exploits a dual-read-port instruction memory. The second output port is connected to Invalidate_Instr bus (Fig. 4). This bus is used for invalidation of the register contents and detection of the data dependencies of other processors to local register data. All instructions (local and non-local) in the program will appear on this bus one-by-one. Register modules monitor the destination eld and ID of Invalidate_Instr bus. When the destination register is occupied and the stored ID in this register is smaller than the Invalidate_Instr bus ID eld, the register contents will be invalidated; because it means that a posterior instruction is met and this register is its destination. Therefore the prior value of this register is useless hereafter. Invalidate_Instr also contains two operand IDs of the corresponding instruction. These IDs are used to detect the dependency of other processors to the local data. All valid register modules compare these IDs with their own. If they were equal, the corresponding value is put on Send1_Data or Send2_Data bus and NoC interface module sends it to the appropriate processor. Note that instructions on Invalidate_Instr are not going to be executed and a single 150 & The Institution of Engineering and Technology 2010 clock cycle is adequate to check them, while execution of local instructions appeared on Instr bus may require many clock cycles. But, the PC2, which is the address of Invalidate_Instr bus instruction, must never exceed the PC1 (address of Instr bus instruction). The Fetch_Issue module also contains two FIFO memory units to store the encoded scheduling data (shown in Fig. 2). The processor does not start to work until the rst scheduling word is received by the Scheduling Data FIFO. The output word of this FIFO is used to determine whether or not the specied instruction must be executed on this processor. If another processor is determined in this word, the PC1 is added to Instr_num eld to bypass all specied instructions. The words, which are read from the Scheduling Data FIFO, will be immediately stored in the Scheduled FIFO (Fig. 6). The output of this FIFO is used to determine the address of the processor responsible for the current instruction on Invalidate_Instr bus. NoC Interface module recognises the destination address of the transmitting data using this value. A low-complexity recongurable computing hardware is designed as a general FU that can perform different arithmetic and logical operations in different congurations. Fig. 7 shows the internal architecture of this FU. R1 and R2 components in this gure, store the data operand of the issued instruction. Multiply operation is realised with add and shift in this architecture. Table 1 lists EvoMP instruction set supported by the presented FU architecture. Note that Load and Store are the only instructions, which are not executed in FUs. There is also an immediate version of all instructions, in which the second operand is an immediate value.
Experimental results
This section presents the target application domain and performance evaluation results of the EvoMP system. Four representative DSP programs are developed and used as experimental applications. These sample applications are executed on EvoMPs of different sizes. Three other decomposition and scheduling schemes are also implemented and evaluated on EvoMP for comparison. The mentioned applications are also executed on NIOS II soft core to make the evaluations more comprehensive.
4.1 Scope of work

EvoMP system can be efciently used for iterative applications as described in [6, 28]. The only crucial requirement in such applications is that its main part has to be iteratively executed for a considerable number of times. Because this system is efcient only when the number of required iterations in the evolution phase is negligible in comparison with total number of similar iterations. This is especially the case for applications that perform an identical computation on different data samples. Furthermore, the EvoMP is not suitable for applications in which numerous forward jumps takes place IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 2, pp. 143 156 doi: 10.1049/iet-cdt.2008.0120
www.ietdl.org
Figure 6 Internal architecture of Fetch_Issue unit (general view) Table 1 Instruction set of the EvoMP Instruction load/store MOV shift/rotate left shift/rotate right AND/OR/XOR/NOT ADD/SUB/MUL logical arithmetic Instruction category memory data movement shift and rotate
accomplished computations must also be measured and used for tness estimation in order to achieve fairness. In embedded systems area, the EvoMP can be used in applications, where the same computation is performed for a stream of data inputs. Different encoders and decoders, signal processing applications in communication systems, encryption and decryption standards and packet processing in network systems can be considered as some example application domains of the EvoMP system. The conguration of the EvoMP (including number of contributing processors) must be properly selected in order to meet the processing power requirements of the target applications. Note that, activities like coding style conversion and loop unrolling do not require high-level analysis and can be accomplished by simple compile-time algorithms or even by object code modication.
Figure 7 Internal architecture of exploited congurable FU (e.g. control applications); as these jumps would affect the execution time and the tness value, and thus lead to problems in tness evolution. Backward jumps (loops) are also better to be unrolled at compile time to permit decomposition and scheduling to take place precisely. At rst glance, it seems that any conditional jump may lead to unfairness in tness evaluation. Because the conditional jumps may be taken or not taken in different iterations, and thus number of instructions executed in different iterations is not necessarily equal. This is specically the case in multimedia applications, in which computations highly depend on input data. A primary solution is to consider the number of executed instructions in tness evaluation. In this way, size of the IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 2, pp. 143 156 doi: 10.1049/iet-cdt.2008.0120
4.2 Conguration of the experimental environment

EvoMP system has some congurable parameters that affect the performance and implementation results. Hence, these parameters and their values in experimental environment 151
www.ietdl.org
Table 2 Congurable parameters of the EvoMP and their values in each experiment Parameter FIR16 Word_Len FU_num Flit_Word_Len Pop_Size Cross_Rate1 Cross_Rate2 Rand_Size 16 1 16 8 1 4 3 Conguration values DCT8 16 1 16 16 1 4 11 DCT16 16 1 16 16 2 8 6 MATRIX55 16 1 16 16 2 8 6 processor word length number of instantiated FUs in each processor bit-length of connection links between NoC switches number of chromosomes in each population number of chromosomes in each generation produced by crossover between best found chromosomes number of chromosomes in each generation produced by crossover between a random data and the best chromosome number of chromosomes in each generation produced randomly Description
must be carefully set based on the running application. Table 2 lists these parameters and their conguration values used for our experimental applications. These values are selected according to the best obtained simulation results
after testing different congurations. The mesh size is another congurable parameter that is variable in our experiments and therefore, has not been mentioned in Table 2.
Figure 8 Evolution phase best chromosome tness value (number of clock cycles required for execution of each iteration) in different EvoMP sizes for
a b c d 16-tap FIR lter 8-point discrete cosine transform 16-point discrete cosine transform 5 5 matrix multiplication
152 & The Institution of Engineering and Technology 2010
www.ietdl.org
Table 3 Fitness value of the nal best chromosome (in clock cycles) and corresponding speed up and evolution time for four decomposition and scheduling schemes using different number of processors FIR-16 number of instructions number of multiply instructions NIOS II required number of clock cycles 1 processor (1 2) in all three schemes tness (clock cycles) speed up 2 processors (1 3) presented EvoMP tness (clock cycles) speed up evolution time (us) SDGS tness (clock cycles) speed up evolution time (us) rst free tness (clock cycles) speed up pure random tness (clock cycles) speed up 3 processors (2 2) presented EvoMP tness (clock cycles) speed up evolution time (us) SDGS tness (clock cycles) speed up evolution time (us) rst free tness (clock cycles) speed-up pure random tness (clock cycles) speed up 4 processors (2 3) presented EvoMP tness (clock cycles) speed up evolution time (us) SDGS tness (clock cycles) speed-up evolution time (us) rst free tness (clock cycles) speed up pure random tness (clock cycles) speed up 74 16 510 350 1 214 1.63 27 342 202 1.73 1967 293 1.19 306 1.14 171 2.04 30 174 161 2.17 10 739 239 1.46 291 1.20 Unevaluated DCT-8 88 32 810 671 1 403 1.66 42 807 401 1.67 29 315 733 0.91 656 1.022 319 2.10 54 790 306 2.19 52 477 681 0.98 589 1.13 285 2.33 93 034 256 2.62 41 023 496 1.35 500 1.34 DCT-16 324 128 3452 2722 1 1841 1.47 74 582 1812 1.50 84 365 2529 1.08 2441 1.11 1460 1.86 23 319 1189 2.28 536 565 1933 1.40 2213 1.23 1213 2.25 630 482 1106 2.46 111 118 1587 1.71 1837 1.48 MATRIX-5 5 406 125 3838 3181 1 2344 1.37 198 384 2218 1.43 65 119 2487 1.27 2655 1.19 1868 1.70 294 828 1817 1.75 10 092 2098 1.51 2492 1.27 1596 1.99 546 095 1575 2.01 178 219 1815 1.75 2176 1.46
153
www.ietdl.org
Table 4 Synthesis results of a 2 2-mesh EvoMP system on a XC2V3000 FPGA NoC switch area (total LUTs) max freq. (MHz) 741 (2%) Genetic core 1891 (6%) MMU 3612 (12%) Processor 4583 (15%) Total system 12 877 92.4
The following equation illustrates the relation between the genetic core parameters described in Table 2 Pop Size Cross Rate1 Cross Rate2 Rand Size (1)
4.3 Simulation and synthesis results

Fig. 8 shows the tness value (number of clock cycles required to execute each iteration) of the best found chromosome at each instant of the evolution phase for different EvoMP sizes. Note that the execution time in single processor EvoMP is constant as the decomposition and scheduling are meaningless in such a system. These results conrm the applicability of this approach because by increasing the number of processors better results are achieved. It means that the expected resource utilisation is obtained without applying noticeable changes on sequential code development process. Consider that, increasing the program length and number of contributing processors would also increase the required time for evolution, obviously because of larger decomposition and scheduling search space. Exploiting a more advanced and dedicated heuristic method for decomposition and scheduling (instead of the current pure genetic architecture) can reduce the required evolution time. As illustrated in Table 3, the nal achieved speedup is gradually saturated by increasing the number of processors in the system. Remarkable increase in decomposition and scheduling search space, restrictive data dependencies in the program and increasing communication cost are the main reasons of this phenomenon. Simultaneous exploration of an efcient solution for decomposition and scheduling requires search in a very huge search space. As mentioned in previous sections, many dynamic task scheduling architectures and techniques are introduced in the literature. But the run-time task decomposition is a novel work in EvoMP. Comparison between the presented version of EvoMP and other approaches is necessary to prove its applicability. Hence, we have designed three other schedulers with predetermined (static) task decomposition scheme to make the comparison feasible. The rst scheduler uses the GA and the second one utilises the classical First Free (FF) [9] algorithm for dynamic task scheduling. Also, the third one is a pure random scheduler. All of them use a predetermined decomposition scheme (manually specied inside the program). The simplicity of the developed experimental programs allowed us to partition them to small tasks manually, in an efcient way. The genetic core in the presented EvoMP architecture (studied in Section 2.4) is replaced by these new schedulers. 154 & The Institution of Engineering and Technology 2010
The static decomposition and genetic scheduler (SDGS) is a previously introduced approach in the literature [29] in which the genetic core only performs the task scheduling. Task decomposition scheme is statically determined in this method. The FF approach is a simple and well known scheduling scheme. This approach starts from address 01 and selects the FF node able to execute the rst pending task in the job queue. This scheduler neither saves its decisions, nor receives any feedback about the efciency of its prior decisions [9]. In the pure random approach, scheduling scheme is different in each iteration. Thus, we have used the mean value (number of clock cycles per iteration) of 1000 iterations as the result. The simulation results of these three schedulers are illustrated in Table 3. As shown, the nal results of both proposed EvoMP and static decomposition genetic scheduling (SDGS) are much better than FF and pure random approaches. Furthermore, the results of the SDGS is much better (better achieved tness values in less evolution time) than the proposed one, obviously because of much smaller search space. However, note that the static decomposition necessitates programming or compiletime task decomposition, which equates to loss of sequential program execution capability. NIOS II processor is also exploited to execute the same applications. Number of clock cycles required to execute one iteration of the applications, on this processor, is also measured and presented in Table 3 to make the results more comprehensive. Note that, we have used a conguration of NIOS that has hardware multiplier. The EvoMP is completely implemented at RT-level VHDL. Table 4 illustrates the synthesis results of a 2 2mesh system on a Xilinx VirtexII XC2V3000 FPGA.
Conclusions and future works
This paper presented the EvoMP, which is a novel NoCbased MPSoC system with dynamic task decomposition and scheduling capability. The conventional sequential programs can be executed on this system efciently. EvoMP exploits a hardware genetic core for run-time task decomposition and scheduling operations. A special architecture is also designed and exploited for each processor to achieve the capability of automatic detection and alleviation of data dependencies. Centralised memory is the main bottleneck for scalability of the EvoMP and low-cost fault tolerance is a benecial feature of this system. EvoMP is suitable for applications that perform a unique computation for huge amount of data or a data stream. Operational mechanism, architecture, IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 2, pp. 143 156 doi: 10.1049/iet-cdt.2008.0120
www.ietdl.org
advantages and challenges of the system are presented in this paper. The experimental results also conrm the applicability of EvoMP novel ideas. Note that the nal goal of the authors, in this research, was presentation and demonstration of applicability of novel ideas in designing MPSoC system. These ideas can be utilised in future MPbased high performance computing architectures. Centralised physical memory causes some issues for the scalability. Designing a distributed physical memory version of EvoMP is a useful future work. Note that, distributing address space seems to be impossible but techniques like distributed-shared memory that keep the address space shared and distribute physical memory [30], can be useful in this system. A pure genetic core is used in the presented version of the EvoMP. The authors believe that the performance of this system can be improved by design and utilisation of more dedicated heuristic algorithm. Utilisation of such techniques is another benecial work in the future. [10] HUBNER M., PAULSSON K., BECKER J.: Parallel and exible multiprocessor system-on-chip for adaptive automotive applications based on Xilinx MicroBlaze soft-cores. Proc. Int. Symp. Parallel and Distributed Processing, Washington, DC, USA, 2005, p. 149.1 [11] GOHRINGER D., HUBNER M. , SCHATZ V. , BECKER J. : Runtime adaptive multi-processor system-on-chip: RAMPSoC. Proc. Int. Symp. Parallel and Distributed Processing, April 2008, pp. 1 7 [12] LIANG-THE L. , CHIA-YING T. , SHIEH-JIE H.: An adaptive scheduler for embedded multi-processor real-time, systems. Proc. IEEE TENCON Conf., October 2007, pp. 1 6 [13] MALANI P., MUKRE P., QIU Q., WU Q.: Adaptive scheduling and voltage scaling for multiprocessor real-time applications with non-deterministic workload. Proc. Design, Automation and Test Conf. in Europe, April 2007, pp. 652 657 [14] KLIMM A., BRAUN L., BECKER J.: An adaptive and scalable multiprocessor system for Xilinx FPGAs using minimal sized processor cores. Proc. Parallel and Distributed Processing Symp., April 2008, pp. 1 7 [15] ZOMAYA A.Y., WARD C., MACEY B.: Genetic scheduling for parallel processor systems: comparative studies and performance issues, IEEE Trans. Parallel Distrib. Syst., 1999, 10, pp. 795 812 [16] YI-WEN Z. , JIAN-GANG Y.: A genetic algorithm for tasks scheduling in parallel multiprocessor systems. Proc. Int. Conf. Machine Learning and Cybernetics, November 2003, pp. 1785 1790 [17] HOU E. , ANSARI N. , REN H.: A genetic algorithm for multiprocessor scheduling. IEEE Trans. Parallel Distrib. Syst., 1994, 5, (2), pp. 113 120 [18] LEE S.J., LEE K., YOO H.J.: Analysis and-implementation of practical, cost-effective networks on chips, Proc. IEEE Des. Test Comput., 2005, 22, (5), pp. 422 433 [19] BJERREGAARD T., MAHADEVAN S.: A survey of research and practices of network-on-chip, ACM Comput. Surv., 2006, 38, pp. 1 54 [20] FREEH V.W., BLETSCH T.K., RAWSON F.L.: Scaling and packing on a chip multiprocessor. Proc. Parallel and Distributed Processing Symp., March 2007, pp. 1 8 [21] RUGGIERO M., GUERRI A., BERTOZZI D., POLETTI F., MILANO M.: Communication-aware allocation and scheduling framework for stream-oriented multi-processor systemson-chip. Design, Automation and Test Conf. in Europe, March 2006, pp. 6 12 155
References
[1] JERRAYA A.A., WOLF W.: Multiprocessor systems-on-chips (Morgan Kaufmann Publishers, 2005, 1st edn.) [2] MARTIN G.: Overview of the MPSoC design challenge. Proc. Design and Automation Conf., San Francisco, USA, July 2005, pp. 274 279 [3] WOLF W.: The future of multiprocessor systems-onchips. Proc. Int. Design Automation Conf., San Diego, USA, June 2004, pp. 681 685 [4] PARHAMI B.: Introduction to parallel processing: algorithms and architectures (Kluwer Academic Press, 1999, 1st edn.) [5] EL-REWINI H., ABD-EL-BARR M.: Message passing interface (MPI), in Advanced computer architecture and parallel processing (Wiley, 2005, 1st edn.), pp. 205 233 [6] PARHI K., MESSERSCHMITT D.: Static rate-optimal scheduling of iterative data-ow programs via optimum unfolding, IEEE Trans. Comput., 1991, 40, pp. 178 195 [7] MANIMARAN G., MURTHY C.S.R.: An efcient dynamic scheduling algorithm for multiprocessor real-time systems, IEEE Trans. Parallel Distrib. Syst., 1998, 9, (3), pp. 312319 [8] KHAN A.A. , MCCREARY C.L., JONES M.S. : A comparison of multiprocessor scheduling heuristics. Proc. Int. Conf. Parallel Processing, 1994, pp. 243 250 [9] CARVALHO E., CALAZANS N., MORAES F.: Heuristics for dynamic task mapping in NoC-based heterogeneous MPSoCs. Proc. Int. Rapid System Prototyping Workshop, 2007, pp. 3440 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 2, pp. 143 156 doi: 10.1049/iet-cdt.2008.0120
www.ietdl.org
[22] MONTAZERI F., SALMANI-JELODAR M., FAKHRAIE S.N., FAKHRAIE S.M.: Evolutionary multiprocessor task scheduling. Proc. Int. Symp. Parallel Computing in Electrical Engineering, 2006 [23] OBERMAISSER R., KRAUT H., SALLOUM C.: A transient-resilient system-on-a-chip architecture with support for on-chip and off-chip TMR. Proc. Int. Dependable Computing Conf., 2008, pp. 123 134 [24] CANHAM R., TYRRELL A. : An embryonic array with improved efciency and fault tolerance. Proc. NASA/DoD Conf. on Evolvable Hardware, July 2003, pp. 265 272 [25] BARKER W., HALLIDAY D.M., THOMA Y. , ET AL .: Fault tolerance using dynamic reconguration on the POEtic Tissue, IEEE Trans. Evol. Comput., 2007, 11, (5), pp. 666 684 [26] MANIMARAN G., MURTHY C.S.R.: A fault-tolerant dynamic scheduling algorithm for multiprocessor real-time systems and its analysis, IEEE Trans. Parallel Distrib. Syst., 1998, 9, (11), pp. 1137 1152 [27] BEITOLLAHI H., DEECONINICK G.: Fault-tolerant partitioning scheduling algorithms in real-time multi-processor systems. Proc. Pacic Rim Symp. Dependable Computing, December 2006, pp. 296 304 [28] JAGADISH H.V., KAILATH T.: Multiprocessor implementation models for adaptive algorithms, IEEE Trans. Signal Process., 1996, 44, (9), pp. 2319 2331 [29] PAGE A.J., NAUGHTON T.J.: Dynamic task scheduling using genetic algorithms for heterogeneous distributed computing. Proc. Int. Symp. Parallel and Distributed Processing, April 2005, p. 189.1 [30] EL-REWINI H., ABD-EL-BARR M.: Introduction to advanced computer architecture and parallel processing, in ZOMAYA , A.Y. ( ED .): Advanced computer architecture and parallel processing (Wiley, 2005, 1st edn.), pp. 1 17
156 & The Institution of Engineering and Technology 2010

Evolvable Multi-Processor A Novel MPSoC Architecture With Evolvable Task Decomposition and Schedulin

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Evolvable Multi-Processor A Novel MPSoC Architecture With Evolvable Task Decomposition and Schedulin

Uploaded by

Copyright:

Available Formats

www.ietdl.

& The Institution of Engineering and Technology 2010

EvoMP system architecture

Figure 1 EvoMP architecture

2.1 Inter-processor communication scheme

& The Institution of Engineering and Technology 2010

2.2 Encoded decomposition and scheduling data format

2.4 Genetic core architecture

2.3 Memory organisation

Figure 3 Internal architecture of

& The Institution of Engineering and Technology 2010

2.5 Fault tolerance scheme

Architecture of each processor

Figure 4 Internal block diagram of each EvoMP processors

3.1 EvoMP machine code style

3.2 More detailed operational view of each processor

Figure 5 2-tap FIR lter assembly code in

& The Institution of Engineering and Technology 2010

4.1 Scope of work

4.2 Conguration of the experimental environment

& The Institution of Engineering and Technology 2010

152 & The Institution of Engineering and Technology 2010

& The Institution of Engineering and Technology 2010

4.3 Simulation and synthesis results

Conclusions and future works

& The Institution of Engineering and Technology 2010

156 & The Institution of Engineering and Technology 2010

You might also like