You are on page 1of 10

1372 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 62, NO.

5, MAY 2015

A Heterogeneous Multicore Crypto-Processor With


Flexible Long-Word-Length Computation
Jun Han, Member, IEEE, Renfeng Dou, Lingyun Zeng, Shuai Wang, Zhiyi Yu, Member, IEEE, and
Xiaoyang Zeng, Member, IEEE

AbstractA domain specic multicore processor for public-key femicell, and ubiquitous computing, the devices of access point
cryptography is proposed in this paper. This processor provides (AP) or small base station should provide high-security and
exible and efcient computation for various forms of RSA and low-latency authentication and authorization services [7]. If
ECC algorithms, fullling low-latency or high-throughput require-
these security services cannot be low-latency, the quality of
ments of different application scenarios. By using a heterogeneous
multicore architecture, the proposed processor enables high speed delay-sensitive applications such as streaming audio/video and
parallel implementations of kernel arithmetics of public-key algo- IP telephony might be greatly affected when a user moves from
rithms. A long-word-length modular multiplication can be parti- one AP to another.
tioned into parallel tasks executed by the high performance mul- For most cryptographic applications, high security strength
tipliers distributed in multiple cores. Some dedicated communica- becomes more and more necessary, as for the improved com-
tion mechanisms minimize inter-core data transferring latencies puting ability and new attack methods. This demands to use
of the processor. The proposed processor is implemented under
large-size keys as well as the countermeasures against potential
TSMC 65 nm LP CMOS technology. Experimental results show
that our design outperforms previous works based on varied plat- threats such as power analysis attack [8]. However, the work-
forms in performance, for instance, it can complete a 1024-bit RSA loads of public-key ciphers are dramatically enlarged along with
encryption in 0.087 ms at 960 MHz. Moreover, we also study the the increment of key size. The anti-attack algorithms of [8] re-
area reduction techniques for proposed multicore processor from quire extra computation overheads.
the perspectives of algorithm, architecture, and circuit. Hardware platforms are expected to be more powerful to
Index TermsCryptography, ECC, modular multiplication, meet the requirements of high-throughput, low-latency, and
multicore, RSA. high-security. Moreover, security applications evolve fast and
new cryptographic protocols will require support in the future.
A platform must implement different algorithms and offer cer-
I. INTRODUCTION tain freedom for software development. The exibility makes

C LOUD computing as an evolutionary computing tech- the hardware implementation problem more difcult.
nology has become more and more popular in recent years. A general purpose processor (GP) is the most exible platform
While it brings economic benets, some challenges still need but fails to meet the performance requirements. This kind of plat-
to be solved. Cloud data security and privacy protection are the form is inefcient to support the kernel arithmetics of public-key
most concerning obstacles impeding the wide adoption of cloud ciphers such as long-word-length operations. Besides, the energy
computing [1], especially for business applications under public efciency of GP is unacceptable in many application scenarios.
clouds [2]. Several researchers have discussed the security issues Opposite to GP, ASIC can provide highest performance and
of cloud computing in detail [3], [4], and solutions proposed achieve best energy efciency. However, the ASIC implemen-
in [5], [6] demonstrate that public-key cryptography can be tations dedicated to specic applications offer the lowest ex-
the infrastructure to preserve the condentiality, integrity, and ibility. Under nanoscale technology, fabricating ASIC chip for
authenticity of data and communications. Since one node in a each application leads to extremely high cost. There are many
cloud might face numerous requests of secure services, a high- attempts to improve the adaptiveness of ASIC designs; for ex-
throughput scheme to process public-key ciphers is desired. ample, several ECC processors presented in [8][11] support
In wireless communication, public-key ciphers like RSA, both binary and prime eld with different sizes. Nevertheless,
Elliptic-Curve Cryptography (ECC), also play an important ASIC implementations will not provide the true programma-
role. For example, in the emerging applications like WiMAX, bility that enables efcient software development.
A graphic processing unit (GPU) possesses abundant com-
putation resources, and it can run a lot of threads in parallel to
Manuscript received June 05, 2014; revised September 15, 2014; accepted
February 13, 2015. Date of publication April 06, 2015; date of current version achieve high throughput for cryptographic services. Neverthe-
April 28, 2015. This work was supported by the National Natural Science Foun- less, executing multiple tasks on GPU does not shorten the latency
dation of China under Grant 61176023 and Grant 61234002, and in part by the of a single task. For instance, an RSA-1024 exponentiation in
Project of State Key Laboratory of ASIC and System under Grant 11MS005.
[12] has a latency of 6930 ms at working frequency of 1.35GHz.
This paper was recommended by Associate Editor V. Chandra.
The authors are with the State Key Laboratory of ASIC and Systems, Public-key calculations, such as RSA and ECC, contain thou-
Fudan University, Shanghai 201203, China (e-mail: junhan@fudan.edu.cn; sands of modular multiplications. Due to data dependencies, a
rdou12@fudan.edu.cn; 13210720074@fudan.edu.cn; 09210720068@fudan. large proportion of these multiplications cannot be executed in
edu.cn; zhiyiyu@fudan.edu.cn; xyzeng@fudan.edu.cn).
Color versions of one or more of the gures in this paper are available online
parallel. Therefore, a single user of RSA or ECC authentication
at http://ieeexplore.ieee.org. must wait considerable time, although the high-throughput
Digital Object Identier 10.1109/TCSI.2015.2407431 platform processes many users' requests concurrently.

1549-8328 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
HAN et al.: A HETEROGENEOUS MULTICORE CRYPTO-PROCESSOR WITH FLEXIBLE LONG-WORD-LENGTH COMPUTATION 1373

To adopt multicore architecture rather than to increase the modular multiplication with quotient pipelining (MMQP),
complexity of a single core can be an effective solution to boost employing a folded carry save adder (CSA) in the modular
the performance of a platform with better energy efciency multiplier, using a shared register le to avoid operands
[13]. Multicore architecture has the inherent merits of parallel duplication, and simplifying the hardware complexity
speedup, programmability, and low power density. Several of PE by reducing its capacity of program control and
previous works have studied how to map the public-key algo- memory access.
rithms, such as RSA, onto multicore platforms of homogeneous This paper is organized as follows. In Section II, an improved
architecture [14], [15]. Data-computation-dominated tasks and MMQP algorithm is introduced, based on which hardware struc-
control-ow-dominated tasks cannot be both well supported ture and parallel implementation can be developed. The mi-
by the homogeneous architecture. Heterogeneous architectures croarchitecture and hardware design of the proposed processor
have been widely exploited in recent years to address this will be shown in Section III. Instruction set architecture (ISA)
problem. For example, AMD's heterogeneous system architec- and the parallelization of applications will be detailed in Sec-
ture (HSA) delivers improvement across power, performance, tion IV. In Section V, experimental results are provided and
programmability, and portability by using heterogeneous com- used to make comparison with related works. Finally, the con-
puting units [16]. clusion is obtained in Section VI.
From cryptographic applications, the main challenge to mul-
II. IMPROVED MONTGOMERY MODULAR MULTIPLICATION
ticore architecture is how to implement computations with long-
ALGORITHM
word-length operands as well as support exible control ow
and parallelism exploitation for algorithms. In this paper, we in- A. Original MMQP Algorithm
vestigate what architecture improvements can manage this kind
Montgomery MM algorithm was proposed by Peter L. Mont-
of challenge and show experimental results toward public-key
gomery to avoid the division by modulus [17]. This algorithm
ciphers. Our contributions are highlighted here:
dramatically reduces the computation complexity and has be-
We present a low-latency scheme for iterative long-word- come the dominant method for both software [18], [19] and
length data computations. Modular multiplication (MM), hardware implementations [20], [21]. MMQP algorithm pro-
which is the kernel of public-key algorithms, always has posed by Orup [22], as shown in Algorithm 1, is a variant of
long-word-length operands and causes heavy calculation original Montgomery algorithm, which simplies the quotient
loads. Using a multicore platform, executing multiple inde- determination and allows the quotient pipelining in hardware
pendent MMs in parallel is a straightforward way to offer design. The drawback of this algorithm is the bit-extension for
high throughput. However, it is difcult to accelerate a operands in Montgomery eld. The operands must extend their
single MM to enable real time security services. So in this bit-width to that increases by bits compared to ,
paper three approaches to reducing the latency of MM are which is also called the eld size of normal Montgomery algo-
proposed for a multicore platform: rithm. This extension leads to non-negligible overhead, espe-
1) A high speed modular multiplier is developed as the cially in high-radix modular multipliers where the radix al-
critical component of each processing element (PE), ways has a relative large value.
achieving both high working frequency and less itera-
tion cycles.
2) The cooperation of multiple PEs realizes the parallel Algorithm 1 Original MMQP algorithm [22]
execution of a single MM.
3) Dedicated inter-core communication mechanisms Input:
speed up data transferring that can be a performance , ,
bottleneck of parallel computation. , where , for
We present a heterogeneous multicore architecture for a , , is Modulus with
cryptographic processor. Domain specic multicore pro- . Positive integers , such that ,
cessors are developed for many applications but not well i.e., , if is the bit width of .
studied for cryptography. To design a real cryptographic
processor, it is important to balance the exibility and the Output:
computation capability. We decouple the regular tasks of in- mod ,
tensive computation from the irregular tasks of high-level
control and allocate different tasks to heterogeneous cores. where , .
This accelerates the performance of the cryptographic pro- 1: ;
cessor and eases the software development.
We propose several area reduction techniques for the 2: for to
multicore processor implementing public-key ciphers. 3: mod ; {quotient determination step.} and
Although parallelization increases the energy efciency, are the th words of quotient and result, respectively.
it results in the high area cost that becomes a big chal-
lenge for multicore system. Area reduction techniques 4: div ; {reduction step.}
are useful to save the total fabrication cost of multicore 5: end for
processors, about which four aspects are studied in this
paper, such as improving the algorithm of Montgomery 6: .
1374 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 62, NO. 5, MAY 2015

B. Improved MMQP Algorithm


We propose an improved MMQP algorithm, shown as Algo-
rithm 2, which has a lower upper bound of output than the orig-
inal algorithm. Thus the bit width of operands used by MMQP
can be reduced.

Algorithm 2 Proposed improved MMQP algorithm

Input:
, ,
, , where
, , ,
Fig. 1. Overall heterogeneous multicore architecture.
. Positive integers , such that
.
Then, it can be rewritten to (2) by using several premises, such
Output: as that is inferred from the line 3 of
mod , where . the algorithm, and
, which are both derived from
1:
. By iteratively using the line 4 of the al-
2: for to gorithm, from to , and considering the conditions
, , and , (2) can be transformed into
3: mod ; {quotient determination step} (3). The nal result of the algorithm, , can be computed
4: div ; {reduction step} by (4), if the line 6 of the algorithm replaces the with its
expression in (3). Notice that
5: end for
, , and . Therefore
6: . can be proved based on (4), where
.
1) Correctness: Following the procedures in [22], we prove 2) Input and Output Range: Given ,
the correctness of the proposed algorithm. According to the line we can prove that the output of the proposed algorithm is con-
4 of Algorithm 2, can be calculated through (1). strained by . Based on (4), we get
(5)
Using the preconditions of Algorithm 2, and
, is got. Applying this constraint
(1) to (5), we obtain the upper bound of such that
(6)
3) Advantages: The output of the improved algorithm has
(2) the upper bound of much lower than that is
required by the original version. This reduction shortens the bit
width of operands of MMQP by and thus leads to two
implementation advantages. First, if parameter in the orig-
inal and improved algorithm being represented by and ,
respectively, must be smaller than , given same radix .
This is because can be lower than , according to
and . So the improved algo-
rithm has less iterations and results in more efcient computa-
(3) tion than the original one, given a xed . Second, the data path
of the high-radix multiplier implementing MMQP as well as the
register les that store the operands occupy less area, since the
improved algorithm saves bits in the width of operands.

III. MICROARCHITECTURE OF PROPOSED MULTICORE


PROCESSOR

A. Overall Heterogeneous Multicore Architecture


The overall architecture of the proposed processor consists
(4) of two clock domains responsible for different functionalities.
HAN et al.: A HETEROGENEOUS MULTICORE CRYPTO-PROCESSOR WITH FLEXIBLE LONG-WORD-LENGTH COMPUTATION 1375

Fig. 2. The architecture of 5-stage processing element.

As shown in Fig. 1, the high and low frequency domains con-


tain four PEs and a RISC core, respectively. These heteroge-
neous cores have different hardware structures in order to sup-
port various tasks. Surrounding the inter-core communication
modules, four PEs work at high frequency to perform the in-
tensive but regular computation for public-key ciphers. In con-
trast, the RISC core runs relatively slow, dealing with the ir-
regular tasks, such as high-level control ows of public-key ci-
phers, memory initialization, and IO services. These two clock
domains are connected by asynchronous FIFOs, through sev-
eral of which the RISC core sends commands, named macro
instructions, to the PEs in order to direct their computations. Fig. 3. FCSA architecture. (a) Original CSA; (b) The data ow of FCSA at the
On the other hand, one of these FIFOs can serve as the channel rst cycle; (c) The data ow of FCSA at the second cycle.
between the data memory of RISC core and the shared regle, and a CSA compressor as shown in Fig. 2, these logics simplify
which is the common storage area for the PEs. the calculation in subsequent stages. To dene the operands se-
lection signal and operation type for modular multiplications,
B. Processing Element
PE must set up the modular multiplication unit (MMU) con-
In the multicore architecture, PE is a programmable building guration register (MMCR) at this stage. The modular multi-
block that provides high performance arithmetic calculations, plier operates with the own independent pipeline, but it receives
such as long-word-length modular multiplication and addition, operands and conguration information at stage.
for public-key algorithms. It has a RISC-like 5-stage pipeline At and stages, the long-word-length additions
architecture equipped with a dedicated modular multiplier. are performed. Each stage has a 146-bit adder that processes
1) Pipeline Architecture: PE can be divided into 5 pipeline the high or low half part of a 292-bit operand, and after two
stages, such as instruction fetch , instruction decode , stages of calculation the whole results of the 292-bit addition
rst execution , second execution , and write are available at stage. The instructions like branch if equal
back , as illustrated in Fig. 2. Compared with a classical are execute in the stage by checking the equality
RISC pipeline, we remove the memory stage since PE doesn't of operands. Notice that multiple PEs can cooperate to add up
access data memory and only manipulates operands in private long-word-length operands, thus the carries of addition must
and shared regles. Moreover, PE needs to perform the long- be latched at stage and transferred among PEs. At
word-length addition of 292-bit. For decreasing the critical path stage, the result will be adjusted by a shift unit with type 2 (SU2)
delay of adder circuits, two execution stages ( , ) are and then written back to the private or shared regle.
employed. 2) High Performance Montgomery Modular Multiplier: In
At stage, 32-bit instructions are fetched from an instruc- PE, the modular multiplier is the fundamental unit to support
tion memory. In fact these instructions serve as the microcodes, high-throughput and low-latency public-key applications. It im-
which can be translated from the macro instructions sent by the plements Algorithm 2 that aims at high performance and area re-
RISC core managing high-level control ows. At stage, if duction. Folded CSA architecture is adopted in this multiplier,
an instruction for PE is decoded, operands reading, operands making better trade-off between performance and area. It im-
pre-computing, and modular multiplier conguration are per- plements Algorithm 2 and adopts a folded CSA (FCSA) archi-
formed in the same cycle. A PE instruction has at most 3 source tecture to make trade-off between performance and area. More-
operands that might be read from the private or shared regle. over, the multipliers on different PEs have extension interfaces
The private regle can provide 3 operands through its 3 reading to connect each other, whereby they can work together to com-
ports, while the shared regle only provides one operand for one plete a long-word-length MM with low latency.
PE. Arithmetics with 3 operands, such as a) FCSA tree: Considering long-word-length operands
, are imple- used in Algorithm 2, compressing partial products straight-
mented by PE, so at stage we also place some pre-computing forwardly can result in a large CSA tree. This causes a long
logics to merge three operands into two ones. Implemented by a critical path delay as well as a high area cost. Instead, we fold
shift unit with type 1 (SU1), two shift & inversion units (SIU), the compressor tree of CSA as demonstrated in Fig. 3. The
1376 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 62, NO. 5, MAY 2015

compression completes in two cycles, but the hardware cost,


i.e., the number of full adders (FAs), is lowered.
b) Pre-computation and partial products generation: The
number of partial products exerts a strong inuence upon the
hardware complexity of a modular multiplier. As seen in the
line 4 of Algorithm 2, partial products need to be compressed
because and are -bit numbers. However, the number
of partial products can be reduced to if using pre-computation.
The combinations of the th, , bits of and
are , which lead to combined partial products
, respectively. So in the proposed multiplier,
we reduce the partial products by 50% through pre-computing
the parameter and using the bit codes of both and
to select inputs.
c) Pipelined multiplier architecture: Fig. 4 shows the pro-
posed Montgomery modular multiplier, the main data path of
which is a pipelined compressor tree based on FCSA. The mul-
tiplier has ve inputs, , , , , and , provided Fig. 4. Proposed FCSA-based modular multiplier with extension interface.
by some special registers in the private or shared regle. ,
bits of produced by slice generator, and other four inputs
as well as parameter are used to generate 24 partial prod-
ucts. Notice that although the bit-width of is 288, the partial
products are extended to 320-bit for compression. So the FCSA
tree has a 160-bit width, one-half less than that of partial prod-
ucts. The FCSA tree is partitioned into two stages. In the rst
stage, 24 partial products are merged into 8 ones and latched
into an -bit pipeline register. Then in the second stage, Fig. 5. Switch network for broadcasting.
the 8 results of previous stage and the generated by former
Moreover, the distributed tasks need to duplicate some common
iteration are compressed into 3 results. Since the partial product
operands in the PEs they are allocated to, requiring extra
compression is handled part by part, each result register divided
storage area. Therefore we propose inter-core communication
into two components, such as and , to cap-
mechanisms including broadcast, forwarding, reducing
ture corresponding results. Given and , this
communication latency and area cost.
multiplier is an implementation of Algorithm 2, and it needs
1) Data Dependencies of Distributed MM Computation and
cycles to complete the multiplication since one it-
Real Time Data Transferring: As analyzed in [15], and
eration of the algorithm costs two cycles in average. Compared
dependencies strongly impact the latency of a long-word-length
to a non-pipelined FCSA-based compressor, the pipeline archi-
MM, which is partitioned into tasks distributed among different
tecture almost makes no change to the total cycles of the mul-
cores. To minimize the penalties due to data dependencies, we
tiplication but decreases the critical path delay of the multiplier
propose dedicated mechanisms to transfer and in real
from to , where is the latency of a full
time.
adder. Therefore the high working frequency can be achieved,
First, the extension interface in the multiplier mentioned
and the time delay for the multiplication is shortened. Basically,
above can forward the low 24 bits of immediately. The bene-
this multiplier executes the 288-bit MMQP, but multiplications
ts of this forwarding will be illustrated in Section IV. Second,
with much longer bit width are supported by the cooperation of
a switch network to broadcast is developed. Without corre-
multiple multipliers.
sponding , the block-level tasks of distributed MM cannot
d) Extension interface: The multipliers of different PEs
start. In stead of transferring sequentially through normal
can cooperate with each other, executing an MMQP with very
communication channels in multicore processors such as FIFO,
long bit-width, such as 552 and 1056. Therefore, one multiplier
broadcasting in real time can cut down the communica-
must transfer signals to the other multipliers when they are con-
tion latency and thus speedup the parallel computation. The
currently executing the tasks composing a single MMQP. Some
hardware structure of the congurable switch network for
input and output signals are employed by the multiplier to fulll
broadcasting is shown in Fig. 5. The selection bits from the
this requirement, as shown in Fig. 4. The output signals include
MMCR registers of PEs, , determine
the 24-bit and the low 24-bit of the , while the input sig-
which input is used for MM computing. For example, if
nals comprise the corresponding bits of and from the mul-
two-way MM-521 s are executed, and can
tiplier in another PE. These signals form an extension interface.
be selected to drive and , respectively. As
for MM-1024, can be selected to drive ,
C. Inter-core Communication Mechanism
, and .
To compute parallel tasks distributed among multiple PEs 2) Low Cost General Purpose Inter-core Communication
is an approach of performance acceleration. However, the and Synchronization: As shown in Fig. 6, shared regle that
inter-core communication overheads caused by data dependen- contains 16 288-bit registers is employed to avoid operands
cies can increase the total latency of a public-key algorithm. duplication as well as facilitate inter-core communication. To
HAN et al.: A HETEROGENEOUS MULTICORE CRYPTO-PROCESSOR WITH FLEXIBLE LONG-WORD-LENGTH COMPUTATION 1377

TABLE I
INSTRUCTION SET ARCHITECTURE OF PE

in GSM with value 0011. The PE1 will stall its computation
until PE0 also sets the same value in PE0.Sync.Bits. As long as
two PEs specify same synchronization bits, a hit signal will be
Fig. 6. The structure and interface of shared regle. generated by the simple circuit in GSM to inform a successful
synchronization to both PEs.

IV. INSTRUCTION SET ARCHITECTURE AND PARALLELISM


EXPLOITATION FOR APPLICATIONS

A. Instruction Set Architecture


PE has a simple but dedicated ISA as shown in Table I.
This ISA provides special arithmetic and modular multipli-
cation instructions to realize high performance cryptographic
computations. and implement operations like
, while and support opera-
tions like . Here, an instruction
with the postx can put the highest 28 bits of its result into
carry registers, which will be used by an instruction issued on
another PE. , , , and are
instructions to perform varied-word-length MMs. can
Fig. 7. Global synchronization module and an illustration of synchronization be executed on a single PE, while the others require two or
hit generation for PE0.
four PEs performing calculations simultaneously. In fact, the
instruction is a package including four instructions
compute RSA and ECC algorithms through distributed tasks, , , , and ,
some parameters, such as and used in MMQP, must be each of which represents an operation like . Notice
shared among these tasks. Moreover, one operand of MM, such that the instructions for MM have the postxes to specify
as in Algorithm 2, might be used by several PEs. Duplicating quotient selection and shift-enable signals. For example, a PE
these long-word-length variables in every PE requires a large executing should get the quotient from PE0 and
amount of storage area. To save cost, we use the shared regle receive the value of shifted by its neighbor.
as a common data pool that provides parameters or operands , , , , and represent the
to multiple PEs. For instance, the registers to store branch if equal, branch if not equal, branch if less than zero,
the operand for all PEs while they are computing a 1024-bit branch if great than or equal to zero, and jump instructions,
MM. Fig. 6 illustrates that the shared regle is connected to 4 respectively. instruction is used to stall the pipeline of
PEs as well as the RISC core by ve read/write ports. Registers the PE until it receives a macro instruction from the RISC core.
are placed in the front of the read ports in order to prevent can also freeze the PE, waiting for the accomplish-
PEs from suffering a long path delay of operand fetching. This ment of MMU's operation. and set
shared regle can also be used as a general purpose channel to ags to inform the RISC core for the status of a task allocated to
exchange data between PEs. For example, PE0 writes a value PEs. instruction is used for the synchronization among
in , and then other PEs can fetch the value by accessing PEs by setting the registers in GSM shown in Fig. 7. Only
the directly. This kind of channel can implement exible simple branch and control instructions are included in the ISA,
communication schemes, such as unicasting, multicasting, and and complex program control mechanisms, such as stack main-
broadcasting. Because the shared regle acts as both data pool tenance and interruption handling, aren't supported by PEs.
and inter-core channel, the hardware is highly reused. Besides, this ISA removes instructions for data memory access,
Two PEs independently accessing the same register in the since all operations in PEs only manipulate the variables in a
shared regle might cause read-after-write (RAW) hazards. private regle (PR), as shown in Fig. 8, or the shared regle
To synchronize actions of PEs, we propose a cycle-accurate and the data exchange between the shared regle and a memory
global synchronization module (GSM). As shown in Fig. 7, unit is managed by the RISC core. These simplications can
each PE is assigned a 4-bit register to latch the synchronization indeed reduce the hardware complexity of PEs.
bits contained in a synchronization instruction. These bits from
right to left correspond to the synchronization requests of PE0, B. Parallel Long-Word-Length MMs
PE1, PE2, and PE3, respectively. Taking the synchronization With the proposed ISA, the parallel versions of long-word-
between PE0 and PE1 as example, PE1 sets the PE1.Sync.Bits length MMs can be implemented on four PEs. For example,
1378 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 62, NO. 5, MAY 2015

for MM-1024 can be expected. It is worth to point out that the


necessary communication and cooperation among PEs are ef-
cient. Without receiving generated in PE0, all PEs cannot
continue their computations. Using the switch network shown
in Fig. 5 to broadcast immediately can minimize the time
overhead due to this data dependency. PE0, PE1, and PE2 also
require the adjacent PE at their right side to provide . The real
time forwarding through the extension interface of each PE
removes the possible stalls in PEs that are caused by transfer-
ring through slow communication channels. At the end of the
Fig. 8. Private regle elements and special connections.
parallel algorithm (line 16 to 19), some computations must be
carried out across different PEs to obtain the nal result. This
kind of cooperative computations can be implemented by spe-
cial arithmetic and synchronization instructions. As we can see
in Fig. 9, and instructions perform additions
with carry propagation and the instructions enable four
PEs keeping the sequential execution order of these additions
required by the data dependency of the algorithm.

C. Parallel ECC
Elliptic curve scalar multiplication (ECSM) is used to per-
form a multiplication map on an elliptic curve, taking a point
Fig. 9. Assemble program sample for parallel implementation of MM-1024. to by a private key . The ECSM can be implemented
by left-to-right (LR) double-and-add (DA) binary method [8].
Several previous works have studied the parallelization of ECC
algorithm. For instance, in [23], Jyu-Yuan Lai et al. proposed
a two-phase scheduling methodology that leads to the efcient
ECC design. Like [23], we exploit the parallelism among ECC
operations (curve and eld arithmetics) through task scheduling
based on our multicore architecture. The experimental results of
the parallel implementation are demonstrated in the following
section.

V. EXPERIMENTAL RESULTS
We have realized two implementations of the multicore pro-
cessor for public-key applications: Design I, our earlier work
[24], and Design II, the improved version. Under TSMC 65 nm
LP CMOS technology, Design I has been fabricated and Design
II is implemented by using Synopsys IC Compiler. The perfor-
mance and power of Design II are obtained from post-layout
simulation and evaluation.

A. Benefits of The Improved MMQP Algorithm


Benets of the improved MMQP algorithm can be demon-
strated by comparing the implementation results of two MMUs
based on Algorithm 1 and Algorithm 2, respectively. Synthe-
sized under same constraints, the MMU with the improved al-
gorithm can obtain 10% reduction on both area and execution
cycles for MM-256, compared to the one with the original algo-
rithm, as shown in Fig. 11. With respect to area-time product, a
1.2 improvement has been achieved by the one with a reduced
bit width of data path and less iterations. Moreover, by using the
Fig. 9 shows the assemble program for MM-1024 using the in- improved algorithm, the data width of regles of our processor
structions dedicated to PEs. This program is developed based is decreased from 312 bits to 288 bits, which makes regles re-
on Algorithm 3 that is the parallel form of MM-1024 derived duce their area by about 10%.
from Algorithm 2. The kernel execution ow of this parallel
algorithm on four PEs is illustrated in Fig. 10. Using multipli- B. Comparison with Related Works
cand-based partitioning, we divide MM-1024 into several com- Experimental results demonstrate that the proposed designs
putation tasks and evenly distribute them to four cores. Thus, achieve the goal of providing both high-throughput and low-
PEs will compute their tasks in parallel and a high speedup ratio latency computation for public-key algorithms.
HAN et al.: A HETEROGENEOUS MULTICORE CRYPTO-PROCESSOR WITH FLEXIBLE LONG-WORD-LENGTH COMPUTATION 1379

Fig. 10. Parallel execution of kernel iteration of MM-1024 cooperatively by four PEs.

TABLE II
PERFORMANCE COMPARISON WITH RELATED WORKS ON ECC OVER

We choose the author's result under Jacobian coordinates for a fair comparison.

of 305 ms. LR-DAA and RL-DAA ECSM algorithms are


well-known countermeasures against power analysis attacks
[8], [11]. LR-DAA ECSM can resist not only SPA attack but
also DPA attack if using a randomized base point technique.
RL-DAA ECSM is a security-enhanced algorithm to defeat
doubling attack, As illustrated by Table II, design II also
achieves much higher performance in running LR-DAA and
RL-DAA ECSM algorithms, compared to previous works.
Although the anti-attack algorithms introduce extra operations
Fig. 11. Performance improvements for MM-256 by using improved MMQP to disable side channel leakage and require more intensive
algorithm compared with the original algorithm. computations, the proposed multicore processor can also speed
up them because of its parallel hardware resources.
1) ECC: The comparison results for ECC algorithms are 2) RSA: The comparison results for RSA algorithms are
listed in Table II. Implementing conventional LR-DA ECSM shown in Table III. Our designs obtain higher throughput of
algorithm, design I and II have higher throughput than the the RSA exponentiation using Montgomery Ladder method
software implementation on AMD Opteron 252 server CPU than the GPU implementation [12]. Meanwhile, Design II has
working at 2.6 GHz [25]. Design I and II also outperform the the much lower execution latency of a single exponentiation.
GPU based work [12] in both throughput and latency. The For example, Design II only needs 0.434 ms for a 1024-bit
GPU implementation achieves the throughput of 1429 op/s exponentiation, while the GPU implementation costs 6930 ms.
with eld size of 224, which represents the number of ECSM Design II also outperforms two ASIC implementations [26],
operations per second, but each operation suffers a long latency [27] in the speed of executing the RSA exponentiation using
1380 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 62, NO. 5, MAY 2015

TABLE III
PERFORMANCE COMPARISON WITH RELATED WORKS ON RSA

CIOS modular exponentiation.


The layout area including macro cells is represented with , and the core area of logic gates is counted by KGates. In [27] the total number of
transistors, 804 k, is presented, and we convert it to KGates by dividing with 4.

[3] L. M. Kaufman, Data security in the world of cloud computing, IEEE


Security Privacy, vol. 7, no. 4, pp. 6164, 2009.
[4] S. Ramgovind, M. M. Eloff, and E. Smith, The management of se-
curity in cloud computing, in Proc. IEEE Inf. Security South Africa
(ISSA 2010), pp. 17.
[5] C. Wang, Q. Wang, K. Ren, and W. Lou, Privacy-preserving public
auditing for data storage security in cloud computing, in Proc. IEEE
INFOCOM 2010, pp. 19.
[6] D. Zissis and D. Lekkas, Addressing cloud computing security is-
sues, Fut. Gener. Comput. Syst., vol. 28, no. 3, pp. 583592, 2012.
[7] D. He, C. Chen, S. Chan, and J. Bu, Secure and efcient handover au-
thentication based on bilinear pairing functions, IEEE Trans. Wireless
Commun., vol. 11, no. 1, pp. 4853, 2012.
Fig. 12. Area reduction compared with our previous design I. [8] J.-W. Lee, S.-C. Chung, H.-C. Chang, and C.-Y. Lee, Efcient
power-analysis-resistant dual-eld elliptic curve cryptographic pro-
cessor using heterogeneous dual-processing-element architecture,
CRT and Montgomery Ladder method, respectively. Mont- IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 22, no. 1, pp.
4961, Jan. 2014.
gomery Ladder method [28] is deemed to be a technique with
[9] J.-Y. Lai and C.-T. Huang, A highly efcient cipher processor for
enhanced resistance to side channel attacks. Our design II can dual-eld elliptic curve cryptography, IEEE Trans. Circuits Syst. II,
speed up the RSA implementation using this kind of technique. Exp. Briefs, vol. 56, no. 5, pp. 394398, 2009.
3) Design II vs. Design I: Table II and III show that design [10] J.-Y. Lai and C.-T. Huang, Energy-adaptive dual-eld processor for
high-performance elliptic curve cryptographic applications, IEEE
II obtains better performance of executing ECC-256, ECC-521, Trans. Very Large Scale Integr. (VLSI) Syst., vol. 19, no. 8, pp.
RSA-1024, and RSA-2048 than design I. Moreover, design II 15121517, 2011.
saves 36% layout area with respect to the physical implementa- [11] J.-W. Lee, Y.-L. Chen, C.-Y. Tseng, H.-C. Chang, and C.-Y. Lee, A
tions of two designs, as shown by Fig. 12. This is attributed to 521-bit dual-eld elliptic curve cryptographic processor with power
analysis resistance, in Proc. ESSCIRC, Sep. 2010, pp. 206209.
that it cuts down 10% logic gates, reduces memory blocks and [12] R. Szerwinski and T. Gneysu, Exploiting the power of GPUs for
volumes (from 64 KB to 40 KB), and has a higher standard-cell asymmetric cryptography, in Proc. Cryptogr. Hardware Embedded
utilization in layout than design I. Syst. (CHES 2008), pp. 7999.
[13] D. Geer, Chip makers turn to multicore processors, Computer, vol.
38, no. 5, pp. 1113, May 2005.
VI. CONCLUSION [14] Z. Chen and P. Schaumont, A parallel implementation of Montgomery
This paper proposes a heterogeneous multicore processor that multiplication on multicore systems: Algorithm, analysis, prototype,
is devoted to computing public-key algorithms. It can fulll IEEE Trans. Comput., vol. 60, no. 12, pp. 16921703, Dec 2011.
[15] J. Han, S. Wang, W. Huang, Z. Yu, and X. Zeng, Parallelization of
varied requirements according to specic application scenarios, radix-2 Montgomery multiplication on multicore platform, IEEE
providing both low-latency and high-throughput cryptographic Trans. Very Large Scale Integr. (VLSI) Syst., vol. 21, no. 12, pp.
computation. High performance of long-word-length MMs is 23252330, Dec. 2013.
[16] L. Su, Architecting the future through heterogeneous computing, in
achieved by using efcient modular multipliers and employing IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers (ISSCC), Feb.
fast inter-core communication. Moreover, algorithm, architec- 2013, pp. 811.
ture, and circuit approaches for area reduction are studied to [17] P. L. Montgomery, Modular multiplication without trial division,
save the fabrication cost of this processor. Math. Comput., vol. 44, no. 170, pp. 519521, 1985.
[18] . K. Ko, T. Acar, and B. S. Kaliski Jr, Analyzing and comparing
Montgomery multiplication algorithms, IEEE Micro, vol. 16, no. 3,
REFERENCES pp. 2633, 1996.
[1] H. Takabi, J. B. Joshi, and G.-J. Ahn, Security and privacy challenges [19] J. Fan, K. Sakiyama, and I. Verbauwhede, Montgomery modular mul-
in cloud computing environments., IEEE Security Privacy, vol. 8, no. tiplication algorithm on multi-core systems, in Proc. IEEE Workshop
6, pp. 2431, 2010. Signal Process. Syst. 2007, pp. 261266.
[2] P. Hofmann and D. Woods, Cloud computing: the limits of public [20] T. Blum and C. Paar, Montgomery modular exponentiation on recon-
clouds for business applications, IEEE Internet Comput., vol. 14, no. gurable hardware, in Proc. 14th IEEE Symp. Comput. Arith. 1999 ,
6, pp. 9093, 2010. pp. 7077.
HAN et al.: A HETEROGENEOUS MULTICORE CRYPTO-PROCESSOR WITH FLEXIBLE LONG-WORD-LENGTH COMPUTATION 1381

[21] S. E. Eldridge and C. D. Walter, Hardware implementation of Mont- Lingyun Zeng received the B.S. degree in electronic
gomery's modular multiplication algorithm, IEEE Trans. Comput., science and engineering from Southeast University,
vol. 42, no. 6, pp. 693699, 1993. Jiangsu, China, in 2013. He is currently pursuing
[22] H. Orup, Simplifying quotient determination in high-radix modular the M.S. degree in microelectronics at State Key
multiplication, in Proc. IEEE 12th Symp. Comput. Arith. 1995, pp. Laboratory of ASIC and System, Fudan University,
193199. Shanghai, China.
[23] J.-Y. Lai and C.-T. Huang, Elixir: High-throughput cost-effective His current research interests include high-perfor-
dual-eld processors and the design framework for elliptic curve mance digital VLSI design and parallel computing on
cryptography, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. multicore architecture.
16, no. 11, pp. 15671580, Nov. 2008.
[24] S. Wang, J. Han, Y. Li, Y. Bo, and X. Zeng, A 920 Mhz quad-core
cryptography processor accelerating parallel task processing of
public-key algorithms, in Proc. IEEE Custom Integr. Circuits Conf.
(CICC), 2013, pp. 14.
[25] P. Longa and C. Gebotys, Efcient techniques for high-speed elliptic Shuai Wang received the B.S. degree in electronic
curve cryptography, in Proc. Cryptogr. Hardware Embedded Syst. science and engineering from Zhejiang University,
(CHES 2010), pp. 8094. Zhejiang, China, in 2009. He received the M.S.
[26] A. Miyamoto, N. Homma, T. Aoki, and A. Satoh, Systematic de- degree in microelectronics at State Key Laboratory
sign of RSA processors based on high-radix Montgomery multipliers, of ASIC and System, Fudan University, Shanghai,
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 19, no. 7, pp. China, in 2012.
11361146, Jul. 2011. His current research interests include high-perfor-
[27] B. Devlin, M. Ikeda, H. Ueki, and K. Fukushima, Completely mance digital VLSI design and parallel computing on
self-synchronous 1024-bit RSA crypt-engine in 40 nm CMOS, in multicore architecture.
Proc. IEEE Asian Solid-State Circuits Conf. (A-SSCC), Nov. 2013,
pp. 309312.
[28] S.-M. Y. Marc Joye, The Montgomery powering ladder, in Proc.
Cryptogr. Hardware Embedded Syst., 2003, pp. 291302.
Zhiyi Yu received the B.S. and M.S. degrees in elec-
trical engineering from Fudan University, Shanghai,
China, in 2000 and 2003, respectively, and the Ph.D.
degree in electrical and computer engineering from
the University of California, Davis, CA, USA, in
2007.
From 2007 to 2008, he was with IntellaSys Corpo-
ration, USA. He is currently an Associate Professor
with the State Key Laboratory of ASIC and System,
Microelectronics Department, Fudan University. He
Jun Han received the B.S. degree from Xidian has published 2 books (chapters) and over 60 papers.
University, Shanxi, China, in 2000 and the Ph.D. His research interests include high-performance and energy-efcient digital
degree in microelectronics from Fudan University, VLSI design with an emphasis on many-core processors.
Shanghai, China, in 2006. Prof. Yu serves as a member of the Technical Program Committee of several
He joined Fudan University as an Assistant Pro- conferences such as the IEEE Asian Solid-State Circuits Conference (ASSCC)
fessor in July 2006 and is currently an Associate Pro- and the IEEE International Conference on ASIC.
fessor with the State Key Laboratory of ASIC and
Systems. He is working on a high performance do-
main specic processor especially for digital signal
processing and cryptography. Xiaoyang Zeng received the B.S. degree from
Xiangtan University, Xiangtan, China, in 1992 and
the Ph.D. degree from Changchun Institute of Optics
and Fine Mechanics, Chinese Academy of Sciences,
China, in 2001.
Renfeng Dou received the B.S. degree in electronic Since 2007, he has been a Full Professor and the
information engineering from Wuhan University, Director of the State Key Laboratory of ASIC and
Hubei, China, in 2012. He is currently pursuing System, Fudan University, China, where he was a
the M.S. degree in microelectronics at State Key Postdoctoral Researcher from 2001 to 2003, and
Laboratory of ASIC and System, Fudan University, later an Associate Professor. His research interests
Shanghai, China. include information security chip, VLSI signal
His current research interests include high-per- processing, and communication systems design.
formance and energy-efcient digital VLSI design Prof. Zeng is the Steering Committee Member of the Asia and South Pacic
on parallel computing architecture and on-chip-net- Design Automation Conference (ASP-DAC), and the TPC member of the IEEE
works. Asian Solid-State Circuits Conference (A-SSCC).

You might also like