Professional Documents
Culture Documents
5, MAY 2015
AbstractA domain specic multicore processor for public-key femicell, and ubiquitous computing, the devices of access point
cryptography is proposed in this paper. This processor provides (AP) or small base station should provide high-security and
exible and efcient computation for various forms of RSA and low-latency authentication and authorization services [7]. If
ECC algorithms, fullling low-latency or high-throughput require-
these security services cannot be low-latency, the quality of
ments of different application scenarios. By using a heterogeneous
multicore architecture, the proposed processor enables high speed delay-sensitive applications such as streaming audio/video and
parallel implementations of kernel arithmetics of public-key algo- IP telephony might be greatly affected when a user moves from
rithms. A long-word-length modular multiplication can be parti- one AP to another.
tioned into parallel tasks executed by the high performance mul- For most cryptographic applications, high security strength
tipliers distributed in multiple cores. Some dedicated communica- becomes more and more necessary, as for the improved com-
tion mechanisms minimize inter-core data transferring latencies puting ability and new attack methods. This demands to use
of the processor. The proposed processor is implemented under
large-size keys as well as the countermeasures against potential
TSMC 65 nm LP CMOS technology. Experimental results show
that our design outperforms previous works based on varied plat- threats such as power analysis attack [8]. However, the work-
forms in performance, for instance, it can complete a 1024-bit RSA loads of public-key ciphers are dramatically enlarged along with
encryption in 0.087 ms at 960 MHz. Moreover, we also study the the increment of key size. The anti-attack algorithms of [8] re-
area reduction techniques for proposed multicore processor from quire extra computation overheads.
the perspectives of algorithm, architecture, and circuit. Hardware platforms are expected to be more powerful to
Index TermsCryptography, ECC, modular multiplication, meet the requirements of high-throughput, low-latency, and
multicore, RSA. high-security. Moreover, security applications evolve fast and
new cryptographic protocols will require support in the future.
A platform must implement different algorithms and offer cer-
I. INTRODUCTION tain freedom for software development. The exibility makes
C LOUD computing as an evolutionary computing tech- the hardware implementation problem more difcult.
nology has become more and more popular in recent years. A general purpose processor (GP) is the most exible platform
While it brings economic benets, some challenges still need but fails to meet the performance requirements. This kind of plat-
to be solved. Cloud data security and privacy protection are the form is inefcient to support the kernel arithmetics of public-key
most concerning obstacles impeding the wide adoption of cloud ciphers such as long-word-length operations. Besides, the energy
computing [1], especially for business applications under public efciency of GP is unacceptable in many application scenarios.
clouds [2]. Several researchers have discussed the security issues Opposite to GP, ASIC can provide highest performance and
of cloud computing in detail [3], [4], and solutions proposed achieve best energy efciency. However, the ASIC implemen-
in [5], [6] demonstrate that public-key cryptography can be tations dedicated to specic applications offer the lowest ex-
the infrastructure to preserve the condentiality, integrity, and ibility. Under nanoscale technology, fabricating ASIC chip for
authenticity of data and communications. Since one node in a each application leads to extremely high cost. There are many
cloud might face numerous requests of secure services, a high- attempts to improve the adaptiveness of ASIC designs; for ex-
throughput scheme to process public-key ciphers is desired. ample, several ECC processors presented in [8][11] support
In wireless communication, public-key ciphers like RSA, both binary and prime eld with different sizes. Nevertheless,
Elliptic-Curve Cryptography (ECC), also play an important ASIC implementations will not provide the true programma-
role. For example, in the emerging applications like WiMAX, bility that enables efcient software development.
A graphic processing unit (GPU) possesses abundant com-
putation resources, and it can run a lot of threads in parallel to
Manuscript received June 05, 2014; revised September 15, 2014; accepted
February 13, 2015. Date of publication April 06, 2015; date of current version achieve high throughput for cryptographic services. Neverthe-
April 28, 2015. This work was supported by the National Natural Science Foun- less, executing multiple tasks on GPU does not shorten the latency
dation of China under Grant 61176023 and Grant 61234002, and in part by the of a single task. For instance, an RSA-1024 exponentiation in
Project of State Key Laboratory of ASIC and System under Grant 11MS005.
[12] has a latency of 6930 ms at working frequency of 1.35GHz.
This paper was recommended by Associate Editor V. Chandra.
The authors are with the State Key Laboratory of ASIC and Systems, Public-key calculations, such as RSA and ECC, contain thou-
Fudan University, Shanghai 201203, China (e-mail: junhan@fudan.edu.cn; sands of modular multiplications. Due to data dependencies, a
rdou12@fudan.edu.cn; 13210720074@fudan.edu.cn; 09210720068@fudan. large proportion of these multiplications cannot be executed in
edu.cn; zhiyiyu@fudan.edu.cn; xyzeng@fudan.edu.cn).
Color versions of one or more of the gures in this paper are available online
parallel. Therefore, a single user of RSA or ECC authentication
at http://ieeexplore.ieee.org. must wait considerable time, although the high-throughput
Digital Object Identier 10.1109/TCSI.2015.2407431 platform processes many users' requests concurrently.
1549-8328 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
HAN et al.: A HETEROGENEOUS MULTICORE CRYPTO-PROCESSOR WITH FLEXIBLE LONG-WORD-LENGTH COMPUTATION 1373
To adopt multicore architecture rather than to increase the modular multiplication with quotient pipelining (MMQP),
complexity of a single core can be an effective solution to boost employing a folded carry save adder (CSA) in the modular
the performance of a platform with better energy efciency multiplier, using a shared register le to avoid operands
[13]. Multicore architecture has the inherent merits of parallel duplication, and simplifying the hardware complexity
speedup, programmability, and low power density. Several of PE by reducing its capacity of program control and
previous works have studied how to map the public-key algo- memory access.
rithms, such as RSA, onto multicore platforms of homogeneous This paper is organized as follows. In Section II, an improved
architecture [14], [15]. Data-computation-dominated tasks and MMQP algorithm is introduced, based on which hardware struc-
control-ow-dominated tasks cannot be both well supported ture and parallel implementation can be developed. The mi-
by the homogeneous architecture. Heterogeneous architectures croarchitecture and hardware design of the proposed processor
have been widely exploited in recent years to address this will be shown in Section III. Instruction set architecture (ISA)
problem. For example, AMD's heterogeneous system architec- and the parallelization of applications will be detailed in Sec-
ture (HSA) delivers improvement across power, performance, tion IV. In Section V, experimental results are provided and
programmability, and portability by using heterogeneous com- used to make comparison with related works. Finally, the con-
puting units [16]. clusion is obtained in Section VI.
From cryptographic applications, the main challenge to mul-
II. IMPROVED MONTGOMERY MODULAR MULTIPLICATION
ticore architecture is how to implement computations with long-
ALGORITHM
word-length operands as well as support exible control ow
and parallelism exploitation for algorithms. In this paper, we in- A. Original MMQP Algorithm
vestigate what architecture improvements can manage this kind
Montgomery MM algorithm was proposed by Peter L. Mont-
of challenge and show experimental results toward public-key
gomery to avoid the division by modulus [17]. This algorithm
ciphers. Our contributions are highlighted here:
dramatically reduces the computation complexity and has be-
We present a low-latency scheme for iterative long-word- come the dominant method for both software [18], [19] and
length data computations. Modular multiplication (MM), hardware implementations [20], [21]. MMQP algorithm pro-
which is the kernel of public-key algorithms, always has posed by Orup [22], as shown in Algorithm 1, is a variant of
long-word-length operands and causes heavy calculation original Montgomery algorithm, which simplies the quotient
loads. Using a multicore platform, executing multiple inde- determination and allows the quotient pipelining in hardware
pendent MMs in parallel is a straightforward way to offer design. The drawback of this algorithm is the bit-extension for
high throughput. However, it is difcult to accelerate a operands in Montgomery eld. The operands must extend their
single MM to enable real time security services. So in this bit-width to that increases by bits compared to ,
paper three approaches to reducing the latency of MM are which is also called the eld size of normal Montgomery algo-
proposed for a multicore platform: rithm. This extension leads to non-negligible overhead, espe-
1) A high speed modular multiplier is developed as the cially in high-radix modular multipliers where the radix al-
critical component of each processing element (PE), ways has a relative large value.
achieving both high working frequency and less itera-
tion cycles.
2) The cooperation of multiple PEs realizes the parallel Algorithm 1 Original MMQP algorithm [22]
execution of a single MM.
3) Dedicated inter-core communication mechanisms Input:
speed up data transferring that can be a performance , ,
bottleneck of parallel computation. , where , for
We present a heterogeneous multicore architecture for a , , is Modulus with
cryptographic processor. Domain specic multicore pro- . Positive integers , such that ,
cessors are developed for many applications but not well i.e., , if is the bit width of .
studied for cryptography. To design a real cryptographic
processor, it is important to balance the exibility and the Output:
computation capability. We decouple the regular tasks of in- mod ,
tensive computation from the irregular tasks of high-level
control and allocate different tasks to heterogeneous cores. where , .
This accelerates the performance of the cryptographic pro- 1: ;
cessor and eases the software development.
We propose several area reduction techniques for the 2: for to
multicore processor implementing public-key ciphers. 3: mod ; {quotient determination step.} and
Although parallelization increases the energy efciency, are the th words of quotient and result, respectively.
it results in the high area cost that becomes a big chal-
lenge for multicore system. Area reduction techniques 4: div ; {reduction step.}
are useful to save the total fabrication cost of multicore 5: end for
processors, about which four aspects are studied in this
paper, such as improving the algorithm of Montgomery 6: .
1374 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 62, NO. 5, MAY 2015
Input:
, ,
, , where
, , ,
Fig. 1. Overall heterogeneous multicore architecture.
. Positive integers , such that
.
Then, it can be rewritten to (2) by using several premises, such
Output: as that is inferred from the line 3 of
mod , where . the algorithm, and
, which are both derived from
1:
. By iteratively using the line 4 of the al-
2: for to gorithm, from to , and considering the conditions
, , and , (2) can be transformed into
3: mod ; {quotient determination step} (3). The nal result of the algorithm, , can be computed
4: div ; {reduction step} by (4), if the line 6 of the algorithm replaces the with its
expression in (3). Notice that
5: end for
, , and . Therefore
6: . can be proved based on (4), where
.
1) Correctness: Following the procedures in [22], we prove 2) Input and Output Range: Given ,
the correctness of the proposed algorithm. According to the line we can prove that the output of the proposed algorithm is con-
4 of Algorithm 2, can be calculated through (1). strained by . Based on (4), we get
(5)
Using the preconditions of Algorithm 2, and
, is got. Applying this constraint
(1) to (5), we obtain the upper bound of such that
(6)
3) Advantages: The output of the improved algorithm has
(2) the upper bound of much lower than that is
required by the original version. This reduction shortens the bit
width of operands of MMQP by and thus leads to two
implementation advantages. First, if parameter in the orig-
inal and improved algorithm being represented by and ,
respectively, must be smaller than , given same radix .
This is because can be lower than , according to
and . So the improved algo-
rithm has less iterations and results in more efcient computa-
(3) tion than the original one, given a xed . Second, the data path
of the high-radix multiplier implementing MMQP as well as the
register les that store the operands occupy less area, since the
improved algorithm saves bits in the width of operands.
TABLE I
INSTRUCTION SET ARCHITECTURE OF PE
in GSM with value 0011. The PE1 will stall its computation
until PE0 also sets the same value in PE0.Sync.Bits. As long as
two PEs specify same synchronization bits, a hit signal will be
Fig. 6. The structure and interface of shared regle. generated by the simple circuit in GSM to inform a successful
synchronization to both PEs.
C. Parallel ECC
Elliptic curve scalar multiplication (ECSM) is used to per-
form a multiplication map on an elliptic curve, taking a point
Fig. 9. Assemble program sample for parallel implementation of MM-1024. to by a private key . The ECSM can be implemented
by left-to-right (LR) double-and-add (DA) binary method [8].
Several previous works have studied the parallelization of ECC
algorithm. For instance, in [23], Jyu-Yuan Lai et al. proposed
a two-phase scheduling methodology that leads to the efcient
ECC design. Like [23], we exploit the parallelism among ECC
operations (curve and eld arithmetics) through task scheduling
based on our multicore architecture. The experimental results of
the parallel implementation are demonstrated in the following
section.
V. EXPERIMENTAL RESULTS
We have realized two implementations of the multicore pro-
cessor for public-key applications: Design I, our earlier work
[24], and Design II, the improved version. Under TSMC 65 nm
LP CMOS technology, Design I has been fabricated and Design
II is implemented by using Synopsys IC Compiler. The perfor-
mance and power of Design II are obtained from post-layout
simulation and evaluation.
Fig. 10. Parallel execution of kernel iteration of MM-1024 cooperatively by four PEs.
TABLE II
PERFORMANCE COMPARISON WITH RELATED WORKS ON ECC OVER
We choose the author's result under Jacobian coordinates for a fair comparison.
TABLE III
PERFORMANCE COMPARISON WITH RELATED WORKS ON RSA
[21] S. E. Eldridge and C. D. Walter, Hardware implementation of Mont- Lingyun Zeng received the B.S. degree in electronic
gomery's modular multiplication algorithm, IEEE Trans. Comput., science and engineering from Southeast University,
vol. 42, no. 6, pp. 693699, 1993. Jiangsu, China, in 2013. He is currently pursuing
[22] H. Orup, Simplifying quotient determination in high-radix modular the M.S. degree in microelectronics at State Key
multiplication, in Proc. IEEE 12th Symp. Comput. Arith. 1995, pp. Laboratory of ASIC and System, Fudan University,
193199. Shanghai, China.
[23] J.-Y. Lai and C.-T. Huang, Elixir: High-throughput cost-effective His current research interests include high-perfor-
dual-eld processors and the design framework for elliptic curve mance digital VLSI design and parallel computing on
cryptography, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. multicore architecture.
16, no. 11, pp. 15671580, Nov. 2008.
[24] S. Wang, J. Han, Y. Li, Y. Bo, and X. Zeng, A 920 Mhz quad-core
cryptography processor accelerating parallel task processing of
public-key algorithms, in Proc. IEEE Custom Integr. Circuits Conf.
(CICC), 2013, pp. 14.
[25] P. Longa and C. Gebotys, Efcient techniques for high-speed elliptic Shuai Wang received the B.S. degree in electronic
curve cryptography, in Proc. Cryptogr. Hardware Embedded Syst. science and engineering from Zhejiang University,
(CHES 2010), pp. 8094. Zhejiang, China, in 2009. He received the M.S.
[26] A. Miyamoto, N. Homma, T. Aoki, and A. Satoh, Systematic de- degree in microelectronics at State Key Laboratory
sign of RSA processors based on high-radix Montgomery multipliers, of ASIC and System, Fudan University, Shanghai,
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 19, no. 7, pp. China, in 2012.
11361146, Jul. 2011. His current research interests include high-perfor-
[27] B. Devlin, M. Ikeda, H. Ueki, and K. Fukushima, Completely mance digital VLSI design and parallel computing on
self-synchronous 1024-bit RSA crypt-engine in 40 nm CMOS, in multicore architecture.
Proc. IEEE Asian Solid-State Circuits Conf. (A-SSCC), Nov. 2013,
pp. 309312.
[28] S.-M. Y. Marc Joye, The Montgomery powering ladder, in Proc.
Cryptogr. Hardware Embedded Syst., 2003, pp. 291302.
Zhiyi Yu received the B.S. and M.S. degrees in elec-
trical engineering from Fudan University, Shanghai,
China, in 2000 and 2003, respectively, and the Ph.D.
degree in electrical and computer engineering from
the University of California, Davis, CA, USA, in
2007.
From 2007 to 2008, he was with IntellaSys Corpo-
ration, USA. He is currently an Associate Professor
with the State Key Laboratory of ASIC and System,
Microelectronics Department, Fudan University. He
Jun Han received the B.S. degree from Xidian has published 2 books (chapters) and over 60 papers.
University, Shanxi, China, in 2000 and the Ph.D. His research interests include high-performance and energy-efcient digital
degree in microelectronics from Fudan University, VLSI design with an emphasis on many-core processors.
Shanghai, China, in 2006. Prof. Yu serves as a member of the Technical Program Committee of several
He joined Fudan University as an Assistant Pro- conferences such as the IEEE Asian Solid-State Circuits Conference (ASSCC)
fessor in July 2006 and is currently an Associate Pro- and the IEEE International Conference on ASIC.
fessor with the State Key Laboratory of ASIC and
Systems. He is working on a high performance do-
main specic processor especially for digital signal
processing and cryptography. Xiaoyang Zeng received the B.S. degree from
Xiangtan University, Xiangtan, China, in 1992 and
the Ph.D. degree from Changchun Institute of Optics
and Fine Mechanics, Chinese Academy of Sciences,
China, in 2001.
Renfeng Dou received the B.S. degree in electronic Since 2007, he has been a Full Professor and the
information engineering from Wuhan University, Director of the State Key Laboratory of ASIC and
Hubei, China, in 2012. He is currently pursuing System, Fudan University, China, where he was a
the M.S. degree in microelectronics at State Key Postdoctoral Researcher from 2001 to 2003, and
Laboratory of ASIC and System, Fudan University, later an Associate Professor. His research interests
Shanghai, China. include information security chip, VLSI signal
His current research interests include high-per- processing, and communication systems design.
formance and energy-efcient digital VLSI design Prof. Zeng is the Steering Committee Member of the Asia and South Pacic
on parallel computing architecture and on-chip-net- Design Automation Conference (ASP-DAC), and the TPC member of the IEEE
works. Asian Solid-State Circuits Conference (A-SSCC).