You are on page 1of 6

2007 IEEE Nuclear Science Symposium Conference Record

N15-119

A VLSI-FPGA System-on-Chip for Detectors Monitoring


Alberto Aloisio, Francesco Cevenini, Raffaele Giordano and Vincenzo Izzo
AbstractIn this paper we present a System-on-Chip (SoC) designed to offer a self-contained, compact data acquisition platform for particle detector monitoring. With a companion ADC, this architecture is capable to acquire the detectors signals, process the data and perform monitoring tests. The SoC is based on a custom, expanded version of a free, opensource 8-bit microprocessor. We extended the instruction set, added internal memory resources and integrated peripherals to interface the core with the external ADC. Peripherals implement in hardware threshold checking, pedestal suppression, waveform recording and a multi-channel analyzer. The SoC inherits from the open-source core some attractive features for real-time applications: high working frequency, constant instruction execution time, short and xed IRQ latency and little logic resources requirement. The processor makes it possible to execute in software additional off-line data processing, such as averaging, FWHM calculation and peak nding. It includes an 8-bit I/O bus for interfacing with external logic and a UART for RS232 communications. We also present two implementations of the SoC: on a Virtex II Xilinx FPGA and on a standard-cell CMOS 0.18 m ASIC. Their resource occupation and performance are compared, in the view of a deployment in the monitoring system of an experimental apparatus. Index TermsSystem on Chip, microprocessor, FPGA, VLSI, CMOS, standard cell, real-time, monitoring, detector.

sending commands to the microprocessor and by checking the results of data acquisitions and analyses. Also, the SoC could be enabled to control external hardware to modify the detector working conditions (e.g. supply voltage), thus creating a feedback loop and acting as a detector controller. II. T HE L ILO B LAZE P ROCESSOR We designed a 8-bit soft-core based on the PicoBlaze softprocessor designed by Xilinx . PicoBlaze is a 8-bit opensource microprocessor and exists only as a HDL description. The company provides several versions of the CPU, for all of them the description is given as a netlist of FPGA primitives except for the CR version, which is described with a synthetizable VHDL code. We have chosen the PicoBlaze CR [1] since we wanted to modify the description and we wanted to implement it on a Very Large Scale Integration standard cell technology too. This core has some attractive features for real-time applications: 1 high working frequency, around 100 MHz (50 MIPS ) on Virtex II V1000 FGPA [2][3] and with XST synthetizer [4]; timing predictabily: every instruction always takes 2 clock cycles to be executed; short and xed IRQ latency: 5 or 6 clock cycles; low logic resource occupation, 4% of Virtex II V1000 FPGA. The processor has 8 general purpose 8-bit registers, 25 assembly instructions, an 8-bit program counter, an 8-bit ALU with zero and carry ags, 256 8-bit I/O ports, a 4-level deep call/return hardware stack and a maskable interrupt request (IRQ) input. Here we just want to notice that this core does not have an integrated RAM, so the only memory resources are the registers, which are insufcient even for very simple tasks (e.g. building a histogram of acquired data). The ALU does not provide any instruction to compare two operands (compare between two registers or between a register and a immediate operand) which affects the processors ags. The only way to do it is using the subtraction instruction, but in this way the destination register is altered. Moreover the stack depth allows only 4 nested calls which does not allow th user to efciently organize the code into sub-routines. Eventually, theres a single IRQ input, which doesnt allow the microprocessor to receive IRQs from multiple sources. To overcome the PicoBlaze limitations we developed the LiloBlaze processor which enhances the described architecture with RAM resources, an improved ALU, a deeper call/return
1 Millions

I. I NTRODUCTION Particle detectors are often monitored during data taking in order to control their operating conditions. Gain drift, noise performance and aging studies are based on pulse height analysis and they require digitizing and processing the detectors analog output. In a typical experimental apparatus, size, power, cabling and cost are serious constraints, thus only a few channels can be equipped with an electronic chain suited to this purpose. In this paper we present a System-on-Chip designed to overcome this problem offering a stand-alone tool to acquire and monitor detectors analog output. We developed a complete data acquisition platform on a single chip. Our system hosts a custom microprocessor, which controls the acquisition, performs programmable data analysis and executes digital I/O toward external hardware. The system only requires an external ADC. The board hosting the chip would be tiny thus minimizing the overall size and allowing to put it directly on the detector. The SoC is able to transmit acquired and analyzed data to a personal computer or a serial terminal by using a dedicated on-chip UART. In a real application, the user interacts with the SoC by
Manuscript received Nov. 22, 2007. Alberto Aloisio, Francesco Cevenini, Raffaele Giordano and Vincenzo Izzo are with I.N.F.N. Sezione di Napoli and Universit di Napoli "Federico II", Dipartimento di Scienze Fisiche, Via Cintia - 80126 Napoli, Italy (e-mail: aloisio@na.infn.it, cevenini@na.infn.it, rgiordano@na.infn.it, izzo@na.infn.it)

of Instructions Per Second

1-4244-0923-3/07/$25.00 2007 IEEE.

468

Figure 1.

Simplied block diagram of the LiloBlaze architecture.

stack and two independent IRQ inputs. All the functionalities as well as the instructions of the original core are supported by LiloBlaze too. So, all the code written for the PicoBlaze processor can be used on the LiloBlaze processor. We integrated two RAMs in the architecture: 1) DAQ_RAM, a 256x8-bit dual port memory to store data from the ADC or used as a multichannel analyzer memory, 2) DATA_RAM, a 32x8-bit memory that works as support resource for the CPU program. We expanded the instruction-set with instructions to read and write both the RAMs with direct and indirect addressing, i.e. the address can be specied by a immediate operand in the instruction (directly) or by the content of a register (indirectly). Fig. 1 shows a simplied block diagram of the architecture we designed and shows how the RAM memories have been embedded in the core. Access to DATA_RAM memory is managed by the instruction decode block: the address provided to the memory can be the content of a register or a 5-bit immediate operand encoded in the instruction word. The input data is read from a processor register as well as the output data is written to a processor register. The same discussion holds for the port 1 of the DAQ_RAM memory, but in this case the address space is 8 bit wide. We decided DAQ_RAM to be a dual port memory because in this way one port can be used by LiloBlaze while the other is available for direct memory access by the peripherals.

We added a compare instruction to the ALU which allows the comparison between two CPU registers or between a register and a immediate operand. Also, we added a barrel shifter to the ALU and corresponding instructions to execute the barrel shift to the right or the left lling the uncovered bits with 0 or 1. The amount of the shifting can be specied either directly or indirectly. We eventually have expanded the depth of the call/return stack to 16 levels allowing 1 IRQ and 15 nested calls to be executed. Its important to anticipate that even if we added hardware to the core, we have been able not to worsen the frequency performance with respect to the original design, as we will show in paragraph V. III. T HE S YSTEM - ON -C HIP A RCHITECTURE Our SoC includes the LiloBlaze microprocessor, a program memory (Program RAM) and peripherals for data acquisition and real-time processing. Moreover the SoC includes a UART to communicate with a RS-232 serial link (Fig. 2). Data from the ADC are handled by the ADC synchronization logic which takes care of reading and synchronizing the data, when they become available. Data is then transferred to specic peripherals in order to subtract a pedestal and apply a threshold without CPU overhead. Pedestal and threshold are set by the CPU via internal registers. After pedestal suppression, data over threshold can either be processed by a MultiChannel Analyzer (MCA) control logic or directly stored into a dual port RAM bank (DAQ RAM). The selection between the

469

Figure 2.

Simplied block diagram of the SoC internals.

two operating modes is performed by writing to a dedicated register, which drives a multiplexer. The multiplexer in turn allows writing to the RAM memory by the MCA control logic or by the threshold checking unit. This architecture allows the CPU to run a program and analyze the data while the system is in acquisition, because the CPU has read access to the DAQ RAM using one of the ports while the peripherals have read/write access using the other. The microprocessor loads the 8-bit Sample Counter (SC) with the number of data to acquire. For each data acquired from the ADC, its content is decreased. When SC reaches zero, an IRQ is issued and the ADC synchronization logic is disabled, thus stopping the acquisition. A dual port 256x16-bit RAM (Program RAM) stores the program code for the CPU and is serially initialized after power-on through a dedicated Serial Peripheral Interface (SPI) bus. The reason for choosing a serial bus to load the program memory will be explained in paragraph IV. The ADC synchronization logic handles data from the ADC and ags the reading of each data both on a output pin of the SoC and internally to the SC. We designed this logic to match a straightforward 8-bit ADC output interface. We selected the ADC0820 from National Semiconductors, which is an industry standard and has very simple control interface, adopted by many other models [5]. All peripherals are fully pipelined and allow reading from the external ADC at every clock cycle. The system is fully synchronous, i.e. microprocessor and peripherals share the same clock domain. This guarantees a xed latency of a few clock cycles between data arrival from the ADC and its storage in memory. That makes the timing of an acquisition operation easy to be predicted. The microprocessor inherits from the open-source core some attractive features for realtime applications: high working frequency, constant instruction

execution time, short and xed IRQ latency and low footprint in terms of logic resources. The processor makes it possible to execute in software additional ofine data processing, such as averaging, FWHM calculation, peak nding. Moreover parallel input and output operations always take two clock cycles in order to move a data from CPU internal registers to the I/O pins and vice versa. Thus, the I/O timing is predictable and has a short latency. The LiloBlaze processor is interfaced to the peripherals through its 8-bit I/O bus. A pair of assembly instructions (INPUT and OUTPUT) are dedicated to I/O operations. INPUT allows to transfer an 8-bit word from the input bus to one of the internal registers of the processor. OUTPUT allows to transfer the content of a register on the output bus. Both the instructions allow to specify the port address (PORT_ID) either directly or indirectly. The peripherals are managed using reserved I/O ports (PORT_IDs from (04)16 to (F F )16 ). Ports from (00)16 to (03)16 are available for interfacing the SoC to external hardware. Every register of each DAQ peripheral has an assigned PORT_ID in the reserved range. Writing to a register (or reading its content) is performed using the OUTPUT (INPUT) instruction with the corresponding PORT_ID. Reserved ports allow the user to set the state of control signals of peripherals and to write (read) to (from) their internal registers to set (get) acquisition parameters. For instance, an I/O port allows the programmer to write into the Sample Counter (SC) setting the number of samples to acquire and starting the acquisition. In order to enable the processor to receive an interrupt request at the end of an acquisition, without losing the capability to receive an external IRQ, we added an OR port on the IRQ input of LiloBlaze. In order to determine if the request came from an external device, the processor can read a dedicated ip op through its parallel

470

Figure 3.

Layout of the System on Chip on a Virtex II 1000 FPGA.

Figure 5. Up: Dump of the MCA memory on a PC serial terminal, down: histogram of the received data.

Figure 4. FPGA.

Resource occupation of the System on Chip on a Virtex II 1000

I/O bus: SC_LAST. If SC_LAST is set, this means that SC reached zero, so the interrupt has been generated by the end of the acquisition. Also, an IRQ is generated on the overow of the multichannel analyzer: it happens when a location of the channels memory contains the value 255 and the MCA tries to increment it. The processor can distinguish this IRQ from the previous reading the number of samples yet to acquire from the SC. IV. F PGA AND V LSI I MPLEMENTATIONS Fig. 3 shows the SoC layout on a Xilinx Virtex II 1000 FPGA. This device includes 40 18-kbit BlockRAMs which can be congured with several depths and word widths. The DAQ_RAM memory included in the microprocessor and the

Program RAM have been implemented by suitably conguring two BlockRAMs. The logic utilization reaches only the 8% of the total available slices, corresponding to 625 Look Up Tables (LUTs) and 281 ip-ops. For a detailed resource occupation report see Fig. 4. Such a small resource requirement makes it possible to implement more than one instance ( 10) of the SoC in the same Virtex II 1000 device. The maximum working frequency is the same of the LiloBlaze microprocessor ( 100 MHz). We added peripheral hardware in such a way that it did not worse the frequency performance. We used the V2MB1000 Memec demoboard to test our SoC. We have taken advantage of a RS-232/LVTTL transceiver hosted on the board to allow the system to perform serial I/O toward a Personal Computer (PC) through the SoCs UART. We designed a on-chip ADC emulator to perform data acquisition tests without building a dedicated board. The emulator is asynchronous with respect to the system logic, as a real ADC would be, and provides a programmable sequence of 2k 8-bit words to stimulate the ADC synchronization logic. Fig. 5 shows the output produced by the microprocessor while running a test program, which displayed the histogram of acquired data on a serial terminal and calculated the FWHM and the maximum. In that case the ADC emulator was programmed to provide a gaussian shaped sequence. Our SoC requires a 870 x 870 m2 core area and a 1.2 x 1.2 mm2 total die area in a VLSI CMOS 0.18 m standard cell technology (Fig. 6). The CPU Program RAM (which stores 256 16-bit words) takes 25% of the core area, while the DAQ

471

Figure 6.

Layout of the System in a VLSI CMOS 0.18 m standard cell technology (routing has been hidden for the sake of clarity).

RAM (which stores 256 8-bit words) takes 16% of it. Both the RAMs have been implemented using hard macros generated by the silicon foundry. We decided to use the SPI bus to load the program memory to limit the number of pins of the system. In this way we achieved a good trade-off between the area required by the core logic and the area required by the I/O pads. A parallel loading logic would have made the integrated circuit pad-limited increasing the area and consequently the cost. The die has 12 bonding pads per side, for a total of 48. There are 38 signal pins, 6 power supply pins for I/O and 4 power supply pins for core logic. The placement of both the I/O and core supply pins has been optimized to minimize the IR drop on supply rings; moreover we placed them on the sides of sensitive pins, such as clock and reset, to shield them from cross-talk. The maximum working frequency we achieved in the VLSI integrated circuit was of about 180 MHz. The chip is now under manufacturing. V. I NTEGRATED D EVELOPMENT E NVIRONMENT The microprocessor rmware is developed in assembly language. The assembler analyzes instruction syntax and ags

errors if they occur. We modied the original assembler to support instructions we added. Also, we developed a graphical Integrated Development Environment (IDE), which integrates an editor with syntax highlighting, an assembler, a simulator and a debugger (Fig. 7). The IDE is based on a free open-source tool, named Kpicosim [6], that we tailored and expanded to support our architecture. It is coded in C++ language and works under Linux, but could be easily ported to most of the common operating systems. The user writes assembly source code using the editor, then he can run the assembler, which does syntax checking on the source and, if the check is successful, produces an executable code for the microprocessor. The simulator is based on a object-oriented software model of the System-on-Chip, and is able to emulate the execution of the code provided by the assembler. The debugger controls the simulator and allows the user to set breakpoints and to execute the program step-by-step. It also enable to monitor programs execution ow, including microprocessors registers, dump of both the DAQ and the DATA RAM, peripherals operation and their control registers. Moreover it allows the developer to analyze output on I/O ports and to stimulate them with input data.

472

Figure 7.

Screen-shot of the IDE main window.

The simulator is open-source software, so it is possible to expand it with software models for additional hardware and simulate their communication with the SoC. For instance, the simulator provides a ASCII terminal emulator, which emulates a hardware terminal connected to an I/O port of the processor which is very helpful for debug purposes. The description of the data input from the ADC to simulate is performed writing a list of ADC output data, each with the time at which it will be at the input of the SoC. VI. C ONCLUSIONS The SoC we have developed has been successfully tested on a FPGA and it fullls all the starting requirements. The VHDL description of the system is completely technology independent, allowing the user to implement the SoC both by programming an FPGA or by designing a standard cell Application Specic Integrated Circuit (ASIC). Our system has a very low logic footprint, in fact it requires just 625 LUTs on a Xilinx Virtex II 1000 FPGA and 1.2 x 1.2 mm2 die area in VLSI CMOS 0.18 m process. The maximum working frequency is 100 MHz in the FPGA device and 180 MHz in the standard cell ASIC. The system works in realtime and has an easy timing predictability. It also provides a low interrupt and parallel I/O latency (5-6 clock cycles for the IRQ and 2 for I/O). Thanks to the tiny size of our system, the user could integrate it in the detectors readout electronics with a negligible impact on the resource occupation. For the same reason, it is also possible to replicate the system many times in the same integrated circuit (to monitor more than one detector channel). Also, our system could be used

with virtually any detector with an analog output under the condition of providing a suitable analog pre-processing of the signal (e.g. amplication, sample and hold, integration). Our architecture is extremely versatile and can be deployed even with systems that are not detectors. For instance it has been successfully used to build an integrated testing and calibration platform for two FPGA-based TDCs [7].Moreover it has been used to implement a Bit Error Rate Tester for 2eSST block transfers on the VME64x bus [8]. R EFERENCES
[1] (2003). XAPP387(v1.1) PicoBlaze 8-Bit Microcontroller for CPLD Devices. Xilinx. [Online]. Available: http://www.xilinx.com/bvdocs/appnotes/xapp387.pdf [2] (2007). DS031(v3.5) Virtex-II Platform FPGAs: Complete Data Sheet. Xilinx [Online]. Available: http://www.xilinx.com/support/documentation/data_sheets/ds031.pdf [3] (2007). UG002(v2.2) Virtex-II Platform FPGA User Guide. Xilinx [Online]. Available: http://www.xilinx.com/support/documentation/user_guides/ug002.pdf [4] (2007). XST User Guide. Xilinx [Online]. Available: http://toolbox.xilinx.com/docsan/xilinx92/books/docs/xst/xst.pdf [5] (2003). ADC0820 - 8-Bit High Speed P Compatible A/D Converter with Track/Hold Function. National Semiconductor. [Online]. Available: http://www.national.com/mpf/DC/ADC0820.html [6] (2005). Kpicosim. A simulator and assembler for the picoblaze, with a graphical user interface. Mark Six [Online]. Available: http://www.xs4all.nl/~marksix/kpicosim.html [7] A. Aloisio, P. Branchini, R. Cicalese, R. Giordano, V. Izzo and S. Loffredo, FPGA Implementation of High-Resolution Time-to-Digital Converter, in Proc. of 2007 Nuclear Science Symposium, 2007. [8] A. Aloisio, F. Cevenini, R. Cicalese, R. Giordano and V. Izzo, Beyond 320 Mbyte/s with 2eSST and Bus Invert coding on VME64, IEEE Trans. on Nucl. Sci., to be published in 2008.

473

You might also like