Using LISATek for the Design of an ASIP Core including Floating Point Operations Reimund Klemm, Javier Prieto Sabugo, Hendrik Ahlendorf, Gerhard Fettweis Vodafone Chair Mobile Communication Systems Technische Universität Dresden D-01062 Dresden, Germany {klemm, fettweis}@ifn.et.tu-dresden.de
Abstract Application specific instruction set processors (ASIPs) recently became more important to overcome compute bottlenecks in digital signal processing systems with tight power constraints. Within the last years commercial tools like the LISATek framework came up, that allow to design ASIP architectures by using their own description language. It shortens the design cycle dramatically compared to classical register-transfer-level (RTL) based approaches. However, if such designed ASIPs are used in complex system-on-chip designs, they must easily integrate into existing design flows to allow an iterative design process. In this paper we investigate the capabilities of the LISATek framework by implementing a RISC core compliant to the SH-1. Additionally we integrated existing IP cores in terms of a floating point processing unit into the architecture to enable instruction customization. Thereby we observed some minor limitations, which have to be overcome in future for a better practical design flow.
Index Terms - ASIP, RISC, LISA, floating point, architecture exploration
1
Introduction
Within the recent years application-specific instruction-set processors (ASIPs) are seen as one potential solution to cope with the increasing complexity for complete Systemon-Chip platforms [1]. Pure ASIC centric solutions for systems having 1 M+ gate count become more difficult to manage due to the verification and integration burden. General purpose architectures are not applicable for strongly power limited applications like wireless communication. Recent trends on application specific system-on-chip platform like StepNP [2] or MeP [3] show, that one possible solution to enable powerful architectures is to aggregate multiple simple processor cores, with a customized instruction set for the needed application domain. Customizable architectures like the Xtensa from Tensilica [4] or 750D from ARC [5] gained significant importance in modern system-on-chip platforms, but they allow to customize the instruction set only in a very limited way.
If larger flexibility and customization for the instruction set architecture (ISA) is needed, tools focusing on architecture level using Architecture Description Languages (ADL) can be used. There has been numerous research in this field, but so far the LISA language is the only one that gained also commercial acceptance. We used this approach to design a RISC core with a five stage pipeline, focusing on how to enrich the instruction set by advanced functionality like floating point operations.
2
Tool flow for ASIP design
For the case study we used the LISATek toolsuite, including the HDL processor generator for the architecture exploration and design. The design is made using the LISA 2.0 language, which is a mixed-level behavioral/structural oriented architecture description language (ADL) [6]. The instruction syntax and binary encoding for the complete instruction set is defined in a tree like structure. Out of this description the assembler and linker are generated.
Figure 1: Flow chart of the used design process
To address the structural properties of the architecture to be designed, functional units, memories and pipelines must be defined. Control flow is described via activation records. Additionally dedicated function directives for the pipeline like stall() or flush() simplify the control of the pipeline registers. For details on the architecture modeling, please refer to [7]. If certain guidelines of the language are met, a HDL description either in Verilog or VHDL of the complete model can be generated [8, 9]. We put the generated
Process: Operating condition: Synthesis wire load:
130nm, 1.2V low power Typical, 1.2 V, 25 ◦ C 10K Gates
Table 1: Technology parameters at synthesis
HDL code into a classic synthesis platform, namely Design Compiler and related tools from the Synopsys Galaxy platform (see figure 1). Thereby using a standard cell library with the properties listed in table 1, the performance in terms of timing and area could be estimated for different architecture set-ups. Additionally for a first verification at netlist level the binaries from the generated assembler had been used. As in our study we did not include back annotated data from place and route. Our experience is that for small processor cores like the discussed one, the synthesis wire load models have a good match if pessimistic input/output constraining of half the clock cycle is used. Thereby the delay can be considered as feasible. All the hereafter mentioned area numbers refer post synthesis values for the core from Synopsys Design Compiler. For this study we did not consider post place and route chip area, as it strongly influenced by the memories and the floorplan quality. Overall such a methodology speeds up dramatically the design of ASIP architectures. For a designer familiar with the LISA language and tools, it is possible to specify a complete synthesize-able architecture within 1 month. By generating the assembler out of the architecture description, it is straight forward to generate first test benches for the netlist. From our experience, the first successful run of assembly code based on a classical HDL approach takes much longer.
3
Designing the Processor Core
For our case study we took the instruction set from SH-1 SuperH core from Renesas technologies as point of reference [10]. As mentioned before the goal is to do the core design out of an architecture description down to a physical implementation for a real processor core.
Figure 2: Used pipeline structure
The SH-1 was developed by Hitachi in the early 90s and the first models were integrated into CD-drivers or fax machines. Nowadays this core architecture is widely used in the embedded domain, especially in the automotive sector for control purposes. The SH-1 itself is a 32 bit processor and has a 16 bit fixed point instruction word. A multiplier and 64 bit accumulator register are designated to efficiently implement signal processing instructions like FIR filtering.
if((IN.pRn == WB.IN.pRn) && (WB.IN.writeback_code == WRITEBACK_WRITERN)){ // bypass the value from WB OUT.op_2 = WB.IN.result; }else{ // read the value from the register file TpRn = IN.pRn; OUT.op_2 = R[TpRn.ExtractToLong(0,4)]; } Figure 3: Definition of the bypass from WB to the OPFE stage
By specifying the complete coding tree for the SH-1 instructions, we are able to generate SH-1 compliant binaries using the LISATek assembler generator. By using Renesas embedded workshop [11] including a full blown gcc compiler, even standard C applications could be integrated on top of it. We only had to do some minor Perl scripting to suit the generated assembly for the LISATek assembler. Mainly they result from some different label and immediate naming conventions based on LISA. The definition of the coding tree having around 180 instructions was done within the time of one man week. By using the visualization tool Instruction-Set Designer [12] even complex instruction sets like the one from the SH-1, having a strongly irregular coding structure can be handled. Especially if binary code compatibility for an architecture matters, we think the Instruction-Set Designer is a very handsome tool. For the architecture itself we implemented a 5 stage RISC like pipeline (see Figure 2). Compared to the classic DLX structure [13], we slightly reordered the pipelining structure to the following arrangement: 1. IF - Instruction fetch from the program memory 2. DE - Instruction decode 3. OPFE - Operand fetching from the register file or data memory 4. EXE - Execute of the issued operation 5. WB - Write back to the register file or memory The implementation challenge for every pipelined architecture are hazards, either resulting from resource conflicts (structural hazards), data dependencies (data hazards) or the program flow (control hazards). Within this paper we will concentrate on data hazards and how different architecture trade-off scenarios can be easily studied in LISA to avoid them. Taking a look again at our pipeline structure in figure 2, it is obvious that a data hazard occurs when instruction I in pipeline stage WB is writing the same data value that is fetched by instruction I+2 in stage OPFE. The variable I hereby denotes the instruction number in the program sequence. The same is valid for the case if instruction I+1 in the EXE stage uses the same operand for it’s calculation which is written by stage WB. The simplest solution is to stall the stages IF, DE and OPFE, if a data conflict is detected at the WB stage. In LISA it can be implemented by calling dedicated function
primitives like FE.stall() or OPFE.stall() in the appropriate behavior. However, this slows the processing speed of the pipeline, as an empty cycle is introduced. Bypassing directly the value from WB to the OPFE is the preferable solution, as no delay cycle has to be added. This technique is very well known, but it’s labor-intensive to implement different set-ups in a hierarchical HDL design. The values from the appropriate stages must be wired into the stage doing the actual bypass evaluation. In contrast, the bypass in LISA in our model is just a short if else construct, having the concern located at a few lines of code (see the code listening in figure 3). For our realization we implemented different bypasses, preventing read-after-write (RAW) data hazards, for the OPFE and EXE stage. It turned out that placing bypass evaluation logic in the EXE stage is critical in terms of delay. Of course an experienced processor designer would know this, but for more complicated pipeline set-ups doing some pre-calculation at other stages, things get less obvious. To obtain a better overall balanced pipeline (having nearly the same critical path length for all stages), we placed all the evaluation logic in the OPFE stage. For the case of bypass 3. this was enforced by writing results of the EXE behavior into a global LISA variable, which synthesizes into a wire. Overall we studied the following pipeline constellations: • No bypass: The processor pipeline resolves RAW hazards by pipeline stalling • Bypass 1: The pipeline resolves RAW hazards between instruction I and I + 2 locating the evaluation logic in stage OPFE • Bypass 2: Additionally to Bypass 1 RAW hazards between instruction I and I + 1 are resolved. The bypass evaluation logic is located in the EXE stage. • Bypass 3: Additionally to Bypass 1 RAW hazards between instruction I and I + 1 are resolved. The bypass evaluation logic is located in the OPFE stage. Each operation in the execute stage writes to a global LISA variable, that is fed into the bypass evaluation of the OPFE stage. Overall it can be concluded, to solve the RAW hazards in our pipeline structure, solution using Bypass 3 should be preferred over Bypass 2. The impact of the additional area of Bypass 3 is much lower than the shortening of the critical path, resulting in a better area-delay-product than Bypass 2 (see figure 4).
Figure 4: Area and delay trade-off for the different bypass realizations
[h] Figure 5: Datapath with "ghost" floating point unit
4
Integrating IP cores for Floating Point Extension
The previous section showed we are able to easily elaborate different architecture set ups and observe their impact on the the physical level. However, if we want to design advanced system-on-chips, it is often needed to integrate custom IP cores from library vendors into our model. For instance libraries like Synopsys DesignWare offer various components to efficiently perform timing critical data processing or complete interfaces for advanced memory controllers like SDRAM. For our considerations we took as example the integration of a floating point processing unit. This has been done already by Karuri [14] completely in synthesize-able LISA behavior, which has the advantage of higher portability, as no specific library instantiation at RTL is done. However, we think performance disadvantages on timing critical paths like floating point processing justifies the integration of dedicated IP cores, which can not be inferred from HDL operators. To easily modify the generated HDL by the processor generator, we introduced a "ghost" floating point (FP) unit (see figure 5). In the EXE stage of processor description appropriate operations like addition and multiplication are introduced, by using Reinterpret-Cast (see code listening in figure 6). This is possible as the generator in-lines the LISA behavior into C/C++ code to generate the processor simulator and assembler. Note that the generated simulator is not fully IEEE 754 floating point compliant, as the code is ported to another host simulation architecture (x86, Sparc, Intanium) the saturation modes might be different. As the LISATek Processor Designer also allows to compile and link external libraries into its model, fully IEEE floating point compliant libraries like Softfloat [15] can be used for the LISA behavior description. For our investigative purpose the effort was not justified to do so. By setting pragmas for the Processor Generator the HDL for the FP unit is only a Skeleton. The combinatoric logic for floating point operations must be in-lined by manual Verilog instation of the IP components. By doing so, we do not manipulate any HDL code which is generated for the pipeline control (stall, flush, bypassing) or has crosscut concerns in multiple HDL modules. From our experiences in processor design the debugging of control structures for the pipeline control and interfaces is the most error prone and stressful task, as it means to test and cover all possible state transitions of the associated logic. We think to break a top down flow for the inclusion of some strongly optimized HDL for the behavior is less critical, as the debugging is relatively straight forward. After synthesizing the HDL and simulating example applications at net list level, it turned out as expected that the FP unit became the critical path. For a better pipeline balancing and to minimize the overall critical path, we intended to implement a second pipeline, having two execute stages for the floating point operations (see figure 7). This
OPERATION fp_op_EXE IN pipe.EXE { BEHAVIOR{ # pragma analyze(off) float fpop1, fpop2, fpresult; fpop1 = *((float*)&IN.op_1); fpop2 = *((float*)&IN.op_2); # pragma analyze(on) ... # pragma analyze(off) fpresult = fpop1 + fpop2; OUT.result = *((int*)&fpresult); # pragma analyze(on) ... } } Figure 6: Defining floating point addition in LISA using Reinterpret-Cast Core with single cycle floating point instruction Area [µm2 ] Delay [ns] A*T [µm2 ∗ µs] Delay optimization 145,570 3.70 538 Compromised synthesis 132,646 4.04 535 Core with dual cycle floating point instruction Area [µm2 ] Delay [ns] A*T [µm2 ∗ µs] Delay optimization 163,339 3.00 490 Compromised synthesis 146,605 3.79 555 Table 2: Area and delay characteristics with single precision floating point instruction
pipeline is activated in the activation records of the previous OPFE stage, if a floating point instruction is detected one cycle before its actual execution. Additionally, if a single cycle execute operation is following a floating point operation the pipeline has to be stalled for one cycle. Otherwise a structural hazards occurs in the WB stage. As mentioned before such pipelining controls in LISA are just simple function calls at the appropriate stage. A limitation that still exist is that the processor generator is not able to generate HDL for designs having multiple pipelines. Thereby for the inclusion of a second pipeline some more HDL coding was necessary, as the control signals and clock signals had to be included as well. But finally we were able to include a two staged FP unit into our design and run successfully assembly code on the LISA model as well as the netlist. For the pipeline balancing of the two staged FP unit we used the register retiming feature of the Synopsys Design Compiler [16]. Therefor two register stages are placed at the output of the FP unit and are shifted by the synthesis into the logic of the unit to shorten the critical path.
Figure 7: Datapath with floating point extension pipeline
Overall our approach resulted in a processor core with 3.8 ns delay for the compromised synthesis (see table 2), employing a 130 nm standard cell library with the parameters from table 1. In contrast Karuri et al. [14] obtained a critical path of 15 ns with a 130 nm process for a RISC core having single precision floating point instructions. Even our approach takes some more effort to be ported to other RTL tool flows, we showed that the inclusion of HDL behavior at the appropriate parts of the design enables significant performance boost with negligible integration and verification increase. If the Processor Generator will support multiple pipelines in future, some of these pipelines can be used to generate conveniently code skeletons to integrate custom HDL cores.
5
Conclusions
Within this paper we did an exploration of a RISC like architecture taking the SH-1 as a point of reference. By using a toolflow addressing the architecture as well as register transfer level, we were able to consider different architecture set-ups and determine their physical impact for the silicon realization. Furthermore by showing how to integrate custom IP cores for advanced instruction extension like floating point operations, we were able to come to a solution which has hardly any penalty over a RTL based approach. We would like to thank Dr. Oliver Schliebusch from Coware Inc. for valuable advice on the features of the LISATek processor generator.
References [1] M. Gries and K. Keutzer, Building ASIPs: The Mescal Methodology. 2005. [Online]. Available: http://www.gigascale.org/pubs/603.html
Springer,
[2] P. G. Paulin, C. Pilkington, E. Bensoudane, M. Langevin, and D. Lyonnard, “Application of a multi-processor soc platform to high-speed packet forwarding,” in DATE ’04: Proceedings of the Conference on Design, Automation and Test in Europe, Paris, France, 2004, pp. 58–63. [3] www.mepcore.com, 2004. [4] Xtensa Instruction Set Architecture (ISA) Reference Manual, Tensilica Inc., 3255-6 Scott Blvd., Santa Clara, CA, 2005. [5] 750D Core Architecture, ARC Inc., 3590 N. First Street, San Jose, CA, 2005.
[6] A. Hoffmann, O. Schliebusch, A. Nohl, G. Braun, O. Wahlen, and H. Meyr, “A methodology for the design of application specific instruction set processors (ASIP) using the machine description language lisa,” in ICCAD ’01: Proceedings of the 2001 IEEE/ACM international Conference on Computer-aided Design. San Jose, California: IEEE Press, 2001, pp. 625–630. [7] LISA Language Reference Manual, CoWare Inc., 1732 N. First Street, San Jose, CA, 2005. [8] O. Schliebusch, A. Chattopadhyay, E. M. Witte, D. Kammler, G. Ascheid, R. Leupers, and H. Meyr, “Optimization techniques for ADL-driven RTL processor synthesis,” in RSP ’05: Proceedings of the 16th IEEE International Workshop on Rapid System Prototyping (RSP’05), Montral, Canada, 2005, pp. 165–171. [9] LISATek Methodology Guidelines for the Processor Generator, CoWare Inc., 1732 N. First Street, San Jose, CA, 2006. [10] SH-1/SH-2/SH-DSP Software Manual, Renesas Technology Corp., Marunouchi Bldg., 2-4-1 Marunouchi, Chiyoda-ku, Tokyo 100-6334, Japan, 2004. [11] SuperH High-performance Embedded Workshop V.3 User’s Manual, Renesas Technology Corp., Marunouchi Bldg., 2-4-1 Marunouchi, Chiyoda-ku, Tokyo 100-6334, Japan, 2004. [12] LISATek Processor Designer Manual, CoWare Inc., 1732 N. First Street, San Jose, CA, 2005. [13] J. Hennessy and D.Patterson, Computer Architecture: a Quantitative Approach. Morgan Kaufmann Publishers, 1996. [14] K. Karuri, R. Leupers, G. Ascheid, H. Meyr, and M. Kedia, “Design and implementation of a modular and portable ieee 754 compliant floating-point unit,” in DATE ’06: Proceedings of the Conference on Design, Automation and Test in Europe. Munich, Germany: European Design and Automation Association, 2006, pp. 221–226. [15] Softfloat distribution http://www.jhauser.us/arithmetic/SoftFloat.html, 2002. [16] Design Compiler Reference Manual, Synopsys, Inc., 700 East Middlefield Road, Mountain View, CA, 2006.