

# Cycle Approximate Simulation of RISC-V Processors

<u>Lee Moore</u>, Duncan Graham, Simon Davidmann Imperas Software Ltd.

Felipe Rosa Universidad Federal Rio Grande Sul

Embedded World conference 27 February 2018





- Why is timing estimation important for embedded systems?
- Current techniques for timing estimation
- Instruction accurate simulation
- Instruction Accurate + Estimation (IA+E)
- Results





- Why is timing estimation important for embedded systems?
- Current techniques for timing estimation
- Instruction accurate simulation
- Instruction Accurate + Estimation (IA+E)
- Results



# Timing Estimation is Important



- Most embedded systems have some sort of real time requirements,
  e.g. response time limits
- Clearly needed for automotive systems, but also for other transportation systems, medical electronics, industrial controls, IoT, ...
- Timing estimation can help with system testing, however this could also help to demonstrate to a prospective customer that the SoC under consideration could meet the system requirements









# imperas

### Challenge

 More and more software in electronic products, with more and more system scenarios, means more and more simulation is needed ... which means more and more simulation speed is needed

- So if more timing detail is needed then it needs to be with as much speed as possible
  - Many applications take many millions of instructions

# imperas

- Why is timing estimation important for embedded systems?
- Current techniques for timing estimation
- Instruction accurate simulation
- Instruction Accurate + Estimation (IA+E)
- Results

# **Techniques for Timing and Power Estimation**



| Technique           | Strength                  | Weaknesses                        |
|---------------------|---------------------------|-----------------------------------|
| Manual spreadsheets | Ease of use               | Lack of accuracy; inability to    |
|                     |                           | support estimations with real     |
|                     |                           | software                          |
| Hardware emulators  | Cycle accurate            | High cost (millions USD); needs   |
|                     |                           | RTL; < 5 mips performance         |
| FPGA prototypes     | Cycle accurate            | High cost (hundreds of thousands  |
| 0.000               |                           | USD); needs RTL                   |
| Cycle approximate   | Good performance          | Lack of accuracy; lack of         |
| simulation          | 50040                     | availability of models            |
| Cycle accurate      | Cycle accurate            | High cost (hundreds of thousands  |
| simulation          |                           | of USD); lack of availability of  |
|                     |                           | models                            |
| Gem5                | Microarchitectural detail | A lot of work to develop a model  |
|                     |                           | of specific microarchitecture and |
|                     |                           | to get realistic traces of SoC.   |



#### **Cycle Approximate, Cycle Accurate** Simulation Market Survey - several approaches over the years ...



- RTL (SystemVerilog) simulation
  - Expensive, slow, late in project, restricted access to IP
- Cycle accurate simulation (RTL based)
  - RTL converted to C, compiled and simulated
  - Expensive, slow, late in project, complex to set up
- Cycle approximate models (C based)
  - Hand coded models...
  - Complex to create models, expensive to build, slowish
- Cycle accurate performance simulation (open source)
  - SimpleScalar, gem5
  - Limited processor architectures, standalone, slow
  - On your own ..., support?
- Cycle accurate performance simulation (proprietary)
  - Users who develop their own
  - Requires expert resources, maintenance
  - Usability? Interfaces to standards? Speed? (slow)
  - Traditionally performance simulation has been done with 'trace based' solutions as a separate post simulation process that uses large memory and other resources





- Why is timing estimation important for embedded systems?
- Current techniques for timing estimation
- Instruction Accurate simulation
- Instruction Accurate + Estimation (IA+E)
- Results



#### Virtual Platforms Provide a Simulation Environment Such That the Software Does Not Know That It Is Not Running On Hardware



- The virtual platform is a set of instruction accurate models that reflect the hardware on which the software will execute
  - Could be 1 SoC, multiple SoCs, board, system; no physical limitations
- Run the executables compiled for the target hardware
- Models are typically written in C or SystemC
- Models for individual components interrupt controller, UART, ethernet, ... – are connected just like in the hardware
- Peripheral components can be connected to the real world by using the host workstation resources: keyboard, mouse, screen, ethernet, USB, ...
- High performance: 200 500 million instructions per second typical, or boots Linux in <5 sec</li>

# **Example RISC-V IA Virtual Platform**





Boots Linux in under 5 seconds





- Why is timing estimation important for embedded systems?
- Current techniques for timing estimation
- Instruction accurate simulation
- Instruction Accurate + Estimation (IA+E)
- Results



# Overview of CPU Timing Estimator





- > platform.exe -timing
- > platform.exe -timing --stretch
- CPU Characterization Data for each CPU variant
- Timing Estimator loaded onto CPU instance as binary intercept library (SlipStreamer API)
- No edits/changes needed in CPU model binary, or platform, or other models
- Controlled by simulation command line arguments
- Imperas simulation speeds up to 500Mips with timing estimation
  - Note that this is 200-500x faster than callback method with instruction accurate simulation

#### **Mechanism**





- Depending on instruction, cycles are added
- Simulator has mode to change elapse of simulation time based on timing calculations (--stretch)

### **Timing Model**





- the parser module disassembles the binary code and identifies the instruction that must be executed
- identified instruction is used as a hash table key to ascertain to which class such instruction belongs
- oycle count is computed and instruction is executed in the CPU

# **Timing Data**



- Two parts to timing data
  - Cycle information number of cycles for a given instruction, in context
  - Timing information estimated time per cycle for a given silicon implementation
- Provided as a separately linked dynamic library
  - Enables processor designers to create a cycle approximate timing simulation without sharing any internal information



- Why is timing estimation important for embedded systems?
- Current techniques for timing estimation
- Instruction accurate simulation
- Instruction Accurate + Estimation (IA+E)
- Results



### Imperas Environment for Embedded Software Development, Debug & Test





#### **Characterization Example**

- Reference Board
  - STM32F4-Discovery board (ARM Cortex-M4F CPU)
- Cortex-M4F running FreeRTOS
  - both are highly used in high-performance embedded system design



**Imperas** 

#### **Accuracy Evaluation vs Board**

- Experimental Runs
  - Cortex-M4F running FreeRTOS
  - WCET and other benchmarks





### **Performance**



- For Cortex-M4F CPU estimator vs. Board
- Worst case error <13%</li>
- Average error across all benchmarks run <5%</li>
- Most errors are <8%</p>
- Simulation speeds with timing up to 500 MIPS

- Caveat
  - Performance and accuracy are application dependent



#### **Example RISC-V Platform**

- RISC-V RV32 processor model
- Various peripheral models
- Several benchmark applications run, with different compiler optimizations



# imperas

# **RISC-V Experiments**

- Timing estimation using IA+E for RISC-V 32 bit
  - Andes Technology N25
  - Microsemi Mi-V RV32IMA
  - SiFive E31
- Different benchmarks used for each processor
  - Do not want to compare processor performance
- Only cycle information results are presented, so that we were not providing misleading timing estimation data on different processors





#### **Andes N25 Results**

#### Comparing:

IA = 1 cycle per instruction

IA+E = estimated cycles per instruction







#### Microsemi Mi-V RV32IMA Results

#### **Comparing:**

IA = 1 cycle per instruction

IA+E = estimated cycles per instruction







#### **SiFive E31 Results**

#### Comparing: IA = 1 cycle per instruction IA+E = estimated cycles per instruction





# imperas

### **Summary & Conclusions**

- IA+E technique shows excellent results for speed and minimal overhead, with acceptable accuracy
  - Very fast (up to 500 MIPS)
  - Simple timing estimation, or simulation time stretching
- Limitations
  - In-Order deterministic processors only
- Further work
  - Extend to more complex processors: cache, multi-core, out-of-order execution
  - Apply this technique to power estimation



# imperas

#### **Thank You!**

 See Imperas at the RISC-V Foundation booth, Hall 3A-419

Stay for the next RISC-V presentation, after the break:

Securing RISC-V Machines Dynamically with Hardware-Enforced Metadata Policies, Steve Milburn, Dover Microsystems

