## New RISC-V processor family

# From microcontroller to multicore processor

10. November 2021, 11:30 Uhr | By Frank Riemenschneider, Segger Microcontroller



The latest announcement of a strategic partnership with Renesas opened the door for RISC-V startup SiFive into the club of established semiconductor manufacturers. With its Series 7 processors, SiFive bridges the gap from microcontroller to multicore processor, which is anything but trivial.

The announcement was spectacular: **SiFive**, the largest supplier of processor IP based on RISC-V microarchitecture, but still with the image of a startup, will become a **strategic partner of** <u>Renesas</u> – and that also for high-end automotive applications in the field of ADAS (Advanced Driver Assistance Systems) and autonomous driving. »The SiFive RISC-V portfolio is silicon proven and available in leading and advanced manufacturing foundries, offering flexibility to customers and partners,« Renesas stated, and with that at the latest, SiFive was promoted to the circle of established IP providers.

In terms of flexibility, SiFive already offers an amazingly broad range of processor IP. In addition to 32- and 64-bit standard cores with a focus on embedded applications **(table)**, software and hardware for accelerating AI/ML

applications with SiFive AI ISA extensions and RISC V vector extensions have recently been offered under the name **»SiFive Intelligence**«.

The widest arc is the **Processor Series 7**, which bridges the gap from microcontroller to microprocessor and is certainly of great interest to embedded developers.

The dual-issue in-order processor core is in the same complexity as e.g. <u>Arm's</u> Cortex-A55. SiFive offers versions for real-time embedded processing as well as for Linux applications.

|                                                                                                                     | E-Cores: 32-bit-<br>embedded-CPUs                                                         | S-Cores: 64-bit-<br>embedded-CPUs | U-Cores: 64-bit-<br>Application processors |
|---------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|-----------------------------------|--------------------------------------------|
| 8-Series: Highest<br>performance with 11-<br>stage Out-of-Order-CPU-<br>Pipeline                                    |                                                                                           |                                   | U84                                        |
| 7-Serie: High<br>performance with 8-<br>stage, superscalar Dual-<br>Issue-CPU-Pipeline                              | E76, E76MC Quad-Core                                                                      | S76, S76MC Quad-Core              | U74, U74MC (4xU74 +<br>1xS76)              |
| 3/5-Serie: Focus on<br>energy efficiency with 5-<br>6-stage Single-Issue-<br>CPU-Pipeline                           | E31, E34 (E31 + FPU)                                                                      | S51, S54 (S51 + FPU)              | U54, U54MC (4xU54 +<br>1xS51)              |
| 2-Serie: Optimised for<br>minimal power<br>consumption and silicon<br>area, 2-3-stage Single-<br>Issue-CPU-Pipeline | E20, E21 (E20+User-<br>Mode, Atomic<br>Instructions, Multiplier,<br>TIM), E24 (E21 + FPU) | S21                               |                                            |

Table: With its IP offering of 32- and

64-bit processors, SiFive targets classic embedded applications.

At the high end of the performance scale is

the new U74MC IP core, which builds on the U54 and already offers multicore configurations and Linux compatibility. The U74MC features a **double precision floating point unit (FPU) as standard**. Up to nine of the 64-bit cores can share an L2 cache with ECC protection. For embedded applications, there is the 32-bit E76 and the 64-bit S76, which contain an FPU that calculates with single precision. With 4.9 CoreMark/MHz and a clock frequency that is around 10 % higher than the predecessor Series 5, users get significantly higher computing performance with the Series 7. In addition, there are further improvements to the memory subsystem compared to the Series 5:

- Zero clock cycle load-to-use latency, instead of 1 clock cycle,
- 2 clock cycles access time to the SRAM in the worst case, instead of 5 clock cycles, and
- a fast I/O port, called Fast I/O or FIO for short. This is tightly coupled to the core and enables core-to-memory and low-latency accelerator operations. The FIO port can also be used to incorporate larger SRAM as well as custom accelerators via the accelerator register interface (Figure 1).

In the end, the 7 series processor achieves a 63 % improvement in CoreMarks/MHz (4.9 CM/MHz). **The basis of the 7 series is a cluster with up to nine CPUs (8+1, Fig. 2)**. The cores can be a mix of the Series 7 cores as well as other existing processor cores from SiFive. All elements in the cluster are cache-coherent - including all advanced SRAM options as well as any custom accelerators attached to the cores. The cluster can be further scaled by using AMBA, which allows



Figure 1. The FIO port tightly coupled to the processor core enables lowlatency transfers to/from the core from/to memory or hardware accelerators.

integration of up to 64 clusters on a single chip. Multi-chip support is also possible via ChipLink.

Following a \$50 million funding round realized in April 2018, SiFive had expanded its focus to IP for embedded applications at the time.

Part of the differentiation of SiFive IP is undoubtedly its configurability. Customers can start with the specification for a standard core and add or remove standard command extensions, change memory details and configure other features.



*Figure 2. Up to eight identical CPUs plus another CPU of a different type can be integrated into a cluster in the Series 7.* 

## **New CPU in Series 7**

The new dual-issue Series 7 CPU represents a departure from SiFive's previous CPUs: Series 5 uses a simple five-stage scalar pipeline, implemented with TSMC's 28 nm process, a **clock frequency of up to 1.5 GHz** is achievable. The S54 includes the RV64I base ISA as well as the Multiply and Divide (M), Atomic (A) and Compressed (C) extensions. Optionally, the Series 5 processor handles the single-precision (F) and double-precision (D) floating-point extensions.

#### As Figure 3 shows, the Series 7 processor expands the pipeline to eight stages and adds several execution units for

superscalar operations. The first execution slot performs memory operations (load/store) and simple integer operations, whereas the second slot performs arbitrary integer operations – including multiply/divide – branch resolution, and floating-point operations. SiFive has added a second fetch



*Figure 3. The 8-stage dual-issue pipeline of the SiFive S7 core.* 

stage and a second data memory access stage to allow for larger L1 cache and scratchpad memories. A second decode stage handles superscalar dispatching.

Both execution slots contain arithmetic logic execution units (ALUs) in the 5th stage. They handle most of the arithmetic instructions. Branch resolution can use these ALUs immediately, resulting in five clock cycles of latency if a branch miss occurs. However, when an ALU instruction needs the output of a pending load, it moves to stage seven, which contains a second set of ALUs. These »late« ALUs **provide a load-to-use latency of zero cycles**, meaning that a dependent ALU instruction can be processed in the cycle immediately following the instruction that loads its data. When a branch is resolved with the late ALUs, the latency increases to seven clock cycles for a jump miss.

The biggest change in the 7 Series is the revision of the memory subsystem with data cache and optional tightly integrated memory (TIM). The FIO port bypasses the core complex bus. **Figure 4 shows the structure of the CPU.** 

The U74MC has a 64-bit register set and 64bit data path, L1 instruction and data caches protected by ECC, a physical memory protection (PMP) unit, and a memory



*Figure 4. Microarchitecture of SiFive's S-CPU.* 

management unit (MMU) that enables the use of Linux. The MMU implements the 39-bit version (SV39) of the RISC-V virtual memory system. The PMP protects up to eight memory areas and **allows permissions to be assigned for user-mode accesses**. The processor core can also contain a local interrupt controller (CLIC) to enable interrupt prioritization and preemption. To prevent side-channel attacks, system software can clear branch history when switching processes.

The 32-bit E76 and 64-bit S76 are microcontroller-class CPUs that lack the MMU compared to the U74, but include **optional tightly integrated memory (TIM)** and the FIO. SiFive configures the E7x cores with a 64-KB instruction cache with four-way associativity, an instruction TIM addressable in a single cycle, or both.

For data, a cache or TIM can be selected. Although the data TIM ranges from 4 KB to 256 KB, most developers opt for 32 KB. For real-time capable processors, **developers can use instruction TIM** and disable dynamic jump prediction at boot time. These processor cores typically run a real-time operating system (RTOS) and small applications, so complex cache structures are not required.

The core complex includes a fully coherent and shared memory area. A platform-level interrupt controller (PLIC) distributes global interrupts. **Each processor core can be configured**, for example, one with SRAM, another with an accelerator, and a third core without either. All processor cores are connected to a cache-coherent bus and can see and access the FIO port on all other cores, which means they can also access the SRAM and a possible user-defined accelerator of the other cores.

For a simple microcontroller, instead of a multi-core configuration, only an **E76** core with TCM, FIO and CLIC (Core-Local Interrupt Controller) functions can also be used and the L2 cache and PLIC block can be omitted.

Excluding memory, such a microcontroller occupies  $0.112 \text{ mm}^2$  of silicon area in <u>TSMC's</u> 28HPC process when using a standard 9-track cell library. According to SiFive, this microcontroller consumes 20.4 mW when running the Dhrystone benchmark on it at 400 MHz clock frequency – without memory. For maximum performance, a 12-track library should allow worst-case operation at 875 MHz, with the processor core occupying 0.174 mm<sup>2</sup> and consuming 74.4 mW of power.

#### Vector unit for SiFive's S7

VIS7 was unveiled in early 2021, a processor capable of 64 billion FP32 operations per second and designed for deterministic operations. VIS7 combines the S7 processor with a 512-bit wide vector unit.

The new VIS7 processor includes the vector extension RVV 1.0. The vector unit works with 8-, 16- and 32-bit data in floating-point, fixed-point and integer formats. It uses a 512-bit vector ALU and a 512-bit vector memory unit.

The VIS7 can be compared to the Cortex-R82, Arm's first 64-bit real-time processor. To increase the SIMD performance of the R82, licensees can integrate an optional 128-bit wide Neon unit. The VIS7 and R82 have eight cores and offer real-time determinism to deliver predictable throughput. Both processors use tightly integrated or tightly coupled memory to reduce memory transaction times and improve determinism.

SiFive's VIS7 processor achieves 5.1 CoreMarks/MHz, 12 % behind Arm's R82, and it operates at a similar peak frequency of 2.0 GHz. However, it shines in SIMD operations as its vector unit is 4× wider and almost quadruples the FP32 peak throughput of the R82's Neon unit. Programmers can also set LMUL (length-multiplier) – a control register for grouping vector registers - to 2, 4 or even 8, creating an 8,182-bit wide virtual register in extreme cases. LMUL does not improve peak throughput, but it does reduce the number of instructions needed to supply the vector unit.

### SiFive vs. Arm

The U74 competes directly with Arm's Cortex-A55. The U74 and A55 both have an **eight-stage in-order dual-issue pipeline**, with SiFive's design achieving about 11 % more CoreMarks/MHz. The U74 also comes out ahead in power and area efficiency. On the other hand, the A55 includes an FPU that handles neon single instruction multiple data (SIMD) vector instructions.

The E76 is comparable to Arm's Cortex-M7 in terms of integer performance. Both are dual-issue microcontrollers that deliver about 5.0 CoreMarks/MHz, with the **Arm microcontroller having a slight edge**. The Cortex-M7 includes DSP/SIMD enhancements that the E76 does not; both manufacturers offer optional FPUs. Although the E76 doesn't quite match the M7 for power efficiency, it achieves a higher clock speed of up to 1.6 GHz in the same manufacturing process. The S76 does not have a 64-bit competitor from Arm. The Cortex-R8 is similar, but it is a 32-bit processor and does not come close to the S76 in the Coremark benchmark.

SiFive and Arm's offerings also differ in their multicore configurations. The 7series has a shared L2 cache. In contrast, the **Cortex-A55 has a private L2 cache and a shared L3 cluster cache**, while the Cortex-M and Cortex-R CPUs do not support private L2 caches.

## The agony of choice – Segger supports them all

As always, which choice is best depends on the application. SiFive's 7-series delivers an impressive 63 % performance increase over the 5-series. More importantly, it allows SiFive to compete against dual-issue Arm CPUs such as Cortex-M7 and Cortex-A55. The increased performance of the **U74 also expands the range of Linux applications that RISC-V can serve**. To compete with Arm in applications that require DSP or AI processing, SiFive had to add a vector unit: Although there is a VIS7 announcement on this (see box: Vector Unit for S7 from SiFive), the product is not officially offered on SiFive's website at this time.

Regardless of whether a processor from Arm or SiFive is chosen, for debugging Segger offers a uniform tool with the J-Link [1]. It was voted the best debugger by electronics readers in a survey [2]. <u>Segger's</u> Embedded Studio IDE, which scored excellently in the last Elektronik reader test, is also available for Arm and RISC-V [3]. And last but not least embOS for RISC-V, the preferred RTOS choice for engineers all over the world, offers incomparable ease-of-use and guarantees 100 % deterministic real-time operation for any embedded RISC-V device. Certified by TÜV SÜD, embOS complies with the functional safety standards IEC 61508 SIL 3 and IEC 62304 Class C. More technical details you can find at Segger's Platform for RISC-V overview [4].

#### **References:**

[1] SEGGER's J-Link debugging tools for SiFive and Arm CPUs: https://www.segger.com/products/debug-probes/j-link/

[2] Schlichtmeier, T.: Reader survey on debugging - These are the top 3. elektronik.de, May 25, 2021,

www.elektroniknet.de/embedded/entwicklungstools/das-sind-die-top-3.186641.html

[3] Stelzer, G.: »Embedded Studio« from Segger with top rating. Electronics 2021, H. 4, pp. 6-9.

[4] SEGGER's RISC-V platform: <u>https://www.segger.com/risc-v/</u>