Challenge of Spatial Accelerator

Domi Yan
5 min readMar 23, 2020

--

This article is my personal opinion on the spatial accelerator and one technical challenge for wide adoption — compiler. It is composed of 3 parts: 1. An introduction to spatial accelerators. 2. How a compiler works for spatial accelerators. 3. What’s the challenge for wide adoption?

What is a spatial accelerator?

If you are interested in computing in general, you probably have heard about this statement for a long time and will continue to hear in the years to come:

Moore’s Law Is Dead

Whether you believe it or not, the fact is we humans can’t shrink the size of the transistor and increase the speed the clock frequency at a pace we used to. As many people said, the free lunch is over. The days when you don’t have to modify your software and automatically get a 2X speed boost in 18 months is gone. Computer scientists are trying to seek new architectures to make computation faster. One class of accelerators can be classified as a spatial accelerator. In recent years, we see a trend emerging in this category.

The key features of spatial accelerators are flexible and configurable. Unlike fixed hardware (CPU which is general purpose-oriented or GPU/TPU which are extremely good at certain tasks), spatial architecture is flexible, it can be tailored for your compute workload. If your application needs more (ALU) arithmetic logic units, the chip can be configured to provide more adders/multipliers. If you need more on-chip storage for caching/routing purposes, the basic functional unit can be configured as registers and on-chip memory.

Hardware architecture: Usually a spatial architecture provides a group of basic configurable functional units and interconnects between them. Those basic functional units can be configured to perform different functions, the interconnects between function units are also configurable. Let’s take FPGA which is a representative of spatial accelerators for example. Below is an illustration of FPGA’s architecture

Figure 1: FPGA contains 3 components: programmable logic block (basic configurable functional unit), programmable interconnections (interconnects) and programmable I/O blocks (interfaces).

Other publicly announced (released or under development) spatial accelerators besides FPGA are Xilinx ACAP, Intel CSA, Celebras WSE.

These all sound great. One key issue of preventing wide adoption of general software developers is compiler.

How a spatial architecture targeted compiler works?

A CPU/GPU compiler works this way: it translates the higher-level language to lower-level representation. The target can be an instruction set (ISA) for hardware architecture or instructions recognized by a virtual machine. Some famous ones are X86 ISA, RISC ISA, and Java bytecode. When the program is executed, the low-level representations will be executed in runtime on the hardware (or virtual machine).

A spatial accelerator’s compiler works differently. The final output is not a sequence of instructions, it’s a configuration file for the hardware including how components are constructed and hooked up together.

The difference is not hard to understand. Imagine you have an “add” operation in your program, the final output for the CPU should contain an “add instruction”. During execution, this instruction will be fetched and executed on the corresponding hardware unit (in this case, an ALU). For configurable hardware, the function unit is selected, configured to perform add operation and connected to its inputs and outputs at compile time. This type of transform, on a high level, can be divided into 2 stages:

  1. From high-level language to logical function units: Decompose the high-level language functionality to basic functional units and connections between them.
  2. From logical function units to physical function units: Map the basic functional units to physical units on the hardware and connect them using reserved connections on-chip.

The first step is called synthesis and the second step is usually called “placement and route”. These are terminology in the EDA (Electric Automation Design) world.

On the conceptual level, the compilers are helping design your customized circuit based on your description in higher-level language (C/C++/Python, etc.). And this is the problem.

To get good performance of the design, the users are required to think like a digital circuit designer. Tools have been developed to help automate this process but it’s not there yet (not good enough to replace digital circuit designer’s job). Here is a list of things I think could be big barriers for developers if you can’t think like an EDA tool user.

Challenge for the developer to use a spatial architecture

1. The “area” needed for the program increases with the “size” of the program.

What is “area”? Remember, during the configuration stage, certain resources will be used, “area” is to describe the amount of computing resources they occupy on the chip. It’s decided by your program. It uses more and more resources as your program grows bigger and reaches the limit. The size of the program (if it gets too big) can now be a functional issue!

2. Compile time can be very long.

Underlying placement and route problems can and will take a long time. The time can be hours or even days.

3. Very hard to debug. No concept of stack or heap in execution.

Whether you are using IDE or command-line interface, the pain of debugging code line by line, breakpoint by breakpoint all of a sudden becomes a pleasure compared to hardware debug.

4. Even harder to tune performance.

You need very deep knowledge of underlying hardware to tune your program because the compiler can’t automatically do it for you.

5. Can’t be built incrementally. The computer engine/accelerator is created as a monolithic unit.

Because the final result is a graphic configuration with everything affecting each other, an incremental compile is very hard to achieve without a lot of constraints and performance trade-off.

All of the above problems have the same cause: while the concept of “programming” can be mapped to similar processes (synthesis, “placement and route”) on the surface, the development experience can’t be “ported” due to the nature of underlying compiler technology.

But don’t lose hope, think about the programmers in the early 1960s of the CPU, their work is very cumbersome and not efficient at all. Also, they must know more about how CPU hardware works than today’s developer. As I am writing, a bunch of smart compiler engineers behind those spatial accelerators are trying hard to make it better and easy to use. I hope their work will be successful and one day truly democratizes the use of spatial accelerator devices.

--

--