# Verified Timing Transformations in Synchronous Circuits with $\lambda\pi$ -Ware

João Paulo Pizani Flor, Wouter Swierstra

Utrecht University {J.P.PizaniFlor, W.S.Swierstra}@uu.nl

#### 1 Introduction

Modelling electronic circuits has been a fertile ground for functional programming (Sheeran, 2005) and theorem proving (Hanna and Daeche, 1992). There have been numerous efforts to describe, simulate, and verify circuits using functional languages such as MuFP (Sheeran, 1984) and more recently  $C\lambda aSH$  (Baaij, 2015) and ForSyDe (Sander and Jantsch, 2004).

Functional languages have also been used to *host* an Embedded Domain-Specific Language (EDSL) for hardware description. Some of these EDSLs, such as Wired (Axelsson et al., 2005), capture low-level information about the layout of a circuit; others aim to use the host language to provide a higher-level of abstraction to describe the circuit's intended behaviour. A notable example of the latter approach is Lava (Bjesse et al., 1999) and its several variations (Gill et al., 2009; Singh, 2004).

Also interactive theorem proving and programming with dependent types have been fruitfully used to support hardware verification efforts, with some based on HOL (Melham, 1993; Boulton et al., 1992), some on Coq (Braibant, 2011; Braibant and Chlipala, 2013) and some on Martin-Löf Type Theory (Brady et al., 2007) Following this line of research, we utilize a dependently-typed programming language (Agda) as the *host* of our hardware EDSL, for its proving capabilities and convenience of embedding.

In particular, this paper focuses on verification related to the hardware design aspect of *timing*, that is, the behaviour of a circuit in terms of its inputs *over time*. When implementing algorithms in hardware, a compromise must be made between the *area* occupied by a circuit and the *number of clock cycles* it takes for a computation to be performed.

Circuits with more computation being performed *in parallel* will occupy a larger area. This might lead to design constraints being violated, as well as having an impact in power consumption. A more sequential circuit, on the other hand, will occupy less area, but might be a bottleneck in computational throughput and impact other parts of the design that depend on its results.

There are many different ways to implement any specific functional behaviour, and it can be difficult to find the right spot in the design space upfront. Timing-related circuit transformations are quite invasive and error-prone – making it difficult to correct bad design decisions a posteriori With this paper, we attenuate these issues by defining a language for circuit description that

facilitates the exploration of different points in the timing design space. More specifically, this paper makes the following contributions:

- We show how to embed a typed domain specific language for circuit description and verification, λπ-Ware, in the general purpose dependently typed programming language Agda (Section 3), together with an executable semantics based on state transitions (Section 4).
- Next, we define common recursion patterns to build circuits in both parallel and sequential architectures in λπ-Ware (Section 5). We show how some well-known circuits can be expressed in terms of these recursion patterns.
- Finally, we define a precise relation between the parallel and sequential versions of circuits that exhibit equivalent behaviour (Section 5.1). By proving that different flavours of our recursion schemes are convertible, we allow hardware designers to tune the degree of parallelism while being certain that the timing transformations preserve functional semantics up to timing.

Altogether, these contributions help to *separate the concerns* between the *values* that a circuit must compute and the *speed* with which the computation is performed. In this way, timing decisions can be modified more easily in a later stage of design process.

The codebase in which the ideas exposed in this paper are developed is available online. For the sake of presentation, code excerpts in this paper may differ slightly from the corresponding ones in the repository.

#### ${f 2}$ Overview

We begin by presenting a hardware EDSL,  $\lambda\pi$ -Ware, as the vehicle for describing our main contributions. Although it is inspired by our previous work,  $\Pi$ -Ware (Pizani Flor et al., 2016), it has several distinguishing characteristics and novel features that will be described as we present them. In this section, we illustrate the language by means of two variations on a simple circuit. Later section cover the syntax and semantics of circuits in greater detail.

Example: Horner's method We will look at two example circuits: one performing its computation in parallel and the other sequentially. Both circuits compute the value of a polynomial at a given point using the well-known Horner's method, however, the structure of the circuits is very different.

For any coefficients  $a_0, \ldots, a_n$  in  $\mathbb{N}$ , we can define a polynomial as follows:

$$p(x) = \sum_{i=0}^{n} a_i x^i = a_0 + a_1 x + a_2 x^2 + a_3 x^3 + \dots + a_n x^n,$$

In order to compute the value of the polynomial at a specific point  $x_0$  of its domain, Horner's method proceeds by using the following sequence of values:

https://gitlab.com/joaopizani/lambda1-hdl/tree/paper-2017-comb-seq

$$b_n := a_n$$

$$b_{n-1} := a_{n-1} + b_n x_0$$

$$\vdots$$

$$b_0 := a_0 + b_1 x_0.$$

Then  $b_0$  is the value of of our polynomial at  $x_0$ , that is,  $p(x_0)$ . By iteratively expanding definitions for each of the  $b_i$  in the equations above, one arrives at a factorized form of the polynomial easily seen as equivalent to the usual series of powers.

Parallel version This computational process is easily expressed as a fold, and in  $\lambda\pi$ -Ware we can build a parallel circuit to compute this fold without the use of any state or memory element, for any given degree n. When reading the signature of the horner-par definition below, one must note that only the parameters with the type former  $\lambda H$  are circuit inputs, and the others are synthesis parameters. Thus horner-par takes two inputs of type  $\mathbb N$  and produce one output of type  $\mathbb N$ .

$$\begin{array}{ll} \mathsf{horner\text{-}par} : \ \forall \ \mathsf{n} \ (\mathsf{x}_0 \ : \ \mathsf{\lambda}\mathsf{H} \ \mathbb{N}) \ (\mathsf{a}_n \ : \ \mathsf{\lambda}\mathsf{H} \ \mathbb{N}) \ (\mathsf{as} \ : \ \mathsf{Vec} \ (\mathsf{\lambda}\mathsf{H} \ \mathbb{N}) \ \mathsf{n}) \to \mathsf{\lambda}\mathsf{H} \ \mathbb{N} \\ \mathsf{horner\text{-}par} \ \mathsf{x}_0 \ = \ \mathsf{foldl\text{-}par} \ (\mathsf{\lambda} \ \mathsf{s} \ \mathsf{a} \to \mathsf{a} \ : + : \ \mathsf{x}_0 \ : ^* : \mathsf{s}) \end{array}$$

The circuit horner-par n will compute the value of a polynomial of degree n at a given point. This circuit will contain three inputs: the point at which to evaluate the polynomial,  $x_0:\lambda H\ \mathbb{N}$ ; an accumulating argument,  $a_n:\lambda H\ \mathbb{N}$  and a vector of coefficients, as : Vec  $(\lambda H\ \mathbb{N})$  n. Later in Section 4 we present the precise executable semantics of circuits, but for now we can say that horner-par n will behave accordingly to the behaviour of fold from the standard library.

To grasp the difference between the parallel and sequential versions of horner, it is also useful to look at each version's *architecture*. The parallel version, horner-par, will have the structure in Figure 1:



Fig. 1. Block diagram of the horner-par circuit.

In the block diagram of Figure 1, we can clearly see that the inputs of the circuit are a vector, that the circuit contains no loops nor memory cells, and that

the body of the foldl is replicated n times, thus corresponding to a fully parallel implementation. Not all parallel circuits are completely stateless as horner-par, but fully stateless circuits are also commonly called combinational in hardware parlance.

Sequential version Next, we describe a sequential circuit to do the same calculation, using internal state to produce a sequence of outputs. The output value of the circuit at clock cycle i corresponds to the sum of all polynomial terms with degree smaller than or equal to i, evaluated at point  $x_0$ .

```
\begin{array}{ll} \mathsf{horner\text{-}seq} \,:\, \forall \; (\mathsf{x}_0 \,:\, \mathsf{\lambda}\mathsf{H} \; \mathbb{N}) \; (\mathsf{a} \,:\, \mathsf{\lambda}\mathsf{H} \; \mathbb{N}) \to \mathsf{\lambda}\mathsf{H} \; \mathbb{N} \\ \mathsf{horner\text{-}seq} \; \mathsf{x}_0 \;=\; \mathsf{foldl\text{-}seq} \; (\mathsf{\lambda} \; \mathsf{s} \; \mathsf{a} \to \mathsf{a} \;:\text{+:} \; \mathsf{x}_0 \;:\text{*:} \; \mathsf{s}) \end{array}
```

The circuit takes two inputs:  $x_0$ , the point at which we desire to evaluate the polynomial; and a, a single input containing the n- (i+1)-th coefficient at the i-th clock cycle. The circuit is defined using the foldl-seq combinator, that iterates its argument function. This function corresponds to the loop body, mapping the current approximation, s, and the current value of the inputs, a, to a new approximation. As we shall see, to execute this sequential circuit, we will need to provide an initial value for the state, s.

The architecture of horner-seq is the extreme opposite of horner-par when it comes to parallelism, and is shown in Figure 2.



Fig. 2. Block diagram of the horner-seq circuit.

In the block diagram of Figure 2 we can see the body of the fold is the same as in the parallel version, but now instead of n instances of the body we have a single instance with one of its outputs *tied back* in a loop with a memory cell (shift register).

We have seen that the parallel and sequential definitions are syntactically similar, but have very different timing behaviour and generate very different architectures. First of all, the coefficients input of horner-par is a vector (a bus in hardware parlance), while the corresponding input of horner-seq is a single number. Also, all the coefficients are consumed by horner-par in a single clock

cycle, while horner-loop consumes the sequence of coefficients over n clock cycles. It is only after these n cycles, that the results of the two circuits will coincide.

This paper will make precise how two such circuits are related. Before we can do so, however, we need to describe our domain specific language,  $\lambda\pi$ -Ware, in greater detail.

#### 3 λπ-Ware

We begin by fixing the universe of types, UB, for the elements that circuits may produce or consume. This type is parameterized over the type of data carried over the circuit's wires (B). An obvious choice for B would be bits or booleans, but other choices are possible when modelling a high-level circuit, such as integers or a datatype representing assembly instructions for a microprocessor.

```
\begin{array}{lll} \textbf{data} \ \textbf{U} \ (\textbf{B} \ : \ \textbf{Set}) \ : \ \textbf{Set} \ \textbf{where} \\ & \textbf{unit} & : \ \textbf{U} \ \textbf{B} \\ & \iota & : \ \textbf{U} \ \textbf{B} \\ & \underset{-}{\Rightarrow}_{-} : \ (\sigma \ \tau \ : \ \textbf{U} \ \textbf{B}) \rightarrow \textbf{U} \ \textbf{B} \\ & \underset{-}{\otimes}_{-} : \ (\sigma \ \tau \ : \ \textbf{U} \ \textbf{B}) \rightarrow \textbf{U} \ \textbf{B} \\ & \underset{-}{\oplus}_{-} : \ (\sigma \ \tau \ : \ \textbf{U} \ \textbf{B}) \rightarrow \textbf{U} \ \textbf{B} \\ & \textbf{vec} & : \ (\tau \ : \ \textbf{U} \ \textbf{B}) \ (\textbf{n} \ : \ \textbf{N}) \rightarrow \textbf{U} \ \textbf{B} \end{array}
```

This collection of type codes is hardly controversial, consisting of a unit type (1) and the base type  $(\iota)$ , closed under function space  $(\Rightarrow)$ , products  $(\otimes)$ , coproducts  $(\oplus)$ , and homogeneous arrays of fixed-size (vec). The decoding function El maps the elements of U to their corresponding type in the obvious fashion.

Core datatype As mentioned before, our language is a deep-embedding in Agda. The core datatype representing circuit terms, λB, is given below:

Elements of  $\lambda B$  are circuit models embedded into Agda using typed De Bruijn indices for variable binding. However, we do have a convenience layer on top of  $\lambda B$ , called  $\lambda H$ , as seen previously in the overview. Definitions using  $\lambda H$  are essentially a shallow embedding of circuit models into Agda, using Higher-Order Abstract Syntax (HOAS), offering a more convenient and user-friendly programming interface. The unembedding technique (Atkey et al., 2009) guaranteed that it is always possible to go from a circuit definition using  $\lambda H$  to an equivalent one using  $\lambda B$ .

Returning to the  $\lambda B$  datatype itself, it is indexed by the context ( $\Gamma$ : Ctxt B) representing the arguments to the circuit or any free variables currently in scope. The datatype is also indexed by the circuit's output type, ( $\tau$ : UB).

The entire development is parameterized by a type of fundamental gates,  $\mathsf{Gate}: \mathsf{U} \; \mathsf{B} \to \mathsf{Set},$  corresponding to the primitive components of our circuit. The  $\langle \_ \rangle$  constructor creates a circuit from such a fundamental gate. These typically correspond to the usual triple of boolean gates ({NOT, AND, OR}) with Bool as the chosen base type; circuit designers, however, are free to choose those fundamental gates that best fit their domain. Besides these fundamental gates, there are constructors to refer to variables (var), compose circuits through application ( $\_\$\_$ ), and to introduce sharing (let') or recursion (loop).

While these constructors form the heart of the  $\lambda B$  datatype, we have additional constructors for introducing and eliminating products, coproducts, and vectors:

We give the elimination forms for both products and coproducts uniformly as case constructs, instead of projections that matches on its argument and introduces newly bound variables to the context. For vectors,  $\lambda B$  has the two usual introduction forms: one to produce an empty vector of any type (nil) and to extend an existing vector with a new element (cons). Finally, the *accumulating map*, mapAccumL-par, performs a combination of map and fold! The input vector with elements of type  $\rho$  is pointwise transformed into one with elements of type  $\tau$ , all the while threading an accumulating parameter of type  $\sigma$  from left to right.

This eliminator is less general than the usual type theoretic elimination principle for vectors; embedding this more general eliminator would require dependent types and higher-order functions in our circuit language. To keep our object language simple, however, we chose a more simple elimination principle capable of expressing the most common hardware constructs.

## 4 Semantics and properties

Where the previous section defined the *syntax* of our circuit language, we now turn our attention to its *semantics*. Although there are many different interpretations that we could assign to our circuits, for the purpose of this paper we will focus on describing a circuit's input/output behaviour.

State transition semantics Circuits defined in  $\lambda B$  can be classified in two ways. Combinational circuits do not have any loops; sequential circuits may contain loops. To define the semantics of sequential circuits, we will need to define the type of state associated with a particular circuit. To do so, we define the inductive family  $\lambda s$ :

```
data \lambda s: (c: \lambda B \Gamma \tau) \rightarrow Set where _s,_ : (sx: \lambda s \, x) \, (sy: \lambda s \, y) \rightarrow \lambda s \, (x, y) sLoop : \{c: \lambda B \, (\sigma:: \Gamma) \, (\sigma \otimes \tau)\} \rightarrow (si: El \, \sigma) \rightarrow (sc: \lambda s \, c) \rightarrow \lambda s \, (loop \, c) ...
```

This family has a constructor for each constructor of  $\lambda DB$ . Most of these constructors either contain no significant information, or simply follow the structure of the circuit, like in the clause for pairs,  $\_s$ , shown above. The most interesting case is sLoop, in which the state required to simulate a circuit of the form loop c consists of a value of type  $El\ \sigma$ — where  $\sigma$  is the type of the state that the circuit produces — together with any additional state that may arise from the loop body.

One other constructor of  $\lambda s$  deserves special attention: sMapAccumL-par. A circuit built with mapAccumL-par consists of n copies of a subcircuit f connected in a row. Hence, the state of such a circuit consists of a vector of states, one for each of the copies of f. Correspondingly, we define the state associated with such an accumulating map as follows:

```
sMapAccumL-par : (sf : Vec (\lambda s f) n) (se : \lambda s e) (sxs : \lambda s xs) \\ \rightarrow \lambda s (mapAccumL-par f e xs)
```

With this definition of state in place, we turn our attention to the semantics of our circuits. We will sketch the definition of our single step semantics, <code>[\_|]s</code>, mapping a circuit, initial state and environment to a the value produced by the circuit and a new state:

The environment  $\gamma$  assigns values to any free variables in our circuit definition. The base cases for our semantics are as follows:

In the case for gates, we apply the semantics of our atomic gates, described by the auxiliary function [-]g; in the case for variables, we lookup the corresponding value from the environment. Both these cases do not refer to the circuit's state. This state becomes important when simulating loops. In the clauses for application, let' and loop, shown in Listing 1, we do need to consider the circuit's state.

In the cases of application and **let**, each subcircuit simply "takes a step" independently and the next state of the whole circuit is a combination of the

Listing 1: State-combining clauses of the single-step state transition semantics.

next states of each subcircuit. The case for loop is slightly more interesting: the loop body, f, takes an additional input, namely the current state given by the ml parameter of sLoop constructor.

The further clauses of the transition function handle the introduction and elimination forms of products, coproducts and vectors. They are all defined simply by recursive evaluation of the subcircuits, and are straightforward enough to omit from the presentation here. For example, the clause for coproduct elimination is shown below:

First the coproduct value  $(x \lor y)$  is evaluated, computing a result value and its next state. The result of the evaluation  $(rx \lor ry)$  is then fed to Agda's coproduct eliminator  $([\_,\_])$ ; the functions that process the left and right injections proceed accordingly. In either case, the value is fed into evaluation of the appropriate body (either f or g), and the result is then used as the result of the whole coproduct evaluation.

Similarly our elimination principle for vectors, mapAccumL-par, is worth highlighting:

The above clause is key in the relation that we later establish (Section 5.1) between parallel and sequential versions of circuits. The three key sub-steps involved in this clause are: evaluation of the left identity element (e), the evaluation of the row of inputs (xs) and the row of step function copies (f).

The first two steps are as expected: both the identity and row of inputs take a step, and we thus obtain the next state and result values of each. The core step is then evaluating the row of copies of f, and its semantics are given using the auxiliary function mapAccumL2.

The mapAccumL2 function is simply a two-input version of an accumulating map, which works by simply zipping the pair of input vectors and calling the mapAccumL function from Agda's standard library.

```
\begin{array}{ll} \mathsf{mapAccumL} \,:\, (\sigma \to \alpha \to (\sigma \times \beta)) \to \sigma \to \mathsf{Vec} \; \alpha \; \mathsf{n} \to \sigma \times \mathsf{Vec} \; \beta \; \mathsf{n} \\ \mathsf{mapAccumL} \; \mathsf{f} \; \mathsf{s} \; [] &= \mathsf{s} \;, \; [] \\ \mathsf{mapAccumL} \; \mathsf{f} \; \mathsf{s} \; (\mathsf{x} :: \mathsf{xs}) &= \mathsf{let} \; \mathsf{s}' \;, \; \mathsf{y} &= \mathsf{f} \; \mathsf{s} \; \mathsf{x} \\ & \mathsf{s}'' \;, \; \mathsf{ys} &= \mathsf{mapAccumL} \; \mathsf{f} \; \mathsf{s}' \; \mathsf{xs} \\ & \mathsf{in} \; \mathsf{s}'' \;, \; (\mathsf{y} :: \mathsf{ys}) \\ \mathsf{mapAccumL2} \; : \; (\sigma \to \alpha \to \gamma \to (\sigma \times \beta \times \delta)) \to \sigma \to \mathsf{Vec} \; \alpha \; \mathsf{n} \to \mathsf{Vec} \; \gamma \; \mathsf{n} \\ & \to \sigma \times \mathsf{Vec} \; \beta \; \mathsf{n} \times \mathsf{Vec} \; \delta \; \mathsf{n} \\ \mathsf{mapAccumL2} \; \mathsf{f} \; \mathsf{s} \; \mathsf{xs} \; \mathsf{ys} \\ &= \mathsf{map} \times \mathsf{id} \; \mathsf{unzip} \; \$ \; \mathsf{mapAccumL} \; (\mathsf{uncurry} \circ \mathsf{f}) \; \mathsf{s} \; (\mathsf{zip} \; \mathsf{xs} \; \mathsf{ys}) \end{array}
```

In the semantics of mapAccumL-par, we apply mapAccumL2 to the vector with the result of xs (called rxs) as well as the vector with states for the copies of f (called mfs). Then, as the result of the application we obtain the final accumulator value and vector of result values, together with the vector of next state values (mfs').

*Multi-step semantics* To describe the behaviour of a circuit *over time*, we need to define another semantics. More specifically, in this work we consider only discrete-time synchronous circuits, and thus we will show how to use  $[\![\ ]\!]$ s to define a multi-step state-transition semantics.

When simulating a circuit for n cycles, we need to take not *one* input environment but n, and instead of producing a single value, the simulation returns a vector of n values. Just as we saw for mapAccumL-par, we ensure that the newly computed state is threaded from one simulation cycle to the next.

This is exactly the behaviour of an accumulating map, thus the use of mapAccumL here. The use of mapAccumL here is the key to the connection between the multi-cycle of circuits using loop and the single-cycle behavior of circuits using mapAccumL-par.

#### 5 Parallel and sequential combinators

With  $\lambda\pi$ -Ware we intend to give a hardware developer more freedom to explore the trade-offs between area, frequency and number of cycles that a circuit might

take to complete a computation. This freedom comes from the proven guarantees of convertibility between parallel and sequential versions of circuits.

To make it easier to explore this design space, we provide some *circuit combinators* for common patterns. Each of these patterns comes in a sequential and a parallel version, together with a lemma relating the two. If a circuit is defined using one of these combinators, changing between a serial or parallel design is as easy as changing the combinator version used. The associated lemma guarantees the relation between the functional behaviour of the versions.

All combinators in this section are derived from the two primitive constructors loop and mapAccumL-par. By appropriate partial application and the use of "wrappers" to create the loop body, all sequential combinators are derived from loop. Similarly, using the same wrappers but with mapAccumL-par, we derive all parallel combinators.

Of notice is also the fact that, in this section, we present the combinators in *De Bruijn* style, as this is the most useful representation to use when evaluating circuit (generators), which is covered in 5.1

The map combinators For example, we might want to easily build circuits that map a certain function over its inputs. We will define both the sequential and parallel map combinators in terms of a third circuit, mapper. The sequential version is given by map-seq:

```
\begin{array}{l} \mathsf{mapper} \,:\, (f\,:\, \lambda B\,\, (\rho ::\, \Gamma)\,\, \tau) \to \lambda B\,\, (\sigma ::\, \rho ::\, \Gamma)\,\, (\sigma \otimes \tau) \\ \mathsf{mapper}\,\, f \,=\, \#_0 \,\,,\, K_1\,\, f \\ \mathsf{map-seq} \,:\, (f\,:\, \lambda B\,\, (\rho ::\, \Gamma)\,\, \tau) \to \lambda B\,\, (\rho ::\, \Gamma)\,\, \tau \\ \mathsf{map-seq}\,\, f \,=\, \mathsf{loop}\, \{\sigma \,=\, \mathbb{1}\}\,\, (\mathsf{mapper}\, f) \end{array}
```

We define map-seq by applying loop to the mapper f circuit. In mapper, the next state (first projection of the pair) is a copy of its first input  $(\#_0)$ , whereas the second projection is made by the weakened f, which discards its first input.

The parallel version of the same combinator (map-par) is defined in terms of mapAccumL-par and mapper:

```
map-par : (f: \lambda B \ (\rho :: \Gamma) \ \tau) \ (xs: \lambda B \ \Gamma \ (vec \ \sigma \ n)) \rightarrow \lambda B \ \Gamma \ (vec \ \tau \ n) map-par f \ xs = snd \ (mapAccumL-par \ (mapper \ f) \ unit \ xs)
```

In the above definition we note that we are free to choose the type of the "initial element" (2nd argument), but we use 1 (value unit), as units can always be used regardless of the base type chosen in the development. Furthermore, we use snd to extract only the second element of the pair (the output vector), and discard the "final element" outputted.

The foldl-scanl combinators Perhaps even more useful than mapping is scanning and folding over a vector of inputs. To obtain the sequential and parallel versions of such combinators, we again apply the loop and mapAccumL-par primitives to a special body which wraps the binary operation (f) of the scan/fold.

```
\begin{array}{ll} \mathsf{folder} \,:\, (\mathsf{f} \,:\, \lambda \mathsf{B} \,\, (\sigma :: \rho :: \Gamma) \,\, \sigma) \to \lambda \mathsf{B} \,\, (\sigma :: \rho :: \Gamma) \,\, (\sigma \otimes \sigma) \\ \mathsf{folder} \,\, \mathsf{f} \,\, = \,\, \#_0 \,\,,\, \mathsf{f} \\ \mathsf{foldl\text{-}scanl\text{-}seq} \,:\, (\mathsf{f} \,:\, \lambda \mathsf{B} \,\, (\sigma :: \rho :: \Gamma) \,\, \sigma) \to \lambda \mathsf{B} \,\, (\rho :: \Gamma) \,\, \sigma \\ \mathsf{foldl\text{-}scanl\text{-}seq} \,\, \mathsf{f} \,\, = \,\, \mathsf{loop} \,\, (\mathsf{folder} \,\, \mathsf{f}) \end{array}
```

The wrapper called folder makes the next state equal to the first input of the binary operator, and the output be the result of applying the binary operator. In the above definition of foldl-scanl-seq, we get the behaviour of scanl and foldl combined: The circuit outputs from clock cycle 0 to n form the result of the scanl operation, and the last one at cycle n+1 is the value of the foldl.

The parallel version has also such a combined behaviour:

```
 \begin{array}{l} \mathsf{foldl\text{-}scanl\text{-}par} \,:\, (\mathsf{f}\,:\, \lambda \mathsf{B} \,\, (\sigma :: \rho :: \Gamma) \,\, \sigma) \,\, (\mathsf{e}\,:\, \lambda \mathsf{B} \,\, \Gamma \,\, \sigma) \,\, (\mathsf{xs}\,:\, \lambda \mathsf{B} \,\, \Gamma \,\, (\mathsf{vec} \,\, \rho \,\, \mathsf{n})) \\ \qquad \, \, \to \lambda \mathsf{B} \,\, \Gamma \,\, (\sigma \otimes \mathsf{vec} \,\, \sigma \,\, \mathsf{n}) \\ \mathsf{foldl\text{-}scanl\text{-}par} \,\, \mathsf{f} \,\, \mathsf{e} \,\, \mathsf{xs} \,\, = \,\, \mathsf{map} \mathsf{AccumL\text{-}par} \,\, (\mathsf{folder} \,\, \mathsf{f}) \,\, \mathsf{e} \,\, \mathsf{xs} \end{array}
```

In the parallel case, we obtain a pair as output, of which the first element is the fold component, and the second element is the scanl (vector) component. Thus by simply applying the fst and snd functions we can obtain the usual fold and scanl.

Whereas these combinators capture some common patterns in hardware design, their usefulness also depends on lemmas relating their parallel and sequential versions.

#### 5.1 Convertibility of parallel and sequential versions

In this section we make precise the relation between circuits with different levels of parallelism. In the examples of this paper we convert between extreme cases: completely parallel (combinational) versus completely sequential. However, nothing in treatment that follows precludes that it be used for *partial unrolling*.

We will show that when two circuits are deemed "convertible up to timing", they can be substituted for one another with minor interface changes in the surrounding context but no alteration of the values ultimately produced.

The relation of convertibility relies on the fact that any sequential circuit will have an occurrence of the loop constructor. As such, a more parallel variant of such a circuit can be obtained by substituting the occurrence of loop with one of mapAccumL-par, thereby unrolling the loop. The fundamental relation between loop and mapAccumL-par is what we now establish. First, recall the types of the single- and multi-step semantic functions:

Now, to establish the desired relation, we apply both the single and multicycle semantics. The *step function* subcircuit (called f) is equal in both cases, and the mapAccumL-par case takes 2 extra parameters besides f.

The second parameter of mapAccumL-par must be a circuit whose value is the same as the first parameter of sLoop, and we use here the simplest possible such circuit: (valm). The third parameter (xs) is the input vector of size n, and is used to build the vector of environments used by the multi-cycle semantics (map ( ::  $\gamma$ ) xs).

Finally, the state of the mapAccumL-par case is built by simply replicating one state of f by n times. Stating the convertibility property in this way makes it be valid only for a state-independent f, that is, when the input/output semantics of f is independent of the state.

```
\begin{array}{l} \mathsf{state}\text{-}\mathsf{independent} \ : \ \forall \ (\mathsf{c} \ : \ \mathsf{\lambda}\mathsf{B} \ \Gamma \ \mathsf{\tau}) \to \mathsf{Set} \\ \mathsf{state}\text{-}\mathsf{independent} \ \mathsf{c} \ = \ \forall \ (\mathsf{sa} \ \mathsf{sb} \ : \ \mathsf{\lambda}\mathsf{s} \ \mathsf{c}) \to [\![\ \mathsf{c}\ |]\!]\mathsf{s} \ \mathsf{sa} \ \gamma \equiv [\![\ \mathsf{c}\ |]\!]\mathsf{s} \ \mathsf{sb} \ \gamma \end{array}
```

This restriction on f could be somewhat further loosened (as is discussed in Section 6.2), but we work here with state-independent loop bodies to simplify the presentation.

As we have seen, the results from applying each semantic function have different types (Tpar and Tloop), so the relation comparing these results is more subtle than just equality. We define this relation, called \_=\*\_, as follows:

Both sides of (\_=\*\_) consist of a pair of next state and circuit outputs. In the mapAccumL-par case, the next state can be ignored in the comparison, but in the loop case, the *value* stored in the loop state (obtained by gets<sub>0</sub>) must be equal to the first output of evaluating mapAccumL-par. With the comparison function defined, we can finally completely express the relation we desire:

[ mapAccumL-par f (val m) xs |]s (sMapAccumL-par (replicate mf)) 
$$\gamma$$
 =\* [ loop f |]n (sLoop m mf) n (map ( ::  $\gamma$ ) xs)

Proof of the basic relation The proof of the basic convertibility relation between mapAccumL-par and loop proceeds by induction on the input vector xs. Due to the deliberate choice of semantics for both constructors involved, and the choice of the right parameters for the application of each, a considerable part of the proof is achieved by just the built-in reduction behaviour of the proof assistant (Agda).

The only key lemma involved is shown below. Namely, the state-independence principle is shown to hold for a whole vector, assuming that it holds for the body circuit f.

This lemma is useful because both left-hand side and right-hand side of the convertibility relation can be transformed into applications of mapAccumL2 simply by reduction, but with different state vector parameters. Thus the lemma is used to bring the sub-goals to a state where they can be closed by using the induction hypothesis.

Convertibility of derived combinators When building circuits using the derived parallel and sequential combinators (map, foldl-scanl, etc.), the convertibility between different (more or less parallel) variants of such circuits rely on the convertibility between the different variants of the combinators themselves.

The basic convertibility principle shown above between mapAccumL-par and loop is the most general one, and can be directly applied to the derived combinators as well, as they are all just a specialized instance of mapAccumL-par or loop. However, for the derived combinators, some more specific properties are useful.

With regards to the map combinators, for example, we wish that the vectors produced by the parallel and sequential versions be equal, without any regard for initial or final states. This can be succinctly expressed as:

```
snd ([\![ map-par f xs |\![]\!]s units \gamma)

\equiv snd ([\![ map-seq f |\![]\!]n units' (map ( :: \gamma) xs))
```

Where units and units' are simply the states (composed of units) that need to be passed to the semantic function but are irrelevant for the computed vectors.

On the other hand, when comparing foldl-par to foldl-seq, the intermediate values produced in the output of foldl-seq are disregarded, and only the final state matters.

```
\begin{array}{ll} \text{fst } ([\![ \text{ foldl-par f } (\text{val e}) \text{ xs } | ]\!] \text{s m} & \gamma) \\ \equiv \text{fst } ([\![ \text{ foldl-seq f} & | ]\!] \text{n } (\text{sFoldl e m}) \text{ } (\text{map } (\_ :: \gamma) \text{ xs})) \end{array}
```

Both of these properties (for map and for foldl) can simply be proven by application of the general property shown above for mapAccumL-par and loop. This is because the definition of the derived combinators is just a partial application of mapAccumL-par and loop, along with projections.

### 5.2 Applications of the parallel and sequential combinators

In this section we describe several variants of circuit families that compute matrix multiplication, as a commonly used application of the aforementioned techniques.

The first design choice involved in this example application is how to represent matrices, i.e., the choice of the matrix type. Traditionally in computing contexts, matrices are mostly represented in two ways: row major (vector of rows) and column major (vector of columns). As it turns out, both representations are useful for our purposes, so we show both here:

```
RMat CMat : (rc : \mathbb{N}) \rightarrow U \mathbb{N}
RMat rc = vec (vec \mathbb{N} c) r
CMat rc = vec (vec \mathbb{N} r) c
```

Here, RMat r c and CMat r c both represent matrices with r rows and c columns, the difference being only whether they are row- or column-major. Going further with the example, we need to define the basic ingredient of matrix multiplication: the *dot product* of two equally-sized vectors.

```
\begin{array}{l} \text{dp} \ : \ \lambda H \ (\text{vec} \ \mathbb{N} \ n) \to \lambda H \ (\text{vec} \ \mathbb{N} \ n) \to \lambda H \ \mathbb{N} \\ \text{dp} \ xs \ ys \ = \ foldl-par } \ :+: \ \ (\text{val} \ 0) \ (\text{zipWith-par} \ :^*: \ \ xs \ ys) \end{array}
```

The dot product is simply defined as element-wise multiplication of the vectors and summing up the results. We can then use the dot product m times in order to multiply a vector by a compatibly-sized matrix.

```
\begin{array}{lll} \mathsf{vec} \times \mathsf{mat}\text{-par} \,:\, \lambda \mathsf{H} \,\, (\mathsf{vec} \,\, \mathbb{N} \,\, \mathsf{n}) \to \lambda \mathsf{H} \,\, (\mathsf{CMat} \,\, \mathsf{n} \,\, \mathsf{m}) \to \lambda \mathsf{H} \,\, (\mathsf{vec} \,\, \mathbb{N} \,\, \mathsf{m}) \\ \mathsf{vec} \times \mathsf{mat}\text{-par} \,\, \mathsf{v} \,\, \mathsf{m} \,\, = \,\, \mathsf{map}\text{-par} \,\, (\mathsf{dp} \,\, \mathsf{v}) \,\, \mathsf{m} \end{array}
```

Here an important detail resides: as the dot product is done for  $each\ column$  of the matrix, the matrix argument of  $vec \times mat$  must be in column-major representation. Also, here we start having choices: we may either have the computation done in parallel as above, or sequentially as below:

```
vec \times mat\text{-seq} : \lambda H \text{ (vec } \mathbb{N} \text{ n)} \rightarrow \lambda H \text{ (vec } \mathbb{N} \text{ n)} \rightarrow \lambda H \mathbb{N}

vec \times mat\text{-seq } v \text{ m} = map\text{-seq } (dp \text{ v}) \text{ m}
```

With the multi-step semantics in mind, we know that each of the m columns of the matrix will be present on the circuit's second input, one per clock cycle, and that collecting the output values for m cycles gives the same vector of results as the one from the parallel version.

For defining the multiplication of two matrices, we simply use  $\text{vec}\times\text{mat}$  on each row of the left matrix. If using  $\text{vec}\times\text{mat-par}$ , we obtain a matrix multiplication circuit with area proportional to r\*c, whereas by using  $\text{vec}\times\text{mat-seq}$  the area is proportional to r\*1.

```
\label{eq:mat_mat_par} \begin{array}{l} \mathsf{mat} \times \mathsf{mat}\text{-}\mathsf{par} \ : \ \lambda \mathsf{H} \ (\mathsf{RMat} \ \mathsf{n} \ \mathsf{m}) \to \lambda \mathsf{H} \ (\mathsf{RMat} \ \mathsf{n} \ \mathsf{p}) \\ \mathsf{mat} \times \mathsf{mat}\text{-}\mathsf{par} \ \mathsf{mr} \ \mathsf{mc} \ = \ \mathsf{map}\text{-}\mathsf{par} \ (\mathsf{flip} \ \mathsf{vec} \times \mathsf{mat}\text{-}\mathsf{par} \ \mathsf{mc}) \ \mathsf{mr} \\ \mathsf{mat} \times \mathsf{mat}\text{-}\mathsf{seq} \ : \ \lambda \mathsf{H} \ (\mathsf{RMat} \ \mathsf{n} \ \mathsf{m}) \to \lambda \mathsf{H} \ (\mathsf{vec} \ \mathbb{N} \ \mathsf{m}) \to \lambda \mathsf{H} \ (\mathsf{vec} \ \mathbb{N} \ \mathsf{n}) \\ \mathsf{mat} \times \mathsf{mat}\text{-}\mathsf{seq} \ \mathsf{mr} \ \mathsf{mc} \ = \ \mathsf{map}\text{-}\mathsf{par} \ (\mathsf{flip} \ \mathsf{vec} \times \mathsf{mat}\text{-}\mathsf{seq} \ \mathsf{mc}) \ \mathsf{mr} \end{array}
```

In the parallel version ( $mat \times mat$ -par), all the rows in the resulting matrix are computed in parallel, with the column-positioned values inside each row computed also in parallel. In the sequential version, at each clock cycle one whole column is produced, with the row-positioned values inside each column computed in parallel.

Matrix multiplication as defined here has two nested recursion blocks, and thus four ways in which it could be sequentialized. Above we have shown two possible such choices, and the other two can simply be obtained by swapping map-par for map-seq.

#### 6 Discussion

#### 6.1 Related work

There is a rich tradition of using functional programming languages to model and verify hardware circuits, Sheeran (2005) gives a good overview – we restrict ourselves to the most closely related languages here. Languages embedded in Haskell, such as Lava and Wired, typically rely on automated theorem provers and testing using QuickCheck for verification. In  $\lambda\pi$ -Ware, however, we can perform inductive verification of our circuits. Existing embeddings in most theorem provers, such as Coquet and  $\pi$ -Ware, have a more limited treatment of variable scoping and types. More recent work by Choi et al. (2017) is higher level, but sacrifices the ability to be simulated directly in the theorem prover.

#### 6.2 Future work

Other timing transformations While our language easily lets you explore possible designs, trading time and space, there are several alternative transformations, such as *pipelining* that we have not yet tried to describe in this setting.

While we have a number of combinators for transforming between parallel and sequential circuits, these are mostly aimed at linear, list-like data. Event though these structures are the most prevalent in hardware design, we would like to explore related timing transformations on tree-structured circuits. To this end, it would be interesting to look into the formalization and verification of flattening transformations, and of the work done in the field of nested data parallelism.

Relaxed unrolling restriction In Section 5.1 we mention that the proof of semantics preservation for loop unrolling relies on the premise that the loop body is state-independent, that is, it has the same input/output behaviour for any given state. This premise can be relaxed somewhat, and proving that loop unrolling still preserves semantics under this relaxed premise is (near-)future work.

The relaxed restriction on the body f of a loop to be unrolled is as follows:

```
state-input-independent : \forall (c : \lambda B \Gamma \tau) \rightarrow Set state-input-independent c = fst (\llbracket c \rrbracket \rrbracket s sa \gamma) \equiv fst (\llbracket c \rrbracket \rrbracket s sa \delta)
```

That is, the next state of the loop body (fst projection of the evaluation result) does not depend on the values in its input environment. This condition is necessary because when writing the parallel version of a loop construct we must give each copy of f its own initial state. As the desired initial state for each such copy must be known at verification time, it cannot depend on input.

#### 7 Conclusion

There are several advantages to be gained by embedding a hardware design DSL in a host language with dependent types, such as Agda. Among these advantages are the easy enforcement of some well-formedness characteristics of circuits, the power given by the host's type system to express object language types and design constraints. The crucial advantage though, is the ability to have modelling, simulation, synthesis and theorem proving in the same language.

By using the host language's theorem-proving abilities, we are able not only to show properties of individual circuits, but of (infinite) classes of circuits, defined by using *circuit generators*. Particularly interesting is the ability to have *verified transformations*, preserving some semantics.

The focus of this paper lies on timing-related transformations, but we also recognize the promise of theorem proving for the formalization of other non-functional aspects of circuit design, such as power consumption, error correction, fault-tolerance and so forth. The formal study of all these aspects of circuit construction and program construction could benefit from mechanized verification.

## Acknowledgments

We would like to thank the very fruitful collaboration and helpful feedback gathered during the visit to Chalmers University of Technology, funded by COST Action EUTypes CA15123. Especially valuable were the meetings and discussions with Mary Sheeran, whose deep knowledge of the field oriented and gave perspective to this work in its beginning stage. Also, we are deeply thankful to the comments and feedback gathered during our presentation on this topic given at the TFP2017 conference in Canterbury.

This work was supported by the Netherlands Organization for Scientific Research (NWO) project on A Dependently Typed Language for Verified Hardware.

# Bibliography

- Robert Atkey, Sam Lindley, and Jeremy Yallop. Unembedding Domain-specific Languages. In *Proceedings of the 2Nd ACM SIGPLAN Symposium on Haskell*, Haskell '09, pages 37–48, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-508-6. doi: 10.1145/1596638.1596644. URL http://doi.acm.org/10.1145/1596638.1596644.
- Emil Axelsson, Koen Claessen, and Mary Sheeran. Wired: Wire-Aware Circuit Design. In *Correct Hardware Design and Verification Methods*, pages 5-19. Springer, Berlin, Heidelberg, October 2005. doi: 10.1007/11560548\_4. URL https://link.springer.com/chapter/10.1007/11560548\_4.
- Christiaan Pieter Rudolf Baaij. Digital circuits in ClaSH: functional specifications and type-directed synthesis. info:eu-repo/semantics/doctoralThesis, University of Twente, Enschede, January 2015. URL https://doi.org/10.3990/1.9789036538039.
- Per Bjesse, Koen Claessen, Mary Sheeran, and Satnam Singh. Lava: hardware design in Haskell. *ACM SIGPLAN Notices*, 34(1):174–184, January 1999. ISSN 03621340. doi: 10.1145/291251.289440. URL http://portal.acm.org/citation.cfm?doid=291251.289440.
- Richard J Boulton, Andrew D Gordon, Michael JC Gordon, John Harrison, John Herbert, and John Van Tassel. Experience with embedding hardware description languages in hol. In *TPCD*, volume 10, pages 129–156, 1992.
- Edwin Brady, James Mckinna, and Kevin Hammond. Constructing Correct Circuits: Verification of Functional Aspects of Hardware Specifications with Dependent Types. In *Trends in Functional Programming 2007*, 2007.
- Thomas Braibant. Coquet: A Coq Library for Verifying Hardware. In Jean-Pierre Jouannaud and Zhong Shao, editors, Certified Programs and Proofs, number 7086 in Lecture Notes in Computer Science, pages 330-345. Springer Berlin Heidelberg, January 2011. ISBN 978-3-642-25378-2, 978-3-642-25379-9. URL http://link.springer.com/chapter/10.1007/978-3-642-25379-9\_24.
- Thomas Braibant and Adam Chlipala. Formal Verification of Hardware Synthesis. In Natasha Sharygina and Helmut Veith, editors, *Computer Aided Verification*, number 8044 in Lecture Notes in Computer Science, pages 213–228. Springer Berlin Heidelberg, January 2013. ISBN 978-3-642-39798-1 978-3-642-39799-8. URL http://link.springer.com/chapter/10.1007/978-3-642-39799-8\_14.
- Joonwon Choi, Muralidaran Vijayaraghavan, Benjamin Sherman, Adam Chlipala, and Arvind. Kami: A platform for high-level parametric hardware specification and its modular verification. *Proc. ACM Program. Lang.*, 1 (ICFP):24:1–24:30, August 2017. ISSN 2475-1421. doi: 10.1145/3110268. URL http://doi.acm.org/10.1145/3110268.
- Andy Gill, Tristan Bull, Garrin Kimmell, Erik Perrins, Ed Komp, and Brett Werling. Introducing Kansas Lava. In *Proceedings of the Symposium on Im-*

- plementation and Application of Functional Languages, volume 6041 of LNCS. Springer-Verlag, Sep 2009.
- F. K. Hanna and N. Daeche. Dependent Types and Formal Synthesis. *Philosophical Transactions: Physical Sciences and Engineering*, 339(1652):121-135, April 1992. ISSN 0962-8428. URL http://www.jstor.org/stable/54016.
- T. Melham. Higher Order Logic and Hardware Verification, volume 31 of Cambridge Tracts in Theoretical Computer Science. Cambridge University Press, 1993. ISBN 0-521-41718-X. doi: 10.1017/CBO9780511569845. URL http://www.cs.ox.ac.uk/tom.melham/pub/Melham-1993-HOL.html.
- Joao Paulo Pizani Flor, Yorick Sijsling, and Wouter Swierstra. π-ware: Hardware description and verification in agda. In Tarmo Uustalu, editor, 21th International Conference on Types for Proofs and Programs (TYPES 2015), Leibniz International Proceedings in Informatics (LIPIcs), 2016.
- I Sander and A Jantsch. System modeling and transformational design refinement in ForSyDe [formal system design]. *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, 23(1):17–32, January 2004. ISSN 0278-0070. doi: 10.1109/TCAD.2003.819898.
- M Sheeran. Hardware Design and Functional Programming: a Perfect Match. 2005. URL http://www.jucs.org/jucs\_11\_7/hardware\_design\_and\_functional/jucs\_11\_7\_1135\_1158\_sheeran.pdf.
- Mary Sheeran. muFP, a language for VLSI design. In *Proceedings of the 1984 ACM Symposium on LISP and functional programming*, pages 104–112. ACM Press, 1984. ISBN 0897911423. doi: 10.1145/800055.802026. URL http://portal.acm.org/citation.cfm?doid=800055.802026.
- S. Singh. Designing reconfigurable systems in Lava. In *VLSI Design*, 2004. Proceedings. 17th International Conference on, pages 299–306, 2004. doi: 10. 1109/ICVD.2004.1260941.