# A Comparison of Double Point Multiplication Algorithms and their Implementation over Binary Elliptic Curves 

Reza Azarderakhsh and Koray Karabina


#### Abstract

Efficient implementation of double point multiplication is crucial for elliptic curve cryptographic systems. We revisit three recently proposed simultaneous double point multiplication algorithms. We propose hardware architectures for these algorithms, and provide a comparative analysis of their performance. We implement the proposed architectures on Xilinx Virtex-4 FPGA, and report on the area and time results. Our results indicate that differential addition chain based algorithms are better suited to compute double point multiplication over binary elliptic curves for both high performance and resource constrained applications.


## Index Terms

Elliptic curve cryptography (ECC), differential addition chains, binary fields, double point multiplication, Field Programmable Gate Array (FPGA).

## I. Introduction

Elliptic curves have been extensively used in public key cryptography especially in embedded, resource-constrained, and highperformance applications. Point multiplication is a major operation in many elliptic curve based cryptosystems. For example, if a subgroup $\langle P\rangle$ of an elliptic curve $E$ is deployed in a Diffie-Hellman type key exchange protocol, then a party $A$ chooses a secret random integer $a$, computes $a P$, and sends it to the other party with whom $A$ wants to share a secret key. For another example, if a prime order cyclic subgroup $\mathbb{G}$ of an elliptic curve $E$ is deployed in a Cramer-Shoup encryption scheme, then in the key generation phase a party $A$ computes $a P+b Q$ as a part of her public key. Here, $P, Q$ are two random generators of $\mathbb{G}$, and $a, b$ are two random integers all chosen by the party $A$. Even though $A$ announces $P$ and $Q$ as a part of her public key, she has to keep $a$ and $b$ private as a part of her secret key. Similarly, in the decryption phase, $A$ computes $\tilde{a} P_{1}+\tilde{b} P_{2}$, where $P_{1}, P_{2} \in \mathbb{G}$ are parts of a ciphertext, and the integers $\tilde{a}, \tilde{b}$ have to be kept secret by $A$. The security of such cryptosystems relies heavily on the difficulty of the discrete logarithm problem (DLP) in $\mathbb{G}$ (i.e., given $P, a P \in \mathbb{G}$, compute $a$ ). A generic way to solve DLP in $\mathbb{G}$ is to use the Pollard's rho method that runs in time $O(\sqrt{G})$ [20]. Side channel analysis includes a class of other methods to recover the secret $a$ by making use of side channel information extracted from the computation of $a P$. A conventional method for computing $a P$ is to use a variant of double-and-add type algorithms based on the binary representation of the secret exponent $a$. Such an algorithm would suffer from power analysis attacks when doubling and addition operations are distinguishable [8]. One method to provide Diffie-Hellman type protocols with some level of protection against side channel attacks is to split the scalar $a=r+(a-r)$ for some secret random integer $r$, and to compute $a P=r P+(a-r) P$ [7].

For the sake of generality, let $\mathbb{G}$ be an additive abelian group. Given an integer $a$ and a point $P \in \mathbb{G}$, a (single) point multiplication algorithm computes $a P \in \mathbb{G}$. Given two integers $a, b$ and two points $P, Q \in \mathbb{G}$, a double point multiplication algorithm computes $a P+b Q \in \mathbb{G}$. As we see in the above examples, having an efficient and secure ${ }^{1}$ double point multiplication algorithm is crucial for many cryptographic schemes. Another scenario where one needs efficient and secure double point multiplication is to speed up single point multiplication over elliptic curves with endomorphisms, see [11],[10],[12].

A naive way to perform double point multiplication is to perform two single point multiplications. A more efficient method is to compute $a P+b Q$ simultaneously. Straus-Shamir's trick (see Algorithm 14.88 in [17]) and interleaving [18] are two such methods. Straus-Shamir's type simultaneous double point multiplication algorithms are vulnerable to side-channel analysis because double and add instructions are not performed in a regular fashion. Fortunately, recoding the scalars $a$ and $b$ allows us to adapt Straus-Shamir's type algorithms in such a way that the same instructions are executed in the same order. Joye and Tunstall [15] proposed several methods of regular recoding of scalars for regular point multiplication algorithms, which can immediately be adapted to yield regular simultaneous double point multiplication algorithms. In particular, their signeddigit recoding method with the digit set $\{ \pm 1, \pm 3\}$ yields a regular double point multiplication algorithm, that we call the $J T-\{ \pm 1, \pm 3\}$ algorithm. $J T-\{ \pm 1, \pm 3\}$ costs half addition and one doubling per scalar bit. Using differential addition chains (DAC) is another method to perform simultaneous double point multiplication; see for instance [19], [2], and [6]. DAC-method

[^0]Table I. A comparison of three simultaneous double point multiplication algorithms $J T-\{ \pm 1\}, B-N B C$, and $A K-D A C$. Double and add operations are denoted by $D$ and $A$, respectively.

| Algorithm | Per-bit <br> cost | Regular | DAC- <br> based | Parallelizable |
| :---: | :---: | :---: | :---: | :---: |
| $J T-\{ \pm 1, \pm 3\}$ | $0.5 A+1 D$ | Yes | No | No |
| $B-N B C$ | $2 A+1 D$ | Yes | Yes | Yes |
| $A K-D A C$ | $1.4 A+1.4 D$ | Yes | Yes | Yes |

is attractive because it yields potentially simple power analysis resistant algorithms due to the uniform pattern of operations executed; and it is especially efficient in elliptic curves setting because double and add operations can be performed using $x$-coordinates only. Bernstein [6] proposed a double point multiplication algorithm based on the new binary chain, that we call the $B-N B C$ algorithm. $B-N B C$ has a uniform structure, and costs two additions and one doubling per scalar bit. More recently, Azarderakhsh and Karabina [4] proposed a simultaneous double point multiplication algorithm based on DAC, that we call the $A K-D A C$ algorithm. $A K-D A C$ has a uniform structure, and costs 1.4 additions and 1.4 doublings per scalar bit.

In Table I, we present a brief comparison of these three simultaneous double point multiplication algorithms $J T-\{ \pm 1, \pm 3\}$, $B-N B C$, and $A K-D A C$. All of these three algorithms are regular, and so they are potentially resistant against power analysis attacks. However, comparing these algorithms from the efficiency point of view is not straightforward. Even though JT$\{ \pm 1, \pm 3\}$ has the best per-bit cost, $B-N B C$ and $A K-D A C$ have the advantage of being based on DAC. For example, in elliptic curves setting, one can implement $B-N B C$ and $A K-D A C$ using the addition formulas that use only the $x$-coordinates of the points, and that are much more efficient than their traditional counterparts. Moreover, $J T-\{ \pm 1, \pm 3\}$ is not parallelizable in the sense that the double and add operations cannot be executed in parallel because an addition operation should always follow after two consecutive doubling operations. Double and add operations can be totally parallelized in both $B-N B C$ and $A K-D A C$. If one deploys two parallel addition/doubling units, then the per-bit costs of $B-N B C$ and $A K-D A C$ becomes $1 A+1 D$ and $1.4 A$, respectively. Similarly, if one deploys three parallel addition/doubling units, then the per-bit cost of $B-N B C$ becomes $1 A$.

In this paper, we realize hardware implementations of $J T-\{ \pm 1, \pm 3\}, B-N B C$, and $A K-D A C$ using standard Weierstrass binary elliptic curve groups, and present detailed performance comparisons with several area and time results. To the best of our knowledge, these three algorithms are some of the most promising regular algorithms with low precomputation and storage requirements, and their relative performance comparisons have not been analyzed.

The rest of the paper is organized as follows. In Section II, we review the naive method and the three algorithms $J T-\{ \pm 1, \pm 3\}$, $B-N B C$, and $A K-D A C$ for computing double point multiplication. In Section III, we provide hardware architectures and implement them of FPGA and compare the implementation results. Finally, we conclude the paper in Section IV.

## II. A Review of Double Point Multiplication Algorithms

In this section, we review the three algorithms $J T-\{ \pm 1, \pm 3\}, B-N B C, A K-D A C$, and the naive method for computing double point multiplication. We also introduce the elliptic curve equation over which we realize our implementation, and introduce some notation that we refer throughout the paper.

Let $E_{W, a, b}$ be a non-supersingular binary generic elliptic curve (short Weierstrass) defined as

$$
\begin{equation*}
E_{W, a, b}: y^{2}+x y=x^{3}+a x^{2}+b \tag{1}
\end{equation*}
$$

where $a, b \in \mathbb{F}_{2^{\ell}}$, and $b \neq 0$. The set of points $(x, y), x, y \in \mathbb{F}_{2^{\ell}}$, that satisfy (1) together with a special point at infinity $\mathcal{O}$ (group identity) form a finite additive abelian group that we denote by $E_{W, a, b}\left(\mathbb{F}_{2^{\ell}}\right)$. The group operation can be performed using the chord-and-tangent rule [13]. We have, $P+\mathcal{O}=\mathcal{O}+P=P$ for all $P \in E_{W, a, b}\left(\mathbb{F}_{2^{\ell}}\right)$, and the inverse of the point $P=(x, y)$ is $-P=(x, x+y)$.

## A. Traditional Scheme

The traditional scheme (in Hardware) for fast computation of double point multiplication is to employ two parallel (point multiplication) circuits to compute $a P$ and $b Q$ separately and add the final results together. The latency of computing double point multiplication based on this scheme is one point multiplication and a point addition (using explicit addition formulas) which requires to duplicate the hardware.

## B. The $J T-\{ \pm 1, \pm 3\}$ algorithm

Joye and Tunstall [15] proposed several methods of regular recoding of scalars for regular point multiplication algorithms. One of these algorithms is so called the signed-digit recoding algorithm that allows regular implementation of $m$-ary point multiplication algorithms. We represent their recoding algorithm for $m=4$ in Algorithm 1. We should note that a typical choice

```
Algorithm \(1 J T-\{ \pm 1, \pm 3\}\) scalar recoding algorithm
Inputs: \(a\) odd, \(m=4\)
Output: \(a=\left(a_{\ell-1}, \ldots, a_{0}\right)\) with odd \(a_{i} \in\{ \pm 1, \pm 3\}\)
    \(i \leftarrow 0\)
    While \(a>m\) do
        \(a_{i} \leftarrow(a \bmod 2 m)-m\)
        \(a \leftarrow\left(a-a_{i}\right) / m\)
        \(i \leftarrow i+1\)
    end While
    \(a_{i} \leftarrow a\)
```

Table II. An example to compute $71 P+93 Q$ using $J T-\{ \pm 1, \pm 3\}$

| $a$ | 1 | 1 | -3 | 3 |
| :---: | :---: | :---: | :---: | :---: |
| $b$ | 1 | 1 | 3 | 1 |
| Point | $P+Q$ | $5 P+5 Q$ | $17 P+23 Q$ | $71 P+93 Q$ |

for $m$ is $m=2^{k}$ for some positive integer $k$, and the choice $k=2$ seems to be optimal to get a competitive exponentiation algorithm with reasonable storage requirements for resource constrained applications. For example, the choice of $k=2$ requires to store 8 group elements, whereas with the choice of $k=3$, one has to store 32 group elements.

This recoding algorithm immediately yields a regular double point multiplication algorithm to compute $a P+b Q$, and we call this algorithm $J T-\{ \pm 1, \pm 3\}$. If $a$ and $b$ are $\ell$-bit integers, then $J T-\{ \pm 1, \pm 3\}$ requires about $\ell / 2$ iterations, and at each iteration a point $X$ is updated to a point $4 X+R$ for some $R \in\{ \pm(P+Q), \pm(P-Q)\}$. Therefore, the per-bit cost of $J T-\{ \pm 1, \pm 3\}$ is $0.5 A+1 D$. For example, Algorithm 1 recodes $a=71=(1,1,-3,3)$ and $b=(1,1,3,1)$, and $a P+b Q=71 P+93 Q$ is computed as in Table II.

The cost of point addition and doubling in $E_{W, a, b}\left(\mathbb{F}_{2^{\ell}}\right)$ are $13 \mathbf{M}+4 \mathbf{S}+9 \mathbf{A}$ and $5 \mathbf{M}+4 \mathbf{S}+5 \mathbf{A}$, respectively [13]. Here, $\mathbf{M}$, $\mathbf{S}$, and $\mathbf{A}$, are the costs of multiplication, squaring, and addition in $\mathbb{F}_{2^{\ell}}$, respectively. In Lopez-Dahap coordinates [16] where one of the points is represented in affine, the cost of mixed projective point addition, i.e., $\left(X_{3}, Y_{3}, Z_{3}\right)=\left(X_{1}, Y_{1}, Z_{1}\right)+\left(x_{2}, y_{2}\right)$, reduces to $9 \mathbf{M}+5 \mathbf{S}+9 \mathbf{A}$ [3]. The explicit formulas for point addition (PA) and point doubling (PD) are as follows [3]:

$$
\begin{aligned}
& \mathrm{PA}:\left\{\begin{array}{l}
A=Y_{1}+y_{2} Z_{1}^{2}, B=X_{1}+x_{2} Z_{1}, C=B Z_{1}, \\
Z_{3}=C^{2}, D=x_{2} Z_{3}, X_{3}=A^{2}+C\left(A+B^{2}+a C\right), \\
Y_{3}=\left(D+X_{3}\right)\left(A C+Z_{3}\right)+\left(y_{2}+x_{2}\right) Z_{3}^{2}
\end{array}\right. \\
& \mathrm{PD}:\left\{\begin{array}{l}
A=X_{1} Z_{1}, B=X_{1}^{2}, C=B+Y_{1}, D=A C \\
Z_{3}=A^{2}, X_{3}=C^{2}+D+a Z_{3} \\
Y_{3}=\left(Z_{3}+D\right) X_{3}+B^{2} Z_{3} .
\end{array}\right.
\end{aligned}
$$

In Fig. 1, the data dependency graph for computing point addition and point doubling are illustrated employing four and two parallel multipliers, respectively. As one can see, the total cost (latency) of computing point addition and point doubling is $3 M+13$ and $3 M+10$ clock cycles, respectively. Therefore, the cost of computing double point multiplication using $J T$ $\{ \pm 1, \pm 3\}$ is $\approx 0.5 \times(l-1) \times(3 M+13)+(l-1) \times(3 M+10)$. As we noted earlier, the latency of this scheme cannot be reduced further.

## C. The B-NBC algorithm

We briefly explain Bernstein's double point multiplication algorithm based on the new binary chain [6]. Let $a$ and $b$ be two positive integers. The new binary chain for $(a, b)$ is computed as follows. Let $(M, N)=(a, b)$ and $D=a \bmod 2 . C_{D}(0,0)$ is defined as $(0,0),(1,0),(0,1),(1,-1)$. For $(M, N) \neq(0,0), C_{D}(M, N)$ is defined recursively:

$$
\begin{aligned}
C_{D}(M, N)= & C_{d}(m, n) \\
& (M+(M+1 \bmod 2), N+(N+1 \bmod 2)) \\
& (M+(M \bmod 2), N+(N \bmod 2)) \\
& (M+(M+D \bmod 2), N+(N+D+1 \bmod 2)),
\end{aligned}
$$



Figure 1. Data dependency graph for computing (a) point addition [5] and (b) doubling employing parallel multipliers.
where $m=\lfloor M / 2\rfloor, n=\lfloor N / 2\rfloor$, and

$$
d= \begin{cases}0 & \text { if }(m+M, n+N) \bmod 2=(0,1) \\ 1 & \text { if }(m+M, n+N) \bmod 2=(1,0) \\ D & \text { if }(m+M, n+N) \bmod 2=(0,0) \\ 1-D & \text { if }(m+M, n+N) \bmod 2=(1,1)\end{cases}
$$

Building the new binary chain for $(a, b)$ requires $\max \left(\left\lceil\log _{2} a\right\rceil,\left\lceil\log _{2} b\right\rceil\right)$ iterations, and at the each iteration three vectors are added to the sequence. Let $V_{0}, V_{1}, V_{2}, \ldots, V_{\ell}$ be the new binary chain for $(a, b)$, where $V_{0}=C_{D}(0,0)$ and $V_{k}=v_{k}^{(1)}, v_{k}^{(2)}, v_{k}^{(3)}$ for $i=1, \ldots, \ell$. Because of the correspondence between the tuple $(i, j)$ and the group element $i P+j Q$, it will be convenient for us to call $V_{0}$ the input, and call $V_{1}$ the initial state (IS). By construction, there are six possibilities for $V_{1}: v_{1}^{(1)}$ is always $(1,1)$, and $\left(v_{1}^{(2)}, v_{1}^{(3)}\right) \in\{((2,0),(2,1)),((2,0),(1,0)),((0,2),(0,1)),((0,2),(1,2)),((2,2),(2,1)),((2,2),(1,2))\}$ In any case, initial state $V_{1}$ can be obtained from the input $V_{0}$ at a cost of at most 2 additions and 1 doubling. Furthermore, $V_{k}$ can be obtained from $V_{k-1}$ at a cost of 2 additions and 1 doubling for all $2 \leq k \leq \ell$. In particular, we have $v_{k}^{(1)}=v_{k-1}^{(1)}+v_{k-1}^{(2)}$, $v_{k}^{(2)}=2 v_{k-1}^{\left(i_{k}\right)}$, and $v_{k}^{(3)}=v_{k-1}^{\left(j_{k}\right)}+v_{k-1}^{(3)}$ for some $i_{k} \in\{1,2,3\}$ and $j_{k} \in\{1,2\}$. The values of $i_{k}$ and $j_{k}$ can be determined easily while computing the new binary chain as follows. By construction, the parities of the vectors $v_{i}^{(1)}, v_{i}^{(2)}, v_{i}^{(3)}$ must be either (odd,odd), (even,even), (odd, even) or (odd,odd), (even,even), (even, odd), respectively, for all $i=1, \ldots, \ell$. This already shows that $v_{k}^{(1)}=v_{k-1}^{(1)}+v_{k-1}^{(2)}$. Moreover, if the value of $v_{k}^{(2)}$ modulo 4 is $(0,0)$ then $v_{k}^{(2)}=2 v_{k-1}^{(1)}$; if it is $(4,4)$ then $v_{k}^{(2)}=2 v_{k-1}^{(2)}$; and if it is $(2,4)$ or $(4,2)$ then $v_{k}^{(2)}=2 v_{k-1}^{(3)}$. Finally, if the parities of $v_{k}^{(3)}$ and $v_{k-1}^{(3)}$ are the same then $v_{k}^{(3)}=v_{k-1}^{(2)}+v_{k-1}^{(3)}$; otherwise $v_{k}^{(3)}=v_{k-1}^{(1)}+v_{k-1}^{(3)}$. Therefore, the new binary chain $\left\{V_{k}\right\}_{k=0}^{\ell}$ can be associated with what we call the chain sequence

$$
\begin{equation*}
\mathrm{CS}=\left\{\left(i_{k}, j_{k}\right)\right\}_{k=1}^{\ell}, i_{k} \in\{1,2,3\}, j_{k} \in\{1,2\} \tag{2}
\end{equation*}
$$

It also follows from the construction of $\left\{V_{k}\right\}_{k=0}^{\ell}$ that when two vectors in $V_{k-1}$ are added to obtain a vector in $V_{k}$, the difference of the vectors $v_{k-1}^{(1)}-v_{k-1}^{(2)}$ and $v_{k-1}^{\left(j_{k}\right)}-v_{k-1}^{(3)}$ must belong to the set $\{ \pm(P+Q), \pm(P-Q)\}$ and $\{ \pm P, \pm Q\}$, respectively. Therefore, the new binary chain $\left\{V_{k}\right\}_{k=0}^{\ell}$ can be associated with what we call the differences sequence

$$
\begin{equation*}
\mathrm{DS}=\left\{\left(\left(a_{k}, b_{k}\right),\left(c_{k}, d_{k}\right)\right)\right\}_{k=1}^{\ell} \tag{3}
\end{equation*}
$$

where $a_{k}, b_{k}, c_{k}, d_{k} \in\{-1,0,1\}, a_{k}$ and $b_{k}$ are nonzero, exactly one of $c_{k}$ and $d_{k}$ is zero, and $\left(\left(a_{k}, b_{k}\right),\left(c_{k}, d_{k}\right)\right)$ represents $a_{k} P+b_{k} Q$ and $c_{k} P+d_{k} Q$.

Table III. An example to compute $71 P+93 Q$ using $B-N B C$.

| k | CS | DS | $v_{k+1}^{(1)}$ | $v_{k+1}^{(2)}$ | $v_{k+1}^{(3)}$ |
| :---: | :---: | :---: | :---: | :---: | :---: |
| 0 |  |  | $P+Q$ | $2 P+2 Q$ | $P+2 Q$ |
| 1 | $(1,1)$ | $(-1,-1),(0,-1)$ | $3 P+3 Q$ | $2 P+2 Q$ | $2 P+3 Q$ |
| 2 | $(3,1)$ | $(1,1),(1,0)$ | $5 P+5 Q$ | $4 P+6 Q$ | $5 P+6 Q$ |
| 3 | $(2,2)$ | $(1,-1),(-1,0)$ | $9 P+11 Q$ | $8 P+12 Q$ | $9 P+12 Q$ |
| 4 | $(3,1)$ | $(1,-1),(0,-1)$ | $17 P+23 Q$ | $18 P+24 Q$ | $18 P+23 Q$ |
| 5 | $(3,2)$ | $(-1,-1),(0,1)$ | $35 P+47 Q$ | $36 P+46 Q$ | $36 P+47 Q$ |
| 6 | $(3,1)$ | $(-1,1),(-1,0)$ | $71 P+93 Q$ | $72 P+94 Q$ | $71 P+94 Q$ |



Figure 2. The data dependency graph for computing double point multiplication.

To summarize, given two positive integers $a, b \in \mathbb{Z}$ and two group elements $P, Q \in \mathbb{G}$, the new binary chain for $(a, b)$ allows us to generate the chain sequence CS and the differences sequence DS as described in the previous paragraph. We can then compute $a P+b Q$ at a cost of $(2 A+1 D) \cdot \max \left(\left\lceil\log _{2} a\right\rceil,\left\lceil\log _{2} b\right\rceil\right)$, where $A$ and $D$ represent the cost of addition and doubling in $\mathbb{G}$, respectively. The chain sequence $C S$ specifies the input to the doubling and addition operations at each iteration. The differences sequence DS encodes the differences of the points that are the input points to the addition operations at each iteration. Note that if $P$ and $-P$ can be identified with a same string $S_{P}$ that only depends on $P$ for all $P \in \mathbb{G}$, then the differences of the points encoded by the differences sequence during the computation of $m P+n Q$ can be identified only with $S_{P}, S_{Q}, S_{P+Q}$, and $S_{P-Q}$. Table III presents an example for computing $71 P+93 Q$.

The computation of double point multiplication in $E_{W, a, b}\left(\mathbb{F}_{2^{\ell}}\right)$ (see (1)) can be performed using differential point addition and doubling formulas. As mentioned earlier, its per-bit cost is $2 A+1 D$, and one can employ two point addition circuits and one point doubling circuit in parallel to reduce the latency. The mixed projective differential point addition and doubling formulas for generic elliptic curves over $\mathbb{F}_{2^{\ell}}$ are defined as [16]

$$
\begin{align*}
Z_{A d d} & =\left(X_{1} \cdot Z_{2}+X_{2} \cdot Z_{1}\right)^{2} \\
X_{A d d} & =x \cdot Z_{A d d}+\left(X_{1} \cdot Z_{2}\right) \cdot\left(X_{2} \cdot Z_{1}\right) \tag{4}
\end{align*}
$$

and

$$
\begin{align*}
Z_{D b l} & =\left(X_{2} \cdot Z_{2}\right)^{2} \\
X_{D b l} & =X_{2}^{4}+b \cdot Z_{2}^{4} \tag{5}
\end{align*}
$$

where $P_{i}=\left[X_{i}, Y_{i}, Z_{i}\right], P_{1}+P_{2}=\left[X_{A d d}, Y_{A d d}, Z_{A d d}\right], 2 P_{2}=\left[X_{D b l}, Y_{D b l}, Z_{D b l}\right]$ in projective coordinates, and $P_{1}-P_{2}=$ $(x, y)$. The combined point addition and doubling requires [16] $6 \mathbf{M}+5 \mathbf{S}+3 \mathbf{A}$ over $\mathbb{F}_{2^{\ell}}$, where $\mathbf{M}, \mathbf{S}$, and $\mathbf{A}$, are the costs of multiplication, squaring and addition in $\mathbb{F}_{2^{\ell}}$, respectively. Note that in hardware the fastest possible implementation of combined point addition and doubling utilizes 3 parallel multipliers, and its latency is two multiplications.

In Fig. 2, the data dependency graph for computing two differential point additions and one point doubling in parallel is illustrated. As one can see, it requires five parallel finite field multipliers, two circuits to perform double squaring, one circuit to perform single squaring, and three adders to operate in parallel. The critical path has two field multipliers, two field adders, and one field squarer. As one can see in Fig. 2 we achieved $100 \%$ multiplier utilization employing 5 field multipliers. The differential input of DPA-1 is denoted by $x_{\text {diff }}$ which is either $x_{(P+Q)}$ or $x_{(-P+Q)}$. Also, the differential input of DPA-2 is denoted by $x_{P} / x_{Q}$ which could be $x_{P}$ or $x_{Q}$ based on the given DS sequence. As one can see, the latency of computing double point multiplication without considering coordinate conversion is about $\approx(l-1) \times(2 M+5)$ clock cycles.

## D. The AK-DAC algorithm

Let $a$ and $b$ be two positive integers. In order to compute $a P+b Q, A K-D A C$ starts with the initial values $d=a, e=b$, $\vec{R}=(P, Q), \vec{u}=(1,0), \vec{v}=(0,1)$, and $\vec{\Delta}=(1,-1)$. We also define $R_{u}=\vec{u} \cdot \vec{R}, R_{v}=\vec{v} \cdot \vec{R}$, and $R_{\Delta}=\vec{\Delta} \cdot \vec{R}$. The initial

```
Algorithm \(2 A K-D A C\) double point multiplication algorithm
Inputs: \(a>0, b>0, P, Q\)
Output: \(a P+b Q\)
    \(d \leftarrow a, e \leftarrow b, \vec{u} \leftarrow(1,0), \vec{v} \leftarrow(0,1), \vec{\Delta} \leftarrow(1,-1)\)
    \(R_{u} \leftarrow P, R_{v} \leftarrow Q, R_{\Delta} \leftarrow P-Q\)
    While \(d \neq e\) do
        Execute the first applicable rule in Table IV
    end While
    Using single point multiplication with input \(d\) and
    \(\left(R_{u}+R_{v}\right)\), compute and return \(d\left(R_{u}+R_{v}\right)\)
```

Table IV. Update rules for double point multiplication

| Rule | Condition | $d$ | $e$ | $\vec{u}$ | $\vec{v}$ | $\vec{\Delta}$ | $R_{u}$ | $R_{v}$ | $R_{\Delta}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| R 1 | $d \equiv e(\bmod 2)$ and $d>e$ | $(d-e) / 2$ | $e$ | $2 \vec{u}$ | $\vec{u}+\vec{v}$ | $\vec{\Delta}$ | $2 R_{u}$ | $R_{u}+R_{v}$ | $R_{\Delta}$ |
| $\mathrm{R}^{\prime}$ | $d \equiv e(\bmod 2)$ and $d<e$ | $d$ | $(e-d) / 2$ | $\vec{u}+\vec{v}$ | $2 \vec{v}$ | $\vec{\Delta}$ | $R_{u}+R_{v}$ | $2 R_{v}$ | $R_{\Delta}$ |
| R2 | $d \equiv 0(\bmod 2)$ | $d / 2$ | $e$ | $2 \vec{u}$ | $\vec{v}$ | $\vec{u}+\vec{\Delta}$ | $2 R_{u}$ | $R_{v}$ | $R_{u}+R_{\Delta}$ |
| R2 $^{\prime}$ | $e \equiv 0(\bmod 2)$ | $d$ | $e / 2$ | $\vec{u}$ | $2 \vec{v}$ | $\vec{\Delta}+(-\vec{v})$ | $R_{u}$ | $2 R_{v}$ | $R_{\Delta}+\left(-R_{v}\right)$ |

values yield $R_{u}=P, R_{v}=Q, R_{\Delta}=R_{u}-R_{v}=P-Q$, and $d R_{u}+e R_{v}=a P+b Q$, and the values $d, e, \vec{u}, \vec{v}, \vec{\Delta}, R_{u}, R_{v}, R_{\Delta}$ are updated so that $d R_{u}+e R_{v}=a P+b Q$ and $R_{\Delta}=R_{u}-R_{v}$ hold, $d, e>0$, and $(d+e)$ decreases until $d=e$. When $d=e$, we will have $a P+b Q=d R_{u}+e R_{v}=d\left(R_{u}+R_{v}\right)$ which can be computed using a single point multiplication algorithm with base $R_{u}+R_{v}$ and scalar $d$. Note that when $\operatorname{gcd}(a, b)=1,(d+e)$ in the algorithm will decrease until $d=e=1$ and we have $a P+b Q=d\left(R_{u}+R_{v}\right)=R_{u}+R_{v}$.

It is discussed in [4] that, if a and b are $\ell$-bit integers, then $a P+b Q$ can on average be computed in about $1.4 \ell$ additions and $1.4 \ell$ doublings. Moreover addition and doubling operations can be performed using differential addition and differential doubling formulas as the difference of the group elements to be added are known by construction. We give an example in Table V to show intermediate values of Algorithm 2 with input $a=71, b=93, P, Q$ and $P-Q$. Note that in step 6 of Algorithm 2, we have $d=1, R_{u}=31 P+37 Q, R_{v}=40 P+56 Q$, and the output is $R_{u}+R_{v}=71 P+93 Q$, as required.
The data dependency graph for computing double point multiplication employing four parallel multipliers, three squarers, and two adders is illustrated in Fig. 3 based on differential point addition and doubling formulae given in [22]. One should note that the difference of two points is given in projective coordinates as we need to update them at each iteration based on the conditions given in Algorithm 2. As one can see, we first perform data-flow analysis for ECC computations to understand how data has to move between the different logic and computational elements such as field multipliers, adders, and squarers. Then, we perform a latency analysis to determine where potential bottlenecks may occur and then find a balance between desired performance and the cost of implementing the design. Therefore, the latency of computing double point multiplication on binary generic curves is $\approx 1.4 \times(l-1) \times(2 M+9)$, without considering the cost of conversion from mixed projective coordinates to affine coordinates, where $M$ is the cost of a field multiplication.

## III. Implementations of Double Point Multiplication Algorithms

## A. Hardware Architectures

In this section, we propose hardware architectures for computing double point multiplication algorithms reviewed in Section II, and implement them. The hardware architectures are depicted in the Figs 4 a and 4 b for $B-N B C$ algorithm and $A K-D A C$

Table V. An example to compute $71 P+93 Q$ using $A K-D A C$

| Rule | $d$ | $e$ | $\vec{u}$ | $\vec{v}$ | $\vec{\Delta}$ | $R_{u}$ | $R_{v}$ | $R_{\Delta}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | 71 | 93 | $(1,0)$ | $(0,1)$ | $(1,-1)$ | $P$ | $Q$ | $P-Q$ |
| R1' $^{\prime}$ | 71 | 11 | $(1,1)$ | $(0,2)$ | $(1,-1)$ | $P+Q$ | $2 Q$ | $P-Q$ |
| R1 | 30 | 11 | $(2,2)$ | $(1,3)$ | $(1,-1)$ | $2 P+2 Q$ | $P+3 Q$ | $P-Q$ |
| R2 | 15 | 11 | $(4,4)$ | $(1,3)$ | $(3,1)$ | $4 P+4 Q$ | $P+3 Q$ | $3 P+Q$ |
| R1 | 2 | 11 | $(8,8)$ | $(5,7)$ | $(3,1)$ | $8 P+8 Q$ | $5 P+7 Q$ | $3 P+Q$ |
| R2 | 1 | 11 | $(16,16)$ | $(5,7)$ | $(11,9)$ | $16 P+16 Q$ | $5 P+7 Q$ | $11 P+9 Q$ |
| R1 $^{\prime}$ | 1 | 5 | $(21,23)$ | $(10,14)$ | $(11,9)$ | $21 P+23 Q$ | $10 P+14 Q$ | $11 P+9 Q$ |
| R1 $^{\prime}$ | 1 | 2 | $(31,37)$ | $(20,28)$ | $(11,9)$ | $31 P+37 Q$ | $20 P+28 Q$ | $11 P+9 Q$ |
| R2 $^{\prime}$ | 1 | 1 | $(31,37)$ | $(40,56)$ | $(-9,-19)$ | $31 P+37 Q$ | $40 P+56 Q$ | $-9 P-19 Q$ |



Figure 3. Data dependency graph for differential point addition and doubling on binary elliptic curves using four parallel multipliers [23].


Figure 4. Hardware Architecture for computing double point multiplication based on (a) $B-N B C$ algorithm (b) $A K-D A C$ algorithm.
algorithm, respectively.Since the architectures of the naive method and the $J T-\{ \pm 1, \pm 3\}$ algorithm are rather straightforward, we do not include them in the following.

## B. Arithmetic Unit

The arithmetic unit is the main part for each architecture which is composed of $\mathbb{F}_{2^{\ell}}$ adders, squarers, and finite field multipliers as described in the following.

1) Addition and Squaring: Addition of two field elements, say, $A=\sum_{i=0}^{l-1} a_{i} \alpha^{i}=\left(a_{l-1}, \cdots, a_{1}, a_{0}\right)$ and $B=\sum_{i=0}^{l-1} b_{i} \alpha^{i}=$ $\left(b_{m-1}, \cdots, b_{1}, b_{0}\right)$ in $\mathbb{F}_{2^{\ell}}$ represented by polynomial basis is $C=A+B$ and can be obtained by pair-wise addition of the coordinates of $A$ and $B$ over $\mathbb{F}_{2}$ (i.e., modulo 2 addition) as $c_{i}=a_{i} \oplus b_{i}$. Addition requires only one clock cycle to store the results in the registers.

For squaring an element $A \in \mathbb{F}_{2^{\ell}}$, we first simply insert zeros between each bit in the bit-vector representing $A$ which must be followed by a reduction operation as $A^{2}=\sum_{i=0}^{l-1} a_{i} \alpha^{2 i} \bmod f(x)$. The reduction, $\bmod f(x)(f(x)$ is a degree- $l$ irreducible polynomial) is computed using XOR and shift operations only. Squaring is a simply hardwired permutations (inserting zeros) and requires only one clock cycle over $\mathbb{F}_{2^{\ell}}$ (note that the irreducible polynomial is fixed in this work).
2) Multiplication: Finite field multipliers are available in bit-level (with area complexity of $O(\mathrm{~m})$ and time complexity of $O(m)$ ), digit-level (with area complexity of $O(m \delta)$ and time complexity of $O(m / \delta)$ ), and bit-parallel (with area complexity of $O\left(m^{2}\right)$ for quadratic and $O\left(m^{\log _{2} 3}\right)$ for subquadratic with time complexity of $\left.O(1)\right)$ architectures depending on the available resources. The digit-level polynomial basis multiplier architecture proposed in [21] is used in this work. For the binary extension field $\mathbb{F}_{2^{233}}$, recommended by NIST [1], the irreducible polynomial is $\mathbb{F}_{2^{\ell}}: \mathbb{F}_{2^{233}}=\mathbb{F}_{2}[x] / x^{233}+x^{74}+1$ is a trinomial. In Fig. 5, the polynomial basis digit-level multiplier with serial-in parallel-out (SIPO) architecture is depicted. As one can see, in


Figure 5. Then architecture of employed digit-level SIPO polynomial basis multiplier for trinomial irreducible polynomials [21].

Table VI. The FPGA implementation results of different double point multiplication algorithms over $G F\left(2^{233}\right)$ on Xilinx Virtex-4.

| Naive Method 6 Mults. (Section II-A) |  |  |  |  |  |  | $B-N B C 5$ Mults. (Section II-C) [6] |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $d$ | $q$ | Latency | CPD | Time | Area | AT | Latency | CPD | Time | Area | AT |
|  |  | [\# Clock cycles] | [ $n \mathrm{~s}$ ] | [ $\mu \mathrm{s}$ ] | [\# Slices] | Area $\times$ Time | [\# Clock cycles] | [ $n \mathrm{~S}$ ] | [ $\mu \mathrm{s}$ ] | [\# Slices] | Area $\times$ Time |
| 7 | 34 | 17,937 | 3.40 | 60.9 | 6,218 | 0.38 | 17,828 | 3.38 | 60.2 | 5,207 | 0.31 |
| 13 | 18 | 10,305 | 3.93 | 40.4 | 9,693 | 0.39 | 10,244 | 3.90 | 39.9 | 8,117 | 0.32 |
| 18 | 13 | 7,920 | 3.97 | 31.4 | 11,335 | 0.35 | 7,874 | 3.91 | 30.7 | 9,492 | 0.29 |
| 26 | 9 | 6012 | 4.31 | 25.9 | 16,612 | 0.43 | 5,978 | 4.29 | 25.7 | 13,911 | 0.35 |
| $J T-\{ \pm 1, \pm 3\} 4$ Mults. (Section II-B) [15] |  |  |  |  |  |  | $A K-D A C 4$ Mults. (Section II-D) [4] |  |  |  |  |
| 7 | 34 | 40,057 | 3.42 | 136.9 | 4,196 | 0.57 | 25,437 | 3.38 | 85.9 | 4,146 | 0.35 |
| 13 | 18 | 23,145 | 3.98 | 92.1 | 6,541 | 0.60 | 14,884 | 3.88 | 57.7 | 6,462 | 0.37 |
| 18 | 13 | 17,860 | 4.01 | 71.6 | 7,649 | 0.54 | 11,586 | 3.97 | 45.9 | 7,557 | 0.34 |
| 26 | 9 | 13,632 | 4.33 | 59.1 | 11,210 | 0.66 | 8,947 | 4.28 | 38.2 | 11,075 | 0.42 |

each clock cycle $\delta$ coefficients of the operand $A$ are processed having all bits of operand $B$ available through multiplication process. In this architecture, the $J$ blocks perform bit-wise AND operation as $a_{i} \odot B$. The $\times x^{i}$ blocks perform corresponding shift operations and are only wiring. Once the $\delta$ partial products are computed at the output of $J$ blocks, they are multiplied by $x^{i}, 1 \leq i \leq \delta$ and then reduced using mod $f(x)$ blocks. The $\mathbb{F}_{2^{\ell}}$ adder block performs addition (XOR) over $\delta+1 l$-bit field elements. Therefore, the critical-path delay of the $\mathbb{F}_{2^{\ell}}$ adder is $\left(\left\lceil\log _{2}(\delta+1)\right\rceil\right) T_{X}$. For multiplier operation, first the registers $\langle Y\rangle$ and $\langle Z\rangle$ are preloaded with the operand $B$ and zero $\left(0 \in \mathbb{F}_{2^{\ell}}\right)$, respectively. The register $\langle X\rangle$ provides in each clock cycle $d$ bits of operand $A$. Then, the results of the multiplication are available after $M_{q}=\left\lceil\frac{l}{\delta}\right\rceil, 1 \leq \delta \leq m$ clock cycles in the register $\langle Z\rangle$. The main advantage of this multiplier is that it operates in higher clock frequencies in comparison to the counterparts available in the literature such as Karatsuba multiplier.
3) Inversion: Inversion is the most expensive operation and can be computed using the Extended Euclidean Algorithm (EEA) or Fermat's Little Theorem (FLT) [9]. Base on FLT, one can write $A^{2^{l}-2}=A^{-1}$ whose computation requires $l-1$ squarings and $l-2$ multiplications as $2^{l}-2=(11 \cdots 110)_{2}$. However, Itoh and Tsujii (IT) [14] proposed an efficient algorithm for computing inversion over $\mathbb{F}_{2^{\ell}}$. The IT scheme requires $\left\lfloor\log _{2}(l-1)\right\rfloor+H W(l-1)-1$ multiplications and $l-1$ squarings, where $H W(l-1)$ is the Hamming weight (number of ones) of the binary representation of $l-1$. Inversion over $\mathbb{F}_{2233}$ using Itoh-Tsujii scheme [14] requires $\left\lfloor\log _{2}(l-1)\right\rfloor+h(l-1)-1=10$ multiplications and $l-1=232$ squarings.

## C. Control Unit and Memory

The control unit is designed with a finite state machine (FSM) based on the double point multiplication algorithms given in the previous sections. It schedules the computation tasks by generating the signals and switching the operands for arithmetic units. The intermediate results are stored in the register files. We note that the control unit is simpler and requires smaller area than the other units in the data path. Since it is implemented as a FSM, it can easily mapped into the FPGA by the synthesis tools. To store the input and output points and the intermediate results we employed a register file using flip-flops of the FPGA. Also, several multiplexers are employed to chose appropriate registers and connect to the arithmetic unit.


Figure 6. Implementation results of different double point multiplication algorithms and their comparison to the counterparts in terms of (a) digit-size and area-time products and (b) area and computation time over $G F\left(2^{233}\right)$ on Xilinx Virtex-4 FPGA.

## D. Implementation Results and Comparisons

In this section, we implement the proposed architecture for double point multiplication in the previous sections to evaluate its area and time requirements. We have selected the Xilinx ${ }^{\circledR}$ Virtex ${ }^{\mathrm{TM}}-4 \mathrm{xc} 4 \mathrm{vlx} 200$ device as the target FPGA. The proposed architecture is modeled in VHDL and synthesized for different digit sizes using $\mathrm{XST}^{\mathrm{TM}}$ of Xilinx ${ }^{\circledR}{ }^{\circledR}$ ISE $^{\mathrm{TM}}$ version 12.1 design software. The results of implementations of double point multiplication algorithms based on the proposed hardware architectures are reported in Table VI for $l=233$ and different digit sizes $d=\{7,13,18,26\}$. As shown in this table, we provided the latency (number of clock cycles), total time of computation, critical path delay (CPD), occupied area (number of slices) and area-time products. The Naive method requires largest area in comparison to the other schemes and $A K-D A C$ scheme requires the smallest area. The $B-N B C$ scheme provides fastest results and best area-time trade-offs in comparison to the counterparts. The $J T-\{ \pm 1, \pm 3\}$ is the slowest method and is not efficient in terms of time-area trade-offs. In Fig. 6, we plotted the area, time, and area-time results in terms of the digit-size of different scheme for more clarification.

In Fig. 6, implementation results of different double point multiplication algorithms and its comparison to the traditional method is illustrated. In Fig. 6a, we plot area-time products in terms of the digit-size and in Fig. 6b the area given by number of occupied slices is plotted in terms of time of computing double point multiplication.

## IV. Conclusion

In this paper, efficient implementation of double point multiplication over binary elliptic curves is presented. We provide a comprehensive analysis and comparison of double point multiplication algorithms based on differential addition chains, binary double and add method, and naive method. We investigate the performance and efficiency of these schemes based on the required area and time of computation. Our results indicate that the differential addition chain based schemes are the most suitable schemes for computing double point multiplication. For instance, we show that the scheme proposed in [6] provides the fastest double point multiplication, and the one presented in [4] requires the smallest silicon area for simultaneous computation.

## REFERENCES

[1] National Institute of Standards and Technology. Digital Signature Standard, 186-2, January 2000.
[2] T. Akishita. Fast Simultaneous Scalar Multiplication on Elliptic Curve with Montgomery Form. Selected Areas in Computer Science SAC 2001, LNCS, 2259:225-267, 2001.
[3] E. Al-Daoud, R. Mahmod, M. Rushdan, and A. Kilicman. A New Addition Formula for Elliptic Curves Over GF (2m). IEEE Transactions on Computers, 51(8):972-975, 2002.
[4] R. Azarderakhsh and K. Karabina. A New Double Point Multiplication Algorithm and its Application to Binary Elliptic Curves with Endomorphisms. IEEE Transactions on Computers, to appear:pp, 2013.
[5] Reza Azarderakhsh and Arash Reyhani-Masoleh. High-Performance Implementation of Point Multiplication on Koblitz Curves. IEEE Trans. on Circuits and Systems, 60-II(1):41-45, 2013.
[6] D. Bernstein. Differential addition chains. Technical report, 2006. Available at http://cr.yp.to/ecdh/diffchain-20060219.pdf.
[7] C. Clavier and M. Joye. Universal Exponentiation Algorithm - A First Step towards Provable SPA-Resistance. Lecture Notes in Computer Science, CHES 2001, 2162:300-308, 2001.
[8] J-S. Coron. Resistance against differential power analysis for elliptic curve cryptosystems. Lecture Notes in Computer Science, CHES 1999, 1717:292-302, 1999.
[9] K. Fong, D. Hankerson, J. López, and A. Menezes. Field inversion and point halving revisited. IEEE Transactions on Computers, pages 1047-1059, 2004.
[10] D. Galbraith, X. Lin, and M. Scott. Endomorphisms for Faster Elliptic Curve Cryptography on a Large Class of Curves. Journal of Cryptology, 24:446-469, 2011.
[11] R. Gallant, R. Lambert, and S. Vanstone. Faster point multiplication on elliptic curves with efficient endomorphisms. Advances in Cryptology - CRYPTO 2011, LNCS, 2139:190-200, 2001.
[12] D. Hankerson, K. Karabina, and A. Menezes. Analyzing the Galbraith-Lin-Scott point multiplication method for elliptic curves over binary fields. IEEE Transactions on Computers, 58:1411-1420, 2009.
[13] D.R. Hankerson, S.A. Vanstone, and A.J. Menezes. Guide to Elliptic Curve Cryptography. Springer-Verlag New York Inc, 2004.
[14] Toshiya Itoh and Shigeo Tsujii. A Fast Algorithm for Computing Multiplicative Inverses in $G F\left(2^{m}\right)$ Using Normal Bases. Information Computing, 78(3):171-177, 1988.
[15] M. Joye and M. Tunstall. Exponent recoding and regular exponentiation algorithms. Lecture Notes in Computer Science, AFRICACRYPT 2009, 5580:334-349, 2009.
[16] J. López and R. Dahab. Fast Multiplication on Elliptic Curves Over $G F\left(2^{m}\right)$ Without Precomputation. In Proceedings of Workshop on Cryptographic Hardware and Embedded Systems (CHES 1999), pages 316-327, 1999.
[17] A. Menezes, P. van Oorschot, and S. Vanstone. Handbook of Applied Cryptography. New York, 1996.
[18] B. Möller. Algorithms for Multi-exponentiation. Selected Areas in Computer Science SAC 2001, LNCS, 2259:165-180, 2001.
[19] P. Montgomery. Evaluating recurrences of form $X_{m+n}=f\left(X_{m}, X_{n}, X_{m-n}\right)$ via Lucas chains. www.cwi.nl/ftp/pmontgom/Lucas.ps.gz, December 13, 1983; Revised March, 1991 and January, 1992.
[20] J.M. Pollard. Monte Carlo Methods for Index Computation (mod p). Mathematics of computation, 32(143):918-924, 1978.
[21] Chang Shu, Soonhak Kwon, and Kris Gaj. Reconfigurable Computing Approach for Tate Pairing Cryptosystems over Binary Fields. IEEE Transactions on Computers, 58(9):1221-1237, 2009.
[22] M. Stam. On Montgomery-Like Representations for Elliptic Curves over $G F\left(2^{k}\right)$. (PKC 2003), pages 240-253.
[23] M. Stam. On Montgomery-like Representations for Elliptic Curves Over GF (2 $\left.{ }^{k}\right)$. In Proceedings of The 3rd International Conference on Practice and Theory of Public Key Cryptography (PKC 2003), pages 240-254, 2003.


[^0]:    R. Azarderakhsh is with the Department of Combinatorics and Optimization, Center for Applied Cryptographic Research (CACR), University of Waterloo, Waterloo, Ontario, Canada N2L 3G1. E-mail address: razarder@uwaterloo.ca.
    K. Karabina is with the Department of Mathematics, Bilkent University, Bilkent, Ankara, Turkey, 06800. E-mail address: karabina@fen.bilkent.edu.tr.
    ${ }^{1}$ Here, "secure" implies resistance against side channel analysis atacks.

