
Fast modular multiplication on the state-of-the-art digital signal processor (DSP) is studied in this work. More specifically, Montgomery multiplication over a prime field for an arbitrary 256-bit p is implemented on TMS320C6678 DSP by Texas Instruments. Two implementations optimized for latency and throughput are designed. The implementations are based on the k-bit divided Montgomery modular multiplication algorithm by Kornerup. The algorithm is extended to run two independent Montgomery multiplication in parallel thereby running efficiently on the target DSP by exploiting its symmetric functional units. The proposed implementations are advantageous than the previous implementation proposed by Itoh et al. in terms of latency and throughput. The latency of 0.496 [\(\upmu \)s] of the proposed implementation is only 17% of 2.86 [\(\upmu \)s] for the implementation proposed by Itoh et al. Moreover, the throughput \(4.03 \times 10^6\) [Montgomery multiplication(MM)/s] in the present case is more than \(\times \)10 the value of \(0.37 \times 10^6\) [MM/s] from the previous work.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
