forked from kalray/Benchmark_FFT
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
49 lines (40 loc) · 2.08 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# Kalray Inc.: http://www.kalrayinc.com/
#
# Bostan Fast-Fourier Transform Implementation (64K complex-float)
# Developped by J. Hascoet
# Description:
# This distributed FFT implementation uses the 6-step method to split the work
# over the compute clusters of the MPPA processor.
# The FFT 6-step method well described in [1] page 7.
# However our current implementation supports input array multiple of 4 in order
# to have square matrix to transpose (transpose is step 1, 3 and 6 of the 6-step)
# In this benchmark the IO generates input buffer in the DDR.
# The input array is a complex array (1D array) where the imaginary part
# is zeros and the real part uses random numbers.
# First, the compute clusters get a tile of the 1D array interpreted as a
# 2D array (tiling).
# Second, the CC all execute the 6-step FFTs. All twiddle factors are pre-computed.
# Finally the result is writen back to the DDR in the IO which executes
# a sequential FFT and performs correctness check.
# References:
# [1] 'https://www.nas.nasa.gov/assets/pdf/techreports/1989/rnr-89-004.pdf'
# Performance measures.
# We measure performance of both DDR access time (I/O) and the computation
# on the MPPA matrix.
# There is no batching (batch-1) thus the throughput is the same as the
# latency. It is a low-latency implementation.
# The time for initializing the LUT of the twiddle factor is not computed
# (system initialization).
# Requirements:
# This benchmark requires Kalray's AccessCore Toolchain and Kalray's MPPA
# Validated with Kalray's AccessCore >= 2.9.0
# Multi-cluster - Matrix topology condition
# Only nb_cluster=1, 2, 4, 8 or 16 are supported (selected at build time)
# Intra-cluster
# The number of core can be from 1 to 16. (nb_core variable at build time)
# How to execute on MPPA hardware
# By default 16 clusters and 16 cores in each cluster are used.
# Using only jtag (no pcie, standalone mode)
make nb_core=<NUM_CORE> nb_cluster=<NUM_CLUSTER> [stand_alone_board=<ab01|ab04>] run_jtag
# Using pcie
make nb_core=<NUM_CORE> nb_cluster=<NUM_CLUSTER> run_pcie