Written by Noda.
OpenClで高位合成したい人間のためのメモ。
これでとりあえず足し算くらいはできる(動かすまでが大変だったのです...)。
host codeぐちゃぐちゃな気がしますがとりあえず動くので許して下さいませ。
これをもとにお好きな回路を作ってみて下さい。
回路の最適化については触れていないので、詳しく知りたい方は下記リファレンスへどうぞ。
This is a simple guide for implementing on Arria 10 SoC using the high-level synthesis environment "Intel FPGA SDK for OpenCL". By reading this memo, you can implement a simple addition circuit on Arria 10 using the environment.
If you want to optimize your code, please read the following references.
https://www.altera.com/en_US/pdfs/literature/hb/opencl-sdk/aocl_getting_started.pdf
https://www.altera.com/en_US/pdfs/literature/hb/opencl-sdk/aocl_programming_guide.pdf
Arria 10 SoC is a system on chip with ARM CPU and FPGA provided by Intel.
The strength of Arria 10 is that a hard macro (DSP) for float operation embeds on the FPGA. Please note that the DSP does not support double operation.
OpenCl is a framework for performing parallel computation in a heterogeneous environment (CPU + GPU, CPU + FPGA, etc.). Intel FPGA SDK for OpenCL is an OpenCL-based HLS environment for FPGAs provided by Inrel, and it is said that it is possible to describe a high-performance FPGA circuit in a short period of time by using it (is it true?).
In OpenCL environment we use host code and kernel code. The former is a code for control processors (ARM on Arria 10 SoC in this case), which is described using C ++ and OpenCL API. The latter is a code for arithmetic cores (FPGA on Arria 10 SoC in this case) and is described using OpenCL C language.
In that HLS environment, Board Support Package (BSP) is provided for each board, so users don't need to develop I/O interface connecting CPU, FPGA, and memory. Although this BSP can be rewritten, we are likely to destroy the environment, so it is better not to touch it. Or, please take a backup and challenge it.
The sample addition code is placed in the following directory.
/home/asap2/noda/arria_test
There are two directories "add" and "common". The addition code is in "add", and it will not move without the "common" directory. Copy the "arria_test" directory and move to the "add" directory. Below, let's assume that your current directory is "add".
The host code is in "host/src/main.cpp", and the kernel code is in "device/add.cl". Besides, there are any shell codes called "秘伝のタレ".
This code sums up randomly generated 100 elements. In order to verify the calculation result, the result calculated on the FPGA is compared with that calculated on the CPU. Moreover, this code measures the calculation time of kernel and the turn-around time.
Also, we will ssh and scp to Arria 10 later, but before that you have to put your ".ssh/id_rsa.pub" in authorized_keys of Arria10. If you email noda@am.ics.keio.ac.jp with your public key, I will add it. Then you can connect to Arria 10.
Before implementing the circuit in the FPGA, we debug on a CPU, neutrino.
First, after ssh to neutrino, copy "~noda/.bash_profile" and source it.
Now that PATH has passed, you try to emulate. There is a shell code "emu_go" in the directory "add". So if you run it, emulation will start.
There are various compile options. Check references for details.
The result of execution is as follows.
bash-4.1$ ./emu_go aoc: Environment checks are completed successfully. You are now compiling the full flow!! aoc: Selected target board a10soc_2ddr aoc: Running OpenCL parser.... aoc: OpenCL parser completed successfully. aoc: Compiling for Emulation .... aoc: Emulator Compilation completed successfully. Emulator flow is successful. To execute emulated kernel, invoke host with env CL_CONTEXT_EMULATOR_DEVICE_ALTERA=1 <host_program> For multi device emulations replace the 1 with the number of devices you which to emulate Initializing OpenCL Platform: Altera SDK for OpenCL Using 1 device(s) EmulatorDevice : Emulated Device Using AOCX: add.aocx Arria 10 SoC Turn_around_Time: 0.712237 ms Kernel time (device 0)(getStartEndTime): 0.619050 ms Output: 93.649620 Reference: 93.649620 Verification: PASS
You can check the flow of calculation on the CPU. You must debug the host and kernel code until the code works properly.
Although we can confirm that the calculation is done normally, we can not simulate the execution time (we get results, but this is an unreliable value). We can measure execution time only after implementating on FPGA.
We can also check FPGA resource usage. Execute "emu_resource" in the directory "device". The execution result is below.
bash-4.1$ cd device/ bash-4.1$ ./emu_resource aoc: Environment checks are completed successfully. aoc: Selected target board a10soc_2ddr aoc: Running OpenCL parser.... aoc: OpenCL parser completed successfully. aoc: Compiling.... aoc: Linking with IP library ... +--------------------------------------------------------------------+ ; Estimated Resource Usage Summary ; +----------------------------------------+---------------------------+ ; Resource + Usage ; +----------------------------------------+---------------------------+ ; Logic utilization ; 2% ; ; ALUTs ; 1% ; ; Dedicated logic registers ; 1% ; ; Memory blocks ; 3% ; ; DSP blocks ; 0% ; +----------------------------------------+---------------------------; aoc: First stage compilation completed successfully. aoc: To compile this project, run "aoc add.aoco"
The float operation automatically uses the DSP. In the table above, the DSP usage rate is 0%, but the circuit size is too small, it seems that the DSP is used properly.
We confirmed that the addition of sample code worked properly, so we go to the next section.
First, run the shell "aocx_go" and compile the kernel code. This is very time-consuming. Even in this sample code, it takes time more than 1 hour. We will compile the host code later in ARM on Arria 10. The execution result is below.
bash-4.1$ ./aocx_go aoc: Environment checks are completed successfully. You are now compiling the full flow!! aoc: Selected target board a10soc_2ddr aoc: Running OpenCL parser.... aoc: OpenCL parser completed successfully. aoc: Compiling.... aoc: Linking with IP library ... +--------------------------------------------------------------------+ ; Estimated Resource Usage Summary ; +----------------------------------------+---------------------------+ ; Resource + Usage ; +----------------------------------------+---------------------------+ ; Logic utilization ; 2% ; ; ALUTs ; 1% ; ; Dedicated logic registers ; 1% ; ; Memory blocks ; 3% ; ; DSP blocks ; 0% ; +----------------------------------------+---------------------------; aoc: First stage compilation completed successfully. aoc: Hardware generation completed successfully.
When compilation starts, the directory "to_a10soc" specified in the shell code is created. It contains an intermediate file "add.aoco" and a directory "add" containing various data. After compilation, a binary file "add.aocx" is generated in "to_a10soc".
After compiling the kernel code, transfer the generated aocx file and host code (uncompiled) to Arria 10 with scp. Here we transfer to arria 10 using the shell "go_scp" in the directory "to_a10soc". Please change the transfer destination by yourself.
./to_a10soc/go_scp
Ssh to Arria 10.
ssh root@131.113.69.239
Currently, everyone is Superuser, so you have to be careful about your actions.
Before compiling, execute the following spells on arria10. Ignore the error.
source ~/init_opencl.sh
After that, you move to the transfer destination directory in Arria 10. In this example, "~/test/" contains "aocx file" and "main.cpp", and a previously prepared "Makefile".
If you prepared your own directory, copy "~/test/Makefile".
Finally, make the "main.cpp" and compile it. The execution result is described below. Ignore the error.
root@Arria10_linaro:~/test/test_add# make clean root@Arria10_linaro:~/test/test_add# make all ../common/src/AOCLUtils/opencl.cpp: In function ‘void* aocl_utils::alignedMalloc(size_t)’: ../common/src/AOCLUtils/opencl.cpp:55:49: warning: ignoring return value of ‘int posix_memalign(void**, size_t, size_t)’, declared with attribute warn_unused_result [-Wunused-result] posix_memalign (&result, AOCL_ALIGNMENT, size); ^ ../common/src/AOCLUtils/opencl.cpp: In function ‘bool aocl_utils::setCwdToExeDir()’: ../common/src/AOCLUtils/opencl.cpp:278:14: warning: ignoring return value of ‘int chdir(const char*)’, declared with attribute warn_unused_result [-Wunused-result] chdir(path); ^
Then, a directory "bin" is created. Inside there is a "host" which is the compiled host code.
Finally move "aocx file" to directory "bin" and execute "./bin/host" command. The execution result is as follows.
root@Arria10_linaro:~/test/test_add# ./bin/host Initializing OpenCL Platform: Altera SDK for OpenCL Using 1 device(s) a10soc_2ddrArria 10 SoC Development Kit Using AOCX: add.aocx Reprogramming device with handle 1 Arria 10 SoC Turn_around_Time: 1.022762 ms Kernel time (device 0)(getStartEndTime): 0.107940 ms Output: 93.649620 Reference: 93.649620 Verification: PASS
Congrats! Now we are the king of addition! ! !
When compiling the kernel code with the "--profile" option and then running on the FPGA, "profile.mon" is generated in the directory "bin". Retransfer the "mon file" to neutrino (using go_mon), and execute "aocl report" command with "aocx" (also "aoco") file. So GUI profiler launch. (Do not forget to enable X port forwarding).
bash-4.1$ ./go_mon Enter passphrase for key '/home/hlab/hoge/.ssh/id_rsa': profile.mon 100% 97 0.1KB/s 00:00 bash-4.1$ aocl report profile.mon add.aocx &
もう力尽きたので後はまたこんどにゃん。 Please add your knowledge to this wiki!!!