Joo Aun Saw
Intel recently introduced a USB device called Movidius Neural Compute Stick (NCS) that’s capable of running deep neural networks. This convenient USB form factor allows developers to easily add neural network capability to existing hardware and start experimenting with artificial intelligence (AI). The ability to run artificial intelligence locally on an embedded system with low power consumption, coupled with low device cost makes AI at the IoT edge a reality. There are a few advantages associated with running the neural network locally versus a cloud based model, such as lower latency and immunity to network outages.
I was given an opportunity to play with Movidius NCS at our recent DiUS hack day. I tried out the various examples that Intel has and I wanted to get my hands dirty, so I decided to implement NCS support into an existing software.
My first practical application
I’ve been playing with Motion open source security camera software for some time now — my weekend adventures with this are well known around the office — and I thought the addition of neural network would greatly improve the motion detection system. There’s also an open source software called MotionEye that provides a user friendly web front-end to configure Motion. The combination of Motion and MotionEye makes it a well rounded security camera software solution, which is perfect as my first target application. Motion was written in C, and fortunately NC SDK provides a C version of the API.
I implemented the neural network (MVNC) based detection as an alternative to the traditional motion detection system. The neural network detection system is designed to run ChuanQi’s MobileNet SSD model which has a good balance between detection speed and accuracy. It supports 20 classes of objects, including persons, cats, dogs, and cars. Users have the ability to choose which classes of objects to detect, and a confidence threshold. A detected object is only considered valid if the detection confidence is above this threshold. I have been testing this for a few weeks and it has been performing well so far. It detects people reliably at a reasonable frame rate and has been running continuously without any hiccup.
Just a side note: If you are running this on a Raspberry Pi, you may want to record videos at high frame rate which requires offloading H264 video encoding to the GPU. Follow these steps to enable GPU H264 video encoder (h264_omx).
- Disable ffmpeg OMX zero copy feature. Apply this patch, compile and install. You may have to uninstall your existing ffmpeg library first.
- Remove h264_omx from Motion’s blacklist. Apply this patch to Motion.
- Finally, configure Motion to use h264_omx encoder by selecting h264_omx as the preferred encoder. Example: ffmpeg_video_codec mp4:h264_omx. Note that “ffmpeg_video_codec” has been renamed to “movie_codec” in Motion 4.2.
Various neural network models
The performance and accuracy of the detection system depends heavily on the neural network model used. There are a few popular neural network models designed for visual processing, such as GoogleNet, YOLO, and MobileNet SSD. The below table from WeiLiu’s github repository summarises the performance of various models.
The models are put through the PASCAL Visual Object Classes Challenge 2007 (VOC2007) test and given a score for accuracy. This test involves putting a set of images through a model and scoring it based on its detection accuracy. Better accuracy gives higher score. The model also outputs a list of bounding boxes for all the detected objects. From the above table, it looks like SSD300 is the most suitable for our application because of good accuracy and high frame rate.
Movidius NC workflow
This is the workflow to get a deep neural network onto the Movidius NCS:
- Train your neural network on a powerful machine. The Movidius SDK supports two Deep Neural Network frameworks: TensorFlow and Caffe.
- Compile the trained model into a “graph”, which is a format that is understood by the Movidius NCS. Optionally, you may profile and tune the model here.
- Load the “graph” onto the target device and start using the trained neural network model.
For our application, I skipped step one as someone else had already done the hard work of training the model. I downloaded the MobileNet SSD Caffe model, compiled it into a graph, and it was ready to go, very straight forward. Kudos to Intel for making it so simple.
Application software implementation details
The very first thing we need to do is to initialise the MVNC hardware. The process is quite easy and there are lots of examples on the internet.
- Create a device handle. For a multi-stick environment, this also selects which stick to use.
- Open the device.
- Read the graph file into memory. A graph file is a compiled neural network model.
- Create a graph handle.
- Load the graph to the device and allocate input and output FIFOs.
Now we are ready to do inference.
The above diagram shows where the MVNC detection was implemented in the main Motion software loop. Inference on the MVNC requires a considerable amount of time, achieving only 5.5 frame per second, so I kept it asynchronous from the main Motion loop to give a real-time impression. The delayed inference output of 180ms is acceptable in this application.
1. YUV hi-res image to BGR 300x300 image
Motion software gets the high resolution raw image from the camera in YUV format, but our MobileNet SSD model expects resolution of 300 x 300 pixel, so we need to scale the image down. Our MobileNet SSD model also expects the image to be in BGR format with -1.0 to 1.0 floating point value for each colour channel, which means we need to do the format conversion as well. Fortunately, ffmpeg library has “sws_scale” function that does both scaling and conversion in one call. After putting the image through “sws_scale”, we end up with a raw image in BGR24 format, which has 0 to 255 value for each colour channel. We can dumped this raw BGR24 image to a file and verify using “gnuplot” that scaling and conversion was done correctly. This is the gnuplot command that I used to view the raw image:
plot ‘scaled_bgr24.raw’ binary array=(300,300) flipy format=’%uchar%uchar%uchar’ using 3:2:1 with rgbimage
The BGR24 to floating point conversion is straight forward.
BGR float = (BGR24–127.5) * 0.007843
Again, I used gnuplot to view the raw image:
plot ‘scaled_bgr_float.raw’ binary array=(300,300) flipy format=’%float%float%float’ using ($3*127.5+127.5):($2*127.5+127.5):($1*127.5+127.5) with rgbimage
Initially I was confused about whether to use 16-bit or 32-bit floating point number. Version 1 of the NC SDK used 16-bit floating point number whereas version 2 SDK defaults to 32-bit float, but the device only supports 16-bit. It was not clear to me whether I needed to configure the FIFO to 16-bit. I tried feeding it 32-bit float and to my surprise, it worked! Eventually I found out from Movidius’s forum that the 32-bit float will be converted to 16-bit automatically anyway.
2. Queueing an image for inference
To perform an inference, we need to write the image to the input FIFO. Writing to FIFO is a blocking operation where the caller is blocked until there is enough space in the FIFO. For asynchronous inference, we need this to be non-blocking, but it is not implemented yet, so I ended up checking the FIFO level each time before I write. The number of images we maintain in the FIFO is important because it impacts on the inference latency and throughput of the device. Inference latency is the time between writing to FIFO and getting the result. Maintaining more images in the FIFO means a higher throughput, but suffers from higher latency. I did a few tests and decided to only maintain one image in the FIFO. I will explain the reasons behind it later.
3. Retrieving the results
Each time we go through the main Motion loop, we look for inference result by checking the output FIFO level. Reading the FIFO is also a blocking call, so we only read the FIFO when the inference result is ready. The length and format of the output depends on the neural network model. Our MobileNet SSD model outputs a list of detected objects with the associated confidence level and bounding box.
Real world performance and issues
After getting the basics to work, I profiled the system and pushed the device to see where the limits are. I used a Raspberry Pi 3B+ with Pi Camera and a single Movidius NCS as my test system. Initially I was only maintaining one image in the input FIFO, which resulted to 5.5 fps throughput running at just under 60°C. I increased this FIFO level to two images and managed to achieve 11 fps throughput. Increasing this FIFO further did not yield any more improvement. Unfortunately, the device was not able to sustain the 11 fps throughput for long as it started thermal throttling at 70°C. The throughput eventually dropped down to 8 fps. This was at 24°C ambient temperature. At higher ambient temperature, it would have throttle further. There is no point in pushing the device too hard, so I decided to go back to my initial one image FIFO set up.
Intel has made AI on embedded systems a reality with this low power low cost Movidius NCS. Integrating neural networks into existing software was quite easy. The only disappointment is that the device thermal throttles whenever it is pushed. Nevertheless, Movidius NCS is still a viable option for running AI at the edge.