luxonis/depthai

DepthAI Pipeline Builder Gen2

Luxonis-Brandon opened this issue · 4 comments

Start with the why:

Several of the real-world applications that are desired of the DepthAI platform are actually series or parallel (or both) combinations of neural networks with regions of interest (ROI) passed from one network to one or more subsequent networks.

The Myriad X is hardware is capable of multi-stage neural inference in parallel with computer vision functions, disparity depth, video encoding, etc. but no system exists to be able to easily use this functionality to solve real-world problems. If a user can modularly piece these together (i.e. in a pipeline builder), this gives super-interesting capabilities, and example of which is below for sports filming:

  • Detecting action in a scene (neural inference, say detecting where a soccer ball is)
  • Automatically tracking the action (say tracking the ball)
  • Automatically digitally zooming (#135) using the 12MP camera dynamically (lossless zoom up to 6x while producing 1080p encoded video). (say running motion detection and only encoding the subset of the video that has the motion… in sports, no motion probably means no action)
  • Running parallel neural on ball/player detection and tracking them in 3D space - to produce game statistics of total distance traveled of the ball in miles, each player, etc.
  • Running re-identification (neural inference-based) on players as they move (and occlude eachother) so that each player is tracked individually.

So this is just an example of how the pipeline builder can be used to string together really interesting functionalities. The core value of the builder is that it would allow many hardware/firmware capabilities to be strung together in series/parallel combinations to solve real-world problems easily:

  • Neural inference (e.g. Object detection, image classification of the ROI of a detected object, etc.)
  • 3D object localization (both monocular object detection plus stereo depth and stereo neural inference supported)
  • Object tracking
  • Stereo depth (initial Gen2 example, here)
  • h.264/h.265 encoding
  • Digital zoom (leveraging the full 12MP sensor resolution... which is 6x full 1080p streams)
  • Background subtraction
  • Feature tracking
  • Motion estimation
  • Arbitrary crop/rescale/reformat and ROI return

In many of these pipeline flows of multiple nodes, there is need for custom rules and logic between nodes (e.g. filtering out which ROI 'make the cut' for the next stage. And in many cases, the pipeline is not doable without these rules as the rules are often a key implementation of a-priori knowledge by the designer, without which, the solution is not tractable.

So as such, having support for custom code/functions/etc. to enable rules is a critical feature. And the support of this feature is equally necessary when DepthAI is used with or without a host.

DepthAI used with host

When using DepthAI and megaAI with a host, having the capability to implement these rules/functions/etc. on the host is very convenient. As then the engineer can leverage the full convenience of the host for running rules, functions, and even CV capabilities.

To most flexibly facilitate this, architecting the pipeline builder such that every node (including the camera node(s)) can support (optionally) sending its output to the host and (optionally) receiving it is a key capability of such a pipeline builder.

Importantly, such a capability for each node to send/receive information from the host also enables easier development work-flows:

  • Debugging (testing each node for accuracy/performance by itself)
  • QA (capability to test thousands (or millions) of images through the whole pipeline, or parts of it, from existing datasets)
  • Model refinement and accuracy testing (being able to test the node accuracy fully on the hardware, after conversion, in a quantitative way)
  • Visualization (being able to see on a computer the output of each stage to easily see how things are looking in each stage)

UPDATE 20 Nov. 2020:: The first example of this host-integrated use-case is here: https://github.com/luxonis/depthai-experiments/blob/master/gaze-estimation

DepthAI used without host (i.e. embedded use-case)

When there is no host present - for example when DepthAI is running completely standalone and directly actuating IO or communicating over SPI/UART/I2C - it is still equally necessary to allow such rules/custom code/etc.

To support this, the capability for the user to run arbitrary code on DepthAI (as nodes) is critical.

It is worth noting that when using DepthAI without a host in deployment, one could still use the with host above for debugging, while still running the full embedded flow.

The how:

To support such arbitrary pipeline builds in both with-host and without-host use-cases, we architect the pipeline builder to support every node to send data to/from the host and for CPython code to be run directly as nodes.

Integrating this, we have settled on the following approach, which breaks into 3 modalities of nodes that are used in the pipeline builder to solve embedded CV/AI problems and leveraging this information to interact with the physical world.

Node modalities:

  1. Fast, easy, limited flexibility: So the list accelerated blocks above like neural inference, 3D object localization, etc. These come pre-packaged and are trivial to make use of. But they often need application-specific logic between them, hence modality 2. And if your CV algorithm isn't on that list (or maybe you've invented your own proprietary, and you need it to run performantly on the DepthAI, see modality 3.

  2. Slow, easy, quite flexible: CPython bindings for scripts running direct on DepthAI as a node (issue #207).
    This allows you to have custom rules on metadata from neural inference results, write custom protocols that run on-chip as part of the pipeline, communicate with sensors/actuators or other systems over SPI, UART, I2C, etc. based on pipeline results, etc. For example you can make rules that make sense of neural-inference metadata, which then control performant crop/resize/reformat to connect layers of accelerated CV functions.

  3. Fast, hard, quite flexible: OpenCL (here), G-API (more details soon) and ML Frameworks for Vectorized math are used to compile custom computer functions to run performantly on the SHAVES in DepthAI. So you can take your computer vision function, write it in OpenCL, G-API, or say in PyTorch, and drop it as a node in the pipeline builder. So this supports custom algorithms, including proprietary algorithms, to be hardware accelerated in the pipeline as a node. And the pipeline builder leverages the hardware accelerated crop/rescale/reformat to match inputs and outputs. This could even be used for non-CV functions for example be used to run custom arbitrary mathematical functions on audio data brought in via CPython over I2C. For an EXCELLENT example of how to run custom CV code on depthai using PyTorch, see this guide by Rahul Ravikumar.

The what:

If we support the following with our pipeline builder it seems it would be sufficiently flexible.
So implement a pipeline builder which can be used to implement the flows below.

UPDATE 26 December 2021: The docs for Gen2 are materializing here: https://docs.luxonis.com/projects/api/en/gen2_develop/

Example Neural Pipelines To support:

  • The OpenVINO security barrier demo (here).

    • This does vehicle detection, followed by two parallel networks that operate on the ROI of the vehicle:
      • 1 x NN for vehicle color and vehicle type
      • 1 x NN for license plate detection
        • Then the ROI from the plate detection is passed to another NN which outputs region and OCRs the plate.
  • Update 26 January 2021: Github issue for this example pipeline is luxonis/depthai-experiments#47

  • Interactive Face Detection Demo (here)

    • Does face detection and allows running the following as secondary networks run on the ROI of the face
      • Age + Gender recognition
      • Facial Expression estimation (incorrectly called ‘emotion’ in their doc)
      • Facial Landmarks
      • UPDATE 16 MARCH 2020 ArduCam produced this example using the Gen2 Pipeline builder, here
  • Interactive Face Recognition Demo (here)

    • Detects faces and runs landmarks and face-reidentification to recognize the people.
    • Github issue for Gen2 Example implementation here
    • UPDATE 16 MARCH 2020 ArduCam produced this example using the Gen2 Pipeline builder, here
  • Cross Road Camera Demo (here)

    • Detects people, vehicle, bikes, and then runs person attributes and person re-idenfitication on the ROI of detected people.
    • UPDATE 16 March 2020: ArduCam actually implemented this, here and we have our WIP version here. (We started before we realized ArduCam had already produced this example!)
  • Pedestrian Tracker (i.e. Person ReID here)

    • Detects people, re-identification runs on the ROI from person detection
    • UPDATE Nov 23 2020: Initially implemented in Gen2, here
  • Text Detection and Recognition (OCR) (here)

    • Detects regions that have text, and then OCR these regions.
    • UPDATE 30 Dec 2020: Initially implemented in Gen2, here
  • Gaze Estimation (here and here)

    • Does face detection, ROI of which goes to both head pose estimation and facial landmarks.
      • The outputs of head pose estimation and facial landmarks are passed to the gaze estimation model
    • UPDATE Oct 23 2020: Initially implemented in Gen2, here

Of the examples on the OpenVINO repository, the following seems like it should not be implemented, as it’s the only one that does series, parallel, and output of parallel back to a single model. So it seems much more specialized.

This will then cover the following items which were previously independently on the DepthAI roadmap:

  • Get two-stage face detection and following age-gender or emotion working (prototype here)
  • Person detection, tracking, and reidentification.
  • Add capability to run multiple neural networks in parallel (prototype here)
  • Integrate face detection and identification AP with Python API (e.g. here)
    • First step: without depth
    • Second step: with depth.
    • Most common compliment to object detection
  • Be able to run multiple models in sequence (e.g. facial detection -> facial landmark -> landmark tracking) (prototype here)
    • This is different than multiple-output tensor. (which is already implemented, PR here)
  • Text Detection and OCR Support (#124)

To keep in mind, but maybe not support initially:

  • This smart motion (#132) sort of pipeline, here, which is using motion detection to determine what subset of an scene to pass into object detection, followed by object tracking on the detected object detection
  • Utilizing onboard storage with Pipeline Builder (#134)
  • Option to return the depth map for just the ROI of the detected object (#125)

@Luxonis-Brandon,

A pipeline builder can make things quicker and more straightforward to piece up! :)
Some things I'm about to try:

Default mobilenet SSD (Coco) with depth:

If person

Run face recognition and face reidentification

If stranger

Run age / gender estimator

Run facial expression estimator

Run action classifier
Output: 09:00. Person. Male. 20 to 25 years old. Looking happy. Standing. 2 meters away.

If not stranger

Run facial expression estimator

Run action classifier
Output: 09:00. Marx. Looking happy. Standing. 2 meters away.

If not person

Run OCR detection

If text detected

Run text recognition
Output: Dead center. Monitor. 1 meter away. Text reads, " Warning: Aliens Spotted Near You ".

If no text

Pass
Output: Dead center. Monitor. 1 meter away.

:D

Great feedback @MXGray ! Discussing internally now how difficult such results-based dynamic pipelines would be to implement. I definitely see how useful this would be... not to investigate the relative difficulty/feasibility.

The initial Gaze estimation example is implemented here: luxonis/depthai-experiments#8
Gaze Example Demo

This is now implemented and mainlined. Most things that were possible in Gen1 API are now possible in Gen2. See below for resources: