de-code/layered-vision

AutoTrack filter

benbatya opened this issue · 14 comments

This is the tracking issue for the autotrack feature originally mentioned at python-tf-bodypix#67
Make AutoTrack a filter that can be applied.
This will require a new filter type which takes the bodypix masks as an argument to the filter function

Great, thank you.

This will require a new filter type which takes the bodypix masks as an argument to the filter function

Currently the filters just receive a single input. That works fine with just using the a single mask. But in this case you probably want to have multiple inputs. Also the current approach when blurring the background is to pass the webcam image into two branches, whereas the background branch is blurred and the foreground branch has the applied to it. "Zooming" both probably requires a filter after both branches are applied, and you need a separate input from the bodypix model I guess. Open for your ideas how to best achieve that.

Hi @de-code, here's some thoughts:
Maybe we can build a simple data-flow API which can define nodes with zero or more inputs and one ndarray output.
Data would be pushed from upstream nodes to downstream nodes.
To make the data-flow simpler, each source node will be notified of the new frame and will produce a new output.
Then the downstream nodes will take the output from the upstream nodes and process them and produce new output.
Nodes will be added using BFS queue which adds a node once all of its inputs are ready.
Input/output compatibility is checked whenever a node is processed. This could also be done once lazily.
For simplicity, a valid connection is a output and input ndarray with matching dims on all axes.
All of a node's inputs must be connected in order for it to be validated.

So for replace background of a movie:

  1. the webcam node outputs a 640x480x3 ndarray and is marked as a source node
  2. The video_player node outputs a 640x480x3 ndarray and is marked as a source node
  3. the bodypix node input a 640x480x3 ndarray is connected to the webcam output and it outputs 2 640x480x1 ndarrays which are labelled as "all_mask" and "face_mask"
  4. The erode node input is 640x480x1 and is connected to the bodypix.all output and outputs a 640x480x1 ndarray
  5. dilate, blur and motion_blur are the same as erode
  6. The composite node inputs are 640x480x3 background, 640x480x3 foreground and a 640x480x1 mask and its output is 640x480x3
  7. The video_player output is connected to the composite bg input, the webcam output is connected to the composite's fg input and the motion_blur output is connected to the composite's mask
  8. The window node input is connected to the composite's output and has a 0x0x0 output

webcam -> bodypix -> erode -> dilate -> blur -> motion_blur -> composite -> window
-------------------------------------------------------------------------/\ /
video_player ---------------------------------------------------------------------------/
A more flexible connection system would be nice to allow for ANYxANYx1 style dim definitions and use dynamic dim checking & pass through during the lazy check.
Another approach would be to use gstreamer with the python API and convert bodypix to a gstreamer node.

What do you think?

Thank you for putting your thoughts down. I will need some time to read and understand it properly.
The current system is pull rather than push. It seemed easier to make things lazy. But I will read your proposal more carefully.

But just as you mentioned gstreamer. There is Google Coral's project-bodypix which is using gstreamer. But as far as I understand is only supporting Coral Edge TPU. It might be something to look for inspiration. I don't have any experience with the gstreamer API (only from vague recollections it required more system dependencies).

Making it pull would be fine for small graphs.
If the graph gets big enough, then turning it into a efficient multi-process app will be more difficult.
Queuing and distributing large numbers of nodes calculations will be easier with a push system.
But maybe that's over-designing the system.
And I have no idea how to efficiently pass ndarrays between multiple python processes.

Hi @de-code I got inspired about your lazy pull approach so I adopted it for a more general graph.
The graph is acyclic and dependent nodes must appear after their dependees.

Running the graph consists of recursively calling node.calculate() starting with the last node.
Normally the output of called node would be the default input of the calling node.
But if a node needs more then one input or default input is not connected to default output, it can add that request to a map of output-requests.
When a output matches an output-request, its added to the map before returning the default output.
This means that memory reallocation for default connections ("input: prev_node.output") is eliminated

Steps of Node.calculate():

  1. Check if the node has any inputs.
  2. if so, add non-default requests to the output-requests map and call the previous node in the list with the map
  3. Gather and validate the inputs from the returned default value and output-requests map
  4. Run the filter/source/sink code on the inputs to produce outputs
  5. Store the output-request values into the map and return the default output

This way the nodes can be purely functional and lazily pulled.
Here's an example configuration:

nodes:
  - source: video_source
    # id: bg  # id is optional. type name is default for id. Since each id must be unique in the graph,
    # if two nodes of the same type are created, use id to identify them 
    # Source: https://www.pexels.com/video/carnival-rides-operating-in-an-amusement-park-3031943/
    input_path: "https://www.dropbox.com/s/oqftndbs29g8ekd/carnival-rides-operating-in-an-amusement-park-3031943-360p.mp4?dl=1"
    repeat: true
    preload: true
  - source: webcam
    device_path: "/dev/video0"
    fourcc: "MJPG"
  - filter: bodypix  # automatic connection from webcam.output to bodypix.input
    model_path: "https://storage.googleapis.com/tfjs-models/savedmodel/bodypix/mobilenet/float/050/model-stride16.json"
    # model_path: "https://storage.googleapis.com/tfjs-models/savedmodel/bodypix/resnet50/float/model-stride16.json"
    internal_resolution: 0.5
    threshold: 0.5
  - filter: erode  
    # input: bodypix.output # bodypix default output is the all mask
    value: 20
  - filter: dilate  # automatic connection
    value: 19
  - filter: box_blur
    value: 10
  - filter: motion_blur
    frame_count: 3
    decay: 0
  - filter: composite
    # input: motion_blur.output   # NOTE: this isn't needed. It would be the default connection anyway
    input_fg: webcam.output
    input_bg: video_source.output
  - filter: auto_track      # crops input to crop_mask, adds a padding and then resizes back to the size of input array
    # input: composite.output # NOTE: not needed
    crop_input: bodypix.output_face # Alternative is a union node of bodypix.output_lface and bodypix.output_rface
    padding: 20
  - sink: v4l2_loopback
    device_path: "/dev/video4"

Hi @benbatya thanks again. I haven't had much time to think about it yet but I will try to respond in part anyway (please let me know if anything doesn't make sense, I am sure I am missing a lot).

And I have no idea how to efficiently pass ndarrays between multiple python processes.

That I am not sure about either. I guess it is one of the major weaknesses of Python. There seem to be shared memories that could potentially be used.

Perhaps one other consideration here is, that maybe the initial use-case will be live desktop usage, maybe even during a meeting. Maybe it's good / okay if it is not maxing out all of the CPUs.

Another potential argument for pull in that setting is that it reduces issues with back-pressure. i.e. if the bodypix model and the webcam output is slower than the background video frame rate, then we can skip video frames rather than putting them on a queue. (I guess that is different to a typical data flow scenario, where we definitely want to process all of the data)

Some benefits I believed to see when I used the branches:

  • make it impossible to have cyclic graphs
  • making it more intuitive / simpler by not having to wire up layers explicitly
  • a branch could encapsulate somewhat a "component"

I am not sure whether the second point is actually true, as I am sometimes confused by it myself. So maybe it wasn't a great idea after all.

I suppose by mandating referenced layers to appear before like you suggest, you are also enforcing it to be acyclic.

With the current implementation we could also provide a map with the layers by id to add them in.

Otherwise your proposal seems to be a bit similar how RuntimeLayer.__next__ is implemented?

The inputs are connected to RuntimeLayer ahead of time, so that it basically is a node in the graph. (Although there is definitely plenty of room for refactory in that area)

Some questions:

I guess you prefer to more clearly separate between video_source, webcam and v4l2_loopback? (it will definitely make parameter validation easier)

In the composite filter layer you have input_fg and input_bg. Are these very specific inputs that the filter supports or would it be able to combine an arbitrary number of inputs? (the current branches do, but I am not sure there is an actual use-case for it)

In the auto_track filter layer, we have the crop_input set to bodypix.output_face. I understand bodypix would be the id of the previous layer. But where would the output_face come from? Would the bodypix filter register a dynamic output with that name, which would be calculated when requested by auto_track?

When you explained Node.calculate, do you then still mean a pull based layer by layer request by frame or assume a then static input request? (i.e. could the actually requested input still be dynamic?)

I will add some other GitHub issues with potential features..

Hi @de-code, my example config was to try to prove out what an "ideal" configuration might be.
But after thinking about it, the configuration can be replaced by single high-level python function which starts with the sources, just passes ndarrays between different filter functions and then dumps the results into the sinks. This is especially true since there's no easy way to multi-thread numpy instances (maybe Parallelpython.com would help?) so scheduling nodes in parallel isn't easy/possible.

To answer your questions:

  1. I like to define types which will be instanced as concretely as possible. Then there's no ambiguity with regards to parameters or trying to guess what the concrete type is from the context. To me, it just makes sense for video_source, webcam and v4l2_loopback to be different types and allows nodes of those types to be created anywhere. I think that the source, filter and sink distinction can be dropped as well..
  2. the input_fg and input_bg are additional inputs. the composite filter requires 3 inputs and the default input is the mask. I was thinking about using input0, input1, etc... but realized that its more clear in the configuration to force the names to be explicit
  3. bodypix.output_face is a different output. bodypix would have output which is the all mask and output_face which is just the face mask (left_face | right_face). bodypix could have an output per body part (25 with all) but why bother unless there's a specific usecase?
  4. Node.calculate is supposed to be the functional core of the node system. Instead of layers, there's just nodes which are passing inputs to each other. This allows for a more flexible setup because layers disallow data from flowing between layers. The composite node builds the composited output which is then cropped and resized by the auto_track node. Similar functionality is possible with adding overlays because they require the mask outputs from bodypix to transform overlay images (crown, halo, horns, etc...) relative to the face mask. Eventually, when the bodypix node can produce the estimated position of eyes, ears, mouth, etc, then the overlays can get more interesting ala snapchat filters.

If composite functionality for nodes is desired, a sub-graph with clearly defined source inputs and sink outputs can be defined. This is how gstreamer GstGhostPads are defined.

I think that there's a lot of benefit to keeping the graph definition in python to allow making new filter nodes simpler. bodypix takes up a majority of the CPU right now so trying to multi-process it is counter productive. It would be nice to move bodypix to the GPU for processing but I guess that means updating python-tf-bodypix?

A good example of a node network is https://viz.mediapipe.dev/demo/hair_segmentation.
The mediapipe framework is pretty awesome except bodypix doesn't work on it... lol

Okay, these discussions are definitely useful.

When I created this layered-vision project, I had the config in mind. That is so that less technical people could modify it.
Perhaps there could even be a UI for it.
It is currently admittedly not designed as a Python API, but that would be good too of course. Either in the same project or another shared project (if it could be useful for sub-projects).
I hadn't spend a lot of time trying to survey exiting solutions.
e.g. is there another similar Python project (preferably with a permissive license), that we could integrate bodypix instead? (I am also thinking that there may be other bodypix-like models that could be used in the future).

So I guess at this point the main questions are:

  • is it worth continuing with this project or would you rather adopt another project or start from scratch?
  • if we continue with this project, would you only be happy to do that if it was refactored (or re-architect) first or would the current structure be okay?
  • would you prioritise the auto track feature over refactoring the project?

It also depends on your available time.

If you think it is still worth investing into this project, then in order to progress it, it could be a good time to start moving different aspects to separate smaller tickets.

The main questions for the autotrack feature are probably (given that it is still configuration driven):

  • How do we configure the bodypix face output
    • This is what I am not so clear about at the moment. One can already select the part segments to output. Although as a single output. I wouldn't want to always output the face, it should therefore be either on-demand or configured. Perhaps there could be an additional_outputs map, where part segments can be configured per output.
    • Alternatively there could be an entire separate bodypix filter and it could rely on caching to avoid calculating it twice.
  • How do we configure the face input to the autotrack filter:
    • This seems to be easier and something like you suggested would work. I would maybe prefer a more "generic" auxiliary input, e.g. in the form of an additional_inputs map. But that seems to be a minor detail.

Another idea could also be to decouple the autotrack filter from the bodypix model further. For example OpenCV has face detection, which may not be very useful for removing the background but could be used for autotrack (and would be a lot faster I guess, not tested). i.e. one could imagine a bounding box face input, although it doesn't quite fit any of the other data being passed around.So maybe not a good idea, just throwing it out there anyway.


It would be nice to move bodypix to the GPU for processing but I guess that means updating python-tf-bodypix?

You should get cuda-enabled GPU support via TensorFlow. Otherwise the current Docker image won't have GPU support, because it isn't currently including any cuda libraries. I don't have any GPU on my laptop to test it though.

Just my thoughts: I have started looking at mediapipe by Google (mp) and I think that it's a much better platform for flexible webcam manipulation then trying to get performance out of a python app. It has a nice graph description API (more understandable then gstreamer), it automatically parallelizes node calculations and contains most of the functionality needed to make good results.

The only major piece needed AFAICS is to port the bodypix model to a calculator which can expose results (output_stream in mediapipe terms) and to make a calculator which sinks the results into v4l2loopback.
There's a nice little tutorial that I'm going to run through to test mp segmentation: https://towardsdatascience.com/custom-calculators-in-mediapipe-5a245901d595 but my tflite knowledge is extremely limited...

Once I get through the tutorial, building mp nodes for v4l2loopback and auto_track will be the next goals.
Unfortunately, these won't be python apps....

I apologise for suggesting this feature and then leaving this project. I like what you have done but unfortunately I see the GIL as being a fundamental blocker to performance. Especially when alternatives like mediapipe and gstreamer exist.

I would be extremely Happy to collaborate on porting bodypix to mediapipe. Google's new background segmentation model is much more accurate then bodypix but doesn't appear to produce separate masks for different body parts or estimate feature locations AFAICS

https://ai.googleblog.com/2020/10/background-features-in-google-meet.html?m=1

https://drive.google.com/file/d/1lnP1bRi9CSqQQXUHa13159vLELYDgDu0/view

Just my thoughts: I have started looking at mediapipe by Google (mp) and I think that it's a much better platform for flexible webcam manipulation then trying to get performance out of a python app. It has a nice graph description API (more understandable then gstreamer), it automatically parallelizes node calculations and contains most of the functionality needed to make good results.

MediaPipe looks like a good project.
(I am just wondering why the "Used in leading ML products and teams" section uses quotes of four people, all of them previous or current Google employees)
I guess it really depends on what you are trying to do.
Python is certainly not perfect. But it is easy to use and has a wide adoption.
It seems MediaPipe can also be used with Python. But not sure if it there are limitations in regards to how you can implement it.

The only major piece needed AFAICS is to port the bodypix model to a calculator which can expose results (output_stream in mediapipe terms) and to make a calculator which sinks the results into v4l2loopback.

Why don't you use a separate simple face detection in the interim to build a proof-of-concept?

There's a nice little tutorial that I'm going to run through to test mp segmentation: https://towardsdatascience.com/custom-calculators-in-mediapipe-5a245901d595

It seems interesting. I am not sure whether I read that correctly: The build will finish successfully in around 10–15 minutes

but my tflite knowledge is extremely limited...

My TFlite knowlegde is also very limited. I barely added support for that in tf-bodypix. Perhaps it helps you. But there are other example projects as well. I remember having come across a project using Flutter.

Once I get through the tutorial, building mp nodes for v4l2loopback and auto_track will be the next goals.
Unfortunately, these won't be python apps....

I apologise for suggesting this feature and then leaving this project. I like what you have done but unfortunately I see the GIL as being a fundamental blocker to performance. Especially when alternatives like mediapipe and gstreamer exist.

That is fine. I will try to check out MediaPipe more myself. And it is a good feature suggestion (just thinking whether this issue should be renamed to something like "architecture exploration / discussion").

I do want to stick with Python for now I think. Unless there was very limited exposure to other languages.

I would be extremely Happy to collaborate on porting bodypix to mediapipe. Google's new background segmentation model is much more accurate then bodypix but doesn't appear to produce separate masks for different body parts or estimate feature locations AFAICS

I am happy to share any limited knowledge I gained from using the bodypix model, or exchange ideas otherwise.
Perhaps you could make a start trying to integrate something in mediapipe. Would be happy to learn more about it.

https://ai.googleblog.com/2020/10/background-features-in-google-meet.html?m=1

https://drive.google.com/file/d/1lnP1bRi9CSqQQXUHa13159vLELYDgDu0/view

Did you happen to come across the actual model? (I can seem to see it from the blog)
I'd be interested to try it in this project for example.

I think that the model is at the bottom of google-ai-edge/mediapipe#1460
The license might have been changed from Apache to Google so not sure if it's available for use in open source

Just a note that as part of #94 the internal representation has changed a bit. It is now closer to the nodes you described.
The config still has branches as before but they are converted to a flat list of layers with a list if input layers.
Still no named inputs or named auxiliary outputs.