Going Deep into Real-Time Human Action Recognition

A journey to discover the potential of state-of-the-art techniques

Hi all! My name is Eugenio Liso, Big Data Engineer @ Agile Lab, a remote-first R&D company located in Italy. Our main focus is to build Big Data and AI systems, in a very challenging — yet awesome — environment.

Thanks to Agile Lab, I attended a 2nd level Master’s Degree in Artificial Intelligence at the University of Turin. The Master’s thesis will be the main topic of this post. Enjoy ?

Human Action Recognition: What does it mean?

Imagine that a patient is undergoing a rehabilitation exercise at home, and his/her robot assistant is capable of recognizing the patient’s actions, analyzing the correctness of the exercise, and preventing the patient from further injuries.

Such an intelligent machine would be greatly beneficial as it saves the trips to visit the therapist, reduces the medical cost, and makes remote exercise into reality¹.

With the term Human Action Recognition, we refer to the ability of such a system to temporally recognize a particular human action in a small video clip, without waiting for it to finish. This is different from Action Detection, since we do not locate where the action is spatially taking place.

Specifically, my task is to predict in real-time one out of three possible human actions: fallingrunning or walking.

The most careful readers will already have noticed the “movement-wise” difference between these three actions. Although the falling action has, potentially, very different “features” from the other two, and is therefore a good candidate to be one of the easiest to predict, the same cannot be said between walking or running. Indeed, these two actions share some common characteristics.

Since building ad hoc features for this kind of task can be expensive and complicated, a powerful tool comes to rescue me: Deep Neural Networks (DNN).

In the following sections, I will provide some details about the DNN’s I used to accomplish the task and, finally, we will discuss the results of my experiments.

3D-ResNeXt vs Two Stream Inflated 3D ConvNet (I3D)

After a careful and detailed research of the state-of-the-art, I have decided to focus on two — architecturally different but very promising — DNN: the 3D-ResNeXt and the Two Stream Inflated 3D ConvNet.

Luckily, the authors already provide the two networks pre-trained on Kinetics, a VERY huge dataset that has been proven to be an excellent starting point when training from-scratch a newly-built network for the Human Action Recognition’s task.


The first one is called 3D-ResNeXt, a Convolutional-based network (CNN) implemented in Pytorch. It is also based on ResNet, a kind of network that introduces shortcut connections that bypass a signal from one layer to the next one. For more details, refer to the author’s paper².

This kind of network takes K 112×112 RGB frames as input, while its output is the predicted action. Regarding this, I have set K = 16 for two reasons:

  • Taking 16 frames as input is a reasonable default, since almost all architectures have that kind of time granularity
  • Almost all the videos used during the training phase are recorded at 30 FPS. Although the running or walking actions seem not to be highly impacted (since they can be roughly considered “cyclical”), I have empirically established that, by increasing the input frames, the falling action becomes “meaningless”, since 16 frames satisfactorily enclose its abruptness. It might also be a good idea to set K = 32 if the input videos were recorded at 60 FPS.

An overview on how 3D-ResNeXt (and more generally, the 3D-ConvNets) receives its inputs.


The main building block of the 3D-ResNeXt network. The ResNeXt block is simply built from a concatenation of 3D-Convolutions.

Two Stream Inflated 3D ConvNet (I3D)

The other network is called Two Stream Inflated 3D ConvNet — from now on we will call it I3D for the sake of brevity. It is implemented by DeepMind in Sonnet, a library built on top of Tensorflow. For more details, please refer to the author’s paper³.

The I3D, differently from the 3D-ResNeXt, uses two kinds of inputs:

  • K RGB frames, like the 3D-ResNeXt, but their size is 224×224
  • Optical Flow frames with size 224×224. The Optical Flow describes the motion between two consecutive frames caused by the movement of an object or camera

An overview on how the I3D network (and more generally, the 3D-ConvNets) receives its inputs.
To the left side, an example of Optical Flow extracted from the RGB clip on the right side. These guys are playing Cricket, but I do not really know the rules. Courtesy of DeepMind.

Fine-Tuning with Online Augmentation

Before stepping into the juicy details about the employed datasets and the final results — don’t be in a hurry! — I would like to focus on the Augmentation, a fundamental technique used when training a CNN. Speaking broadly, during the training phase (on every mini-batch), we do not give the exact same input to the CNN (in my case, a series of frames), but a slightly modified version of it. This increases the CNN’s capability to generalize better on unseen data and decreases the risk of overfitting.

Of course, one could take a dataset, create modified frames from the original ones and save them to disk for later use. Instead, I use this technique during the training phase, without materializing any data. In this case, we talk about Online Augmentation.

The “units” or “modules” responsible of producing a frame similar to the original one are called filters: they can be compared to a function that takes an input image i and produces an output image i’. For this task, I have chosen 7 different filters:

  • Add and Multiply: add or multiply a random value on every pixel’s intensity
  • Pepper and Salt: common Computer Vision’s transformations that will set randomly chosen pixels to black/white
  • Elastic deformation⁴
  • Gaussian blur, which blurs the image with a Gaussian function
  • Horizontal flip, which flips an image horizontally

These filters can be applied in three different ways:

  • Only one (randomly chosen) filter f is applied to the original image i, producing a newly-created frame f(i) = i’
  • Some of the filters are randomly chosen (along with their order) and applied to the original frame. This can be thought as a “chained” function application: fₙ …(f₂(f₁(i))) = iⁿ
  • All the filters (their order is random) are applied sequentially on the input image

This GIF represents a man doing push-ups. This is our starting video: on it, we will apply the filters described above. Taken from here.




Starting from the upper left side, and going in order from left to right: Salt, Pepper, Multiply, Add, Horizontal Flip, Gaussian Blur, Elastic Deformation


Test Set for the evaluation of the (pre-trained) 3D-ResNeXt and I3D

To evaluate the two chosen pre-trained NNs, I have extracted a subset of the Kinetics Test Set. This subset has been built by sampling 10 videos for each class among the 400 available classes, resulting in 4000 different videos.

Training/Test Set for the Fine-Tuned 3D-ResNeXt

The dataset employed to fulfill my original task is created by combining different datasets that can be found online.

Total distribution of the dataset used during Fine-Tuning

The Test Set is built from the overall dataset taking, for each class, roughly the 10% of the total available samples, while the remaining 90% represents the Training Set (as shown below).

Distribution’s summary of the dataset used during Fine-Tuning

It is clear that we are looking at an unbalanced dataset. Since I do not have more data, to mitigate this problem, I will try to use a loss function called Weighted Cross Entropy, which we will discuss later on.

Manually-tagged Dataset for the Fine-Tuned 3D-ResNeXt

Finally, I have built and manually annotated a dataset consisting of some videos publicly available from YouTube.

Summary of the manually-tagged dataset. I can assure you that finding decent quality videos is — perhaps — more complicated than being admitted to the NeurIPS conference.

For this tagging-task, I have used a well-known tool called Visual Object Tagging Tool, an open source annotation and labeling tool for image and video assets. An example is shown below.


Tagging a video with the label falling. Tagging is mainly done in two sequential phases: first, tag the initial frame with the desired label; then, tag the final frame with a dummy tag, simulating the end of that action. Last, parse those “temporally-cut ranges” with a script, take the label (from the initial frame) and save the resulting clip. Et voilà!

Performance Analysis

Metrics and Indicators

For most of the results presented below, several metrics and indicators will be reported:

Pre-trained 3D-ResNeXt vs Pre-trained I3D on Kinetics

The purpose of this test is to have a first, rough, overview. The only class we are interested into is jogging, while all the other labels represent actions that are completely different and useless for our goal.⁶

A first, rough, comparison between 3D-ResNeXt and I3D on the jogging class

Given this subset of data, the obtained results show that:

  • the I3D seems to have better overall performances compared to the 3D-ResNeXt. This confirms its supremacy, in line with the results in the author’s paper
  • the prediction time of the 3D-ResNeXt is competitive, while the I3D is too slow for a Real-Time setting, since its architecture is significantly heavier and complicated

At this point it is clear that, for our task, the 3D-ResNeXt is a better choice because:
– it is faster and almost capable of sustaining the speed of a 30 FPS video
– it has a simpler network architecture
– it does not need additional preprocessing or different inputs, since it uses only RGB frames

Fine-Tuned 3D-ResNeXt network’s performance on Test Set

Before stepping into the various parameter configurations, to take into account the Training Set’s unbalance, I use a different Loss Function called Weighted Cross Entropy (WCE): instead of giving each class the same weight, I use the Inverse Class Frequency to ensure the network gives more importance to errors on the minority class falling.

So, consider a set of classes C = {c₁ , …, cₙ}. The frequency fᵢ of a class cᵢ ∈ C represents the number of samples of that class in a dataset. The associated weight ICFᵢ ∈ [0, 1] is defined as:

In the table below, I have reported almost all the training experiments I’ve been able to carry out during my thesis.

The parameters for the different Training runs of the 3D-ResNeXt (using Fine-Tuning). OA means Online Augmentation, while LR denotes the Learning Rate (the starting value is 0.01, and at a defined Scheduler Step — i.e. training epoch — it will be decreased by multiplying it with 0.1). It should be noted that I’ve tried some other parameter’s configurations, but these represent the best runs I’ve been able to analyze. As a related example, an interesting result is that, when using the OA in All mode, the network shows a lower performance: this could be caused by an overly-distorted input produced during the Online Augmentation.

The really big question before this work was: using the few data available for the falling class, will the network be able to have an acceptable performance? Given the parameters above, the obtained results are summarized in the chart below.

The performance of my Fine-Tuned networks on the Test Set

What can I say about that? The results seem pretty good: the network understands when someone is walking or running, while it is more doubtful when predicting falling. But, aside that, let us focus on the contoured blue and red rectangles: red denotes training runs with Cross Entropyblue denotes training runs with Weighted Cross Entropy.

This chart highlights some interesting findings:

  • When using Cross Entropy, the precision on the falling and running classes is noticeably better, and overall, the runs with Cross Entropy behave better, with a balanced precision-recall measures which are rewarded by the F-Score’s metric
  • When using Weighted Cross Entropy, the precision on the minority class falling drops dramatically, while its recall increases significantly. During the evaluation, this concept translates in more frequent (and incorrectfalling’s predictions

TOP-3 Fine-Tuned NNs on the Manually-Tagged Dataset

Last, but not least — well, these are probably the most interesting and awaited results, am I right? — I have carried out a final test that aims to assess the capability of the Fine-Tuned NNs to generalize on unseen and real data. I have chosen the three best Fine-Tuned networks (their ID are T1, T7 and T8), taken from the previous experiment, and I have tested them on my manually-tagged dataset.

Drumroll — What will be the winning run? Will the nets be able to defeat overfitting, or will they be the victims? See you in the next episode — I’m kidding! You can see them below! ?

Performance of the three best training runs (T1 — T7 — T8) on the manually-tagged dataset

Finally, here they are! I’m actually surprised about them, since:

  • The networks generalize quite well on the new manually-tagged dataset — I did not expect such good results! A stable F-Score of 0.8 is reached for the running and walking, with a satisfactory performance on the falling class
  • Using Online Augmentation produces a positive effect on the recall of the minority class falling (T7 vs. T8)
  • Less frequent use of LR Steps leads to a better overall performance. Indeed, the recall on the falling class (T1 vs. T8) has improved noticeably
  • The three networks have a VERY HIGH precision on the falling class (T7 is never wrong!) but, unfortunately, they have a mediocre recall: they are very confident on the aforementioned label, but more than half of the time they cannot recognize a falling sample

What have I learned? What have YOU learned?

Finally, we have come to the end of our tortuous but satisfying journey. I admit it: it was tough! I hope this adventure has been to your liking, and thanking you for the reading, I leave here some useful take-home messages and future developments:

  • The 3D-ResNeXt network is better suited for the Real-Time Human Action Recognition task since it has a dramatically faster prediction time than the I3D
  • The CE and WCE produce radically different results on the minority class falling, while they do not change so much the other two classes: the first produces a well-balanced precision and recall, the latter causes the trained network to predict more frequently the minority class, increasing its recall at the expense of its precision
  • The Online Augmentation helps the network to not overfit on training data
  • A LR that changes more rarely is to be preferred to a LR that changes more frequently
  • Speaking broadly, the result that has emerged during training is not an “absolute truth”: there are some cases where the task requires a higher recall rather than a higher precision. Indeed, based on it, we might prefer a network that predicts falling:
    * more precisely but more rarely (more precision, less recall)
    * less precisely but more frequently (more recall, less precision).

Future Developments

  • The minority class is maybe too unbalanced. More and high-quality data
    (not necessarily only for one class) could dramatically increase the network’s performance
  • The WCE has enormously increased the recall of the minority class. It could be interesting to investigate more on this behaviour or to try another weighting method that reduces the loss’ weight on the minority class. This could potentially make the network to predict less frequently the class falling, thus increasing its precision with also the benefits of a higher recall
  • The predictions made by the network could be joined with another system that is able to detect people (whose task is known as People Detection), in order to give a more precise and “bounded” prediction. However, this could require a different dataset, where the input videos contain only the relevant information (e.g. a man or woman falling with minimal or no background or environment)

YouTube video

[1] Human Action Recognition and Prediction: A Survey. Yu Kong, Member, IEEE, and Yun Fu, Senior Member, IEEE — JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2018. Link

[2] Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? Kensho Hara, Hirokatsu Kataoka, Yutaka Satoh, 2017. Link

[3] João Carreira and Andrew Zisserman, Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset, 2017. Link

[4] Simard, Steinkraus and Platt, “Best Practices for Convolutional Neural Networks applied to Visual Document Analysis”, in Proc. of the International Conference on Document Analysis and Recognition, 2003. Link

[5] The prediction time does not take into account the preprocessing time. For 3D-ResNeXt, it is minimal, while for I3D, it is much more lengthy due to the calculation of the optical flow for each input video. As an example, with my specs, 10 seconds of a single video need about 1 minute of preprocessing.

[6] It’s completely fine if you think that this test is simplistic and not very accurate. Due to time and space limitations (those videos were literally eating my free disk space), and also because this is completely out-of-scope, I managed to run these tests only on a small portion of test data.

Written by Eugenio Liso – Agile Lab Big Data Engineer

If you found this article useful, take a look at our blog and follow us on our Medium Publication, Agile Lab Engineering!

An Introduction to Deep Generative Models for Computer Vision – Online Meetup – 11th June 2020 | 6.30PM

Our next meetup, planned on 11th June 2020, will cover two of the most famous and recent generative models, Variational Autoencoders and Generative Adversarial Networks, at an introductory level with the main mathematical insights. The theory behind the techniques and how they work will be presented, together with practical applications and the relative code.

The event will consist of two presentations and a final session for questions and technical time.

First Talk: Variational Autoencoders (VAEs) by Marco Odore, Riccardo Fino and Stefano Samele,  Data Scientists at Agile Lab, with a computer scientist, statistical and mathematical background respectively.

In this talk will be shown how to train a Variational Autoencoder on an image dataset and will be given a glimpse of its interpretation through visual arithmetics.

Second Talk: Generative Adversarial Networks (GANs), by Luca Ruzzola, a deep learning researcher at boom.co, working on applied Deep Learning for photo enhancement.

In this talk we’re gonna go back through the development of Generative Adversarial Networks (GANs), the challenges in using them and the amazing results that they can produce, from an introductory perspective.

Click here to sign up!


3D Pose Estimation and Tracking from RGB-D

Hi everyone, this is my first article so I am going to introduce myself.

My name is Lorenzo Graziano and I work as Data Engineer at Agile Lab, an Italian company focused on scalable technologies and AI in production.

Thanks to Agile Lab, during the last two years I attended a 2nd level Master’s Degree in Artificial Intelligence at the University of Turin. In this post, I am going to present part of the work done during this awesome experience.


The project described in this post falls within the context of computer vision, the Artificial Intelligence branch that deals with high-level understanding of images and video sequences.

In many real-world applications, such as autonomous navigation, robotic manipulation and augmented reality, it is essential to have a clear idea of how the environment is structured. In particular, recognizing the presence of objects and understanding how they are positioned and oriented in space is extremely important. In fact, this information enables the interaction with the environment, which otherwise would be ineffective.

In the context of autonomous navigation, pose estimation is fundamental for planning a navigation path and enabling collision avoidance functions

A robot assembles mechanical parts using the pose estimation of the manipulated objects

An augmented reality application exploiting orientation and position of the machine parts for e-learning purposes.

The goal of this project is to create an architecture capable of monitoring the position and orientation of different instances of objects belonging to different classes within a video sequence bearing depth information (RGB-D signal). The result will be a framework suitable for use in different types of applications based on artificial vision.

Pose Estimation: Dense Fusion

The central part of the architecture is clearly the one intended for the 6D pose estimation. After evaluating the performance of various solutions available in the literature, Dense Fusion, from the paper DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion, was chosen as the core architecture module.

It is only possible to estimate the pose of a known object under adversarial conditions (e.g. heavy occlusion, poor lighting, …) by combining the information contained in the color and depth image channels. The two sources of information, however, reside in different spaces. The main technical challenge in this domain is to extract and fuse features from heterogeneous data sources. For this reason, Dense Fusion has a heterogeneous architecture that processes color and depth information differently, retaining the native structure of each data source, and a dense pixel-wise fusion network that performs color-depth fusion by exploiting the intrinsic mapping between the data source. Finally, the pose estimation is further refined with a differentiable iterative refinement module.

High-level overview of Dense Fusion architecture

The first stage performs a semantic segmentation for each known object category. For each segmented object, we feed the masked depth pixels converted to a 3D point cloud and the image patch cropped by the bounding box of the mask to the second stage.

The second stage processes the results of the segmentation and estimates the object’s 6D pose. The second stage is composed by different sub-components:

1. Dense Color Embedding Network: CNN-based encoder-decoder architecture that maps an image of size H∗W ∗3 into a H ∗W ∗drgb embedding space.

2. Geometric Embedding Network: variant of PointNet architecture that processes each point in the masked 3D point cloud, converted from segmented depth pixels, to a geometric feature embedding.

3. Pixel-wise Fusion Network: In this module the geometric feature of each point is associated to its corresponding image feature pixel based on a projection onto the image plane using the known camera intrinsic parameters.

4. Pose Estimation Network: it takes as input a set of per-pixel feature and for each of them predicts one 6D pose. The loss to minimize for the prediction per dense-pixel is defined as:

where xj denotes the jth point of the M randomly selected 3D points from the object’s 3D model. p = [R|t] is the ground truth pose where R is the rotation matrix and it is the translation vector, while ˆp = [ ˆR|ˆt] is the predicted pose generated from the fused embeddings of the ith dense-pixel. The above loss function is only well defined for asymmetric objects, for symmetric objects the loss function becomes:

As we would like our network to learn to balance the confidence among the per dense-pixel predictions, we weight the per dense-pixel loss with the dense-pixel confidence, and add a second confidence regularization term:

where N is the number of randomly sampled dense-pixel features from the P elements of the segment and w is a balancing hyperparameter. We use the pose estimation that has the highest confidence as final output.

5. Iterative self-refinement methodology: The pose residual estimator network is trained to perform the refinement given the initial pose estimation from the main network

Dense Fusion limitations

Keeping in mind the objective of this project, that is the 6D pose estimation and tracking of different instances of objects from RGB-D signal, we have identified some points on which to intervene in order to obtain the desired result.

We realized that the first stage of this architecture (the Semantic segmentation module) implied that the presence of only one object for each class of object could be taken into consideration in the same image. It is therefore necessary to revise the object segmentation module in order to enable the management of different instances of the same object in a single frame.

It is also essential to overcome the lack of system for tracking pose objects over time.

Object segmentation module

The object segmentation module proposed in the original dense fusion architecture is based on semantic segmentation, this means that the output generated by this module does not allow us to distinguish the different instances of the objects in which we are interested. We are only able to make predictions about the class of belonging of every single pixel, without distinction on the instance of belonging (as can be seen from the image below).

How can we correctly estimate the position and orientation of the purple cubes if we are unable to distinguish them from each other?

To overcome this problem we have decided to employ a neural network capable of providing instance segmentation of the input. The architecture we have chosen is Mask R-CNN because it is currently the state of the art in the instance segmentation task on different datasets.

Mask R-CNN architecture

Mask R-CNN (Regional Convolutional Neural Network) is structured in two different stages. First, it generates proposals about the regions where an object could be based on the input image. Second, it predicts the class of the object, refines the bounding box and generates a mask at pixel level of the object based on the first stage proposal. Both stages are connected to the backbone structure: a FPN (Feature Pyramid Networks) style deep neural network.

Mask R-CNN gives us the possibility to treat each object instance separately

Multiple Object Tracking

In order to remedy the lack of system for tracking objects over time, we have decided to integrate a Multiple Object Tracking module. Deep SORT (Simple Online and Realtime Tracking) is a tracking algorithm based on the SORT tracking algorithm and it has been chosen for its remarkable results in the Multiple Object Tracking (MOT) problem. SORT exploits a combination of familiar techniques such as the Kalman Filter and Hungarian algorithm for the tracking components and achieves good performance in terms of tracking precision and accuracy, but it returns also a high number of identity switches.

The main idea of Deep SORT is to use some off-the-shelf model for object detection and then plug the results into the SORT algorithm with deep association metric that matches detected objects across frames.

In Deep SORT predicted states from Kalman filter and the newly detected box in the current frame must be associated, this task is solved using the Hungarian algorithm. Into this assignment problem motion and appearance information are integrated through a combination of two metrics:

1. Motion Information: are obtained using the squared Mahalanobis distance between predicted Kalman states and newly arrived measurements.

2. Appearance Information: for each bounding box detection dj an appearance descriptor rj is computed. This metric measures the smallest cosine distance between the i-th track and j-th detection in appearance space.

The resulting metric is the following:

Architecture overview

The source of information in this architecture is a RGB-D sensor. At each time instant t our RGB-D camera produces an RGB image and a depth information matrix (related to the distance to the sensor) in a per-pixel basis.

The first step of our architecture is represented by our Mask R-CNN network dedicated to the instance segmentation. At each time instant t an rgb signal is sent to the Mask R-CNN network. Against this signal the network will provide different information: Instance Mask Segmentations, Bounding Boxes and Features Maps. Then the architecture separates off into two distinct branches: one used for pose estimation task and one needed for multiple object tracking.

Pose Estimation branch

The Instance segmentation is used to obtain a crop of the starting image for each object instance identified and a mask of the input depth signal. Then the information concerning color, depth and the classification of the object are propagated to the Dense Fusion network which uses the data supplied as input to estimate the 6DoF of each single object segmented by Mask R-CNN.

Multiple Object Tracking branch

The MOT branch uses the output provided by mask R-CNN in the time instant t and in the previous time instant t − 1. Each bounding box for frame t − 1 is sent to Deep Sort Estimation module. Within this module each detection is represented through a state vector. Its task is to model the states associated with each bounding box as a dynamic system, this is achieved by using the Kalman Filter. Predicted states from Deep Sort Estimation module and detected bounding box from time instant t are associated inside Deep SORT Data Association module. Inside this module motion and appearance information are integrated together in order to obtain a high-performance Multiple Object Tracking.

Proposed Architecture overview


Our experiments were conducted on the popular LineMOD dataset. The system consists of several interconnected modules, and it is therefore useful to evaluate their performance on the individual tasks.

Pose Estimation results

In order to evaluate 6Dof pose estimation results we computed the average distance of all model points from their transformed versions. We say that the model was correctly detected and the pose correctly estimated if km*d ≥ m where km is a chosen coefficient (0.1 in our case ) and d is the diameter of M. The results on LineMOD are shown in the table below together with results of other popular methods.

Quantitative evaluation of 6D pose methods on the LineMOD dataset

Dense fusion input: RGB channels
Dense fusion input: Depth channel
Input image with the overlapped prediction. The point cloud (red dots) is transformed with the predicted pose and then projected to the 2D image frame

Instance Segmentation results

In order to evaluate the results obtained on our test set, we decided to use the mAP (Mean Average Precision) a commonly employed metric used for evaluating object detectors. The mAP corresponds to the average of average precision of all classes.

mAP using Mask R-CNN on LineMOD dataset
example of instance masks obtained with our Mask R-CNN, from the left to the right: hole p., bench vi., eggbox, glue

MOT results

We chose two different metrics to evaluate performance in the MOT task:

  1. Switch Id: An ID switch occurs when the mapping switches from the previously assigned ID track to different one.
  2. Fragment: A fragmentation is counted in a sequence of frames when an object is tracked, then interrupts, and then reacquires its ‘tracked’ status at a later point.

The results for some selected objects are shown in the table below.

Quantitative evaluation of Deep SORT on LineMOD dataset (λ = 0.5)

We realized that the sampling frequency of the images in the sequence is quite low, moreover the camera moves around the objects. Given these premises we can understand why we obtained these rather low performances.

Since it is possible to change the metric used for the data association, we chose to change the weight given to the appearance information extracted from the feature maps provided by Mask R-CNN. The results obtained increasing the weight of appearance information with respect to motion information is shown in the following table.

Quantitative evaluation of Deep SORT on LineMOD dataset (λ = 0.25)


We have designed and implemented an architecture that uses RGB-D data in order to estimate the 6DoF pose of different object instances and tracking them over time.

This architecture is mainly based on Dense Fusion.

The segmentation module has been replaced with a custom version of Mask R-CNN. It has been modified in order to extract feature maps computed by the RoiAlign layer.

Finally, Deep SORT was added to the architecture in order to implement our Multiple Object Tracking module.

Regarding the possible future developments of this work, there are several paths that could be investigated:

  • Extending Dense Fusion in order to estimate not only the 6d pose but also the size of the objects: we could add a geometry embedding extraction branch to predict the scale of the object and then normalize the input cloud into the predicted scale
  • Exploiting information from the depth channel for the Instance Segmentation task
  • Integrating information about 6DoF pose to improve our Motion Information metric used by Deep SORT

I hope you enjoyed reading this article!

Written by Lorenzo Graziano – Agile Lab Data Engineer

If you found this article useful, take a look at our blog and follow us on our Medium Publication, Agile Lab Engineering!

Big data pipeline with k8s & Deep learning for NLP – On-Line Meetup, 28th April 2020 | 7pm

Are you interested in Big data pipeline with k8s & Deep learning for NLP?

If you want to discover more, don’t miss our next meetup, with two talk:

  • “How to run big data pipeline in production with k8s”
    In several big data projects we use k8s to guarantee reliability and to automatically manage application failures.
    In this talk we show you how we used k8s in conjunction with Cloudera to deliver our mission critical applications in big data production 24/7. Speaker: Matteo Bovetti + Carlo Ventrella, Site Reliability Engineers @ Agile Lab
  • “NLP e Deep Learning: dal Information retrieval a BERT”
    Quali sono le difficoltà intrinseche e le limitazioni nella gestione dell’informazione testuale. Storia delle tecnologie NLP dai sistemi di information retrieval a BERT, modello SOTA per la maggior parte dei task affrontati dal NLP moderna. Presentazione di un caso d’uso reale in ambito chatbot e implicazioni dell’utilizzo del Deep Learning in ambito NLP. Speaker: Gianpiero Sportelli, AI Developer @ CELI – Language Technology

We have moved on-line, but at the end, we’ll still have time to discuss, gather proposals and suggestions for the next meetups.

Click here to sign up!



Video Search Prototype

We are working on a new kind of prototype dedicated to broadcasters: the system is able to analyze different kind of flows (video, speech to text, structured feeds, social network feeds, etc) in order to identify “patterns” like specific images or semantic correlations. The system is based on the WASP infrastructure and benefits of all its peculiar features: data consistency, massive ingestion, machine learning algorithms with real time and post processing capabilities.
Considering the broadcasters use cases,  where video or audio codecs are the main flows, the system is able to:

  • analyze the specific video flow frame per frame identifying a reference image pattern (a logo, a face, or in general a pre-defined image) with machine learning methods and/or with embeddable third parties  “plug-ins”
  • produce massive meta-tags identified within the specific pattern-frame matching
  • classify and elaborate the produced tags (associated to specific video contents) and eventually completed by Open Data Semantic correaltions, creating a semantic-correlated knowledge base, function that is possible thanks to the Sophia Semantic Engine, a library developed by CELI that is fully integrated into the system.

Take a look to the overall logical design:

The system is designed with a scale-out pattern architecture and it could be used in full “enterprise” environments without losing performances even if with high input throughputs.
This kind of system offers many high value features that could be used to enhance the legacy systems normally used in the broadcasters application universe. The use cases could be different and interesting:

  • we could deliver a totally new tool for the video-editing process: the editor will be able to search into a large volume video library using new keywords associated to a specific content in any kind of video or audio fragment
  • a broadcaster with a great historical video repository is now able to monetize his copyright videos in several ways (“Stock Video” e-commerce, etc)
  • new kind of applications will be available like real time advertising or product displacement strategies based on the user profile and on the specific content that is viewed at the moment

This is a very challenging domain that will totally enhance our platform capabilities not only in terms of performance but also from the “algo” point of view.