The journey to Elasticsearch – Tech Talk with Neodata Group

Tech Talk with Neodata Group: The journey to Elasticsearch

In this first episode of our first Tech Talk, we hosted Neodata Group to talk about their journey to Elasticsearch, the open-source search and analytics engine for all types of data.
We explored some interesting use cases on data & analytics.

See the full video and stay tuned to discover all Agile Lab’s events on our blog!

Spark 3.0: First hands-on approach with Adaptive Query Execution (Part 3)

Spark 3.0: First hands-on approach with Adaptive Query Execution (Part 3)

In the previous articles (1)(2), we started analyzing the individual features of Adaptive Query Execution introduced on Spark 3.0. In particular, we analyzed “dynamically coalescing shuffle partitions” and “dynamically switching join strategies”. Last but not least, let’s analyze what will probably be the most-awaited and appreciated feature:

Dynamically optimizing skew joins

To understand exactly what it is, let’s take a moment’s step back with the theory, remembering that a DataFrame is on Spark an abstraction from the concept of RDD, which is in turn a logical abstraction of the dataset that can be processed in a distributed way thanks to the concept of partition. A dataset on Spark, once “transformed” into RDD, is divided into partitions where ideally the data is distributed in a balanced way:

In practice, it is very common for RDDs generated by certain operations (such as key grouping) to be unbalanced:

The unfortunate consequence of this is that, since computation parallels are based on the concept of data partitioning, we will not be able to adequately exploit the resources of the cluster. Identifying such situations is fairly straightforward by using the Spark UI. How many of you who are regular users of Spark will certainly have found yourself in the unpleasant situation, during which the progress bar related to an operation stops for a long time on the last tasks, giving the feeling, in the most “serious” cases, that the job has gone into freeze.

Analyzing the tasks in detail, it is easy to identify the “incriminated” ones by looking at the duration, size and number of records processed that will be one or more orders of magnitude larger than all the others.

Techniques such as adding additional join keys (where possible) or key salting were used to resolve these situations without AQE. With an additional effort from developers for implementation, testing etc.

AQE mechanisms transparently discover and optimize implementation.

Let’s see how to enable the feature and set configuration parameters correctly. The core directive spark.sql.adaptive.skewJoin.enabled is set to true by default. As in the previous cases, it is sufficient to enable AQE (spark.sql.adaptive.enabled) to take advantage of the optimization in question.

To find out what policy Spark uses to identify a skewed partition, simply analyze the sources of the org.apache.spark.sql.execution.adaptive.OptimizeSkewedJoin class (class that extends Rule[SparkPlan]).

where ADAPTIVE_EXECUTION_SKEWED_PARTITION_FACTOR corresponds to the configuration property spark.sql.adaptive.skewJoin.skewedPartitionFactor and SKEW_JOIN_SKEWED_PARTITION_THRESHOLD to spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes which ideally needs to be larger than the property spark.sql.adaptive.advisoryPartitionSizeInBytes we’ve already seen about the shuffle partitions coalesce. By default, the threshold is set to 256Mb.

To show how this optimization works, we’ll borrow the excellent example of the article “The Taming of the Skew” based on two hypothetical tables in a car manufacturer’s database:

That will be created ad-hoc with an imbalance in the number of records related to one of the join keys (represented by the brand (make) and model pair).

Let’s check that the keys are really unbalanced:

Now let’s try to join with AQE disabled (we also disable broadcast joins as the tables are very small and would frustrate the result of our experiment):

and let’s analyze the results. The longest task took, as we would have expected, a good 1.2 minutes out of 1.3 minutes total having processed most of the data concentrated on a single partition.

Now let’s repeat the experiment with AQE enabled and the configuration properties set appropriately based on the size of the sample data we’re using:

First of all, we note that the execution time drops from 1.3 min to 34 seconds. And that longer tasks now only take 4 seconds. Working fine-tuning on the parameters, I am convinced that we could get even better performance but for what was our purposes we stop here.

If you want to read the previous parts of the article:

Part 1- Dynamically coalescing shuffle partitions

Part 2 – Dynamically switching join strategies

Written by Mario Cartia – Agile Lab Big Data Specialist/Agile Skill Managing Director
 If you found this article useful, take a look at our blog and follow us on Agile Lab Engineering, our Medium Publication.

Agile Lab ospite di InnoTech, l’Hub di The European House Ambrosetti per parlare di Machine Learning e Big Data in ambito assicurativo

Agile Lab ospite di InnoTech, l’Hub di The European House Ambrosetti per parlare di Machine Learning e Big Data in ambito assicurativo

Alberto Firpo, Co-Founder & CEO di Agile Lab, è il protagonista dell’ episodio di “InnoTechCast – Leaders’ View on Innovation” per parlare di Machine Learning e Big Data applicati all’ambito assicurativo.

The European House Ambrosetti, all’interno di InnoTech Hub, riferimento per l’ecosistema dell’innovazione e della tecnologia italiana in Europa e nel mondo, nel bel mezzo della pandemia COVID, ha lanciato InnoTechCast, un nuovo format per raccontare in digitale il punto di vista dei principali leader della community italiana del mondo dell’innovazione e della tecnologia.

In questo podcast Alberto Firpo racconta come i dati raccolti dalle “black-box” installate sulle autovetture, grazie a modelli di Machine Learning e sistemi real-time, forniscano informazioni utili alle aziende del mondo assicurativo.

In particolare, attraverso la piattaforma WASP di Agile Lab, i dati raccolti in tempo reale consentono alle compagnie assicurative di attivare servizi per i clienti, in real-time, e formulare le polizze sulla base del “driving-behaviour“, il comportamento di guida del conducente rilevato attraverso modalità innovative.

Ascolta il podcast!

Real-time Analytics in applicazioni mission-critical: il caso Vodafone Automotive


Real-time Analytics in applicazioni mission-critical: il caso Vodafone Automotive

In occasione del Convegno conclusivo dell’ottava edizione dell’Osservatorio Big Data & Analytics del Politecnico di Milano, Alberto Firpo, CEO & Co-Founder Agile Lab, Yari Franzini, Regional Director Cloudera Italia e Paolo Giuseppetti, Head of Innovation and Connected Mobility Platform, Vodafone Automotive, intervistati da Irene Di Deo, Ricercatrice per gli Osservatori Digital Innovation, hanno illustrato l’innovativo progetto realizzato con la piattaforma WASP, Wide Analytics Streaming Platform, di Agile Lab.

Grazie all’utilizzo di questo sistema, Vodafone Automotive è stata in grado di utilizzare i dati raccolti dalle black box installate sulle vetture e trasformarli, in tempo reale, in informazioni utili per accrescere il livello dei servizi offerti ai propri clienti.

Play Video

La ricerca 2020 dell’Osservatorio Big Data & Business Analytics

L’obiettivo della Ricerca 2020, a cui ha collaborato Agile Lab in qualità di Sponsor, è stato quello di fotografare e comprendere lo stato dell’arte degli Analytics in Italia, in particolare di:

  • quantificare e analizzare il mercato Analytics in Italia, identificando i trend in atto;
  • indagare le applicazioni degli Analytics nei diversi settori e processi;
  • comprendere le principali evoluzioni tecnologiche in ambito Analytics;
  • stimare la diffusione di competenze e modelli organizzativi di gestione della Data Science;
  • comprendere il ruolo svolto dalle startup in ambito Analytics.

Per maggiori informazioni, visita il sito

Spark 3.0: First hands-on approach with Adaptive Query Execution (Part 2)

In the previous article, we started analyzing the individual features of Adaptive Query Execution introduced on Spark 3.0. In particular, the first feature analyzed was “dynamically coalescing shuffle partitions”. Let’s get on with our road test.

Dynamically switching join strategies

The second optimization implemented in AQE is the runtime switch of the dataframe join strategy.

Let’s start with the fact that Spark supports a variety of types of joins (inner, outer, left, etc.). The execution engine supports several implementations that can run them, each of which has advantages and disadvantages in terms of performance and resource utilization (memory in the first place). The optimizer’s job is to find the best tradeoff at the time of execution.

Going into more detail the join strategies supported by Spark are:

  • Broadcast Hash Join
  • Shuffle Hash Join
  • Sort-merge Join
  • Cartesian Join

Without going into too much detail of the individual strategies (which is beyond the scope of the current treatment), the Broadcast Hash Join is the preferred strategy in all those cases where the size of one of the parts of the report is such that the broadcast table can be easily transferred to all executors and the “map-side” join avoiding the burden of shuffle operations (and the creation of a new execution stage). This technique, where applicable, provides excellent benefits in terms of reducing execution times.

Spark allows setting the spark.sql.autoBroadcastJoinThreshold configuration property to force the use of this strategy where one of the dataframes involved in the join is smaller than the specified threshold (the default value of the property in question is 10 Mb). Without AQE, however, the size of the dataframe is determined statically during the optimization phase of the execution plan. In some cases, however, the runtime size of the relationship is significantly smaller than the total size. Think of a join where there is a filter condition that at runtime will cut most records.

To better understand the potential of this optimization we will go to make a practical example. We will use THE public datasets of IMDB (also known as the Internet Movie Database) for our purpose. In particular, the film dataset (title.akas.tsv.gz) and the cast dataset.

The dataset with the cast members is tied to the title dataset through the tconst field. The title dataset weighs about 195 MB and the cast dataset weighs about 325 Mb (gzip compression).

Leaving the default value for the broadcast limit threshold unmodified by trying to join the two datasets, the join strategy selected would of course be SortMerge. Without AQE even applying a very restrictive filter (for example, filtering the dataframe of the titles leaving only those related to the Virgin Islands that are very few) SortMerge would also be selected as a strategy. Try:

See what happens instead by activating AQE:

Thanks to the statistics calculated at runtime and the adaptive execution plan, the most correct strategy has been selected in this case.

The latest optimization, concerning dynamically optimizing skew joins, will be discussed in the last part of the article. Not to be missed!

Written by Mario Cartia – Agile Lab Big Data Specialist/Agile Skill Managing Director
 If you found this article useful, take a look at our blog and follow us on our Medium Publication, Agile Lab Engineering!

Spark 3.0: First hands-on approach with Adaptive Query Execution (Part 1)

Apache Spark is a distributed data processing framework that is suitable for any Big Data context thanks to its features. Despite being a relatively recent product (the first open-source BSD license was released in 2010, it was donated to the Apache Foundation) on June 18th the third major revision was released that introduces several new features including adaptive Query Execution (AQE) that we are about to talk about in this article.

A bit of history

Spark was born, before being donated to the community, in 2009 within the academic context of ampLab (curiosity: AMP is the acronym for Algorithms Machine People) of the University of California, Berkeley. The winning idea behind the product is the concept of RDD, described in the paper “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing” whose lead author is Spark Matei Zaharia’s “father”.

The idea is for a solution that solves the main problem of the distributed processing models available at the time (MapReduce in the first place): the lack of an abstraction layer for the memory usage of the distributed system. Some complex algorithms that are widely used in big data, such as many for training machine learning models, or manipulating graph data structures, reuse intermediate processing results multiple times during computation. The “single-stage” architecture of algorithms such as MapReduce is greatly penalized in such circumstances since it is necessary to write (and then re-read) the intermediate results of computation on persistent storage. I/O operations on persistent storage are notoriously onerous on any type of system, even more so on one deployed due to the additional overhead introduced by network communications. The concept of RDD implemented on Spark brilliantly solves this problem by using memory during intermediate computation steps on a “multi-stage” DAG engine.

The other milestone (I leap because I enter into the merits of RDD programming and Spark’s detailed history, although very interesting, outside the objectives of the article) is the introduction on the first stable version of Spark (which had been donated to the Apache community) of the Spark SQL module.

One of the reasons for the success of the Hadoop framework before Spark’s birth was the proliferation of products that added functionality to the core modules. Among the most used surely we have to mention Hive, SQL abstraction layer over Hadoop. Despite MapReduce’s limitations that make it underperforming to run more complex SQL queries on this engine after “translation” by Hive, the same is still widespread today mainly because of its ease of use.

The best way to retrace the history of the SQL layer on Spark is again to start with the reference papers. Shark (spark SQL’s ancestor) dating back to 2013 and the one titled “Spark SQL: Relational Data Processing in Spark” where Catalyst, the optimizer that represents the heart of today’s architecture, is introduced.

Spark SQL features are made available to developers through objects called DataFrame (or Java/Scale Datasets in type-safe) that represent RDDs at a higher level of abstraction. You can use the DataFrame API through a specific DSL or through SQL.

Regardless of which method you choose to use, DataFrame operations will be processed, translated, and optimized by Catalyst (Spark from v2.0 onwards) according to the following workflow:

What’s new

We finally get to get into the merits of Adaptive Query Execution, a feature that at the architectural level is implemented at this level. More precisely, this is an optimization that dynamically intervenes between the logical plan and the physical plan by taking advantage of the runtime statistics captured during the execution of the various stages according to the stream shown in the following image:

The Spark SQL execution stream in version 3.0 then becomes:

Optimizations in detail

Because the AQE framework is based on an extensible architecture based on a set of logical and physical plan optimization rules, it can easily be assumed that developers plan to implement additional functionality over time. At present, the following optimizations have been implemented in version 3.0:

  • Dynamically coalescing shuffle partitions
  • Dynamically switching join strategies
  • Dynamically optimizing skew joins

let’s go and see them one by one by touching them with our hands through code examples.

Regarding the creation of the test cluster, we recommend that you refer to the previously published article: “How to create an Apache Spark 3.0 development cluster on a single machine using Docker”.

Dynamically coalescing shuffle partitions

Shuffle operations are notoriously the most expensive on Spark (as well as any other distributed processing framework) due to the transfer time required to move data between cluster nodes across the network. Unfortunately, however, in most cases they are unavoidable.

Transformations on a dataset deployed on Spark, regardless of whether you use RDD or DataFrame API, can be of two types: narrow or wide. Wide-type data needs partition data to be redistributed differently between executors to be completed. The infamous shuffle operation (and creating a new execution stage).

Without AQE, determining the optimal number of DataFrame partitions resulting from performing a wide transformation (e.g. joins or aggregations) was assigned to the developer by setting the spark.sql.shuffle.partitions configuration property (default value: 200). However, without going into the merits of the data it is very difficult to establish an optimal value, with the risk of generating partitions that are too large or too small and resulting in performance problems.

Let’s say you want to run an aggregation query on data whose groups are unbalanced. Without the intervention of AQE, the number of partitions resulting will be the one we have expressed (e.g. 5) and the final result could be something similar to what is shown in the image:

Enabling AQE instead would put data from smaller partitions together in a larger partition of comparable size to the others. With a result similar to the one shown in the figure.

This optimization is triggered when the two configuration properties spark.sql.adaptive.enabled and spark.sql.adaptive.coalescePartitions.enabled are both set to true. Since the second is set true by default, practically to take advantage of this feature you only need to enable the global property for AQE activation.

Actually going to parse the source code you find that AQE is actually enabled only if the query needs shuffle operations or is composed of sub-queries:

and that there is a configuration property that you can use to force AQE even in the absence of one of the two conditions above.

The number of partitions after optimization will depend instead on the setting of the following configuration options:

  • spark.sql.adaptive.coalescePartitions.initialPartitionNum
  • spark.sql.adaptive.coalescePartitions.minPartitionNum
  • spark.sql.adaptive.advisoryPartitionSizeInBytes

where the first represents the starting number of partitions (default: spark.sql.shuffle.partitions), the second represents the minimum number of partitions after optimization (default: spark.default.parallelism), and the third represents the “suggested” size of the partitions after optimization (default: 64 Mb).

To test the behaviour of the dynamic coalition feature of AQE’s shuffle partitions, we’re going to create two simple datasets (one is to be understood as a lookup table that we need to have a second dataset to join).

The sample dataset is deliberately unbalanced, the transactions of our hypothetical “Very Big Company” are about 10% of the total. Those of the remaining companies about 1%:

Let’s first test what would happen without AQE.

We will receive output:

Number of partitions without AQE50

The value is exactly what we have indicated ourselves by setting the configuration property spark.sql.shuffle.partitions.

We repeat the experiment by enabling AQE.

The new output will be:

Number of partitions with AQE7

The value, in this case, was determined based on the default level of parallelism (number of allocated cores), that is, by the value of the spark.sql.adaptive.coalescePartitions.minPartitionNum configuration property.

Now let’s try what happens by “suggesting” the target size of the partitions (in terms of storage). Let’s set it to 30 Kb which is a value compatible with our sample data.

This time the output will be:

Number of partitions with AQE (advisory partition size 30Kb): 15

regardless of the number of cores allocated on the cluster for our job.

Apart from having a positive impact on performance, this feature is very useful in creating optimally sized output files (try analyzing the contents of the job output directories that I created in CSV format while being less efficient so that you can easily inspect the files).

In the second and third part of the article we will try the other two new features:

  • Dynamically switching join strategies
  • Dynamically optimizing skew joins.

Stay tuned!

Written by Mario Cartia – Agile Lab Big Data Specialist/Agile Skill Managing Director
 If you found this article useful, take a look at our blog and follow us on our Medium Publication, Agile Lab Engineering!

Going Deep into Real-Time Human Action Recognition

A journey to discover the potential of state-of-the-art techniques

Hi all! My name is Eugenio Liso, Big Data Engineer @ Agile Lab, a remote-first R&D company located in Italy. Our main focus is to build Big Data and AI systems, in a very challenging — yet awesome — environment.

Thanks to Agile Lab, I attended a 2nd level Master’s Degree in Artificial Intelligence at the University of Turin. The Master’s thesis will be the main topic of this post. Enjoy ?

Human Action Recognition: What does it mean?

Imagine that a patient is undergoing a rehabilitation exercise at home, and his/her robot assistant is capable of recognizing the patient’s actions, analyzing the correctness of the exercise, and preventing the patient from further injuries.

Such an intelligent machine would be greatly beneficial as it saves the trips to visit the therapist, reduces the medical cost, and makes remote exercise into reality¹.

With the term Human Action Recognition, we refer to the ability of such a system to temporally recognize a particular human action in a small video clip, without waiting for it to finish. This is different from Action Detection, since we do not locate where the action is spatially taking place.

Specifically, my task is to predict in real-time one out of three possible human actions: fallingrunning or walking.

The most careful readers will already have noticed the “movement-wise” difference between these three actions. Although the falling action has, potentially, very different “features” from the other two, and is therefore a good candidate to be one of the easiest to predict, the same cannot be said between walking or running. Indeed, these two actions share some common characteristics.

Since building ad hoc features for this kind of task can be expensive and complicated, a powerful tool comes to rescue me: Deep Neural Networks (DNN).

In the following sections, I will provide some details about the DNN’s I used to accomplish the task and, finally, we will discuss the results of my experiments.

3D-ResNeXt vs Two Stream Inflated 3D ConvNet (I3D)

After a careful and detailed research of the state-of-the-art, I have decided to focus on two — architecturally different but very promising — DNN: the 3D-ResNeXt and the Two Stream Inflated 3D ConvNet.

Luckily, the authors already provide the two networks pre-trained on Kinetics, a VERY huge dataset that has been proven to be an excellent starting point when training from-scratch a newly-built network for the Human Action Recognition’s task.


The first one is called 3D-ResNeXt, a Convolutional-based network (CNN) implemented in Pytorch. It is also based on ResNet, a kind of network that introduces shortcut connections that bypass a signal from one layer to the next one. For more details, refer to the author’s paper².

This kind of network takes K 112×112 RGB frames as input, while its output is the predicted action. Regarding this, I have set K = 16 for two reasons:

  • Taking 16 frames as input is a reasonable default, since almost all architectures have that kind of time granularity
  • Almost all the videos used during the training phase are recorded at 30 FPS. Although the running or walking actions seem not to be highly impacted (since they can be roughly considered “cyclical”), I have empirically established that, by increasing the input frames, the falling action becomes “meaningless”, since 16 frames satisfactorily enclose its abruptness. It might also be a good idea to set K = 32 if the input videos were recorded at 60 FPS.

An overview on how 3D-ResNeXt (and more generally, the 3D-ConvNets) receives its inputs.


The main building block of the 3D-ResNeXt network. The ResNeXt block is simply built from a concatenation of 3D-Convolutions.

Two Stream Inflated 3D ConvNet (I3D)

The other network is called Two Stream Inflated 3D ConvNet — from now on we will call it I3D for the sake of brevity. It is implemented by DeepMind in Sonnet, a library built on top of Tensorflow. For more details, please refer to the author’s paper³.

The I3D, differently from the 3D-ResNeXt, uses two kinds of inputs:

  • K RGB frames, like the 3D-ResNeXt, but their size is 224×224
  • Optical Flow frames with size 224×224. The Optical Flow describes the motion between two consecutive frames caused by the movement of an object or camera

An overview on how the I3D network (and more generally, the 3D-ConvNets) receives its inputs.
To the left side, an example of Optical Flow extracted from the RGB clip on the right side. These guys are playing Cricket, but I do not really know the rules. Courtesy of DeepMind.

Fine-Tuning with Online Augmentation

Before stepping into the juicy details about the employed datasets and the final results — don’t be in a hurry! — I would like to focus on the Augmentation, a fundamental technique used when training a CNN. Speaking broadly, during the training phase (on every mini-batch), we do not give the exact same input to the CNN (in my case, a series of frames), but a slightly modified version of it. This increases the CNN’s capability to generalize better on unseen data and decreases the risk of overfitting.

Of course, one could take a dataset, create modified frames from the original ones and save them to disk for later use. Instead, I use this technique during the training phase, without materializing any data. In this case, we talk about Online Augmentation.

The “units” or “modules” responsible of producing a frame similar to the original one are called filters: they can be compared to a function that takes an input image i and produces an output image i’. For this task, I have chosen 7 different filters:

  • Add and Multiply: add or multiply a random value on every pixel’s intensity
  • Pepper and Salt: common Computer Vision’s transformations that will set randomly chosen pixels to black/white
  • Elastic deformation⁴
  • Gaussian blur, which blurs the image with a Gaussian function
  • Horizontal flip, which flips an image horizontally

These filters can be applied in three different ways:

  • Only one (randomly chosen) filter f is applied to the original image i, producing a newly-created frame f(i) = i’
  • Some of the filters are randomly chosen (along with their order) and applied to the original frame. This can be thought as a “chained” function application: fₙ …(f₂(f₁(i))) = iⁿ
  • All the filters (their order is random) are applied sequentially on the input image

This GIF represents a man doing push-ups. This is our starting video: on it, we will apply the filters described above. Taken from here.




Starting from the upper left side, and going in order from left to right: Salt, Pepper, Multiply, Add, Horizontal Flip, Gaussian Blur, Elastic Deformation


Test Set for the evaluation of the (pre-trained) 3D-ResNeXt and I3D

To evaluate the two chosen pre-trained NNs, I have extracted a subset of the Kinetics Test Set. This subset has been built by sampling 10 videos for each class among the 400 available classes, resulting in 4000 different videos.

Training/Test Set for the Fine-Tuned 3D-ResNeXt

The dataset employed to fulfill my original task is created by combining different datasets that can be found online.

Total distribution of the dataset used during Fine-Tuning

The Test Set is built from the overall dataset taking, for each class, roughly the 10% of the total available samples, while the remaining 90% represents the Training Set (as shown below).

Distribution’s summary of the dataset used during Fine-Tuning

It is clear that we are looking at an unbalanced dataset. Since I do not have more data, to mitigate this problem, I will try to use a loss function called Weighted Cross Entropy, which we will discuss later on.

Manually-tagged Dataset for the Fine-Tuned 3D-ResNeXt

Finally, I have built and manually annotated a dataset consisting of some videos publicly available from YouTube.

Summary of the manually-tagged dataset. I can assure you that finding decent quality videos is — perhaps — more complicated than being admitted to the NeurIPS conference.

For this tagging-task, I have used a well-known tool called Visual Object Tagging Tool, an open source annotation and labeling tool for image and video assets. An example is shown below.


Tagging a video with the label falling. Tagging is mainly done in two sequential phases: first, tag the initial frame with the desired label; then, tag the final frame with a dummy tag, simulating the end of that action. Last, parse those “temporally-cut ranges” with a script, take the label (from the initial frame) and save the resulting clip. Et voilà!

Performance Analysis

Metrics and Indicators

For most of the results presented below, several metrics and indicators will be reported:

Pre-trained 3D-ResNeXt vs Pre-trained I3D on Kinetics

The purpose of this test is to have a first, rough, overview. The only class we are interested into is jogging, while all the other labels represent actions that are completely different and useless for our goal.⁶

A first, rough, comparison between 3D-ResNeXt and I3D on the jogging class

Given this subset of data, the obtained results show that:

  • the I3D seems to have better overall performances compared to the 3D-ResNeXt. This confirms its supremacy, in line with the results in the author’s paper
  • the prediction time of the 3D-ResNeXt is competitive, while the I3D is too slow for a Real-Time setting, since its architecture is significantly heavier and complicated

At this point it is clear that, for our task, the 3D-ResNeXt is a better choice because:
– it is faster and almost capable of sustaining the speed of a 30 FPS video
– it has a simpler network architecture
– it does not need additional preprocessing or different inputs, since it uses only RGB frames

Fine-Tuned 3D-ResNeXt network’s performance on Test Set

Before stepping into the various parameter configurations, to take into account the Training Set’s unbalance, I use a different Loss Function called Weighted Cross Entropy (WCE): instead of giving each class the same weight, I use the Inverse Class Frequency to ensure the network gives more importance to errors on the minority class falling.

So, consider a set of classes C = {c₁ , …, cₙ}. The frequency fᵢ of a class cᵢ ∈ C represents the number of samples of that class in a dataset. The associated weight ICFᵢ ∈ [0, 1] is defined as:

In the table below, I have reported almost all the training experiments I’ve been able to carry out during my thesis.

The parameters for the different Training runs of the 3D-ResNeXt (using Fine-Tuning). OA means Online Augmentation, while LR denotes the Learning Rate (the starting value is 0.01, and at a defined Scheduler Step — i.e. training epoch — it will be decreased by multiplying it with 0.1). It should be noted that I’ve tried some other parameter’s configurations, but these represent the best runs I’ve been able to analyze. As a related example, an interesting result is that, when using the OA in All mode, the network shows a lower performance: this could be caused by an overly-distorted input produced during the Online Augmentation.

The really big question before this work was: using the few data available for the falling class, will the network be able to have an acceptable performance? Given the parameters above, the obtained results are summarized in the chart below.

The performance of my Fine-Tuned networks on the Test Set

What can I say about that? The results seem pretty good: the network understands when someone is walking or running, while it is more doubtful when predicting falling. But, aside that, let us focus on the contoured blue and red rectangles: red denotes training runs with Cross Entropyblue denotes training runs with Weighted Cross Entropy.

This chart highlights some interesting findings:

  • When using Cross Entropy, the precision on the falling and running classes is noticeably better, and overall, the runs with Cross Entropy behave better, with a balanced precision-recall measures which are rewarded by the F-Score’s metric
  • When using Weighted Cross Entropy, the precision on the minority class falling drops dramatically, while its recall increases significantly. During the evaluation, this concept translates in more frequent (and incorrectfalling’s predictions

TOP-3 Fine-Tuned NNs on the Manually-Tagged Dataset

Last, but not least — well, these are probably the most interesting and awaited results, am I right? — I have carried out a final test that aims to assess the capability of the Fine-Tuned NNs to generalize on unseen and real data. I have chosen the three best Fine-Tuned networks (their ID are T1, T7 and T8), taken from the previous experiment, and I have tested them on my manually-tagged dataset.

Drumroll — What will be the winning run? Will the nets be able to defeat overfitting, or will they be the victims? See you in the next episode — I’m kidding! You can see them below! ?

Performance of the three best training runs (T1 — T7 — T8) on the manually-tagged dataset

Finally, here they are! I’m actually surprised about them, since:

  • The networks generalize quite well on the new manually-tagged dataset — I did not expect such good results! A stable F-Score of 0.8 is reached for the running and walking, with a satisfactory performance on the falling class
  • Using Online Augmentation produces a positive effect on the recall of the minority class falling (T7 vs. T8)
  • Less frequent use of LR Steps leads to a better overall performance. Indeed, the recall on the falling class (T1 vs. T8) has improved noticeably
  • The three networks have a VERY HIGH precision on the falling class (T7 is never wrong!) but, unfortunately, they have a mediocre recall: they are very confident on the aforementioned label, but more than half of the time they cannot recognize a falling sample

What have I learned? What have YOU learned?

Finally, we have come to the end of our tortuous but satisfying journey. I admit it: it was tough! I hope this adventure has been to your liking, and thanking you for the reading, I leave here some useful take-home messages and future developments:

  • The 3D-ResNeXt network is better suited for the Real-Time Human Action Recognition task since it has a dramatically faster prediction time than the I3D
  • The CE and WCE produce radically different results on the minority class falling, while they do not change so much the other two classes: the first produces a well-balanced precision and recall, the latter causes the trained network to predict more frequently the minority class, increasing its recall at the expense of its precision
  • The Online Augmentation helps the network to not overfit on training data
  • A LR that changes more rarely is to be preferred to a LR that changes more frequently
  • Speaking broadly, the result that has emerged during training is not an “absolute truth”: there are some cases where the task requires a higher recall rather than a higher precision. Indeed, based on it, we might prefer a network that predicts falling:
    * more precisely but more rarely (more precision, less recall)
    * less precisely but more frequently (more recall, less precision).

Future Developments

  • The minority class is maybe too unbalanced. More and high-quality data
    (not necessarily only for one class) could dramatically increase the network’s performance
  • The WCE has enormously increased the recall of the minority class. It could be interesting to investigate more on this behaviour or to try another weighting method that reduces the loss’ weight on the minority class. This could potentially make the network to predict less frequently the class falling, thus increasing its precision with also the benefits of a higher recall
  • The predictions made by the network could be joined with another system that is able to detect people (whose task is known as People Detection), in order to give a more precise and “bounded” prediction. However, this could require a different dataset, where the input videos contain only the relevant information (e.g. a man or woman falling with minimal or no background or environment)

YouTube video

[1] Human Action Recognition and Prediction: A Survey. Yu Kong, Member, IEEE, and Yun Fu, Senior Member, IEEE — JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2018. Link

[2] Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? Kensho Hara, Hirokatsu Kataoka, Yutaka Satoh, 2017. Link

[3] João Carreira and Andrew Zisserman, Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset, 2017. Link

[4] Simard, Steinkraus and Platt, “Best Practices for Convolutional Neural Networks applied to Visual Document Analysis”, in Proc. of the International Conference on Document Analysis and Recognition, 2003. Link

[5] The prediction time does not take into account the preprocessing time. For 3D-ResNeXt, it is minimal, while for I3D, it is much more lengthy due to the calculation of the optical flow for each input video. As an example, with my specs, 10 seconds of a single video need about 1 minute of preprocessing.

[6] It’s completely fine if you think that this test is simplistic and not very accurate. Due to time and space limitations (those videos were literally eating my free disk space), and also because this is completely out-of-scope, I managed to run these tests only on a small portion of test data.

Written by Eugenio Liso – Agile Lab Big Data Engineer

If you found this article useful, take a look at our blog and follow us on our Medium Publication, Agile Lab Engineering!

An Introduction to Deep Generative Models for Computer Vision – Online Meetup – 11th June 2020 | 6.30PM

Our next meetup, planned on 11th June 2020, will cover two of the most famous and recent generative models, Variational Autoencoders and Generative Adversarial Networks, at an introductory level with the main mathematical insights. The theory behind the techniques and how they work will be presented, together with practical applications and the relative code.

The event will consist of two presentations and a final session for questions and technical time.

First Talk: Variational Autoencoders (VAEs) by Marco Odore, Riccardo Fino and Stefano Samele,  Data Scientists at Agile Lab, with a computer scientist, statistical and mathematical background respectively.

In this talk will be shown how to train a Variational Autoencoder on an image dataset and will be given a glimpse of its interpretation through visual arithmetics.

Second Talk: Generative Adversarial Networks (GANs), by Luca Ruzzola, a deep learning researcher at, working on applied Deep Learning for photo enhancement.

In this talk we’re gonna go back through the development of Generative Adversarial Networks (GANs), the challenges in using them and the amazing results that they can produce, from an introductory perspective.

Click here to sign up!