AgileRAI: scalable, real-time, semantic video search


scalable, real-time, semantic video search

Television content has a very long life cycle. After being produced and broadcasted it is archived in order to be reused later (days, months, even years)  for both rebroadcasts and inclusion in other content. This content archive is a central asset in today’s entertainment industry, and lots of resources are devoted to its creation, maintenance, management and use.
An mere archive of content is not useful by itself : we need an efficient way to search the archive in order to find what is relevant to our need. This means we need information on the subject matter of the content, which is usually obtained by manual addition of metadata; this approach however requires human intervention and thus it is costly, slow and prone to biases and errors. Furthermore, given the ever-growing amount of content produced everyday, the manual approach is not sustainable in the long term.
AgileRAI is a a system for multimedia content analysis that takes on this challenge by providing a scalable platform for innovative multimedia archiving and production services by leveraging advanced pattern recognition techniques and combining them with a modern distributed architecture.
The project has been developed in collaboration with CELI that has implemented the semantic annotation and the user interface. This is a good example of how two small and dynamic companies could drive innovation with a public giant like RAI.



The platform supports real time ingestion of multiple video streams of various types (e.g. RTP streams, video files from storage systems, etc.) on which different techniques are applied in a parallel and scalable way in order to recognize specific visual patterns like logos, paintings, buildings, monuments, etc. The video streams are analyzed by extracting visual features from the frames, which are then matched to a reference database of visual patterns to produce a set of meta-tags describing the ingested contents in real time. Furthermore, these tags can then be further enriched with semantic thanks to open semantic data repositories. This allows searching and retrieval operations based on high-level concepts. The architecture is designed to be parallel and scalable in order to ensure near real time, frame by frame pattern detection and recognition in video data.
In the experimental setup of the system, we leveraged the Compact Descriptors for Visual Search (CDVS) standard proposed by the Moving Picture Experts Group (MPEG). The CDVS standard describes how to extract, compress and decompress relevant visual information in a robust and interoperable format that leverages the Scale-Invariant Feature Transform (SIFT) algorithm for feature detection.
The core building blocks of CDVS consists in global and local descriptor extractors and compressors based on selected SIFT features. The first operation to extract these descriptors is removing color information. Then, candidate key-points are extracted using SIFT. These candiate points are then evaluated and filtered according to various metrics, in order to select those that are best able to provide a “feature description” of the objects contained. These selected points are then encoded in the descriptors, which can then be used to determine if two images have objects in common.
In order to efficiently match the patterns encoded in the CDVS descriptors with the reference patterns, a reference database is used. This database is built from a collection of images containing visual objects of interest (e.g. paintings, buildings, monuments, etc.) under different scales, views and lighting conditions. Each visual object is represented by a unique label, with each image marked with the corresponding label. When the database is created the visual features are extracted from the reference images and the corresponding labels are associated with them. When the database is queried using a CDVS descriptor the label matching the contained patterns is returned irrespective of which particular reference image matched.
Both the feature extraction and the matching step are computationally intensive; moreover, such a system must be able to handle multiple steams at once, both real-time and batch. This raises the need for a scalable architecture.
Let’s see how the AgileRAI system is able to parallelize all the operations needed to analyze a video stream.



The AgileRAI system architecture merges classical information analysis and retrieval components with the most advanced fast data architectures. The high-level architecture of the system may be considered as a cluster made of multiple back-end computation nodes and a single front-end node.

Video processing

The video processing pipeline is based on witboost Data Streams (formerly Wasp), an open source framework written in Scala. You can find a detailed description of the witboost system at
The input RTP (or file-based) video is decoded using the FFMPEG library in order to extract the raw frames out of the incoming stream. Then, CDVS descriptors are generated for each frame and pushed to a Kafka queue. Kafka acts as a kind of persistent, fault tolerant and distributed publish/subscribe layer, which allows decoupling video feature extraction from feature matching tasks.
The descriptors queue is consumed by Spark Streaming in order to match the incoming descriptors versus the reference images database. The database is broadcasted to all the nodes participating in the computation, and the matching operation is performed concurrently on the descriptors. This step produces a list of labels for each frame, representing the visual objects contained therein.
The output of the Spark processing step is a new Kafka queue made of tuples <FrameID, Labels>, where the FrameID is a unique identifier of the processed frame (i.e. input stream reference and timestamp data) and Labels is the list of labels of the matching visual patterns within the frame.
The Kafka queue is finally sent to the semantic annotation node for enrichment, storing and publication, as well as delivered to an ElasticSearch cluster for monitoring purposes.

Semantic annotation

The queue of <FrameID, Labels> tuples generated by the video processing pipeline is consumed by the enrichment pipeline. Since the incoming frames are labelled with the URIs of linked data resources, the system is able to access the rich set of properties and relations provided by the referenced web sources. As an example, if within the input video stream a monument is retrieved at a certain date and time with an unambiguous URI (e.g. dbpedia:Sforza_Castle) the system may extract (and use as metadata for subsequent searches) all its properties like e.g. geographical location, year of construction, related external contents, etc. With this approach, the intellectually expensive and time-consuming work of annotating video data is requested to be performed on a (small) portion of data, e.g. when the target image dataset is created or updated, and not on the entire archived or broadcasted content. Furthermore, the creation and annotation of the target image dataset may be automated itself, by e.g. using a Web image search engine to source visual training data for target queries.

The collected information (i.e. source stream identifier, detection timestamps, linked data URIs) is stored as RDF triples in a triple store following a purpose-built ontology, in order to be semantically searchable and accessible by means of SPARQL queries.

User interface

In addition to being stored in the triple store, the semantically enriched metadata is also leveraged in the system’s GUI. This information, along with a transcoded version of the input video, is used by an HTML5 web page that allows the user to choose between the available channels and shows the live video of the input video stream being processed along with a stream of contextual information. The image below shows an example: the input video is depicting a portrait of William Shakespeare. Some pictures of the famous English writer were previously collected and tagged in the AgileRAI image reference dataset. Thus, the system is able to recognise his appearance within the input stream and collect further information listing e.g. his biography and other personal data.



The combination of Visual Analysis techniques for the detection of visual patterns with Semantic Web technologies enables the extraction and most importantly the use of content information from video media, an important capability in today’s entertainment industry. Parallel computation is crucial to achieving scale out capabilities in this kind of use cases. Thanks to the witboost Data Streams (formerly Wasp) framework, AgileRAI makes it possible to ingest several live video streams in parallel and analyze them in a single cluster with a single application deployment, providing powerful video analysis capabilities with low effort.