Real-Time Analytics in Mission Critical Applications For Usage Based Insurance (UBI) services

Real-Time Analytics in Mission Critical Applications 

for Usage-Based Insurance (UBI) services

“The IOT generates large amounts of data and above all the expectation, for those who use them, to have processed and transformed them into useful information in real time. Therefore, we need solutions that are scalable in terms of economics and efficiency and, at the same time, guarantee the expected results. The choice of partners such as Agile Lab, always aware of innovation and with a solid expertise in new technologies, allows us to achieve ambitious results and to adopt innovative solutions for the challenges of tomorrow.”

 

Paolo Giuseppetti

Head of Innovation and Connected Mobility Platform, Vodafone Automotive

Scenario

In order to to provide insurance companies with specific risk profiles for every driver, extracted from on-board black boxes, Vodafone Automotive needs a whole new systems for acquisition and processing of telemetry data, capable of Big Data collection, management and analysis in real time for over 227 million weekly mileage and driving related messages.  

The new architecture, based on Cloudera, adding the components of Apache, Kafka, Spark and the combination of HDFS and HBase, has been specifically designed to be available on-premises. The Cloud Computing branch of Vodafone handles the management and maintenance of the environments, while the company relies on its ongoing collaboration with AgileLab for the application services development and management. 

This new architecture – introducing a more flexible and innovative platform – has enabled the company to meet the level of service expected by Clients (i.e. a maximum of 15 minutes of latency between data generation and its reception on the Client’s Cloud, including cross-time on both mobile networks and the Internet) and it has become the solid foundation for developing services that the company wouldn’t otherwise have been able to offer. 

Client contest

Vodafone Automotive, part of Vodafone Group, operates in the specialized segment of Vodafone IoT, and it focuses on providing IT platforms to the mobility world. The company supplies telecommunication services and products targeted to the automotive sector. In particularit offers stolen vehicle assistance and recovery services, theft prevention, crash and vehicle accident management services. Specifically for insurance-related services, it provides analytical functions for driving habits and styles, risk-management assessmentsas well as a wide scope of vehicle management services (i.e. maintenance and management life cycle) for both fleet and OEM manufacturers (of automotive on board electronics). The main focus of our case study is on Usage Based Insurance (UBI).  

Challenge

UBI aims to address the traditional trade-off between the driver’s privacy and the cost of the insurance plan, a key aspect of the car insurance sector. Vodafone Automotive is able to provide insurance companies with different driving style profiles by collecting multiple information, as for instance the location and acceleration of a car, by installing on board the vehicle an electronic device (the black box).  

Through this information, Vodafone Automotive helps insurance companies create a “score” that represents with the outmost precision the risk associated with the type of driver and therefore the single insurance plan, also providing data on the type of roads traveled, driving times, and many more.  

The project was born in light of a necessity for extracting the maximum value from the data generated by on-board devices (often called black boxes, like the flight recorders on airplanes), to better cater to the needs of insurance companies. They can therefore utilize this data for policy pricing, (computationally, is organized in time intervals, with pre-established elaboration cycles and post-elaboration submission of data sets as per agreed with the company and used, for example, at the time of policy renewal, quarterly or even annually), but also to offer new services to their subscribers, strengthening and enhancing the customer experience. For example, by sending alerts related to the level of danger of a certain area (i.e. where the client may have parked), or localized weather-related alerts (such as hail alerts).  

In compliance with Vodafone Automotive’s goal to increase safety on the street, the company launched this project to revise and revolutionize their systems for acquisition and processing of telemetry data (generated by systems previously installed by Vodafone Automotive on insured vehicles) by introducing the features and capabilities offered by the Cloudera platform to collect, manage and analyze in real-time the Big Data sent by the installed black boxes.  

Solution

The Vodafone Automotive project, started in 2017, was aimed at deploying, managing and consolidating a platform able to collect and elaborate big quantities of data, to help insurance companies risk evaluation process both for issuing insurance plans and offering real-time services to their customers. The project led to the replacement of the previous architecture with a newer and innovative one, based on Cloudera, adding the components of Apache, Kafka, Spark and the combination of HDFS and HBase (architectural model “Lambda”), and later on also NIFI – which can elaborate data with a latency of a few seconds, regardless of their quantity or frequency. The primary feature of this platform is its ability to flexibly manage high volumes of data, and to be able to expand and grow according to the company’s evolving needs. Data processing occurs mainly through Apache Spark, which captures the data and processes them after their extraction from Code Kafka. Afterwards, the platform drops the base data on a distributed HDFS file system. Whereas processed data are saved on the NoSQL Database, achieving impressive performance results. The collected data are then sorted through the Lambda architecture, enabling both real-time data processing and effective storage for future re-processing needs. To accomplish the latter function, the architecture relies on NoSQL HBase. It should be noted that the primary data processing reconstructs the driver’s path from localization and speed data, and geographical information acquired through GPS system and the accelerometer in the vehicle’s black box.

Additional operations are required to guarantee the reliability of collected data: it is fundamental, for instance, to perform data cleansing and preparation, in order to spot any device malfunctions or differentiate between a pothole and a car’s impact (and consequently understand whether or not to provide assistance to the driver). The new architecture has been specifically designed to be available on-premises, and its related servers are placed in Vodafone Group’s Technology Hub in Milan, (for Vodafone Italy), which hosts the VFIT services. Also, a back-up server cluster has been created in the twin data center of Vodafone, as part of the disaster recovery plan. The Cloud Computing branch of Vodafone handles the management and maintenance of the environments (GDC – Group Data Center, where the data processing resources are being implemented. Vodafone caters to the Southern European market through this structure), while the company relies on its collaboration with AgileLab for the application services development and management. As a matter of fact, the architectural evolution that Vodafone Automotive implemented allowed the company to not only effectively handle high volumes of data, but also represented a qualifying element to guarantee the availability of real-time analyzed and processed data to insurance companies. Thanks to the new platform, today insurance companies are able to receive real-time information on the driving habits of their client, with a latency of mere minutes from the event registration in the vehicle’s onboard black box.  

The following are some figures from our project:  

  • over 33 million messages per day
  • 227 million weekly mileage and driving-related messages (for insured clients) 
  • 130 terabytes of data collected in 3 years.  

From a management point of view, the project has required the involvement of a dedicated team, that focuses exclusively on the design and development of new architectural scenarios. This organizational choice and the collaboration with AgileLab – which took charge of every detail regarding the planning, engineering, and optimization of the application platform – played a key role in the success of the project. After the project launched, the team created by Vodafone Automotive to manage the development phases of the project, joined the IT of the company to work in the areas of Project Management, Development, Operations.  

The greatest challenge faced by the company has been the need to integrate numerous recent technologies into its existing information system kit. The IT department was required to manage a series of new tools and platforms, and take all the necessary steps (also from a training perspective) to both maintain and employ those technologies to their fullest potential.  

Achieved Results and Benefits

First of all, the new architecture – introducing a more flexible and innovative platform – has enabled the company to meet the level of service expected by Clients (i.e. a maximum of 15 minutes of latency between data generation and its reception on the Client’s Cloud, including cross-time on both mobile networks and the Internet). In addition, the new architecture has become the solid foundation for developing services that the company wouldn’t otherwise have been able to offer. It allowed Vodafone Automotive to acquire a definitive competitive advantage, positioning itself as one of the most innovative players on the market.

Future Developments

Among the potential evolutions of the platform, there is the possibility of adding a Machine Learning function to be applied to reliability and data quality check processes, even in streaming mode (as they occur). The introduction of automatic learning techniques would allow the company to identify any possible device malfunction a lot more quickly, becoming proactive in the process of maintenance and replacement, when needed, of the black box. This would also bring about the added benefit of avoiding corrective  measures on corrupted data ingested because of device errors or malfunctions. 

This case study was published in Italian by the Management Engineering department of Milan’s Polytechnic University as part of the 2020 Business case of Big Data and Business Analytics Digital Innovation Observatories of the of the School of Management of Politecnico ( Copyright © Politecnico di Milano / Dipartimento di Ingegneria Gestionale)  

Darwin, Avro schema evolution made easy!

Hi everybody! I’m a Big Data Engineer @ Agile Lab, a remote-first Big Data engineering and R&D firm located in Italy. Our main focus is to build Big Data and AI systems, in a very challenging — yet awesome — environment.

In Agile Lab we work on different projects where we chose Avro as encoding format for the data, given its good performances: Avro allows us to store data efficiently in size, but it is not easy to set it up in order to enable data model evolution over time.

Meet Avro!

Avro is a row-based data serialization format. Widely used in Big Data projects, it supports schema evolution in a size efficient fashion, alongside with compression, and splitting. The size reduction is achieved by not storing the schema along with the data: since the schema is not stored with each element (as it would be with a format like JSON) the serialized elements contain only the actual binary data and not their structure.

Let’s see how this is achieved by looking at an example of serialization using Avro. First of all, the schema of the data that should be serialized can be defined using a specific IDL or using JSON. Here we define the hypothetical structure of a “song” in our model with the Avro IDL and the equivalent JSON representation:

IDL and JSON schemas for Song elements

 

Since the schema is the same for all elements, each serialized element could contain only the actual data, and the schema will tell a reader how it should be interpreted: the decoder will go through the fields in the order they appear in the schema and will use the schema to tell the datatype of each field. In our example, the decoder could be able to decode a binary sequence like:

The binary values are interpreted by the decoder using the schema

 

Change is inevitable

One of the problems with the above implementation is that in real scenarios data evolve: new fields are introduced, old ones are removed, fields can be reordered and types can change; e.g. we could add the “album” field to our “song”, change its “duration” field from long to int and change the order of the “duration” and “artists” fields.

To ensure that we don’t lose the ability to read previously written data in the process, a decoder must be able to read both new and old “songs”. Luckily, if the evolution is compliant to certain rules (e.g. the new “album” field must have a default value), then Avro is able to read data written with a certain schema as elements compliant to the newest one, as long as both schemas are provided to the decoder. So, after we introduce the new song schema, we can tell Avro to read a binary element written with the old schema using both the old and the new one. The decoder will be automatically capable of understanding which field of the old schema is mapped to the fields of the new one.

Elements written with an older version of the schema can be read with the new one performing a mapping

between the old and the new fields

 

Note that if the decoder is not aware of the older schema and tries to read some data written with the old schema, it would throw an error since it would read the “album” field value thinking it represents the “song” duration.

Keep your data close and your schemas closer

So it seems that everything works fine with Avro and schema evolution, why should I worry?

Yes, everything is managed transparently by Avro, but only as long as you provide to the decoder not only the schema your application expects to read, but also the one used to encode every single record: how can you know which schema was used to write a very old record? A solution could be to store each element along with its schema, but this way we are going back to store data and schema together (like JSON) losing all the size efficiency!

Another solution could be to write the schema only once into the header of an Avro file with multiple elements that share the same schema in order to reduce the number of schemas per element, but this is not a solution that could be always applied, and is tightly coupled to the number of elements that you can write together in a single file.

Luckily, Avro comes to the rescue again with a particular encoding format called Single-Object Encoding. With this encoding format each element is stored along with an identifier of the schema used for the serialization itself; the identifier is a fingerprint of the schema, to be meant as a hash (64 bit fingerprints are enough to avoid collisions up to a million entries).
From Avro documentation:

Single Avro objects are encoded as follows:

* A two-byte marker, C3 01, to show that the message is Avro and uses this single-record format (version 1).

* The 8-byte little-endian CRC-64-AVRO fingerprint of the object’s schema

* The Avro object encoded using Avro’s binary encoding

Darwin: how I stopped worrying and learned to love the evolution

OK, now we don’t have the actual schema, but only a fingerprint… how can we pass the schema to the decoder if we only have its fingerprint?

Here is where Darwin comes to our help!

Darwin is a lightweight repository written in Scala that allows to maintain different versions of the Avro schemas used in your applications: it is an open-source portable library that does not require any application server!
To store all the data relative to your schemas, you can choose from multiple storage managers (HBase, MongoDB, etc) easily pluggable.

Darwin keeps all the schemas used to write elements stored together with their fingerprints: this way, each time a decoder tries to read an element knowing only the reader schema, it can simply ask Darwin to get the writer schema given its fingerprint.

To plug a Connector for a specific storage, simply include the desired connector among the dependencies of your project. Darwin loads automatically all the Connectors it finds in its classpath: if multiple Connectors are found the first one found is used or you can specify which one to use via configuration.

Even if it is not required when used as a library, Darwin provides a web service that can be queried using REST APIs to retrieve a schema by its fingerprint and to register new schemas. By the time this article is written, Darwin provides connectors for HBase, MongoDB and a REST one that can be linked to a running instance of the Darwin REST server.

Each Connector requires its own set of configurations, e.g. for the HBaseConnector:

Darwin can be configured to run in three different modes:

  1. Lazy mode: Darwin accesses the connector (and the underlying storage) each time it is asked for a schema. This allows the application to always fetch the schemas up to date, but it could increase the number of accesses to the storage. This mode is suitable for applications that always require the latest values found on the storage and cannot afford to load all values at startup (like Web Services).
  2. Eager mode: Darwin loads all the schemas from the storage only once at startup into an internal cache. Then all the schemas are searched only inside the cache without querying the storage again. This reduces to a minimum the accesses to the storage, but won’t retrieve schemas registered after the startup (suitable for batch applications but not for streaming ones).
  3. Lazy cached mode: an hybrid mode between the two presented above; all schemas are loaded at startup and all requests will perform a look up on the internal cache. If a miss occurs, the storage is asked for such schema and, if found, the cache is updated accordingly. This is a mode suitable for streaming applications that could load all the possible values at startup but need to be aware of possible variations during their lifetime.

Darwin in a nutshell

Using Darwin is very easy! Here we have a Scala simple example: first of all we must obtain an instance of AvroSchemaManager passing the desired configurations

Then every application can register the Avro schemas needed by simply invoking

The AvroSchemaManager can then be used to retrieve a schema given its fingerprint and vice-versa

Since the data will be stored and retrieved as single-object encoded, Darwin exposes also APIs to create a single-object encoded byte array starting from an Avro encoded byte array and its schema:

It also provides APIs to retrieve schema and Avro encoded byte array from a single-object encoded element:

The same code in Java would be:

TL;DR — what was that all about?

In Agile Lab we use Avro in different projects as a serialization system that allows us to store data in a binary, fast, and compact data format. Avro achieves very small serialization sizes of the records by storing them without their schema.

Since data evolves (and their schemas accordingly), in order to read records written with different schemas each decoder must know all the schemas used to write records throughout the history of your application. This will cause each application that needs to read the data to worry about the schemas: where to store them, how to fetch them and how to tell which schema was associated to each record.

To overcome said evolution problems in our projects, we created Darwin!

Darwin is a schema repository and utility library that simplifies the whole process of Avro encoding/decoding with schema evolution. We are currently using Darwin in multiple Big Data projects in production at Terabyte scale to solve Avro data evolution problems.

Darwin is available on Maven Central and it is fully open source: code and documentation are available on GitHub.

So give it a try and feel free to give us feedback and contribute to the project!

Written by Lorenzo Pirazzini – Agile Lab Big Data Engineer

 

If you found this article useful, take a look at our blog and follow us on our Medium Publication, Agile Lab Engineering!

Data Quality for Big Data

In today’s data intensive society Big Data applications are becoming more and more common. Their success stems from the ability to analyze huge collections of data opening up new business prospectives. Devising a novel, clever and non trivial use case for a given collection of data is not enough to guarantee the success of a new enterprise. Data is the main actor in a big data application: therefore it’s of paramount importance that the right data is available and the quality of such data must meet certain requirements.
In such scenario we started the development of DataQuality for Big Data, a new open source project that automatically and efficiently monitors the quality of the data in a big data environment.

Why Data Quality?

One of the main targets of a big data application is to extract valuable business knowledge from the input data. Such process does not involve a trivial computation of summary statistics from the raw data. Furthermore traditional tools and techniques cannot be applied efficiently to the huge collections of data that are becoming commonplace. To achieve the indented objective a common approach is to build a Data Pipeline that transforms the input data through multiple steps in order to calculate business KPIs, build/apply machine learning models or make the data usable by other applications.

 Data Pipeline workflow example

 In this scenario there it’s quite common to have pre-defined input data specifications and output data requirements. Then the pipeline steps are defined, relying on the such specifications to achieve the desired output.
However in most cases we do not have full control over the input data or we may have specifications and assumptions that are based on past data analysis. Unfortunately (and fortunately at the same time) data changes over time and we would be able to keep track of such changes that could violate our application assumptions. In some other cases data can be corrupted (there are a lot of moving parts in complex environments) and we don’t want dirty data to lead to unusable results. We’d like, instead, to intercept and prevent this situation, restore the input data and run our data pipeline in a clean scenario.
Another problem is related to the data pipeline itself. After development tests and even massive UATs there could always be some kind of bug, due to wrong programming, unforeseen corner cases in the input data or to a poorly executed testing/UAT phase. Constantly checking our core application properties through the time could be a good way to catch this bugs as soon as possible, even when the application is already in production.
A continuous Data Quality check on input, intermediate and output data is therefore strongly adivisable.

Approaches to Data Quality

Manual checking of some properties and constraints of the data is of course possible but not affordable in the long run: an automatic approach is needed.
A possible solution could be to implement project specific controls and check of the data. This has the obvious advantage to integrate quality checks in different parts of the application. There are however several drawbacks:

  • Development time consumption: developers have spend time on the design, development and test the controls.
  • Code readability: both the application specific code and data quality one will reside in the same code base.
  • Hard and costly maintainability: additional checks will require a new release of the software.
  • Application-specific checks: every application will have its own controls, maybe redundant on some data input. Data quality checks can’t be executed in isolation without the related application (or it can be done with additional overhead in the development phase).

A better solution is be to have a generic Data Quality Framework, able to read many data sources, easily configurable and flexible to accomodate different needs. That’s where our framework shines!

DataQuality project

DQ is an open-source framework to build parallel and distributed quality checks on big data environments. It can be used to analyze structured or unstructured data, calculating metrics (numerical indicators on data) and performing checks on such metrics to assure the quality of data. It relies entirely on Spark to perform distributed computation.
Compared to typical data quality products, this framework performs quality checks at the raw level. It doesn’t leverage any kind of SQL-like abstraction by using Hive or Impala because they perform type checks at runtime hiding bad formatted data. Hadoop has structured data representation (e.g. through Hive) but mainly it stores unstructured data ( files ), so we think that quality checks must be performed at raw level without typed abstractions.
With DQ is possible to:

  • Load eterogenous data from different sources (HDFS, etc.) and various formats (Avro, Parquet, CSV etc.)
  • Select, define and perform metrics with different granularity (i.e. on file or column level)
  • Compose metrics and perform checks on them (i.e. boundary checks, metrics comparison requirements etc.)
  • Evaluate quality and consistency of data, defining contraints and properties , both technical or domain dependent.
  • Save check results and historical metrics on multiple destinations (Hdfs, MySQL, etc. )

DQ engine is even more valuable if inserted in an architecture like the one shown below:

Data Quality solution overview

 

 

In this kind of architecture our Data Quality engine takes some configuration as input where it finds what data has to be checked, then it fetches it, calculates the metrics and performs the configured checks on them. The results are then saved in a data storage, in the example MongoDB. In this scenario a configuration engine (the green box) could be very helpful to configure sources/metrics/checks with a simple web application or automatically create standard checks for every input listed in a Data Governance of an enterprise. Other components can be implemented on top of this architecture to send reports, build graphic dashboard or apply some recovery strategies in case of corrupted data.

DQ Engine: State of the art

To understand what is the state of the art of the project we first have to better specify what are the main abstractions of the engine:

  • Source: As the name suggests it’s the definition of some kind of data that resides in a data storage system. It could be a table, a text or parquet file in HDFS. It could also be a table on a MySQL DB or any other kind of data.
  • Metric: It is a measure, represented as a number, on the data. It could be related to a file  (e.g. number of rows in a text file), or to a specific column (e.g. number of null values in a given column).
  • Check: A check is a property that has to be verified on one or more metrics. For example we can define a check on a metric w.r.t. a given threshold (e.g. number of null values in a column equal to 0), or w.r.t. another metric (e.g. number of distinct values on a column of data A greater than number of distinct value of a column of data B). Checks give the user the ability to verify technical properties, such as integrity of data, or functional properties, like consistency of join condition over multiple data.
  • Target: This is just the definition of a data storage in which we want to save our results i.e the computed metrics and check results.

In the schema below is described the high level pipeline of the DataQuality engine.

Pipeline of the Data Quality Engine
The engine is powerful and flexible enough to be usable in different contexts and scenarios. New sources, metrics, checks and output formats can be plugged in and can be reused without further implementation.

Below a quick recap of what is already available. As already said any other component can be plugged in.

  • Sources: HDFS files (parquet or text files, with fixed length fields or with a field separator)
  • Metrics:
    • File metrics: row count
    • Column metrics:
      • Count on distinct values, null values, empty values
      • Min, max, sum, avg on numeric columns
      • Min, max, avg on string length in text columns
      • Count of well formatted dates, numbers or strings
      • Count of column values in a given domain
      • Count of column values out of a given domain
    • Composed metrics
      • It is possible to compose basic metrics with mathematical formulas
  • Checks:
    • Comparison logic
      • Greater than
      • Equal to
      • Less than
    • Comparison terms
      • Metric vs. Treshold
      • Metric vs. Metric
  • Targets:
    • Type:
      • HDFS text files
    • Format:
      • Columnar and file metrics
      • Columnar and file checks

Use case scenario

Now we look to a possible scenario to understand a possible usage of DataQuality and its configuration pattern.
Suppose that we are in a bank IT department and our application uses as input data a daily transaction flow that arrives in CSV format on HDFS. The expected structure of the file is the following

  • ID Unique id of the transaction
  • SENDER_IBAN Iban of the transaction issuer
  • RECEIVER_IBAN Iban of the transaction receiver
  • AMOUNT Amount of the transaction
  • CURRENCY Original currency of the transaction
  • DATE Timestamp in which the transaction has been issued

On a daily basis we want to check the quality of the received data. Some required high level properties could be:

  1. Uniqueness of the ID in the daily transaction flow
  2. Correctness of the amount (i.e. positive number)
  3. Currency in a set of expected currency (i.e. USD and EUR)
  4. Correctness of the date format (date encoded in a specific format, e.g. yyyy-MM-dd HH:mm )

Let’s see how we can transform these requirements in terms of source, metrics and checks configuration for the DataQuality framework.
Here you can find a possible configuration file for the above use case
DataQuality Configuration example
In order to read the input file the only thing we have to do is to add a source in our sources list specifying an id, the type of the source, the HDFS path, the format of the file (CSV, fixed length, … ) and other few parameters. The file date in the path definition will be automatically placed in the specified format by the framework depending on the run parameters.
Then we have to formalize our high level requirements in terms of metrics and checks

  1. Uniqueness of the ID in the daily transaction flow
    1. Metric A – File rows count
    2. Metric B – Distinct values count in column ID
    3. Check – Metric A = Metric B
  2. Correctness of the amount (i.e. number with some precision/scale)
    1. Metric A – File rows count
    2. Metric C – Count of rows with well formatted numbers in amount column
    3. Check – Metric A = Metric C
  3. Currency in a set of expected currency (i.e. USD and EUR)
    1. Metric D – Count of rows with currency column value is outside (‘USD’, ‘EUR’) domain
    2. Check – Metric D = 0
  4. Correctness of the date format (date encoded in a specific format, e.g. yyyy-MM-dd HH:mm )
    1. Metric A – File rows count
    2. Metric E – Count of row with well formatted timestamps in date column

In the configuration file metrics and checks are defined separately because metrics are independent w.r.t. checks in which they are used. In fact a metric can be re-used in different checks (e.g. metric A)
In the bottom list there are the output files configuration.
Instructions on how to run the framework injecting a custom configuration file are in the GitHub project page

Conclusion

It should be clear that assuring data quality over time is a must in a big data environment. Achieving such goal with automatic approach is needed and it is not affordable or feasible to build ad-hoc solutions in complex big data environments for every project/data source.
An open, flexible, extensible framework such as DataQuality is the solution that guarantees a high level of control over quality of data in different contexts and scenarios, requiring only minor configuration to extend its usage.
The pluggable nature of the project make it highly reusable and open to continuous extensions. It’s also open-source, so you are welcome to  contribute to the project and make it even better!
GitHub https://github.com/agile-lab-dev/DataQuality

 

If you found this article useful, take a look at our blog and follow us on our Medium Publication, Agile Lab Engineering!

WASP is now open source on GitHub!

WASP is a framework that enables the development of full stack complex real time applications like IoT for example, complex big data streaming analytics, massive data ingestion or data offload from legacy systems: streaming analytics is the next frontier and we are already there!!

We are glad to inform you that WASP is now available on github
Our business as big data specialists and system integrators is relying on open source communities like Spark, Cassandra, Kafka… so now is the time to “give back” and we hope this could be useful to someone out there.
In Italy there is a lack of open source culture but we are trying to change this, we have a lot to learn from companies like Confluent, Stratio, H2O, Cloudera etc.
Feel free to share this link ( http://www.agilelab.it/wasp-is-open-source-on-github/ ) on Social Networks like Linkedin, Twitter etc., it will be very much appreciated.
WASP relies on a kind of Kappa/Lambda architecture mainly leveraging Kafka and Spark.
WASP allows you to save time with devops architectures and integrating different components and it lets you focus on your data, business logic and algorithms, without worrying about typical big data problems like:

  • at least once or exactly once delivery
  • publishing your results in real time, be reactive
  • performing data quality on unstructured data
  • feeding different datastores from the same data flow in a safe way
  • periodically training a machine learning model

These are the main tech features that WASP provides out of the box:

  • Scalable ingestion, processing and storage
  • Complex Analytics in RT
  • Pluggable Business Logic
  • Pluggable Datastore ( Cassandra, Elastic, Hbase, Druid, Solr, Cloudera and many others )
  • Data feed publishing in RT via API
  • Integrated model server ( batch training, model hot deploy, real time prediction ) for machine learning workloads
  • Lambda and Kappa Architecture
  • Fast data interaction with Notebooks for DataScientists
  • Centralized log collection
  • Full dockerized
  • Business intelligence ready via JDBC
  • Composable Trasformation Pipeline as directed acyclic graph
  • Easily extensible

Here you can find the first release notes.