Nuovo evento organizzato dalla community Big Data Torino (molto probabilmente sarà nuovamente online, ma seguiranno informazioni più precise nelle prossime settimane).
Come sempre ospitiamo due talk:
1) AWS Datalake – S3, Athena e QuickSight by Walter Dal Mut, Co-founder @ Corley
In questo talk vedremo come costruire un datalake in modalità “serverless” utilizzando i servizi managed di Amazon Web Services. Concluderemo con un’overview rapida di QuickSight per la rappresentazione dei nostri dati.
2) Apache Flink – Stateful streaming done right, by Andrea Fonti, Big Data Engineer @ AgileLab
Sicuramente avrete avuto modo di utilizzare tecnologie big data per lo streaming come lo Structured Streaming offerto da Apache Spark, i microbatch non fanno per voi? Gestire lo stato vi crea grattacapi? Ritenete operazionalizzare i job spark troppo complesso?
Ecco perchè Apache Flink rappresenta una valida alternativa a Spark per l’implementazione di sistemi streaming.
In questo intervento verrà fatta una overview su:
– Architettura di Flink
– Modalità di deployment
– Gestione dello stato:
API
Introspezione
Prepopolamento tramite batch
Data Lake and Data Warehouse in real-time and low cost
“A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale”
After two years of apparently no innovation in the field of Big Data technologies, now a new wave is coming: real-time Data-Lakes and low-cost DWHs.
The first one is the possibility to feed a DataLake in real-time directly from operational data sources and reducing the time of data availability from many hours to a few minutes. This is a game-changing pattern because it will enable new patterns in the data value chain.
The second one is related to the possibility to apply complex data transformation without being limited by append-only operations and strict immutability. The possibility to have data lakes and DWHs sitting on top of the same resources, tools and practices, is a huge gain in terms of TCO and consolidation of past investment to adopt big data technologies.
DWH is usually bringing the following challenges:
Hard to scale
Expensive
Proprietary technology
Vendor Lock-in
No ML
Now a unified platform for both data lake and DWH is possible and we can also feed it in real-time with fast update cycles.
Main problems with current architectures for Data Lakes
The data imported into the file system in bulk generally lacks a global strict schema specification. When downstream Spark jobs are analyzed, it will be very troublesome to encounter data in a confusing format. Each analysis job must filter and process the missing data. Usually, there is no standard way to manage schema evolution, then any change has to be managed with ad-hoc developments, impact analysis, and deployment operations. In the classical 2-levels schema-on-read architectures, you have great flexibility but schema management is centralized in the reader layer, while other data consumers could be in trouble to understand how to manage underlying data.
There is no ACID guarantee for the process of writing data to the file system. Users may read the data in an intermediate state, dirty reads. To avoid this pit, people work at scheduling level and orchestration, adding complexity. Alternatively, you can end-up with creative solutions to manage locks, available tables for reading, notifications, or other best-effort solutions. In any case, data integrity is a really tedious process.
Users cannot efficiently upsert/delete historical data. Once the parquet file is written to the HDFS file if you want to change the data you have to read and rewrite it, which is also costly and not-efficient. This is a super common scenario for operational database offloading projects. Online mutations are continuously streamed towards data lake and then, compacted rewriting huge portions of the lake; In addition processes like GDPR are requesting for history deletion.
Frequent data import will produce a large number of small files on the file system, resulting in the file system being overwhelmed, especially the file system with a limited number of files, such as HDFS.
Streaming is not a first citizen class, typically is not straightforward to stream data into the data lake, and it is not possible to elaborate in streaming deltas produced from data lake processes. Did you ever try to add hive partitions in streaming when you cross mid-night? I guess no clean solutions out there.
If you leverage S3 as storage level, things are getting even more complicated, because file operations ( listing ) are slow and not efficient, move operation is not atomic and overwrite is eventually consistent. In addition, it is not recommended to design folder layouts on top of S3.
Anyway leveraging big data technologies to build data lakes is still the cheapest ( in terms of TCO ) way to go, then market leaders like Databricks, Uber, Netflix invested a lot of resources to bring such technologies a step forward.
Table Format
The main problem is due to the old conception of the schema-on-read pattern. We were used to thinking about a *file format* and some schema-on-read provided through an external table with Hive or Impala. In this scenario table state is stored in two places:
partitions in the hive metastore (HMS)
Files in an FS with non-transaction support
This is creating problems for the atomicity of various operations and it is also affecting schema evolution because the ownership of the schema is on HMS from a definition standpoint, but rules for evolution are tightly coupled on capabilities of the file format ( ex. CSV by position, Avro by name ).
A table format is like an intermediate level of abstraction where all concepts related to the table (metadata, indexes, schema, etc ) are paired with data and handled by a library that you can plug inside multiple writers and readers. The processing engine is not owning table info anymore, it is just interpreting them coming from the table format. In this way, we will be decoupled by processing engines, file formats, and storages ( HDFS, S3, etc ).
This is opening interesting new perspectives.
…
Delta Lake
Delta Lake has been started and then open-sourced by Databricks.
The main concept in delta lake is that everything moves around a transaction log, including metadata. This feature is enabling fast file listing also in large directories and time travel capabilities.
Delta Lake
Delta Lake is designed for Spark and with streaming in mind. The deep integration with Spark is probably the best feature, in fact, it is the only one with Spark SQL specific commands ( ex. MERGE or VACUUM ) and it introduces also useful DMLs like “UPDATE WHERE ” or “DELETE WHERE ” directly in Spark.
In the end, having ACID serializability properties, we can now really think to move a DWH inside a Data Lake. That’s way Databricks now use the term DataLakeHouse to better represent this new era.
Hudi
In early days Uber was a pioneer in standard big data and data lake architectures, but then its challenging use cases triggered new requirements in terms of data freshness, and also they had to face an important cost reduction need. In some scenarios, they realized that every day they were rewriting 80% of their HDFS files.
Therefore, they hope to design a suitable data lake solution, which can achieve fast upsert and streaming incremental consumption under the premise of solving the needs of the general data lake. This was the start of Hudi.
The Uber team implemented two data formats, Copy On Write and Merge On Read, on Hudi.
Merge On Read was designed to solve their fast upsert. To put it simply, each time the incrementally updated data is written to a batch of independent delta file sets, the delta file and the existing data file are periodically merged through compaction. At the same time, it provides three different reading perspectives for the upper analysis engine: only read delta incremental files, only read data files, and merge read delta and data files. Meet the needs of various business parties for data batch analysis of data lakes.
Copy on write instead is compacting data directly in the writing phase, then it is optimized for the read-intensive scenario.
The flexibility to choose how to tune the system is probably the greatest functionality of Hudi. Wrapping up:
low latency: write to separated delta file, merge-on-read to combine snapshot and delta on the fly
high-speed query: only query snapshot, copy-on-write to compact snapshot with delta
Iceberg
Netflix, in order to solve previously mentioned pain points, designed its own lightweight data lake Iceberg. At the beginning of the design, the authors positioned it as a universal data lake project, so they made a high degree of abstraction in the implementation. Although the current functions are not as rich as the previous two, due to its solid underlying design, once the functions are completed, it will become a very potential open-source data lake solution.
The key idea is simple: track all files in a table over time leveraging a concept of metadata snapshot tracking. The table metadata is immutable and always move forward.
What is stored in Manifest File:
Basic data file info
File location and format
Iceberg tracking data
Values to filter files for a scan
Partition data values
Per-column lower and upper bounds
Metrics for cost-based optimization
File-level: row count, size
Column-level: value count, null count, size
Such a rich metadata system allows skipping completely any kind of interaction with the DFS ( potentially expensive or eventually consistent ). Also, query planning is leveraging these metadata for fast data scanning
…
Comparison
In the following section, we try to analyze all of them from different points of view, starting from a general perspective.
ACID and Isolation Level Support
Serialization: It means that all readers and writers must be executed serially;
Write Serialization: It means that multiple writers must be strictly serialized, and reader and writer can run simultaneously;
Snapshot Isolation: It means that if the data written by multiple writers have no intersection, they can be executed concurrently; otherwise, they can only be serialized. Readers and writers can run simultaneously.
Schema Evolution
Hudi only supports the addition of optional columns and the deletion of columns. This is a backward-compatible DDL operation, while other solutions do not have this limitation. The other is whether the data lake customizes the schema interface in order to decouple from the calculation engine’s schema. Iceberg is doing a good job here, abstracting its own schema, and not binding any calculation engine-level schema.
Openness
Here Iceberg is the best data lake solution with the highest degree of abstraction and flexibility, and all four aspects have done a very clean decoupling. Delta and Hudi are strongly bound to spark. Storage pluggable means whether it is convenient to migrate to other distributed file systems (such as S3), which requires the data lake to have the least semantic dependence on the file system API interface, for example, if the data lake’s ACID strongly depends on the file system rename, if the interface is atomic, it is difficult to migrate to cheap storage such as S3.
Performance
Iceberg is the only one adding features that can be leveraged by processing engines to skip files based on metadata. Delta Lake can provide advanced features like Z-Order clustering or file skipping, but only in the commercial version provided by Databricks. If they are going to release them it will be a huge step forward in performances.
For data security considerations, Iceberg also provides file-level encryption and decryption functions, which is an important point not considered by other solutions.
Wrapping up
Iceberg’s design is very solid and open and is gaining traction.
Hudi’s situation is different, it is the eldest on the market so it has some competitive advantage (like merge on read), but it is difficult to evolve it. For example, if you are wondering to use it with Flink or Kafka streams, forget about it.
Delta has the best user API, good documentation, and good foundations. The community is totally powered by Databricks that is owning a commercial version with additional features. This could be a good point or a bad one…
Written by Paolo Platter – Agile Lab CTO & Co-Founder
If you found this article useful, take a look at our blog and follow us on our Medium Publication, Agile Lab Engineering!
X This website uses cookies (including third party cookies as well as other tracking technologies) to make our site work, for marketing purposes and to improve your online experience. We won‘t set optional cookies unless you enable them. You can change your cookie settings at any time. Further information can be found in our Privacy Policy and Cookie Policy.
Il presente sito web utilizza cookie (compresi cookie di terze parti e altre tecnologie di rilevamento) al fine di gestire il nostro sito, per finalità di marketing e per migliorare la tua esperienza online. Non saranno impostati cookie opzionali a meno che non vengano da te abilitati. Potrai modificare in qualsiasi momento le tue impostazioni dei cookie. Troverai maggiori informazioni nella nostra Privacy Policy e Cookie Policy.
This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
Cookie
Duration
Description
connect.sid
2 hours
This cookie is used for authentication and for secure log-in. It registers the log-in information.
cookielawinfo-checkbox-advertisement
1 year
Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional
11 months
The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
elementor
never
This cookie is used by the website's WordPress theme. It allows the website owner to implement or change the website's content in real-time.
viewed_cookie_policy
1 months
The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Cookie
Duration
Description
aka_debug
session
Vimeo sets this cookie which is essential for the website to play video functionality.
bcookie
2 years
LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
ELOQUA
1 year 1 month
The domain of this cookie is owned byOracle Eloqua. This cookie is used for email services. It also helps for marketing automation solution for B2B marketers to track customers through all phases of buying cycle.
lang
session
LinkedIn sets this cookie to remember a user's language setting.
lidc
1 day
LinkedIn sets the lidc cookie to facilitate data center selection.
UserMatchHistory
1 month
LinkedIn sets this cookie for LinkedIn Ads ID syncing.
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.
Cookie
Duration
Description
dtCookie
This cookie is set by the provider Dynatrace. This is a session cookie used to collect information for Dynatrace. Its a system to track application performance and user errors.
INGRESSCOOKIE
23 hours
This cookie is used for load balancing and session stickiness. This technical session identifier is required for some website features.
SRM_B
1 year 24 days
Used by Microsoft Advertising as a unique ID for visitors.
YSC
session
This cookies is set by Youtube and is used to track the views of embedded videos.
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
Cookie
Duration
Description
_ga
2 years
This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gat_gtag_UA_71167806_1
1 minute
This cookie is set by Google and is used to distinguish users.
_gat_UA-140152462-1
1 minute
This is a pattern type cookie set by Google Analytics, where the pattern element on the name contains the unique identity number of the account or website it relates to. It appears to be a variation of the _gat cookie which is used to limit the amount of data recorded by Google on high traffic volume websites.
_gh_sess
session
GitHub sets this cookie for temporary application and framework state between pages like what step the user is on in a multiple step form.
_gid
1 day
This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visted in an anonymous form.
vuid
2 years
This domain of this cookie is owned by Vimeo. This cookie is used by vimeo to collect tracking information. It sets a unique ID to embed videos to the website.
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.
Cookie
Duration
Description
ANONCHK
10 minutes
The ANONCHK cookie, set by Bing, is used to store a user's session ID and also verify the clicks from ads on the Bing search engine. The cookie helps in reporting and personalization as well.
IDE
1 year 24 days
Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
MUID
1 year 24 days
Bing sets this cookie to recognize unique web browsers visiting Microsoft sites. This cookie is used for advertising, site analytics, and other operations.
test_cookie
15 minutes
This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the user's browser supports cookies.
uuid
never
MediaMath sets this cookie to avoid the same ads from being shown repeatedly and for relevant advertising.
VISITOR_INFO1_LIVE
5 months 27 days
This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.
yt-remote-connected-devices
never
YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id
never
YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId
never
This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests
never
This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
Other uncategorized cookies are those that are being analyzed and have not been classified into a category as yet.
Cookie
Duration
Description
_clck
1 year
No description
_clsk
1 day
No description
_lfa
2 years
This cookie is set by the provider Leadfeeder. This cookie is used for identifying the IP address of devices visiting the website. The cookie collects information such as IP addresses, time spent on website and page requests for the visits.This collected information is used for retargeting of multiple users routing from the same IP address.