November 2022 - Agile Lab News
The latest news from the world of Data Engineering through the lens of Agile Lab.
NewSQL….the new era of Relational Databases?
The term NewSQL represents a new generation of Relational Database Management Systems
whose goal is reaching the same scalability guarantees of NoSQL databases without giving up ACID properties and the relational model.
ACID properties are easy to obtain when the database scales vertically [1] but vertical scaling is not pratical because price of a single machine does not scale linearly with performance.[2]
NoSql databases prefer to give up ACID properties to horizontally scale, just add more machines and your performance should scale linearly in their number.
If your application needs ACID properties [3] you are stuck reimplementing those properties in your application, you could have better spent your money on a good RDBMS engine.
NewSQL engines can give you the best of both worlds.
Inspired by Google Spanner[4] CockroachDB is a NewSQL database built on a transactional and strongly-consistent key-value store (RocksDB).
CockroachDB enables your use case to scale horizontally, survives disk, machine, rack, and even datacenter failures.
CockroachDB supports strongly-consistent ACID transactions and provides a familiar SQL API for structuring, manipulating, and querying data.
CockroachDB shards your data in ranges and, as more data comes in, splits those ranges. They are continuosly rebalanced between cluster nodes.
Horizontal scaling is achieved by assigning ranges to different cluster nodes, surviving failures is achieved by making use of replicas of the ranges, cluster split and rebalance is mediated by the raft consensus algorithm.
NuoDB like CoackroachDB is a NewSQL database that achieves horizontal scale by employing a two level architecture.
The transaction processing layer consists of in-memory process nodes called Transaction Engines (TE). TEs allow NuoDB to maintain high, in-memory performance. As an application makes requests of NuoDB, TEs naturally build up in-memory caches with affinity for that application’s workload. Requests for data not currently in cache (cache misses) can be fulfilled from the memory caches of other TEs or from the storage management layer.
The storage management layer consists of process nodes called Storage Managers (SM), which have both in-memory as well as on-disk storage components. The SM also caches data in-memory to speed up data retrieval from disk. SMs also provide on-disk data durability guarantees. Multiple SMs can be used to increase data redundancy.
All workloads were performed using synthetic data 3 aimed at simulating those often found in the IoT world. More precisely, three tables have been created: devices, measurements and periods of validity.
All the benchmarks have been carried out in the same conditions:
The first workload under analysis consists of 200,000 reading operations by primary key. As a result, each query selects a single line of the table involved. The transactions were equally distributed among the three tables.
15,000 write operations were carried out. The number of rows inserted by each operation varies from one to five. Here again, the writes were equally distributed between the three tables.
Workload composed of 75,000 read operations, which require joins among three relationships. Queries select only two lines of the table resulting from the two inner joins.
This workload is similar to the one just analyzed, but it highlights the changes in the performance of the two databases as the number of selected rows increases. In this case, the operations carried out were 15,000 reads with joins involving the three tables, only 300 rows are transferred to the clients.
The last benchmark in analysis was performed with a workload consisting of mixed operations. In particular, 100,000 queries were carried out of which 80% are reads and 20% are writes. 1% of the reads are joins on the three tables, producing two rows. The remaining read requests were all made by primary key, returning a single record as the output. Finally, the writes inserted a number of rows that varies from one to five.
[1] To scale you have to buy one machine with better performance.
[2] Machines capable of the needed performance are not available yet.
[3] Applications do not often really need ACID properties.
[4] But not requiring hardware-assisted clock synchronization using GPS clocks and atomic clocks to ensure global consistency.
Written by Andrea Fonti - Agile Lab Software Mentor & Big Data Engineer
If you found this article useful, take a look at our Knowledge Base and follow us on our Medium Publication, Agile Lab Engineering!
The latest news from the world of Data Engineering through the lens of Agile Lab.
The May 2023 edition of our newsletter that includes an award-winning white paper, upcoming events, news around the industry and open positions in...
The January 2023 edition of our newsletter, including our report on the top 6 data trends, why data mesh should be tech agnostic and a new...