Secondary Indexing on HBase, a.k.a. NoSQL is good but the world is not key-value
So, they told you that NoSQL is the place where to store your de-normalized ex-relational (big) data, right? And you believed them, didn’t you? Well, that could not have been your worst decision, as long as you have thought about a plan B to apply when the first users — coming right after your data migration has been completed — will ask you:
“I don’t have the key to access the key-value store. Is that OK if I perform a (full?) SCAN to search for the data I need?”
No, it’s NOT OK.
If your data system must provide massive alongside with punctual access to small piece of data in a lake of potentially (b)millions of records, sequential flows like file systems (even if distributed) are not the right choice: that’s why key-value stores, backed by a distributed and scalable architecture, gained so much success in the last few years, especially when companies facing re-platforms of their relational data systems approached the Big Data world.
Unfortunately, the lack of efficient access patterns (a metric which can obviously be different use case by use case) different from the key-value one usually brought to light a sad piece of news: there’re no secondary indexes! WHAT?! Well, I’m afraid so. At least, not usually built-in in most of the broad-adopted technologies like Apache HBase.
So here’s your plan B.
In most of the projects we worked on at AgileLab, we leveraged Apache HBase to tackle the problem of having a consistent, distributed, fault tolerant and PetaByte-scalable data base where to store unstructured or relational-denormalized data in key-value fashion. In fact, the NoSQL datastore HBase provides CP  capabilities on top of an Hadoop cluster, leveraging Zookeeper for state and configuration management and synchronization, and the Hadoop Distributed File System (HDFS) as scalable, replicated, fault tolerant persistence layer.
Although HBase provides a “table” abstraction, it doesn’t allow users (as of HBase 1.x) to declare any “relational” schema-based structured, besides a “column family” based physical and row-oriented logical data model.
HBase Table structure (source: https://www.oreilly.com/library/view/hbase-the-definitive/9781449314682/httpatomoreillycomsourceoreillyimages889234.png)
Data on HBase is organized in rows (each identified by a unique “row key”), which are physically persisted in different files on HDFS (“HFile”) according to the column family a given piece of data belongs (HFiles can be multiple, depending on the regions splits, but this is out of the scope of this post). Within column families, data is organized in lexicographically-ordered “column qualifiers”, which uniquely identify a “cell” that will contain nothing but a bare array of bytes. We can than state that, given a unique key made of <row key + column family + column qualifier> (a.k.a. “cell qualifier”), data access is offered from HBase in a key -> value pattern being the key the “cell qualifier” modeled above, and the value the array of bytes contained in the cell.
Don’t panic, let’s make an example
Let’s imagine a possible use case. MyOnlineShopCompany has millions of customers, each identified by a unique id (customer_id). Every customer place orders and buy stuff from the online shop. The company’s IT decided to store such orders on HBase, because they can be “a lot” and they can be accessed from a company’s “My Orders” web page one by one and customer by customer. Each customer, during the “active account” period, can place up to x000 (thousands) orders, so the order of magnitude of just an hypothetical “orders” table would be in the billions of records. The company must provide online access to such orders for at least 5 years, but it could need to make historical older data available to its data science team which will perform prediction, analytics, and other cool stuff. Orders business objects are modeled with dozens of fields.
OK, accepting the fact that we are gonna leverage HBase to store these orders (we could discuss about business/costs/tradeoffs-wise arguments all night long), how could we design our table?
A possible physical data model could be:
— ROW KEY: the customer_id > this allows us to reference a single row to access all the orders of a customers with a GET operation; also it implies our table doesn’t grow “vertically” thus allowing an upfront-optimized HBase’s regions keys splitting management.
— COLUMN FAMILY: the year the order was placed > this allows us to precisely access a single family (read: avoid opening files that don’t contain orders of a specific requested year); also it allows hussle-free data historicization (just grab the 6th year old CF and do your historical data management read/write/delete operations without affecting other data).
— COLUMN QUALIFIER: the day and minutes the order was placed, represented as the number of “days left in the year” (1 to 366) concatenated to the “minutes left in the day” > this allows us to have the “first” order in the array of column qualifiers for a single row being the most recent (of a certain year), which could speed up reads by a “most-recent” access pattern (with bissextile and timezones shifts management). Again, there are several optimizations (what if a customer places more than one order within a minute? I know, I know) that we could imagine on this model: let’s move on and accept this for now, but we can discuss it further in another place.
And the content? Avro, of course. The application business object serialized with Avro Single Object Encoding, to be precise (just a byte array from an HBase point of view).
A very common use case, that we at AgileLab leveraged in several projects, is to put complex business objects inside those cells, serializing them with Apache Avro (it requires a schemas management framework and we developed, leveraged and open sourced Darwin  to address that). Nonetheless this mechanism provides users the powerful possibility to have multiple schema-less data models within a single table, it forces them to know upfront cell qualifiers to access the single record in “GET” access pattern (which can be pushed down up to milliseconds latency!). HBase also provides “SCAN” access patterns, but these are not as much efficient, nor are the use case this technology has been designed for.
So how do we access our huge table by a pattern different from the key -> value one?
Let’s say our orders has, besides the other fields, the “category”, “price”, and “description” one. How would you fetch the orders, for a specific customer or maybe (worse) across all the customers, having “XY” category, price range between 10 and 100, and description containing the word “USB” (case insensitive)?
Would you really perform (FULL) SCANs?
(Image source: https://erickbio.files.wordpress.com/2012/05/tebel.jpg)
OK, so maybe now I got your full attention.
Well, as of HBase documentation  a reference to “Secondary Indexing” suggests users to develop a custom “Coprocessor” that deserializes our “PUT” mutations, extracts secondary index keys and writes them to another “shadow” HBase index table, which … has the same limitation of our source table. Data is opaque, there’s no high-level schema or column value type to help in accessing such data, besides the problem of keeping the two tables consistent among each other.
Also, have you ever developed (Java)/debugged (on clusters)/deployed (as jar)/maintained a custom Coprocessor? We did, and it wasn’t a pleasant journey. Coprocessors can be seen as an intermediate server-side entry-point API which is triggered before a Mutation (a PUT/DELETE request from a client) is handled by a Region Server (the slave servers of the master-slave HBase architecture).
We found out this approach is worthwhile of the effort only when our workload is beyond TeraByte-scale. When not in such a scenario, another architecture provides extremely easier and user friendlier approach.
While working on Cloudera clusters, as in most of our customers scenarios, we leveraged the Lily Indexer + SolR (Cloud) stack to tackle the secondary indexing problem .
Cloudera Search architecture
Lily HBase Indexer service embeds the NG-Data Indexer  to provide a Near-Real-Time (NRT) resilient automated configuration-driven mechanism to trigger Morphline (Kite SDK)  parsers over an HBase mutation so to extract from a deserialized cell value our significant secondary index fields and publish them as documents on a pre-determined SolR collection (a.k.a. Solr Search in Cloudera environment, given that index data can be easily accessed via Hue Web UI for browsing and dashboarding). PUT and DELETE mutations are automatically handled updating the SolR collections.
From an application development point of view, indexes on secondary fields provide a millisecond-grade multi-modal access pattern on SolR (either accessed from batch jobs or web services), which would allow us to solve our initial data problem and retrieve HBase’s cell qualifiers in order to access our original full records.
In our example, after having configured our SolR collection with a schema like:
We would issue client-side the SolR query:
category:”XY” AND price:[10 TO 100] AND description:”USB”
This would return the documents containing the “row_key”, “column_family”, “qualifier” stored fields that, on the application side, we could leverage to issue the related GET operations on HBase and efficiently retrieve our records within, probably, milliseconds.
I like it! What do I need to set it up?
On a Cloudera cluster with Lily HBase indexer service properly installed and configured, the applicative setup is pretty easy, we just need to:
1. enable replication on the target HBase table (if you don’t have cluster-wide replication already enabled);
2. register an indexer for each table we are interested in indexing through a shell command (and a Morphline configuration file that contains the parsing logic);
3. properly create/setup the related SolR collections;
4. enjoy accessing the indexes, maybe through Cloudera Hue thanks to which we could also create dynamic interactive dashboards (what about NRT KPI?).
Interactive Dashboarding with Hue (source: https://blog.cloudera.com/blog/2014/07/new-in-cdh-5-1-hues-improved-search-app/ )
The resiliency and at-least-once data handling semantic of the indexers is given by the fact that those register themselves as HBase replication peers, thus exploiting HBase’s built-in replication mechanism. Also, the Lily HBase indexer supports Kerberos for authentication, and Apache Sentry for authorization. Scalability of SolR Cloud is provided by its sharding capabilities.
Last but not least, what about a batch distributed MapReduce job to re-build the indexes in case of data-migration or data corruption, schema change or new environment setup? Well, it’s already available in Cloudera environment, and you just need to launch it by passing it the very same morphline.conf configuration file passed to the Lily HBase Indexer creation shell command.
OK the pros, but what about the cons?
Well, there are a few. You tell us which one, in your experience, could be the worst to face among:
a) DDLs are not propagated (e.g. ‘truncate’ commands) to SolR (but none of you guys trigger truncates in production environment, right?).
b) SolR collections are indexed with the Lucene Inverted Indexes , so pay attention to the range of values that your secondary index keys can have: in fact, inverted indexes are known not to outperform when keys are mostly condensed around a very little subset of values.
c) As of latest CDH 5.x release, Lily Indexers registered as HBase replication peers will always receive all the mutations of all the tables configured with a “REPLICATION_SCOPE” greater than 0 (they are just table-agnostic replication end-points to send mutations to from an HBase perspective). Indexers discard all the mutation not related to their pre-configured target table.
At AgileLab we leveraged this architecture to provide a “light-OLTP” access pattern to our customer’s HBase-based Big Data use cases, handling thousands of updates per minute across hundreds of tables and related indexes (per single use-case).
Get in touch with us to know more about this architecture. We also developed an OLAP access layer (another “anti-pattern” to access key -> value data stores) on top of HBase that maybe will be matter of another blog post in the future.
Also, let us know what you’d like us to further write about.
 CAP theorem: https://en.wikipedia.org/wiki/CAP_theorem
 AgileLab Darwin Avro Schema Evolution manager: https://github.com/agile-lab-dev/darwin
 HBase documentation, secondary indexes: https://hbase.apache.org/book.html#secondary.indexes
 Cloudera Lily Indexer + SolR: https://www.cloudera.com/documentation/enterprise/5-14-x/topics/search_config_hbase_indexer_for_search.html
 NG-Data indexer: https://ngdata.github.io/hbase-indexer/
 Morphline parser: http://kitesdk.org/docs/1.1.0/morphlines/
 Lucene Inverted Index: https://lucene.apache.org/core/
By R. Coluccio