SQL index on hash attributes in Redis (and Memurai)
Have you ever tried, in Redis, to search for or filter hashes by one or more attributes, or to sort hashes by one or more attributes?
The (probably missing) feature you need is called "secondary indexing"
and I explain it here.
Redis (or Memurai if you are on Windows)
At the most basic level, Redis is an extremely efficient, in-memory key–value store. This efficiency comes at a cost: Redis lacks several features that are common in traditional relational database management systems (RDMSs). Among those features are "secondary indices", which enable users to query data efficiently without altering the underlying data model. Nonetheless, Redis supports a wide variety of data structures that can be used to create and maintain the key–value pairs that effectively emulate the functional and performance-related characteristics of traditional secondary indices. This article aims to provide an overview of how to index your data using Redis.
Redis was originally designed to run on Linux. However, in producing this article, I have been using its alter ego, Memurai, because I am more comfortable working with Windows. Memurai is a Windows port of Redis that has been derived from Microsoft’s OpenTech Redis project and is being actively maintained. Memurai is available free of charge, with commercial subscription licenses also available, via memurai.com.
The dear old SQL index
Before we dive into how to model your data using secondary indices, let us take a step back, consider the problem at a more holistic level, and think about how primary and secondary indices are typically used. We all know that in a relational database, data is stored in tables. Consider an example that assumes you are working on some form of e-commerce platform. In such a case, you might store products in a table with the same name:
sql> SELECT * FROM products; ID Name Description Color Price ---------- ------------ ------------------------------------------------------ ---------- ---------- 1 Giant Snakes Delicious fruit flavor sweets with a soft foamy base! yellow 600 2 Happy Cola B Mini cola-bottle-shaped sweets brown 800 3 Sour Waterme The tastiest plant-based sweets you will ever try. red 100 ...
The ID column represents the primary key for this table. It “identifies” each record, meaning that given the ID of a product, you can look up the corresponding product efficiently, without forcing the RDMS to scan through all records in the table:
sql> SELECT * FROM products WHERE ID = 2; ID Name Description Color Price ---------- ------------------ ------------------------------ ---------- ---------- 2 Happy Cola Bottles Mini-cola-bottle-shaped sweets brown 800
In practice, however, most queries are more complex. For example, you might want to find all products of a particular color or within a certain price range.
This is where secondary indices come into play, enabling you to retrieve data that is filtered efficiently, based on the indexed columns. For example, if you were to search for products of a particular color, you would first create the corresponding index:
sql> CREATE INDEX products_color ON products(color); sql> SELECT * FROM products WHERE color = 'red'; ID Name Description Color Price ---------- ---------------------- -------------------------------------------------- ---------- ---------- 3 Sour Watermelon Sweets The tastiest plant-based sweets you will ever try. red 100 4 Sweet Sausage Candy Not quite vegan, but pretty tasty. red 700
Under the hood, this will result in the RDBMS creating and maintaining a B-tree to facilitate efficient queries based on color. You do not have to maintain this index yourself. The specific implementation, data structures, and performance characteristics will depend on the actual database you are using.
Redis keys and primary index
Redis — being a key–value store — does not have tables, let alone primary keys, in the “traditional” relational sense. Instead, Redis supports a wide variety of data structures that are “indexed” by key. To retrieve or modify data, you have to provide a key corresponding to the value being operated on. This is reflected in commands typically accepting the key as a first argument:
127.0.0.1:6379> SET hello "world" OK 127.0.0.1:6379> GET hello "world"
The closest thing Redis has to a primary index is the key itself. In practice, this does not present a problem: in most cases, the ID of a particular record in a relational database can be used appropriately as the key in Redis.
However, one limitation of Redis is that it has no notion of namespaces or tables. This could be problematic if different entities share a common ID, potentially causing data loss by overwriting. It is therefore a good idea to establish a consistent naming scheme for keys. A common pattern is to prefix the key with the entity name itself: instead of storing a product under the key “1”, you would use “product:1”:
127.0.0.1:6379> HMSET product:1 name "Giant Snakes" description "Delicious fruit flavor sweets with a soft foamy base!" color "yellow" price 600 OK 127.0.0.1:6379> HGETALL product:1 1) "name" 2) "Giant Snakes" 3) "description" 4) "Delicious fruit flavor sweets with a soft foamy base!" 5) "color" 6) "yellow" 7) "price" 8) "600"
Bearing in mind that Redis doesn’t care about the format of this key, we could have just as easily used “products/1” or “product_1”. Depending on the size of your value, it might also be worth maintaining a global enum of prefixes.
In the above example, we used a hash map to represent the product record. However, depending on how your data is being queried, there might be better ways of serializing it. If you are always going to query all fields, it might make more sense to use "Thrift" or "Protocol Buffers" to further reduce the size of the values. However, this should be carefully weighed against the complexity introduced by such serialization schemes.
Using sorted sets to model secondary indices
Redis supports a variety of data structures, including sorted sets. Sorted sets are commonly used to model numerical secondary indices.
Sorted sets have two properties that make them particularly interesting as models for secondary indices:
- Members are guaranteed to be unique (as for the C++
- Each member has a “score”, whereby members can be queried based on their respective scores. Sorted sets also support range queries that enable users to efficiently query members that fall within a certain window.
Suppose you want to facilitate the efficient querying of products based on their price. Many e-commerce websites enable customers to filter products based on their individual prices. To unlock such functionality, you would have to create some form of index. Using a sorted set is a natural choice here, as the price of each product can be used as the score. For the members, you could use the product ID itself or the key under which the product is stored:
127.0.0.1:6379> HMSET product:2 name "Cola Bottles" desc "Mini cola-bottle-shaped sweets" color "brown" price 800 OK 127.0.0.1:6379> HMSET product:3 name "Watermelon Sweets" desc "Tasty plant-based sweets." color "red" price 100 OK 127.0.0.1:6379> HMSET product:4 name "Sweet Sausage Candy" desc "Almost vegan candy." color "red" price 700 OK 127.0.0.1:6379> ZADD product_index:price 600 1 (integer) 1 127.0.0.1:6379> ZADD product_index:price 800 2 (integer) 1 127.0.0.1:6379> ZADD product_index:price 100 3 (integer) 1 127.0.0.1:6379> ZADD product_index:price 700 4 (integer) 1
The returned value of “1” indicates that one member has been added to the set “product_index:price”.
ZADD accepts multiple members, so the above is equivalent to passing multiple members and scores simultaneously:
127.0.0.1:6379> ZADD product_index:price 600 1 800 2 100 3 700 4 (integer) 4
The returned value “4” indicates that four members have been added.
Note that we used “product_index:price” as the key name, but we could equally have used something more descriptive, such as “products_indexed_by_price”. At the end of the day, Redis doesn’t care. What matters is that the name of the index does not conflict with any pre-existing keys. Furthermore, it’s advisable to stick to a consistent naming scheme, such as including the field being indexed in the key name. In this case, we used “price”, which is similar to the relational index created above.
To query the data, we can leverage the
ZRANGEBYSCORE enables users to query sorted sets in an efficient manner (O(log(N)), with N being the number of members stored in the sorted set.
If you wanted to query products in a specific price range, you would first issue a corresponding
ZRANGEBYSCORE call to find matching product IDs, then use those IDs to fetch the products themselves:
127.0.0.1:6379> ZRANGEBYSCORE product_index:price 400 700 1) "1" 2) "4"
This query retrieves members with a score in the range 400 to 700, inclusive. Here, the score represents the price of a product, but this could equally well be a UNIX timestamp, sensor value, or counter in other use cases.
The returned values, "1" and "4", are the IDs of the products that meet the criterion of being priced between 400 and 700. In contrast to the relational database example above, you would now issue another query to retrieve the actual product descriptions:
127.0.0.1:6379> HGETALL product:1 1) "name" 2) "Giant Snakes" 3) "description" 4) "Delicious fruit flavor sweets with a soft foamy base!" 5) "color" 6) "yellow" 7) "price" 8) "600" 127.0.0.1:6379> HGETALL product:4 1) "name" 2) "Sweet Sausage Candy" 3) "description" 4) "Not quite vegan, but pretty tasty." 5) "color" 6) "red" 7) "price" 8) "700"
Depending on your dataset, you might also want to define an upper limit of the number of members being retrieved. This can be achieved by supplying a third argument to
One potential pitfall to look out for here is to not expect all products whose ID has been returned to still exist. As a separate
DEL command might have been issued earlier, the key being indexed might have been deleted by the time the
HGETALL is issued. This is called a "race condition" and it is very tricky to debug. Unfortunately, it is an aspect of most (if not all) non-trivial systems.
We will look in a moment at how to avoid race conditions.
Using sets for indexing categorical data
Real-world applications commonly need to deal with enumerated types, such as colors or categories. In the above example, we might want to enable our customers to find products that match a particular color. A common way of implementing such functionality is to create (unsorted) sets for each category. These sets could then serve as secondary indices.
127.0.0.1:6379> SADD product_index:color:yellow 1 (integer) 1 127.0.0.1:6379> SADD product_index:color:brown 2 (integer) 1 127.0.0.1:6379> SADD product_index:color:red 3 (integer) 1 127.0.0.1:6379> SADD product_index:color:red 4 (integer) 1
We can use the
SADD command to adding elements to a set. It works similarly to the
ZADD command described above. However, it does not use the "score" concept, as the resulting data structure will be unsorted.
To identify products that are of a certain color, we can leverage the Redis
SUNION returns members that are in any of the provided sets. In this case, it enables us to efficiently query products that are "yellow" or "brown", for example:
127.0.0.1:6379> SUNION product_index:color:red product_index:color:brown 1) "2" 2) "3" 3) "4"
Redis also provides the ability to find the intersection of sets by using the
SINTER returns members that are in all the provided sets. This is useful in identifying products that match all the provided criteria. In this case, it would not be useful, because each product can only have a single color. However, if we were to also index products by category, we could construct queries that match products that are both in a certain category and have the desired color:
127.0.0.1:6379> SINTER product_index:color:red product_index:category:candy 1) "3" 2) "4" 3) "9"
Maintaining secondary indices
So far, we have glossed over the details of when and how a secondary index should be created. Because Redis has no built-in support for indices, everything related to creating, updating, and using indices has to be handled at the application level.
There are several ways in which indices can be created.
We will now discuss some of the common approaches.
When updating indices in real time, some of the issues around the lack of consistency guarantees can be mitigated by using "transactions" to update the core data, such as the product and the corresponding indices, in one step.
When updating the price of a product, you might be tempted to issue two commands independently: one to update the product and another to update the index:
127.0.0.1:6379> HSET product:1 price 200 (integer) 0 127.0.0.1:6379> ZADD product_index:price 1 200 (integer) 1
However, this approach can cause problems. The application might crash before the
ZADD command could be issued, which would cause the index to become stale. Race conditions are another issue. A different user might also be updating the price to a different value. Their
HSET command might succeed the price update to 200, but their
ZADD command might precede it. This would cause the index to become unsynchronized.
For these reasons, it is advisable to use Redis transactions:
127.0.0.1:6379> MULTI OK 127.0.0.1:6379(TX)> HSET product:1 price 200 QUEUED 127.0.0.1:6379(TX)> ZADD product_index:price 1 200 QUEUED 127.0.0.1:6379(TX)> EXEC 1) (integer) 0 2) (integer) 0
MULTI begins the transaction and
EXEC commits it. The
ZADD commands are “QUEUED*” and don’t take effect until the transaction has been committed. Transactions in Redis are serializable, meaning that they are executed sequentially. This mitigates the aforementioned race condition. Further, Redis ensures that either all the commands that are part of the transaction are executed or none of them are. Transactions are atomic, hence the indices remain valid at all times (before, after and *during the transaction).
Updating secondary indices can be achieved using a technique that those who enjoy the perils of multithreaded programming will know, namely "locks".
One method of implementing locks in Redis uses the
SETNX adds an item to a Redis set if the provided key does not already exist.
SETNX returns "1" if the item was added successfully and "0" otherwise.
SETNX is the basic tool used to build locks in Redis.
(A full discussion of SETNX-based locking is beyond the scope of this article, but we might get back to it again in the future.)
Another possibility is to use Redlock, but this is also beyond the scope of this article.
KEYS considered harmful
KEYS command enables users to query keys that match the provided glob. While this might “look” like an index, it isn’t. Using the KEYS command is incredibly inefficient (O(N), with N being the total number of keys). While very useful for debugging, its use when querying data in a production environment would almost always be the wrong choice. Even when debugging a production system, manually issuing a
KEYS command can be quite dangerous because of the single-threaded nature of Redis.
If you need to be able to query over all keys of a certain type, storing them in a set is usually a preferable approach.
Redis is fundamentally a key–value store. Secondary indices must be implemented at the application level and require some manual work. Consideration should be given to the performance trade-off involved in the requirement for additional storage that results from the use of such indices. You should also leverage the built-in data structures in Redis to model your data in a way that facilitates efficient querying of core data. Because Redis does not provide any consistency guarantees around secondary indices, you should use transactions or locks to keep the indices synchronized and up to date.