Secondary indexes in Cassandra

CC-BY Judy Gallagher @ Flickr https://www.flickr.com/photos/52450054@N04/49397666902/

While most focus in Cassandra world is on the upcoming 4.0 release, or perhaps on our project to add JSON and GraphQL APIs to Cassandra, a feature that excites me personally is much more fundamental and at the core of the database itself: finally having a usable secondary indexing implementation!

Datastax published the Storage Attached Index CEP in September. While this implementation won’t make it into Cassandra 4.0, it is already available as GA in Datastax Enterprise 6.8. Introducing a reliable and performant secondary indexing implementation in Cassandra is a big deal! It will open up a whole range of new applications that can now choose Cassandra as the best tool for the job.

Personally I always want to understand why something is better than what came before. In this blog post I want to first iterate the limitations of the original “2i” index implementation, and then explain how SAI deals with those problems much better.

Writing this has also been a good journey generally into understanding where LSM storage engines stand when it comes to secondary indexes. While LSMs in many ways excel in handling primary key based workloads, somewhat surprisingly it seems like the state of the art is still evolving towards truly optimal solutions for secondary indexes. (And SAI indexes very much present a state of the art solution.)

Note that Cassandra 4.0 also introduces a new index implementation known as SASI, that was developed at Apple. While I will omit discussion of SASI indexes in this blog post, the short summary is that they share many of the benefits of our new SAI index, so they are also an improvement over the original “2i”. The main difference with SAI is that SASI indexes take up much more space. The CEP covers also SASI in more detail.

Limitations with 2i

While Cassandra has had a secondary index implementation since forever, in practice their performance characteristics were so poor, that the common advice is essentially to not use them at all. This is the advice the Datastax documentation used to give:

Do not use an index in these situations:

 

  • On high-cardinality columns for a query of a huge volume of records for a small number of results.
  • In tables that use a counter column.
  • On a frequently updated or deleted column. 
  • To look for a row in a large partition unless narrowly queried. 

 

When you add all of that together, the end result is that 2i indexes are either used rarely, or not at all.

So what exactly is the problem with 2i indexes?

While the above quoted advice is well known, and many Cassandra experts know it by heart, it was surprisingly difficult to learn what exactly are the properties of the 2i index implementation that causes the above limitations? I ended up asking several of my more experienced colleagues before I felt satisfied with the explanations. It seems as people have learned to use 2i sparingly, the original reasons for doing so have slowly been forgotten.

The CEP does a great job at explaining why SAI indexes are better, but the reader is expected to already know how 2i works. Which is not the case for yours truly. This Datastax blog seems to be the most detailed explanation of 2i internals I could find. It will have to do as a reference, since I don’t intend to actually read the source code this time.

Cassandra 2i indexes are implemented internally as a separate and hidden table. This is the same as how RocksDB or WiredTiger does it as well. The index leaves are the primary key values of the main table.

From here we can start to understand why 2i indexes have the limitations they have:

2i requires read-before-write

Famously, an LSM database can achieve great write performance, because of its capability to store writes (updates, inserts, deletes) without needing to read the underlying record first. The new values are simply written quickly to “the top” of the LSM structure. The penalty comes when the record is queried, when old and new versions of the record must be found and reconciled.

However, this benefit of fast writes is lost when you add 2i indexes to a table. When you update a value in an indexed column, then we must also remove the previous value from the index. Example:

INSERT INTO mytable (id, a) VALUES (1, 2);

UPDATE mytable SET a=5 WHERE id=1;

 

In the UPDATE above, the 2i index must both add the index entry (5,1), but also remove (2,1). But to know the value 2, it must now query the record id=1 to find out the pre-state of the row. And that’s where it brings back read-before-write behavior common in traditional B-Tree based databases.

Essentially using 2i indexes in Cassandra now combines the worst parts of an LSM and BTree database: both writes and reads have to do a lot of work!

In the above list of when not to use 2i indexes, the advice to only use 2i indexes for rarely updated columns is related to this problem. If we don’t update our indexed values, we avoid the read-before-write.

Elsewhere: Note that for example RocksDB seems to have this same limitation. Although the blog post says things like “blind write”, RocksDB only avoids reading the old index key but does require to read the primary key record. I’m aware of Tarantool doing something similar to SAI, where the work to remove old keys from the secondary index is deferred entirely to the read and compaction operations, enabling truly read-free writes. This seems to be the state of the art in secondary indexes for LSM databases. And this is also how SAI does index maintenance, more on that below.

Tombstones

DELETE statements in an LSM database don’t immediately remove the row they point to, rather a “tombstone record” is written to the top of the LSM structure. Again, it is up to read queries to deal with reconciling the row that exists and the tombstone that has marked it as deleted.

Generally when you delete data from a database, you’d expect to benefit: There will be less data on disk and hopefully the database can become faster. But with an LSM deletes actually add cost: You add tombstones to the disk, and they take more space, not less!

Additionally deletes and tombstones are also tricky in Cassandra for distributed database reasons. Skipping the details, Cassandra by default stores tombstones for 10 days! So if you delete a lot of data, it will take 10 days before you can expect both your original data and the tombstones to actually disappear from your database.

But with 2i indexes every update is also a delete! This means that with frequently updated columns your 2i indexes will quickly get full of tombstones! This sounds very inefficient.

Other performance issues

My analysis here still doesn’t explain all of the “when not to use” advice above. It seems the remaining issues (like why high-cardinality indexes aren’t performing well) are relatively specific to 2i, and I will skip those in this blog post. The last common warning is to not index columns where the same values are repeated a lot. This is because it causes all those rows to be stored in the same large partition in the index table.

How SAI indexes avoid the above problems

If we traverse the above list backwards, the first benefit of Storage Attached Indexes is obvious: The index key is always in the same file that it points to. Whenever an SSTable file is compacted, the primary and secondary index stay together!

SAI indexes have their own specific structure (again, see the CEP for details) rather than just being a hidden LSM table. In particular, they don’t use tombstones. Index entries in SAI point to the row in the same SSTable. Whether the values in that row are still valid or have been overridden by an UPDATE or DELETE is a matter for the read query to validate. 

This is also how SAI avoids read-before-write. Each SAI index simply points to the rows in the same SSTable file. This way it doesn’t incur the penalty in write performance as 2i indexes did, because the old SAI values that exist in other SSTable files don’t need to be deleted or overwritten. Reconciling what is the actual current value is left for the read query to find out. This behavior brings us back to the performance profile you’ve come to expect from an LSM based database: writes are fast because they postpone work to the reads.

But wait, there’s more

SAI indexes have more great stuff in them than what I’ve covered in this blog post. In particular they are very compact due to using tries and kd-trees as their data structures. 

We expect that having an index that is efficient both in terms of write performance and memory consumption will revolutionize how people use secondary indexes in Cassandra. No longer do you need to advise Cassandra users to use indexes sparingly (or let’s face it, at all…) rather it can be embraced as a best practice.

In conclusion, SAI indexes finally bring Cassandra from a Primary Key -oriented database to a powerful general purpose database with secondary indexes. You can try SAI indexes immediately in Datastax Enterprise 6.8, or wait for them to land in a Cassandra binary some time after 4.0.

Further reading

A Comparative Study of Secondary Indexing Techniques in LSM-based NoSQL Databases

 

I look forward to SAI getting into upstream.

I assume that (SAI) are local secondary indexes -- local to the SST. That allows read-before-write to be avoided, but comes at the cost of fanout on secondary index queries. Once more, lunch is rarely free. But the compromise here is a great one for many workloads. I assume:
* Point query on the SAI must check every SST, made faster by bloom filters
* Range query on the SAI must check every SST

I will read the Tarantool paper soon, but assume the cost of that approach is that secondary-index queries are not index-only because some index entries can be invalid (were not removed on delete) and the base row must be read to confirm.

WRT deletes making space-amp worse, that is bounded when using leveled compaction to ~10%.

B-Trees with MVCC also don't reclaim space on delete. The space can't be reclaimed until there are no snapshots that might read it. Postgres defers that to vacuum. InnoDB defers that to purge.

I look forward to SAI getting into upstream.

Me too!

I assume that (SAI) are local secondary indexes -- local to the SST. That allows read-before-write to be avoided, but comes at the cost of fanout on secondary index queries. Once more, lunch is rarely free. But the compromise here is a great one for many workloads. I assume:
* Point query on the SAI must check every SST, made faster by bloom filters
* Range query on the SAI must check every SST

Lunch is definitely not free. But reads are already  expensive in Cassandra, with this design we can keep writes fast.

The range query could be addressed with the SuRF Trie based bloom filter. However, as our indexes already use tries (for text) some testing my colleagues have done indicated that adding a SuRF would not be much different from simply reading the SAI index itself. In the end their structures are very similar.

I will read the Tarantool paper soon, but assume the cost of that approach is that secondary-index queries are not index-only because some index entries can be invalid (were not removed on delete) and the base row must be read to confirm.

Yes. I believe SAI reads are equivalent to this. You have to read both the index and the primary key.

WRT deletes making space-amp worse, that is bounded when using leveled compaction to ~10%.

Why is this?

B-Trees with MVCC also don't reclaim space on delete. The space can't be reclaimed until there are no snapshots that might read it. Postgres defers that to vacuum. InnoDB defers that to purge.

Good point. The painful part in Cassandra is that we have to store tombstones for a really long time to avoid a scenario where a replica was offline for days, comes back, and via repair we receive a "new" value for a record that is in fact already deleted. (This is an area that can be improved in the future. Current solution is also not elegant. Strictly speaking you'd have to keep all tombstones forever to guard against arbitrarily long outages that might happen.)

 

For the 10% bound on space-amp (estimate, not exact) with leveled compaction, assume the per-level fanout is 10 (each level is 10X larger) then ~90% of data is in the max level, ~10% of data is in not-max levels and all of the data in non-max levels can be redundant (updates or deletes) for data in the max level.

Add new comment

The content of this field is kept private and will not be shown publicly. Cookie & Privacy Policy
  • No HTML tags allowed.
  • External and mailto links in content links have an icon.
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.
  • Use [fn]...[/fn] (or <fn>...</fn>) to insert automatically numbered footnotes.
  • Each email address will be obfuscated in a human readable fashion or, if JavaScript is enabled, replaced with a spam resistent clickable link. Email addresses will get the default web form unless specified. If replacement text (a persons name) is required a webform is also required. Separate each part with the "|" pipe symbol. Replace spaces in names with "_".
About the bookAbout this siteAcademicAmazonBeginnersBooksBuildBotBusiness modelsbzrCassandraCloudcloud computingclsCommunitycommunityleadershipsummitConsistencycoodiaryCopyrightCreative CommonscssDatabasesdataminingDatastaxDevOpsDrizzleDrupalEconomyelectronEthicsEurovisionFacebookFrosconFunnyGaleraGISgithubGnomeGovernanceHandlerSocketHigh AvailabilityimpressionistimpressjsInkscapeInternetJavaScriptjsonKDEKubuntuLicensingLinuxMaidanMaker cultureMariaDBmarkdownMEAN stackMepSQLMicrosoftMobileMongoDBMontyProgramMusicMySQLMySQL ClusterNerdsNodeNoSQLodbaOpen ContentOpen SourceOpenSQLCampOracleOSConPAMPPatentsPerconaperformancePersonalPhilosophyPHPPiratesPlanetDrupalPoliticsPostgreSQLPresalespresentationsPress releasesProgrammingRed HatReplicationSeveralninesSillySkySQLSolonSunSybaseSymbiansysbenchtalksTechnicalTechnologyThe making ofTungstenTwitterUbuntuvolcanoWeb2.0WikipediaWork from HomexmlYouTube