On Monday, Robert Haas committed a patch of mine that considerably speeds up the sorting of text in PostgreSQL. This was the last and the largest in a series of such patches, the patch that adds "abbreviated keys". PostgreSQL 9.5 will have big improvements in sort performance.
In realistic cases, CREATE INDEX operations on text are over 3 times faster than in PostgreSQL 9.4. Not every such utility operation, or data warehousing query involving a big sort is sped up by that much, but many will be.
This was a piece of work that I spent a considerable amount of time on over the past few months. It's easy to justify that effort, though: sorting text is a very fundamental capability of any database system. Sorting is likely the dominant cost when creating B-Tree indexes, performing CLUSTER operations, and, most obviously, for sort nodes that are required by many plans that are executed in the service of queries with ORDER BY or DISTINCT clauses, or aggregates using the GroupAggregate strategy. Most of the utility statements that need to perform sorts must perform them with a very disruptive lock on the target relation (CREATE INDEX CONCURRENTLY is a notable exception), so quite apart from the expense of the sort, the duration of sorts often strongly influences how long a production system is seriously disrupted.
My interest in sorting is not new: I first worked on it in 2011. Early research on it back then prompted Robert Haas and Tom Lane to write the SortSupport infrastructure, which I've now extended here. Originally, the SortSupport infrastructure was all about providing alternative versions of comparators for use in sort routines, versions that avoided certain overhead otherwise inherent to calling functions that are generally accessible from SQL. As a highly extensible system, PostgreSQL requires that sort behavior be defined in terms of a default B-Tree operator class, which is itself defined in terms of SQL operators with underlying SQL-callable functions. These functions are written in C for built-in types, but in principle they could be written in a PL, like PL/Python, for example. When the underlying comparator can be expected to compile to just a few CPU instructions, "fmgr elision" (avoiding SQL function call overhead) is important - the time spent in the "fmgr" when not eliding it shows up prominently on profiles with certain types (or rather, it did in the past).
Note that I generalized sortSupport to work for more cases, so B-Tree index builds will get a nice little boost in PostgreSQL 9.5, even for types like integer and float8. That's not what this blog post is really about, though. This blog post is about the interesting new direction that the SortSupport infrastructure has been taken in, beyond mere "fmgr elision" - abbreviation.
A well known problem with the text datatype in PostgreSQL is that it uses the operating system/C standard library strcoll() function to resolve comparisons as tuples are sorted, which is very expensive. It's at least a thousand times more expensive than comparing integers, for example. This general problem is something that Robert Haas has expressed concern about in the past.
The expense relates to a normalization process whereby string comparisons use complex tables to make sure that strings are compared according to the rules of some particular culture or nation (that is, some particular collation associated with a locale). Even in English speaking countries, this is important; for example, the en_US collation considers difference in case (higher case versus lower case) after alphabetical ordering and diacritical differences, so case is considered last of all (the c locale, on the other hand, will sort upper case and lower case strings into two distinct batches, which is typically not desirable). In addition, while English usually doesn't have diacritics, sometimes it does. At work, I'm still sometimes annoyed by the sort order of the Linux Hipchat client's user list, which uses the C locale. Hi Ómar!
It was always suspected that we could more effectively amortize the cost of these locale-aware comparisons, by performing a transformation of strings into binary keys using strxfrm(), and sorting the keys instead (using a strcmp()-based comparator with the keys, which only considers raw byte ordering). This comparison will produce equivalent results to just using strcoll() directly. But the binary keys are much larger than the original strings - typically almost 4x larger. Moreover, we'd still need to do a tie-breaker strcmp() comparison (to check for strict binary equality) using the original string, when strcoll() reports equality, because the historic idea of equality that the text type offers is strict binary equality. There were some historic edge cases where a tie-breaker strcmp() was not performed following strcoll() returning '0', resulting in corrupt B-Tree indexes on a Hungarian database. strcoll() could return 0 despite not being passed a pair of bitwise-identical strings.
Having to keep around the original text datum seemed like an additional burden on the whole idea of using strxfrm() blobs as sort keys. It seemed like using binary keys to sort had a lot of promise, but we couldn't quite work out how to exploit that idea - until recently.
Abbreviated keys were committed:
Use abbreviated keys for faster sorting of text datums.
This commit extends the SortSupport infrastructure to allow operator
classes the option to provide abbreviated representations of Datums;
in the case of text, we abbreviate by taking the first few characters
of the strxfrm() blob. If the abbreviated comparison is insufficent
to resolve the comparison, we fall back on the normal comparator.
This can be much faster than the old way of doing sorting if the
first few bytes of the string are usually sufficient to resolve the
There is the potential for a performance regression if all of the
strings to be sorted are identical for the first 8+ characters and
differ only in later positions; therefore, the SortSupport machinery
now provides an infrastructure to abort the use of abbreviation if
it appears that abbreviation is producing comparatively few distinct
keys. HyperLogLog, a streaming cardinality estimator, is included in
this commit and used to make that determination for text.
Peter Geoghegan, reviewed by me.
By storing just the first 8 bytes (on 64-bit platforms; 4 bytes on 32-bit platforms) of the strxfrm() blob in a field that would otherwise contain a pointer-to-text (since text is a pass-by-reference type) - the same type-punned field that directly stores the representation of pass-by-value types like integer - most comparisons can often be resolved using just those 8 bytes directly, and without pointer-chasing. At the same time, the cost of locale transformations is still quite effectively amortized, because as always when using strxfrm(), binary key blobs performs a transformation O(n) times, rather than an average of O(n log n) times (the transformation process performed by strxfrm() and strcoll() may not be exactly comparable, but close enough).
It turns out that the large binary key blobs produced by strxfrm(), while much larger than the original strings, have a significant concentration of entropy towards the start of the blob (assuming the use of the Unicode collation algorithm, or an algorithm with similar properties). This is because the representation consists of a series of "levels". The "primary weights", which appear first, represent primary alphabetical ordering when using Latin scripts. So whitespace differences and punctuation differences are not represented at that level (nor are differences in case). For accented Latin characters, for example, diacritics are represented at a subsequent level, and so the abbreviated key representation typically won't vary if accents are added or removed to a text datum. This is important because languages that use accents extensively, like French or Spanish, will get a concentration of entropy in their 8 byte abbreviated keys that's about the same as if no accents were used, even though accented code points usually take 2 bytes of storage in UTF-8, rather than 1 byte on unaccented Latin alphabet code points.
When I initially discussed the idea of abbreviated keys, there was a certain degree of skepticism from other Postgres hackers. What if most comparisons are not resolved by abbreviated comparisons, due to text datums with a lot of redundant or repeated information at the beginning? Could all the strxfrm() work go to waste when that happens? Well, for one thing, low cardinality sets (tuples with text columns that have a relatively low number of distinct values) are not a problem. That's because strcoll() is still a huge cost, and if we can have our authoritative tie-breaker comparator observe that the strings are identical, then no strcoll() comparison is ever needed - we can just exit early with a simple, cheap opportunistic binary comparison (memcmp()), which is almost as good. But what about when there are many different strings with differences towards the end of the string, past the 8th or so byte?
CPU cache characteristics have presented complicated engineering trade-offs for sorting infrastructure for a long time. Database luminary Jim Gray proposed an abbreviation-like technique as early as 1994, in his AlphaSort paper (PDF). He describes a "key-prefix sort" in the paper. Even back in 1994, Gray observed that memory latency was the dominant cost by a wide margin. The underlying trends in CPU performance characteristics have continued apace since then. Before his death in 2007, Gray officiated the Sort Benchmark. Among the rules for the "Daytona sort" category, which concerns the sort performance of general-purpose algorithms (which is what I'm interested in), it states that Daytona sort entrants "must not be overly dependent on the uniform and random distribution of key values in the sort input". It's almost as if Gray was saying: "of course I expect you to use abbreviated keys, but don't push your luck!". And so it is for PostgreSQL. Some cases benefit much more than others, and some cases might even be slightly regressed.
Fundamentally, when you do a cost/benefit analysis, abbreviated keys are very compelling. The upsides are clearly very large, and the break-even point for switching to using abbreviation is surprisingly far out. We cannot ignore the performance benefits of these techniques because some much rarer cases will be slightly regressed. But, as it happens, we have cheap worst case insurance: HyperLogLog is used to cheaply and fairly accurately check the cardinality of both abbreviated keys and the original text values. If abbreviated cardinality is an effective proxy for full cardinality, then most comparisons will either use abbreviated comparisons, or use a cheap memcmp() tie-breaker, which is almost as good. Otherwise, we abort abbreviation before the sort proper is underway.
Abbreviated keys are just infrastructure. While text is the most compelling case, there are at least a few other datatypes that would greatly benefit from support for abbreviation. These include:
numeric character(n) uuid bytea
- citext (case insensitive text, from contrib/citext)
I welcome others with an interest in making sorting faster to work on the relevant opclass support for each of these types, and possibly others. Other people may be able to come up with novel encoding schemes for these types, that maximize the entropy within the finished abbreviated keys. Order-preserving compression is likely to be an area where text's support could be improved, by making comparisons resolved at the abbreviated key level more frequent. Hopefully the benefits of the abbreviated key infrastructure will not be limited to accelerating sorts on text.