I attended PGDay on January 31st, in Brussels. This event was held just before FOSDEM, which I also attend (expect another blog post). Here are some of the notes I took during the conference.
Statistics in PostgreSQL, Heikki Linnakangas
Due to transit delays, I only caught the last half of that talk.
The main goal of this talk was to explain some of Postgres' per-column statistics. In a nutshell, Postgres needs to have some idea about tables' content in order to choose an appropriate query plan.
Heikki explained which sorts of statistics gathers, such as most common values and histograms. Another interesting stat is the correlation between physical pages and data ordering (see CLUSTER).
Column statistics are gathered when running ANALYZE and stored in the pg_statistic system catalog. The pg_stats view provides a human-readable version of these stats.
Heikki also explained how to determine whether performance issues are due to out-of-date statistics or not. As it turns out, EXPLAIN ANALYZE shows for each step of the query planner how many rows it expects to process and how many it actually processed. The rule of thumb is that similar values (no more than an order of magnitude apart) mean that column statistics are doing their job. A wider margin between expected and actual rows mean that statistics are possibly preventing the query planner from picking a more optimized plan.
It was noted though that statistics-related performance issues often happen on tables with very frequent modifications. Running ANALYZE manually or increasing the frequency of the automatic ANALYZE may help in those situations.
Advanced Extension Use Cases, Dimitri Fontaine
Dimitri explained with very simple cases the use of some of Postgres' lesser-known extensions and the overall extension mechanism.
Here's a grocery-list of the extensions and types he introduced:
- intarray extension, which adds operators and functions to the standard ARRAY type, specifically tailored for arrays of integers,
- the standard POINT type which provides basic 2D flat-earth geometry,
- the cube extension that can represent N-dimensional points and volumes,
- the earthdistance extension that builds on cube to provide distance functions on a sphere-shaped Earth (a close-enough approximation for many uses),
- the pg_trgm extension which provides text similarity functions based on trigram matching (a much simpler thus faster alternative to Levenshtein distances), especially useful for "typo-resistant" auto-completion suggestions,
- the hstore extension which provides a simple-but-efficient key value store that has everyone talking in the Postgres world (it's touted as the NoSQL killer),
- the hll extensions which implements the HyperLogLog algorithm which seems very well suited to storing and counting unique visitor on a web site, for example.
An all-around great talk with simple but meaningful examples.
Integrated cache invalidation for better hit ratios, Magnus Hagander
What Magnus presented almost amounted to a tutorial on caching strategies for busy web sites. He went through simple examples, using the ubiquitous Django framework for the web view part and Varnish for the HTTP caching part.
The whole talk revolved around adding private (X-prefixed) HTTP headers in replies containing one or more "entity IDs" so that Varnish's cache can be purged whenever said entities change. The hard problem lies in how and when to call PURGE on Varnish.
The obvious solution is to override Django's save() method on Model-derived objects. One can then use httplib (or better yet requests) to purge the cache. This solution can be slightly improved by using Django's signal mechanism instead, which sound an awful-lot like CubicWeb's hooks.
The problem with the above solution is that any DB modification not going through Django (and they will happen) will not invalidate the cached pages. So Magnus then presented how to write the same cache-invalidating code in PL/Python in triggers.
While this does solve that last issue, it introduces synchronous HTTP calls in the DB, killing write performance completely (or killing it completely if the HTTP calls fail). So to fix those problems, while introducing limited latency, is to use SkyTools' PgQ, a simple message queue based on Postgres. Moving the HTTP calls outside of the main database and into a Consumer (a class provided by PgQ's python bindings) makes the cache-invalidating trigger asynchronous, reducing write overhead.
A clear, concise and useful talk for any developer in charge of high-traffic web sites or applications.
SkyTools: http://wiki.postgresql.org/wiki/SkyTools
The Worst Day of Your Life, Christophe Pettus
Christophe humorously went back to that dreadful day in the collective Postgres memory: the release of 9.3.1 and the streaming replication chaos.
My overall impression of the talk: Thank $DIETY I'm not a DBA!
But Christophe also gave some valuable advice, even for non-DBAs:
- Provision 3 times the necessary disk space, in case you need to pg_dump or otherwise do a snapshot of your currently running database,
- Do backups and test them:
- give them to developers,
- use them for analytics,
- test the restore, make it foolproof, try to automate it,
- basic Postgres hygiene:
- fsync = on (on by default, DON'T TURN IT OFF, there are better ways)
- full_page_writes = on (on by default, don't turn it off)
- deploy minor versions as soon as possible,
- plan upgrade strategies before EOL,
- 9.3+ checksums (createdb option, performance cost is minimal),
- application-level consistency checks (don't wait for auto vacuum to "discover" consistency errors).
Materialised views now and in the future, Thom Brown
Thom presented on of the new features of Postgres 9.3, materialized views.
In a nutshell, materialized views (MV) are read-only snapshots of queried data that's stored on disk, mostly for performance reasons. An interesting feature of materilized views is that they can have indexes, just like regular tables.
The REFRESH MATERIALIZED VIEW command can be used to update an MV: it will simply run the original query again and store the new results.
There are a number of caveats with the current implementation of MVs:
- pg_dump never saves the data, only the query used to build it,
- REFRESH requires an exclusive lock,
- due to implementation details (frozen rows or pages IIRC), MVs may exhibit non-concurrent behavior with other running transactions.
Looking towards 9.4 and beyond, here are some of the upcoming MV features:
- 9.4 adds the CONCURRENTLY keyword:
- + no longer needs an exclusive lock, doesn't block reads
- - requires a unique index
- - may require VACUUM
- roadmap (no guarantees):
- unlogged (disables the WAL),
- incremental refresh,
- lazy automatic refresh,
- planner awareness of MVs (would use MVs as cache/index).
Indexes: The neglected performance all-rounder, Markus Winand
Markus' goal with this talk showed that very few people in the SQL world actually know - let alone really care - about indexes. According to his own experience and that of others (even with competing RDBMS), poorly written SQL is still a leading cause of production downtime (he puts the number at around 50% of downtime though others he quoted put that number higher). SQL queries can indeed put such stress on DB systems and cause them to fail.
One major issue, he argues, is poorly designed indexes. He went back in time to explain possible reasons for the lack of knowledge about indexes with both SQL developers and DBAs. One such reason may be that indexes are not part of the SQL standard and are left as implementation-specific details. Thus many books about SQL barely cover indexes, if at all.
He then took us through a simple quiz he wrote on the topic, with only 5 questions. The questions and explanations were very insightful and I must admit my knowledge of indexes was not up to par. I think everyone in the room got his message loud and clear: indexes are part of the schema, devs should care about them too.
Try out the test : http://use-the-index-luke.com/3-minute-test
PostgreSQL - Community meets Business, Michael Meskes
For the last talk of the day, Michael went back to the history of the Postgres project and its community. Unlike other IT domains such as email, HTTP servers or even operating systems, RDBMS are still largely dominated by proprietary vendors such as Oracle, IBM and Microsoft. He argues that the reasons are not technical: from a developer stand point, Postgres has all the features of the leading RDMBS (and many more) and the few missing administrative features related to scalability are being addressed.
Instead, he argues decision makers inside companies don't yet fully trust Postgres due to its (perceived) lack of corporate backers.
He went on to suggest ways to overcome those perceptions, for example with an "official" Postgres certification program.
A motivational talk for the Postgres community.