Quantcast
Channel: awe.sm | Blog » seldo
Viewing all articles
Browse latest Browse all 3

Medium Data: things to try before abandoning SQL

$
0
0

corgi-computerDisclaimer: the following blog post is deeply imperfect. It’s not totally comprehensive, and it doesn’t justify every claim I make with quantitative data. It’s full of anecdotal evidence and subjectivity. But I’ve concluded that by the time I found the time to do all the research and write the footnotes needed to bullet-proof it, we’d be more worried about the sun going out than our data storage and manipulation technology. I hope it contains a few data points that you find useful, despite its flaws.

First, my thesis: a lot of less-experienced developers are using big data[1] and NoSQL technologies because they are new and cool, and because SQL is old and hard. A lot of these people would save themselves time and effort by learning more about SQL and tuning their databases and hardware just a little bit. Rather than just gripe about it, I’m going to suggest a grab-bag of basic techniques of varying complexity that can help a lot (especially if you are using MySQL), and later I’m going to follow up with another post about more complex strategies for more experienced devs.

My task has been made easier by some other recent posts by other people that made most of my argument for me. Namely:

  1. Pick your battles by Zef Hemel

    The money quote is this:

    go and build amazing applications. Build them with the most boring technology you can find. The stuff that has been in use for years and years. Where every edge case has been covered. Where every library you will ever need has been in production for years. Where every part of the release cycle has been ironed out. Where the best practices on how to do testing are known.

    In essence: trying to solve a problem by using a technology stack you’ve never tried before gives you, at minimum, two problems. Big data tool chains are very different from more general purpose data-stores, and newer data stores are less reliable at scale, simply because fewer of their edge-cases have been met and managed.

  2. SQL is agile by Armin Ronacher

    Armin’s post has fewer pithy quotes, but the key point is that early in the life of your project, when your data is small[1] but your ideas and use-cases can change dramatically, the structure and flexibility of an RDBMS trumps the single-case-optimized performance of a document store.

  3. MySQL: choose something else by Owen Jacobson

    Opinionated and inflammatory as this post is, what I took away from it is: don’t judge all RDBMS by the failures of MySQL. MySQL is not the best database in the world, it’s just the easiest[2]. It’s getting better all the time, but Postgres is really good right now[3]. Check to make sure the problem you have with “SQL” is really just a problem with MySQL. Further disclaimer: I still mostly use MySQL myself, out of familiarity and habit, and all of my tips are from that perspective.

So if you’re not on board with the premise that new technologies require substantial up-front investment that will delay your actually building anything useful, that old and tested technologies can be great, and that SQL in particular is cruelly maligned, take it up with those guys. What I’m going to be sharing are some strategies for making MySQL work better in a lot of common use-cases.

Don’t use vanilla MySQL

A while back, Sun Microsystems bought MySQL AB, the MySQL company. Then Oracle, maker of a major competitor to MySQL, bought Sun Microsystems. Conspiracy theories abound, but I don’t believe they really bought the whole company just to kill MySQL.[4] Nevertheless, Oracle suffers from a clear conflict of interest as the steward of MySQL, and the company who in my opinion have taken up the torch as the true home of MySQL are Percona.

Percona Server is MySQL, just fully patched and cared for, and you can install it with apt-get just like Oracle’s version. I think switching to Percona MySQL has the highest ratio of performance improvement to effort of any of these strategies. If you try nothing else, try using Percona.

Use a great in-memory database: MySQL

If your data is small enough that it can fit in memory, then the sadly under-appreciated innodb_buffer_pool configuration settings can be trivially modified to make MySQL behave like an in-memory database, like MongoDB. Don’t be put off by all the caveats in the documentation: if you have less than 50GB of data, buy a server with 64GB of RAM, set innodb_buffer_pool_size to 50GB, and watch it fly[5]. Of course, if your data all fits in memory, almost any database will fly, because you don’t have big data. You don’t even have medium data. You have no problems at all: memory is 100,000 times faster than disk access, so you can do really, really inefficient things without issue.

Cache it if you can

Caching is a solution so obvious I hesitate to put it in, and yet also such a big, varied solution that I could never cover it in one post. At its most basic: if some subset of your data is more frequently accessed than other parts, for instance because it’s popular, or because it’s recent, and if that data changes less than about once per second, you can put that data in an in-memory cache, and speed up reads by an order of magnitude or more (remember: 100,000 times faster).

For a huge variety of applications, caching can be enough to take you a long way: for example, if you have a list of your top articles, or recent posts, and you are serving 100 requests per second, implementing a cache that lasts only 1 second will reduce the number of reads to your database from 100 to 1, i.e. two orders of magnitude (but beware of thundering herds when that cache expires).

You will also need to solve the problem of cache invalidation, i.e. knowing when to use the cache and not. That’s a case with endless nuance, one of the two hard problems of computer science, but frequently solvable for simpler use-cases simply by automatically expiring cache entries after a certain amount of time (a feature built into memcached, the most popular in-memory cache).

Replicate to handle more reads[6]

If your data-access pattern is not (or not entirely) amenable to caching, then a complementary solution for scaling reads may be replication. RDBMS like MySQL and Postgres have replication built-in, and while the details vary, the central idea is simple: instead of having one database that all your reads come from, have many “slaves”, each a continuously-updated copy of the primary, “master” database.

Replication works well if your read-queries are competing with each other for resources on your master database, by providing more memory, disk time and processing power for each query. It can also help if you simply have too many incoming connections to your database[7].

Replication’s major complication is “replication delay”, i.e. the time it takes for a write on the master to turn up on one of the slaves. This means you can read an out-of-date value from a slave while the new value is en route from the master — particularly a problem if your app writes a record to the master and then immediately reads it back for confirmation, and that read goes to a slave.

Replication is also frequently used as a backup/high-availability solution. As the ill-fated Ma.gnolia famously learned, a replicated database is in no way a backup (though it can help in creating them, by making snapshots and dumps more convenient). However, it can help to some degree with availability, especially if your application can tolerate a period of read-only activity while the master is upgraded/repaired or a slave is promoted.

Use indexes, but carefully

This is another one that falls into the “duh” category for the experienced, but indexing is very poorly understood by new developers. In essence, an index is a sub-set of your data, held in memory like a cache, but ordered in such a way as to make a particular type of lookup quicker, and unlike a cache guaranteed to always be accurate. That sounds basic but can be incredibly powerful:

  • If your data is on a disk, locating the record via an in-memory index means that instead of hitting the disk dozens of times, your database needs to hit the disk exactly once: to retrieve the record itself. Once again: memory is 100,000 times faster than disk for this kind of access. Cutting out even a single disk read helps enormously.
  • If your query is only on a subset of your data, MySQL has the concept of a “covered query”: for instance, if you just need to look up a username and password, put those together as an index and queries that need only those two values will be pulled directly from that index in memory, without hitting disk at all.

The most common pitfall of indexing is to index every column individually. It is seldom the case that you do equally frequent lookups on all your database columns, and each index comes with a cost — it has to be held in memory. If you index too aggressively, some of your indexes will be written to disk instead of held in memory, greatly slowing access. Also remember that every time a column is changed, the related index must be updated, so indexing a frequently-updated value that is seldom used as a primary lookup (such as a user’s score, or last login date) will leave your database doing a lot of extra work for no reason.

The simplest, first pass rule for indexing is: find the columns you use in your WHERE clause, and put them into a single index. But with the right indexing strategy you can go a lot further: Keith Murphy of MySQL magazine has a pretty good slide deck on index tuning and query optimization.

Shard your data, but only after you’ve tried everything else

The principle of data sharding is simple: instead of having one big table, split it into many tables across multiple boxes. This allows you to store more data than can fit on a single box, and can also speed up queries by limiting the size the index needs to be on any given box.

Sharding is a powerful, potentially very useful design pattern. It’s also difficult to get right, with lots of potential pitfalls. Unlike lots of the other patterns mentioned in this post, it’s also been quite extensively discussed and explained recently — almost everybody I’ve interviewed in the last three years has mentioned sharding to solve database problems, even people who hadn’t heard of caching or replication. So I won’t be covering it in this post. Eric Ries wrote a pretty good practical introduction to sharding over at Startup Lessons Learned; they also rely on sharding at Instagram.

Shard your queries

The name of this pattern is not, as far as I’m aware, an industry term — it’s just what we call it internally at awe.sm. In data sharding, you split your data across multiple boxes, and run all types of queries on these subsets of your data. In query sharding, you instead copy all of your data (for instance, via replication) to multiple boxes, and run subsets of your queries on each box.

This allows you to optimize each box for a given type of query: for instance, one could be optimized for quick key-value lookups (large numbers of connections, a small amount of memory for each one) and another for large aggregations (few connections, lots of memory). It also allows for more effective indexing: put exactly one index on each shard, tailored to the requirements of that query. Your box will be able to fit more or even all of that single index in memory, and it will also spend less time updating unused indexes.[8]

Coming up later: more!

Because this post was already hella long, I decided to split the next batch of patterns into a second blog post, to follow a week or two after this one. More experienced users should find that post of more interest.


[1] In an early version of this post, I spent a long time defining big data. Suffice to say that it’s a tricky thing to nail down, but a good rule of thumb is that if you have less than 100GB of data, you definitely don’t have big data, and if you have less than a terabyte of actively-used data (not just logs) you probably still don’t. These claims are horribly unsubstantiated, but spending a lot of time on them distracts from my point.

[2] You read my disclaimer about subjective calls, right? But what I mean is: MySQL is a classic disruptive technology. When it first became popular 15 years ago, it’s because it was cheaper than all commercial solutions, and a hell of a lot easier to get running than Postgres was (especially on Windows, where a lot of junior developers like myself got started, and in multi-tenant environments like shared hosting). In the subsequent decade and a half, hundreds of competitors have arisen and Postgres has done much to close the newbie-friendliness gap, but MySQL has remained “good enough” to stay the most popular open-source database.

[3] And if you can afford Oracle, then by all means do. Evil they may be, expensive as all hell they certainly are, but Oracle does everything you’ve ever dreamed of in a relational datastore and is unbelievably performant. If you’ve got the budget, it’s a magic wand of awesome.

[4] Apart from being a very expensive way to force startups using MySQL to upgrade to Oracle even though they can’t afford it, MySQL is open source, so buying the parent company doesn’t give you control. It seems Oracle wanted Java, and Sun’s hardware business, and MySQL just came along as a freebie, but that’s pure speculation, so let’s please not have a bunch of comments about it.

[5] Percona’s performance blog has a realy useful post on tuning innodb_buffer_pool_size.

[6] I would have listed replication as “too obvious” as well, but after three years of interviewing junior developers, I’ve concluded that replication is neither widely known nor well-understood; it apparently falls into the “too scary/too hard” feature category.

[7] If too many connections is your issue and you’re using PHP, you may get further faster by looking into using persistent connections.

[8] This is one of those times when I wish I’d had time to gather some quantitative data on the effectiveness of query sharding. Instead, all I can say is that it works for us.


Viewing all articles
Browse latest Browse all 3

Latest Images

Trending Articles





Latest Images