Please log in to watch this conference skillscast.
We trust databases to store our data, but should we? Jepsen combines generative testing techniques with fault injection to verify the safety of distributed databases. We'll learn the basics of distributed systems testing, and show how those techniques found consistency errors in MongoDB, PostgreSQL, and Redis-Raft. Finally, we'll conclude with advice for testing your own systems.
Q&A
Question: What do you think of implementing Read Committed using snapshots instead of pessimistic concurrency locking?
Answer: So Postgres does this internally, and I think it's a natural fit--once you choose snapshot isolation it's relatively straightforward. I think it especially makes sense in a distributed systems context because locks generally require some sort of communication. Surprisingly, you can actually build RC in a totally coordination-free way. All you need to do is... not expose uncommitted values to clients. There's no requirement that you show them timely or even isolated values! See Peter Bailis' VLDB paper for more.
Question: Is it possible to plug tools like Elle or Jepsen into an existing cloud infrastructure and discover the inherent flaws in a system?
Answer: Sort of. Elle is intentionally very general: it needs "a concurrent history" which you can record any way you like.Jepsen records those histories. You could point Jepsen at a cloud service like S3 or Dynamo, and that would work just fine! But you wouldn't get fault injection.So you can only measure the happy case, unless you can find a way to either 1.) trigger faults, or 2.) wait long enough to see faults in the wild. (I've been thinking about doing this for S3 myself, actually)
Question: I've always been fascinated by these tools, but they are a bit daunting to use, and infinitely more difficult to go DIY/NIH and would it be easier to inject those faults yourself?
Answer: I'm not entirely sure. Like.... how do you figure out which nodes in S3 even store your data? How do you get into Amazon's network to mess with them? Generally, we don't have any access into cloud services, so fault injection is probably off the table. There are some cases where AWS exposes some APIs for triggering failovers, like in RDS, right? That might be a place to look at a fault. But you're kind of at the mercy of Amazon's designers there--having to trust that the fault is actually meaningful and maps to something... like an accident.
Question: That assumes AWS as the provider. Azure and GCP are two completely different beasts, as well
Answer: Quite right! You'd have to ask each one for permission to mess with their internals.
Question: Not surprised by these issues with NoSQL - Kyle do you have a comparison between Mongo and something like Dynamo?
Answer: I mean, if there's something I'd take away here, it's that SQL is, if anything, a harder problem! I've done several SQL databases and found safety violations in every one. In particular, SQL allows a very expressive range of operations with complex interactions, all (supposedly) at strong safety levels. Predicate reads, for example! Object stores are, I think, easier to implement. Mongo and Dynamo (the paper) are pretty much polar opposites: Mongo requires a majority quorum, Dynamo is totally available. Mongo aims for SI/linearizable, Dynamo is eventually consistent with vclocks. DynamoDB, though, is a whole other beast, right? I actually don't have a lot of insight into its behavior or design.
Question: Thanks Kyle - yeah I've always seen the push for noSQL usages in general to not worry as much about the data and possible data loss but other issues like crossing read flows sound particularly concerning if they'd happen in any database
Answer: Yeah. I mean... I think you had no choice in some early NoSQL systems but to give up things like serializability, and that's still very true in things like Cassandra today. But there's no reason why object stores can't be transactional, and we're seeing folks like Cockroach, Yugabyte, Mongo work on those problems today. One thing that bugs me is that the traditional SQL databases I love, like Postgres, have... AFAICT, no safe replication story. I'm hoping that gets addressed someday.
Question: You mention above you don’t have much insight into Dynamo. Do you test opaque cloud services like it at all? E.g. do you have more insight into Spanner or CosmosDb?
Answer: Not really, no! I could absolutely test them from the outside, but most of the interesting behavior in Jepsen comes when we do fault injection: partitions, crashes, etc. It's certainly possible that we'd find out that, I dunno, Spanner is actually violating linearizability on a regular basis. Might be worth trying!
Question: Any thoughts on Raft vs Paxos? Would you say one has inherently fewer issues?
Answer: Good question! So like... there's no question that Raft has been revolutionary, right? Diego did the industry a huge service: we've seen an explosion in systems built on Raft instead of winging their own consensus algorithm. That said, Raft isn't necessarily the end-all-be-all of consensus systems. It forces a total order over decisions, for instance, which is great for writing a simple state machine, but not great when those decisions are logically independent. For those, some kind of Paxos might actually be preferable.
What we see in practice is folks running hundreds of Raft instances, and then building a transaction protocol between them--that addresses the ordering bottleneck by sharding the order into many Raft instances, and writing transactions on top of linearizable/sequential components is a lot easier than doing them from first principles.
Another issue with Raft is that the reliance on a stable leader means paying an extra round-trip if you're not on the leader itself. If the leader is in Chicago and you're in Beirut, that might add up! But you can imagine that a sliiiight tweak to Paxos would allow you to get away with only one round trip, rather than the 2 that Raft (or Paxos with leaders) requires.
Heidi Howard clued me into this, and you should definitely look at her papers/talks--she knows way more than I ever will about Paxos, haha.
Question: “What we see in practice is folks running hundreds of Raft instances” - Would you be talking about Consul?
Answer: (Ah, yes, to be clear, these are generally hidden from the end user) Oh, um... if you're running hundreds of Consuls I also have questions, haha. I was thinking specifically about systems like YugaByte, Cockroach, and MongoDB, which run a Raft (or raft-alike) per shard, plus a txn protocol between shards. Users don't see the separate Raft instances--they're managed by the database.
Question: Do you think one day when every device has an atomic clock in it some of those issues could go away?
Answer: Atomic clocks help! But they aren't the end-all-be-all of consensus. Spanner still does Paxos! Truetime just lets them skip going to a timestamp oracle for a txn timestamp.
The Percolator paper might be good reading--you can kinda see how Spanner may have evolved from it.
One other thing that might be helpful is a talk I gave on distributed txn architectures: https://www.youtube.com/watch?v=wzYYF3-iSo - I am pretty sure I got the spanner part of this WRONG in some way, so take that with a grain of salt. It's based on reading the paper and DMing friends a bunch and playing telephone with members of the spanner team so like... ???? But it might be wrong in a somewhat helpful way, is what I'm saying, haha
Question: Have you run into any limitations of Clojure's transactions or other Clojure systems when writing Elle?
Answer: I don't use the Clojure STM (in-memory transactions)... pretty much ever. I do use atoms, promises, and futures heavily, plus some of j.u.concurrent's primitives. Those work really well! Elle is surprisingly fast: it'll do 100K transactions in a minute. If I were to rewrite it, though, I'd be looking to reduce allocation pressure on the GC, tighter structs, etc. I'm sure it could be more efficient, but there's a complexity tradeoff there. The stuff Elle does around provenance (WHY can I prove that these txns are a cycle) relies heavily on persistent data structures and some heavy libraries--they've been well-optimized, but maybe we could do better by going mutable. (I have ALL kinds of feelings about Elle's design BTW, happy to chat about that all day)
I didn't mention this in the talk, but if you're wondering about using Elle to test your own systems, take a look at the repo and paper:
<a href="https://github.com/jepsen-io/elle" target="blank">https://github.com/jepsen-io/elle
https://github.com/jepsen-io/elle/blob/master/paper/elle.pdf
And here's the reports on Mongo ( https://jepsen.io/analyses/mongodb-4.2.6),
Postgres ( https://jepsen.io/analyses/postgresql-12.3),
and Redis-Raft (https://jepsen.io/analyses/redis-raft-1b3fbf6)
YOU MAY ALSO LIKE:
Jepsen 13
Kyle Kingsbury
Kyle Kingsbury, a.k.a "Aphyr", is a computer safety researcher working as an independent consultant. He is the author of the Riemann monitoring system, the Clojure from the Ground Up introduction to programming, and the Jepsen series on distributed systems correctness. He grills databases in the American Midwest.