Please log in to watch this conference skillscast.
Systems performance studies the performance of computing systems, including all physical components and the full software stack to help you find performance wins for your application and kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes the topic for everyone, touring six important areas: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events) and tracing (ftrace, bcc/BPF, and bpftrace/BPF), advice about what is and isn't important to learn, and case studies to see how it is applied. This talk is aimed at everyone: developers, operations, sysadmins, etc, and in any environment running Linux, bare metal or the cloud.
Q&A
Question: How does tools installation work in containerised env where runtime images are expected to be lean / distroless?
Answer: There's different ways to deal with the container problem:
One way is to install everything on the host and then debug the containers from there (which I currently do), but that doesn't help end users of containers, since they typically don't have host access.
Another way is to have a "debug container" image with all the tools that you can spin up that shares the same namespaces as the target container, and then give access to the debug container. Side car container.
Question: how do you get started with flame graphs and how do you compare them to jvm monitoring tools (like connecting to the JVM and profiling cpu?)
Answer: VM tools that go via JVMTI are typically Java methods only. They make it easier to see the full un-inlined stack, but miss other code paths including GC, libraries, and the kernel.
So I prefer perf/BCC-based profiling so I can see all CPU consumers and code paths. As for getting started: that depends on which target language. I've posted instructions for targets like Java http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Java
If it's just a compiled language like C, C++, or golang, then it's easy.
things like Java get complex since it compiles methods into the heap which don't have a standard symbol table (only JVMTI understands it) so you need a way to export the symbol table for regular profilers to see it.
IntelliJ added flame graphs. So did yourkit. And there's a way to add it to JMC. If you have a profiller UI today, check if it supports flame graphs, either it does now or it will. It's just a visualization that's easy to add.
Question: What's the biggest curve ball issue you've ever had to solve? I like war stories
Answer: The problem is a lot of my issues don't sound complex once I've found the answer. :) One ongoing one is one of our major microservices, some instances will be slightly slower (10% at most). CPU flame graphs point to a particular Java method, but there's no reason why it's slow on some instances and not another. I've used the hsdis package to dump its assembly and found it gets compiled to be a massive method on the slow instances, and is a tiny method on the fast ones -- the hotspot compiler is emitting different instructions. But we still don't know why, and why it's slow. That's gotta be my worst because it's currently unsolved.
I suspect something triggers the hotspot compiler to massively inline methods when normally it doesn't, hence the big size. It might not be a curve ball. Might just be something dumb. Actual curve balls include getting a system where one PCI expansion bus was running at half speed due to a physical manufacturing error. I noticed storage cards had different max speeds based on which slot they were plugged into, then used PMCs to look at bus-level metrics. Another was a 1% performance issue that I debugged and found was due to a server was in a hotter part of a server room, and the CPUs couldn't turbo boost as much.
You can see a lot with PMCs and MSRs, except tooling for them has been lacking. I've published various shell scripts on github that we use at Netflix (pmc-cloud-tools, msr-cloud-tools), but they may only work on the Netflix instances (since I've never tested them for other processors; e.g., ARM).
Oh, and another curve ball was disks that sometimes had bad write performance, and I debugged it by yelling at the disks. You might have seen the video…
Question: have you tried running the short optimised bytecode generated by the fast ones in the slower ones? Maybe it won't run? Hotspot compiler might be detecting some different instruction set and assuming there are instructions missing and generating a bigger code that it believes will work
Answer: I don't know how it'd detect a different instruction set, since it's the same instance type... we get slow instances within the same ASG.
Question: Brendan, Flame Graph needs some patches in libc to reach its full potential, so is musl/Alpine in the plans?
Answer: Well, we're rolling out our own patched libc. Plus I'm hoping to get Canonical to provide an official libc-fp. And I'm also hoping to convince the gcc developers (GNU tool maintainers) to revert the -fomit-frame-pointer default in gcc for x8664, fixing this for everything (but they want to see performance numbers to understand the regression)
Question: I understand musl is a completely different beast, right? With the increased usage of Alpine Linux, are there any plans to support it?
Answer: Getting flame graphs (or any profiler) to work with musl depends on how it's compiled. gcc's default is -fomit-frame-pointer, which breaks stacks. I don't know how musl is packaged, but one solution is to get the package maintainers to include -fno-omit-frame-pointer in the build process.
We did that for the Netflix libc build, and we're trying to get Canonical/Debian to adopt it for the standard libc package (since we'd rather not do our own builds of things if it can be avoided -- we rather upstream then consume).
Question: Is there much difference in how Linux performance tools work vs Unix or are they pretty similar given the heritage?
Answer: vmstat/iostat/mpstat/top/sar are pretty much the same. The big difference is the newer tracing tools, based on BPF. But these differences don't come often -- so once you learn the basics it'll probably be the same for the next few decades.
Question: When you train new performance engineers - how do you do it and how long does it take?
Answer: I've developed performance classes before, typically 5-day classes. Sun Microsystems once allowed me to create a 10-day class that brought anyone up to speed with systems performance. It was great. But nowadays training isn't the same as it was decades ago, and it's hard to convince companies and people to find the time. My recent class, a BPF perf analysis class at Netflix, is 4 hours -- and that's the longest class in the catalog.
As for how to do it: it's a mix of theory and practice. The best results are from setting up simulated performance issues and have the students try to solve them, without the answers. Such hands on is when you really have to think about things and engage the brain. I have a suite of programs that I install on systems as binaries only, and students run them and try to debug their performance.
Question: Having off cpu flame graphs will be very useful for a full picture..?
Answer: Right. There's a bunch of challenges with it. Off-CPU analysis shows that most threads are sleeping most of the time. Imagine a 100-thread Java application and only 4 threads are doing work: your off-CPU flame graph is now 96% stuff you don't care about. So it's important to zoom into the threads doing work.
e.g., I'll find a Java method called something like "dorequest" and then filter on that.
Question: There is lots of discussion of using ML to build autonomic infrastructure. What are your thoughts on this?
Answer: I've seen ML tried for performance analysis using the system metrics as input, and I think it assumes the metrics are good and complete to begin with, which they aren't. I'm worried about garbage in / garbage out. I'd first fix/add metrics to the system, including using new tracing sources, so we had a complete set of metrics (USE method complete) and then feed that to ML.
YOU MAY ALSO LIKE:
Linux Systems Performance
Brendan Gregg
Fellow Intel Corporation