Web Performance Calendar

The speed geek's favorite time of year
2017 Edition

Alex Podelko (@apodelko) has specialized in performance since 1997, working as a performance engineer and architect for several companies. Currently he is Consulting Member of Technical Staff at Oracle, responsible for performance testing and optimization of Enterprise Performance Management and Business Intelligence (a.k.a. Hyperion) products.

Alex periodically talks and writes about performance-related topics, advocating tearing down silo walls between different groups of performance professionals. His collection of performance-related links and documents (including his recent articles and presentations) can be found at alexanderpodelko.com. Alex currently serves as a director for the Computer Measurement Group (CMG), an organization of performance and capacity planning professionals.

We live in very exciting times from the performance point of view. The complexity and scale of problems we are trying to solve skyrocket in almost any area (starting from IoT, big data, AI, etc.), each bringing a new level of performance challenges. But we are also getting opportunities and technologies to address these challenges allowing us to finally implement performance engineering as it was preached.

Shift Left

“Shift left” is natural consequence of iterative development. With agile/iterative development, we got an opportunity to start performance work early as we are supposed to get a working system (or at least some of its components) at each iteration. The idea that we need to test early, as the cost of fixing defects skyrocket later in the development lifecycle, may be traced at least to Barry Boehm’s paper “Software Engineering ” published in 1976 and further developed in his book Software Engineering Economics.

However, although it was common wisdom, not much could be done early as not much was available to test during waterfall development until the very end. Of course, we could define performance requirements and do architecture analysis (fully developed, for example, in Connie Smith’s Performance Engineering of Software Testing published in 1990). But, without testing, we had very limited information input from the system – until the very last moment when the cost of fixing defects was very high. And now, finally, we can get performance feedback from the system from the first development iterations – so we may verify the architecture early and find defects as soon as they appear. That is a great thing – and we can indeed “shift left” performance testing and other performance-related activities.

It is interesting that “shift left” is usually associated with automation and continuous integration (CI). I tried to evaluate the current state of performance CI in my CMG imPACt 2017 presentation Continuous Performance Testing: Myths and Realities. Of course, it is a quickly developing area and it would become a reality for more and more projects soon. Automation/CI becomes necessary as we get to multiple iterations and shrinking times to verify performance. However, it is a more technical part of “shift left”. There is an opinion in the [functional] testing community to separate checking (including “traditional” automated testing) from testing – see, for example, Testing and Checking Refined by James Bach. While this separation was not widely accepted, it has an important point if we project it to performance: automated tests are basically regression tests, “checking” (at least in the most cases for the moment – there are interesting ideas how it may become more than that, usually utilizing AI).

Another side of “shift left” is usually remained unmentioned (and, often, not benefited). Its blessing is the opportunity to test systems early – to find performance issues early. And, when you test a new system early, you don’t have all parts in place yet – so testing should be more flexible / agile. Early performance testing, of course, requires different mindset and different set of skills and tools. Automation complements it – offloading routine tasks from performance engineers. But testing early – bringing most benefits by identifying problems early when the cost of their fixing is low – usually does require sophisticated planning, research, and analysis, it is not a routine activity and can’t be easily formalized. The term ‘exploratory’, very popular in some circles in functional testing, fits here well.

The concept of exploratory performance testing is still rather alien. But the notion of exploring is much more important for performance testing than for functional testing. Functionality of systems is usually more or less defined (whether it is well documented is a separate question) and testing boils down to validating if it works properly. In performance testing, you will not have a clue how the system would behave until you try it. Having requirements – which in most cases are rather goals you want your system to meet – does not help you much here because actual system behavior may be not even close to them. It is rather a performance engineering process (with tuning, optimization, troubleshooting and fixing multi-user issues) eventually bringing the system to the proper state than just testing.

Shift Right

Another breakthrough happened on the operations side. We got lucky with many great tools providing insights into production environments – such as Real User/End User Monitoring (RUM/EUM) and Application Performance Monitoring/Management (APM). They allow us the level of insights that was not available earlier in distributed systems. Plus, DevOps techniques allows us to deploy and remove changes quickly [for Web operations] thus decreasing potential impact of introduces issues (but not eliminating it – and with increased scale of operation risks may still remain very high). Thus, we getting “shift right” – more relying on operations tools and techniques. See, for example, Shift-right for ‘Performance Engineering’, a potent approach?

If we speak about “shift right” in testing, it may have a somewhat different meaning. It may include doing different kinds of performance testing in production environments – using either synthetic load (for example, traditional load testing, synthetic monitoring) or real load (for example, A/B testing, canary testing).

Not that “shift right” is a completely new concept either – it is rather getting back to old. If it does not complement earlier-stages activities, it basically means a reactive approach to performance issues – address them in production as they happen (of course, preferably acting on early symptoms rather than a full collapse – when we have such an opportunity). Nothing new is that approach – it was the natural approach to performance as soon as the first performance issue happens in the history of computing… And it was developed into a serious methodology at least since 1966 when SMF (System Management Facilities) were introduced as a part of OS/360 – which provided the level of performance insights still unavailable on many systems until now.

Here we probably should mention another major trend – DevOps. It is a very popular term, but used very loosely. In my understanding DevOps puts together all other trends, including “shift left” (the Dev side) and “shift right” (the Ops side). It is supposed to drastically improve feedback from production to development and free flow of performance information both ways – so a holistic approach to performance should be one of its main advantages. Some performance aspects in DevOps are considered, for example, in Performance Testing in a DevOps World by Stijn Schepers. But it doesn’t look like such a holistic approach happens often. Rather it looks like “DevOps” teams just drop more sophisticated parts of performance testing/engineering and rely more on reactive approach to performance issues – concentrating more on quick fixing the issues than on preventing them.

My personal experience was rather centered around load testing. And I am still waiting to see a sophisticated enough system that would work fine under heavy load from the beginning. Usually a lot of issues are found even with rather cursory load testing. I always wonder how some projects appear to get away without load testing. I believe that the answer is in a once popular book Web Operations – where Jesse Robbins wrote in the foreword:

“Our experiences were universal: Our software crashed or couldn’t scale. The databases crashed and data were corrupted … And just when we got things working again, a new feature would be pushed out, traffic would spike, and everything would break all over again.”

Well, not saying that load testing is a panacea – but I wonder how many problems could be identified and fixed in advance even with simple load testing.

Performance Silos

I wrote Breaking Performance Silos in 2013 – and the situation probably started to change, but not always in the way I hoped. Yes, traditional silos started to erode – in part because they don’t fit well into changing environments. But it looks like “shift left” and “shift right” trends, if understood simplistically, lead to creating new silos leaving conceptual performance gaps in between. For example, these silos may be around test automation (SDETs – Software Development Engineers in Test) and web operations (SREs – Site Reliability Engineers) – without much in between.

“Shift left” often ends up in “automation”, where (in cases where people pay attention) performance is one of variables to check. The associated issues are well described, for example, in Scott Moore’s post Do You Need A SDET Or A SEAL? [Although, while I complete agree with the description of the problem and that an end-to-end holistic approach is needed in performance, I do not completely agree with the proposed term / solution. I think we already have a decent term for the person who should do that – in places, where people try to implement some kind of a holistic approach for performance, such position is usually named “performance architect”. I do not care much about a title (and it probably is rather a question of personal taste) – but I see here an important underlying issue. If you name that person a software engineer, the assumption is that after he finds a problem, he will fix it in the code himself. While it may be a case in some case (for example, in smaller projects) – it is another notion around that looks concerning to me. Even programmers tend to specialize in certain areas of the system (usually defined by functionality or by technology) – why would we expect a performance person, who has quite a lot of other things on his plate, be able to equally easy fixing issues across all parts of the system? I believe that the role of that performance person is rather to identify the root cause and work with developers to help them fix it.]

“Shift right” on the top of the DevOps trend often results in assigning performance and capacity responsibilities to Site Reliability Engineers (SREs). As stated in What is ‘Site Reliability Engineering’? interview with Ben Treynor, VP of Engineering, Google.

“In general, an SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.”

SRE, basically, is a new reincarnation of the system administrator role.

While these practices bring a lot of benefits – there is an inherent danger if performance is left only with automation / administration silos. Neither group has an explicit focus on performance. For old performance and capacity silos performance was at least in the center of their activity – so we often had attempts to introduce some kind of a holistic approaches (as much as it could be in existed environments). Scott Barber, for example, defined the performance lifecycle as “Conception to Headstone” (for example, in Performance Testing: Throughout the Application Life-Cycle). For both SDETs and SREs performance – while listed – isn’t their main priority. Well, of course, until you get to a major performance issue – but you hardly may expect them to initiate a holistic approach to performance from “Conception to Headstone”.

The main problem with using one approach (or few) to mitigate performance risk is that different approaches mitigate performance risks in their own ways and rather complement each other than are options to choose from. They overlap, so specific implementation details may vary – but every one help to mitigate performance risks and, in most cases, the only reason to not use a specific approach is high costs in comparison to possible losses (which, of course, should include all aspects such as brand damage and opportunity costs – which rather difficult to calculate). No single approach guarantee you good performance and reliability. Moreover, even utilizing all of them does not guarantee you that, but allows you to mitigate performance risks as much as possible.

Holistic Approach

To better mitigate performance risks a holistic approach should be created from existing techniques and methods. It doesn’t mean that they all of them should be included just to put a checkmark – it means that for each environment a most effective (from the both results and cost points of view) combination should be identified and implemented across all areas and cycles.

Just to see what should be included, here is a cursory list of most important parts of performance engineering (if we assume that this is the umbrella term for all activities to ensure proper performance and scalability of systems). While it is not absolutely necessary to include all, it should be rather important considerations not to include any of them in some form as it will expose gaps in performance strategy and unnecessary performance risks.

  • Software Performance Engineering (SPE) or Performance Design. Everything that helps in selecting appropriate architecture and design and proving that it will scale according to our needs. Starting with performance requirements (and overall non-functional requirements) and including, for example, performance patterns and anti-patterns, scalable architectures, and modeling. Algorithms efficiency and complexity, recently getting more attention again, is an important part here too – amongst many other, less easy to formalize, concepts. Proper instrumentation of systems to ensure their observability is another important part. The concept got some attention again recently – see, for example, Monitoring Isn’t Observability by Baron Schwartz.

  • Single-User Performance Engineering. Everything that helps to ensure that single-user response times match expectations. Including profiling, tracking and optimization of single-user performance, and Web Performance Optimization (WPO). Measuring performance during functional tests and CI is a good way to track single-user performance and catch regressions. While these methods may prevent a lot of performance issues – they don’t cover multi-user and scalability aspects at all.

  • Monitoring/APM/Log Analysis. Everything that provides insights in what is going on inside the working system and tracks down performance issues and trends. While APM and RUM (Real-User Monitoring) are real game-changers here providing tons of detailed information, it still remains mostly reactive approach. It helps to understand why issues happened – but after it did happened (well, of course, it may help you to catch a trend early too – thus becoming somewhat pro-active). It is interesting that synthetic monitoring, in spite of its name, is rather single-user performance testing in production than monitoring – so rather gets into another category.

  • Load Testing. Everything used for testing systems under any multi-user load (including all other variations of multi-user testing, such as performance, concurrency, stress, endurance, longevity, scalability, reliability, and similar). To see what specific risks may be addressed only by synthetic load testing see, for example, my CMG imPACt 2016 paper Reinventing Performance Testing.

    We may separate traditional load testing, using synthetic workload, and using real workload (sometimes referred to as end-user load testing, canary testing, A/B testing, chaos testing/engineering, etc.). Using real load is still testing and does mitigates some risks – but not all of them. See some consideration, for example, in my virtual interview with Coburn Watson about these practices at Netflix.

  • Capacity Planning/Management. Everything that ensures that we will have enough resources for the system, including both people-driven approaches and automatic self-management such as auto-scaling. It somewhat differs for new systems (rather a part of performance design) and for systems in production (where it becomes a pro-active approach based on monitoring results and business input).

  • System Lifecycle. New system lifecycle trends helps a lot too. Continuous Integration / Delivery / Deployment allows quick deployment and removal of changes, thus decreasing the impact of performance issues. Agile / iterative development helps to find issues early, thus decreasing costs of fixing them.

Of course, all the above do not exist not in a vacuum, but on top of high-priority functional requirements and resource constraints (including time, money, skills, etc.).

A great list of performance engineering trends / predictions, somewhat extending the list above, is listed in Trades of a Performance Engineer in 2020! by Andreas Grabner. The post stresses changes, but if we put all of them together with performance basics (which are still here – you cannot do advanced stuff if you don’t understand and implement basics) we also see an outline for a holistic approach.

On the top of that all, it should be a process of continuous performance improvement – efforts not only to make sure that we avoid issues / keep it under control, but to constantly improve performance. Not only that adding functionality / improving interfaces / making system more convenient usually have negative impact on performance – overall performance standards (performance expectations) are increasing with time. There were always talks about necessity to do continuous performance improvement (sometimes in connection with TQM, Six Sigma, ITIL or other popular theories) – but not many specific examples were published. Etsy was at some point a company sharing their culture of continuous performance improvement (although it appears that it last open site performance report is for Q1 2016 – and they mostly stressed single-user and reactive approaches).

By the way, references to history in this post are given not only to hint that it is not always necessary to re-invent the wheel – well, indeed many things changed drastically, there is a lot of space for inventions too. Rather it was done to get a larger view of performance and ways to handle it. I did an Ignite session at Velocity A Short History of Performance Engineering – and want to repeat here the punchline: The feeling that we are close to solving performance problems have existed in the last 50+ years and probably it will stay with us for a while. So do not think that the latest fashion will be your silver bullet for performance issues – better start to implement a holistic approach to cover as many bases as possible.