When doing performance analyzes you often would want to see 95 percentile, 99 percentile and similar values. The “average” is the evil of performance optimization and often as helpful as “average patient temperature in the hospital”.

Lets set you have 10000 page views or queries and have average response time of 1 second. What does it mean ? Really nothing – may be one page view was 10000 seconds and the rest was in low milliseconds or may be you had every single page view taking 1 second, which are completely different.

You also do not really care about average performance – the goal of good user experience is majority of users to have good experience and average is not a good fit here. Defining your response time goal in 95 or 99 percentile is much better. Say you say 99 percentile response time should be one second, this means only 1 percent of queries/page views are allowed to take more than that. For larger systems defining (increasing) response times for 99.9 or even 99.99 percentile numbers often make sense.

It also often makes sense to define response time goals separately for different transactions – the AJAX widget response time requirements may be very different from the slow search page.

So you have defined your response time in terms of 95/99 percentile and get your logs in the table, so how to get the data if MySQL only provides you the avg:

The average response time here is for example; the real data what we need is number of rows which matches for given query type.

Dividing the count by 100 we get our 1% of values and dividing by 20 5% of values, now we can get the response time we concerned about simply by running following order-by queries:

So for this system the 95 percentile is just over 5 sec (some 3 times more than the average) and 99% percentile is just a bit over 10 seconds (6 times more than average). The both numbers are horrible and system surely needs to be fixed.

These numbers are to illustrate – the percentile numbers can be quite different from average numbers (it is not rare to see 99 percentile to be order of magnitude different from the average) and this is what you really need to focus on.

Looking at the numbers from the business standpoint try to understand what these really are. In some cases I see rather bad percentile on the backend which are not really the problem for the business because there is a cache up front anyway. If 99% of requests are coming from the cache and you observe certain 99 percentile response time on the backend it is often 99.99 percentile response time which is a lot different – you often can afford 1/10000 requests to stall for few seconds, because things outside of your control (like packet loss at client side) would be responsible for larger amount of delays.

Be careful though – the “random” delays, for example if system was busy and delayed servicing request is one thing, “systematic” delays, when response time is always bad in given conditions can be much worse problems. You do not want your best client to suffer for example, even if he is the only one.

6 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Dale Lancaster

Thanks for the article. For me having a histogram of access/load times would be more meaningful. I am not sure what a 99 or 95 percentile ranking would mean to me. If anything it would lead me to generate more statistics and the one I would found most useful is a histogram of the counts of the load times. This would require normalizing the data a bit such that only the first couple decimal places of precision are used, otherwise you get a lot of noise.

To generate a histogram with SQL I usually just use the group by of a count on the load times and you are good to go. So in your example, if you did the group by with order by on the load time, I would be most interested in how often the really large load times occur.

Thanks.

Baron Schwartz

Histograms are nice, I agree. We’ve built histograms into Maatkit’s new log analysis tool, mk-log-parser. We are also going to add an indication of time-clustering, e.g. “this query always runs at 4AM on Tuesday”.

I might write up a post on histograms — with LOG() it is actually not too hard to do, though it’s not terribly efficient (filesort/temp table).

Mikhail Solovyev

Peter,

Is there any simple way to get 95% in one query?For example there is a SELECT TOP N PERCENT in MSSQL, is there going to be something similar in MySQL?

Mikhail Solovyev

Peter,

Is there any simple way to get 95% in one query?For example there is a SELECT TOP N PERCENT in MSSQL, is there going to be something similar in MySQL?

Pendyala

Is There any way to calculate 90%,95%,99% percentile in 1 million rows of IIS logs with response times (time-taken),calcaulating 90%,95%,99% percentile functions