When doing performance analyzes you often would want to see 95 percentile, 99 percentile and similar values. The “average” is the evil of performance optimization and often as helpful as “average patient temperature in the hospital”.
Lets set you have 10000 page views or queries and have average response time of 1 second. What does it mean ? Really nothing – may be one page view was 10000 seconds and the rest was in low milliseconds or may be you had every single page view taking 1 second, which are completely different.
You also do not really care about average performance – the goal of good user experience is majority of users to have good experience and average is not a good fit here. Defining your response time goal in 95 or 99 percentile is much better. Say you say 99 percentile response time should be one second, this means only 1 percent of queries/page views are allowed to take more than that. For larger systems defining (increasing) response times for 99.9 or even 99.99 percentile numbers often make sense.
It also often makes sense to define response time goals separately for different transactions – the AJAX widget response time requirements may be very different from the slow search page.
So you have defined your response time in terms of 95/99 percentile and get your logs in the table, so how to get the data if MySQL only provides you the avg:
1 2 3 4 5 6 7 | mysql> select count(*),avg(wtime) from performance_log_081128 where page_type='search'; +----------+-----------------+ | count(*) | avg(wtime) | +----------+-----------------+ | 106859 | 1.4469140766532 | +----------+-----------------+ 1 row in set (2.08 sec) |
The average response time here is for example; the real data what we need is number of rows which matches for given query type.
Dividing the count by 100 we get our 1% of values and dividing by 20 5% of values, now we can get the response time we concerned about simply by running following order-by queries:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | mysql> select wtime from performance_log_081128 where page_type='search' order by wtime desc limit 1068,1; +---------+ | wtime | +---------+ | 10.1007 | +---------+ 1 row in set (2.06 sec) mysql> select wtime from performance_log_081128 where page_type='search' order by wtime desc limit 5342,1; +---------+ | wtime | +---------+ | 5.09297 | +---------+ 1 row in set (2.06 sec) |
So for this system the 95 percentile is just over 5 sec (some 3 times more than the average) and 99% percentile is just a bit over 10 seconds (6 times more than average). The both numbers are horrible and system surely needs to be fixed.
These numbers are to illustrate – the percentile numbers can be quite different from average numbers (it is not rare to see 99 percentile to be order of magnitude different from the average) and this is what you really need to focus on.
Looking at the numbers from the business standpoint try to understand what these really are. In some cases I see rather bad percentile on the backend which are not really the problem for the business because there is a cache up front anyway. If 99% of requests are coming from the cache and you observe certain 99 percentile response time on the backend it is often 99.99 percentile response time which is a lot different – you often can afford 1/10000 requests to stall for few seconds, because things outside of your control (like packet loss at client side) would be responsible for larger amount of delays.
Be careful though – the “random” delays, for example if system was busy and delayed servicing request is one thing, “systematic” delays, when response time is always bad in given conditions can be much worse problems. You do not want your best client to suffer for example, even if he is the only one.
Thanks for the article. For me having a histogram of access/load times would be more meaningful. I am not sure what a 99 or 95 percentile ranking would mean to me. If anything it would lead me to generate more statistics and the one I would found most useful is a histogram of the counts of the load times. This would require normalizing the data a bit such that only the first couple decimal places of precision are used, otherwise you get a lot of noise.
To generate a histogram with SQL I usually just use the group by of a count on the load times and you are good to go. So in your example, if you did the group by with order by on the load time, I would be most interested in how often the really large load times occur.
Thanks.
Dale,
Histogram is good though takes more work to get in SQL plus it is harder to define the goals. It is much easier to say 95 percentile should be within a second and 99 percentile within 3 seconds than define goals in terms of histogram.
Histograms are nice, I agree. We’ve built histograms into Maatkit’s new log analysis tool, mk-log-parser. We are also going to add an indication of time-clustering, e.g. “this query always runs at 4AM on Tuesday”.
I might write up a post on histograms — with LOG() it is actually not too hard to do, though it’s not terribly efficient (filesort/temp table).
Peter,
Is there any simple way to get 95% in one query?For example there is a SELECT TOP N PERCENT in MSSQL, is there going to be something similar in MySQL?
Peter,
Is there any simple way to get 95% in one query?For example there is a SELECT TOP N PERCENT in MSSQL, is there going to be something similar in MySQL?
Is There any way to calculate 90%,95%,99% percentile in 1 million rows of IIS logs with response times (time-taken),calcaulating 90%,95%,99% percentile functions