I feel like I’ve been seeing this a lot lately.
occasionally, seemingly innocuous selects take unacceptably long.
Or
Over the past few weeks, we’ve been having bizarre outages during which everything seems to grind to a halt… and then fixes itself within 5 minutes. We’ve got plenty of memory, we’re not running into swap, and we can’t find any queries that would seem to trigger outages: just tons of simple session requests all hung up for no obvious reason.
Problems like this are always hard to debug. If it happens twice a week for 5 minutes at a time, your chance of getting someone logged onto the machine to watch it in action are pretty slim. And of course, when they do look at it, they see nothing wrong on the surface; it takes some very clever, very fast work with OS-level debugging and tracing utilities to really prove what’s happening.
The two cases mentioned above were caused by scalability/concurrency/locking problems in the query cache. (One was on Windows, and we fixed it by guessing. The other was on GNU/Linux and Maciek isolated it with his elite skillz.) So if you’re having random lockups, you might try disabling the query cache, and see if that solves it.
Hopefully this blog post will show up on Google and save someone time and money!
This seems to be a 5.1 problem. I had plenty of 4.1 and 5.0 servers running query cache with years uptime. With 5.1 I had plenty of lock-ups and disabling query cache helped. Most of the time is during an UPDATE but I’ve seen it on INSERTs too. FLUSH TABLES will unblock a server with a query cache problem – most of the time.
There may be changes in 5.1, but it’s not a 5.1 problem in general. Both of the examples I quoted above were recent 5.0 versions. I’ve seen it on a variety of different server versions. It depends more on the server’s hardware and workload than MySQL version, IMO.
I think what would help better is for someone to go into the sql_cache.cc and remove the silly “mutex trylock spin” code. If you want to have it, just a single trylock and if no joy, proceed as if there is a cache miss. The spin is very expensive as it calls gettimeofday as well as repeatedly hitting the mutex.
(admittedly, the good outcome of that silly code is that it helped make the FreeBSD team improve the performance of their gettimeofday … but still – I am amazed that the code has lasted this long with no one allowed to clean it up)
Interesting… this was supposed to have been fixed, eh.
There was the issue of the query cache pruning taking quite long; this was somewhat addressed in a bugfix, but my request for changing the state in the processlist to make any issue visible was declined because 5.0 was already frozen…
Oh yeah, I forgot to mention above… also take as much as reasonably possible outside of the mutex, especially stuff like the malloc.
Those are implementation problems. I think you can quite easily argue that for serious uses, not only is the implementation not great, but the architecture isn’t great. And not only that, but the whole concept isn’t great.
Still, it’s useful for some people… at least until random freezes start happening.
I smell a Percona patch. Antony, want to contribute?
Lately I have found the query cache less and less useful as a general rule. That is, at least, in proportion to one’s database size. We have found the “sweet spot” in the query cache’s effectiveness is around 64-128MB. YMMV. The argument is, though, if someone needs more than 128MB worth of query cache, it’s time to start considering other caching mechanisms (such as memcached) and give the resources occupied by the query cache to someone else.
Makes sense when you think about it. The query cache, after all, caches the query (and the data associated with it) and the larger the database is the more permutations need to be cached and the greater chance that an update is going to flush all those queries that were being cached involving that table anwyay. So, you need more RAM to make the cache useful but run into ugly, hidden locking; and at the time time higher amount of updates nullifies the cache far more often make it even less useful. Neither are two amazingly good scenarios.
Granted, I DO like the query cache. It really works wonders in some cases. However, those cases tend to be centered around the smaller datasets – at least in my experience.
Currently, the query cache uses the actual query as the key to the cache.
This means bad things, when you’re using generated SQL and are putting in to the query comments so that you can examine the logs and work out where the query was generated.
What would work must better for the query cache key is simply a cryptographic digest of the tokens which the lexer encountered, possibly followed by the column names for the query to guard against the unlikely event of a digest key collision.
This does reorder how MySQL processes queries: Right now, it checks the query cache before it parses the query. My proposal would move it to occur after the parsing. The downside – you haven’t removed the cost of the parsing stage. On the plus side – you should achieve a significantly higher hit rate and you don’t hit the query cache mutex lock so often because you only ever hit it when you really have encountered a SELECT.
We had problems, similar to the second one, that we solved by disabling the query cache. When it froze it often lasted more than 5 minutes, but it only happened once or twice a week
We also see some issues with query cache usage, like setting query_cache_size=0 at runtime crashing mysqld.
Baron,
Percona could make a huge impact on improving the scalability of the query cache without a major investment in time. I don’t know about the stability though. Right now there is 1 mutex protecting the entire cache. For many users, a scalable and stable query cache is a good solution — it doesn’t require an extra process to manage and it doesn’t require additional authorization & access control problems.
We also encountered this issue about half a year ago with the 5.1.26 release.
Depending on the mysql-load the issues happened every 20 minutes.
After debugging mysql and we also recognized, that the query cache is involed in this issues.
It looked like a thread synchronization problem. After disabling the query cache all our locking problems went away.
We’ve opened a bug report, according to the developers this issue had been fixed in 5.1.29.
But nevertheless we did not re-enable the query cache.
If you want to demonstrate your elite hacker skills on linux, do this while the server is having trouble (lots of threads locked or freeing items, not responding to kills):
gdb -p
set pagination off
thread apply all bt
and look for lots of threads waiting on
Query_cache::wait_while_table_flush_is_in_progress
If that’s what you see and the problem goes away after setting the query cache size to 0 then see http://bugs.mysql.com/bug.php?id=43758 .
There have also a recent bug that can be worked around by turning off the adaptive hash index, http://bugs.mysql.com/42400 . Fixed in 5.1.33.
Arjen, FYI I’m using this bug to request an increase in the priority of the better status reporting.
Mark: “One mutex protecting the entire cache” isn’t really true. There are several mutexes involved in its design. One of the major issues is the custom memory allocator which is a great idea for small embedded systems and for the initial targets of the QC. Changing the memory allocation model to something more standard which would allow for concurrent access would improve the situation a lot and reduce complexity. After all it just a fancy container. Using a lock free hash is also a good idea. We have a re-engineering team working on redesigning the QC from scratch. Brainstorming, wish-lists and requirements are of course both fun and helpful. I’m sure they would appreciate it.
Anthony: The spin lock approach was a really silly idea. I’m the guilty guy behind it. Sorry 😉 At the time I recall facing some very strange requirements, and that is my only defense. Not my finest moment probably, but thanks to the power of open source, what doesn’t kill us sure makes us stronger. It is probably time to revisit that code with a new fix. (The spin lock approach isn’t in 5.1+)
Arjen: We always try to do our best. There is no conspiracy that I know of 🙂 You’re contribution was appreciated!
In defense of the design of the QC I must say that it was initially created to solve some specific performance issues on a certain kind of hardware and it proved to be a good idea for many people at that time (and still can be if you understand what you do). Now as we move towards a new generation of hardware, things look different.
@Kristofer — What am I missing? structure_guard_mutex is locked in Query_cache::send_result_to_client() for any thread that queries the cache. It is also locked for much of the work done by Query_cache::store_query(). There are also read-write locks, but I have not spent the time to understand them.
What release will contain the performance fixes? 6.X is a long way off.
Antony:
What about keying off of *both* the actual query and a cryptographic digest of the tokens? I.e. attempt to hit the cache as the current implementation does. If that misses, go through the parsing stage & check again with the digest.
Upside: same cost in terms of parsing, higher hit rate.
Downside: storing 2x keys, performing 2x lookups for true misses.
Mark: I don’t think you’re missing anything. I just meant that there are different mutexes and locks used in the query cache and not just one single mutex, and you seem to agree on that. I think the main focus should be on the memory allocator part since it (1) contains a linear search, (2) put too much work in getting the smallest memory block possible for the cache, (3) also has a problem with fragmentation and (4) causes increased code complexity where we could use standard components (maybe a google allocator?). Another thing which could easily be done is to start using a “lock-free” (== HW atomic compare and switch) hash to speed up searches. How much performance gain we get from this isn’t trivial though given all the integration constrains. Saying that we should remove struct_guard_mutex is ok, but how do you want to have it done instead if we don’t use that mutex to protect memory?
Hi,
I am working on a system which has currently around 20gb of data (increasing at the rate of ~200mb/day). We need to save complete documents in the system so basically one column has around 70-80% of the data. Our server is Quad Dual Core 4GB Ram, Server 2003 and using MySQL 5.0. Query cache size is 156M and limit is 8M. All tables except one are INNO DB and its buffer pool size is 1024M.
The end user client module is read-only so lots and lots of repeated queries therefore I think query cache helps a lot but one of the problems I am facing is somewhat similar as discussed above. Sometimes the system halts for few mins. I fear that if I disable the query cache it will slow down my system which already is just OK.
Apart from that I also want to handle the future size of the db which is increasing rapidly. I have got hold of High Performance MySQL but any hints where should I be looking at first?
Any help will he highly appreciated. I can also give more details of my system in case if it will help. Thanks a lot!
We are experiencing this freezing issue as well on FreeBSD 7.1 with MySQL 5.1.32. Its a pretty high traffic site, but I’ve never seen the DB server go above 20% CPU usage. Currently the master db is freezing up. Its replicating to 2 slave servers. When it freezes up it’s top state shows “ucond” and cpu usage is 0% or very low so somehow its still doing some sort of work. Trying to connect to it via the command line mysql client returns error 34, Can’t create a new thread
Also issuing a /usr/local/etc/rc.d/mysql-server restart command just hangs as well, have to issue a kill -9 pid to make the server stop and restart. We were running FreeBSD 7.0 and MySql 5.0 and never saw this issue. Any help would be GREATLY appreciated, I honestly don’t know where to start.
Here is our my.cnf:
skip-name-resolve
max_connections = 2000
skip-innodb
log-bin = /usr/mysql/logs/mysql-bin.log
binlog-do-db=tp
server-id=1
wait_timeout = 15
key_buffer = 16M
key_buffer_size = 32M
max_allowed_packet = 16M
thread_stack = 128K
thread_cache_size = 64
query_cache_limit = 0
#query_cache_size = 64M
query_cache_size = 0
query_cache_type = 0
join_buffer_size = 512K
log_slow_queries = /usr/mysql/logs/mysql-slow.log
Ok just caught it when it had froze and had some more info to post:
Here is what we get when we try to connect to the server via the command line client:
ERROR 1135 (HY000): Can’t create a new thread (errno 35); if you are not out of available memory, you can consult the manual for a possible OS-dependent bug
Here is what top shows, the mysqld process is at the bottom
last pid: 23574; load averages: 0.08, 0.19, 0.33 up 2+01:39:37 23:06:53
1522 processes:1 running, 1521 sleeping
CPU: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
Mem: 234M Active, 4999M Inact, 403M Wired, 285M Cache, 214M Buf, 6000K Free
Swap: 4096M Total, 4096M Free
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
592 root 1 44 0 24080K 5096K select 0 5:36 0.00% snmpd
630 root 1 44 0 10480K 2136K select 0 0:05 0.00% ntpd
742 root 1 53 0 22876K 2660K select 2 0:04 0.00% sshd
748 root 1 44 0 10696K 2880K select 0 0:02 0.00% sendmail
510 root 1 44 0 5688K 1060K select 2 0:01 0.00% syslogd
758 root 1 8 0 6744K 1184K nanslp 0 0:00 0.00% cron
21290 root 1 44 0 33768K 3500K select 0 0:00 0.00% sshd
752 smmsp 1 20 0 10696K 2704K pause 0 0:00 0.00% sendmail
21294 root 1 20 0 10104K 2284K pause 1 0:00 0.00% csh
20790 mysql 1 8 0 7060K 1388K wait 1 0:00 0.00% sh
23574 root 1 44 0 12208K 5116K CPU2 2 0:00 0.00% top
800 root 1 5 0 5684K 996K ttyin 0 0:00 0.00% getty
799 root 1 5 0 5684K 996K ttyin 0 0:00 0.00% getty
801 root 1 5 0 5684K 996K ttyin 2 0:00 0.00% getty
798 root 1 5 0 5684K 996K ttyin 2 0:00 0.00% getty
803 root 1 5 0 5684K 996K ttyin 2 0:00 0.00% getty
806 root 1 5 0 5684K 944K ttydcd 2 0:00 0.00% getty
797 root 1 5 0 5684K 996K ttyin 1 0:00 0.00% getty
804 root 1 5 0 5684K 996K ttyin 2 0:00 0.00% getty
802 root 1 5 0 5684K 996K ttyin 3 0:00 0.00% getty
805 root 1 5 0 5684K 944K ttydcd 2 0:00 0.00% getty
449 root 1 44 0 2180K 472K select 0 0:00 0.00% devd
20895 mysql 1500 44 0 409M 250M ucond 1 0:00 0.00% mysqld
Noah, please don’t use the comments as a support forum. We have forums at forums.percona.com and there are also MySQL’s mailing list, IRC channel, and forums. And if you need professional help, of course we’re here 🙂
ah…feel free to delete them, including this one
This bug turned out to be the query cache too: http://bugs.mysql.com/?id=39508
I was told it is fixed in 5.0.68
Thank you for this article – it saved me several frustrated hours of trying to understand why a simple update is sometimes executed at least for a minute.
This problem still happens in the latest percona in linux containers. We had cache on with mysql 5.7 with 512MB query cahe and didn’t have any problems. When we switch to percona and 4096mb of query cache, the entire db frozee time to time.
mysql> select @@version;
+—————+
| @@version |
+—————+
| 5.7.21-21-log |
+—————+