There is the rare bug which I ran into every so often. Last time I’ve seen it about 3 years ago on MySQL 4.1 and I hoped it is long fixed since… but it looks like it is not. I now get to see MySQL 5.4.2 in the funny state.

When you see bug happening you would see MySQL log flooded with error messages like this:

091119 23:03:34 [ERROR] Error in accept: Resource temporarily unavailable
091119 23:03:34 [ERROR] Error in accept: Resource temporarily unavailable
091119 23:03:34 [ERROR] Error in accept: Resource temporarily unavailable
091119 23:03:34 [ERROR] Error in accept: Resource temporarily unavailable

filling out disk space

Depending on the case you may be able to connect to MySQL through Unix Socket or TCP/IP or neither.
It also looks like there is a correlation between having a lot of tables and such condition.

Previously I was unlucky with seeing this issue in production so we had to restart MySQL quickly currently I have a test MySQL showing some behavior.

Here is what strace tells me:

[percona@test9 msb_5_4_2]$ strace -f -p 19229
Process 19229 attached with 23 threads – interrupt to quit
[pid 19286] rt_sigtimedwait([HUP QUIT ALRM TERM TSTP],
[pid 19285] futex(0x165962ec, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19284] select(0, NULL, NULL, NULL, {0, 765000}
[pid 19283] select(0, NULL, NULL, NULL, {0, 412000}
[pid 19248] futex(0x1781193c, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19247] futex(0x178118bc, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19246] futex(0x1781183c, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19245] futex(0x178117bc, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19244] futex(0x1781173c, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19232] futex(0x1781113c, FUTEX_WAIT_PRIVATE, 165, NULL
[pid 19229] accept(16392,
[pid 19241] futex(0x178115bc, FUTEX_WAIT_PRIVATE, 319, NULL
[pid 19243] futex(0x178116bc, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19242] futex(0x1781163c, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19240] futex(0x1781153c, FUTEX_WAIT_PRIVATE, 3, NULL
[pid 19239] futex(0x178114bc, FUTEX_WAIT_PRIVATE, 7, NULL
[pid 19238] futex(0x1781143c, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19237] futex(0x178113bc, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19236] futex(0x1781133c, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19235] futex(0x178112bc, FUTEX_WAIT_PRIVATE, 7, NULL
[pid 19234] futex(0x1781123c, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19233] futex(0x178111bc, FUTEX_WAIT_PRIVATE, 207, NULL
[pid 19231] futex(0x178110bc, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19229] <... accept resumed> 0x7fffffac70b0, [18423225245113516048]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 19229] accept(16392, 0x7fffffac70b0, [18423225245113516048]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 19229] accept(16392, 0x7fffffac70b0, [18423225245113516048]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 19229] accept(16392, 0x7fffffac70b0, [18423225245113516048]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 19229] fcntl(16392, F_SETFL, O_RDWR) = 0
[pid 19229] fcntl(16392, F_SETFL, O_RDWR) = 0
[pid 19229] select(16394, [1025 1040 1042 1044
….
8 9709 9714 9716 9717 15368 15369], NULL, NULL, NULL*** buffer overflow detected ***: strace terminated
======= Backtrace: =========
/lib64/libc.so.6(__chk_fail+0x2f)[0x35f36e6aff]
/lib64/libc.so.6[0x35f36e5ad3]
strace[0x408b60]

So as you can see accept gets pretty high socket number – probably because of large innodb_open_files I tested with in this case – which all can be used during recovery.

Note the process gets accept on the socket to fail with EAGAIN (which is not well described in the manual) and later getting call to select call with 16384 sockets. This does not seems to be the healthy number to work with SELECT call and It is quite possible something goes wrong because of it.

If you have any ideas why could be going wrong in this case.

7 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Kristian Nielsen

Ah, I think I see it now.

The select() and accept() calls in the trace are from
handle_connections_sockets() in sql/mysqld.cc. It is supposed to select() on
(at most) 2 file descriptors: the IP server socket and the unix server socket.

Now in this case, one of these somehow ended up with file descriptor number
16393.

From man 2 select:

“An fd_set is a fixed size buffer. Executing FD_CLR() or FD_SET() with a
value of fd that is negative or is equal to or larger than FD_SETSIZE will
result in undefined behavior. Moreover, POSIX requires fd to be a valid
file descriptor.”

And FD_SETSIZE seems to be 1024 on Linux!

So basically mysqld is now passing random stack memory to select() for file
descriptors 1024-16393 :-(. This also explains why accept() fails with EAGAIN:
there isn’t any new connection pending, it is just that select() gets lots of
random file descriptors in the undefined memory, and one of them happens to
return ready.

So the code in handle_connections_sockets() is really broken, the only reason
it works at all is probably that the server sockets are usually created early
and hence get low file descriptors. But in this case for some reason at least
one of them got the very high number 16393.

Even if we do not exceed the FD_SETSIZE it can still be inefficient to
search a potentially big fd_set for just two descriptors. The code should be
changed to use poll(). I seem to remember seeing a patch for this, or I might
make on up for MariaDB.

Brant

I don’t have anything to add other than Kristian is awesome.

Ryan H

Thanks for this. We have this problem all the time on our production servers.

-Ryan

domas

yet more arguments for shared tablespaces, mwaha.

Antony

Or simply rearrange the server code so that the socket handles are created first, before any storage engine is initialized.