Rare evil MySQL Bug

There is the rare bug which I ran into every so often. Last time I’ve seen it about 3 years ago on MySQL 4.1 and I hoped it is long fixed since… but it looks like it is not. I now get to see MySQL 5.4.2 in the funny state.

When you see bug happening you would see MySQL log flooded with error messages like this:

091119 23:03:34 [ERROR] Error in accept: Resource temporarily unavailable
091119 23:03:34 [ERROR] Error in accept: Resource temporarily unavailable
091119 23:03:34 [ERROR] Error in accept: Resource temporarily unavailable
091119 23:03:34 [ERROR] Error in accept: Resource temporarily unavailable

filling out disk space

Depending on the case you may be able to connect to MySQL through Unix Socket or TCP/IP or neither.
It also looks like there is a correlation between having a lot of tables and such condition.

Previously I was unlucky with seeing this issue in production so we had to restart MySQL quickly currently I have a test MySQL showing some behavior.

Here is what strace tells me:

[percona@test9 msb_5_4_2]$ strace -f -p 19229
Process 19229 attached with 23 threads – interrupt to quit
[pid 19286] rt_sigtimedwait([HUP QUIT ALRM TERM TSTP],
[pid 19285] futex(0x165962ec, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19284] select(0, NULL, NULL, NULL, {0, 765000}
[pid 19283] select(0, NULL, NULL, NULL, {0, 412000}
[pid 19248] futex(0x1781193c, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19247] futex(0x178118bc, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19246] futex(0x1781183c, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19245] futex(0x178117bc, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19244] futex(0x1781173c, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19232] futex(0x1781113c, FUTEX_WAIT_PRIVATE, 165, NULL
[pid 19229] accept(16392,
[pid 19241] futex(0x178115bc, FUTEX_WAIT_PRIVATE, 319, NULL
[pid 19243] futex(0x178116bc, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19242] futex(0x1781163c, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19240] futex(0x1781153c, FUTEX_WAIT_PRIVATE, 3, NULL
[pid 19239] futex(0x178114bc, FUTEX_WAIT_PRIVATE, 7, NULL
[pid 19238] futex(0x1781143c, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19237] futex(0x178113bc, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19236] futex(0x1781133c, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19235] futex(0x178112bc, FUTEX_WAIT_PRIVATE, 7, NULL
[pid 19234] futex(0x1781123c, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19233] futex(0x178111bc, FUTEX_WAIT_PRIVATE, 207, NULL
[pid 19231] futex(0x178110bc, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19229] <... accept resumed> 0x7fffffac70b0, [18423225245113516048]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 19229] accept(16392, 0x7fffffac70b0, [18423225245113516048]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 19229] accept(16392, 0x7fffffac70b0, [18423225245113516048]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 19229] accept(16392, 0x7fffffac70b0, [18423225245113516048]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 19229] fcntl(16392, F_SETFL, O_RDWR) = 0
[pid 19229] fcntl(16392, F_SETFL, O_RDWR) = 0
[pid 19229] select(16394, [1025 1040 1042 1044
….
8 9709 9714 9716 9717 15368 15369], NULL, NULL, NULL*** buffer overflow detected ***: strace terminated
======= Backtrace: =========
/lib64/libc.so.6(__chk_fail+0x2f)[0x35f36e6aff]
/lib64/libc.so.6[0x35f36e5ad3]
strace[0x408b60]

So as you can see accept gets pretty high socket number – probably because of large innodb_open_files I tested with in this case – which all can be used during recovery.

Note the process gets accept on the socket to fail with EAGAIN (which is not well described in the manual) and later getting call to select call with 16384 sockets. This does not seems to be the healthy number to work with SELECT call and It is quite possible something goes wrong because of it.

If you have any ideas why could be going wrong in this case.

7 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Kristian Nielsen

14 years ago

Ah, I think I see it now.

The select() and accept() calls in the trace are from
handle_connections_sockets() in sql/mysqld.cc. It is supposed to select() on
(at most) 2 file descriptors: the IP server socket and the unix server socket.

Now in this case, one of these somehow ended up with file descriptor number
16393.

From man 2 select:

“An fd_set is a fixed size buffer. Executing FD_CLR() or FD_SET() with a
value of fd that is negative or is equal to or larger than FD_SETSIZE will
result in undefined behavior. Moreover, POSIX requires fd to be a valid
file descriptor.”

And FD_SETSIZE seems to be 1024 on Linux!

So basically mysqld is now passing random stack memory to select() for file
descriptors 1024-16393 :-(. This also explains why accept() fails with EAGAIN:
there isn’t any new connection pending, it is just that select() gets lots of
random file descriptors in the undefined memory, and one of them happens to
return ready.

So the code in handle_connections_sockets() is really broken, the only reason
it works at all is probably that the server sockets are usually created early
and hence get low file descriptors. But in this case for some reason at least
one of them got the very high number 16393.

Even if we do not exceed the FD_SETSIZE it can still be inefficient to
search a potentially big fd_set for just two descriptors. The code should be
changed to use poll(). I seem to remember seeing a patch for this, or I might
make on up for MariaDB.

Peter Zaitsev

Author

14 years ago

Kristian,

Thanks. Good catch. This matches my guess as well. select() is way to the trouble especially these days when we deal with a lot of connections, a lot of open files etc.

I filed a bug on it BTW:
http://bugs.mysql.com/bug.php?id=48929

Brant

14 years ago

I don’t have anything to add other than Kristian is awesome.

Ryan H

14 years ago

Thanks for this. We have this problem all the time on our production servers.

-Ryan

Peter Zaitsev

Author

14 years ago

Ryan,

Looks like this series of posts of workings with large amount of tables matches your envinronment significantly 🙂

domas

14 years ago

yet more arguments for shared tablespaces, mwaha.

Antony

14 years ago

Or simply rearrange the server code so that the socket handles are created first, before any storage engine is initialized.

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Rare evil MySQL Bug

Related

Related Blog Articles

RECOMMENDED ARTICLES

Did MyDumper LIKE Triggers?

Should You Deploy Your Databases on Kubernetes? And What Makes StatefulSet Worthwhile?

How to Improve Database Performance: The Ultimate Guide

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Rare evil MySQL Bug

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

Did MyDumper LIKE Triggers?

Should You Deploy Your Databases on Kubernetes? And What Makes StatefulSet Worthwhile?

How to Improve Database Performance: The Ultimate Guide

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation