mysql 主备状态显示正常，数据未同步的bug

时间:2014-12-15 23:02 来源:linux.it.net.cn 作者:IT

今天看到这个Bug。

https://bugs.launchpad.net/percona-server/+bug/1070255

http://bugs.mysql.com/bug.php?id=67352

原文内容：

Bug Description

For a product application with 4 masters and 4 slaves of MySQL, which configured with BINLOG_FORMAT=ROW, 2 of the slaves can not replicate DML but the DDL works, seens two days ago. That is, all the events in the master binlogs could be sent to relay logs and applied by slave, everything seems OKay from "show slave status", but the inserted record can not be found in slave.

We spent most half day on this issue, and at last, we found the root reason.

In Table_map_log_event::do_apply_event, the m_table_id is correct, parsed from relay log event of Table_map_log_event, which is 5126232549. That value is assigned to table_list->table_id, but actually, it's overflowed and it's 831265253. Then, for the followed Rows_log_event::do_apply_event, table_list->table_id(831265253) is inserted into HASH m_table_ids, but later looked into HASH m_table_ids with entry of m_table_id for Rows_log_even, which is 5126232549. Obviously, it failed.

Here is part of GDB output:

-----------------------------------------------------

Breakpoint 2, table_mapping::set_table (this=0x65720df8, table_id=831265084, table=0x7fbe8c011320) at /home/xiyu.lh/alimysql/sql/rpl_tblmap.cc:109

109 entry *e= find_entry(table_id);

(gdb) c

Continuing.

Breakpoint 3, table_mapping::get_table (this=0x65720df8, table_id=5126232380) at /home/xiyu.lh/alimysql/sql/rpl_tblmap.cc:64

64 entry *e= find_entry(table_id);

(gdb) c

Continuing.

Breakpoint 1, Table_map_log_event::do_apply_event (this=0x7fbe8c00d6a0, rli=0x6571eb08) at /home/xiyu.lh/alimysql/sql/log_event.cc:8518

8518 thd->set_query_id(next_query_id());

(gdb) n

8520 if (!(memory= my_multi_malloc(MYF(MY_WME),

(gdb) n

8527 strmov(db_mem, rpl_filter->get_rewrite_db(m_dbnam, &dummy_len));

(gdb)

8528 strmov(tname_mem, m_tblnam);

(gdb)

8532 tname_mem, TL_WRITE);

(gdb)

8534 table_list->table_id= DBUG_EVALUATE_IF("inject_tblmap_same_id_maps_diff_tToable", 0, m_table_id);

(gdb) p table_list->table_id

$8 = 0

(gdb) p m_table_id

$9 = 5126232549

(gdb) s

8535 table_list->updating= 1;

(gdb) p (uint)m_table_id

$10 = 831265253

(gdb) p table_list->table_id

$11 = 831265253

(gdb) set var table_list->table_id=m_table_id

(gdb) p table_list->table_id

$12 = 831265253

-----------------------------------------------------

The proble is the routine below:

table_list->table_id= DBUG_EVALUATE_IF("inject_tblmap_same_id_maps_diff_table", 0, m_table_id);

as table_list->table is uint, while m_table_id is ulong, caused overflow for table_list->table.

This caused the following logic as unexpected:

-----------------------------------------------------

int Rows_log_event::do_apply_event(Relay_log_info const *rli)

{

for (TABLE_LIST *ptr= rli->tables_to_lock ; ptr ; ptr= ptr->next_global)

{

const_castlog_info*>(rli)->m_table_map.set_table(ptr->table_id, ptr->table);

}

#ifdef HAVE_QUERY_CACHE

query_cache.invalidate_locked_for_write(rli->tables_to_lock);

#endif

}

TABLE*

table=

m_table= const_cast<<wbr>Relay_log_info*>(rli)->m_table_map.get_table(m_table_id);

DBUG_PRINT("debug", ("m_table: 0x%lx, m_table_id: %lu", (ulong) m_table, m_table_id));

if (table)

{

-----------------------------------------------------

table_mapping::set_table insert into m_table_id with overflowed value(831265253), and then

table_mapping::get_table could not find m_table_id(5126232549) entry in HASH m_table_ids, caused the table==0x0 and left the core part do_apply_event.

Solution:

Change table_id from uint to ulong in sql/table.h to avoid overflow.

This bug should be exist in 5.1/5.5/5.6 with 64bit OS kernels.

这个bug非常严重，没有任何错误日志，或者状态上的不正常。

内部需要研讨一下，这个bug的应对方法了？

(责任编辑：IT)