线上多台服务器连续出现异常宕机重启,在OOB带外日志显示首先出现网卡bnx2NETDEV WATCHDOG报告传输队列超时(Call Trace),同时出现CPU [Hareware Error] 显示 Machine Check Exception: 4 Bank 5: be00000000800400:

2016-12-03 19:08:42    [14874902.828670] ------------[ cut here ]------------
2016-12-03 19:08:42    [14874902.833622] WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x263/0x270() (Tainted: GF          ---------------   )
2016-12-03 19:08:42    [14874902.844875] Hardware name: PowerEdge R510
2016-12-03 19:08:42    [14874902.849212] NETDEV WATCHDOG: slave0 (bnx2): transmit queue 5 timed out
2016-12-03 19:08:42    [14874902.856068] Modules linked in: kvm_intel_0 ksplice_etuw1mmg_vmlinux_new ksplice_etuw1mmg havs(F) kvm_intel_1 ksplice_ubnyr3k6_vmlinux_new ksplice_ubnyr3k6 sch_sfq act_police cls_u32 sch_ingress xt_mac xt_state mptctl mptbase nbd tcp_diag inet_diag nfnetlink_queue igb nfnetlink xt_conntrack ipt_REJECT ip6table_filter ip6_tables arpt_nfqueue cls_fw sch_htb ebt_mark ebt_arp arptable_filter arp_tables autofs4 ipmi_devintf ipmi_si ebtable_filter ebtable_nat flow_acl(F) iptable_filter ip_tables nf_conntrack_ipv4 nf_defrag_ipv4 bonding ipv6 8021q garp ext4 jbd2 dm_mirror dm_multipath video output sbs sbshc acpi_pad acpi_ipmi ipmi_msghandler vhost_net(F) macvtap(F) macvlan(F) parport turbo_proxy tun slb_ctk_proxy(F) ruleset patch_cksum kvm flow_mark flow_filter nf_conntrack flow_ctk(F) vpc_session(F) classic_traceroute(F) ebt_fnat(F) ebtable_broute(F) ebtables(F) bridge stp llc avs_hotfix slb_ctk_session alivrouter(F) flow_qos(F) sg power_meter iTCO_wdt iTCO_vendor_support bnx2 serio_raw i7core_edac edac_core dcdbas lpc_ich mfd_core dm_raid45 dm_memcache dm_region_hash dm_log dm_mod shpchp ext3 jbd mbcache virtio_pci virtio_blk virtio virtio_ring raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx raid10 raid1 raid0 mpt2sas scsi_transport_sas raid_class hpsa cciss [last unloaded: kvm_intel_0]
2016-12-03 19:08:43    [14874902.975946] Pid: 0, comm: swapper Tainted: GF          ---------------    2.6.32-358.el5.x86_64 #1
2016-12-03 19:08:43    [14874902.986422] Call Trace:
2016-12-03 19:08:43    [14874902.989201]    [] ? dev_watchdog+0x263/0x270
2016-12-03 19:08:43    [14874902.995819]  [] ? dev_watchdog+0x263/0x270
2016-12-03 19:08:43    [14874903.001810]  [] ? warn_slowpath_common+0x98/0xc0
2016-12-03 19:08:43    [14874903.008319]  [] ? warn_slowpath_fmt+0x6e/0x70
2016-12-03 19:08:43    [14874903.014570]  [] ? strlcpy+0x4f/0x70
2016-12-03 19:08:43    [14874903.019952]  [] ? netdev_drivername+0x48/0x60
2016-12-03 19:08:43    [14874903.026208]  [] ? dev_watchdog+0x263/0x270
2016-12-03 19:08:43    [14874903.032201]  [] ? scheduler_tick+0xeb/0x250
2016-12-03 19:08:43    [14874903.038276]  [] ? dev_watchdog+0x0/0x270
2016-12-03 19:08:43    [14874903.044092]  [] ? run_timer_softirq+0x161/0x340
2016-12-03 19:08:43    [14874903.050519]  [] ? lapic_next_event+0x1d/0x30
2016-12-03 19:08:43    [14874903.056682]  [] ? __do_softirq+0xbf/0x220
2016-12-03 19:08:43    [14874903.062584]  [] ? hrtimer_interrupt+0x11d/0x230
2016-12-03 19:08:43    [14874903.069010]  [] ? call_softirq+0x1c/0x30
2016-12-03 19:08:43    [14874903.074827]  [] ? do_softirq+0x65/0xa0
2016-12-03 19:08:43    [14874903.080463]  [] ? irq_exit+0x7c/0x90
2016-12-03 19:08:43    [14874903.085932]  [] ? smp_apic_timer_interrupt+0x70/0x9d
2016-12-03 19:08:43    [14874903.092794]  [] ? apic_timer_interrupt+0x13/0x20
2016-12-03 19:08:43    [14874903.099306]    [] ? intel_idle+0xe1/0x180
2016-12-03 19:08:43    [14874903.105668]  [] ? intel_idle+0xc4/0x180
2016-12-03 19:08:43    [14874903.111399]  [] ? cpuidle_idle_call+0x99/0x140
2016-12-03 19:08:43    [14874903.117734]  [] ? cpu_idle+0xa8/0xe0
2016-12-03 19:08:43    [14874903.123197]  [] ? start_secondary+0x226/0x360
2016-12-03 19:08:43    [14874903.129451] ---[ end trace d5f89fd37f9190bc ]---
2016-12-03 19:08:43    [14874903.134393] bnx2 0000:01:00.0: slave0: <--- start FTQ dump --->
2016-12-03 19:08:43    [14874903.140642] bnx2 0000:01:00.0: slave0: RV2P_PFTQ_CTL 00010000
2016-12-03 19:08:43    [14874903.146715] bnx2 0000:01:00.0: slave0: RV2P_TFTQ_CTL 00020000
2016-12-03 19:08:43    [14874903.152785] bnx2 0000:01:00.0: slave0: RV2P_MFTQ_CTL 00004000
2016-12-03 19:08:43    [14874903.158858] bnx2 0000:01:00.0: slave0: TBDR_FTQ_CTL 00004002
2016-12-03 19:08:43    [14874903.164844] bnx2 0000:01:00.0: slave0: TSCH_FTQ_CTL 00020000
2016-12-03 19:08:43    [14874903.171056] bnx2 0000:01:00.0: slave0: TDMA_FTQ_CTL 00010002
2016-12-03 19:08:43    [14874903.177044] bnx2 0000:01:00.0: slave0: TXP_FTQ_CTL 00010002
2016-12-03 19:08:43    [14874903.182945] bnx2 0000:01:00.0: slave0: TPAT_FTQ_CTL 00010002
2016-12-03 19:08:43    [14874903.188933] bnx2 0000:01:00.0: slave0: TAS_FTQ_CTL 00010002
2016-12-03 19:08:43    [14874903.194830] bnx2 0000:01:00.0: slave0: RXP_CFTQ_CTL 00008000
2016-12-03 19:08:43    [14874903.200817] bnx2 0000:01:00.0: slave0: RXP_FTQ_CTL 00100000
2016-12-03 19:08:43    [14874903.206720] bnx2 0000:01:00.0: slave0: RLUP_FTQ_CTL 00008000
2016-12-03 19:08:43    [14874903.212704] bnx2 0000:01:00.0: slave0: COM_COMXQ_FTQ_CTL 00010000
2016-12-03 19:08:43    [14874903.219126] bnx2 0000:01:00.0: slave0: COM_COMTQ_FTQ_CTL 00020000
2016-12-03 19:08:43    [14874903.225545] bnx2 0000:01:00.0: slave0: COM_COMQ_FTQ_CTL 00010000
2016-12-03 19:08:43    [14874903.231881] bnx2 0000:01:00.0: slave0: CP_CPQ_FTQ_CTL 00004000
2016-12-03 19:08:43    [14874903.238039] bnx2 0000:01:00.0: slave0: RDMA_FTQ_CTL 00010000
2016-12-03 19:08:43    [14874903.244031] bnx2 0000:01:00.0: slave0: CSCH_CH_FTQ_CTL 00004000
2016-12-03 19:08:43    [14874903.250279] bnx2 0000:01:00.0: slave0: MCP_MCPQ_FTQ_CTL 00005000
2016-12-03 19:08:43    [14874903.256612] bnx2 0000:01:00.0: slave0: CPU states:
2016-12-03 19:08:43    [14874903.261737] bnx2 0000:01:00.0: slave0: 045000 mode b84c state 80001000 evt_mask 500 pc 8001288 pc 8001288 instr 8e030000
2016-12-03 19:08:43    [14874903.273005] bnx2 0000:01:00.0: slave0: 085000 mode b84c state 80009000 evt_mask 500 pc 8000a5c pc 8000a4c instr 1440fffc
2016-12-03 19:08:43    [14874903.284263] bnx2 0000:01:00.0: slave0: 0c5000 mode b84c state 80001000 evt_mask 500 pc 8004c14 pc 8004c14 instr 32070001
2016-12-03 19:08:43    [14874903.295524] bnx2 0000:01:00.0: slave0: 105000 mode b8cc state 80000000 evt_mask 500 pc 8000a9c pc 8000aa4 instr 8821
2016-12-03 19:08:43    [14874903.306444] bnx2 0000:01:00.0: slave0: 145000 mode b880 state 80004000 evt_mask 500 pc 80009d4 pc 800c6cc instr 0
2016-12-03 19:08:43    [14874903.317098] bnx2 0000:01:00.0: slave0: 185000 mode b8cc state 80000000 evt_mask 500 pc 8000c6c pc 8000c6c instr 1700ffe1
2016-12-03 19:08:43    [14874903.328351] bnx2 0000:01:00.0: slave0: <--- end FTQ dump --->
2016-12-03 19:08:43    [14874903.334425] bnx2 0000:01:00.0: slave0: <--- start TBDC dump --->
2016-12-03 19:08:43    [14874903.340766] bnx2 0000:01:00.0: slave0: TBDC free cnt: 32
2016-12-03 19:08:43    [14874903.346404] bnx2 0000:01:00.0: slave0: LINE     CID  BIDX   CMD  VALIDS
2016-12-03 19:08:43    [14874903.353351] bnx2 0000:01:00.0: slave0: 00    001100  a6d8   00    [0]
2016-12-03 19:08:43    [14874903.360127] bnx2 0000:01:00.0: slave0: 01    001100  a6d8   00    [0]
2016-12-03 19:08:43    [14874903.366896] bnx2 0000:01:00.0: slave0: 02    001000  c5f8   00    [0]
2016-12-03 19:08:43    [14874903.373667] bnx2 0000:01:00.0: slave0: 03    001200  bf80   00    [0]
2016-12-03 19:08:43    [14874903.380435] bnx2 0000:01:00.0: slave0: 04    001000  9f20   00    [0]
2016-12-03 19:08:43    [14874903.387208] bnx2 0000:01:00.0: slave0: 05    001000  2218   00    [0]
2016-12-03 19:08:43    [14874903.393980] bnx2 0000:01:00.0: slave0: 06    001000  2250   00    [0]
2016-12-03 19:08:43    [14874903.400750] bnx2 0000:01:00.0: slave0: 07    001000  2220   00    [0]
2016-12-03 19:08:43    [14874903.407524] bnx2 0000:01:00.0: slave0: 08    001000  2238   00    [0]
2016-12-03 19:08:43    [14874903.414302] bnx2 0000:01:00.0: slave0: 09    001000  2258   00    [0]
2016-12-03 19:08:43    [14874903.421076] bnx2 0000:01:00.0: slave0: 0a    001000  2228   00    [0]
2016-12-03 19:08:43    [14874903.427848] bnx2 0000:01:00.0: slave0: 0b    001000  2260   00    [0]
2016-12-03 19:08:43    [14874903.434628] bnx2 0000:01:00.0: slave0: 0c    001000  2148   00    [0]
2016-12-03 19:08:43    [14874903.441407] bnx2 0000:01:00.0: slave0: 0d    001000  2150   00    [0]
2016-12-03 19:08:43    [14874903.448180] bnx2 0000:01:00.0: slave0: 0e    001000  2158   00    [0]
2016-12-03 19:08:43    [14874903.454952] bnx2 0000:01:00.0: slave0: 0f    001000  2168   00    [0]
2016-12-03 19:08:43    [14874903.461730] bnx2 0000:01:00.0: slave0: 10    001000  2170   00    [0]
2016-12-03 19:08:43    [14874903.468501] bnx2 0000:01:00.0: slave0: 11    001000  2180   00    [0]
2016-12-03 19:08:43    [14874903.475273] bnx2 0000:01:00.0: slave0: 12    001000  2188   00    [0]
2016-12-03 19:08:43    [14874903.482044] bnx2 0000:01:00.0: slave0: 13    001000  2190   00    [0]
2016-12-03 19:08:43    [14874903.488816] bnx2 0000:01:00.0: slave0: 14    001000  1938   00    [0]
2016-12-03 19:08:43    [14874903.495588] bnx2 0000:01:00.0: slave0: 15    001000  1940   00    [0]
2016-12-03 19:08:43    [14874903.502360] bnx2 0000:01:00.0: slave0: 16    001000  1948   00    [0]
2016-12-03 19:08:43    [14874903.509131] bnx2 0000:01:00.0: slave0: 17    001000  1950   00    [0]
2016-12-03 19:08:43    [14874903.515902] bnx2 0000:01:00.0: slave0: 18    001000  17f0   00    [0]
2016-12-03 19:08:43    [14874903.522674] bnx2 0000:01:00.0: slave0: 19    001000  1800   00    [0]
2016-12-03 19:08:43    [14874903.529447] bnx2 0000:01:00.0: slave0: 1a    001000  0808   00    [0]
2016-12-03 19:08:43    [14874903.536220] bnx2 0000:01:00.0: slave0: 1b    001000  0810   00    [0]
2016-12-03 19:08:43    [14874903.542999] bnx2 0000:01:00.0: slave0: 1c    001000  b780   00    [0]
2016-12-03 19:08:43    [14874903.549777] bnx2 0000:01:00.0: slave0: 1d    001000  4200   00    [0]
2016-12-03 19:08:43    [14874903.556548] bnx2 0000:01:00.0: slave0: 1e    001000  b548   00    [0]
2016-12-03 19:08:43    [14874903.563322] bnx2 0000:01:00.0: slave0: 1f    001000  8040   00    [0]
2016-12-03 19:08:43    [14874903.570091] bnx2 0000:01:00.0: slave0: <--- end TBDC dump --->
2016-12-03 19:08:43    [14874903.576253] bnx2 0000:01:00.0: slave0: DEBUG: intr_sem[0] PCI_CMD[00100406]
2016-12-03 19:08:43    [14874903.583606] bnx2 0000:01:00.0: slave0: DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000088]
2016-12-03 19:08:43    [14874903.591827] bnx2 0000:01:00.0: slave0: DEBUG: EMAC_TX_STATUS[00000008] EMAC_RX_STATUS[00000000]
2016-12-03 19:08:47    [14874907.108434] bnx2 0000:01:00.0: slave0: NIC Copper Link is Up, 1000 Mbps full duplex
2016-12-03 19:08:43    [14874903.600908] bnx2 0000:01:00.0: slave0: DEBUG: RPM_MGMT_PKT_CTRL[40000088]
2016-12-03 19:08:43    [14874903.608021] bnx2 0000:01:00.0: slave0: DEBUG: HC_STATS_INTERRUPT_STATUS[015f00a0]
2016-12-03 19:08:43    [14874903.615886] bnx2 0000:01:00.0: slave0: DEBUG: PBA[00000000]
2016-12-03 19:08:43    [14874903.621791] bnx2 0000:01:00.0: slave0: DEBUG: MSIX table:
2016-12-03 19:08:43    [14874903.627518] bnx2 0000:01:00.0: slave0: DEBUG: [0]: fee00618 00000000 00000000 00000000
2016-12-03 19:08:43    [14874903.635818] bnx2 0000:01:00.0: slave0: DEBUG: [1]: fee00618 00000000 00000001 00000000
2016-12-03 19:08:43    [14874903.644131] bnx2 0000:01:00.0: slave0: DEBUG: [2]: fee00618 00000000 00000002 00000000
2016-12-03 19:08:43    [14874903.652433] bnx2 0000:01:00.0: slave0: DEBUG: [3]: fee00618 00000000 00000003 00000000
2016-12-03 19:08:43    [14874903.660739] bnx2 0000:01:00.0: slave0: DEBUG: [4]: fee00618 00000000 00000004 00000000
2016-12-03 19:08:43    [14874903.669270] bnx2 0000:01:00.0: slave0: DEBUG: [5]: fee00618 00000000 00000005 00000000
2016-12-03 19:08:43    [14874903.677572] bnx2 0000:01:00.0: slave0: DEBUG: [6]: fee00618 00000000 00000006 00000000
2016-12-03 19:08:43    [14874903.685876] bnx2 0000:01:00.0: slave0: DEBUG: [7]: fee00618 00000000 00000007 00000000
2016-12-03 19:08:43    [14874903.694180] bnx2 0000:01:00.0: slave0: <--- start MCP states dump --->
2016-12-03 19:08:43    [14874903.701040] bnx2 0000:01:00.0: slave0: DEBUG: MCP_STATE_P0[0003610e] MCP_STATE_P1[0003610e]
2016-12-03 19:08:43    [14874903.709776] bnx2 0000:01:00.0: slave0: DEBUG: MCP mode[0000b880] state[80000000] evt_mask[00000500]
2016-12-03 19:08:43    [14874903.719217] bnx2 0000:01:00.0: slave0: DEBUG: pc[0800b220] pc[0800b158] instr[30820800]
2016-12-03 19:08:43    [14874903.727604] bnx2 0000:01:00.0: slave0: DEBUG: shmem states:
2016-12-03 19:08:43    [14874903.733505] bnx2 0000:01:00.0: slave0: DEBUG: drv_mb[01030014] fw_mb[00000014] link_status[0000006f] drv_pulse_mb[00004a50]
2016-12-03 19:08:43    [14874903.745044] bnx2 0000:01:00.0: slave0: DEBUG: dev_info_signature[44564907] reset_type[01005254] condition[0003610e]
2016-12-03 19:08:43    [14874903.755891] bnx2 0000:01:00.0: slave0: DEBUG: 000001c0: 01005254 42530088 0003610e 00000000
2016-12-03 19:08:43    [14874903.764630] bnx2 0000:01:00.0: slave0: DEBUG: 000003cc: 00000000 00000000 00000000 00000000
2016-12-03 19:08:43    [14874903.773370] bnx2 0000:01:00.0: slave0: DEBUG: 000003dc: 00000000 00000000 00000000 00000000
2016-12-03 19:08:43    [14874903.782112] bnx2 0000:01:00.0: slave0: DEBUG: 000003ec: 00000000 00000000 00000000 00000000
2016-12-03 19:08:43    [14874903.792028] bnx2 0000:01:00.0: slave0: DEBUG: 0x3fc[00000000]
2016-12-03 19:08:43    [14874903.798098] bnx2 0000:01:00.0: slave0: <--- end MCP states dump --->
2016-12-03 19:08:44    [14874903.916087] bnx2 0000:01:00.0: slave0: NIC Copper Link is Down
2016-12-03 19:08:44    [14874903.922336] bonding: eth0: link status definitely down for interface slave0, disabling it
2016-12-03 19:08:52    [14874911.977687] [Hardware Error]: CPU 19: Machine Check Exception: 4 Bank 5: be00000000800400
2016-12-03 19:08:52    [14874911.977777] Clocksource tsc unstable (delta = -8589935096 ns).  Enable clocksource failover by adding clocksource_failover kernel parameter.
2016-12-03 19:08:52    [14874911.999243] [Hardware Error]: TSC 7ebbc5c7e698b4 ADDR 1579aa6 
2016-12-03 19:08:52    [14874912.005430] [Hardware Error]: PROCESSOR 0:206c2 TIME 1480763330 SOCKET 0 APIC 11
2016-12-03 19:08:52    [14874912.013203] [Hardware Error]: CPU 7: Machine Check Exception: 4 Bank 5: be00000000800400
2016-12-03 19:08:52    [14874912.021686] [Hardware Error]: TSC 7ebbc5c7e698c8 ADDR 1579aa6 
2016-12-03 19:08:52    [14874912.027876] [Hardware Error]: PROCESSOR 0:206c2 TIME 1480763330 SOCKET 0 APIC 10
2016-12-03 19:08:52    [14874912.035650] [Hardware Error]: Machine check: Processor context corrupt
2016-12-03 19:08:52    [14874912.042500] Kernel panic - not syncing: Fatal Machine check
2016-12-03 19:08:52    [14874912.048401] Pid: 0, comm: swapper Tainted: GF  M    W  ---------------    2.6.32-358.el5.x86_64 #1
2016-12-03 19:08:52    [14874912.058867] Call Trace:
2016-12-03 19:08:52    [14874912.061632]  <#MC>  [] ? panic+0xd6/0x1d0
2016-12-03 19:08:52    [14874912.067554]  [] ? notifier_call_chain+0x21/0x90
2016-12-03 19:08:52    [14874912.073971]  [] ? __ratelimit+0xbd/0xe0
2016-12-03 19:08:52    [14874912.079701]  [] ? print_mce+0xf1/0x150
2016-12-03 19:08:52    [14874912.085338]  [] ? mce_panic+0x20e/0x210
2016-12-03 19:08:52    [14874912.091062]  [] ? do_machine_check+0x9b8/0x9c0
2016-12-03 19:08:52    [14874912.097396]  [] ? machine_check+0x1c/0x30
2016-12-03 19:08:52    [14874912.103293]  [] ? intel_idle+0xb4/0x180
2016-12-03 19:08:52    [14874912.109015]  <>  [] ? menu_select+0xee/0x370
2016-12-03 19:08:52    [14874912.115627]  [] ? cpuidle_idle_call+0x99/0x140
2016-12-03 19:08:52    [14874912.121958]  [] ? cpu_idle+0xa8/0xe0
2016-12-03 19:08:52    [14874912.127422]  [] ? start_secondary+0x226/0x360

当出现machine check exception (MCE) 的时候,硬件出现问题,此时会设置一个taint state(污染状态),一旦设置了内核已经污染(tainted),则只能通过重启系统重新加载内核才能unset这个污染状态。

请注意,在这里有两次出现 Call Trace,在Call Trace之前,有一个标记 Tainted: 其后跟随的字符表示了tainted stats。在 Machine Check Exception (MCE) 之后的 Kernel panic中,tainted stats 是 GF M W。其中的 W 表示 warning ,也就是 "W: 如果前面内核已经提出过告警"。

仔细查看dmesg输出信息,可以看到在操作系统启动时,内核已经发出过警告信息

[    1.434178] [Hardware Error]: This system BIOS has enabled interrupt remapping
[    1.434179] on a chipset that contains an errata making that
[    1.434180] feature unstable.  Please reboot with nointremap
[    1.434180] added to the kernel command line and contact
[    1.434181] your BIOS vendor for an update

上述启动时的硬件错误信息是提示系统BIOS开启了中断重映射(这个功能是用于kvm passthrough虚拟化),但是主机的芯片存在bug会导致系统不稳定,所以建议关闭intremap并升级BIOS。详细分析见本文后述。

dev_watchdog

Kernel Panic原因

Interrupt remapping problems with Intel 5500, 5520 CPUs 汇总了有关Intel VT-d中断映射引擎影响Intel 5500/5520芯片的问题。

出现 NETDEV WATCHDOG: eth0 (bnx2): transmit queue 2 timed out 的原因是因为 Intel 55XX 芯片存在一个中断重映射(interrupt remapping)的bug,通常会导致网卡丢失,以及watchdog触发panic。

关闭interrupt remapping方法

服务器启动内核修改 /boot/grub/menu.lst 配置文件kernel行添加 intremap=off

sudo sed -i '/vmlinuz-2.6.32/ s/$/ intremap=off/' /boot/grub/grub.conf

修正后内核启动配置类似如下(这里借用了Kernel panic with "NETDEV WATCHDOG: eth0 (bnx2): transmit queue 2 timed out" error案例)

title Parallels (2.6.32-042stab090.2)
        root (hd0,2)
        kernel /boot/vmlinuz-2.6.32-042stab090.2 ro root=UUID=2b7b0e76-2383-4423-805a-376893fadcb4 rd_NO_LUKS rd_NO_LVM rd_NO_MD rd_NO_DM LANG=es_ES.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=es crashkernel=auto rhgb quiet intremap=off
        initrd /boot/initramfs-2.6.32-042stab090.2.img

依然存在Kernel Panic

重启操作系统后,dmesg消息中不再出现系统警告[Hardware Error]: This system BIOS has enabled interrupt remapping,但是却发现内核参数添加了intremap=off依然会出现Panic,但是不再出现NETDEV WATCHDOG: slave0 (bnx2): transmit queue 5 timed out这样的信息。

依然出现的是Machine Check Exception: 4 Bank 5: be00000000800400:

2017-03-30 00:40:55    [1412408.649353] [Hardware Error]: CPU 20: Machine Check Exception: 4 Bank 5: be00000000800400
2017-03-30 00:40:55    [1412408.649370] Clocksource tsc unstable (delta = -8589936066 ns).  Enable clocksource failover by adding clocksource_failover kernel parameter.
2017-03-30 00:40:55    [1412408.670723] [Hardware Error]: TSC c0912bd05e005 ADDR 164b547 
2017-03-30 00:40:55    [1412408.676741] [Hardware Error]: PROCESSOR 0:206c2 TIME 1490805655 SOCKET 1 APIC 33
2017-03-30 00:40:55    [1412408.684427] [Hardware Error]: CPU 8: Machine Check Exception: 4 Bank 5: be00000000800400
2017-03-30 00:40:55    [1412408.692814] [Hardware Error]: TSC c0912bd05df33 ADDR 164b547 
2017-03-30 00:40:55    [1412408.698822] [Hardware Error]: PROCESSOR 0:206c2 TIME 1490805655 SOCKET 1 APIC 32
2017-03-30 00:40:55    [1412408.706517] [Hardware Error]: Machine check: Processor context corrupt
2017-03-30 00:40:55    [1412408.713280] Kernel panic - not syncing: Fatal Machine check
2017-03-30 00:40:55    [1412408.719091] Pid: 0, comm: swapper Tainted: GF  M       ---------------    2.6.32-358.23.2.ali1233.el5.x86_64 #1
2017-03-30 00:40:55    [1412408.729479] Call Trace:
2017-03-30 00:40:55    [1412408.732158]  <#MC>  [] ? panic+0xd6/0x1d0
2017-03-30 00:40:55    [1412408.738002]  [] ? notifier_call_chain+0x21/0x90
2017-03-30 00:40:55    [1412408.744334]  [] ? __ratelimit+0xbd/0xe0
2017-03-30 00:40:55    [1412408.749973]  [] ? print_mce+0xf1/0x150
2017-03-30 00:40:55    [1412408.755522]  [] ? mce_panic+0x20e/0x210
2017-03-30 00:40:55    [1412408.761156]  [] ? do_machine_check+0x9b8/0x9c0
2017-03-30 00:40:55    [1412408.767402]  [] ? machine_check+0x1c/0x30
2017-03-30 00:40:55    [1412408.773214]  [] ? intel_idle+0xb4/0x180
2017-03-30 00:40:55    [1412408.778846]  <>  [] ? menu_select+0xee/0x370
2017-03-30 00:40:56    [1412408.785370]  [] ? cpuidle_idle_call+0x99/0x140
2017-03-30 00:40:56    [1412408.791616]  [] ? cpu_idle+0xa8/0xe0
2017-03-30 00:40:56    [1412408.796995]  [] ? start_secondary+0x226/0x360

参考

results matching ""

    No results matching ""