wataru yamamoto
wyama****@kke*****
2011年 11月 3日 (木) 13:06:50 JST
松島様 山本です。 さっそくご回答いただき、まことにありがとうございます。 (2011/11/03 (木) 6:39), Takehiro Matsushima wrote: > はじめまして、松島と申します。 > 私も数カ月前にHeartbeatV3 + Pacemakerを始めた素人です。 > > いくつか気になる点がありまして、お伺いさせて頂きます。 > > 関係ないかもしれませんが、eth0側のネットワークがルーターとServerのNICと > Maskが違うのはなにか理由があってのことでしょうか。 > Router2のServer Network側は193-198のアドレスが同じネットワークですが、 > Serverがそれの外にあるような気がします。 申し訳ござません。図の誤植でした。 Router2のアドレス(ネットマスク)は 誤=> aa.bb.cc.193/29 正=> aa.bb.cc.193/28 となります。 一応、以下に修正したネットワーク図を載せておきます。 +---------+ | Router1 | +---------+ (aa.bb.cc.241/29) | | +---------仮想IP1------+ | (aa.bb.cc.242/29) | | | | | eth2 eth2 (aa.bb.cc.243/29) (aa.bb.cc.244/29) +--------+ +--------+ (dd.ee.ff.1/28) eth3| |eth1 eth1| |eth3 +---------+ +-----| hostM1 |-------------| hostB1 |--------+----| | | | | | | | | Router3 | | +--------+ + -------+ | | | | (aa.bb.cc.202/28) (aa.bb.cc.203/28) | +---------+ | eth0 eth0 | | | | | | | (aa.bb.cc.201/28) | | | +---------仮想IP2------+ | | | | | | | | (aa.bb.cc.193/28) | | +---------+ | | | Router2 | | | +---------+ | | | +----------------------------------------------+ > > >> hostM1のeth2についても以前は同様でした。しかしこのhostM1とhostB1を >> 他の環境に持っていくと、eth0についてはフェールオーバー&バックするも >> のの、eth2についてはそうならなくなりました。 > > ここでいう他の環境というのはどういうことでしょうか。 > IPアドレスを変えたりネットワークのセグメントを変えたり、はたまた仮想環境に > 入れた、ということでしょうか? Router1,2,3はお客さんの環境で、そこに私が設定し動作確認してきたhostM1,B1 を持ってきた、という次第です。移動前の私の環境でも一応Router1,2,3をダミー として置いて試験していました。ダミーとはいっても実ホストでアドレスも上の 客先と同じものにしていました。各ネットワークアドレスも同様です。 > > >> 具体的にはhostM1にてifdown eth2とすると、crm_monの仮想IP欄上では一 >> 瞬FATALと表示されるものの、フェールオーバーせずcrm_monではhostM1の >> ままとなっています。さらにその時点でhostM1でifconfigするとそのeth2 >> に仮想IP(aa.bb.cc.242/29)が直接振られてしまっています。さらにその状 >> 態でservice heartbeat stopとしてifconfigでI/Fを確認すると、eth1のIP >> アドレスが全く振られていない状態となってしまっています。 > > ここで、ip addr showしたときはいかがでしょうか。 > fail側、failover側で、もし差支え無ければ。 > 双方のhostで通常時、不具合時の両方をお願いいたします。 > (的外れかもしれませんが) 今、客先から撤収しているので確認できません。来週また行くのでその際に 確認しようと思っています。 ただ、実は手元にもう1セット残しています。アドレス体系は若干違うの ですが、ネットワーク構成は上と全くいっしょです。ただこれもきちんと動作 しますし、もちろん客先へ納入したものもこちらではうまく動作していました。 その手元の環境で、ip addr showコマンドを打ってみました。稼働系ホスト でのip addr showで仮想IPが確認できるんですね。すみません、知りません でした。ありがとうございました。 > > >> 肝心の/var/log/ha-logですがきちんと収集できていませんが、 >> このeth2のフェールオーバー失敗の現象が出る際には >> get_failcount: ip-s 云々 >> のようなWARNINGメッセージ?が出ていました。 > > 状況をみるためにlogはそれなりに大切だと考えておりまして、可能でしたら > 不具合発生付近のlogを抜粋していただけるとヒントが見つかるかもしれません。 > > 素人なので全く的はずれなことを言っていたらハズカシイのですが、よろしければ > ご回答ください。 logの提示の必要性についても全くおっしゃる通りです。こちらこそハズカシイ限りです。 実はha-logのファイルはhostM1,hostB1とも取得しています。しかし客先で時間がなく 錯綜し、linuxHAまわりのいろんな項目の変更を次々と実施したため、どの事象がログ のどれやら大変わかりにくくなっていたので、あえて添付しませんでした。 それでも、ちょっと頑張って記載してみます。手元環境にてifdown eth2を実施しその ログの出力傾向をつかみましたので、客先環境ログでの関連しそうな箇所を以降に示してみます。 尚、本ログは若干修正しています。具体的には、手元環境との違うと思われる箇所に"<<point1>>", "<<point2>>"との2行を挿入しました。尚、ホスト名やアドレスについては上の図に合わせて 匿名化しています。以下です。 ---------------- ha-log failover err start ----------------------------- Nov 2 10:02:42 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation ip-s_monitor_10000 (call=20, rc=7, cib-update=80, confirmed=false) not running Nov 2 10:02:42 hostM1 crmd: [20254]: info: process_graph_event: Action ip-s_monitor_10000 arrived after a completed transition Nov 2 10:02:42 hostM1 crmd: [20254]: info: abort_transition_graph: process_graph_event:486 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=ip-s_monitor_10000, magic=0:7;4:5:0:db95561c-6613-41e2-b7a0-17b8701178c3, cib=0.7.38) : Inactive graph Nov 2 10:02:42 hostM1 crmd: [20254]: WARN: update_failcount: Updating failcount for ip-s on hostm1 after failed monitor: rc=7 (update=value++, time=1320195762) Nov 2 10:02:42 hostM1 attrd: [20253]: info: find_hash_entry: Creating hash entry for fail-count-ip-s Nov 2 10:02:42 hostM1 attrd: [20253]: info: attrd_local_callback: Expanded fail-count-ip-s=value++ to 1 Nov 2 10:02:42 hostM1 crmd: [20254]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Nov 2 10:02:42 hostM1 attrd: [20253]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-ip-s (1) Nov 2 10:02:42 hostM1 crmd: [20254]: info: do_state_transition: All 2 cluster nodes are eligible to run resources. Nov 2 10:02:42 hostM1 crmd: [20254]: info: do_pe_invoke: Query 81: Requesting the current CIB: S_POLICY_ENGINE Nov 2 10:02:42 hostM1 attrd: [20253]: info: attrd_perform_update: Sent update 44: fail-count-ip-s=1 Nov 2 10:02:42 hostM1 attrd: [20253]: info: find_hash_entry: Creating hash entry for last-failure-ip-s Nov 2 10:02:42 hostM1 attrd: [20253]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-ip-s (1320195762) Nov 2 10:02:42 hostM1 attrd: [20253]: info: attrd_perform_update: Sent update 47: last-failure-ip-s=1320195762 Nov 2 10:02:42 hostM1 crmd: [20254]: info: do_pe_invoke_callback: Invoking the PE: query=81, ref=pe_calc-dc-1320195762-53, seq=2, quorate=1 Nov 2 10:02:42 hostM1 crmd: [20254]: info: abort_transition_graph: te_update_diff:150 - Triggered transition abort (complete=1, tag=nvpair, id=status-396fe8ff-762a-41bb-a3e6-da8270fa144f-fail-count-ip-s, magic=NA, cib=0.7.39) : Transient attribute: update Nov 2 10:02:42 hostM1 crmd: [20254]: info: abort_transition_graph: te_update_diff:150 - Triggered transition abort (complete=1, tag=nvpair, id=status-396fe8ff-762a-41bb-a3e6-da8270fa144f-last-failure-ip-s, magic=NA, cib=0.7.40) : Transient attribute: update Nov 2 10:02:42 hostM1 crmd: [20254]: info: do_pe_invoke: Query 82: Requesting the current CIB: S_POLICY_ENGINE Nov 2 10:02:42 hostM1 crmd: [20254]: info: do_pe_invoke: Query 83: Requesting the current CIB: S_POLICY_ENGINE Nov 2 10:02:42 hostM1 pengine: [20268]: notice: unpack_config: On loss of CCM Quorum: Ignore Nov 2 10:02:42 hostM1 pengine: [20268]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0 Nov 2 10:02:42 hostM1 pengine: [20268]: info: determine_online_status: Node hostm1 is online Nov 2 10:02:42 hostM1 pengine: [20268]: info: determine_online_status: Node hostb1 is online Nov 2 10:02:42 hostM1 pengine: [20268]: WARN: unpack_rsc_op: Processing failed op ip-s_monitor_10000 on hostm1: not running (7) Nov 2 10:02:42 hostM1 pengine: [20268]: notice: group_print: Resource Group: hoge-grp Nov 2 10:02:42 hostM1 pengine: [20268]: notice: native_print: ip-g#011(ocf::heartbeat:IPaddr2):#011Started hostm1 Nov 2 10:02:42 hostM1 pengine: [20268]: notice: native_print: ip-s#011(ocf::heartbeat:IPaddr2):#011Started hostm1 FAILED Nov 2 10:02:42 hostM1 pengine: [20268]: notice: native_print: hoge-r#011(ocf::hoge:hoge_proxy.ra):#011Started hostm1 Nov 2 10:02:42 hostM1 pengine: [20268]: notice: clone_print: Clone Set: clone_ping Nov 2 10:02:42 hostM1 pengine: [20268]: notice: short_print: Started: [ hostm1 hostb1 ] Nov 2 10:02:42 hostM1 pengine: [20268]: info: get_failcount: ip-g has failed 1 times on hostm1 Nov 2 10:02:42 hostM1 pengine: [20268]: notice: common_apply_stickiness: ip-g can fail 999999 more times on hostm1 before being forced off Nov 2 10:02:42 hostM1 pengine: [20268]: notice: RecurringOp: Start recurring monitor (10s) for ip-s on hostm1 Nov 2 10:02:42 hostM1 pengine: [20268]: notice: LogActions: Leave resource ip-g#011(Started hostm1) Nov 2 10:02:42 hostM1 pengine: [20268]: notice: LogActions: Recover resource ip-s#011(Started hostm1) Nov 2 10:02:42 hostM1 pengine: [20268]: notice: LogActions: Restart resource hoge-r#011(Started hostm1) Nov 2 10:02:42 hostM1 pengine: [20268]: notice: LogActions: Leave resource ping-r:0#011(Started hostm1) Nov 2 10:02:42 hostM1 pengine: [20268]: notice: LogActions: Leave resource ping-r:1#011(Started hostb1) Nov 2 10:02:42 hostM1 crmd: [20254]: info: handle_response: pe_calc calculation pe_calc-dc-1320195762-53 is obsolete Nov 2 10:02:42 hostM1 crmd: [20254]: info: do_pe_invoke_callback: Invoking the PE: query=83, ref=pe_calc-dc-1320195762-54, seq=2, quorate=1 Nov 2 10:02:42 hostM1 pengine: [20268]: info: process_pe_message: Transition 6: PEngine Input stored in: /var/lib/pengine/pe-input-614.bz2 Nov 2 10:02:42 hostM1 pengine: [20268]: notice: unpack_config: On loss of CCM Quorum: Ignore Nov 2 10:02:42 hostM1 pengine: [20268]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0 Nov 2 10:02:42 hostM1 pengine: [20268]: info: determine_online_status: Node hostm1 is online Nov 2 10:02:42 hostM1 pengine: [20268]: info: determine_online_status: Node hostb1 is online Nov 2 10:02:42 hostM1 pengine: [20268]: WARN: unpack_rsc_op: Processing failed op ip-s_monitor_10000 on hostm1: not running (7) Nov 2 10:02:42 hostM1 pengine: [20268]: notice: group_print: Resource Group: hoge-grp Nov 2 10:02:42 hostM1 pengine: [20268]: notice: native_print: ip-g#011(ocf::heartbeat:IPaddr2):#011Started hostm1 Nov 2 10:02:42 hostM1 pengine: [20268]: notice: native_print: ip-s#011(ocf::heartbeat:IPaddr2):#011Started hostm1 FAILED Nov 2 10:02:42 hostM1 pengine: [20268]: notice: native_print: hoge-r#011(ocf::hoge:hoge_proxy.ra):#011Started hostm1 Nov 2 10:02:42 hostM1 pengine: [20268]: notice: clone_print: Clone Set: clone_ping Nov 2 10:02:42 hostM1 pengine: [20268]: notice: short_print: Started: [ hostm1 hostb1 ] Nov 2 10:02:42 hostM1 pengine: [20268]: info: get_failcount: ip-g has failed 1 times on hostm1 Nov 2 10:02:42 hostM1 pengine: [20268]: notice: common_apply_stickiness: ip-g can fail 999999 more times on hostm1 before being forced off Nov 2 10:02:42 hostM1 pengine: [20268]: info: get_failcount: ip-s has failed 1 times on hostm1 Nov 2 10:02:42 hostM1 pengine: [20268]: notice: common_apply_stickiness: ip-s can fail 999999 more times on hostm1 before being forced off Nov 2 10:02:42 hostM1 pengine: [20268]: notice: RecurringOp: Start recurring monitor (10s) for ip-s on hostm1 Nov 2 10:02:42 hostM1 pengine: [20268]: notice: LogActions: Leave resource ip-g#011(Started hostm1) Nov 2 10:02:42 hostM1 pengine: [20268]: notice: LogActions: Recover resource ip-s#011(Started hostm1) Nov 2 10:02:42 hostM1 pengine: [20268]: notice: LogActions: Restart resource hoge-r#011(Started hostm1) Nov 2 10:02:42 hostM1 pengine: [20268]: notice: LogActions: Leave resource ping-r:0#011(Started hostm1) Nov 2 10:02:42 hostM1 pengine: [20268]: notice: LogActions: Leave resource ping-r:1#011(Started hostb1) Nov 2 10:02:42 hostM1 crmd: [20254]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Nov 2 10:02:42 hostM1 crmd: [20254]: info: unpack_graph: Unpacked transition 7: 11 actions in 11 synapses Nov 2 10:02:42 hostM1 crmd: [20254]: info: do_te_invoke: Processing graph 7 (ref=pe_calc-dc-1320195762-54) derived from /var/lib/pengine/pe-input-615.bz2 Nov 2 10:02:42 hostM1 crmd: [20254]: info: te_pseudo_action: Pseudo action 18 fired and confirmed Nov 2 10:02:42 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 14: stop hoge-r_stop_0 on hostm1 (local) Nov 2 10:02:42 hostM1 lrmd: [20251]: info: cancel_op: operation monitor[22] on ocf::hoge_proxy.ra::hoge-r for client 20254, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.1] CRM_meta_on_fail=[restart] CRM_meta_interval=[15000] CRM_meta_timeout=[20000] cancelled Nov 2 10:02:42 hostM1 crmd: [20254]: info: do_lrm_rsc_op: Performing key=14:7:0:db95561c-6613-41e2-b7a0-17b8701178c3 op=hoge-r_stop_0 ) Nov 2 10:02:42 hostM1 lrmd: [20251]: info: rsc:hoge-r:23: stop Nov 2 10:02:42 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation hoge-r_monitor_15000 (call=22, status=1, cib-update=0, confirmed=true) Cancelled Nov 2 10:02:42 hostM1 hoge_proxy.ra[22228]: INFO: Stopping hoge_proxy ... Nov 2 10:02:42 hostM1 pengine: [20268]: info: process_pe_message: Transition 7: PEngine Input stored in: /var/lib/pengine/pe-input-615.bz2 Nov 2 10:02:42 hostM1 hoge_proxy.ra[22228]: INFO: hoge_proxy stopped Nov 2 10:02:42 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation hoge-r_stop_0 (call=23, rc=0, cib-update=84, confirmed=true) ok Nov 2 10:02:42 hostM1 crmd: [20254]: info: match_graph_event: Action hoge-r_stop_0 (14) confirmed on hostm1 (rc=0) Nov 2 10:02:42 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 4: stop ip-s_stop_0 on hostm1 (local) Nov 2 10:02:42 hostM1 lrmd: [20251]: info: cancel_op: operation monitor[20] on ocf::IPaddr2::ip-s for client 20254, its parameters: CRM_meta_name=[monitor] cidr_netmask=[29] crm_feature_set=[3.0.1] CRM_meta_timeout=[20000] CRM_meta_interval=[10000] nic=[eth2] ip=[aa.bb.cc.242] cancelled Nov 2 10:02:42 hostM1 crmd: [20254]: info: do_lrm_rsc_op: Performing key=4:7:0:db95561c-6613-41e2-b7a0-17b8701178c3 op=ip-s_stop_0 ) Nov 2 10:02:42 hostM1 lrmd: [20251]: info: rsc:ip-s:24: stop Nov 2 10:02:42 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation ip-s_monitor_10000 (call=20, status=1, cib-update=0, confirmed=true) Cancelled Nov 2 10:02:42 hostM1 IPaddr2[22259]: INFO: IP status = no, IP_CIP= Nov 2 10:02:42 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation ip-s_stop_0 (call=24, rc=0, cib-update=85, confirmed=true) ok Nov 2 10:02:42 hostM1 crmd: [20254]: info: match_graph_event: Action ip-s_stop_0 (4) confirmed on hostm1 (rc=0) Nov 2 10:02:42 hostM1 crmd: [20254]: info: te_pseudo_action: Pseudo action 19 fired and confirmed Nov 2 10:02:42 hostM1 crmd: [20254]: info: te_pseudo_action: Pseudo action 7 fired and confirmed Nov 2 10:02:42 hostM1 crmd: [20254]: info: te_pseudo_action: Pseudo action 16 fired and confirmed Nov 2 10:02:42 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 13: start ip-s_start_0 on hostm1 (local) Nov 2 10:02:42 hostM1 crmd: [20254]: info: do_lrm_rsc_op: Performing key=13:7:0:db95561c-6613-41e2-b7a0-17b8701178c3 op=ip-s_start_0 ) Nov 2 10:02:42 hostM1 lrmd: [20251]: info: rsc:ip-s:25: start Nov 2 10:02:42 hostM1 IPaddr2[22296]: INFO: ip -f inet addr add aa.bb.cc.242/29 brd aa.bb.cc.247 dev eth2 Nov 2 10:02:42 hostM1 IPaddr2[22296]: INFO: ip link set eth2 up Nov 2 10:02:42 hostM1 IPaddr2[22296]: INFO: /usr/lib64/heartbeat/send_arp -i 200 -r 5 -p /var/run/heartbeat/rsctmp/send_arp-aa.bb.cc.242 eth2 aa.bb.cc.242 auto not_used not_used Nov 2 10:02:42 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation ip-s_start_0 (call=25, rc=0, cib-update=86, confirmed=true) ok Nov 2 10:02:42 hostM1 crmd: [20254]: info: match_graph_event: Action ip-s_start_0 (13) confirmed on hostm1 (rc=0) Nov 2 10:02:42 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 3: monitor ip-s_monitor_10000 on hostm1 (local) Nov 2 10:02:42 hostM1 crmd: [20254]: info: do_lrm_rsc_op: Performing key=3:7:0:db95561c-6613-41e2-b7a0-17b8701178c3 op=ip-s_monitor_10000 ) Nov 2 10:02:42 hostM1 lrmd: [20251]: info: rsc:ip-s:26: monitor Nov 2 10:02:42 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 15: start hoge-r_start_0 on hostm1 (local) Nov 2 10:02:42 hostM1 crmd: [20254]: info: do_lrm_rsc_op: Performing key=15:7:0:db95561c-6613-41e2-b7a0-17b8701178c3 op=hoge-r_start_0 ) Nov 2 10:02:42 hostM1 lrmd: [20251]: info: rsc:hoge-r:27: start Nov 2 10:02:42 hostM1 hoge_proxy.ra[22355]: INFO: Starting hoge_proxy ... Nov 2 10:02:42 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation ip-s_monitor_10000 (call=26, rc=0, cib-update=87, confirmed=false) ok Nov 2 10:02:42 hostM1 crmd: [20254]: info: match_graph_event: Action ip-s_monitor_10000 (3) confirmed on hostm1 (rc=0) Nov 2 10:02:43 hostM1 hoge_proxy.ra[22355]: INFO: hoge_proxy started Nov 2 10:02:43 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation hoge-r_start_0 (call=27, rc=0, cib-update=88, confirmed=true) ok Nov 2 10:02:43 hostM1 crmd: [20254]: info: match_graph_event: Action hoge-r_start_0 (15) confirmed on hostm1 (rc=0) Nov 2 10:02:43 hostM1 crmd: [20254]: info: te_pseudo_action: Pseudo action 17 fired and confirmed Nov 2 10:02:43 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 5: monitor hoge-r_monitor_15000 on hostm1 (local) Nov 2 10:02:43 hostM1 crmd: [20254]: info: do_lrm_rsc_op: Performing key=5:7:0:db95561c-6613-41e2-b7a0-17b8701178c3 op=hoge-r_monitor_15000 ) Nov 2 10:02:43 hostM1 lrmd: [20251]: info: rsc:hoge-r:28: monitor Nov 2 10:02:43 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation hoge-r_monitor_15000 (call=28, rc=0, cib-update=89, confirmed=false) ok Nov 2 10:02:43 hostM1 crmd: [20254]: info: match_graph_event: Action hoge-r_monitor_15000 (5) confirmed on hostm1 (rc=0) Nov 2 10:02:43 hostM1 crmd: [20254]: info: run_graph: ==================================================== Nov 2 10:02:43 hostM1 crmd: [20254]: notice: run_graph: Transition 7 (Complete=11, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-615.bz2): Complete Nov 2 10:02:43 hostM1 crmd: [20254]: info: te_graph_trigger: Transition 7 is now complete Nov 2 10:02:43 hostM1 crmd: [20254]: info: notify_crmd: Transition 7 status: done - <null> Nov 2 10:02:43 hostM1 crmd: [20254]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Nov 2 10:02:43 hostM1 crmd: [20254]: info: do_state_transition: Starting PEngine Recheck Timer <<point1>> Nov 2 10:02:45 hostM1 pingd: [20434]: info: stand_alone_ping: Node aa.bb.cc.241 is unreachable (read) Nov 2 10:02:46 hostM1 pingd: [20434]: info: ping_read: Retrying... Nov 2 10:02:46 hostM1 lrmd: [20251]: info: RA output: (ip-s:start:stderr) ARPING aa.bb.cc.242 from aa.bb.cc.242 eth2#012Sent 5 probes (5 broadcast(s))#012Received 0 response(s) Nov 2 10:04:01 hostM1 pingd: [20434]: info: stand_alone_ping: Node dd.ee.ff.1 is unreachable (read) Nov 2 10:04:02 hostM1 pingd: [20434]: info: stand_alone_ping: Node dd.ee.ff.1 is unreachable (read) Nov 2 10:04:03 hostM1 pingd: [20434]: info: stand_alone_ping: Node dd.ee.ff.1 is unreachable (read) Nov 2 10:04:04 hostM1 pingd: [20434]: info: stand_alone_ping: Node dd.ee.ff.1 is unreachable (read) Nov 2 10:04:05 hostM1 pingd: [20434]: info: stand_alone_ping: Node dd.ee.ff.1 is unreachable (read) Nov 2 10:04:06 hostM1 attrd: [20253]: info: attrd_trigger_update: Sending flush op to all hosts for: default_ping_set (200) Nov 2 10:04:06 hostM1 pingd: [20434]: info: stand_alone_ping: Node dd.ee.ff.1 is unreachable (read) Nov 2 10:04:07 hostM1 attrd: [20253]: info: attrd_ha_callback: flush message from hostm1 Nov 2 10:04:07 hostM1 attrd: [20253]: info: attrd_perform_update: Sent update 49: default_ping_set=200 Nov 2 10:04:07 hostM1 crmd: [20254]: info: abort_transition_graph: te_update_diff:150 - Triggered transition abort (complete=1, tag=nvpair, id=status-396fe8ff-762a-41bb-a3e6-da8270fa144f-default_ping_set, magic=NA, cib=0.7.47) : Transient attribute: update Nov 2 10:04:07 hostM1 crmd: [20254]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Nov 2 10:04:07 hostM1 crmd: [20254]: info: do_state_transition: All 2 cluster nodes are eligible to run resources. Nov 2 10:04:07 hostM1 crmd: [20254]: info: do_pe_invoke: Query 90: Requesting the current CIB: S_POLICY_ENGINE Nov 2 10:04:07 hostM1 crmd: [20254]: info: do_pe_invoke_callback: Invoking the PE: query=90, ref=pe_calc-dc-1320195847-61, seq=2, quorate=1 Nov 2 10:04:07 hostM1 pengine: [20268]: notice: unpack_config: On loss of CCM Quorum: Ignore Nov 2 10:04:07 hostM1 pengine: [20268]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0 Nov 2 10:04:07 hostM1 pengine: [20268]: info: determine_online_status: Node hostm1 is online Nov 2 10:04:07 hostM1 pengine: [20268]: info: determine_online_status: Node hostb1 is online Nov 2 10:04:07 hostM1 pengine: [20268]: notice: group_print: Resource Group: hoge-grp Nov 2 10:04:07 hostM1 pengine: [20268]: notice: native_print: ip-g#011(ocf::heartbeat:IPaddr2):#011Started hostm1 Nov 2 10:04:07 hostM1 pengine: [20268]: notice: native_print: ip-s#011(ocf::heartbeat:IPaddr2):#011Started hostm1 Nov 2 10:04:07 hostM1 pengine: [20268]: notice: native_print: hoge-r#011(ocf::hoge:hoge_proxy.ra):#011Started hostm1 Nov 2 10:04:07 hostM1 pengine: [20268]: notice: clone_print: Clone Set: clone_ping Nov 2 10:04:07 hostM1 pengine: [20268]: notice: short_print: Started: [ hostm1 hostb1 ] Nov 2 10:04:07 hostM1 pengine: [20268]: info: get_failcount: ip-g has failed 1 times on hostm1 Nov 2 10:04:07 hostM1 pengine: [20268]: notice: common_apply_stickiness: ip-g can fail 999999 more times on hostm1 before being forced off Nov 2 10:04:07 hostM1 pengine: [20268]: info: get_failcount: ip-s has failed 1 times on hostm1 Nov 2 10:04:07 hostM1 pengine: [20268]: notice: common_apply_stickiness: ip-s can fail 999999 more times on hostm1 before being forced off Nov 2 10:04:07 hostM1 pengine: [20268]: notice: RecurringOp: Start recurring monitor (10s) for ip-g on hostb1 Nov 2 10:04:07 hostM1 pengine: [20268]: notice: RecurringOp: Start recurring monitor (10s) for ip-s on hostb1 Nov 2 10:04:07 hostM1 pengine: [20268]: notice: RecurringOp: Start recurring monitor (15s) for hoge-r on hostb1 Nov 2 10:04:07 hostM1 pengine: [20268]: notice: LogActions: Move resource ip-g#011(Started hostm1 -> hostb1) Nov 2 10:04:07 hostM1 pengine: [20268]: notice: LogActions: Move resource ip-s#011(Started hostm1 -> hostb1) Nov 2 10:04:07 hostM1 pengine: [20268]: notice: LogActions: Move resource hoge-r#011(Started hostm1 -> hostb1) Nov 2 10:04:07 hostM1 pengine: [20268]: notice: LogActions: Leave resource ping-r:0#011(Started hostm1) Nov 2 10:04:07 hostM1 pengine: [20268]: notice: LogActions: Leave resource ping-r:1#011(Started hostb1) Nov 2 10:04:07 hostM1 crmd: [20254]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Nov 2 10:04:07 hostM1 crmd: [20254]: info: unpack_graph: Unpacked transition 8: 14 actions in 14 synapses Nov 2 10:04:07 hostM1 crmd: [20254]: info: do_te_invoke: Processing graph 8 (ref=pe_calc-dc-1320195847-61) derived from /var/lib/pengine/pe-input-616.bz2 Nov 2 10:04:07 hostM1 crmd: [20254]: info: te_pseudo_action: Pseudo action 21 fired and confirmed Nov 2 10:04:07 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 16: stop hoge-r_stop_0 on hostm1 (local) Nov 2 10:04:07 hostM1 lrmd: [20251]: info: cancel_op: operation monitor[28] on ocf::hoge_proxy.ra::hoge-r for client 20254, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.1] CRM_meta_on_fail=[restart] CRM_meta_interval=[15000] CRM_meta_timeout=[20000] cancelled Nov 2 10:04:07 hostM1 crmd: [20254]: info: do_lrm_rsc_op: Performing key=16:8:0:db95561c-6613-41e2-b7a0-17b8701178c3 op=hoge-r_stop_0 ) Nov 2 10:04:07 hostM1 lrmd: [20251]: info: rsc:hoge-r:29: stop Nov 2 10:04:07 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation hoge-r_monitor_15000 (call=28, status=1, cib-update=0, confirmed=true) Cancelled Nov 2 10:04:07 hostM1 hoge_proxy.ra[23276]: INFO: Stopping hoge_proxy ... Nov 2 10:04:07 hostM1 pengine: [20268]: info: process_pe_message: Transition 8: PEngine Input stored in: /var/lib/pengine/pe-input-616.bz2 Nov 2 10:04:07 hostM1 hoge_proxy.ra[23276]: INFO: hoge_proxy stopped Nov 2 10:04:07 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation hoge-r_stop_0 (call=29, rc=0, cib-update=91, confirmed=true) ok Nov 2 10:04:07 hostM1 crmd: [20254]: info: match_graph_event: Action hoge-r_stop_0 (16) confirmed on hostm1 (rc=0) Nov 2 10:04:07 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 13: stop ip-s_stop_0 on hostm1 (local) Nov 2 10:04:07 hostM1 lrmd: [20251]: info: cancel_op: operation monitor[26] on ocf::IPaddr2::ip-s for client 20254, its parameters: CRM_meta_name=[monitor] cidr_netmask=[29] crm_feature_set=[3.0.1] CRM_meta_timeout=[20000] CRM_meta_interval=[10000] nic=[eth2] ip=[aa.bb.cc.242] cancelled Nov 2 10:04:07 hostM1 crmd: [20254]: info: do_lrm_rsc_op: Performing key=13:8:0:db95561c-6613-41e2-b7a0-17b8701178c3 op=ip-s_stop_0 ) Nov 2 10:04:07 hostM1 lrmd: [20251]: info: rsc:ip-s:30: stop Nov 2 10:04:07 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation ip-s_monitor_10000 (call=26, status=1, cib-update=0, confirmed=true) Cancelled Nov 2 10:04:07 hostM1 IPaddr2[23307]: INFO: IP status = ok, IP_CIP= Nov 2 10:04:07 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation ip-s_stop_0 (call=30, rc=0, cib-update=92, confirmed=true) ok Nov 2 10:04:07 hostM1 crmd: [20254]: info: match_graph_event: Action ip-s_stop_0 (13) confirmed on hostm1 (rc=0) Nov 2 10:04:07 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 10: stop ip-g_stop_0 on hostm1 (local) Nov 2 10:04:07 hostM1 lrmd: [20251]: info: cancel_op: operation monitor[18] on ocf::IPaddr2::ip-g for client 20254, its parameters: CRM_meta_name=[monitor] cidr_netmask=[28] crm_feature_set=[3.0.1] CRM_meta_timeout=[20000] CRM_meta_interval=[10000] nic=[eth0] ip=[aa.bb.cc.201] cancelled Nov 2 10:04:07 hostM1 crmd: [20254]: info: do_lrm_rsc_op: Performing key=10:8:0:db95561c-6613-41e2-b7a0-17b8701178c3 op=ip-g_stop_0 ) Nov 2 10:04:07 hostM1 lrmd: [20251]: info: rsc:ip-g:31: stop Nov 2 10:04:07 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation ip-g_monitor_10000 (call=18, status=1, cib-update=0, confirmed=true) Cancelled Nov 2 10:04:07 hostM1 IPaddr2[23347]: INFO: IP status = ok, IP_CIP= Nov 2 10:04:07 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation ip-g_stop_0 (call=31, rc=0, cib-update=93, confirmed=true) ok Nov 2 10:04:07 hostM1 crmd: [20254]: info: match_graph_event: Action ip-g_stop_0 (10) confirmed on hostm1 (rc=0) Nov 2 10:04:07 hostM1 crmd: [20254]: info: te_pseudo_action: Pseudo action 22 fired and confirmed Nov 2 10:04:07 hostM1 crmd: [20254]: info: te_pseudo_action: Pseudo action 6 fired and confirmed Nov 2 10:04:07 hostM1 crmd: [20254]: info: te_pseudo_action: Pseudo action 19 fired and confirmed Nov 2 10:04:07 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 11: start ip-g_start_0 on hostb1 <<point2>> Nov 2 10:04:07 hostM1 pingd: [20434]: WARN: ping_write: Wrote -1 of 39 chars: Network is unreachable (101) Nov 2 10:04:07 hostM1 pingd: [20434]: info: stand_alone_ping: Node aa.bb.cc.241 is unreachable (write) Nov 2 10:04:08 hostM1 crmd: [20254]: info: match_graph_event: Action ip-g_start_0 (11) confirmed on hostb1 (rc=0) Nov 2 10:04:08 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 12: monitor ip-g_monitor_10000 on hostb1 Nov 2 10:04:08 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 14: start ip-s_start_0 on hostb1 Nov 2 10:04:08 hostM1 pingd: [20434]: WARN: ping_write: Wrote -1 of 39 chars: Network is unreachable (101) Nov 2 10:04:08 hostM1 pingd: [20434]: info: stand_alone_ping: Node aa.bb.cc.241 is unreachable (write) Nov 2 10:04:09 hostM1 crmd: [20254]: info: match_graph_event: Action ip-g_monitor_10000 (12) confirmed on hostb1 (rc=0) Nov 2 10:04:09 hostM1 crmd: [20254]: info: match_graph_event: Action ip-s_start_0 (14) confirmed on hostb1 (rc=0) Nov 2 10:04:09 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 15: monitor ip-s_monitor_10000 on hostb1 Nov 2 10:04:09 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 17: start hoge-r_start_0 on hostb1 Nov 2 10:04:09 hostM1 pingd: [20434]: WARN: ping_write: Wrote -1 of 39 chars: Network is unreachable (101) Nov 2 10:04:09 hostM1 pingd: [20434]: info: stand_alone_ping: Node aa.bb.cc.193 is unreachable (write) Nov 2 10:04:10 hostM1 pingd: [20434]: WARN: ping_write: Wrote -1 of 39 chars: Network is unreachable (101) Nov 2 10:04:10 hostM1 pingd: [20434]: info: stand_alone_ping: Node aa.bb.cc.193 is unreachable (write) Nov 2 10:04:11 hostM1 crmd: [20254]: info: match_graph_event: Action ip-s_monitor_10000 (15) confirmed on hostb1 (rc=0) Nov 2 10:04:11 hostM1 crmd: [20254]: info: match_graph_event: Action hoge-r_start_0 (17) confirmed on hostb1 (rc=0) Nov 2 10:04:11 hostM1 crmd: [20254]: info: te_pseudo_action: Pseudo action 20 fired and confirmed Nov 2 10:04:11 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 18: monitor hoge-r_monitor_15000 on hostb1 Nov 2 10:04:11 hostM1 pingd: [20434]: WARN: ping_write: Wrote -1 of 39 chars: Network is unreachable (101) Nov 2 10:04:11 hostM1 pingd: [20434]: info: stand_alone_ping: Node dd.ee.ff.1 is unreachable (write) Nov 2 10:04:12 hostM1 crmd: [20254]: info: match_graph_event: Action hoge-r_monitor_15000 (18) confirmed on hostb1 (rc=0) Nov 2 10:04:12 hostM1 crmd: [20254]: info: run_graph: ==================================================== Nov 2 10:04:12 hostM1 crmd: [20254]: notice: run_graph: Transition 8 (Complete=14, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-616.bz2): Complete Nov 2 10:04:12 hostM1 crmd: [20254]: info: te_graph_trigger: Transition 8 is now complete Nov 2 10:04:12 hostM1 crmd: [20254]: info: notify_crmd: Transition 8 status: done - <null> Nov 2 10:04:12 hostM1 crmd: [20254]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Nov 2 10:04:12 hostM1 crmd: [20254]: info: do_state_transition: Starting PEngine Recheck Timer Nov 2 10:04:12 hostM1 pingd: [20434]: WARN: ping_write: Wrote -1 of 39 chars: Network is unreachable (101) Nov 2 10:04:12 hostM1 pingd: [20434]: info: stand_alone_ping: Node dd.ee.ff.1 is unreachable (write) Nov 2 10:04:12 hostM1 attrd: [20253]: info: attrd_trigger_update: Sending flush op to all hosts for: default_ping_set (100) Nov 2 10:04:13 hostM1 attrd: [20253]: info: attrd_ha_callback: flush message from hostm1 Nov 2 10:04:13 hostM1 attrd: [20253]: info: attrd_perform_update: Sent update 51: default_ping_set=100 Nov 2 10:04:13 hostM1 crmd: [20254]: info: abort_transition_graph: te_update_diff:150 - Triggered transition abort (complete=1, tag=nvpair, id=status-396fe8ff-762a-41bb-a3e6-da8270fa144f-default_ping_set, magic=NA, cib=0.7.57) : Transient attribute: update Nov 2 10:04:13 hostM1 crmd: [20254]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] ---------------- ha-log failover err end ----------------------------- ■上のログについて ・<<point1>>について >Nov 2 10:04:01 hostM1 pingd: [20434]: info: stand_alone_ping: Node dd.ee.ff.1 is unreachable (read) >Nov 2 10:04:02 hostM1 pingd: [20434]: info: stand_alone_ping: Node dd.ee.ff.1 is unreachable (read) >... ここで、Node dd.ee.ff.241に対してpingがunreachable(read)になっています。 手元環境ではこのようなメッセージは出ていません。 ・<<point2>>について >Nov 2 10:04:07 hostM1 pingd: [20434]: WARN: ping_write: Wrote -1 of 39 chars: Network is unreachable (101) >Nov 2 10:04:07 hostM1 pingd: [20434]: info: stand_alone_ping: Node aa.bb.cc.241 is unreachable (write) >Nov 2 10:04:08 hostM1 crmd: [20254]: info: match_graph_event: Action ip-g_start_0 (11) confirmed on hostb1 (rc=0) >Nov 2 10:04:08 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 12: monitor ip-g_monitor_10000 on hostb1 >Nov 2 10:04:08 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 14: start ip-s_start_0 on hostb1 ここではNode aa.bb.cc.241に対してpingがunreachable(read)になっています。以降、このパターンを 繰り返すケースが多いようです。 ■所管 どうもフェールオーバした後にpingdの宛先にpingが通らなくなっているような気がします。少なくとも eth2の先であるRouter1(aa.bb.cc.241)について通らない、というケースが多いようです。ちなみに、 heartbeat起動しない状態でのaa.bb.cc.241宛てのpingは、通ることを確認しています。その際eth2以外 のifをdownさせてのpingもOKでした。heartbeat起動後もfailoverさえなければ通ります。 以降、よくわからないので勝手に推測しますが、linuxHAってフェールオーバして仮想IPを保持しつづけ てもそのパケットのmacアドレスは変わってしまい、それをRouter1などが嫌ってpingの返事を(すぐには?) 返さない、なんてこと(や設定)があるのでしょうか?今回のRouter1,2,3は客先のものなので、ちょっと 今すぐは確認できないのですが、そのmacアドレスから察するにCiscoのルータだかスイッチだかのようです。 今度客先に出向いたときにでも確認しますが、なにかこういった心当たりがあればご教授願います。 また、今思えば、ha.cfで「auto_failback on」としており、状況がつかめにくくなっていたのかもしれ ません。これをoffとしてフェールバックなしでやってみようかとも思っています。 以上です。ご検討いただけると幸いです。 --- yamamoto