仮想IPがフェールオーバーするもすぐフェールバックし、さらにそれが実 IPとなってしまう。 (Linux-ha-jp) - Linux-HA Japan

松島様
山本です。

さっそくご回答いただき、まことにありがとうございます。

(2011/11/03 (木) 6:39), Takehiro Matsushima wrote:
> はじめまして、松島と申します。
> 私も数カ月前にHeartbeatV3 + Pacemakerを始めた素人です。
> 
> いくつか気になる点がありまして、お伺いさせて頂きます。
> 
> 関係ないかもしれませんが、eth0側のネットワークがルーターとServerのNICと
> Maskが違うのはなにか理由があってのことでしょうか。
> Router2のServer Network側は193-198のアドレスが同じネットワークですが、
> Serverがそれの外にあるような気がします。

申し訳ござません。図の誤植でした。

Router2のアドレス(ネットマスク)は
　誤=> aa.bb.cc.193/29
　正=> aa.bb.cc.193/28
となります。

一応、以下に修正したネットワーク図を載せておきます。

                  +---------+
                  | Router1 |
                  +---------+
               (aa.bb.cc.241/29)
                      |
                      |
          +---------仮想IP1------+
          |   (aa.bb.cc.242/29)  |
          |                      |
          |                      |
         eth2                   eth2
   (aa.bb.cc.243/29)         (aa.bb.cc.244/29)
      +--------+             +--------+             (dd.ee.ff.1/28)
  eth3|        |eth1     eth1|        |eth3         +---------+
+-----| hostM1 |-------------| hostB1 |--------+----|         |
|     |        |             |        |        |    | Router3 |
|     +--------+             + -------+        |    |         |
|   (aa.bb.cc.202/28)       (aa.bb.cc.203/28)  |    +---------+
|         eth0                   eth0          |
|          |                      |            |
|          |   (aa.bb.cc.201/28)  |            |
|          +---------仮想IP2------+            |
|                      |                       |
|                      |                       |
|               (aa.bb.cc.193/28)              |
|                  +---------+                 |
|                  | Router2 |                 |
|                  +---------+                 |
|                                              |
+----------------------------------------------+

> 
> 
>> hostM1のeth2についても以前は同様でした。しかしこのhostM1とhostB1を
>> 他の環境に持っていくと、eth0についてはフェールオーバー&バックするも
>> のの、eth2についてはそうならなくなりました。
> 
> ここでいう他の環境というのはどういうことでしょうか。
> IPアドレスを変えたりネットワークのセグメントを変えたり、はたまた仮想環境に
> 入れた、ということでしょうか？

Router1,2,3はお客さんの環境で、そこに私が設定し動作確認してきたhostM1,B1
を持ってきた、という次第です。移動前の私の環境でも一応Router1,2,3をダミー
として置いて試験していました。ダミーとはいっても実ホストでアドレスも上の
客先と同じものにしていました。各ネットワークアドレスも同様です。


> 
> 
>> 具体的にはhostM1にてifdown eth2とすると、crm_monの仮想IP欄上では一
>> 瞬FATALと表示されるものの、フェールオーバーせずcrm_monではhostM1の
>> ままとなっています。さらにその時点でhostM1でifconfigするとそのeth2
>> に仮想IP(aa.bb.cc.242/29)が直接振られてしまっています。さらにその状
>> 態でservice heartbeat stopとしてifconfigでI/Fを確認すると、eth1のIP
>> アドレスが全く振られていない状態となってしまっています。
> 
> ここで、ip addr showしたときはいかがでしょうか。
> fail側、failover側で、もし差支え無ければ。
> 双方のhostで通常時、不具合時の両方をお願いいたします。
> （的外れかもしれませんが）

今、客先から撤収しているので確認できません。来週また行くのでその際に
確認しようと思っています。

ただ、実は手元にもう１セット残しています。アドレス体系は若干違うの
ですが、ネットワーク構成は上と全くいっしょです。ただこれもきちんと動作
しますし、もちろん客先へ納入したものもこちらではうまく動作していました。

その手元の環境で、ip addr showコマンドを打ってみました。稼働系ホスト
でのip addr showで仮想IPが確認できるんですね。すみません、知りません
でした。ありがとうございました。


> 
> 
>> 肝心の/var/log/ha-logですがきちんと収集できていませんが、
>> このeth2のフェールオーバー失敗の現象が出る際には
>> 　get_failcount: ip-s 云々
>> のようなWARNINGメッセージ?が出ていました。
> 
> 状況をみるためにlogはそれなりに大切だと考えておりまして、可能でしたら
> 不具合発生付近のlogを抜粋していただけるとヒントが見つかるかもしれません。
> 
> 素人なので全く的はずれなことを言っていたらハズカシイのですが、よろしければ
> ご回答ください。

logの提示の必要性についても全くおっしゃる通りです。こちらこそハズカシイ限りです。

実はha-logのファイルはhostM1,hostB1とも取得しています。しかし客先で時間がなく
錯綜し、linuxHAまわりのいろんな項目の変更を次々と実施したため、どの事象がログ
のどれやら大変わかりにくくなっていたので、あえて添付しませんでした。

それでも、ちょっと頑張って記載してみます。手元環境にてifdown eth2を実施しその
ログの出力傾向をつかみましたので、客先環境ログでの関連しそうな箇所を以降に示してみます。

尚、本ログは若干修正しています。具体的には、手元環境との違うと思われる箇所に"<<point1>>",
"<<point2>>"との２行を挿入しました。尚、ホスト名やアドレスについては上の図に合わせて
匿名化しています。以下です。

---------------- ha-log failover err start -----------------------------
Nov  2 10:02:42 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation ip-s_monitor_10000
(call=20, rc=7, cib-update=80, confirmed=false) not running
Nov  2 10:02:42 hostM1 crmd: [20254]: info: process_graph_event: Action ip-s_monitor_10000 arrived
after a completed transition
Nov  2 10:02:42 hostM1 crmd: [20254]: info: abort_transition_graph: process_graph_event:486 -
Triggered transition abort (complete=1, tag=lrm_rsc_op, id=ip-s_monitor_10000,
magic=0:7;4:5:0:db95561c-6613-41e2-b7a0-17b8701178c3, cib=0.7.38) : Inactive graph
Nov  2 10:02:42 hostM1 crmd: [20254]: WARN: update_failcount: Updating failcount for ip-s on hostm1
after failed monitor: rc=7 (update=value++, time=1320195762)
Nov  2 10:02:42 hostM1 attrd: [20253]: info: find_hash_entry: Creating hash entry for fail-count-ip-s
Nov  2 10:02:42 hostM1 attrd: [20253]: info: attrd_local_callback: Expanded fail-count-ip-s=value++ to 1
Nov  2 10:02:42 hostM1 crmd: [20254]: info: do_state_transition: State transition S_IDLE ->
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Nov  2 10:02:42 hostM1 attrd: [20253]: info: attrd_trigger_update: Sending flush op to all hosts
for: fail-count-ip-s (1)
Nov  2 10:02:42 hostM1 crmd: [20254]: info: do_state_transition: All 2 cluster nodes are eligible to
run resources.
Nov  2 10:02:42 hostM1 crmd: [20254]: info: do_pe_invoke: Query 81: Requesting the current CIB:
S_POLICY_ENGINE
Nov  2 10:02:42 hostM1 attrd: [20253]: info: attrd_perform_update: Sent update 44: fail-count-ip-s=1
Nov  2 10:02:42 hostM1 attrd: [20253]: info: find_hash_entry: Creating hash entry for last-failure-ip-s
Nov  2 10:02:42 hostM1 attrd: [20253]: info: attrd_trigger_update: Sending flush op to all hosts
for: last-failure-ip-s (1320195762)
Nov  2 10:02:42 hostM1 attrd: [20253]: info: attrd_perform_update: Sent update 47:
last-failure-ip-s=1320195762
Nov  2 10:02:42 hostM1 crmd: [20254]: info: do_pe_invoke_callback: Invoking the PE: query=81,
ref=pe_calc-dc-1320195762-53, seq=2, quorate=1
Nov  2 10:02:42 hostM1 crmd: [20254]: info: abort_transition_graph: te_update_diff:150 - Triggered
transition abort (complete=1, tag=nvpair,
id=status-396fe8ff-762a-41bb-a3e6-da8270fa144f-fail-count-ip-s, magic=NA, cib=0.7.39) : Transient
attribute: update
Nov  2 10:02:42 hostM1 crmd: [20254]: info: abort_transition_graph: te_update_diff:150 - Triggered
transition abort (complete=1, tag=nvpair,
id=status-396fe8ff-762a-41bb-a3e6-da8270fa144f-last-failure-ip-s, magic=NA, cib=0.7.40) : Transient
attribute: update
Nov  2 10:02:42 hostM1 crmd: [20254]: info: do_pe_invoke: Query 82: Requesting the current CIB:
S_POLICY_ENGINE
Nov  2 10:02:42 hostM1 crmd: [20254]: info: do_pe_invoke: Query 83: Requesting the current CIB:
S_POLICY_ENGINE
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: unpack_config: On loss of CCM Quorum: Ignore
Nov  2 10:02:42 hostM1 pengine: [20268]: info: unpack_config: Node scores: 'red' = -INFINITY,
'yellow' = 0, 'green' = 0
Nov  2 10:02:42 hostM1 pengine: [20268]: info: determine_online_status: Node hostm1 is online
Nov  2 10:02:42 hostM1 pengine: [20268]: info: determine_online_status: Node hostb1 is online
Nov  2 10:02:42 hostM1 pengine: [20268]: WARN: unpack_rsc_op: Processing failed op
ip-s_monitor_10000 on hostm1: not running (7)
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: group_print:  Resource Group: hoge-grp
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: native_print:
ip-g#011(ocf::heartbeat:IPaddr2):#011Started hostm1
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: native_print:
ip-s#011(ocf::heartbeat:IPaddr2):#011Started hostm1 FAILED
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: native_print:
hoge-r#011(ocf::hoge:hoge_proxy.ra):#011Started hostm1
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: clone_print:  Clone Set: clone_ping
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: short_print:      Started: [ hostm1 hostb1 ]
Nov  2 10:02:42 hostM1 pengine: [20268]: info: get_failcount: ip-g has failed 1 times on hostm1
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: common_apply_stickiness: ip-g can fail 999999 more
times on hostm1 before being forced off
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: RecurringOp:  Start recurring monitor (10s) for
ip-s on hostm1
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: LogActions: Leave   resource ip-g#011(Started hostm1)
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: LogActions: Recover resource ip-s#011(Started hostm1)
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: LogActions: Restart resource hoge-r#011(Started hostm1)
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: LogActions: Leave   resource ping-r:0#011(Started
hostm1)
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: LogActions: Leave   resource ping-r:1#011(Started
hostb1)
Nov  2 10:02:42 hostM1 crmd: [20254]: info: handle_response: pe_calc calculation
pe_calc-dc-1320195762-53 is obsolete
Nov  2 10:02:42 hostM1 crmd: [20254]: info: do_pe_invoke_callback: Invoking the PE: query=83,
ref=pe_calc-dc-1320195762-54, seq=2, quorate=1
Nov  2 10:02:42 hostM1 pengine: [20268]: info: process_pe_message: Transition 6: PEngine Input
stored in: /var/lib/pengine/pe-input-614.bz2
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: unpack_config: On loss of CCM Quorum: Ignore
Nov  2 10:02:42 hostM1 pengine: [20268]: info: unpack_config: Node scores: 'red' = -INFINITY,
'yellow' = 0, 'green' = 0
Nov  2 10:02:42 hostM1 pengine: [20268]: info: determine_online_status: Node hostm1 is online
Nov  2 10:02:42 hostM1 pengine: [20268]: info: determine_online_status: Node hostb1 is online
Nov  2 10:02:42 hostM1 pengine: [20268]: WARN: unpack_rsc_op: Processing failed op
ip-s_monitor_10000 on hostm1: not running (7)
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: group_print:  Resource Group: hoge-grp
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: native_print:
ip-g#011(ocf::heartbeat:IPaddr2):#011Started hostm1
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: native_print:
ip-s#011(ocf::heartbeat:IPaddr2):#011Started hostm1 FAILED
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: native_print:
hoge-r#011(ocf::hoge:hoge_proxy.ra):#011Started hostm1
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: clone_print:  Clone Set: clone_ping
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: short_print:      Started: [ hostm1 hostb1 ]
Nov  2 10:02:42 hostM1 pengine: [20268]: info: get_failcount: ip-g has failed 1 times on hostm1
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: common_apply_stickiness: ip-g can fail 999999 more
times on hostm1 before being forced off
Nov  2 10:02:42 hostM1 pengine: [20268]: info: get_failcount: ip-s has failed 1 times on hostm1
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: common_apply_stickiness: ip-s can fail 999999 more
times on hostm1 before being forced off
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: RecurringOp:  Start recurring monitor (10s) for
ip-s on hostm1
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: LogActions: Leave   resource ip-g#011(Started hostm1)
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: LogActions: Recover resource ip-s#011(Started hostm1)
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: LogActions: Restart resource hoge-r#011(Started hostm1)
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: LogActions: Leave   resource ping-r:0#011(Started
hostm1)
Nov  2 10:02:42 hostM1 pengine: [20268]: notice: LogActions: Leave   resource ping-r:1#011(Started
hostb1)
Nov  2 10:02:42 hostM1 crmd: [20254]: info: do_state_transition: State transition S_POLICY_ENGINE ->
S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Nov  2 10:02:42 hostM1 crmd: [20254]: info: unpack_graph: Unpacked transition 7: 11 actions in 11
synapses
Nov  2 10:02:42 hostM1 crmd: [20254]: info: do_te_invoke: Processing graph 7
(ref=pe_calc-dc-1320195762-54) derived from /var/lib/pengine/pe-input-615.bz2
Nov  2 10:02:42 hostM1 crmd: [20254]: info: te_pseudo_action: Pseudo action 18 fired and confirmed
Nov  2 10:02:42 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 14: stop hoge-r_stop_0
on hostm1 (local)
Nov  2 10:02:42 hostM1 lrmd: [20251]: info: cancel_op: operation monitor[22] on
ocf::hoge_proxy.ra::hoge-r for client 20254, its parameters: CRM_meta_name=[monitor]
crm_feature_set=[3.0.1] CRM_meta_on_fail=[restart] CRM_meta_interval=[15000]
CRM_meta_timeout=[20000]  cancelled
Nov  2 10:02:42 hostM1 crmd: [20254]: info: do_lrm_rsc_op: Performing
key=14:7:0:db95561c-6613-41e2-b7a0-17b8701178c3 op=hoge-r_stop_0 )
Nov  2 10:02:42 hostM1 lrmd: [20251]: info: rsc:hoge-r:23: stop
Nov  2 10:02:42 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation hoge-r_monitor_15000
(call=22, status=1, cib-update=0, confirmed=true) Cancelled
Nov  2 10:02:42 hostM1 hoge_proxy.ra[22228]: INFO: Stopping hoge_proxy ...
Nov  2 10:02:42 hostM1 pengine: [20268]: info: process_pe_message: Transition 7: PEngine Input
stored in: /var/lib/pengine/pe-input-615.bz2
Nov  2 10:02:42 hostM1 hoge_proxy.ra[22228]: INFO: hoge_proxy stopped
Nov  2 10:02:42 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation hoge-r_stop_0 (call=23,
rc=0, cib-update=84, confirmed=true) ok
Nov  2 10:02:42 hostM1 crmd: [20254]: info: match_graph_event: Action hoge-r_stop_0 (14) confirmed
on hostm1 (rc=0)
Nov  2 10:02:42 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 4: stop ip-s_stop_0 on
hostm1 (local)
Nov  2 10:02:42 hostM1 lrmd: [20251]: info: cancel_op: operation monitor[20] on ocf::IPaddr2::ip-s
for client 20254, its parameters: CRM_meta_name=[monitor] cidr_netmask=[29] crm_feature_set=[3.0.1]
CRM_meta_timeout=[20000] CRM_meta_interval=[10000] nic=[eth2] ip=[aa.bb.cc.242]  cancelled
Nov  2 10:02:42 hostM1 crmd: [20254]: info: do_lrm_rsc_op: Performing
key=4:7:0:db95561c-6613-41e2-b7a0-17b8701178c3 op=ip-s_stop_0 )
Nov  2 10:02:42 hostM1 lrmd: [20251]: info: rsc:ip-s:24: stop
Nov  2 10:02:42 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation ip-s_monitor_10000
(call=20, status=1, cib-update=0, confirmed=true) Cancelled
Nov  2 10:02:42 hostM1 IPaddr2[22259]: INFO: IP status = no, IP_CIP=
Nov  2 10:02:42 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation ip-s_stop_0 (call=24,
rc=0, cib-update=85, confirmed=true) ok
Nov  2 10:02:42 hostM1 crmd: [20254]: info: match_graph_event: Action ip-s_stop_0 (4) confirmed on
hostm1 (rc=0)
Nov  2 10:02:42 hostM1 crmd: [20254]: info: te_pseudo_action: Pseudo action 19 fired and confirmed
Nov  2 10:02:42 hostM1 crmd: [20254]: info: te_pseudo_action: Pseudo action 7 fired and confirmed
Nov  2 10:02:42 hostM1 crmd: [20254]: info: te_pseudo_action: Pseudo action 16 fired and confirmed
Nov  2 10:02:42 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 13: start ip-s_start_0
on hostm1 (local)
Nov  2 10:02:42 hostM1 crmd: [20254]: info: do_lrm_rsc_op: Performing
key=13:7:0:db95561c-6613-41e2-b7a0-17b8701178c3 op=ip-s_start_0 )
Nov  2 10:02:42 hostM1 lrmd: [20251]: info: rsc:ip-s:25: start
Nov  2 10:02:42 hostM1 IPaddr2[22296]: INFO: ip -f inet addr add aa.bb.cc.242/29 brd aa.bb.cc.247
dev eth2
Nov  2 10:02:42 hostM1 IPaddr2[22296]: INFO: ip link set eth2 up
Nov  2 10:02:42 hostM1 IPaddr2[22296]: INFO: /usr/lib64/heartbeat/send_arp -i 200 -r 5 -p
/var/run/heartbeat/rsctmp/send_arp-aa.bb.cc.242 eth2 aa.bb.cc.242 auto not_used not_used
Nov  2 10:02:42 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation ip-s_start_0 (call=25,
rc=0, cib-update=86, confirmed=true) ok
Nov  2 10:02:42 hostM1 crmd: [20254]: info: match_graph_event: Action ip-s_start_0 (13) confirmed on
hostm1 (rc=0)
Nov  2 10:02:42 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 3: monitor
ip-s_monitor_10000 on hostm1 (local)
Nov  2 10:02:42 hostM1 crmd: [20254]: info: do_lrm_rsc_op: Performing
key=3:7:0:db95561c-6613-41e2-b7a0-17b8701178c3 op=ip-s_monitor_10000 )
Nov  2 10:02:42 hostM1 lrmd: [20251]: info: rsc:ip-s:26: monitor
Nov  2 10:02:42 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 15: start
hoge-r_start_0 on hostm1 (local)
Nov  2 10:02:42 hostM1 crmd: [20254]: info: do_lrm_rsc_op: Performing
key=15:7:0:db95561c-6613-41e2-b7a0-17b8701178c3 op=hoge-r_start_0 )
Nov  2 10:02:42 hostM1 lrmd: [20251]: info: rsc:hoge-r:27: start
Nov  2 10:02:42 hostM1 hoge_proxy.ra[22355]: INFO: Starting hoge_proxy ...
Nov  2 10:02:42 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation ip-s_monitor_10000
(call=26, rc=0, cib-update=87, confirmed=false) ok
Nov  2 10:02:42 hostM1 crmd: [20254]: info: match_graph_event: Action ip-s_monitor_10000 (3)
confirmed on hostm1 (rc=0)
Nov  2 10:02:43 hostM1 hoge_proxy.ra[22355]: INFO: hoge_proxy started
Nov  2 10:02:43 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation hoge-r_start_0
(call=27, rc=0, cib-update=88, confirmed=true) ok
Nov  2 10:02:43 hostM1 crmd: [20254]: info: match_graph_event: Action hoge-r_start_0 (15) confirmed
on hostm1 (rc=0)
Nov  2 10:02:43 hostM1 crmd: [20254]: info: te_pseudo_action: Pseudo action 17 fired and confirmed
Nov  2 10:02:43 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 5: monitor
hoge-r_monitor_15000 on hostm1 (local)
Nov  2 10:02:43 hostM1 crmd: [20254]: info: do_lrm_rsc_op: Performing
key=5:7:0:db95561c-6613-41e2-b7a0-17b8701178c3 op=hoge-r_monitor_15000 )
Nov  2 10:02:43 hostM1 lrmd: [20251]: info: rsc:hoge-r:28: monitor
Nov  2 10:02:43 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation hoge-r_monitor_15000
(call=28, rc=0, cib-update=89, confirmed=false) ok
Nov  2 10:02:43 hostM1 crmd: [20254]: info: match_graph_event: Action hoge-r_monitor_15000 (5)
confirmed on hostm1 (rc=0)
Nov  2 10:02:43 hostM1 crmd: [20254]: info: run_graph:
====================================================
Nov  2 10:02:43 hostM1 crmd: [20254]: notice: run_graph: Transition 7 (Complete=11, Pending=0,
Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-615.bz2): Complete
Nov  2 10:02:43 hostM1 crmd: [20254]: info: te_graph_trigger: Transition 7 is now complete
Nov  2 10:02:43 hostM1 crmd: [20254]: info: notify_crmd: Transition 7 status: done - <null>
Nov  2 10:02:43 hostM1 crmd: [20254]: info: do_state_transition: State transition
S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Nov  2 10:02:43 hostM1 crmd: [20254]: info: do_state_transition: Starting PEngine Recheck Timer
<<point1>>
Nov  2 10:02:45 hostM1 pingd: [20434]: info: stand_alone_ping: Node aa.bb.cc.241 is unreachable (read)
Nov  2 10:02:46 hostM1 pingd: [20434]: info: ping_read: Retrying...
Nov  2 10:02:46 hostM1 lrmd: [20251]: info: RA output: (ip-s:start:stderr) ARPING aa.bb.cc.242 from
aa.bb.cc.242 eth2#012Sent 5 probes (5 broadcast(s))#012Received 0 response(s)
Nov  2 10:04:01 hostM1 pingd: [20434]: info: stand_alone_ping: Node dd.ee.ff.1 is unreachable (read)
Nov  2 10:04:02 hostM1 pingd: [20434]: info: stand_alone_ping: Node dd.ee.ff.1 is unreachable (read)
Nov  2 10:04:03 hostM1 pingd: [20434]: info: stand_alone_ping: Node dd.ee.ff.1 is unreachable (read)
Nov  2 10:04:04 hostM1 pingd: [20434]: info: stand_alone_ping: Node dd.ee.ff.1 is unreachable (read)
Nov  2 10:04:05 hostM1 pingd: [20434]: info: stand_alone_ping: Node dd.ee.ff.1 is unreachable (read)
Nov  2 10:04:06 hostM1 attrd: [20253]: info: attrd_trigger_update: Sending flush op to all hosts
for: default_ping_set (200)
Nov  2 10:04:06 hostM1 pingd: [20434]: info: stand_alone_ping: Node dd.ee.ff.1 is unreachable (read)
Nov  2 10:04:07 hostM1 attrd: [20253]: info: attrd_ha_callback: flush message from hostm1
Nov  2 10:04:07 hostM1 attrd: [20253]: info: attrd_perform_update: Sent update 49: default_ping_set=200
Nov  2 10:04:07 hostM1 crmd: [20254]: info: abort_transition_graph: te_update_diff:150 - Triggered
transition abort (complete=1, tag=nvpair,
id=status-396fe8ff-762a-41bb-a3e6-da8270fa144f-default_ping_set, magic=NA, cib=0.7.47) : Transient
attribute: update
Nov  2 10:04:07 hostM1 crmd: [20254]: info: do_state_transition: State transition S_IDLE ->
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Nov  2 10:04:07 hostM1 crmd: [20254]: info: do_state_transition: All 2 cluster nodes are eligible to
run resources.
Nov  2 10:04:07 hostM1 crmd: [20254]: info: do_pe_invoke: Query 90: Requesting the current CIB:
S_POLICY_ENGINE
Nov  2 10:04:07 hostM1 crmd: [20254]: info: do_pe_invoke_callback: Invoking the PE: query=90,
ref=pe_calc-dc-1320195847-61, seq=2, quorate=1
Nov  2 10:04:07 hostM1 pengine: [20268]: notice: unpack_config: On loss of CCM Quorum: Ignore
Nov  2 10:04:07 hostM1 pengine: [20268]: info: unpack_config: Node scores: 'red' = -INFINITY,
'yellow' = 0, 'green' = 0
Nov  2 10:04:07 hostM1 pengine: [20268]: info: determine_online_status: Node hostm1 is online
Nov  2 10:04:07 hostM1 pengine: [20268]: info: determine_online_status: Node hostb1 is online
Nov  2 10:04:07 hostM1 pengine: [20268]: notice: group_print:  Resource Group: hoge-grp
Nov  2 10:04:07 hostM1 pengine: [20268]: notice: native_print:
ip-g#011(ocf::heartbeat:IPaddr2):#011Started hostm1
Nov  2 10:04:07 hostM1 pengine: [20268]: notice: native_print:
ip-s#011(ocf::heartbeat:IPaddr2):#011Started hostm1
Nov  2 10:04:07 hostM1 pengine: [20268]: notice: native_print:
hoge-r#011(ocf::hoge:hoge_proxy.ra):#011Started hostm1
Nov  2 10:04:07 hostM1 pengine: [20268]: notice: clone_print:  Clone Set: clone_ping
Nov  2 10:04:07 hostM1 pengine: [20268]: notice: short_print:      Started: [ hostm1 hostb1 ]
Nov  2 10:04:07 hostM1 pengine: [20268]: info: get_failcount: ip-g has failed 1 times on hostm1
Nov  2 10:04:07 hostM1 pengine: [20268]: notice: common_apply_stickiness: ip-g can fail 999999 more
times on hostm1 before being forced off
Nov  2 10:04:07 hostM1 pengine: [20268]: info: get_failcount: ip-s has failed 1 times on hostm1
Nov  2 10:04:07 hostM1 pengine: [20268]: notice: common_apply_stickiness: ip-s can fail 999999 more
times on hostm1 before being forced off
Nov  2 10:04:07 hostM1 pengine: [20268]: notice: RecurringOp:  Start recurring monitor (10s) for
ip-g on hostb1
Nov  2 10:04:07 hostM1 pengine: [20268]: notice: RecurringOp:  Start recurring monitor (10s) for
ip-s on hostb1
Nov  2 10:04:07 hostM1 pengine: [20268]: notice: RecurringOp:  Start recurring monitor (15s) for
hoge-r on hostb1
Nov  2 10:04:07 hostM1 pengine: [20268]: notice: LogActions: Move    resource ip-g#011(Started
hostm1 -> hostb1)
Nov  2 10:04:07 hostM1 pengine: [20268]: notice: LogActions: Move    resource ip-s#011(Started
hostm1 -> hostb1)
Nov  2 10:04:07 hostM1 pengine: [20268]: notice: LogActions: Move    resource hoge-r#011(Started
hostm1 -> hostb1)
Nov  2 10:04:07 hostM1 pengine: [20268]: notice: LogActions: Leave   resource ping-r:0#011(Started
hostm1)
Nov  2 10:04:07 hostM1 pengine: [20268]: notice: LogActions: Leave   resource ping-r:1#011(Started
hostb1)
Nov  2 10:04:07 hostM1 crmd: [20254]: info: do_state_transition: State transition S_POLICY_ENGINE ->
S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Nov  2 10:04:07 hostM1 crmd: [20254]: info: unpack_graph: Unpacked transition 8: 14 actions in 14
synapses
Nov  2 10:04:07 hostM1 crmd: [20254]: info: do_te_invoke: Processing graph 8
(ref=pe_calc-dc-1320195847-61) derived from /var/lib/pengine/pe-input-616.bz2
Nov  2 10:04:07 hostM1 crmd: [20254]: info: te_pseudo_action: Pseudo action 21 fired and confirmed
Nov  2 10:04:07 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 16: stop hoge-r_stop_0
on hostm1 (local)
Nov  2 10:04:07 hostM1 lrmd: [20251]: info: cancel_op: operation monitor[28] on
ocf::hoge_proxy.ra::hoge-r for client 20254, its parameters: CRM_meta_name=[monitor]
crm_feature_set=[3.0.1] CRM_meta_on_fail=[restart] CRM_meta_interval=[15000]
CRM_meta_timeout=[20000]  cancelled
Nov  2 10:04:07 hostM1 crmd: [20254]: info: do_lrm_rsc_op: Performing
key=16:8:0:db95561c-6613-41e2-b7a0-17b8701178c3 op=hoge-r_stop_0 )
Nov  2 10:04:07 hostM1 lrmd: [20251]: info: rsc:hoge-r:29: stop
Nov  2 10:04:07 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation hoge-r_monitor_15000
(call=28, status=1, cib-update=0, confirmed=true) Cancelled
Nov  2 10:04:07 hostM1 hoge_proxy.ra[23276]: INFO: Stopping hoge_proxy ...
Nov  2 10:04:07 hostM1 pengine: [20268]: info: process_pe_message: Transition 8: PEngine Input
stored in: /var/lib/pengine/pe-input-616.bz2
Nov  2 10:04:07 hostM1 hoge_proxy.ra[23276]: INFO: hoge_proxy stopped
Nov  2 10:04:07 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation hoge-r_stop_0 (call=29,
rc=0, cib-update=91, confirmed=true) ok
Nov  2 10:04:07 hostM1 crmd: [20254]: info: match_graph_event: Action hoge-r_stop_0 (16) confirmed
on hostm1 (rc=0)
Nov  2 10:04:07 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 13: stop ip-s_stop_0
on hostm1 (local)
Nov  2 10:04:07 hostM1 lrmd: [20251]: info: cancel_op: operation monitor[26] on ocf::IPaddr2::ip-s
for client 20254, its parameters: CRM_meta_name=[monitor] cidr_netmask=[29] crm_feature_set=[3.0.1]
CRM_meta_timeout=[20000] CRM_meta_interval=[10000] nic=[eth2] ip=[aa.bb.cc.242]  cancelled
Nov  2 10:04:07 hostM1 crmd: [20254]: info: do_lrm_rsc_op: Performing
key=13:8:0:db95561c-6613-41e2-b7a0-17b8701178c3 op=ip-s_stop_0 )
Nov  2 10:04:07 hostM1 lrmd: [20251]: info: rsc:ip-s:30: stop
Nov  2 10:04:07 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation ip-s_monitor_10000
(call=26, status=1, cib-update=0, confirmed=true) Cancelled
Nov  2 10:04:07 hostM1 IPaddr2[23307]: INFO: IP status = ok, IP_CIP=
Nov  2 10:04:07 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation ip-s_stop_0 (call=30,
rc=0, cib-update=92, confirmed=true) ok
Nov  2 10:04:07 hostM1 crmd: [20254]: info: match_graph_event: Action ip-s_stop_0 (13) confirmed on
hostm1 (rc=0)
Nov  2 10:04:07 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 10: stop ip-g_stop_0
on hostm1 (local)
Nov  2 10:04:07 hostM1 lrmd: [20251]: info: cancel_op: operation monitor[18] on ocf::IPaddr2::ip-g
for client 20254, its parameters: CRM_meta_name=[monitor] cidr_netmask=[28] crm_feature_set=[3.0.1]
CRM_meta_timeout=[20000] CRM_meta_interval=[10000] nic=[eth0] ip=[aa.bb.cc.201]  cancelled
Nov  2 10:04:07 hostM1 crmd: [20254]: info: do_lrm_rsc_op: Performing
key=10:8:0:db95561c-6613-41e2-b7a0-17b8701178c3 op=ip-g_stop_0 )
Nov  2 10:04:07 hostM1 lrmd: [20251]: info: rsc:ip-g:31: stop
Nov  2 10:04:07 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation ip-g_monitor_10000
(call=18, status=1, cib-update=0, confirmed=true) Cancelled
Nov  2 10:04:07 hostM1 IPaddr2[23347]: INFO: IP status = ok, IP_CIP=
Nov  2 10:04:07 hostM1 crmd: [20254]: info: process_lrm_event: LRM operation ip-g_stop_0 (call=31,
rc=0, cib-update=93, confirmed=true) ok
Nov  2 10:04:07 hostM1 crmd: [20254]: info: match_graph_event: Action ip-g_stop_0 (10) confirmed on
hostm1 (rc=0)
Nov  2 10:04:07 hostM1 crmd: [20254]: info: te_pseudo_action: Pseudo action 22 fired and confirmed
Nov  2 10:04:07 hostM1 crmd: [20254]: info: te_pseudo_action: Pseudo action 6 fired and confirmed
Nov  2 10:04:07 hostM1 crmd: [20254]: info: te_pseudo_action: Pseudo action 19 fired and confirmed
Nov  2 10:04:07 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 11: start ip-g_start_0
on hostb1
<<point2>>
Nov  2 10:04:07 hostM1 pingd: [20434]: WARN: ping_write: Wrote -1 of 39 chars: Network is
unreachable (101)
Nov  2 10:04:07 hostM1 pingd: [20434]: info: stand_alone_ping: Node aa.bb.cc.241 is unreachable (write)
Nov  2 10:04:08 hostM1 crmd: [20254]: info: match_graph_event: Action ip-g_start_0 (11) confirmed on
hostb1 (rc=0)
Nov  2 10:04:08 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 12: monitor
ip-g_monitor_10000 on hostb1
Nov  2 10:04:08 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 14: start ip-s_start_0
on hostb1
Nov  2 10:04:08 hostM1 pingd: [20434]: WARN: ping_write: Wrote -1 of 39 chars: Network is
unreachable (101)
Nov  2 10:04:08 hostM1 pingd: [20434]: info: stand_alone_ping: Node aa.bb.cc.241 is unreachable (write)
Nov  2 10:04:09 hostM1 crmd: [20254]: info: match_graph_event: Action ip-g_monitor_10000 (12)
confirmed on hostb1 (rc=0)
Nov  2 10:04:09 hostM1 crmd: [20254]: info: match_graph_event: Action ip-s_start_0 (14) confirmed on
hostb1 (rc=0)
Nov  2 10:04:09 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 15: monitor
ip-s_monitor_10000 on hostb1
Nov  2 10:04:09 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 17: start
hoge-r_start_0 on hostb1
Nov  2 10:04:09 hostM1 pingd: [20434]: WARN: ping_write: Wrote -1 of 39 chars: Network is
unreachable (101)
Nov  2 10:04:09 hostM1 pingd: [20434]: info: stand_alone_ping: Node aa.bb.cc.193 is unreachable (write)
Nov  2 10:04:10 hostM1 pingd: [20434]: WARN: ping_write: Wrote -1 of 39 chars: Network is
unreachable (101)
Nov  2 10:04:10 hostM1 pingd: [20434]: info: stand_alone_ping: Node aa.bb.cc.193 is unreachable (write)
Nov  2 10:04:11 hostM1 crmd: [20254]: info: match_graph_event: Action ip-s_monitor_10000 (15)
confirmed on hostb1 (rc=0)
Nov  2 10:04:11 hostM1 crmd: [20254]: info: match_graph_event: Action hoge-r_start_0 (17) confirmed
on hostb1 (rc=0)
Nov  2 10:04:11 hostM1 crmd: [20254]: info: te_pseudo_action: Pseudo action 20 fired and confirmed
Nov  2 10:04:11 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 18: monitor
hoge-r_monitor_15000 on hostb1
Nov  2 10:04:11 hostM1 pingd: [20434]: WARN: ping_write: Wrote -1 of 39 chars: Network is
unreachable (101)
Nov  2 10:04:11 hostM1 pingd: [20434]: info: stand_alone_ping: Node dd.ee.ff.1 is unreachable (write)
Nov  2 10:04:12 hostM1 crmd: [20254]: info: match_graph_event: Action hoge-r_monitor_15000 (18)
confirmed on hostb1 (rc=0)
Nov  2 10:04:12 hostM1 crmd: [20254]: info: run_graph:
====================================================
Nov  2 10:04:12 hostM1 crmd: [20254]: notice: run_graph: Transition 8 (Complete=14, Pending=0,
Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-616.bz2): Complete
Nov  2 10:04:12 hostM1 crmd: [20254]: info: te_graph_trigger: Transition 8 is now complete
Nov  2 10:04:12 hostM1 crmd: [20254]: info: notify_crmd: Transition 8 status: done - <null>
Nov  2 10:04:12 hostM1 crmd: [20254]: info: do_state_transition: State transition
S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Nov  2 10:04:12 hostM1 crmd: [20254]: info: do_state_transition: Starting PEngine Recheck Timer
Nov  2 10:04:12 hostM1 pingd: [20434]: WARN: ping_write: Wrote -1 of 39 chars: Network is
unreachable (101)
Nov  2 10:04:12 hostM1 pingd: [20434]: info: stand_alone_ping: Node dd.ee.ff.1 is unreachable (write)
Nov  2 10:04:12 hostM1 attrd: [20253]: info: attrd_trigger_update: Sending flush op to all hosts
for: default_ping_set (100)
Nov  2 10:04:13 hostM1 attrd: [20253]: info: attrd_ha_callback: flush message from hostm1
Nov  2 10:04:13 hostM1 attrd: [20253]: info: attrd_perform_update: Sent update 51: default_ping_set=100
Nov  2 10:04:13 hostM1 crmd: [20254]: info: abort_transition_graph: te_update_diff:150 - Triggered
transition abort (complete=1, tag=nvpair,
id=status-396fe8ff-762a-41bb-a3e6-da8270fa144f-default_ping_set, magic=NA, cib=0.7.57) : Transient
attribute: update
Nov  2 10:04:13 hostM1 crmd: [20254]: info: do_state_transition: State transition S_IDLE ->
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]

---------------- ha-log failover err end   -----------------------------

■上のログについて
・<<point1>>について
>Nov  2 10:04:01 hostM1 pingd: [20434]: info: stand_alone_ping: Node dd.ee.ff.1 is unreachable (read)
>Nov  2 10:04:02 hostM1 pingd: [20434]: info: stand_alone_ping: Node dd.ee.ff.1 is unreachable (read)
>...
ここで、Node dd.ee.ff.241に対してpingがunreachable(read)になっています。
手元環境ではこのようなメッセージは出ていません。

・<<point2>>について
>Nov  2 10:04:07 hostM1 pingd: [20434]: WARN: ping_write: Wrote -1 of 39 chars: Network is
unreachable (101)
>Nov  2 10:04:07 hostM1 pingd: [20434]: info: stand_alone_ping: Node aa.bb.cc.241 is unreachable (write)
>Nov  2 10:04:08 hostM1 crmd: [20254]: info: match_graph_event: Action ip-g_start_0 (11) confirmed
on hostb1 (rc=0)
>Nov  2 10:04:08 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 12: monitor
ip-g_monitor_10000 on hostb1
>Nov  2 10:04:08 hostM1 crmd: [20254]: info: te_rsc_command: Initiating action 14: start
ip-s_start_0 on hostb1
ここではNode aa.bb.cc.241に対してpingがunreachable(read)になっています。以降、このパターンを
繰り返すケースが多いようです。


■所管
どうもフェールオーバした後にpingdの宛先にpingが通らなくなっているような気がします。少なくとも
eth2の先であるRouter1(aa.bb.cc.241)について通らない、というケースが多いようです。ちなみに、
heartbeat起動しない状態でのaa.bb.cc.241宛てのpingは、通ることを確認しています。その際eth2以外
のifをdownさせてのpingもＯＫでした。heartbeat起動後もfailoverさえなければ通ります。

以降、よくわからないので勝手に推測しますが、linuxHAってフェールオーバして仮想IPを保持しつづけ
てもそのパケットのmacアドレスは変わってしまい、それをRouter1などが嫌ってpingの返事を(すぐには？)
返さない、なんてこと(や設定)があるのでしょうか？今回のRouter1,2,3は客先のものなので、ちょっと
今すぐは確認できないのですが、そのmacアドレスから察するにCiscoのルータだかスイッチだかのようです。

今度客先に出向いたときにでも確認しますが、なにかこういった心当たりがあればご教授願います。

また、今思えば、ha.cfで「auto_failback on」としており、状況がつかめにくくなっていたのかもしれ
ません。これをoffとしてフェールバックなしでやってみようかとも思っています。

以上です。ご検討いただけると幸いです。

---
yamamoto

Linux-HA Japan
Fork
pm_logconv-cs
pm_diskd
pm_logconv-hb
pm_extras
doc
pm_crmgen
vm-ctl
pm_kvm_tools

[Linux-ha-jp] 仮想IPがフェールオーバーするもすぐフェールバックし、さらにそれが実 IPとなってしまう。

Linux-HA Japan Forkpm_logconv-cspm_diskdpm_logconv-hbpm_extrasdocpm_crmgenvm-ctlpm_kvm_tools

[Linux-ha-jp] 仮想IPがフェールオーバーするもすぐフェールバックし、さらにそれが実 IPとなってしまう。

Linux-HA Japan
Fork
pm_logconv-cs
pm_diskd
pm_logconv-hb
pm_extras
doc
pm_crmgen
vm-ctl
pm_kvm_tools