存档

文章标签 ‘Nginx’

Nginx、php-cgi启动停止脚本

2009年12月18日 admin 2 条评论

为了方便Nginx和php-cgi的启动停止写了一个脚本,将下面脚本保存为/etc/init.d/nginxd,支持service nginxd start|stop|restart|reload|status

注意:标亮的行可能需要按你的环境修改

#!/bin/sh

# source function library
. /etc/rc.d/init.d/functions

# Source networking configuration.
. /etc/sysconfig/network

# Check that networking is up.
[ ${NETWORKING} = "no" ] && exit 0

RETVAL=0
prog="nginx"

nginxDir=/usr/local/nginx
nginxd=$nginxDir/sbin/nginx
nginxConf=$nginxDir/conf/nginx.conf
nginxPid=$nginxDir/logs/nginx.pid

nginx_check()
{
	if [[ -e $nginxPid ]]; then
		ps aux |grep -v grep |grep -q nginx
		if (( $? == 0 )); then
			echo "$prog already running..."
			exit 1
		else
			rm -rf $nginxPid &> /dev/null
		fi
	fi
}

phpcgi_check()
{
	netstat -tunlp |grep -q php-cgi
	if (( $? == 0 )); then
		echo "php-cgi already running..."
		return 1
	fi
}

phpcgi_start()
{
	phpcgi_check
	if (( $? == 0 )); then
		echo -n $"Starting php-cgi:"
		daemon /usr/bin/spawn-fcgi -a 127.0.0.1 -p 9000 -u nobody -g nobody -C 64 -f /usr/bin/php-cgi
		RETVAL=$?
		echo
                [ $RETVAL = 0 ] && touch /var/lock/subsys/php-cgi
                return $RETVAL
	fi
}

phpcgi_stop()
{
	echo -n $"Stopping php-cgi:"
	phpcgi_pid=`netstat -tnlp |grep php-cgi |awk '{print $7}' |awk -F'/' '{print $1}'`
        kill -9 $phpcgi_pid &>/dev/null
        RETVAL=$?
	killall -9 php-cgi &>/dev/null
	RETVAL=$RETVAL+$?
	if (( $RETVAL == 0 )); then
		echo_success
	else
		echo_failure
	fi
        echo
	[ $RETVAL = 0 ] && rm -f /var/lock/subsys/php-cgi
}

start()
{
	nginx_check
	if (( $? != 0 )); then
		true
	else
		echo -n $"Starting $prog:"
		daemon $nginxd -c $nginxConf
		RETVAL=$?
		echo
		[ $RETVAL = 0 ] && touch /var/lock/subsys/nginx
		return $RETVAL
	fi
}

stop()
{
	echo -n $"Stopping $prog:"
	killproc $nginxd
	RETVAL=$?
        echo
        [ $RETVAL = 0 ] && rm -f /var/lock/subsys/nginx $nginxPid
}

reload()
{
	echo -n $"Reloading $prog:"
	killproc $nginxd -HUP
	RETVAL=$?
        echo
}

case "$1" in
        start)
		phpcgi_start
                start
                ;;
        stop)
		phpcgi_stop
                stop
                ;;
        restart)
		phpcgi_stop
                stop
		phpcgi_start
                start
                ;;
        reload)
                reload
                ;;
        status)
                status $prog
                RETVAL=$?
                ;;
        *)
                echo $"Usage: $0 {start|stop|restart|reload|status}"
                RETVAL=1
esac
exit $RETVAL

MySQL-Nginx-Pacemaker-corosync(openais)-drbd active/passive cluster

2009年12月11日 admin 没有评论

系统:CentOS 5.4
IP分配:

HA1		eth0:192.168.0.66	eth1:192.168.10.1
HA2		eth0:192.168.0.69	eth1:192.168.10.2
VIP		192.168.0.120

DRBD(Distributed Replicated Block Device),DRBD 号称是 “网络 RAID”,开源软件,由
LINBIT 公司开发。DRBD
实际上是一种块设备的实现,主要被用于Linux平台下的高可用(HA)方案之中。他有内核模块和相关程序而组成,通过网络通信来同步镜像整个设备,有点
类似于一个网络RAID-1的功能。也就是说当你将数据写入本地的DRBD设备上的文件系统时,
数据会同时被发送到网络中的另外一台主机之上,并以完全相同的形式记录在文件系统中。本地节点与远程节点的数据可以保证实时的同步,并保证IO的一致性。
所以当本地节点的主机出现故障时,远程节点的主机上还会保留有一份完全相同的数据,可以继续使用,以达到高可用的目的。

一、安装DRBD

在HA1和HA2上安装DRBD。
wget http://oss.linbit.com/drbd/8.3/drbd-8.3.5.tar.gz

[root@HA1 ~]# tar xzvf drbd-8.3.5.tar.gz
[root@HA1 ~]# cd drbd-8.3.5
[root@HA1 drbd-8.3.5]# make clean all
[root@HA1 drbd-8.3.5]# make install
[root@HA1 drbd-8.3.5]# cd

[root@HA1 ~]# vi /etc/drbd.conf

global {
    usage-count yes;    # 是否参加使用者统计,yes为参加
}
common {
  syncer { rate 100M; }    # 设置网络同步速率,建议改为实际网络速率
}

# 一个DRBD设备(即:/dev/drbdX),叫做一个"资源"。
resource "r0" {
  protocol C;    #  数据同步协议,C表示收到远程主机的写入确认后才认为写入完成
  startup {
  }
  disk {
    on-io-error detach;
  }
  handlers {
    split-brain "/usr/lib/drbd/notify-split-brain.sh root"; # 自动修复脑裂问题
  }
  net {
    # 设置主备机之间通信使用的信息算法.
    cram-hmac-alg sha1;
    shared-secret "FooFunFactory";
    # 自动修复脑裂问题
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;
  }
  syncer {
  }
  # 每个主机的说明以"on"开头,后面是主机名
  on HA1 {
    device    /dev/drbd0;
    disk    /dev/sdb;
    # 设置DRBD的监听端口,用于与另一台主机通信
    address    192.168.0.66:7789;
    # metadata的存放位置
    # internal表示将metadata存放到drbd挂在的磁盘分区的最后的位置上
    meta-disk    internal;
  }

  on HA2 {
    device     /dev/drbd0;
    disk       /dev/sdb;
    address    192.168.0.69:7789;
    meta-disk  internal;
  }
}

DRBD将数据的各种信息块保存在一个专用的区域里,这些metadata包括了

a,DRBD设备的大小
b,产生的标识
c,活动日志
d,快速同步的位图

metadata的存储方式有内部和外部两种方式,使用哪种配置都是在资源配置中定义的

内部metadata:内部metadata存放在同一块硬盘或分区的最后的位置上

优点:metadata和数据是紧密联系在一起的,如果硬盘损坏,metadata同样就没有了,同样在恢复的时候,metadata也会一起被恢复回来
缺点:metadata和数据在同一块硬盘上,对于写操作的吞吐量会带来负面的影响,因为应用程序的写请求会触发metadata的更新,这样写操作就会造成两次额外的磁头读写移动。

外部metadata:外部的metadata存放在和数据磁盘分开的独立的块设备上

优点:对于一些写操作可以对一些潜在的行为提供一些改进
缺点:metadata和数据不是联系在一起的,所以如果数据盘出现故障,在更换新盘的时候就需要认为的干预操作来进行现有node对心硬盘的同步了

[root@HA1 ~]# scp /etc/drbd.conf HA2:/etc/

初始化并启动两个系统上的 DRBD 服务:
[root@HA1 ~]# drbdadm create-md r0
[root@HA1 ~]# service drbd start
Starting DRBD resources: [ d(r0) s(r0) n(r0) ].

将 HA1 配置为主节点:
[root@HA1 ~]# drbdadm –overwrite-data-of-peer primary r0

两个设备开始同步数据:
[root@HA2 ~]# service drbd status
drbd driver loaded OK; device status:
version: 8.3.5 (api:88/proto:86-91)
GIT-hash: ded8cdf09b0efa1460e8ce7a72327c60ff2210fb build by root@HA2, 2009-11-13 01:58:29
m:res  cs          ro                 ds                     p  mounted  fstype
…    sync’ed:    0.6%               (6108/6140)M
0:r0   SyncTarget  Secondary/Primary  Inconsistent/UpToDate  C

………

[root@HA2 ~]# service drbd status
drbd driver loaded OK; device status:
version: 8.3.5 (api:88/proto:86-91)
GIT-hash: ded8cdf09b0efa1460e8ce7a72327c60ff2210fb build by root@HA2, 2009-11-13 01:58:29
m:res  cs          ro                 ds                     p  mounted  fstype
…    sync’ed:    45.6%              (3344/6140)M
0:r0   SyncTarget  Secondary/Primary  Inconsistent/UpToDate  C

………

同步数据完成:
[root@HA2 ~]# service drbd status
drbd driver loaded OK; device status:
version: 8.3.5 (api:88/proto:86-91)
GIT-hash: ded8cdf09b0efa1460e8ce7a72327c60ff2210fb build by root@HA2, 2009-11-13 01:58:29
m:res  cs              ro                 ds                     p  mounted  fstype
…    sync’ed:100.0%  (4/6140)M
0:r0   SyncTarget      Secondary/Primary  Inconsistent/UpToDate  C
[root@HA2 ~]# service drbd status
drbd driver loaded OK; device status:
version: 8.3.5 (api:88/proto:86-91)
GIT-hash: ded8cdf09b0efa1460e8ce7a72327c60ff2210fb build by root@HA2, 2009-11-13 01:58:29
m:res  cs         ro                 ds                 p  mounted  fstype
0:r0   Connected  Secondary/Primary  UpToDate/UpToDate  C

[root@HA1 ~]# cat /proc/drbd
version: 8.3.5 (api:88/proto:86-91)
GIT-hash: ded8cdf09b0efa1460e8ce7a72327c60ff2210fb build by root@HA1, 2009-11-13 01:53:51
0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r—-
ns:6291228 nr:0 dw:0 dr:6291228 al:0 bm:384 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

两个节点上的块设备都完全同步之后,使用诸如ext3的文件系统格式化主节点上的 DRBD 设备。
[root@HA1 ~]# mkfs.ext3 /dev/drbd0

测试DRBD服务:
手动挂载DRBD设备,并测试写入文件。
[root@HA1 ~]# mount -o rw /dev/drbd0 /data/
[root@HA1 ~]# echo “This is a test line.” > /data/test.txt
卸载DRBD设备并将HA1设置为从设备。
[root@HA1 ~]# umount /data/
[root@HA1 ~]# drbdadm secondary r0
将HA2设置为主设备。
[root@HA2 ~]# drbdadm primary r0
[root@HA2 ~]# service drbd status
drbd driver loaded OK; device status:
version: 8.3.5 (api:88/proto:86-91)
GIT-hash: ded8cdf09b0efa1460e8ce7a72327c60ff2210fb build by root@HA2, 2009-11-13 01:58:29
m:res  cs         ro                 ds                 p  mounted  fstype
0:r0   Connected  Primary/Secondary  UpToDate/UpToDate  C
挂载DRBD设备并验证能够读出在HA1上写入的文件。
[root@HA2 ~]# mount -o rw /dev/drbd0 /data/
[root@HA2 ~]# cat /data/test.txt
This is a test line.
卸载DRBD设备并将HA2设置为从设备。
[root@HA2 ~]# umount /data/
[root@HA2 ~]# drbdadm secondary r0

将HA1设置为主设备。
[root@HA1 ~]# drbdadm primary r0

查看HA2的DRDB状态:
[root@HA2 ~]# service drbd status
drbd driver loaded OK; device status:
version: 8.3.5 (api:88/proto:86-91)
GIT-hash: ded8cdf09b0efa1460e8ce7a72327c60ff2210fb build by root@HA2, 2009-11-13 01:58:29
m:res  cs         ro                 ds                 p  mounted  fstype
0:r0   Connected  Secondary/Primary  UpToDate/UpToDate  C

在HA1和HA2上配置hosts
[root@HA1 ~]# cat /etc/hosts

# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1 vpc localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6
192.168.10.1 HA1
192.168.10.2 HA2

在HA1和HA2上配置时间同步:
[root@HA1 ~]# crontab -e

*/5     *       *       *       *       /usr/sbin/ntpdate ntp.api.bz


二、在HA1和HA2安装MySQL和Nginx并将数据迁移到/data目录
[root@HA1 ~]# yum install -y mysql-server
[root@HA1 ~]# cat /etc/my.cnf

[mysqld]
datadir=/data/mysql
socket=/data/mysql/mysql.sock
user=mysql
bind-address=192.168.0.120

[root@HA1 ~]# cp -r /var/lib/mysql/ /data/
[root@HA1 ~]# cd /data/
[root@HA1 data]# chown -R mysql.mysql mysql/
[root@HA1 data]# service mysqld start
Starting MySQL:                                            [  OK  ]
[root@HA1 data]# service mysqld stop
Stopping MySQL:                                            [  OK  ]

注意:数据迁移只需在HA1上做。

安装Nginx略,具体见Nginx 0.7.x + PHP 5.2.8(FastCGI)搭建胜过Apache十倍的Web服务器(http://blog.s135.com/post/366/)

[root@HA1 ~]# chkconfig –level 2345 mysqld off
[root@HA2 ~]# chkconfig –level 2345 mysqld off
注意:不要在外部启动HA使用的资源,一切让HA去控制。

编写nginx lsb资源代理脚本(注意nginx安装路径):
[root@HA1 ~]# cat /etc/init.d/nginxd

#!/bin/sh

# source function library
. /etc/rc.d/init.d/functions

# Source networking configuration.
. /etc/sysconfig/network

# Check that networking is up.
[ ${NETWORKING} = "no" ] && exit 0

RETVAL=0
prog="nginx"

nginxDir=/usr/local/nginx
nginxd=$nginxDir/sbin/nginx
nginxConf=$nginxDir/conf/nginx.conf
nginxPid=$nginxDir/nginx.pid

nginx_check()
{
    if [[ -e $nginxPid ]]; then
        ps aux |grep -v grep |grep -q nginx
        if (( $? == 0 )); then
            echo "$prog already running..."
            exit 1
        else
            rm -rf $nginxPid &> /dev/null
        fi
    fi
}

start()
{
    nginx_check
    if (( $? != 0 )); then
        true
    else
        echo -n $"Starting $prog:"
        daemon $nginxd -c $nginxConf
        RETVAL=$?
        echo
        [ $RETVAL = 0 ] && touch /var/lock/subsys/nginx
        return $RETVAL
    fi
}

stop()
{
    echo -n $"Stopping $prog:"
    killproc $nginxd
    RETVAL=$?
    echo
    [ $RETVAL = 0 ] && rm -f /var/lock/subsys/nginx $nginxPid
}

reload()
{
    echo -n $"Reloading $prog:"
    killproc $nginxd -HUP
    RETVAL=$?
    echo
}

case "$1" in
        start)
                start
                ;;
        stop)
                stop
                ;;
        restart)
                stop
                start
                ;;
        reload)
                reload
                ;;
        status)
                status $prog
                RETVAL=$?
                ;;
        *)
                echo $"Usage: $0 {start|stop|restart|reload|status}"
                RETVAL=1
esac
exit $RETVAL

[root@HA1 ~]# chmod +x  /etc/init.d/nginxd
[root@HA1 ~]# scp  /etc/init.d/nginxd HA2: /etc/init.d/nginxd

三、安装配置corosync和pacemaker

corosync是基于OpenAIS构建的集群引擎,可替代heartbeat进行心跳检测。
The Corosync Cluster Engine is an open source project Licensed under the BSD License derived from the OpenAIS project. OpenAIS uses a UDP multicast based communication protocol to periodically check for node availability.

[root@HA1 ~]# wget -O /etc/yum.repos.d/pacemaker.repo http://clusterlabs.org/rpm/epel-5/clusterlabs.repo
[root@HA1 ~]# wget ftp://ftp.pbone.net/mirror/centos.karan.org/el5/extras/testing/i386/RPMS/libesmtp-1.0.4-6.el5.kb.i386.rpm
[root@HA1 ~]# rpm -ivh libesmtp-1.0.4-6.el5.kb.i386.rpm
[root@HA1 ~]# yum install -y pacemaker corosync

[root@HA1 ~]# corosync-keygen
Corosync Cluster Engine Authentication key generator.
Gathering 1024 bits for key from /dev/random.
Press keys on your keyboard to generate entropy.
Writing corosync key to /etc/corosync/authkey.

[root@HA1 ~]# scp /etc/corosync/authkey HA2:/etc/corosync/
[root@HA1 ~]# cp /etc/corosync/corosync.conf.example /etc/corosync/corosync.conf
[root@HA1 ~]# vi !$

# Please read the corosync.conf.5 manual page
compatibility: whitetank

totem {
version: 2
secauth: off
threads: 0
interface {
ringnumber: 0
bindnetaddr: 192.168.10.0
mcastaddr: 226.94.1.1
mcastport: 5405
}
}

logging {
fileline: off
to_stderr: yes
to_logfile: yes
to_syslog: yes
logfile: /var/log/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}

amf {
mode: disabled
}

service {
        # Load the Pacemaker Cluster Resource Manager
        ver:       0
        name:      pacemaker
        use_mgmtd: yes
}

[root@HA1 ~]# scp /etc/corosync/corosync.conf HA2:/etc/corosync/corosync.conf
[root@HA1 ~]# service corosync start
Starting Corosync Cluster Engine (corosync):               [  OK  ]
[root@HA1 ~]# chkconfig –level 2345 corosync on

在HA2上执行:
[root@HA2 ~]# chown root:root /etc/corosync/authkey
[root@HA2 ~]# chmod 400 /etc/corosync/authkey
[root@HA2 ~]# service corosync start
Starting Corosync Cluster Engine (corosync):               [  OK  ]
[root@HA2 ~]# chkconfig –level 2345 corosync on

四、配置CRM资源

[root@HA1 ~]# crm
crm(live)# configure
crm(live)configure# node HA1
crm(live)configure# node HA2
# 配置drbd原始资源
crm(live)configure# primitive drbd ocf:linbit:drbd \
 params drbd_resource="r0" \
 meta migration-threshold="10"
# 配置drbd资源监控
crm(live)configure# monitor drbd 30s:20s
# 配置文件系统原始资源
crm(live)configure# primitive fs ocf:heartbeat:Filesystem \
 params device="/dev/drbd0" directory="/data" fstype="ext3"
# 配置mysql原始资源,使用lsb代理
crm(live)configure# primitive mysqld lsb:mysqld
# 配置nginx原始资源,使用lsb代理
crm(live)configure# primitive nginxd lsb:nginxd
# 配置共享IP原始资源
crm(live)configure# primitive vip ocf:heartbeat:IPaddr2 \
 params ip="192.168.0.120" nic="eth0:0"
# 创建资源组保障资源在某一节点上按顺序启动和停止
crm(live)configure# group mysql-group fs vip mysqld nginxd
# 配置drbd主资源约束
crm(live)configure# ms ms-drbd-mysql drbd \
 meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
# 配置资源位置约束,保证mysql-group资源组启动在drbd主资源上
crm(live)configure# colocation mysql-on-drbd inf: mysql-group ms-drbd-mysql:Master
# 配置资源启动顺序约束,保证drbd启动后启动mysql-group资源组
crm(live)configure# order mysql-after-drbd inf: ms-drbd-mysql:promote mysql-group:start
crm(live)configure# property $id="cib-bootstrap-options" \
 expected-quorum-votes="2" \
 stonith-enabled="false" \
 no-quorum-policy="ignore" \
 start-failure-is-fatal="false"
crm(live)configure# commit
crm(live)configure# end
crm(live)#

五、测试

[root@HA1 ~]# crm status

============
Last updated: Fri Nov 20 22:47:51 2009
Stack: openais
Current DC: HA2 – partition with quorum
Version: 1.0.6-f709c638237cdff7556cb6ab615f32826c0f8c06
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ HA1 HA2 ]

Master/Slave Set: ms-drbd-mysql
Masters: [ HA1 ]
Slaves: [ HA2 ]
Resource Group: mysql-group
fs    (ocf::heartbeat:Filesystem):    Started HA1
vip    (ocf::heartbeat:IPaddr2):    Started HA1
mysqld    (lsb:mysqld):    Started HA1
nginxd    (lsb:nginxd):    Started HA1

关闭HA1,在HA2上查看HA状态:
[root@HA2 ~]# crm_mon -i1

============
Last updated: Sat Nov 21 01:31:13 2009
Stack: openais
Current DC: HA2    - partition WITHOUT quorum
Version: 1.0.6-f709c638237cdff7556cb6ab615f32826c0f8c06
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ HA2 ]
OFFLINE: [ HA1 ]

Master/Slave Set: ms-drbd-mysql
Masters: [ HA2 ]
Stopped: [ drbd:1 ]
Resource Group: mysql-group
fs (ocf::heartbeat:Filesystem):    Started HA2
vip        (ocf::heartbeat:IPaddr2):    Started HA2
mysqld     (lsb:mysqld):   Started HA2
nginxd     (lsb:nginxd):   Started HA2

启动HA1,资源自动迁移到HA1:
[root@HA1 ~]# crm_mon -i1

============
Last updated: Mon Nov 23 15:42:52 2009
Stack: openais
Current DC: HA2    - partition with quorum
Version: 1.0.6-f709c638237cdff7556cb6ab615f32826c0f8c06
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ HA1 HA2 ]

Master/Slave Set: ms-drbd-mysql
Masters: [ HA1 ]
Slaves: [ HA2 ]
Resource Group: mysql-group
fs (ocf::heartbeat:Filesystem):    Started HA1
vip        (ocf::heartbeat:IPaddr2):    Started HA1
mysqld     (lsb:mysqld):   Started HA1
nginxd     (lsb:nginxd):   Started HA1

手动迁移资源到HA2
[root@HA1 ~]# crm resource migrate mysql-group HA2

[root@HA2 ~]# crm_mon -i1

============
Last updated: Mon Nov 23 15:43:42 2009
Stack: openais
Current DC: HA2    - partition with quorum
Version: 1.0.6-f709c638237cdff7556cb6ab615f32826c0f8c06
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ HA1 HA2 ]

Master/Slave Set: ms-drbd-mysql
Masters: [ HA2 ]
Slaves: [ HA1 ]
Resource Group: mysql-group
fs (ocf::heartbeat:Filesystem):    Started HA2
vip        (ocf::heartbeat:IPaddr2):    Started HA2
mysqld     (lsb:mysqld):   Started HA2
nginxd     (lsb:nginxd):   Started HA2


六、解决脑裂(split brain
)问题:

在“双机热备”高可用(HA)系统中,当联系2个节点的“心跳线”断开时,本来为一整体、动作
协调的HA系统,就分裂成为2个独立的个体。由于相互失去了联系,都以为是对方出了故障,2个节点上的HA软件像“裂脑人”一样,“本能”地争抢“共享资
源”、争起“应用服务”,就会发生严重后果:或者共享资源被瓜分、2边“服务”都起不来了;或者2边“服务”都起来了,但同时读写“共享存储”,导致数据
损坏(常见如数据库轮询着的联机日志出错)。

对付HA系统“裂脑”的对策大概有以下几条:

1)添加冗余的心跳线,例如双线条线。尽量减少“裂脑”发生机会。

2)启用磁盘锁。正在服务一方锁住共享磁盘,“裂脑”发生时,让对方完全“抢不走”共享磁盘资源。但使用锁磁盘也会有一个不小的问题,如果占用共
享盘的一方不主动“解锁”,另一方就永远得不到共享磁盘。现实中假如服务节点突然死机或崩溃,就不可能执行解锁命令。后备节点也就接管不了共享资源和应用
服务。于是有人在HA中设计了“智能”锁。即,正在服务的一方只在发现心跳线全部断开(察觉不到对端)时才启用磁盘锁。平时就不上锁了。

3)设置仲裁机制。例如设置参考IP(如网关IP),当心跳线完全断开时,2个节点都各自ping一下 参考IP,不通则表明断点就出在本端,
不仅“心跳”、还兼对外“服务”的本端网络链路断了,即使启动(或继续)应用服务也没有用了,那就主动放弃竞争,让能够ping通参考IP的一端去起服
务。更保险一些,ping不通参考IP的一方干脆就自我重启,以彻底释放有可能还占用着的那些共享资源。

手动解决DRBD脑裂问题:
[root@HA2 ~]# drbdadm down all
[root@HA2 ~]# drbdadm create-md all

六、参考

Integrating DRBD with Pacemaker clusters
DRBD MySQL HowTo
Split brain notification and automatic recovery
Manual split brain recovery


Heartbeat/corosync+pacemaker+ldirectord 实现Nginx负载均衡

2009年12月10日 admin 没有评论

系统:CentOS 5.4
IP分配:

HA1		eth0:192.168.0.66	eth1:192.168.10.1
HA2		eth0:192.168.0.69	eth1:192.168.10.2
VIP		192.168.0.120

1. 安装pacemaker和heartbeat
[root@HA1 ~]# wget -O /etc/yum.repos.d/pacemaker.repo http://clusterlabs.org/rpm/epel-5/clusterlabs.repo
[root@HA1 ~]# wget ftp://ftp.pbone.net/mirror/centos.karan.org/el5/extras/testing/i386/RPMS/libesmtp-1.0.4-6.el5.kb.i386.rpm
[root@HA1 ~]# rpm -ivh libesmtp-1.0.4-6.el5.kb.i386.rpm
[root@HA1 ~]# yum install -y pacemaker heartbeat

2. 安装ldirectord
[root@HA1 ~]# yum install -y ldirectord

3. 配置
3.1 配置Heartbeat
[root@HA1 ~]# cp /usr/share/doc/heartbeat-3.0.1/{ha.cf,authkeys} /etc/ha.d/

[root@HA1 ~]# cat /etc/ha.d/authkeys

auth 1
1 crc

[root@HA1 ~]# cat /etc/ha.d/ha.cf |grep -v “#”

logfile	/var/log/ha-log
logfacility	local0
keepalive 2
deadtime 30
warntime 10
initdead 120
udpport	695
ucast eth1 192.168.10.2     # 注意此处HA2改为:ucast eth1 192.168.10.1
auto_failback on
watchdog /dev/watchdog
node	HA1
node	HA2
ping 192.168.0.1
respawn hacluster /usr/lib/heartbeat/ipfail
apiauth ipfail gid=haclient uid=hacluster
crm	on

3.2 用corosync替换heartbeat(可选)
corosync是基于OpenAIS构建的集群引擎,可替代heartbeat进行心跳检测。
The Corosync Cluster Engine is an open source project Licensed under the BSD License derived from the OpenAIS project. OpenAIS uses a UDP multicast based communication protocol to periodically check for node availability.

[root@HA1 ~]# yum install -y corosync
[root@HA1 ~]# corosync-keygen
Corosync Cluster Engine Authentication key generator.
Gathering 1024 bits for key from /dev/random.
Press keys on your keyboard to generate entropy.
Writing corosync key to /etc/corosync/authkey.

[root@HA1 ~]# scp /etc/corosync/authkey HA2:/etc/corosync/
[root@HA1 ~]# cp /etc/corosync/corosync.conf.example /etc/corosync/corosync.conf
[root@HA1 ~]# vi !$

# Please read the corosync.conf.5 manual page
compatibility: whitetank

totem {
version: 2
secauth: off
threads: 0
interface {
ringnumber: 0
bindnetaddr: 192.168.10.0
mcastaddr: 226.94.1.1
mcastport: 5405
}
}

logging {
fileline: off
to_stderr: yes
to_logfile: yes
to_syslog: yes
logfile: /var/log/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}

amf {
mode: disabled
}

service {
        # Load the Pacemaker Cluster Resource Manager
        ver:       0
        name:      pacemaker
        use_mgmtd: yes
}

[root@HA1 ~]# scp /etc/corosync/corosync.conf HA2:/etc/corosync/corosync.conf
[root@HA1 ~]# service corosync start
Starting Corosync Cluster Engine (corosync):               [  OK  ]
[root@HA1 ~]# chkconfig –level 2345 corosync on
[root@HA1 ~]# chkconfig –level 2345 heartbeat off

在HA2上执行:
[root@HA2 ~]# chown root:root /etc/corosync/authkey
[root@HA2 ~]# chmod 400 /etc/corosync/authkey
[root@HA2 ~]# service corosync start
Starting Corosync Cluster Engine (corosync):               [  OK  ]
[root@HA2 ~]# chkconfig –level 2345 corosync on
[root@HA2 ~]# chkconfig –level 2345 heartbeat off

3.3 安装配置ldirectord
[root@HA1 ~]# cat /etc/ha.d/ldirectord.cf

checktimeout=5
checkinterval=7
autoreload=yes
logfile="/var/log/ldirectord.log"
quiescent=yes
emailalert=shidl@baihe.com
# A server with a page at the main root of the site that displays "Nginx"
virtual=192.168.0.120:80
real=192.168.0.66:80 gate
real=192.168.0.69:80 gate
service=http
request="/lb.html"    # 在根目录下编写lb.html,内容为live
receive="live"
scheduler=wlc
protocol=tcp
checktype=negotiate

3.4 配置hosts
[root@HA1 ~]# cat /etc/hosts

# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1		vpc localhost.localdomain localhost
::1		localhost6.localdomain6 localhost6
192.168.10.1	HA1
192.168.10.2	HA2

3.5 配置lo:0设备

[root@HA1 ~]# cat >>/etc/sysconfig/network-scripts/ifcfg-lo:0<<EOF
<pre>DEVICE=lo:0
IPADDR=192.168.0.120
NETMASK="255.255.255.255"
ONBOOT=yes
NAME=loopback

EOF

3.6 启用转发,禁用arp
[root@HA1 ~]# vi /etc/sysctl.conf
修改net.ipv4.ip_forward = 0为net.ipv4.ip_forward = 1
添加下面行:

net.ipv4.conf.all.arp_ignore = 1
net.ipv4.conf.eth0.arp_ignore = 1
net.ipv4.conf.all.arp_announce = 2
net.ipv4.conf.eth0.arp_announce = 2

[root@HA1 ~]# sysctl -p

# 将配置拷贝到HA2
[root@HA1 ~]# scp /etc/ha.d/{ha.cf,authkeys,ldirectord.cf} HA2:/etc/ha.d/
[root@HA1 ~]# scp /etc/{hosts,sysctl.conf} HA2:/etc/
[root@HA1 ~]# scp /etc/sysconfig/network-scripts/ifcfg-lo:0 HA2:/etc/sysconfig/network-scripts/

在HA2上修改/etc/ha.d/ha.cf
将ucast eth1 192.168.10.2 改为:ucast eth1 192.168.10.1
并使sysctl.conf配置生效:
[root@HA2~]# sysctl -p

3.7 在HA1和HA2上配置并安装好nginx
编写nginx lsb资源代理脚本(注意nginx安装路径):
[root@HA1 ~]# cat /etc/init.d/nginxd

#!/bin/sh

# source function library
. /etc/rc.d/init.d/functions

# Source networking configuration.
. /etc/sysconfig/network

# Check that networking is up.
[ ${NETWORKING} = "no" ] && exit 0

RETVAL=0
prog="nginx"

nginxDir=/usr/local/nginx
nginxd=$nginxDir/sbin/nginx
nginxConf=$nginxDir/conf/nginx.conf
nginxPid=$nginxDir/nginx.pid

nginx_check()
{
    if [[ -e $nginxPid ]]; then
        ps aux |grep -v grep |grep -q nginx
        if (( $? == 0 )); then
            echo "$prog already running..."
            exit 1
        else
            rm -rf $nginxPid &> /dev/null
        fi
    fi
}

start()
{
    nginx_check
    if (( $? != 0 )); then
        true
    else
        echo -n $"Starting $prog:"
        daemon $nginxd -c $nginxConf
        RETVAL=$?
        echo
        [ $RETVAL = 0 ] && touch /var/lock/subsys/nginx
        return $RETVAL
    fi
}

stop()
{
    echo -n $"Stopping $prog:"
    killproc $nginxd
    RETVAL=$?
    echo
    [ $RETVAL = 0 ] && rm -f /var/lock/subsys/nginx $nginxPid
}

reload()
{
    echo -n $"Reloading $prog:"
    killproc $nginxd -HUP
    RETVAL=$?
    echo
}

case "$1" in
        start)
                start
                ;;
        stop)
                stop
                ;;
        restart)
                stop
                start
                ;;
        reload)
                reload
                ;;
        status)
                status $prog
                RETVAL=$?
                ;;
        *)
                echo $"Usage: $0 {start|stop|restart|reload|status}"
                RETVAL=1
esac
exit $RETVAL

[root@HA1 ~]# chmod +x  /etc/init.d/nginxd
[root@HA1 ~]# scp  /etc/init.d/nginxd HA2: /etc/init.d/nginxd

[root@HA1 ~]# service network restart
[root@HA1 ~]# service heartbeat start

[root@HA2 ~]# service network restart
[root@HA2 ~]# service heartbeat start

4. 配置集群资源:

Heartbeat和其他应用提供的ocf代理脚本或许会有错误,我们可以用下面方法排错:
要检查 OCF 脚本,请首先设置所需的环境变量。例如,当测试IPaddr OCF 脚本时,您必须通过设置一个变量名称前缀为 OCF_RESKEY_的环境变量来设置变量 ip 的值。对于此示例,可运行以下命令:

export OCF_RESKEY_ip=
/usr/lib/ocf/resource.d/heartbeat/IPaddr validate-all
/usr/lib/ocf/resource.d/heartbeat/IPaddr start
/usr/lib/ocf/resource.d/heartbeat/IPaddr stop

如果此操作不成功,很可能是您缺少某个必需变量或者只是输错了参数。

排错ldirectord ocf代理脚本:
export OCF_ROOT=/usr/lib/ocf
根据自己的环境设置修改下面两行:
[root@HA1 ~]# vi /usr/lib/ocf/resource.d/heartbeat/ldirectord

LDIRCONF=${OCF_RESKEY_configfile:-/etc/ha.d/ldirectord.cf}
LDIRECTORD=${OCF_RESKEY_ldirectord:-/usr/sbin/ldirectord}

[root@HA1 ~]# /usr/lib/ocf/resource.d/heartbeat/ldirectord monitor
[root@HA1 ~]# echo $?
7     # ldirectord未运行返回7,运行正常返回0

[root@HA1 ~]# crm
crm(live)# configure
crm(live)configure# node HA1
crm(live)configure# node HA2
crm(live)configure# primitive ldirectord ocf:heartbeat:ldirectord \
> params configfile=”/etc/ha.d/ldirectord.cf” \
> op monitor interval=”30s” timeout=”20s” \
> meta migration-threshold=”10″ target-role=”Started”
crm(live)configure# primitive vip ocf:heartbeat:IPaddr2 \
> params lvs_support=”true” ip=”192.168.0.120″ cidr_netmask=”24″ broadcast=”192.168.0.255″ \
> op monitor interval=”1m” timeout=”20s” \
> meta migration-threshold=”10″
crm(live)configure# primitive nginxd lsb:nginxd \
> op monitor interval=”30s” timeout=”20s” \
> meta migration-threshold=”10″ target-role=”Started”
crm(live)configure# group load-balancing vip ldirectord
crm(live)configure# clone cl-nginxd nginxd
crm(live)configure# location perfer-ha1 load-balancing \
> rule $id=”prefer-ha1-rule” 100: #uname eq HA1
crm(live)configure# property stonith-enabled=”false” \
> no-quorum-policy=”ignore” \
> start-failure-is-fatal=”false” \
> expected-quorum-votes=”2″
crm(live)configure# verify
crm(live)configure# commit
crm(live)configure# end
crm(live)# status

============
Last updated: Thu Nov 12 01:00:13 2009
Stack: Heartbeat
Current DC: HA2 – partition with quorum
Version: 1.0.6-f709c638237cdff7556cb6ab615f32826c0f8c06
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ HA2 HA1 ]

Clone Set: cl-nginxd
Started: [ HA2 HA1 ]
Resource Group: load-balancing
vip    (ocf::heartbeat:IPaddr2):    Started HA1
ldirectord    (ocf::heartbeat:ldirectord):    Started HA1

crm(live)# quit
bye

5. 验证
[root@HA1 ~]# ipvsadm -l
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  192.168.0.120:http wlc
-> 192.168.0.69:http            Route   1      0          0
-> 192.168.0.66:http            Local   1      0          0

用浏览器访问网站看是否正常。

禁用HA1的eth1网卡,在HA2上看故障转移情况。
[root@HA2 ~]# crm
crm(live)# status

============
Last updated: Thu Nov 12 18:40:54 2009
Stack: Heartbeat
Current DC: HA2 – partition WITHOUT quorum
Version: 1.0.6-f709c638237cdff7556cb6ab615f32826c0f8c06
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ HA2 ]
OFFLINE: [ HA1 ]

Clone Set: cl-nginxd
Started: [ HA2 ]
Stopped: [ nginxd:0 ]
Resource Group: load-balancing
vip    (ocf::heartbeat:IPaddr2):    Started HA2
ldirectord    (ocf::heartbeat:ldirectord):    Started HA2

启用HA1的eth1网卡,在HA1上看故障转移情况。

[root@HA1 ~]# crm status

============
Last updated: Thu Nov 12 18:42:55 2009
Stack: Heartbeat
Current DC: HA1 – partition with quorum
Version: 1.0.6-f709c638237cdff7556cb6ab615f32826c0f8c06
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ HA2 HA1 ]

Clone Set: cl-nginxd
Started: [ HA1 HA2 ]
Resource Group: load-balancing
vip    (ocf::heartbeat:IPaddr2):    Started HA1
ldirectord    (ocf::heartbeat:ldirectord):    Started HA1

6. 参考:

Load Balanced MySQL Replicated Cluster
Debian Lenny HowTo

Heartbeat实现Nginx高可用性(style 2.x)

2009年12月9日 admin 没有评论

1.x并不能监控资源的状态,要想监控资源的状态,可以自己写监控脚本或者使用Mon脚本来监控服务,每当监控到资源(Nginx)Down掉后使用service heartbeat stop将heartbeatDown掉,这样便会发生故障转移。或者使用heartbeat的style
2.x版本,配置CRM(Cluster Resource Managemenet)来管理资源。

一、使用1.x配置Heartbeat (参见《Heartbeat实现Nginx高可用性(style 1.x)》

二、修改1.x的配置为2.x

1. 在ha.cf中添加下面行
# 开启集群资源管理器,使用heartbeat 2.x模式
crm on
# respwn列出将要执行和监控的命令
# respwn使得Heartbeat以userid(在本例中为hacluster)的身份来执行该进程并监视该进程的执行情况
# 如果其死亡便重启之。
# ipfail插件的用途是检测网络故障,并作出合理的反应,如果需要的话使集群资源故障转移。
# respawn
respawn hacluster /usr/lib/heartbeat/ipfail
apiauth ipfail gid=haclient uid=hacluster
respawn hacluster /usr/lib/heartbeat/cibmon -d
apiauth cibmon   uid=hacluste

2.将haresources资源文件转换成cib.xml文件
执行下面命令:
mv
/etc/ha.d/haresources /etc/ha.d/haresources.bak
/usr/lib/heartbeat/haresources2cib.py /etc/ha.d/haresources.bak
会在/var/lib/heartbeat/crm下生成cib.xml

运行heartbeat后会在/var/lib/heartbeat/crm目录下生成cib.xml.last、cib.xml.sig、cib.xml.sig.last文件,此时再修改cib.xml需要先删除上面三个文件,rm -rf /var/lib/heartbeat/crm/cib.xml.*

CRM支持两种资源类型ocf和lsb:
LSB格式的脚本必须支持status功能,必须能接收start,stop,status三个参数;
OCF格式的脚本则必须支持start,stop,monitor三个参数。
其中status和monitor参数是用来监控资源的,非常重要。
如果是LSB风格的脚本,运行./nginxd status时候,返回值包含OK或则running则表示资源正常 ,返回值包含No或者stopped则表示资源不正常。
如果是OCF风格的脚本,运行./nginxd monitor时, 返回0表示资源是正常的,返回7表示资源出现问题。

ocf格式的启动脚本在/usr/lib/ocf/resource.d/heartbeat下面。
lsb的脚步一般在/etc/init.d/下面。
如:IPaddr使用的是ocf格式的控制脚本,路径为:/usr/lib/ocf/resource.d/heartbeat/IPaddr

修改style 1.x下的nginxd脚本使其支持monitor参数从而支持ocf格式:
[root@HA1 ~]# cat /usr/lib/ocf/resource.d/heartbeat/nginxd

#!/bin/sh

# source function library
. /etc/rc.d/init.d/functions

# Source networking configuration.
. /etc/sysconfig/network

# Check that networking is up.
[ ${NETWORKING} = "no" ] && exit 0

RETVAL=0
prog="nginx"

nginxDir=/usr/local/nginx
nginxd=$nginxDir/sbin/nginx
nginxConf=$nginxDir/conf/nginx.conf
nginxPid=$nginxDir/nginx.pid

nginx_check()
{
    if [[ -e $nginxPid ]]; then
        ps aux |grep -v grep |grep -q nginx
        if (( $? == 0 )); then
            echo "$prog already running..."
            exit 1
        else
            rm -rf $nginxPid &> /dev/null
        fi
    fi
}

start()
{
    nginx_check
    if (( $? != 0 )); then
        true
    else
        echo -n $"Starting $prog:"
        daemon $nginxd -c $nginxConf
        RETVAL=$?
        echo
        [ $RETVAL = 0 ] && touch /var/lock/subsys/nginx
        return $RETVAL
    fi
}

stop()
{
    echo -n $"Stopping $prog:"
    killproc $nginxd
    RETVAL=$?
    echo
    [ $RETVAL = 0 ] && rm -f /var/lock/subsys/nginx $nginxPid
}

reload()
{
    echo -n $"Reloading $prog:"
    killproc $nginxd -HUP
    RETVAL=$?
    echo
}

monitor()
{
    status $prog &> /dev/null
    if (( $? == 0  )); then
        RETVAL=0
    else
        RETVAL=7
    fi
}

case "$1" in
        start)
                start
                ;;
        stop)
                stop
                ;;
        restart)
                stop
                start
                ;;
        reload)
                reload
                ;;
        status)
                status $prog
                RETVAL=$?
                ;;
        monitor)
                monitor
                ;;
        *)
                echo $"Usage: $0 {start|stop|restart|reload|status|monitor}"
                RETVAL=1
esac
exit $RETVAL

查看cib.xml关于nginxd资源的配置情况:

<primitive class="ocf" id="nginxd_2" provider="heartbeat" type="nginxd">
    <operations>
        <op id="nginxd_2_mon" interval="20s" name="monitor" timeout="10s"/>
    </operations>
</primitive>

修改下面的值:
interval=”20s”
timeout=”10s”
即每20秒检测资源运行情况,如果发现资源不在,则尝试启动资源,如果10s后还未启动成功,则资源切换到另一节点,上述的数值可以缩减的更小,如果使用默认的2分钟会给人一种服务down掉没有重启或者切换的感觉。

3. 创建用户和用户组

heartbeat需要haclient用户组和hacluster用户,如果编译时没有创建用户及组需要执行本步操作。两个节点做同样的操作,并保证haclienthaclusterID一样。

groupadd -g 500 haclient

useradd -u 500 -g haclient hacluster

修改heartbeat目录权限:
find / -type d -name “heartbeat” -exec chown -R hacluster {} ;
find / -type d -name “heartbeat” -exec chgrp -R haclient {} ;

如果没有上述账户,启动heartbeat后将会出现下面的错误,系统会被重启:
EMERG: Rebooting system.  Reason: /usr/lib/heartbeat/cib

如果nginxd在系统启动时是自启动的,需要关闭它:
chkconfig –leve 2345 nginxd off

在两个节点上启动heartbeat:
service heartbeat start

在HA1上启动nginxd资源:
crm_resource -r nginxd_2 -p target_role -v started

CRM监控情况:
crm_mon -i1
Refresh in 1s…

============
Last updated: Sun Nov  8 03:20:15 2009
Current DC: ha2 (cc3f9eb0-22be-4b1a-b0c7-706ea75d932f)
2 Nodes configured.
1 Resources configured.
============

Node: ha2 (cc3f9eb0-22be-4b1a-b0c7-706ea75d932f): online
Node: ha1 (ad69968f-2db6-40a0-b71b-7433a689aab9): online

Resource Group: group_1
IPaddr_192_168_2_100        (ocf::heartbeat:IPaddr):        Started ha1
nginxd_2    (ocf::heartbeat:nginxd):        Started ha1

三、CRM管理

启动/停止资源
#crm_resource -r nginxd_2 -p target_role -v started
#crm_resource -r nginxd_2 -p target_role -v stopped
查看资源跑在那个节点上
crm_resource -W -r nginxd_2
将资源从当前节点转移到另个一节点
#crm_resource -M -r nginxd_2
将资源转移到指定节点
#crm_resource -M -r nginxd_2 -H HA1
允许资源回到正常的节点
#crm_resource -U -r nginxd_2
将资源从CRM中删除
#crm_resource -D -r nginxd_2 -t primitive
将资源从CRM中禁用
#crm_resource -p is_managed -r nginxd_2 -t primitive -v off
将资源从新从CRM中启用
#crm_resource -p is_managed -r nginxd_2 -t primitive -v on
重启资源
#crm_resource -C -H HA1 -r nginxd_2
检查所有节点上未在CRM中的资源
#crm_resource -P
检查指定节点上未在CRM中的资源
#crm_resource -P -H HA1
检查所有节点上未在CRM中的资源
#crm_resource -P
检查指定节点上未在CRM中的资源
#crm_resource -P -H HA1

四、测试

1. 手动停掉HA1上的nginx,heartbeat会尝试重启它。
service nginxd stop

2. 在HA1上改名nginx配置文件,heartbeat尝试重启失败会自动进行故障转移。
mv /usr/local/nginx/conf/nginx.conf /usr/local/nginx/conf/nginx.conf.bak
service nginxd stop

# 资源进行了自动故障转移
crm_mon -i1
Refresh in 1s…

============
Last updated: Sun Nov  8 03:37:59 2009
Current DC: ha2 (cc3f9eb0-22be-4b1a-b0c7-706ea75d932f)
2 Nodes configured.
1 Resources configured.
============

Node: ha2 (cc3f9eb0-22be-4b1a-b0c7-706ea75d932f): online
Node: ha1 (ad69968f-2db6-40a0-b71b-7433a689aab9): online

Resource Group: group_1
IPaddr_192_168_2_100        (ocf::heartbeat:IPaddr):        Started ha2
nginxd_2    (ocf::heartbeat:nginxd):        Started ha2

Failed actions:
nginxd_2_monitor_20000 (node=ha1, call=7, rc=7): complete
nginxd_2_start_0 (node=ha1, call=9, rc=1): complete

在HA1上将资源转移到正常的节点:

mv /usr/local/nginx/conf/nginx.conf.bak /usr/local/nginx/conf/nginx.conf
service heartbeat restart

3. 拔掉HA1的eth1网线,看资源是否自动故障转移

在HA2上查看资源情况:
crm_mon -i1
Refresh in 1s…

============
Last updated: Sun Nov  8 04:02:01 2009
Current DC: ha2 (cc3f9eb0-22be-4b1a-b0c7-706ea75d932f)
2 Nodes configured.
1 Resources configured.
============

Node: ha2 (cc3f9eb0-22be-4b1a-b0c7-706ea75d932f): online
Node: ha1 (ad69968f-2db6-40a0-b71b-7433a689aab9): OFFLINE

Resource Group: group_1
IPaddr_192_168_2_100        (ocf::heartbeat:IPaddr):        Started ha2
nginxd_2    (ocf::heartbeat:nginxd):        Started ha2

资源从HA1自动故障转移到HA2。

插上HA1的eth1网线,资源自动转回到HA1。
crm_mon -i1
efresh in 1s…

============
Last updated: Sun Nov  8 04:05:16 2009
Current DC: ha2 (cc3f9eb0-22be-4b1a-b0c7-706ea75d932f)
2 Nodes configured.
1 Resources configured.
============

Node: ha2 (cc3f9eb0-22be-4b1a-b0c7-706ea75d932f): online
Node: ha1 (ad69968f-2db6-40a0-b71b-7433a689aab9): online

Resource Group: group_1
IPaddr_192_168_2_100        (ocf::heartbeat:IPaddr):        Started ha1
nginxd_2    (ocf::heartbeat:nginxd):        Started ha1

排错:如果出现错误,查看heartbeat日志进行解决。

参考:
1. Writing your own OCF Resource Agent Heartbeat Resource Agents
2. 用Heartbeat配置Linux高可用性集群
3. heartbeat2.x的测试终结篇
4. crm_resource man page
5. Getting Started With Heartbeat

分类: 高可用性 标签: , ,

Heartbeat实现Nginx高可用性(style 1.x)

2009年12月8日 admin 1 条评论

一、准备工作

1. 系统:两台CentOS 5.4虚拟机
2. Hostname:HA1,HA2
3. IP地址:HA1   eth0:192.168.2.10   eth1:192.168.10.1
HA2   eth0:192.168.2.20   eth1:192.168.10.2
4. VIP:192.168.2.100   (Failover转移用的IP)

二、安装

1. Nginx编译安装
tar xzvf pcre-7.9.tar.gz
cd pcre-7.9
./configure
make
make install
cd ..

tar xzvf nginx-0.7.63.tar.gz
cd nginx-0.7.63
./configure –user=nobody –group=nobody –prefix=/usr/local/nginx –with-http_stub_status_module –with-http_ssl_module
make
make install

Nginx具体配置略。

2. Heartbeat编译安装

tar xzvf libnet-1.1.2.1.tar.gz
cd libnet
./configure
make
make install
cd ..

创建用户和用户组

heartbeat需要haclient用户组和hacluster用户两个节点做同样的操作,并保证haclienthaclusterID一样。

groupadd -g 500 haclient

useradd -u 500 -g haclient hacluster

tar jxvf STABLE-2.1.4.tar.bz2
cd Heartbeat-STABLE-2-1-STABLE-2.1.4/
./ConfigureMe configure
make
make install
# 拷贝配置文件到相应目录
cp doc/ha.cf /etc/ha.d/
cp doc/haresources /etc/ha.d/
cp doc/authkeys /etc/ha.d/
cd !$   # 跳转到/etc/ha.d/目录

三、配置Heartbeat

在/etc/ha.d/目录下进行配置:
1. vi authkeys   # 节点认证方式,这里使用第一种crc
auth 1
1 crc
# 修改authkeys权限为600
chmode 600 authkeys

2. 编辑/etc/ha.d/ha.cf:
[root@HA1 ha.d]# cat ha.cf |sed ‘/^#/d’
# 开启HA的debug日志,建议调试完后关闭此日志
debugfile /var/log/ha-debug
# 开启HA日志
logfile    /var/log/ha-log
# 设置日志打印级别
logfacility    local0
# 多长时间建材一次心跳
keepalive 2
# 连续多长时间检测失败示对方挂掉,单位秒
deadtime 30
# 连续多长时间检测失败开始警告提示,单位秒
warntime 10
# 为服务重启预留一段时间,在这段时间不进行心跳检测
initdead 120
# 默认端口是UDP 694,我改为了695,如果在局域网还有人在玩Heartbeat,并且他用广播,你最好改个端口
# 否则可能会导致认证失败
udpport    695
# 使用单播通信,在HA2上修改为ucast    eth1 192.168.10.1
ucast    eth1 192.168.10.2
# 主节点恢复正常后是否再切换回来
auto_failback on
# 设置看门狗
# Watchdog在实现上可以是硬件电路也可以是软件定时器,能够在系统出现故障时自动重新启动系统。
# 在Linux 内核下,
watchdog的基本工作原理是:当watchdog启动后(即/dev/watchdog
设备被打开后),
# 如果在某一设定的时间间隔内/dev/watchdog没有被执行写操作,
# 硬件watchdog电路或软件定时器就会重新启动系统。
watchdog /dev/watchdog
# 节点列表,主节点在前,不要写反了
node    HA1
node    HA2

3. [root@HA1 ha.d]# cat haresources

# 每一行代表一个资源组,资源组启动顺序是从左往右,关闭的顺序是从右往左。
# 一个资源组里面不同资源之间以空格分隔,不同的资源组之间没有必然关系
# 资源组的第一列是我们在ha.cf配置文件中列出的节点之一,而且应该是准备作为节点的那一个节点
# 每个资源都是一个脚本,可以放在/etc/init.d目录下面,也可以在/usr/local/etc/ha.d/resource.d目录下。
# 这些脚本必须要支持
start和stop参数。
#
脚本的参数通过::来分隔。
# 主节点   VIP      资源名
HA1    192.168.2.100    nginxd

4. 编写nginxd资源脚本,放到/etc/rc.d/init.d/和/etc/ha.d/resource.d/下

#!/bin/sh

# source function library
. /etc/rc.d/init.d/functions

# Source networking configuration.
. /etc/sysconfig/network

# Check that networking is up.
[ ${NETWORKING} = "no" ] && exit 0

RETVAL=0
prog="nginx"

nginxDir=/usr/local/nginx
nginxd=$nginxDir/sbin/nginx
nginxConf=$nginxDir/conf/nginx.conf
nginxPid=$nginxDir/nginx.pid

nginx_check()
{
    if [[ -e $nginxPid ]]; then
        ps aux |grep -v grep |grep -q nginx
        if (( $? == 0 )); then
            echo "$prog already running..."
            exit 1
        else
            rm -rf $nginxPid &> /dev/null
        fi
    fi
}

start()
{
    nginx_check
    if (( $? != 0 )); then
        true
    else
        echo -n $"Starting $prog:"
        daemon $nginxd -c $nginxConf
        RETVAL=$?
        echo
        [ $RETVAL = 0 ] && touch /var/lock/subsys/nginx
        return $RETVAL
    fi
}

stop()
{
    echo -n $"Stopping $prog:"
    killproc $nginxd
    RETVAL=$?
    echo
    [ $RETVAL = 0 ] && rm -f /var/lock/subsys/nginx $nginxPid
}

reload()
{
    echo -n $"Reloading $prog:"
    killproc $nginxd -HUP
    RETVAL=$?
    echo
}

case "$1" in
        start)
                start
                ;;
        stop)
                stop
                ;;
        restart)
                stop
                start
                ;;
        reload)
                reload
                ;;
        status)
                status $prog
                RETVAL=$?
                ;;
        *)
                echo $"Usage: $0 {start|stop|restart|reload|status}"
                RETVAL=1
esac
exit $RETVAL

5. 设置hosts
[root@HA1 ha.d]# cat /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1        vpc localhost.localdomain localhost
::1        localhost6.localdomain6 localhost6
192.168.10.1    HA1
192.168.10.2    HA2

注:在HA1和HA2上进行二、三步(安装、配置heartbeat)操作

6. 启动heartbeat
注意:主服务器和备份服务器的时间同步,如果相差太多heartbeat可能发生故障。

service heartbeat restart
查看heartbeat的日志启动信息(日志对于排错很有帮助)
tail -100 /var/log/ha-log
heartbeat[13821]: 2009/11/07_19:41:27 info: Configuration validated. Starting heartbeat 2.1.4
heartbeat[13822]: 2009/11/07_19:41:27 info: heartbeat: version 2.1.4
heartbeat[13822]: 2009/11/07_19:41:27 info: Heartbeat generation: 1257517561
heartbeat[13822]: 2009/11/07_19:41:27 info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth1
heartbeat[13822]: 2009/11/07_19:41:27 info: glib: ucast: bound send socket to device: eth1
heartbeat[13822]: 2009/11/07_19:41:27 info: glib: ucast: bound receive socket to device: eth1
heartbeat[13822]: 2009/11/07_19:41:27 info: glib: ucast: started on port 695 interface eth1 to 192.168.10.2
heartbeat[13822]: 2009/11/07_19:41:27 info: G_main_add_TriggerHandler: Added signal manual handler
heartbeat[13822]: 2009/11/07_19:41:27 info: G_main_add_TriggerHandler: Added signal manual handler
heartbeat[13822]: 2009/11/07_19:41:27 notice: Using watchdog device: /dev/watchdog
heartbeat[13822]: 2009/11/07_19:41:27 info: G_main_add_SignalHandler: Added signal handler for signal 17
heartbeat[13822]: 2009/11/07_19:41:27 info: Local status now set to: ‘up’
heartbeat[13822]: 2009/11/07_19:41:29 info: Link ha2:eth1 up.
heartbeat[13822]: 2009/11/07_19:41:29 info: Status update for node ha2: status up
harc[13828]:    2009/11/07_19:41:29 info: Running /etc/ha.d/rc.d/status status
heartbeat[13822]: 2009/11/07_19:41:30 info: Comm_now_up(): updating status to active
heartbeat[13822]: 2009/11/07_19:41:30 info: Local status now set to: ‘active’
heartbeat[13822]: 2009/11/07_19:41:30 info: Status update for node ha2: status active
harc[13845]:    2009/11/07_19:41:30 info: Running /etc/ha.d/rc.d/status status
heartbeat[13822]: 2009/11/07_19:41:45 info: local resource transition completed.
heartbeat[13822]: 2009/11/07_19:41:45 info: Initial resource acquisition complete (T_RESOURCES(us))
IPaddr[13900]:    2009/11/07_19:41:45 INFO:  Resource is stopped
heartbeat[13864]: 2009/11/07_19:41:45 info: Local Resource acquisition completed.
harc[13939]:    2009/11/07_19:41:45 info: Running /etc/ha.d/rc.d/ip-request-resp ip-request-resp
ip-request-resp[13939]:    2009/11/07_19:41:45 received ip-request-resp 192.168.2.100 OK yes
ResourceManager[13960]:    2009/11/07_19:41:45 info: Acquiring resource group: ha1 192.168.2.100 nginxd
IPaddr[13987]:    2009/11/07_19:41:45 INFO:  Resource is stopped
ResourceManager[13960]:    2009/11/07_19:41:45 info: Running /etc/ha.d/resource.d/IPaddr 192.168.2.100 start
IPaddr[14063]:    2009/11/07_19:41:46 INFO: Using calculated nic for 192.168.2.100: eth0
IPaddr[14063]:    2009/11/07_19:41:46 INFO: Using calculated netmask for 192.168.2.100: 255.255.255.0
IPaddr[14063]:    2009/11/07_19:41:46 INFO: eval ifconfig eth0:0 192.168.2.100 netmask 255.255.255.0 broadcast 192.168.2.255
IPaddr[14046]:    2009/11/07_19:41:46 INFO:  Success
heartbeat[13822]: 2009/11/07_19:41:46 info: remote resource transition completed.

查看网卡配置情况,VIP已配置到HA1上。
eth0:0    Link encap:Ethernet  HWaddr 00:0C:29:35:6F:D0
inet addr:192.168.2.100  Bcast:192.168.2.255  Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
Interrupt:67 Base address:0×2000
查看nginx已经启动。

如果看到下面日志,可能是同网段中有人在UDP 694端口运行广播的heartbeat,换个端口试试可能能解决问题。

heartbeat[9966]: 2009/11/07_00:18:53 info: Configuration validated. Starting heartbeat 2.1.4
heartbeat[9967]: 2009/11/07_00:18:53 info: heartbeat: version 2.1.4
heartbeat[9967]: 2009/11/07_00:18:53 info: Heartbeat generation: 1257517538
heartbeat[9967]: 2009/11/07_00:18:53 info: glib: UDP Broadcast heartbeat started on port 694 (694) interface eth1
heartbeat[9967]: 2009/11/07_00:18:53 info: glib: UDP Broadcast heartbeat closed on port 694 interface eth1 – Status: 1
heartbeat[9967]: 2009/11/07_00:18:53 info: G_main_add_TriggerHandler: Added signal manual handler
heartbeat[9967]: 2009/11/07_00:18:53 info: G_main_add_TriggerHandler: Added signal manual handler
heartbeat[9967]: 2009/11/07_00:18:53 info: G_main_add_SignalHandler: Added signal handler for signal 17
heartbeat[9967]: 2009/11/07_00:18:53 info: Local status now set to: ‘up’
heartbeat[9967]: 2009/11/07_00:18:55 ERROR: process_status_message: bad node [master] in message
heartbeat[9967]: 2009/11/07_00:18:55 ERROR: MSG: Dumping message with 12 fields
heartbeat[9967]: 2009/11/07_00:18:55 ERROR: MSG[0] : [t=status]
heartbeat[9967]: 2009/11/07_00:18:55 ERROR: MSG[1] : [st=active]
heartbeat[9967]: 2009/11/07_00:18:55 ERROR: MSG[2] : [dt=7530]
heartbeat[9967]: 2009/11/07_00:18:55 ERROR: MSG[3] : [protocol=1]
heartbeat[9967]: 2009/11/07_00:18:55 ERROR: MSG[4] : [src=master]
heartbeat[9967]: 2009/11/07_00:18:55 ERROR: MSG[5] : [(1)srcuuid=0x9696e70(36 27)]
heartbeat[9967]: 2009/11/07_00:18:55 ERROR: MSG[6] : [seq=1fed7]
heartbeat[9967]: 2009/11/07_00:18:55 ERROR: MSG[7] : [hg=4aee4ce7]
heartbeat[9967]: 2009/11/07_00:18:55 ERROR: MSG[8] : [ts=4af3a3d5]
heartbeat[9967]: 2009/11/07_00:18:55 ERROR: MSG[9] : [ld=0.11 0.03 0.01 1/107 30681]
heartbeat[9967]: 2009/11/07_00:18:55 ERROR: MSG[10] : [ttl=3]
heartbeat[9967]: 2009/11/07_00:18:55 ERROR: MSG[11] : [auth=1 ba81b6cc]
heartbeat[9967]: 2009/11/07_00:18:55 info: Link ha1:eth1 up.
heartbeat[9967]: 2009/11/07_00:18:56 ERROR: process_status_message: bad node [slave] in message
heartbeat[9967]: 2009/11/07_00:18:56 ERROR: MSG: Dumping message with 12 fields
heartbeat[9967]: 2009/11/07_00:18:56 ERROR: MSG[0] : [t=status]
heartbeat[9967]: 2009/11/07_00:18:56 ERROR: MSG[1] : [st=active]
heartbeat[9967]: 2009/11/07_00:18:56 ERROR: MSG[2] : [dt=7530]
heartbeat[9967]: 2009/11/07_00:18:56 ERROR: MSG[3] : [protocol=1]
heartbeat[9967]: 2009/11/07_00:18:56 ERROR: MSG[4] : [src=slave]
heartbeat[9967]: 2009/11/07_00:18:56 ERROR: MSG[5] : [(1)srcuuid=0x9696dc8(36 27)]
heartbeat[9967]: 2009/11/07_00:18:56 ERROR: MSG[6] : [seq=1f94b]
heartbeat[9967]: 2009/11/07_00:18:56 ERROR: MSG[7] : [hg=4aee4cf3]
heartbeat[9967]: 2009/11/07_00:18:56 ERROR: MSG[8] : [ts=4af3a3d6]
heartbeat[9967]: 2009/11/07_00:18:56 ERROR: MSG[9] : [ld=0.00 0.00 0.00 1/105 870]
heartbeat[9967]: 2009/11/07_00:18:56 ERROR: MSG[10] : [ttl=3]
heartbeat[9967]: 2009/11/07_00:18:56 ERROR: MSG[11] : [auth=1 bcd3be0a]

四、测试
1. 手动切换是否正常
在HA1上执行/usr/share/heartbeat/hb_standby看VIP是否能够转移到HA2
查看heartbeat的日志信息
tail -100 /var/log/ha-log
heartbeat[13822]: 2009/11/07_19:44:33 info: ha1 wants to go standby [all]
heartbeat[13822]: 2009/11/07_19:44:33 info: standby: ha2 can take our all resources
heartbeat[14194]: 2009/11/07_19:44:33 info: give up all HA resources (standby).
ResourceManager[14207]:    2009/11/07_19:44:34 info: Releasing resource group: ha1 192.168.2.100 nginxd
ResourceManager[14207]:    2009/11/07_19:44:34 info: Running /etc/ha.d/resource.d/nginxd  stop
ResourceManager[14207]:    2009/11/07_19:44:34 info: Running /etc/ha.d/resource.d/IPaddr 192.168.2.100 stop
IPaddr[14295]:    2009/11/07_19:44:34 INFO: ifconfig eth0:0 down
IPaddr[14278]:    2009/11/07_19:44:34 INFO:  Success
heartbeat[14194]: 2009/11/07_19:44:34 info: all HA resource release completed (standby).
heartbeat[13822]: 2009/11/07_19:44:34 info: Local standby process completed [all].
heartbeat[13822]: 2009/11/07_19:44:36 WARN: 1 lost packet(s) for [ha2] [83:85]
heartbeat[13822]: 2009/11/07_19:44:36 info: remote resource transition completed.
heartbeat[13822]: 2009/11/07_19:44:36 info: No pkts missing from ha2!
heartbeat[13822]: 2009/11/07_19:44:36 info: Other node completed standby takeover of all resources.
查看HA2上VIP已经配置上,nginx也已启动。

2. 切断主节点和备份节点的心跳线看是VIP否能够转移
Down掉HA1的eth1网卡,在HA2上查看heartbeat日志
[root@HA2 ~]# tail -100 /var/log/ha-log
heartbeat[3753]: 2009/11/07_19:59:36 WARN: node ha1: is dead
heartbeat[3753]: 2009/11/07_19:59:36 WARN: No STONITH device configured.
heartbeat[3753]: 2009/11/07_19:59:36 WARN: Shared disks are not protected.
heartbeat[3753]: 2009/11/07_19:59:36 info: Resources being acquired from ha1.
heartbeat[3753]: 2009/11/07_19:59:36 info: Link ha1:eth1 dead.
harc[4255]:    2009/11/07_19:59:36 info: Running /etc/ha.d/rc.d/status status
heartbeat[4256]: 2009/11/07_19:59:36 info: No local resources [/usr/share/heartbeat/ResourceManager listkeys ha2] to acquire.
mach_down[4276]:    2009/11/07_19:59:36 info: Taking over resource group 192.168.2.100
ResourceManager[4310]:    2009/11/07_19:59:36 info: Acquiring resource group: ha1 192.168.2.100 nginxd
IPaddr[4337]:    2009/11/07_19:59:37 INFO:  Resource is stopped
ResourceManager[4310]:    2009/11/07_19:59:37 info: Running /etc/ha.d/resource.d/IPaddr 192.168.2.100 start
IPaddr[4413]:    2009/11/07_19:59:37 INFO: Using calculated nic for 192.168.2.100: eth0
IPaddr[4413]:    2009/11/07_19:59:37 INFO: Using calculated netmask for 192.168.2.100: 255.255.255.0
IPaddr[4413]:    2009/11/07_19:59:37 INFO: eval ifconfig eth0:0 192.168.2.100 netmask 255.255.255.0 broadcast 192.168.2.255
IPaddr[4396]:    2009/11/07_19:59:37 INFO:  Success
ResourceManager[4310]:    2009/11/07_19:59:37 info: Running /etc/ha.d/resource.d/nginxd  start
mach_down[4276]:    2009/11/07_19:59:38 info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired
mach_down[4276]:    2009/11/07_19:59:38 info: mach_down takeover complete for node ha1.
heartbeat[3753]: 2009/11/07_19:59:38 info: mach_down takeover complete.
资源从HA1转移到了HA2。

启动HA1的eth1网卡,可以看到资源从HA2上自动转移到HA1上。

3. 停掉HA1或是停掉HA1上的heartbeat,看VIP是否能够转移到HA2
资源从HA1转移到了HA2。

五、HA管理

启动/停止heartbeat:
service heartbeat start/stop

查看heartbeat状态:
[root@HA2 ~]# service heartbeat status
heartbeat OK [pid 4724 et al] is running on ha2 [ha2]…

手工切换(将本地资源转移到远程主机):
[root@HA1 ~]# /usr/share/heartbeat/hb_standby
2009/11/07_20:11:03 Going standby [all].

手动接管(将资源接管到本地):
[root@HA2 ~]# /usr/share/heartbeat/hb_takeover

总结:通过上面的配置可以达到当其中一个节点Down掉后有另一个节点接管资源目的,但是当nginx本身Down掉后并不能自动故障转移,要想达到此目的必须配置heartbeat style 2.x,请参考《Heartbeat实现Nginx高可用性(style 2.x)》

六、参考
1. authkeys配置参考:http://linux-ha.org/authkeys
2. ha.cf配置参考:http://linux-ha.org/ha.cf
3. http://logzgh.itpub.net/post/3185/466910
4. http://linux.chinaunix.net/bbs/archiver/?tid-1051263.html

分类: 高可用性 标签: , ,