RAC Attack - Oracle Cluster Database at Home/Clusterware and Fencing< RAC Attack - Oracle Cluster Database at Home
Prev: Rolling Patches
Clusterware Testing (e)
- Clusterware and Fencing
- Clusterware Callouts
The goal of this lab is to demonstrate Oracle Clusterware’s fencing ability by forcing a configuration that will trigger Oracle Clusterware’s built-in fencing features. With Oracle Clusterware, fencing is handled at the node level by rebooting the non-responsive or failed node. This is similar to the as Shoot The Other Machine In The Head (STOMITH) algorithm, but it’s really a suicide instead of affecting the other machine. There are many good sources for more information online.
Start with a normal, running cluster with the database instances up and running.
Monitor the logfiles for clusterware on each node. On each node, start a new window and run the following command:
[oracle@<node_name> ~]$ tail –f \ > /u01/grid/oracle/product/11.2.0/grid_1/log/`hostname -s`/crsd/crsd.log [oracle@<node_name> ~]$ tail –f \ > /u01/grid/oracle/product/11.2.0/grid_1/log/`hostname -s`/cssd/ocssd.log
We will simulate “unplugging” the network interface by taking one of the private network interfaces down. On the collabn2 node, take the private network interface down by running the following command (as the root user):
[root@collabn2 ~]# ifconfig eth1 down
Alternatively, you can also simulate this by physically taking the HostOnly network adapter offline in VMware.
Following this command, watch the logfiles you began monitoring in step 2 above. You should see errors in those logfiles and eventually (could take a minute or two, literally) you will observe one node reboot itself.
If you used ifconfig to trigger a failure, then the node will rejoin the cluster and the instance should start automatically.
If you used VMware to trigger a failure then the node will not rejoin the cluster.
- Which file has the error messages that indicate why the node is not rejoining the cluster?
- Is the node that reboots always the same as the node with the failure? Why or why not?