RAC Attack - Oracle Cluster Database at Home/Clusterware and Fencing

Prev: Rolling Patches



Clusterware Testing (e)
  1. Clusterware and Fencing
  2. Clusterware Callouts


Next: Services, Failover and Load Balancing



The goal of this lab is to demonstrate Oracle Clusterware’s fencing ability by forcing a configuration that will trigger Oracle Clusterware’s built-in fencing features. With Oracle Clusterware, fencing is handled at the node level by rebooting the non-responsive or failed node. This is similar to the as Shoot The Other Machine In The Head (STOMITH) algorithm, but it’s really a suicide instead of affecting the other machine. There are many good sources for more information online.



  1. Start with a normal, running cluster with the database instances up and running.


  2. Monitor the logfiles for clusterware on each node. On each node, start a new window and run the following command:

    [oracle@<node_name> ~]$ tail –f \
    > /u01/grid/oracle/product/11.2.0/grid_1/log/`hostname -s`/crsd/crsd.log
    
    [oracle@<node_name> ~]$ tail –f \
    > /u01/grid/oracle/product/11.2.0/grid_1/log/`hostname -s`/cssd/ocssd.log
    


  3. We will simulate “unplugging” the network interface by taking one of the private network interfaces down. On the collabn2 node, take the private network interface down by running the following command (as the root user):

    [root@collabn2 ~]# ifconfig eth1 down
    

    Alternatively, you can also simulate this by physically taking the HostOnly network adapter offline in VMware.

    RA-vmweb-network-edit.png


    RA-vmweb-network-disconnect.png


  4. Following this command, watch the logfiles you began monitoring in step 2 above. You should see errors in those logfiles and eventually (could take a minute or two, literally) you will observe one node reboot itself.

    If you used ifconfig to trigger a failure, then the node will rejoin the cluster and the instance should start automatically.

    If you used VMware to trigger a failure then the node will not rejoin the cluster.

    • Which file has the error messages that indicate why the node is not rejoining the cluster?
    • Is the node that reboots always the same as the node with the failure? Why or why not?



Last modified on 2 July 2013, at 12:12