NetApp cDOT, RDB, & Epsilon

By on 03/12/2015.

NetApp clustered Data ONTAP is designed to provide a solution that eliminates downtime and offers feature-rich scale-out, cloud-ready storage. This is done by taking a series of HA pairs and connecting them via a cluster network backend to provide a single namespace, single point of management and “pay as you grow” functionality.

clustered-data-ontap

But how does a cluster know it’s a cluster?

This could qualify as an existential question, akin to Descartes’ “I think, therefore I am,” or for a more modern approach, any android developing sentient capabilities. But, I digress.

The answer is actually simpler than you’d expect…

“A cluster is a cluster because of a replicated database (aka, RDB).”

What is RDB?

RDB is comprised of 5 applications that have access to read and write to the database files.

Management gateway (MGWD)

This is the “gateway” into the cluster. Responsible for things like SSH, SNMP, ZAPI and reporting cluster health/quorum. This application also communicates with other applications to ensure configuration changes get added to the appropriate cluster database tables.

VIF Manager (vifmgr)

This is the network configuration portion of the cluster. It interacts with mgwd to ensure the database is consistent with the underlying network configuration.

Volume Location Database (VLDB)

This contains records of where volumes live in the cluster. Since a cluster can be up to 24 nodes for NAS environments,it’s important to keep track of where volumes and qtrees are physically located. VLDB also contains cached information about data LIFs for features such as pNFS.

Blocks Communication and Operations Manager (BCOM)

BCOM is the interface responsible for communication between the SAN kernel stack and RDB. It interacts with MGWD to provide updates to the cluster database tables regarding SAN specific information.

Configuration Replication Service (CRS)

This is a new RDB application in 8.3 that is responsible for being able to replicate configuration/operation data from an active cluster to a remote secondary cluster. Think MetroCluster.

How does it work?hmm

RDB replicates across every node in a cluster to maintain a consistent copy of the database. RDB is designed to guarantee no inconsistencies are injected into the database by way of the notion of RDB quorum. This is possible by using the concept of “epsilon” as a tie-breaker. In a 2 node cluster, epsilon is shared across the nodes. In the case of a storage failover, the surviving node becomes epsilon. In clusters greater than 2 nodes, epsilon will live on a single node and act as a tiebreaker in the cluster as long as half of the nodes (including the node that owns epsilon) are still up. The more nodes you add, the less likely you get into a scenario where you lose quorum. This is covered in detail in NetApp product documentation.

RDB gets updated every time a configuration change is made in the cluster. For instance, if you create a new volume, the tables are updated and the database gets replicated. When a new version of the database is created, the epoch number increments. You can see this information in the advanced level command “cluster ring show.”

cluster::*> cluster ring show
Node  UnitName Epoch  DB Epoch DB Trnxs    Master  Online
——— ————– ———- ————— —————  ———– ———
node1  mgmt       472     472           883            node1    master
node1  vldb         526     526           19              node1    master
node1  vifmgr      29306  29306       342            node1    master
node1  bcomd     323     323           11              node1    master
node1  crs           47      47             1                node1    master
node2  mgmt       472     472           883            node1    secondary
node2  vldb         526     526           19              node1    secondary
node2  vifmgr      29306  29306       342            node1    secondary
node2  bcomd     323     323           11              node1    secondary
node2  crs           47      47             1                node1    secondary

In the above example, there is an epoch, a DB epoch and DB transaction number for each application. When a change is made to that application’s RDB table, the transaction number will change. If I create a new volume, for instance, the vldb DB transaction number will change:

cluster::*> volume create -vserver SVM -volume test2 -aggregate aggr2 -size 1g
[Job 6610] Job succeeded: Successful
cluster::*> cluster ring show -unitname vldb
Node  UnitName Epoch  DB Epoch DB Trnxs    Master  Online
——— ————– ———- ————— —————  ———– ———
node1  vldb         526     526           22              node1    master
node1  vldb         526     526           22              node1    secondary

If one of the applications needs to restart for whatever reason, the epoch number will change (and thus, a new DB version). Additionally, the transaction number will get reset to 1.

cluster::*> cluster ring show -unitname vldb
Node  UnitName Epoch  DB Epoch DB Trnxs    Master  Online
——— ————– ———- ————— —————  ———– ———
node1  vldb         527     527           1               node1    master
node1  vldb         527     527           1               node1    secondary

You might also notice there is a “master” and “secondary” – every application has the notion of its own epsilon, which is the master. That master can change if the original master goes down for any reason – lots of resiliency built into clustered Data ONTAP!

Why would an RDB application ever restart?

This can happen for several reasons. For one, if you reboot a node, *all* of the node’s RDB applications restart. Reboots happen for planned and unplanned reasons. But when they do, HA kicks in and data keeps getting served. RDB keeps chugging along, serving up cluster goodness.

Applications can also fail for a number of other reasons that any application fails. Segmentation faults, memory issues, etc. When this happens, there is a nanny process called spmctl that ensures mission critical applications restart and resume normal activities. In the event of an application’s failure, an EMS gets logged and a core file for that application gets created. When that happens, open up a case and get the core analyzed.

cluster::*> event log show -messagename spm*
Time                      Node    Severity   Event
—————————- ———-  ————- —————————
3/10/2015 15:14:53  node1   ERROR  spm.vifmgr.process.exit: Logical Interface Manager(VifMgr) with ID 50822 aborted as a result of signal signal 6. The subsystem will attempt to restart.
cluster::*> system coredump show -type application -node node1
Node:Type          Core Name                                                  Saved  Panic Time
————————- —————————————————————– ———- —————————
node1:application vifmgr.50822.4042835970.1426014893.ucore  true     3/10/2015 15:14:53

What’s great about this design, however, is that a single application can restart and data operations can continue as normal in most cases as opposed to an entire node rebooting. Clustered Data ONTAP is designed in a way that allows fewer disruptions even in unplanned events, keeping your data online, available, and ultimately providing truly non-disruptive operations to your business.

For more information on clustered Data ONTAP, see TR-3982: NetApp Clustered Data ONTAP 8.3 and 8.2.x.

Justin Parisi
Tech Mktg Engineer at NetApp
Justin is a Tech Marketing Engineer for all-things NFS around Data ONTAP at NetApp. He is a VMware vExpert, Cisco Champion, and a member of the NetApp A-Team. He also enjoys comic books, video games, photography, music, film, and current events/politics.

6 Comments