Yesterday, I was speaking with a customer who is a cloud provider. They were discussing how to use NFSv4 with clustered Data ONTAP for one of their customers. As we were talking, I brought up pNFS and its capabilities. They were genuinely excited about what pNFS could do for their particular use case.
What is pNFS?
pNFS is short for “parallel NFS,” which is a little bit of a misnomer as it doesn’t do parallel reads and writes across different object stores (i.e., striping). Instead, it establishes a metadata path to the NFS server and then splits off the data path to its own dedicated path. The client works with the NFS server to determine which path is local to the physical location of the object store. Think of it as ALUA for NFS.
In the case of pNFS on clustered Data ONTAP, NetApp currently supports file-level pNFS, so the object store would be a flexible volume on an aggregate of physical disks. pNFS is covered in detail in NetApp TR-4063.
There are a number of things that makes pNFS a cool new NFS feature, such as:
- All the benefits and security of standard NFSv4.1
- Metadata and data separation
- Transparent-to-client data migration – data agility!
Gotta keep ’em separated!
One thing that is beneficial about the design of pNFS is that the metadata paths are separated from the read/write paths. Once a mount is established, the metadata path is set on the IP address used for mount and does not move without manual intervention. In clustered Data ONTAP, that path could live anywhere in the cluster. (Up to 24 physical nodes with multiple ports on each node!)
That buys you resiliency, as well as flexibility to control where the metadata will be served.
The data path, however, will only be established on reads and writes. That path is determined in conversations between the client and server and is dynamic. Any time the physical location of a volume changes, the data path changes automatically, without need to intervene by the clients or the storage administrator. So, unlike NFSv3 or even NFSv4, you no longer would need to break the TCP connection to move the path for reads and writes (via unmount or LIF migrations). And with NFSv4.x, the statefulness of the connection can be preserved.
That means more time for everyone. Data can be migrated in real time, non-disruptively, based on the storage needs of the client.
For example, I have a volume that lives on node cluster01 of my cDOT cluster:
cluster::> vol show -vserver SVM -volume unix -fields node (volume show) vserver volume node ------- ------ -------------- SVM unix cluster01
I have data LIFs on each node in my cluster:
cluster::> net int show -vserver SVM (network interface show) Logical Status Network Current Current Is Vserver Interface Admin/Oper Address/Mask Node Port Home ----------- ---------- ---------- ------------------ ------------- ------- ---- SVM data1 up/up 10.63.57.237/18 cluster01 e0c true data2 up/up 10.63.3.68/18 cluster02 e0c true 2 entries were displayed.
In the above list:
- 10.63.3.68 will be my metadata path, since that’s where I mounted.
- 10.63.57.237 will be my data path, as it is local to the physical node cluster02.
When I mount, the TCP connection is established to the node where the data LIF lives:
nfs-client# mount -o minorversion=1 10.63.3.68:/unix /unix cluster::> network connections active show -remote-ip 10.228.225.140 Vserver Interface Remote Name Name:Local Port Host:Port Protocol/Service ---------- ---------------------- ---------------------------- ---------------- Node: cluster02 SVM data2:2049 nfs-client.domain.netapp.com:912 TCP/nfs
My metadata path is established to cluster02, but my data volume lives on cluster01.
On a basic cd and ls into the mount, all the traffic is seen on the metadata path. (stuff like GETATTR, ACCESS, etc):
83 6.643253 10.228.225.140 10.63.3.68 NFS 270 V4 Call (Reply In 85) GETATTR 85 6.648161 10.63.3.68 10.228.225.140 NFS 354 V4 Reply (Call In 83) GETATTR 87 6.652024 10.228.225.140 10.63.3.68 NFS 278 V4 Call (Reply In 88) ACCESS 88 6.654977 10.63.3.68 10.228.225.140 NFS 370 V4 Reply (Call In 87) ACCESS
When I start I/O to that volume, the path gets updated to the local path by way of new pNFS calls (specified in RFC-5663):
28 2.096043 10.228.225.140 10.63.3.68 NFS 314 V4 Call (Reply In 29) LAYOUTGET 29 2.096363 10.63.3.68 10.228.225.140 NFS 306 V4 Reply (Call In 28) LAYOUTGET 30 2.096449 10.228.225.140 10.63.3.68 NFS 246 V4 Call (Reply In 31) GETDEVINFO 31 2.096676 10.63.3.68 10.228.225.140 NFS 214 V4 Reply (Call In 30) GETDEVINFO
LAYOUTGET, the client asks the server “where does this filehandle live?”
- The server responds with the device ID and physical location of the filehandle.
- Then, the client asks “what devices to access that physical data are avaiabe to me?” via
- The server responds with the list of available devices/IP addresses.
Once that communication takes place (and note that the conversation occurs in sub-millisecond times), the client then establishes the new TCP connection for reads and writes:
32 2.098771 10.228.225.140 10.63.57.237 TCP 74 917 > nfs [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=937300318 TSecr=0 WS=128 33 2.098996 10.63.57.237 10.228.225.140 TCP 78 nfs > 917 [SYN, ACK] Seq=0 Ack=1 Win=33580 Len=0 MSS=1460 SACK_PERM=1 WS=128 TSval=2452178641 TSecr=937300318 34 2.099042 10.228.225.140 10.63.57.237 TCP 66 917 > nfs [ACK] Seq=1 Ack=1 Win=14720 Len=0 TSval=937300318 TSecr=2452178641
And we can see the connection established on the cluster to both the metadata and data locations:
cluster::> network connections active show -remote-ip 10.228.225.140 Vserver Interface Remote Name Name:Local Port Host:Port Protocol/Service ---------- ---------------------- ---------------------------- ---------------- Node: cluster01 SVM data2:2049 nfs-client.domain.netapp.com:912 TCP/nfs Node: cluster02 SVM data2:2049 nfs-client.domain.netapp.com:912 TCP/nfs
Then we start our data transfer on the new path (data path 10.63.57.237):
38 2.099798 10.228.225.140 10.63.57.237 NFS 250 V4 Call (Reply In 39) EXCHANGE_ID 39 2.100137 10.63.57.237 10.228.225.140 NFS 278 V4 Reply (Call In 38) EXCHANGE_ID 40 2.100194 10.228.225.140 10.63.57.237 NFS 298 V4 Call (Reply In 42) CREATE_SESSION 42 2.100537 10.63.57.237 10.228.225.140 NFS 194 V4 Reply (Call In 40) CREATE_SESSION 157 2.106388 10.228.225.140 10.63.57.237 NFS 15994 V4 Call (Reply In 178) WRITE StateID: 0x0d20 Offset: 196608 Len: 65536 163 2.106421 10.63.57.237 10.228.225.140 NFS 182 V4 Reply (Call In 127) WRITE
If I do a chmod later, the metadata path is used (10.63.3.68):
341 27.268975 10.228.225.140 10.63.3.68 NFS 310 V4 Call (Reply In 342) SETATTR FH: 0x098eaec9 342 27.273087 10.63.3.68 10.228.225.140 NFS 374 V4 Reply (Call In 341) SETATTR | ACCESS
How do I make sure connections don’t pile up?
With clustered Data ONTAP, you have two options to load balance TCP connections for metadata. You can use the tried and true DNS round-robin method, but the NFS server doesn’t have any idea what IP addresses have been issued by the DNS server, so as a result, there are no guarantees the connections won’t pile up.
Another way to deal with connections is to leverage the clustered Data ONTAP feature for on-box DNS load balancing. This feature allows storage administrators to set up a DNS forwarding zone on a DNS server (BIND, Active Directory or otherwise) to forward requests to the clustered Data ONTAP data LIFs, which can act as DNS servers complete with SOA records! The cluster will determine which IP address to issue to a client based on the following factors:
- CPU load
- overall node throughput
This helps ensure that any TCP connection that is established is done so in a logical manner based on performance of the phyical hardware.
For more information regarding on-box DNS load balancing, see NetApp TR-4073 on page 27.
What about that data agility?
What’s great about pNFS is that it is a perfect fit for storage operating systems like clustered Data ONTAP. NetApp and RedHat worked together closely on the protocol enhancement, and it shows in its overall implementation.
In clustered Data ONTAP, there is the concept of non-disruptive volume moves. This feature gives storage administrators agility and flexibility in their clusters, as well as enabling service and cloud providers a way to charge based on tiers (pay as you grow!).
For example, if I am a cloud provider, I could have a 24-node cluster. Some HA pairs could be All-Flash FAS nodes for high-performance/low latency workloads. Some HA pairs could be SATA or SAS drives for low performance/high capacity storage. If I am providing storage to a customer that wants to implement container-based applications (such as Docker), I could sell storage based on the specific needs they have. Maybe their applications are only going to run during the summer months. Super! Let’s non-disruptively move those volumes to our flash storage. After the jobs are complete, we can move them back to SATA/SAS drives for storage and even SnapMirror or SnapVault them off to a DR site for safekeeping. Once the job cycle comes back around, I can move the volumes back to flash.
What happens when a volume moves in pNFS?
When a volume move occurs, the client is notified of the change via pNFS calls. When the file attempts to OPEN for writing, the server responds, “that file is somewhere else now.”
220 24.971992 10.228.225.140 10.63.3.68 NFS 386 V4 Call (Reply In 221) OPEN DH: 0x76306a29/testfile3 221 24.981737 10.63.3.68 10.228.225.140 NFS 482 V4 Reply (Call In 220) OPEN StateID: 0x1077
The client says, “cool, where is it now?”
222 24.992860 10.228.225.140 10.63.3.68 NFS 314 V4 Call (Reply In 223) LAYOUTGET 223 25.005083 10.63.3.68 10.228.225.140 NFS 306 V4 Reply (Call In 222) LAYOUTGET 224 25.005268 10.228.225.140 10.63.3.68 NFS 246 V4 Call (Reply In 225) GETDEVINFO 225 25.005550 10.63.3.68 10.228.225.140 NFS 214 V4 Reply (Call In 224) GETDEVINFO
Then the client uses the new path to start writing.
251 25.007448 10.228.225.140 10.63.57.237 NFS 7306 V4 Call WRITE StateID: 0x15da Offset: 0 Len: 65536 275 25.007987 10.228.225.140 10.63.57.237 NFS 7306 V4 Call WRITE StateID: 0x15da Offset: 65536 Len: 65536
Where does pNFS fit into my cloud architecture?
pNFS fits in the cloud because administrators can serve NFS mounts through a variety of clients. For instance, VMWare has added support for NFSv4.1 into vSphere 6. Think they aren’t looking ahead to the future to add pNFS support? And RedHat KVM already supports it. Check out Captain KVM’s blog on setting it up!
That means, as a cloud service provider, you can serve NFS to your customers on tiered storage solutions and move stuff around non-disruptively (yes, ZERO downtime) while your customers never need to do anything except continue their day-to-day business operations. Pretty cool, eh?
What other things does pNFS give me?
- Stateful, guaranteed connections that NFSv3 doesn’t give you
- Integrated and guaranteed locking via a lease-based model to help eliminate the “stale locks” issues seen with databases on NFSv3 for years
- Increased security via ID domain strings and integrated Kerberos support
- NFSv4.x ACL support
Plus, with the upcoming NFSv4.2, we’ll start to see even more enhancements to the overall NFSv4.x protocol. Now that NFS in the cloud is squarely on the radar with the recent Amazon AWS announcement for EFS, it’s time to start thinking seriously about the best ways to implement and leverage file-level pNFS in cloud storage provider environments.