As some of you may have heard, there is currently a known issue with NFS based datastores (includes VSA NFS datastores) after upgrading to vSphere 5.5 Update 1. The issue causes NFS datastores to disconnect and go into an APD (All Paths Down) state. VMware is currently aware of the problem and you can follow KB 2076392 for the latest updates.

While going through my Twitter stream this morning, I noticed an interesting question from fellow Blogger and friend Jase McCarty who asked the following:

vsphere55u1-nfs-apd-alarm-2
I was quite surprised to hear that there were no vCenter Alarms being triggered for this issue. I decided to take a look at the KB to better understand the symptoms and see if there was anything I could do to help. From what I can tell, the only way to identify this particular problem is by looking at the logs which the KB has an example of what you would see.

Once I took a look at the logs, I knew there was at least two methods in which one could get alerts. One option would be to leverage vCenter Log Insight and create a query based on the particular string but no every customer is using Log Insight and it does require a bit of setup. The second more obvious option for me would be to key off of the VMkernel VOBs that are being generated which I have written about in the past for detecting duplicate IP Addresses for ESXi and VSAN component threshold count.

Here are the steps to create vCenter Alarm:

Step 1 - Create a new vCenter Alarm and give it a name. Select "Hosts" for Monitor and "Specific event occurring ..." for Monitor for

vsphere55u1-nfs-apd-alarm-0
Step 2 - For the Trigger, you will add the following VOB entries (just copy/paste them in)

  • esx.problem.storage.apd.start
  • esx.problem.vmfs.nfs.server.disconnect
  • esx.problem.storage.apd.timeout

Note: The alarm will activate if ANY of the VOBs are seen since it is an OR statement. It would have been nice to be able to group these together to generate the alarm

vsphere55u1-nfs-apd-alarm-1
Once the alarm has been created, you will at least have a way to get notified if you are potentially affected by this problem. I would still highly recommend you subscribe to KB 2076392 for all the latest updates.

14 thoughts on “How to create vCenter Alarm to alert on ESXi 5.5u1 NFS APD issue?

  1. Is there a way the alarm triggers are reported in the FAT client v/s web Client?
    I have the screenshots, not sure if I can attach to the comment.

  2. The alarms are nice, but I’ve noticed two things about them: 1) they never go from red to green after being tripped, and 2) there’s no information about the datastore that tripped the alarm.

    Yes, the instructions above indicate that there are limitations in the way the alarm trigger works (the “or vs and” factor), but it’s sort of weird to see these alarms tripped after upgrading to 5.5U2 _and_ removing NFS stores from the cluster…

    • Jim,

      1) I forget off hand if you could create an alarm that will send an alert but not stay red. For most cases, admins would want to see it and then ACK, else you never know when an alarm was fired off unless you were watching it.

      2) You’re right, this is an area we could improve in. I would guess that if you were using the API, you could pull more information about the object that tripped the alarm, I thought this was possible within the Events view when an alarm tripped but haven’t tested it myself.

    • Yes, you’ll need to identify which alarm environmental variable that contains that info. Some more details https://pubs.vmware.com/vsphere-60/index.jsp?topic=%2Fcom.vmware.vsphere.monitoring.doc%2FGUID-AB74502C-5F01-478D-AF66-672AB5B8065C.html and https://pubs.vmware.com/vsphere-4-esx-vcenter/index.jsp?topic=/com.vmware.vsphere.bsa.doc_40/vc_admin_guide/working_with_alarms/r_alarm_environment_variables.html

      What I normally do is just print out all environmental variables as part of the trigger, identify which variable I need as part of a given alarm. You may also want to check out this recent Reddit thread which could be helpful https://www.reddit.com/r/vmware/comments/4b6lpq/change_the_summary_line_of_email_sent_by_vcenter/

      • Hey William – We tried printing out all environmental variables as part of the trigger, however 4 of the variables were blank for this alarm, including the one I expected to contain the datastore name.
        VMWARE_ALARM_EVENT_VM
        VMWARE_ALARM_EVENT_NETWORK
        VMWARE_ALARM_EVENT_DATASTORE
        VMWARE_ALARM_EVENT_DVS

        The rest of the environmental variables we printed out did have info, but did not contain the datastore name šŸ™

        • Not all VMWARE_ALARM* variables will always be populated, will depend on the event triggered. In this particular case, I suspect the “datastore” which the alarm triggered off of is stored in another variable …

          Would you mind sharing the other VMWARE_ALARM* properties that was returned?

          • Sure. Here is what we got (redacted):

            VMWARE_ALARM_NAME = [name we gave alarm]
            VMWARE_ALARM_ID = [alarm id]
            VMWARE_ALARM_TARGET_NAME = [host fqdn]
            VMWARE_ALARM_TARGET_ID = [host id]
            VMWARE_ALARM_OLDSTATUS = Gray
            VMWARE_ALARM_NEWSTATUS = Red
            VMWARE_ALARM_TRIGGERINGSUMMARY = Event: All paths are down
            Summary: Device or filesystem with identifier [***********] has entered the All Paths Down state.
            Date: [date alarm triggered]
            Host: [host fqdn]
            Resource pool: [cluster name]
            Data center: [datacenter name]
            Arguments:
            eventTypeId = esx.problem.storage.apd.start
            objectId = [host id]
            objectName = [host fqdn]
            1 = [datastore identifier]

            VMWARE_ALARM_DECLARINGSUMMARY = ([Event alarm expression: All paths are down; Status = Red] OR [Event alarm expression: All Paths Down timed out, I/Os will be fast failed; Status = Red] OR [Event alarm expression: Lost connection to NFS server; Status = Red])
            VMWARE_ALARM_ALARMVALUE = Event details
            VMWARE_ALARM_EVENTDESCRIPTION = Device or filesystem with identifier [***********] has entered the All Paths Down state.
            VMWARE_ALARM_EVENT_USERNAME =
            VMWARE_ALARM_EVENT_DATACENTER = [datacenter name]
            VMWARE_ALARM_EVENT_COMPUTERESOURCE = [cluster name]
            VMWARE_ALARM_EVENT_HOST = [host fqdn]
            VMWARE_ALARM_EVENT_VM =
            VMWARE_ALARM_EVENT_NETWORK =
            VMWARE_ALARM_EVENT_DATASTORE =
            VMWARE_ALARM_EVENT_DVS =

          • Hi William – Just checking in to see if you were able to figure out which variable the datastore name is stored in? Did the additional info below help at all? Thanks!

          • It looks like you may have to construct the Datastore Name from “1 = [datastore identifier]” as its not included as part of the alarm.

Thanks for the comment!