Monitoring
Your monitoring system provides the following functions for you.
- Ensures that you are alerted to any pending problems
- Allows you to investigate the current and historical state of your environment to assist in trouble shooting
- Provides uptime and usage information for management reporting
- Provides capacity management projections
- Free space of Datastores
- Free space of Service Consoles
- List of orphaned snapshots
- List of long running snapshots
- Failed (automatic) VMotions
- VMware tools running in hosts
- Size of VC database
- Monitor CPU READY (ms) or CPU %READY per VM per host
- Monitor %CPU BUSY percentages per VM per host
- Monitor network and disk I/O usage per VM per Host
- Monitor service console memory swap usage
- Monitor VM balloon memory and swap usage
- Host downtime reporting
- Server hardware faults (power supplies, fans, IO cards, disks, CPUs, RAM)
- SAN hardware faults (disks and vendor specific)
Your monitoring will certainly consist of VMware vCenter Server and also your hardware monitoring platform. Often these are supplemented by a VMware specific product like Vizioncore vFoglight, Veeam Monitor or Nimsoft.
Your management processes and procedures provide the following functions for you.
- A list of maintenance activities to perform on a periodic basis
- formal heath check
- update templates with patches and updates
- A list of operational procedures on how to perform standard maintenance and trouble shooting tasks.
- A change management impact matrix to detail the potential impact and risk of a particular type of change.
- The procedure to create a new virtual machine
- The procedure to place a new virtual machine within the virtual infrastructure into a Production state. This may be identical to the physical server commissioning procedure.
- The procedure to place an ESX server into and then out of maintenance mode, migrating the guests onto other ESX Server hosts.
- The procedure used to contact VMware for support. It should include contact information and specify contact methods as well as means of collecting information.
- The procedure to add a LUN to an existing ESX server cluster.
- The procedure to patch a template used for creating virtual machines.
- The procedure to create a snapshot of a virtual machine.
- The procedure to restore the virtual machine state to its previous state at the start of the snapshot.
- The procedure for investigating user reported virtual machine performance issues. What to check and how to respond.
- The procedure to add a disk to an existing virtual machine.
- The procedure to expand the size of an existing disk for a virtual machine.
- The procedure to shrink a disk used by a virtual machine.
- The procedure to remove a disk from a virtual machine.
- The procedure to decommission a virtual machine.
- The procedure to migrate (VMotion) a virtual machine between ESX Server hosts in the same ESX cluster.
- The procedure to build an ESX server.
- The procedure to add an ESX server into an existing ESX cluster.
- The procedure to migrate a virtual machine between ESX Server hosts in the different ESX clusters (i.e. between datacenters).
- The procedure to confirm that a SAN link is active, to be used after a SAN link has failed and been restored.
- The procedure to confirm that a network link is active, to be used after a network link has failed and been restored.
- The procedure to enable the network group to troubleshoot user reported network / performance issues.
- The procedure for backing up/restoring VMs (VM-level and file-level).
- The procedure for backing up/restoring VirtualCenter database.
- The procedure for backing up/restoring license server files (or keys).
- The procedure for restoring VirtualCenter Server.
- The procedure for restoring ESX hosts.
Do you have any elements you also find important for Operations? Post in the comments.
Rodos
very nice synopsis of what is needed, I think people are supprised at the amount of rework that may be needed to operational proceedures on the introduction of a Virtual Infrastructure.
ReplyDeletePlus it has just saved me a job, cheers. LOL
And for provisioning VM think about a tiering model. How many CPU's / Memory / Storage. Simplify your environment and make it predictable.
ReplyDeleteOne might also consider capacity planning in the virtual world. It would be nice to know when additional resources are needed. Maybe even do trending so you know what to expect within a year / three years from now.
Duncan
Great post, Just what I needed today.
ReplyDeleteSome additions:
ReplyDelete- VLANs consistency across cluster
- HA Level
- Failed (automatic) VMotions
- VMware Tools running
Another one for the list:
ReplyDeleteKeep your templates updated, e.g. fire them up once a month and run Windows Update
Thanks for the comments peoples.
ReplyDeleteDuncan, in the monitoring section I mention capacity management.
Gabe, some of those things I would put inside a regular health check but given the power of Power Shell scripts these days they may be automated to a daily check.
Have added some of these into the list.
When you use multiple VLANs you also need a procedure to add a VLAN to the ESX cluster.
ReplyDeletedo you have these SOP's plz share to: siva.esx@gmail.com
ReplyDeletethanks