Musings of Rodos: July 2010

Wednesday, July 28, 2010

Monitoring your UCS faults with syslog

When you deploy your UCS environment once of the first things you will want to do is integrate it into your monitoring system. One way is through integrating with syslog. Here are some notes and tips.

When problems occur in your UCS environment they will appear as Faults inside the Administration area. Click on the screen shot below to see some.

One thing to know is this page only shows you the current alerts, once they clear they disappear.

Here is an example alert exported from my system.

Severity | Code | ID | Affected object | Cause | Last Transition | Description
major | F0207 | 225741 | sys/chassis-1/blade-4/adaptor-1/host-fc-2/fault-F0207 | link-down | 2010-07-28T12:18:59 | Adapter  host interface 1/4/1/2 link state: down

One of the key bits of information you are looking for is the fault code, in the example above its F0207. With that code you can look it up in the Cisco UCS Fault Reference.

If you search the reference for that code here is the details presented.

fltAdaptorHostIfLink-down
Fault Code:F0207
Message
Adapter [transport] host interface [chassisId]/[slotId]/[id]/[id] link state: [linkState]
Explanation
This fault typically occurs as a result of one of the following issues:
•The fabric interconnect is in End-Host mode, and all uplink ports failed.
•The server port to which the adapter is pinned failed.
•A transient error that caused the link to fail.
Recommended Action
If you see this fault, take the following actions:
Step 1 If an uplink port is disabled, enable the port.
Step 2 If the server port to which the adapter is pinned is disabled, enable that port.
Step 3 Reacknowledge the server with the adapter that has the failed link.
Step 4 If the above actions did not resolve the issue, execute the show tech-support command and contact Cisco technical support.
Fault Details
Severity: major  
Cause: link-down  
mibFaultCode: 207  
mibFaultName: fltAdaptorHostIfLinkDown  
moClass: adaptor:HostIf  
Type: network

All codes are listed and the fault reference may be a valuable reference for you the first time you come across and error.

For here you will typically you will want send these alerts to your management platform for automated monitoring. A great way to do this is via syslog. Cisco have a good guide "Set up Syslog for Cisco UCS" you can follow for doing the configuration. Here is a shot of the page where you set it up.

Now once this is configure the alerts will appear in your syslog server.

Here is what our example above looks like as a syslog entry.

Jul 26 01:05:01 192.168.128.16 : 2010 Jul 26 01:08:54 EST: %LOCAL0-3-SYSTEM_MSG: [F0207][major][link-down][sys/chassis-1/blade-4/adaptor-1/host-fc-1] Adapter  host interface 1/4/1/1 link state: down - svc_sam_dme[3250]
Jul 26 01:05:14 192.168.128.16 : 2010 Jul 26 01:09:07 EST: %LOCAL0-3-SYSTEM_MSG: [F0207][cleared][link-down][sys/chassis-1/blade-4/adaptor-1/host-fc-1] Adapter  host interface 1/4/1/1 link state: down - svc_sam_dme[3250]

You can see that fault ID F0207 which you an use as a reference. But also notice I have copied in two entries. One is the first event where the fault occurred and the severity level "major" and then there is another entry which states "cleared". You will want to filter out the cleared ones or if you have a smart system get it to match the two so you know which events have been resolved.

Hopefully the examples assist some people.

Rodos

Tuesday, July 27, 2010

UCS Platform Emulator

Cisco have released an emulator for the Unified Computing System (UCS). If you are working with UCS you can now run UCSM from your desktop without needing hardware, making training, testing and documentation much easier.

To get started go to the download page at http://developer.cisco.com/web/unifiedcomputing/start and complete the registration form [update : the form validation is very painful, just keep trying, ensure you fill out all fields and maybe put in a valid phone number format]. You can then download the virtual machine which runs the emulated environment. The download is 2.16Gb.

Open up the VMX file in your favourite VMware software (I use Fusion on my MacBook) and it will boot giving itself an IP address. It only uses a single vCPU, 1G of RAM and close to 6GB of disk.

Most of your activity will be via the GUI but you can change what your emulated UCS environment looks like via the console of the machine. Login with the username "config" and password "config" and you are presented with a simple menu.

Its handy being able to set the number of chassis and blades. You don't have a lot of flexibility, for example all chassis have the same number of blades and you can't have 4 uplinks, only 1 or 2.

Once you have configured up your environment point your web browser to the allocated IP address. Click the "launch" button to load the Java management GUI.

Once you log in you have the standard interface and can interact with many of the elements.

Of course as its an emulated platform so some things don't work such as the no data path, no SNMP, no KVM, no Telnet/SSH, no CLI, no RBAC and limited HA functions. Also the the VMware tools in the machine is out of date. Sounds like a lot but its still quite functional.

If you or others in your company need to work with UCSM I recommend you check the emulator out.

Rodos

Tuesday, July 20, 2010

Gestalt IT TechFieldDay Seattle - Nimble Storage

Take a pile of smart just with backgrounds from Sun, Netapp and Data Domain, throw in a few PHDs (I assume) and see what falls out; thats Nimble Storage who launched at Gestalt IT TechFieldDay Seattle.

The company was formed in 2008, based in San Jose. The two founders are

Varun Mehta (Sun, NetApp, Data Domain)
Umesh Maheshwari (Data Domain)

They have some interesting people on their board of directors as well

Suresh Vasudevan (Omneon, NetApp, McKinsey)
Kirk Bowman (Equallogic, VMware, Inktomi)

Nimble call their technology game-changing, taking what was available in separate products and putting it all into one. Nimble coverage of iSCSI primary storage, backup storage and disaster recovery in a new architecture that combines FLASH and high capacity low cost SATA in a new way.

This brings FLASH into the range of many enterprises who would like to use it for more common workloads like Exchange, SQL and VMware. Their target is for organisations with 200 to 2000 employees.

Nimbles competition in the iSCSI market with market sizes (from IDC) are Equallogic who have 35%, EMC 15%, HP and Netapp are around 10% each.

Nimble have done the brave thing and started with a clean sheet of paper to try and create something that no one else can deliver.

The problems they are trying to solve are delivering fast performance without all those expensive disks and how to efficiently back it all up plus replicate that data to a second site for continuity purposes.

Techniques include

capacity optimised snapshots rather than backups
FLASH is used to give great performance
replication that is efficient and based on the primary information so that the time to recover and use that data is very quick, you don't need to wait for a resto

A key think that Nimble bring is their CASL architecture, it provides the following :

Inline Compression. A real time compression engine as data comes in. On primary datasets they are seeing about a 2:1 saving and on things like databases a 4:1 saving. Blocks are variable in sizes and Nimble take advantage of the current state of multi-core processors having a highly threaded software architecture.
Large Adaptive Flash Cache. Flash as a caching layer, starting at 3/4 of a TB for the entry box. They store a copy of all frequently access data, but all data is also storage on the cheaper SATA storage as well.
High-Capacity Disk storage. Using large SATA drives.
Integrated Backup. 60 to 90 days worth of "delta compressed incremental snapshots" can be stored on the system. They have put a lot of work into integration with Microsoft applications, integrating the VSS for ensuring consistency. The snapshot efficiency should remove the requirement for a secondary backup system outside of the primary storage. Combine this with replication to a remote site and you have a protected system.

Nimble showed the results of some testing they performed on a Exchange 2010 19GB database running snaps over 10 days, the other vendor (Equallogic at a guess) consumed over 100GB of data whereas Nimble only consumed 3GB. A 35x improvement was claimed. This then results in less to replicate. Its suspected that the reason for this difference is the smaller and variable blocksize that Nimble can use, the competitor has a large blocksize.
Replication. The replication is point in time snapshot replication. Once nice thing that you can do is maintain different retention periods at each site. For example you might want to maintain a much higher frequency of snaps locally and a less frequent but longer tail of snaps over at DR, very nice. They have a VMware Site Recovery Manager (SRM) plugin in development but it has not been certified yet. Today you can't cascade replication but it will be coming in a future release. Cascade my be important for people who want to use the Nimble for backup, replicate locally and then offsite.

The befits that result from CASL are :

Enhanced enterprise application performance
Instant local backups and restores with fast offsite DR
Eliminates high RPM drives, EFDs, separate disk-based backup solution
60%+ lower costs than existing solutions

When you create volumes they can be tuned for various application types, tweaking such things as page size or if it should be cached. The Nimble ships with a set of predefined templates for popular appellations. The same for snapshot policies which can be templates and a predefined set are provided.

The pricing estimates they have done is at under $3 per Gb for primary storage at an entry price of around $50K.

Here is the specs of the units.

There is no 10GB interface option yet but it will be considered on customer demand. The same goes for having a Fiber Channel interface. The controllers are active, passive on a system (not LUN) basis.

They currently have 10 to 12 beta accounts.

Umesh Maheshwari then have some further details on the technology behind Nimble. A great discussion from someone who knows the industry and the technologies, as you would expect.

Nimble is all about having the

capacity to store backups (through hi-capacity disks, compression and block sharing) along with
random IO performance for primary storage (through Flash cache for random reads and sequentialized random writes)

This technique of sequentialized was developed by Mendel Rosenblum in his PHD thesis in 1991 (see paper). If you don't remember Mendel was one of the founding brains behind VMware so his ideas have a good track record. Its called a Log Structured File System.

So why has this been done before, well it took technology a while to catch up to the idea. The original concept relies on the assumption that files are cached in main memory and that increasing memory sizes will make the caches more and more effective at satisfying read requests, hence the disk traffic will become dominated by writes. With only small amounts of RAM available it was a problem. Secondly the process requires a background job to do garbage collection.

Nimble have created CASL, an implementation of the log based file system. It utilises a large amount of FLASH for the cache and its integrated closely into the disk based file system. The index or metadata of the system is cached in the Flash and therefore the garbage collection can now work efficiently. Of course cache is bit of a simple word for what it does, its not a LRU, there is some complex meta data being tracked for performance.

The second element is the sequential layout of the data on the disks. How you store data on disk could be categorised into 3 different techniques.

1. Write in place. eg. EMC, EqualLogic

its a very simple layout, you don't need lots of indexes.
reads can go quite well
poor at random writes
parity RAID makes it worse

2. Write anywhere. eg. Netapp WAFL (write anywhere file layout)

more write optimised
between full stripes and random writes
its write a sequence of writes wherever there is free space. So when you starts is sequential but after a while the spaces that are free will be fragmented so you end up doing random writes

3. Write sequently. eg DataDomain, Nimble CASL

most write optimised
always do you writes in full stripes
good when writing to RAID
the blocks can now be variable size which is very efficient but it has a secondary effect that you now have room to store some metadata about the block such as a checksum
this requires the garbage collection process which runs in idle times to always ensure there is space available for writing full stripes, what makes this work is that the index is in Flash and the power of the current set of processors
the difference between what DataDomain do and CASL is that DD do their sharing based on hashes and CASL does it based on snapshots

Of course this makes you wonder whats the difference between the CASL cache and what many other providers are doing with a Tier of Flash?

Because the cache is backed by disk (the data is in the cache and on the disk) you don't need to protect the data on the disk. This means you can use cheaper flash drives and you don't need to do any parity or mirroring giving you saving of 1.3 to 2 times.
Its much easier to evict or throw away data in the cache than it is to demote data out of a Flash tier into a lower one, you don't have to copy any data.
You don't have to be so careful about putting things in cache as its not an expensive operation so all writes or reads can be put in cache for fast access if you need it again and of course cache is a lot more effort to integrate into your file system than tiering so if you are dealing with legacy its much harder then when you are starting from scratch like Nimble have.

Thoughts?

I really got the feeling that Nimble are not trying to be everything to everyone. They are focused on a particular market segment, hitting their pain points and attempting to do it better than the incumbents are.

They have a few things to deliver in my opinion to reach the goal, such as

cascaded replication to offer true local and remote data protection
get the SRM module for VMware certified
its looks hard to scale out if you just need some further storage as you can't add disk shelves, you get what you get. Yet their is nothing in their architecture to preclude some changes here which is good.

The big question will be is it different enough to the competitors for them to get into the market. If you only difference is doing something better (no matter how clever it is under the hood) how easy is it for your competitors to be "good enough" or a much better price point. Some good marketing, sales force and channel are going to be key.

With CASL, Nimble certainly have some very nice technology, but nice technology does not always win in the market. Its certainly going to be great to see how their early adopters go and how they adjust the hardware range and feature set over the next 12 months!

Note that its not available in Australia or EMEA yet.

Rodos

Note : Tech Field Day is a sponsored event. Although I receive no direct compensation and take personal leave to attend, all event expenses are paid by the sponsors through Gestalt IT Media LLC. No editorial control is exerted over me and I write what I want, if I want, when I want and how I want.

Monday, July 19, 2010

GestaltIT TechFieldDay Seattle - F5

A big vendor in the networking and Internet market is F5. We visited them on the Gestalt IT TechFieldDay Seattle.

As you can see the room was full of people.

Introduction

Kirby Wadsworth (VP of Global Marketing) did a who F5 are and what they do. F5 see themselves as the strategic point of control in your data center architecture optimising the relationships between users and the applications and the data that they need.

F5 have 44% of the general application controller delivery market which includes things such as load balancing and some minor layer 2 to 7 functions. In the advanced market where you go beyond layer 4 load balancing and taking advantage of caching, rate shaping and other elements the share in higher.

F5 have a broad set of products most of which a run from their BigIP, which is the hardware platform. The BigIP runs the TMOS OS. These products plugin or layer onto TMOS. The core business is certainly around Local Traffic Manager, where connections are balanced across servers. Global traffic manager does this across data centers. There are many products in the range :

Local Traffic Manager (LTM)
Global Traffic Manager (GTM)
Link Controller (LC)
Application Security Manager (ASM)
WebAccelerator (WA)
Edge Gateway
WAN Optimization Module (WOM)
Access Policy Manager (APM)

To me one of the most exciting things is that earlier this year F5 released a virtual edition of their Big-IP Local Traffic Manager. The LTM is a great device to run as a virtual machine and thankfully its not limited in terms of features. Great to see vendors starting to deliver choice to customers in how they would like to run vendors software! F5 did not make much of a deal about this, especially considering there were some virtualisation people attending. However there is probably not much you can say about it.

Long Distance VMotion

Next we had a demonstration of long distance VMotion. A really interesting part of this was that they use vOrchestrator to control the Big-IPs and the VMware tasks. It was great to see automation being done through Orchestrator workflows. It also shows the power of what you can do with F5 products when you start to pull multiple together and automate them.

I have seen this before at VMworld and its a little difficult to describe it in great detail. If you are interested in it seek out F5 at VMworld or look for the videos of the event which will come online at GestaltIT later. There are multiple elements at work including adjusting the load balancing pools, performing layer 2 over layer 3 tunnels and acceleration of traffic, which is what makes the storage VMotion work in a much faster and more reliable way. The workflow did some nice things such as when starting, first waiting for the number of connections to the server being moved to clear after it had been removed from the balancing pool.

Automation

Next we had Joe Pruitt (Sr. Strategic Architect, @joepruitt) do a great talk on automation and control through the APIs of F5 technologies. They were very early to support SOAP and cover a lot of languages as you can see below.

We looked at what the APIs covered, which is just about everything you could ever imagine doing. A number of examples were walked through which shows both the simplicity alongside the power of what you can achieve. They are split between iControl which covers all of the admin style process and iRule which is the rules for the traffic.

My only issue was that the code examples were not quite real as they contained comments, who comments their code in the real world!

Joe was one of the most enthusiastic presenters across the two days and his passion and joy for the technology really showed, it was great!

Remote Access

We then had a demo of joining some of the F5 products together to provide a bigger and more complex solution, being a global deployment of accelerated remote access. Using the global traffic director they could detect where the user was accessing from, align then with the appropriate entry point into the network (such as the local country) and then accelerate the resulting traffic. Its was good example of if you tie all these things together you can do much more.

ARX

Next was looking at some storage technologies, being ARX. Data is growing and file servers need to become building blocks where you can have policies to place data. ARX does this through open standards, being NFS and CIFS. The ARX is a device that acts as an enterprise class proxy file system. The diagram shown shows the structure.

You can take any storage you want with the characteristics you want and then use policies to move the data around those as required. This is achieved by placing the ARX device in front as a proxy. The ARX appliance looks like a standard client to the lower tiers so will work with many storage systems. The example included Cloud storage but in my opinion this was a little bit of Cloudwashing. Sure the use case was there but it relied on you using a Cloud provider who presented CIFS/NFS locally to your site, its not that the ARX could transpose its requests to talk to a Cloud based service (such as S3) directly. It was not an invalid example, but it does rely on a specific bit of technology that is not part of ARX.

The way ARX works is to place out a namespace across all of your tiers, tracks which bit of data (file) is where, route/proxy the requests accordingly and move the data around the tiers as required. The databases for routing the requests in real time is a non-trivial problem to solve according to F5, their namespace can contain a billion objects.

Curtis Preston discussed the issues around backup and restore with the way the data was laid out. The tiers supporting ARX is where you will probably need to backup and it does not have all the knowledge. Backup is probably going to be okay but restore is going to be hard and its not fully baked. If you need to restore something you are going to have to go and ask the ARX where to put the restored file or where was it previously so you can go and find it in your backup set.

F5 think the difference with ARX is that you can use multi-vendors on the backend and you are not having execute do a stub based solution like some of the alternative technologies.

An interesting last thought on this was the prediction that in a year data traffic management will be better understood, data will be considered another piece of traffic and managed accordingly.

Tour

F5 have a well kitted lab with lots of their equipment along with specialist device such as networking emulation and testing devices. People enjoyed getting back into a server room after a long day.

Thoughts

F5 did a good job, they had some demos and the right technical people presenting who knew their stuff. There might have been a few too many F5 staff filling the room but when TechFieldDay is in the building no one once to miss out right!

The core F5 technology is good and mature, this came through in the earlier presentations. You also got to see how the different products could be combined together. The interesting part was the ARX. I am sure it is a difficult problem to solve at the scales they discussed. However my feeling was it could do with its own interface into some Cloud APIs, maybe they are waiting for further standardisation. The backup and restore is a realistic problem and people will want to have resolved how they might handle it in their environment. Because they are integrating with the tiers as a client the ability to leverage any great features of those Tiers is abstracted or lost (but could be handled directly at that tear). I wonder if there would be any advantage for the ARX to be aware of certain elements to optimise its use of a particular Tiers vendor implementation, for example if its doing proxy for a DataDomain device it may use a more efficient method or interface (not having a good example for what one might be). The ARX from what I could see only added the large name space and tiering to the market. I am sure its not an inexpensive solution but I wonder if its need some more tricks up its sleeve than those two to get some key adoption. Certainly something to keep an eye on.

Thanks F5 for an interesting and fruitful few hours.

Rodos

Saturday, July 17, 2010

Seattle TechFieldDay Compellent

Compellent presented this morning at Gestalt IT TechFieldDay in Seattle.

The Live Volume feature looks real interesting for Disaster Avoidance. I was trying to contrast it to VPLEX. Live Volume works at the LUN level, however as I understand it VPLEX might work at sub-LUN. Something to start digging into.

Rodos

Interview Nimble Storage

Launching at Gestalt IT TechFieldDay yesterday was Nimble Storage. Here is a quick video from their CTO and Founder Umesh.

Will post up some details about their storage technology when I get a change to sit and write.

Rodos

Friday, July 16, 2010

TechFieldDay Seattle - Veeam Video

Veeam was the first vendor off the block today at the GestaltIT TechFieldDay in Seattle.

I caught up with Doug Hazelman later on in the day for a quick recap of what they presented.

Here is a great photo I got of Doug at the end of his presentation.

Rodos

Gestalt IT TechFieldDay Seattle Photos

Here are some photos from the GestaltIT TechFieldDay Seattle.

You can see them by clicking through to Flicker. I will add some more after tomorrows events.

Rodos

Thursday, July 15, 2010

The aBlock

Back in March last year I posted on how Microsoft were saying that the Azure technology would not be for on premise deployments.

While Windows Azure isn’t something we will license for premises deployment, we will license many of the innovations via future versions of Windows Server and System Center.

The other day Microsoft changed their stance on this as as Ballmer announced the Windows Azure Platform Appliance. This is quite a change in stance.

Its not really an appliance, from my reading its more of a vBlock, we can call it an aBlock. Microsoft partners with hardware vendors for a reference implementation that is very standardised. Its also interesting who its targeted at, "designed for service providers, large enterprises and governments". When they say large I suspect they really mean LARGE, this is not something many are going to be able to deploy.

My take is that this is a good move in the right direction. Who knows if its in preparation for VMware to releasing its private/public Cloud software (codenamed Redwood) later this year? But does it go far enough?

For me the key thing is how does this benefit customers? For customers it really does not change anything, unless you are a massive enterprise or government. My gut feeling is that this is not something that is going to be delivered to the public market but rather for private internal Cloud. VMware should be able to deliver that customer experience of run it internally or externally.

Its also worth mentioning that the Azure Appliance is more than just IaaS, it includes SQL Azure. VMware have Zimbra but databases/stroage are key in the Cloud. What are VMware doing with Redis?

Rodos

Wednesday, July 14, 2010

Intro to TechFieldDay Seattle

The GestaltIt TechFieldDay Seattle kicks of tomorrow night and then runs for Thursday and Friday.

I got some good feedback on my past VMworld video diaries (I think mostly from people having a good laugh at me) so I figured I would do some for this event.

Here is an intro video about who's coming, the vendors participating and some initial thoughts.

Now its off to bed to get some sleep!

Rodos

Disclosure : Tech Field Day is a sponsored event. Although I receive no direct compensation and I take personal leave to attend, all event expenses are paid by the sponsors through Gestalt IT Media LLC. No editorial control is exerted over me or the other delegates.

Pages

Wednesday, July 28, 2010

fltAdaptorHostIfLink-down

Tuesday, July 27, 2010

Tuesday, July 20, 2010

Monday, July 19, 2010

Saturday, July 17, 2010

Friday, July 16, 2010

Thursday, July 15, 2010

Wednesday, July 14, 2010