Wednesday, February 17, 2010

Lab Manager 4.0 Best Practices & Design Considerations

New Version of this article here

Lab Manager is one of my favorite technologies in the market today, but before you install, beware of the limitations!

i. 8 ESX hosts max to connect to VMFS3 Datastore (each LUN), you can use NFS to get around this, but for our use case, this is not performant enough.

ii. 2TB vmfs3 size limit, and don’t start there, we started at 1.2TB Luns so we could expand as needed.  Avoid SSMOVE if at all possible (SSMOVE is slow and painful, but works well), if you fill up your Lun, create the extend and/or disable templates and move them to a new datastore.

iii. Only available backups are SAN Snapshots (well the only realistic one for us), and for this to be useful, see #1 below

1. Recommended to put vCenter & Lab Manager Servers on VM’s inside cluster on SAN with the guests

iv. vCenter limits

1. 32 bit has max of 2000 deployed machines and 3000 registered

2. 64 bit has max of 3000 deployed machines and 4500 registered

Best Practices & What we’ve learned

i. Make Luns 10x the size of your Template(s)

ii. Shared Storage is generally the first bottleneck.

1. I used all Raid 1+0 since we observed this on our first LBM deployment and our application is database driven (disk I/O intensive)

iii. We have averaged between 80-100 VM’s per blade, so this means our LBM 4.0 environment should top out at approximately 32 hosts (2 full HP c7000’s) (one cluster, otherwise you lose the advantages of Host Spanning Transport Networks)

iv. LOTS of IP addresses, I recommend at least a /20 for LBM installs = 4096 IP’s, you do not want to have to re-IP lab manager guests, we’ve done that before.

v. Create Gold Master Libraries for your users, helps prevent the 30 disk chain limit from being hit as often.

vi. Encourage Libraries, not snapshots

vii. Do not allow users to create\import templates, Export Only

viii. Do not allow users to modify Disk, CPU or Memory on VM’s.

ix. Storage and Deployment leases are the best thing since sliced bread. Recommend between 14-28 days for both.

x. Train your users, we even went as far as to have two access levels, one for trained, one for untrained, so the untrained are less dangerous, and if they want the advanced features it forces them to get training.

16 comments:

philg said...

Hello

Is the labmanager specially for short testings and developers or could we use the labmanager also for all our testservers? Because also to backup the vm's you have to do on traditional way with agents.

How many VM's have you in your environment an how many ESX?

Thanks for your answer.

Brian Smith said...

Lab Manager is best with short testing scenarios, but it is just a layer on top of ESX, so you could use this as a regular VM environment based on the settings you choose.
So yes it will work fine for both requirements.
In our old environment we ran about 350 VM's simultainiously on 8 HP BL460 blades. We were disk bound and could have ran more, but we run 4GB of Ram/ MS SQL VMs.

Dave said...

We're finding that we consume disk space very quickly and it's hard to determine what's using the disk space or even what to delete to free up large quantaties of disk space. Have you ran into these kind of problems and found a solution? A form of reporting would be nice.

Brian Smith said...

I agree reporting would be awesome, especially detailed reporting. You can get an idea by clicking on the datastore, but it could be much better. The biggest trick to LBM is not running out of disk space, you can always turn everything off and do an SSMOVE.exe to move a disk and all the link clones to another LUN, but usually I can't get downtime and I just have to unpublish a template and move it to a new LUN so future deployments don't consume space on the original LUN.

Dave said...

Brian, this is a little off topic from your original post, but I have question about something you mentioned. When you unpublish a template and move it to a new LUN, what happens to the VMs that were created from that template? If they're not affected, where do they get their base information that the template previously provided?

Brian Smith said...

When you delete a template, the template will stay on the disk until all the linked clones are gone. LBM is very smart about not freeing up any disk space until everyone using it is also gone. So if it's in a library that has no expiration, it could stay there forever. So ssmove is the only way to free up space, moving the template just slows growth, but if users keep working from libraries, space will continue to be used, we had a number of bad LUNS with pre-canned libraries, our disk space issues did not go away until we moved those to another LUN as well, since then its slowly dropped the past few months.

Anonymous said...

Brian, having so many VM's, how do you handle the Active Directory issue? Do you create a DC for every fenced environment?

Brian Smith said...

By AD issue, I assume you mean two DC's with the same Domain/Computername/IP address, if so, yes we use fenced mode and we run dozens of identical DC's simultaneously.

We have a number of prebuilt environments that are only allowed to be deployed with fenced mode that have their own DC included, and we also have others that users will join to a DC that is somewhere else. So yes we use both methods without issue.

Anonymous said...

Good feedback! So what is your experience with having duplicate Fenced Linked Clones contacting the same Active Directory outside the fence for authentication?

Have you seen any hiccups with AD?

I have actually made changes to VM's inside the fence and overwrote corresponding VM outside the fence, with no issue with authentication.

I am curious to know your experience good or bad, with say 3 exact VM's inside a fence contacting a AD server outside the fence.

Brian Smith said...

We don't do any of that intentially, but the hiccup should be the workstation password changes with the DC, unless you disable those changes in GPO before you create the library.Then the workstations will not continue to update their internal passwords and break the sync with the DC. See http://technet.microsoft.com/en-us/library/cc785826(WS.10).aspx

Unknown said...

Greetings,

What is your experience with I/O and storage requirements. I see 10 X Template. Do you mean all templates or place each template on a different datastore?

It was mentioned to me that disk savings in storage is lost in I/O and having to buy more disks.

Brian Smith said...

Linked Clones (disk savings) is still pretty performant. Especially if you put a fast Raid behind it such as Raid 1+0, more spindles is better. I'm sure there is some performance lost due to the disk savings, but honestly it is not noticeable. To answer your question about putting one template per LUN, sometimes we do that, but more often than not we put 5 to 10 templates on a LUN, we choose a few that are going to have alot of disk IO and a few that will have minimal disk IO to even out the load. So if we had four 25GB VM's, that's 100GB, so we'd give that LUN 1000GB of space.

Brian Smith said...

Updated guide for 4.0.3 released today
http://bsmith9999.blogspot.com/2011/03/lab-manager-4.html

4.0.3 Best Practices

Anil Kumar Singh said...

Hi if you want any blog related tricks and design visit my blog www.blogger9.com if you feel this blog is very informative you can join with me on facebook and twitter and give your best opinion for our blog .

REGARDS

Anonymous said...

Hi. I am deploying Windows 7 64 bit in Labmanager in fenced mode. Every time I deploy the machines the network goes to public mode. I need them in private mode. Every time I have to change them manually. Can you please tell me how I can resolve this?

Brian Smith said...

Dibesh, I have not had that exact issue. I believe GSS (VMware Tech Support) would probably know how to resolve that quickly, you should give them a call.