Tuesday, July 31, 2012

Monday, July 30, 2012

User’s AD account being locked out by vCenter 5.0 server, how I found our culprit

My vCenter is running Windows 2008R2.  I tried Running the Microsoft sysinternals tools to find out which host is locking out this user and it kept pointing me to vCenter.  Of course to me it didn’t make any sense why vCenter would be locking this user out every 60 minutes.  I also checked the standard stuff, no services running as the user, nothing under the “Credential Manager” on vCenter.  There are also several VMware KB’s for dealing with the issue, they want you to use TCPVIEW, and some other tools, but none of them helped me on this one.

I’ll start my troubleshooting steps below after the normal sysInternals AD “find out which machine is locking the user out” steps.

1) I logged into vCenter and opened “Server_Manager/Diagnostics/Event_Viewer/Windows Logs/Security”

I saw the following 3 Audit Failures happening ever 60:16 seconds (not sure if that interval is important to the issue, but it gave me an ability to predict the next occurrence).  Since the last event was at 9:31:30am, I can predict the next occurrence at 10:31:46am.   You can see the “Account for Which Logon Failed:” “Account Name:” is the usersloginname, this is key information we’ll use later.

image

2) Logging into vCenter I see this under “Tasks & Events”, the timestamp is an exact match to the Event Log Event above.  I see nothing that correlates to this timestamp under vCenter Tasks, which makes me think this is being initiated from an external (to vCenter Service) source.

image

3) Before I realized this was happening every 60:16 I grabbed a large packet capture on my vCenter Server.  Now that I could predict the exact second this would happen again, I decided to capture a smaller time window with a script.

I setup a single scheduled wireshark dumpcap by utilizing Windows Task Scheduler.

image

The dumpcap –a parameter specifies a 4 second duration of packet capture, the –w specifies what to name the output file, lo.pcap in this example.  I dumped the files into a directory called c:\output .  I didn’t use the default location because when I ran the script manually (from the command line to test) I received access denied, and since I’m too lazy to troubleshoot it, but I’m pretty security minded, I setup a temp directory called output that I gave everyone full control to.  I setup the schedule to start the capture 2 seconds before the planned event and it did successfully gather the data I needed.

4) Now to analyze what I captured.  Because the user swore that he was not manually doing anything to the systems, I though if I did a search within the file for the users name, it may give me some good data.  I did a Wireshark Filter of tcp contains usersloginname.  What I saw was 4 packets containing his name, the first going from vCenter to DC, then the response from DC to vCenter, then the same repeated.

image

By itself, this isn’t overall useful, but IF this problem wasn’t coming FROM virtual center (or a locally installed app), it must be coming over the network.  And if vCenter is talking to AD, most likely something is happening right before this event to cause it.  You see this communication started with packet 316 @ 2.74 seconds after I began capturing.  This is a good correlation since I started the capture 2 seconds early, I was right on time.  I was also able to verify that the issue had happened again  right on time in the Windows Event log.  So what is happening on vCenter’s network RIGHT BEFORE packet 316?  I removed my filter and scrolled up a bit, most of the traffic was from vShield Manager, the SQL DB, misc vCloud items, but one IP stood out, it was not from my vCloud farm of servers.  This Mystery IP seemed like my most likely candidate.  I also did a ping –a 10.x.x.x hoping for a strike on reverse DNS to get a hostname without luck.

To confirm my suspicions that the mystery IP was somehow triggering the lockouts, I switched to analyzing my older bigger packet capture and did a Filter by ip.src==10.x.x.x (the mystery IP). This came up with traffic starting at packet 67701 @ 305.4 seconds into the capture.  Now if my hypothesis is correct and this mystery IP is the culprit, then there should be some talk back and forth with my DC immediately following.  I did another search in this old capture with tcp contains usersloginname and found this traffic on packet 67714 @ 305.5 seconds into the capture!!!  This means the Mystery IP was contacting vCenter, and then locking out my users AD account, but since it was vCenter talking directly to AD, vCenter was getting credit for locking the user out.

5) I located the Mystery IP VM, it was a VMware vCenter Operations VM using the users old password, we fixed it and then waited until 60:16 more seconds had passed, no new logs in vCenter, no new failures in AD, the issue is resolved!

Friday, July 27, 2012

Can’t take ownership of a file on Windows 2008 Server

Trying to delete a folder (with files in it) on my server today, for whatever reason I could not enter, delete, or take ownership of this folder.  I tried the UI, as well as the takeown command, both returned an access denied trying to take ownership.  Because I am a full administrator, my suspicion is that there is an open file.  Because this is a file server with hundreds of open files and folders, I could not easily identify the connection to close.  I rebooted the server, and without issuing any more commands my access to the files/folders had returned.  GO MICROSOFT!..

Friday, July 20, 2012

My Troubleshooting a HP Flex 10 FCoE connection to a Cisco Fiber Channel MDS Switch

First, some good links for documents about this:

Cisco, VMware, HP

The Problem:

Upon building a new HP chassis full of blades, hoping to connect my new blades to storage, I was build zoning rules connecting them to our EMC VNX, Inside of Cisco Fabric Manager I did not see the HP blades listed so that I could build zoning rules.  The other suspicious item was that inside of HP Virtual Connect Manager, inside of the “SAN Fabrics” tab, I was giving a warning for a status. 

The Setup:

I had setup this up in a fully meshed fiber with NPIV enabled everywhere.  1 HP Chassis, two Flex 10 modules, two Cisco MDS switches, and one EMC VNX.  My flex 10 modules were each connected directly to both of the Cisco MDS Switches.   Each of the Cisco MDS’s were connected directly to each of the SP’s on the EMC VNX (fully meshed). 

The Solution:

I worked on this for some time, then after changing the connections on the Flex 10’s to both go directly to the SAME Cisco MDS switch (removing the mesh from the Compute side), the HP Virtual Connect finally showed happy, and the Cisco MDS began to see all the HP Blades and I was able to connect storage.  So what did I do wrong originally?  I am sure there is a great reason, but what part of Fiber Channel for Dummies did I miss?

Thursday, July 19, 2012

Cisco MDS Zoning

Single initiator with a single target is the most efficient approach to zoning

Just jotting down some links

http://www.cisco.com/en/US/docs/switches/datacenter/mds9000/sw/4_1/configuration/guides/fm_4_1/zone.html

http://routerjockey.com/2011/12/23/mds-fiber-channel-switching-basics-for-network-engineers/

EMC VNX Storage Pool Design & Configuration

The EMC Whitepaper on VNX Best Practices h8268_VNX_Block_best_practices.pdf is the way to go, I used version 31.5 as it is the most current one available.

Storage Pools vs Raid Groups vs MetaLuns

From an design perspective, MetaLuns were basically replaced by Storage Pools, Storage pools allow for the large stripping across many drives that MetaLuns offers, but with a lot less complexity. MetaLuns are now generally used to combined/expand a traditional LUN.  Raid Groups have a maximum size of 16 disks, so for larger strips this isn’t a viable option.  For situations where guaranteed performance isn’t critical, go with Storage Pools, use Raid Groups if you need deterministic (guaranteed) performance.  The reason behind this is that you are probably going to create multiple LUNs out of your storage pool, so this could lead to one busy LUN affecting the others.

Raid Level Selection

Assuming your going with a Storage Pool, your options are Raid 5, 6, or 1/0.  If you are using large drives (over 1TB) then Raid 5 is not a good choices because of long rebuild times, Raid 6 is almost certainly the way to go.  Always use suggested drive numbers in the pools, Raid 5 is 5 disks, or a number that evenly divides by 5, Raid 6 & Raid 1/0 is 8 disks or a number that evenly divides by 8.  If you use less than the recommended you will be wasting space.

How Big of a Storage Pool do I start with?

Create a homogeneous storage pool with the largest number of practical drives.  Pools are designed for ease-of-use.  The pool dialog algorithmically implements many Best Practices.  It is better to start with a large pool, as you add disks to the pool, it does not (currently) restripe the disks, and therefore if you only added 5 disks to an existing 50 disk pool, the new LUNS would have much lower performance.   The smallest practical pool is 20 drives (four R5 raid groups).  It is recommended practice to segregate the storage system’s pool-based LUNs into two or more pools when availability or performance may benefit from separation. 

I am only covering a small portion of what you need to know.

When dealing with storage, there are thousands of options, homogeneous drives vs. heterogeneous, Thick vs. Thin Provisioning, Fast VP Pools, Drive Speeds, Fast Cache and Flash Drives, Storage Tiering, the whitepaper above does a great of of detailing all of that, I won’t try to improve on what EMC has said.