Wednesday, September 30, 2020

What is SRE

SRE is an abbreviation for Service Reliability Engineering, also known as Site Reliability Engineering.  SRE can also be a job description for a Service(or Site) Reliability Engineer.  DevOps is a methodology (and should never be someones title) where a developer supports the code they write in production.

Service Reliability Engineering exists to improve reliability of services by writing software to keep a service functioning properly and to drive the improvement through feedback primarily around reliability, availability, observability & serviceability.  SREs are also expected to do a significant amount of troubleshooting to find the root cause of an issue and not focus on the symptoms.  SRE's often help Developers with Infrastructure, Deployment, Configuration, Monitoring & Metrics to make their software easy to update, manage and monitor.

SRE is a set of practices, metrics, and prescriptive ways to ensure reliability and uniformity to enable success at scale.

SRE is often a Gate to Production (grant access, make/approve/authorize production changes)

Characteristics of an SRE

  • Reliability
    • Focused heavily on meeting SLO's & SLA's
    • Error budgets are like money, they are meant to be spent, wisely as it is limited, underspending and overspending are both bad
    • Embraces fact that failures will happen, plans for them
    • Solve production issues and restore service
    • Actively participate in postmortems
    • Has a roadmap and prioritized backlog of things to automate
    • SRE's is at war with 
      • toil (anything repetitive that is done by hand should be automated)
      • inconsistencies (snowflakes are a killer)
      • ignorance (visibility & data are key to managing things quickly at scale)
    • Improves monitoring with alert correlation to reduce noise and TTR (time to repair)
    • SRE's mandate is to continually push for product improvements
  • Automation/Code
    • Minimally spends 51% of their time solving issues through code
    • All Incidents and Escalations should result in a runbook/workflow that eventually turns into automation (I like to call runbooks human-automation)
    • CI/CD Automation platforms to push out new code and fixes
    • Many small code pushes are far better and easier to backtrack than fewer large ones
    • No Production software pushes on Friday or the weekend
    • Knows you can't test everything, unit tests are required, but things will collide in production
    • Automation platform such as stackstorm to orchestrate fixes
    • Write tools such as auto-triage for troubleshooting (gather the logs while the engineer logs in)
    • Idempotent actions are your new best friend 
  • People & Process
    • SRE's need psychological safety, one example is blameless postmortems
    • SRE's is a cognitively difficult task, requires minimal context switching
    • Participates in on-call rotation
    • If SRE's support a platform they do not develop entirely, they must be involved in project planning and execution with those teams.
    • SRE's need time to dig into incidents to not only fix, but to find root cause and take preventative measures
  • Tooling for success
    • SRE's need proper tooling, such as logs, time-series metrics, traces, etc.. (it's virtually impossible to understand true root cause for intermittent issues if you don't have this.)
    • Proper monitoring to detect failures, need to feel confident that if no alarms are triggering that the infrastructure is healthy, customers should not be the ones to tell you that you have a problem.
    • Never create an alarm unless it leads to an action.  Warnings are useless at scale.
SRE Shared Vocabulary (my definitions, feel free to disagree)
  • Black Box Monitoring- All you know are the inputs and outputs, if wood goes in, and chairs comes out, thumbs up, things are good.  If chairs come out broken, incomplete or not at all, you realize you have a problem.
  • White Box Monitoring - (I wish it was called transparent box) but this is where you know what's happening inside the machine.  You see the first machine wood get cut, the second sand it, the third assemble the parts and the fourth paint the chair.  You know which part of the machine is not functioning properly.
  • Observability - You need to make your production observable, you need to expose signals we can watch programmatically in order for the operators to understand the health of the machine.
  • Serviceability - How easy/difficult is it to maintain this software, to get new software into production or upgrade to a new build.
  • Availability - Frequency of a system to operate properly or at least within a(n) SLO/SLA
  • Idempotent - Something that can be applied repeatedly safely and will only make a change the first time it applies.
  • Immutable - Something that can't be changed after it is built/created.
  • Heuristic - an approach to solving a problem that uses practical methods but may not be optimal, but it will get you there.
  • Orthogonal - When something changes, but does so independently and does not affect the other.
  • Chaos Engineering - Intentionally breaking things in production to test your resiliency
  • Canary - Something built to test new features or configurations, or possibly a  very small subset of production.  This way you can find issues with limited scope of impact.
Probably the biggest key to success for SRE & DevOps is the right mindset.  The business must see SRE as a trusted partner who's preventing issues and keeping the service running properly.  If SRE is running a production environment where they do not write all of the software, then developers sometimes think of them as operations people that they can toss software "over the fence to".  While SRE is fantastic at remediating issues at scale with automation, developers need to still be accountable to the code they write and maintain responsibility for improving that code.

Friday, April 10, 2020

WD Red Price Per GB April 2020

If you are like me, you do this exercise every time you need to buy a new drive.

I always try to remember what it used to cost, so I'm just going to start posting them here.

WD RED
TB SizePrice Cost per GB 
1 $       61.27 $         61.27
2 $       78.92 $         39.46
3 $       96.99 $         32.33
4 $     101.99 $         25.50
6 $     156.49 $         26.08
8 $     224.99 $         28.12
10 $     300.00 $         30.00
12 $     357.93 $         29.83
14 $     462.90 $         33.06

Friday, May 3, 2019

Blogs on VMware site

Recently, most of my blogging has been directly on VMware site.

I thought i'd link you to a couple of the more popular ones here.

Embracing a DevOps Mindset, this is all about leading a team through a cultural transformation
https://blogs.vmware.com/vov/2018/07/25/embracing-a-devops-mindset-in-vmware-it-cloud-operations/

Are we ready?, a post about how VMware makes sure its SaaS services are ready for primetime!
https://blogs.vmware.com/vov/2018/12/18/9374/

VMware's private cloud team Represented at VMworld.
https://blogs.vmware.com/vmworld-archive-07-25-2017/2016/07/vmware-and-the-private-cloud-at-vmworld-2016.html

Thursday, August 23, 2018

Troubleshooting 101


Think of yourself as a doctor, but for computers.  Start with "DO NO HARM" as your credo.  Don't make things worse, snapshots, GO SLOWLY, think before taking any action, ask for a double check.
There are two basic approaches to troubleshooting: the stab-in-the-dark approach and the systematic approach. The stab-in-the-dark approach usually involves little knowledge of the technology involved and is completely random in nature. A systematic approach, on the other hand, involves a step-by-step approach and requires in-depth knowledge of the technology.
1) When did it start? (almost always change related, planned or unplanned)
     Find an error message, try finding the starting time in the logs
2) Isolate, isolate, isolate.
  How can I split this complex problem into several smaller problems.  Packets go from A to Z, but don't arrive, 
First divide the problem in half, check if packet makes it from A-M, if it does, then check M-Z.
If you see it didn't make it form M-Z, half it again, check M-T, then T-Z, then again, keep dividing in half.
3) the WORST problems to troubleshoot are always two things, that agitate each other.
Sometimes you have one problem, that due to redundancy, or other reasons, you don't even KNOW you have had for months.
Then another thing breaks, suddenly you have a bizarre scenario that just doesn't add up.
4) Check the health of EVERYTHING
Log into switches, servers, (consoles people) often errors don't show up in logs, but you'll see them sitting right in of you.
5) Get creative, approach the problem from different angles, ask for help, a second point of view or skillset can really help.   Go play foosball, step back for 20 minutes and refresh your mind.

More Advice:
Look for workarounds, or multiple paths to restore service.
If you have a known method to restore, but it may take hours or days, then try to work both paths in parallel

Saturday, March 19, 2016

Netgear VLAN & PVID making me doubt my sanity

Rebuilding my home lab tonight, I got stuck because every time I plugged a cable into my switch, everything died.



I came to realize that the reason for my problems was the fact I had been moving cables around in my Netgear GS748T v5 switch and even though it seemed like the VLANs configs were correct, somehow my old PVID (Advanced-Port PVID Configuration) settings were messing things up.  The scenario I have is 4 ESX hosts, one Synology array, plus one Internet link.  I have four VLANS, 1=Default/home network, 10=iSCSI, 20=Internet, 30=VSAN traffic.  I just upgraded my hosts to the Intel NUC's (because I want to be like William Lam),  These Intel NUC's can only use the 1 onboard NIC with vSphere 6.0 U2 right now, hopefully someone will integrate a USB nic driver soon.  So back to my challenge, the ESX hosts can ride on the default network and use VLAN tagging for access to the other 3 networks. My internet connection is a dumb device that can't use VLAN tagging, so I needed to find a way of integrating it.  Normally that would just be an untagged port, but that doesn't work on these Netgear Switches.  In order to get that to work I had to setup PVID, I used port g1 for Internet and g48 for iSCSI, and g39-42 for the ESXi hosts.  The key here is that in the PVID settings, the port must be a Member of the VLAN, but not Tagged.

That seems to be working well.  From the VLAN membership tab, I left my default VLAN (1) everywhere but the two untagged ports I will need my storage and internet connected to.  For the other 3 VLANs I mostly emptied it out and set it up like this:

If you have a similar setup and you get stuck, I hope this helps you!

Monday, December 9, 2013

VMware vSAN IOPS testing

Take this with a grain of sand, these are only initial figures.  I am using a combination of IOMeter for Windows and fio for Linux.

Baseline redundancy and caching, no storage profiles used, only using vSAN as a datastore (I’ll do the other options later)

My vSAN is made of 3 identical ESXi hosts, with a single SSD Samsung 840 250GB, and two Seagate 750GB SATA drives. vSAN has a dedicated single 1GB connection, no jumbo frames used. (yes there could be bottlenecks at several spots, I haven’t dug that deeply, this is just a ‘first pass’ test)

The end result of this VERY BASIC test is this:

vSAN random reads were an average of 31 times faster than a single SATA disk

vSAN random writes were an average 9.1 times faster than a single SATA disk

 

More Details Below:

Regular single disk performance (just for a baseline before I begin vSAN testing)

Random Read (16k block size)

first test = 79 IOPS

second test = 79 IOPS

Random Write (16k block size)

first test = 127 IOPS

second test = 123 IOPS

vSAN disk performance with same VM vMotion to the vSAN

Random Read (16k block size)

first test = 2440 IOPS

second test = 2472 IOPS

Random Write (16k block size)

first test 1126 IOPS

second test 1158 IOPS

Commands used in fio:

sudo fio --directory=/mnt/volume --name fio_test --direct=1 --rw=randread --bs=16k --size=1G --numjobs=3 --time_based --runtime=120 —-group_reporting

sudo fio --directory=/mnt/volume --name fio_test --direct=1 --rw=randwrite --bs=16k --size=1G --numjobs=3 --time_based --runtime=120 —-group_reporting

I mentioned I did use IOMeter in windows, the initial results were very similar to the fio results above.  I will post those once I have the time try each solution and go deeper into identifying bottlenecks and getting more detailed, adding more hosts, etc…

Sunday, December 8, 2013

VMware vSphere 5.5 vSAN beta ineligable disks

While building my home lab to use vSAN and NSX following Cormac Hogans's great instructions, I've encountered an issue that the disk I am trying to use for vSAN are not showing as available. In the "Cluster/Manage/Virtual SAN/Disk Management" under Disk Groups, I see only one of my 3 hosts has 0/2 disks in use, the others show 0/1. My setup is this, I purchased 3 new 250GB Samsung SSD drives (one for each host), and am trying to re-use 6 older Seagate 750GB SATA drives. My first thought, is why does it only say 0/1 in use on two of the servers?  I have 4 drives in that server, a 60GB boot drive, 1 SSD, & 2 SATA drives, so why doesn't it say 0/3 or 0/4? I noticed in the bottom pane, I can choose to show ineligable drives, there I see the 3 drives I can't use. I understand why I can't use my Toshiba boot drive, but why do my 750GB Seagate drives also show Ineligable?



I played with enabling AHCI, but knowing there is a bug in the beta I wanted to avoid it. See here: http://blogs.vmware.com/vsphere/2013/09/vsan-and-storage-controllers.html. This unfortunately did not change the situation. I finally realized that possibly those drives still had a legacy partition on them. After nuking the partitions on those drives, the disk now show up as eligable drives. I tried this first on my server smblab2, and you see that 0/3 are not in use, which is what I would have expected originally.  Not in use in this context basically means "eligable".


I was then able to Claim the disks for VSAN Use:


Then finally create the disk groups.


Many others suggest running vSAN in a Virtual environment, which is great for learning, you can even get the experience doing the Hands on Labs (Free 24/7 Now!), but I wanted to do some performance testing, and for that I needed a physical environment. Now that I've gotten past my little problem, it's working great!