I have been successfully using a pfSense community edition firewall to protect my home test lab. My local ISP delivers me a subnet of addresses directly so I need to leverage a transparent firewall or "bridge" to protect the lab. After rebuilding a few pieces of the lab I restored my pfSense configuration to a new host/VM and found that no traffic was passing. I did a packet capture and did not see any communication traffic. I thought that *something* must be blocking the traffic before it gets to the pfSense transparent firewall VM. A lightbulb went off in my head back to my VMware architecture days about the 3 security settings you can set on a virtual network switch, promiscuous mode being the easiest to remember. I played with turning these on/off 1 by 1 and found I needed both Promiscuous mode and Forged transmits security turned off (Setting to Accept) for this pfSense transparent firewall VM to operate correctly. Obviously turning these features off does open your ESX Host up to accepting more (possibly malicious) packets, but the ESX host is simply passing the packets along to the VM(s) attached to that Network on the host. You can limit your exposure by only having a single VM on that host's "raw internet" network and a single (same) VM attached to the inside "filtered internet" network. Assuming you trust pfSense to do its job, turning off these features should work for most home use cases.
Friday, August 19, 2022
Thursday, February 3, 2022
I use google cloud's smallest VM for hosting my own DNS servers. I use the f1-micro instances that are very limited in memory and cpu, but cheap!
During a regular yum update I received the following error and my instance (VM) failed to reboot.
During the Cleanup part of yum update the google-cloud-sdk gave me this error:
/var/tmp/rpm-tpm.rdz2f9: line 4: 11963 Killed gcloud components post-process --force-recomplile warning: %postrun(google-cloud-sdk-360.0.0-1.x86_64) scriptlet failed, exit status 137 Non-fatal POSTUN scriptlet failure in rpm package google-cloud-sdk-360.0.0-1.x86_64
I read on this post - https://stackoverflow.com/questions/40163733/upgrading-google-cloud-sdk-fails-on-configure that this person had the same issue to due using the smallest GCP instance size, but they chose to stop some processes to free up memory before they did an update.
My resolution was to first build a new VM with a previous snapshot of the VM before it was messed up by the update.
Just for good measure I did some yum cleanup with yum-utils before anything else, then update the single google-cloud-sdk, before updating the rest without error.
sudo yum install yum-utils
sudo package-cleanup --dupes --noplugins
sudo yum clean all
sudo yum clean dbcache (probably redundant)
sudo yum update google-cloud-sdk (this took a long time)
sudo yum update (update all the other pieces)
After this everything was happy!
Wednesday, May 5, 2021
I recently purchased a USG-3P (Unified Secure Gateway) Ubiquiti Unify Home router.
With its DHCP, everything points to this device as the DNS server for the house. I like this because I don't want to depend on my home lab being up for DNS (Internet) to be working for basic name resolution. However, I want my xyz.home.lab domains to resolve. I did a bunch of googling and found I needed to modify the config.gateway.json file, but I couldn't find it on the appliance. I then found this article, so I just needed to SSH into my USG and run a command such as ---
set service dns forwarding options server=/lab.dns/172.16.1.10
This seemed great, but the command gave me “invalid command”, which it turns out you need to go into configure mode first on the USG, just type ‘configure’ and hit enter and then run the command above. After that you type commit, then save.
This almost makes my home work, but my primary mac machine is always VPN’d into work, and all DNS requests are sent there, so my home.lab still doesn’t work from this machine.
On my mac I needed to mkdir /etc/resolver, then in the /etc/resolver directory, then create a file named home.lab
Adding the following lines ----
After I saved that, I can now resolve my home.lab dns from everywhere, and I get the bonus of being able to still work just fine if the lab (DNS) is down.
Thursday, April 29, 2021
I dug this up from many years ago, just as good today as it was then!
Team Rules - by Brian Smith
- It's all about the customers! We must provide a reliable and cost effective solution!
- If there are customers on it, then it is in production and will be treated as production until the customers are off and it is NOT in production.
- You can't say that's not my job, you may direct someone to the appropriate person or group if you can’t help them.
- All Negative customer experiences must be escalated, no matter who you believe is responsible
- If a project deadline is in jeopardy, you must escalate immediately.
- There is no excuse use official methods (i.e. Open an Incident in the official ticketing tool)
- When you are on call, you must be available to work on an issue.
- Document before you execute and open a change for all changes (use official tools)
- Do your best to not directly delete anything, take offline for 3-7 days when possible
Wednesday, March 24, 2021
Here are some slides I've presented at a number of places people have asked me to post.
Lastly i'd like to add that "Self Correcting Systems" are vital to the success of SRE. Of course we all hear about auto-remediation or self-healing technologies. While those are self evident I personally recommend you think about your people and processes. Think about the motivations, rewards and expected human behaviors. If you focus on a target of reducing false monitoring alarms, someone MIGHT decide to just disable the alarms instead of fixing them. If you focus on auto-healing too much, you may miss the fact that most things that can/should be fixed by auto-healing is a design flaw/problem. Unfortunately we tend to ask how many fires we put out, not how many fires we prevented because "fires put out" is easier to count. We have to educate our stakeholders and leadership to learn that an ounce of prevention is worth a pound of cure!
Wednesday, September 30, 2020
SRE is an abbreviation for Service Reliability Engineering, also known as Site Reliability Engineering. SRE can also be a job description for a Service(or Site) Reliability Engineer. DevOps is a methodology (and should never be someones title) where a developer supports the code they write in production.
Service Reliability Engineering exists to improve reliability of services by writing software to keep a service functioning properly and to drive the improvement through feedback primarily around reliability, availability, observability & serviceability. SREs are also expected to do a significant amount of troubleshooting to find the root cause of an issue and not focus on the symptoms. SRE's often help Developers with Infrastructure, Deployment, Configuration, Monitoring & Metrics to make their software easy to update, manage and monitor.
SRE is a set of practices, metrics, and prescriptive ways to ensure reliability and uniformity to enable success at scale.
SRE is often a Gate to Production (grant access, make/approve/authorize production changes)
Characteristics of an SRE
- Focused heavily on meeting SLO's & SLA's
- Error budgets are like money, they are meant to be spent, wisely as it is limited, underspending and overspending are both bad
- Embraces fact that failures will happen, plans for them
- Solve production issues and restore service
- Actively participate in postmortems
- Has a roadmap and prioritized backlog of things to automate
- SRE's is at war with
- toil (anything repetitive that is done by hand should be automated)
- inconsistencies (snowflakes are a killer)
- ignorance (visibility & data are key to managing things quickly at scale)
- Improves monitoring with alert correlation to reduce noise and TTR (time to repair)
- SRE's mandate is to continually push for product improvements
- Minimally spends 51% of their time solving issues through code
- All Incidents and Escalations should result in a runbook/workflow that eventually turns into automation (I like to call runbooks human-automation)
- CI/CD Automation platforms to push out new code and fixes
- Many small code pushes are far better and easier to backtrack than fewer large ones
- No Production software pushes on Friday or the weekend
- Knows you can't test everything, unit tests are required, but things will collide in production
- Automation platform such as stackstorm to orchestrate fixes
- Write tools such as auto-triage for troubleshooting (gather the logs while the engineer logs in)
- Idempotent actions are your new best friend
- People & Process
- SRE's need psychological safety, one example is blameless postmortems
- SRE's is a cognitively difficult task, requires minimal context switching
- Participates in on-call rotation
- If SRE's support a platform they do not develop entirely, they must be involved in project planning and execution with those teams.
- SRE's need time to dig into incidents to not only fix, but to find root cause and take preventative measures
- Tooling for success
- SRE's need proper tooling, such as logs, time-series metrics, traces, etc.. (it's virtually impossible to understand true root cause for intermittent issues if you don't have this.)
- Proper monitoring to detect failures, need to feel confident that if no alarms are triggering that the infrastructure is healthy, customers should not be the ones to tell you that you have a problem.
- Never create an alarm unless it leads to an action. Warnings are useless at scale.
- Black Box Monitoring- All you know are the inputs and outputs, if wood goes in, and chairs comes out, thumbs up, things are good. If chairs come out broken, incomplete or not at all, you realize you have a problem.
- White Box Monitoring - (I wish it was called transparent box) but this is where you know what's happening inside the machine. You see the first machine wood get cut, the second sand it, the third assemble the parts and the fourth paint the chair. You know which part of the machine is not functioning properly.
- Observability - You need to make your production observable, you need to expose signals we can watch programmatically in order for the operators to understand the health of the machine.
- Serviceability - How easy/difficult is it to maintain this software, to get new software into production or upgrade to a new build.
- Availability - Frequency of a system to operate properly or at least within a(n) SLO/SLA
- Idempotent - Something that can be applied repeatedly safely and will only make a change the first time it applies.
- Immutable - Something that can't be changed after it is built/created.
- Heuristic - an approach to solving a problem that uses practical methods but may not be optimal, but it will get you there.
- Orthogonal - When something changes, but does so independently and does not affect the other.
- Chaos Engineering - Intentionally breaking things in production to test your resiliency
- Canary - Something built to test new features or configurations, or possibly a very small subset of production. This way you can find issues with limited scope of impact.