Checklist
Incident Roles¶
Incident Responder(s) | Hands on keyboard |
Incident Commander | Communicating with stakeholders |
Incident Communicator | Overseeing the incident |
The steps below are applicable for Severity ("Sev") 1 & 2 level incidents. See the Escalations [section][foo] for more detailed steps and Severity Classification for a severity classification rubric.
The first responder automatically becomes the incident commander until the role is transferred:
Warning
You are still the commander until someone else says I have command
The incident commander is automatically the incident communicator until it is transferred:
Warning
You are still communicator until someone else says I have comms
Set up a communications channel on Slack, MS Teams, etc...
Communicate a new incident and channel details to email group(s) etc
Severity Classification¶
Sev 1 | Sev 2 | Sev 3 | |
Kubernetes | Cannot schedule pods | Other performance issues | |
Harbor | Cannot pull images | Cannot push images | Performance issues/Cannot use UI |
Postgres | No replicas/HA lost | ||
Vault | |||
Elasticsearch |
Escalations¶
After receiving a new alert or ticket, spend a few minutes (< 5m) doing a preliminary investigation to identify the reproducibility and impact of the incident. Once confirmed as an issue, the formal incident response process is:
Observing the environment¶
Start by checking general cluster health
Check recent changes that may have had an impact
Check recent trends in consumption metrics:
- Has a threshold been reached? A threshold can be both fixed (e.g. memory) and cumulative (disk space).
Check performance metrics
- Are metrics within bounds?
- Are any metrics abnormal?
Hypothesis Development & Testing¶
- What do I think is the problem?
- What would it look like if I were right about what the problem is?
- Could anything other problem present in this way?
- Of all the possible causes I've identified, are any much more likely than others?
- Of all the possible causes I've identified, are any much more costly to investigate and fix than others?
- How can I prove that, of all the problems that present in this way, the problem that I think is the problem is in fact the problem?
- How can I test my guesses in a way that, if I'm wrong, I still have a system that is meaningfully similar to the one that I was presented with at the start?
Why even talk about 'hypotheses' rather than 'ideas' or guesses'? Using more formal scientific language is a choice that signals a connection to – and that creates a bridge for – a larger set of ideas and techniques about evidence and progress in knowledge. Because of the caché scientific ideas hold, it's important to understand that much of theory is constructed in hindsight – science, like engineering, is a messy, human business and there is no perfect method for developing and testing hypotheses. Whatever is included here will necessarily leave much out, and there will always be cases that are not covered or in which the advice below misleads – nevertheless, it has been our experience that the following set of questions are a useful heuristic.
Interest readings
What do I think is the cause? – Forming the conjecture¶
A hypothesis, or a conjecture, is a guess about what causes some feature of the world. A useful hypothesis can be tested – ie, there is some way to say whether it is true or not. It's a very ordinary thing that engineers do everyday during most hours of the day.
An engineer's capacity to form good, testable hypotheses about a cluster or piece of software is a function of their capacity, experience and observation tooling – knowledge, as they say, is power. When the cause of a problem is unknown, surveying the appropriate data is a crucial first step in developing useful and clear hypotheses about it. See observing the environment for advice on gathering the relevant information.
Let's start with a problem/symptom:
The Symptom
All endpoints report as down on the monitoring tool.
Then we generate a hypothesis about what might be the cause:
Hypothesis 1.1
All the cluster pods are crashing.
What would it look like if I were right about what the problem is? – Making testable predictions¶
Useful hypotheses are testable, which means that if they are true, the world is one observable way, and if they are false, the world is another observable way. Hypothesis 1.1 is a guess about the state of the world, but it doesn't include any tests. The testable claim (Proposition 1.1) below includes a statement about what would hold were hypothesis above 1.1 true:
Proposition 1.1
If all the cluster pods were crashing, I would not be able to curl any of the endpoints.
Let's stipulate for the example that there is an attempt to curl all the endpoints and all attempts fail to produce a response. What can we conclude from this?
Unfortunately, unless the only possible cause of being unable to curl the endpoints is that all the pods were down, we have not discovered the root cause of the symptom – we've merely noticed that the world is consistent with our hypothesis and, while it remains possible that all the pods are crashing, we've not ruled out other hypotheses that are also consistent with what we've seen. If there are any such hypotheses, we say that our conclusion is underdetermined.
Here are some others:
Hypotheses 2.1
- cluster DNS is misconfigured
- services are misconfigured and no longer point correctly to pods
- a local firewall is blocking the cluster IP
Reasoning from inconsistency
Had we been able to curl any of the endpoints, we could infer that:
- some of the pods have problems that prevent curling their endpoints but not all
- we're not generally blocked by a firewall from hitting the cluster domain
Therefore, the following might hold:
Hypotheses from inconsistency
- there are problems with the monitoring tool
- there are problems with some service configurations
- some pods are crashing
Could any other problems present in this way? – Avoiding underdetermination¶
A conclusion is underdetermined by data when plausible rival hypotheses are reasonably likely to be true. For example, even if Proposition 1.1 ("If all the cluster pods were crashing, I would not be able to curl any of the endpoints.") were true and attempts to curl all the endpoints failed in each case, the following hypotheses are rival to Hypothesis 1.1 ("All the cluster pods are crashing"):
Hypotheses 3.1
- a firewall is blocking access
- cluster DNS is misconfigured
- all service configuration has been altered
- the cluster was deleted by an angry ex-employee
- there has been an earthquake and the data centre and backup data centres have been destroyed
Generally speaking, we want to run tests that eliminate competing hypotheses as well as tests that affirm specific hypotheses. Many symptoms/problems will, at start, have vast numbers of plausible competing hypotheses that could be tested. Tests that properly eliminate possibilities can therefore be extremely valuable, especially if those possibilities are both reasonably likely and cheap to check.
Of all the possible causes I've identified, are any much more likely than others? – Assessing the base rate¶
Of all the possible causes I've identified, are any much more costly (time or money) to investigate and fix than others? – Assessing costs of investigation¶
How can I prove that, of all the problems that present in this way, the problem that I think is the problem is in fact the problem? – Ruling out competing hypotheses¶
How can I test my guesses in a way that, if I'm wrong, I still have a system that is meaningfully similar to the one that I was presented with at the start? – Preserving test–retest validity¶
Once you've developed a hypothesis for what has gone wrong, you need to validate whether your guess is correct.
Warning
In a production environment, it is essential to preserve what might be called (with debt to statistics) test-retest reliability, which can be defined as the reliability of a test measured over time.
In the case of a production cluster, configuration changes or changes to the underlying systems on which a cluster depends risk undermining test-retest reliability. By contrast, a staging cluster or local cluster is much more likely to preserve test-retest reliability, as a) it's generally possible to redeploy for testing and b) there is unlikely to be an accumulation of state.
The following strategies can be used to maximise test-retest reliability (to the degree possible and using your considered judgement):
- Be prepared to rollback any change to initial state and do so after validating/disconfirming any hypothesis.
- Until a solution is realised, ensure all changes are rolled back before pursuing a new idea – hypothesis testing should all be done from the from the same base configuration.
- Log changes and rollbacks in the live incident log (see SRE Handbook: Effective Troubleshooting/Negative Results Are Magic).
Incident log¶
TODO
Mitigation¶
See Generic mitigations.
Kubernetes¶
Health Checks ¶
-
karina status
(lists control plane/etcd versions/leaders/orphans) -
karina status pods
- control-plane logs [TODO - elastic query]
- karma, canary alerts
- kubectl-popeye
Deployments
No Scheduling
Manual schedule by specifying node name in spec
Network Connectivity
See Guide to K8S Networking and Networking Model Packet-level Debugging
kubectl-sniff – tcpdump specific pods
kubectl-tap – expose services locally
tcpprobe – measure 60+ metrics for socket connections
Check node to node connectivity using goldpinger
Restart CNI controllers/agents
End User Access Denied
Temporarily increase access level Check access using rbac-matrix
Disk/Volume Space
Check PV usage using
kubectl-df-pv
Remove old filebeat/journal logs
Scale down replicated storage
Reduce replicas from 3 → 2 → 1
DNS Latency
Check DNS request/failure count on grafana Check pod ndots configuration and reduce if possible Check node-local cache hit rates Scale coredns replicas and/or increase cpu limits on coredns and node-local-dns
Failing Webhooks
Temporarily disable the webhooks by either deleting them or setting the FailurePolicy to ignore
Loss of Control Plane Access
Try gain access to master nodes and regen certs using /etc/kubernetes/pki
Downscale cluster to 1 master, and regen certs using kubeadm, scale masters back up
Failure During Rolling Update
Run karina terminate-node
followed by karina provision
Worker Node failure
Run karina terminate-node
followed by karina provision
Control Plane Node Failure
Run kubectl terminate-node
followed by karina provision
Remove any failed etcd members using karina etcd remove-member
Namespace Overutilisation
TBD
Load Balancer Failure
TBD
Cluster Failure
Cordon the cluster by removing GSLB entries
Without PVCs: Run kubectl terminate
followed by karina provision
to reprovision the cluster
With PVCs: Try take a backup first (karina backup or velero backup
), provision a new cluster with a new name, restore from backup karina restore or velero restore
Cluster over-utilized
Increase capacity if possible (even temporarily)
Shed load starting with:
- moving workloads to other clusters
- reducing replica counts
- terminating non critical workloads (dynamic namespaces, etc)
Node¶
Health Checks ¶
- Check CNI health (nsx-node-agent etc)
- Check performance
- Check network connectivity
- Check karma, canary alerts
- Review journalbeat logs/log counts by node
Node Performance
Check container top using crictl stats
and crictl ps<br /
Check VM host CPU/Memory/IO saturation
See Performance Cookbook and USE
Unable to SSH
CNI Failure
Run karina terminate-node
followed by karina provision
Network Connectivity
kubectl-sniff - tcpdump specific pods
kubectl-tap – expose services locally
tcpprobe – measure 60+ metrics for socket connections
check node to node connectivity using goldpinger
Etcd¶
Slow Performance
Check disk I/O
Reduce size of etcd cluster
Loss of Quorum
Key Exposure
TBD
DB Size Exceeded
Run etcd compaction and/or reduce version history retention
Postgresql¶
Health Checks ¶
- Check replication status
kubectl exec -it postgres-<db-name>-0 -- patronictl list
- Check volume usage
kubectl krew install df-pv; kubectl df-pv -n postgres-operator
Disk Space Usage
Check size pg_wal directory
; if it is taking up more than 10% of space then the WALs are not getting archived
Slow Performance
TBD
Data Loss
Recover from backup
Key Exposure
TBD
Not healthy enough for leader race
Replicas are too out of sync to promote; restore from backup
Replica not following
kubectl exec -it postgres-<db-name>-0 -- patornictl reinit postgres-<db-name> pod-name
Failover (Safe)
run kubectl exec -it postgres-<db-name>-0 -- patronictl failover
Delete config endpoint if failover is stuck without any master
Failover (Forced)
run kubectl delete endpoints postgres-<db-name> postgres-<db-name>-config
to force re-election
WAL Logs not getting archived
Check standby that are offline
Cleanup manually using pg_archivecleanup Note cleaning up WAL logs will prevent standbys that are not up to date from catching up – they will need to be re-bootstrapped
Vault / Consul¶
Slow Performance
TBD
Data Loss
TBD
Key Exposure
TBD
Failover
TBD
Harbor¶
Crashlooping
Check health via /api/v2.0/health
failed to migrate: please upgrade to version x first
→ :sql: DELETE from schema_migrations where VERSION = 1
Inaccessible
compare accessibility via UI/API/ docker login
Reset CSRF_TOKEN in harbor-core configmap
Reset admin password in harbor_users
postgres table
Drop and recreate Redis PVC to flush all caches
Delete all running replication jobs via UI or via API
Slow Performance
Check performance of underlying registry storage (S3/disk etc)
Check CPU load/throttling on the postgres DB
Data Loss
Fail forward and ask dev-teams to rebuild and repush images
Recover images from running nodes
Key Exposure
TBD
Failover
TBD
Elasticsearch¶
Slow Performance
Check performance of underlying disks
Check CPU load/throttling on the elastic instance
Check memory saturation under elastic node health
Data Loss
TBD
Key Exposure
TBD
Failover
TBD