Checklist

Incident Roles¶


Incident Responder(s)	Hands on keyboard
Incident Commander	Communicating with stakeholders
Incident Communicator	Overseeing the incident

The steps below are applicable for Severity ("Sev") 1 & 2 level incidents. See the Escalations [section][foo] for more detailed steps and Severity Classification for a severity classification rubric.

The first responder automatically becomes the incident commander until the role is transferred:

Warning

You are still the commander until someone else says I have command

The incident commander is automatically the incident communicator until it is transferred:

Warning

You are still communicator until someone else says I have comms

Set up a communications channel on Slack, MS Teams, etc...

Communicate a new incident and channel details to email group(s) etc

Severity Classification¶

	Sev 1	Sev 2	Sev 3
Kubernetes	Cannot schedule pods		Other performance issues
Harbor	Cannot pull images	Cannot push images	Performance issues/Cannot use UI
Postgres			No replicas/HA lost
Vault
Elasticsearch

Escalations¶

After receiving a new alert or ticket, spend a few minutes (< 5m) doing a preliminary investigation to identify the reproducibility and impact of the incident. Once confirmed as an issue, the formal incident response process is:

Observing the environment¶

Start by checking general cluster health

Check recent changes that may have had an impact

Check recent trends in consumption metrics:

Has a threshold been reached? A threshold can be both fixed (e.g. memory) and cumulative (disk space).

Check performance metrics

Are metrics within bounds?
Are any metrics abnormal?

Hypothesis Development & Testing¶

What do I think is the problem?
What would it look like if I were right about what the problem is?
Could anything other problem present in this way?
Of all the possible causes I've identified, are any much more likely than others?
Of all the possible causes I've identified, are any much more costly to investigate and fix than others?
How can I prove that, of all the problems that present in this way, the problem that I think is the problem is in fact the problem?
How can I test my guesses in a way that, if I'm wrong, I still have a system that is meaningfully similar to the one that I was presented with at the start?

Why even talk about 'hypotheses' rather than 'ideas' or guesses'? Using more formal scientific language is a choice that signals a connection to – and that creates a bridge for – a larger set of ideas and techniques about evidence and progress in knowledge. Because of the caché scientific ideas hold, it's important to understand that much of theory is constructed in hindsight – science, like engineering, is a messy, human business and there is no perfect method for developing and testing hypotheses. Whatever is included here will necessarily leave much out, and there will always be cases that are not covered or in which the advice below misleads – nevertheless, it has been our experience that the following set of questions are a useful heuristic.

Interest readings

What do I think is the cause? – Forming the conjecture¶

A hypothesis, or a conjecture, is a guess about what causes some feature of the world. A useful hypothesis can be tested – ie, there is some way to say whether it is true or not. It's a very ordinary thing that engineers do everyday during most hours of the day.

An engineer's capacity to form good, testable hypotheses about a cluster or piece of software is a function of their capacity, experience and observation tooling – knowledge, as they say, is power. When the cause of a problem is unknown, surveying the appropriate data is a crucial first step in developing useful and clear hypotheses about it. See observing the environment for advice on gathering the relevant information.

Let's start with a problem/symptom:

The Symptom

All endpoints report as down on the monitoring tool.

Then we generate a hypothesis about what might be the cause:

Hypothesis 1.1

All the cluster pods are crashing.

What would it look like if I were right about what the problem is? – Making testable predictions¶

Useful hypotheses are testable, which means that if they are true, the world is one observable way, and if they are false, the world is another observable way. Hypothesis 1.1 is a guess about the state of the world, but it doesn't include any tests. The testable claim (Proposition 1.1) below includes a statement about what would hold were hypothesis above 1.1 true:

Proposition 1.1

If all the cluster pods were crashing, I would not be able to curl any of the endpoints.

Let's stipulate for the example that there is an attempt to curl all the endpoints and all attempts fail to produce a response. What can we conclude from this?

Unfortunately, unless the only possible cause of being unable to curl the endpoints is that all the pods were down, we have not discovered the root cause of the symptom – we've merely noticed that the world is consistent with our hypothesis and, while it remains possible that all the pods are crashing, we've not ruled out other hypotheses that are also consistent with what we've seen. If there are any such hypotheses, we say that our conclusion is underdetermined.

Here are some others:

Hypotheses 2.1

cluster DNS is misconfigured
services are misconfigured and no longer point correctly to pods
a local firewall is blocking the cluster IP

Reasoning from inconsistency

Had we been able to curl any of the endpoints, we could infer that:

some of the pods have problems that prevent curling their endpoints but not all
we're not generally blocked by a firewall from hitting the cluster domain

Therefore, the following might hold:

Hypotheses from inconsistency

there are problems with the monitoring tool
there are problems with some service configurations
some pods are crashing

Could any other problems present in this way? – Avoiding underdetermination¶

A conclusion is underdetermined by data when plausible rival hypotheses are reasonably likely to be true. For example, even if Proposition 1.1 ("If all the cluster pods were crashing, I would not be able to curl any of the endpoints.") were true and attempts to curl all the endpoints failed in each case, the following hypotheses are rival to Hypothesis 1.1 ("All the cluster pods are crashing"):

Hypotheses 3.1

a firewall is blocking access
cluster DNS is misconfigured
all service configuration has been altered
the cluster was deleted by an angry ex-employee
there has been an earthquake and the data centre and backup data centres have been destroyed

Generally speaking, we want to run tests that eliminate competing hypotheses as well as tests that affirm specific hypotheses. Many symptoms/problems will, at start, have vast numbers of plausible competing hypotheses that could be tested. Tests that properly eliminate possibilities can therefore be extremely valuable, especially if those possibilities are both reasonably likely and cheap to check.

Of all the possible causes I've identified, are any much more likely than others? – Assessing the base rate¶

Of all the possible causes I've identified, are any much more costly (time or money) to investigate and fix than others? – Assessing costs of investigation¶

How can I prove that, of all the problems that present in this way, the problem that I think is the problem is in fact the problem? – Ruling out competing hypotheses¶

How can I test my guesses in a way that, if I'm wrong, I still have a system that is meaningfully similar to the one that I was presented with at the start? – Preserving test–retest validity¶

Once you've developed a hypothesis for what has gone wrong, you need to validate whether your guess is correct.

Warning

In a production environment, it is essential to preserve what might be called (with debt to statistics) test-retest reliability, which can be defined as the reliability of a test measured over time.

In the case of a production cluster, configuration changes or changes to the underlying systems on which a cluster depends risk undermining test-retest reliability. By contrast, a staging cluster or local cluster is much more likely to preserve test-retest reliability, as a) it's generally possible to redeploy for testing and b) there is unlikely to be an accumulation of state.

The following strategies can be used to maximise test-retest reliability (to the degree possible and using your considered judgement):

Be prepared to rollback any change to initial state and do so after validating/disconfirming any hypothesis.
Until a solution is realised, ensure all changes are rolled back before pursuing a new idea – hypothesis testing should all be done from the from the same base configuration.
Log changes and rollbacks in the live incident log (see SRE Handbook: Effective Troubleshooting/Negative Results Are Magic).

Incident log¶

TODO

Mitigation¶

See Generic mitigations.

Kubernetes¶

Health Checks ¶

karina status (lists control plane/etcd versions/leaders/orphans)
karina status pods
control-plane logs [TODO - elastic query]
karma, canary alerts
kubectl-popeye

Deployments

See Troubleshooting Deployments
kubectl-debug

No Scheduling

Manual schedule by specifying node name in spec

Network Connectivity

See Guide to K8S Networking and Networking Model Packet-level Debugging
kubectl-sniff – tcpdump specific pods
kubectl-tap – expose services locally
tcpprobe – measure 60+ metrics for socket connections
Check node to node connectivity using goldpinger
Restart CNI controllers/agents

End User Access Denied

Temporarily increase access level Check access using rbac-matrix

Disk/Volume Space

Check PV usage using kubectl-df-pv
Remove old filebeat/journal logs
Scale down replicated storage
Reduce replicas from 3 → 2 → 1

DNS Latency

Check DNS request/failure count on grafana Check pod ndots configuration and reduce if possible Check node-local cache hit rates Scale coredns replicas and/or increase cpu limits on coredns and node-local-dns

Failing Webhooks

Temporarily disable the webhooks by either deleting them or setting the FailurePolicy to ignore

Loss of Control Plane Access

Try gain access to master nodes and regen certs using /etc/kubernetes/pki
Downscale cluster to 1 master, and regen certs using kubeadm, scale masters back up

Failure During Rolling Update

Run karina terminate-node followed by karina provision

Worker Node failure

Run karina terminate-node followed by karina provision

Control Plane Node Failure

Run kubectl terminate-node followed by karina provision
Remove any failed etcd members using karina etcd remove-member

Namespace Overutilisation

TBD

Load Balancer Failure

TBD

Cluster Failure

Cordon the cluster by removing GSLB entries
Without PVCs: Run kubectl terminate followed by karina provision to reprovision the cluster
With PVCs: Try take a backup first (karina backup or velero backup), provision a new cluster with a new name, restore from backup karina restore or velero restore

Cluster over-utilized

Increase capacity if possible (even temporarily)
Shed load starting with:

moving workloads to other clusters
reducing replica counts
terminating non critical workloads (dynamic namespaces, etc)

Node¶

Health Checks ¶

Check CNI health (nsx-node-agent etc)
Check performance
Check network connectivity
Check karma, canary alerts
Review journalbeat logs/log counts by node

Node Performance

Check container top using crictl stats and crictl ps<br /
Check VM host CPU/Memory/IO saturation
See Performance Cookbook and USE

Unable to SSH

Try kubectl-node-shell

CNI Failure

Run karina terminate-node followed by karina provision

Network Connectivity

kubectl-sniff - tcpdump specific pods
kubectl-tap – expose services locally
tcpprobe – measure 60+ metrics for socket connections
check node to node connectivity using goldpinger

Etcd¶

Slow Performance

Check disk I/O
Reduce size of etcd cluster

Loss of Quorum

See Disaster recovery

Key Exposure

TBD

DB Size Exceeded

Run etcd compaction and/or reduce version history retention

Postgresql¶

Health Checks ¶

Check replication status kubectl exec -it postgres-<db-name>-0 -- patronictl list
Check volume usage kubectl krew install df-pv; kubectl df-pv -n postgres-operator

Disk Space Usage

Check size pg_wal directory; if it is taking up more than 10% of space then the WALs are not getting archived

Slow Performance

TBD

Data Loss

Recover from backup

Key Exposure

TBD

Not healthy enough for leader race

Replicas are too out of sync to promote; restore from backup

Replica not following

kubectl exec -it postgres-<db-name>-0 -- patornictl reinit postgres-<db-name> pod-name

Failover (Safe)

run kubectl exec -it postgres-<db-name>-0 -- patronictl failover
Delete config endpoint if failover is stuck without any master

Failover (Forced)

run kubectl delete endpoints postgres-<db-name> postgres-<db-name>-config to force re-election

WAL Logs not getting archived

Check standby that are offline
Cleanup manually using pg_archivecleanup Note cleaning up WAL logs will prevent standbys that are not up to date from catching up – they will need to be re-bootstrapped

Vault / Consul¶

Slow Performance

TBD

Data Loss

TBD

Key Exposure

TBD

Failover

TBD

Harbor¶

Crashlooping

Check health via /api/v2.0/health
failed to migrate: please upgrade to version x first → :sql: DELETE from schema_migrations where VERSION = 1

Inaccessible

compare accessibility via UI/API/ docker login
Reset CSRF_TOKEN in harbor-core configmap
Reset admin password in harbor_users postgres table
Drop and recreate Redis PVC to flush all caches
Delete all running replication jobs via UI or via API

Slow Performance

Check performance of underlying registry storage (S3/disk etc)
Check CPU load/throttling on the postgres DB

Data Loss

Fail forward and ask dev-teams to rebuild and repush images
Recover images from running nodes

Key Exposure

TBD

Failover

TBD

Elasticsearch¶

Slow Performance

Check performance of underlying disks
Check CPU load/throttling on the elastic instance
Check memory saturation under elastic node health

Data Loss

TBD

Key Exposure

TBD

Failover

TBD

Checklist

Incident Roles¶

Severity Classification¶

Escalations¶

Observing the environment¶

Hypothesis Development & Testing¶

What do I think is the cause? – Forming the conjecture¶

What would it look like if I were right about what the problem is? – Making testable predictions¶

Could any other problems present in this way? – Avoiding underdetermination¶

Of all the possible causes I've identified, are any much more likely than others? – Assessing the base rate¶

Of all the possible causes I've identified, are any much more costly (time or money) to investigate and fix than others? – Assessing costs of investigation¶

How can I prove that, of all the problems that present in this way, the problem that I think is the problem is in fact the problem? – Ruling out competing hypotheses¶

How can I test my guesses in a way that, if I'm wrong, I still have a system that is meaningfully similar to the one that I was presented with at the start? – Preserving test–retest validity¶

Incident log¶

Mitigation¶

Kubernetes¶

Health Checks ¶

Node¶

Health Checks ¶

Etcd¶

Postgresql¶

Health Checks ¶

Vault / Consul¶

Harbor¶

Elasticsearch¶

Incident Resolution¶