fix: use library/ Harbor project, add skills, fix pipeline secrets
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed

- .woodpecker.yaml: image paths -> library/autojanet-{agent,dispatcher}
- .woodpecker.yaml: secret names RS_HARBOR_USER / RS_HARBOR_PASS (global)
- container/Dockerfile: restore COPY skills/, skills/ populated from opencode config
- skills/: 84 opencode skills bundled into image
- k8s/manifests: update image refs to library/
This commit is contained in:
Zoë 2026-05-30 15:43:14 -07:00
parent a3f25456e4
commit cc74ad0bd0
232 changed files with 34556 additions and 19 deletions

View file

@ -1,8 +1,8 @@
---
# AutoJanet CI Pipeline
# Builds and pushes two images to Harbor:
# - registry.ctz.fyi/autojanet/agent:latest (+ git SHA tag)
# - registry.ctz.fyi/autojanet/dispatcher:latest (+ git SHA tag)
# - registry.ctz.fyi/library/autojanet-agent:latest (+ git SHA tag)
# - registry.ctz.fyi/library/autojanet-dispatcher:latest (+ git SHA tag)
# Triggered on push to mainline or semver tags.
when:
@ -17,17 +17,16 @@ steps:
image: woodpeckerci/plugin-docker-buildx
settings:
registry: registry.ctz.fyi
repo: registry.ctz.fyi/autojanet/agent
repo: registry.ctz.fyi/library/autojanet-agent
dockerfile: container/Dockerfile
context: .
username:
from_secret: harbor_user
from_secret: RS_HARBOR_USER
password:
from_secret: harbor_password
from_secret: RS_HARBOR_PASS
tags:
- latest
- "${CI_COMMIT_SHA:0:12}"
cache_from: registry.ctz.fyi/autojanet/agent:latest
platforms: linux/amd64
when:
- event: push
@ -39,17 +38,16 @@ steps:
image: woodpeckerci/plugin-docker-buildx
settings:
registry: registry.ctz.fyi
repo: registry.ctz.fyi/autojanet/dispatcher
repo: registry.ctz.fyi/library/autojanet-dispatcher
dockerfile: container/Dockerfile.dispatcher
context: .
username:
from_secret: harbor_user
from_secret: RS_HARBOR_USER
password:
from_secret: harbor_password
from_secret: RS_HARBOR_PASS
tags:
- latest
- "${CI_COMMIT_SHA:0:12}"
cache_from: registry.ctz.fyi/autojanet/dispatcher:latest
platforms: linux/amd64
when:
- event: push
@ -62,12 +60,12 @@ steps:
commands:
- trivy image --exit-code 1 --severity HIGH,CRITICAL
--ignore-unfixed
registry.ctz.fyi/autojanet/agent:${CI_COMMIT_SHA:0:12}
registry.ctz.fyi/library/autojanet-agent:${CI_COMMIT_SHA:0:12}
environment:
TRIVY_USERNAME:
from_secret: harbor_user
from_secret: RS_HARBOR_USER
TRIVY_PASSWORD:
from_secret: harbor_password
from_secret: RS_HARBOR_PASS
when:
- event: push
branch: mainline

View file

@ -4,7 +4,7 @@
# Role is determined at runtime via AGENT_ROLE env var.
#
# Build:
# docker build -t registry.ctz.fyi/autojanet/agent:latest .
# docker build -t registry.ctz.fyi/library/autojanet-agent:latest .
#
# The image bundles:
# - opencode CLI (Node.js)
@ -64,7 +64,7 @@ COPY container/entrypoint.py /app/entrypoint.py
# All agent definition files
COPY agents/ /app/agents/
# Skills (read-only reference)
# Skills from ~/.config/opencode/skills — copied into repo at skills/
COPY skills/ /app/skills/
USER agent

View file

@ -42,7 +42,7 @@ VIKUNJA_TODO_BUCKET_ID = int(os.environ.get("VIKUNJA_TODO_BUCKET_ID", "116"))
VIKUNJA_IN_PROGRESS_BUCKET_ID = int(os.environ.get("VIKUNJA_IN_PROGRESS_BUCKET_ID", "117"))
K8S_NAMESPACE = os.environ.get("K8S_NAMESPACE", "autojanet")
AGENT_IMAGE = os.environ.get("AGENT_IMAGE", "registry.ctz.fyi/autojanet/agent:latest")
AGENT_IMAGE = os.environ.get("AGENT_IMAGE", "registry.ctz.fyi/library/autojanet-agent:latest")
VALID_ROLES = {
"pm", "coder", "code-reviewer", "test-engineer", "devsecops", "secops",

View file

@ -25,7 +25,7 @@ spec:
restartPolicy: Never
containers:
- name: dispatcher
image: registry.ctz.fyi/autojanet/dispatcher:latest
image: registry.ctz.fyi/library/autojanet-dispatcher:latest
imagePullPolicy: Always
env:
- name: OPENBAO_ADDR
@ -51,7 +51,7 @@ spec:
- name: K8S_NAMESPACE
value: "autojanet"
- name: AGENT_IMAGE
value: "registry.ctz.fyi/autojanet/agent:latest"
value: "registry.ctz.fyi/library/autojanet-agent:latest"
resources:
requests:
cpu: "100m"

View file

@ -32,7 +32,7 @@ spec:
tolerations: []
containers:
- name: agent
image: registry.ctz.fyi/autojanet/agent:latest
image: registry.ctz.fyi/library/autojanet-agent:latest
imagePullPolicy: Always
env:
- name: AGENT_ROLE

View file

@ -0,0 +1,185 @@
---
name: adding-keycloak-sso
description: Use when adding Keycloak SSO authentication to a service on the homelab cluster at ctz.fyi, whether via oauth2-proxy sidecar or native OIDC configuration.
---
# Adding Keycloak SSO
## Overview
Two patterns depending on whether the app supports OIDC natively. Both use Keycloak at `sso.ctz.fyi`, realm `ctz`, with secrets stored in OpenBao.
## Pattern Selection
| App type | Pattern |
|----------|---------|
| No auth or basic auth only | **A: oauth2-proxy sidecar** |
| Native OIDC/OAuth2 support (Grafana, Jellyfin, Open WebUI) | **B: Native OIDC** |
| SPA (React/Vue/etc) | **B: Public PKCE client** (`publicClient: true`, no secret) |
**Gotcha:** If an app already uses keycloak-js internally, do NOT also add oauth2-proxy — you'll get double-auth. Pick one.
---
## Step 1: Create Keycloak Client
```bash
# Port-forward Keycloak
kubectl port-forward -n keycloak svc/keycloak 8080:80 &
# Get admin password from OpenBao
bao kv get secret/production/keycloak/keycloak-admin
# Get admin token
TOKEN=$(curl -s http://localhost:8080/realms/master/protocol/openid-connect/token \
-d "client_id=admin-cli&grant_type=password&username=admin&password=<PASSWORD>" \
| jq -r .access_token)
# Create client
curl -s -X POST http://localhost:8080/admin/realms/ctz/clients \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"clientId": "<service-name>",
"enabled": true,
"protocol": "openid-connect",
"publicClient": false,
"standardFlowEnabled": true,
"directAccessGrantsEnabled": false,
"redirectUris": ["https://<hostname>/oauth2/callback", "https://<hostname>/*"],
"webOrigins": ["https://<hostname>"],
"baseUrl": "https://<hostname>"
}'
# Get client UUID, then fetch secret
CLIENT_ID=$(curl -s http://localhost:8080/admin/realms/ctz/clients \
-H "Authorization: Bearer $TOKEN" | jq -r '.[] | select(.clientId=="<service-name>") | .id')
CLIENT_SECRET=$(curl -s http://localhost:8080/admin/realms/ctz/clients/$CLIENT_ID/client-secret \
-H "Authorization: Bearer $TOKEN" | jq -r .value)
kill %1 # Kill port-forward
```
**Redirect URI must include BOTH** `/oauth2/callback` AND `/*` wildcard — missing wildcard causes `redirect_uri_mismatch` for SPAs using keycloak-js.
---
## Step 2: Write Secrets to OpenBao
**Pattern A only — generate cookie secret first:**
```bash
COOKIE_SECRET=$(python3 -c "import os,base64; print(base64.urlsafe_b64encode(os.urandom(32)).decode())")
bao kv put secret/production/<namespace>/<name>-oauth2proxy-secret \
client-secret="$CLIENT_SECRET" \
cookie-secret="$COOKIE_SECRET"
```
**Pattern B:** Store whatever the app needs (client secret, etc.) under an appropriate path.
---
## Step 3: Pattern A — oauth2-proxy Sidecar
### ExternalSecret
```yaml
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: <name>-oauth2proxy-secret
annotations:
argocd.argoproj.io/sync-wave: "-1"
spec:
refreshInterval: 1h
secretStoreRef:
name: openbao
kind: ClusterSecretStore
target:
name: <name>-oauth2proxy-secret
creationPolicy: Owner
data:
- secretKey: client-secret
remoteRef:
key: secret/production/<namespace>/<name>-oauth2proxy-secret
property: client-secret
- secretKey: cookie-secret
remoteRef:
key: secret/production/<namespace>/<name>-oauth2proxy-secret
property: cookie-secret
```
### Deployment sidecar container
```yaml
- name: oauth2-proxy
image: quay.io/oauth2-proxy/oauth2-proxy:v7.7.1
args:
- --provider=oidc
- --oidc-issuer-url=https://sso.ctz.fyi/realms/ctz
- --client-id=<service-name>
- --redirect-url=https://<hostname>/oauth2/callback
- --email-domain=*
- --upstream=http://localhost:<app-port>
- --cookie-secure=true
- --cookie-samesite=lax
- --skip-provider-button=true
- --pass-authorization-header=true
- --pass-access-token=true
- --set-xauthrequest=true
- --http-address=0.0.0.0:4180
env:
- name: OAUTH2_PROXY_CLIENT_SECRET
valueFrom:
secretKeyRef:
name: <name>-oauth2proxy-secret
key: client-secret
- name: OAUTH2_PROXY_COOKIE_SECRET
valueFrom:
secretKeyRef:
name: <name>-oauth2proxy-secret
key: cookie-secret
ports:
- containerPort: 4180
```
### IngressRoute
Update the service port to `4180`. The app's own port no longer needs to be exposed externally.
---
## Step 4: Pattern B — Native OIDC
Configure the app using:
- **Issuer URL:** `https://sso.ctz.fyi/realms/ctz`
- **Client ID:** `<service-name>`
- **Client secret:** from OpenBao (via ExternalSecret or however the app ingests it)
- **Callback/redirect URL:** whatever the app expects (configure in Keycloak `redirectUris`)
For SPAs: set `"publicClient": true` in client creation, omit secret entirely.
---
## Step 5: Deploy and Verify
```bash
git add -A && git commit -m "feat(<service>): add Keycloak SSO"
git push
# Watch ArgoCD sync
```
Test the login flow manually. Check that:
- Unauthenticated requests redirect to Keycloak
- Successful login lands back on the app
- No double-auth prompts
## Common Mistakes
| Mistake | Fix |
|---------|-----|
| Missing `/*` wildcard in redirectUris | Add `"https://<hostname>/*"` alongside the callback URI |
| Cookie secret wrong length | Must be exactly 32 bytes → use the `python3` command above |
| Double-auth on apps with built-in keycloak-js | Remove app's internal auth OR remove oauth2-proxy, not both |
| IngressRoute still pointing at app port | Update to port `4180` for Pattern A |
| `directAccessGrantsEnabled: true` | Set to `false` — resource owner password grant is not needed |

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,128 @@
---
name: ansible-convert
description: Use when converting shell scripts to Ansible playbooks. Use when migrating bash automation, manual procedures, or Dockerfiles to idempotent Ansible tasks.
---
# Shell to Ansible Conversion
## Overview
Shell scripts execute commands imperatively; Ansible declares desired state. Conversion means rethinking operations as state declarations, not translating commands line-by-line. The goal is idempotency: running twice produces identical results.
## When to Use
- Converting existing shell scripts to playbooks
- Migrating manual server setup procedures
- Replacing bash automation with Ansible
- Converting Dockerfile RUN commands
## Core Principle
**Don't wrap shell commands in Ansible's `shell` module.** Find the module that achieves the same end state declaratively.
```bash
# Shell: imperative
mkdir -p /opt/app
chown app:app /opt/app
```
```yaml
# Ansible: declarative
- ansible.builtin.file:
path: /opt/app
state: directory
owner: app
group: app
mode: '0755'
```
## Conversion Table
| Shell Command | Ansible Module | Notes |
|---------------|----------------|-------|
| `mkdir -p` | `ansible.builtin.file` | `state: directory` |
| `cp` | `ansible.builtin.copy` | Static files |
| `cp` with variables | `ansible.builtin.template` | Use `.j2` templates |
| `rm -rf` | `ansible.builtin.file` | `state: absent` |
| `ln -s` | `ansible.builtin.file` | `state: link` |
| `chmod`, `chown` | Include in file/copy/template | `mode`, `owner`, `group` params |
| `apt-get install` | `ansible.builtin.apt` | `update_cache: yes` |
| `yum install` | `ansible.builtin.yum` | Or use `package` for cross-platform |
| `pip install` | `ansible.builtin.pip` | Specify `executable` if needed |
| `useradd` | `ansible.builtin.user` | Handles home, shell, groups |
| `systemctl start` | `ansible.builtin.service` | `state: started` |
| `systemctl enable` | `ansible.builtin.service` | `enabled: yes` |
| `curl -O` | `ansible.builtin.get_url` | Use `checksum` for verification |
| `tar -xzf` | `ansible.builtin.unarchive` | `remote_src: yes` if already on target |
| `echo >> file` | `ansible.builtin.lineinfile` | Ensures line exists |
| `cat > file` | `ansible.builtin.copy` | `content:` parameter |
## Control Flow Conversion
### Conditionals
```bash
# Shell
if [ -f /etc/debian_version ]; then
apt-get install nginx
fi
```
```yaml
# Ansible
- ansible.builtin.apt:
name: nginx
when: ansible_os_family == "Debian"
```
### Loops
```bash
# Shell
for user in alice bob; do
useradd $user
done
```
```yaml
# Ansible
- ansible.builtin.user:
name: "{{ item }}"
loop:
- alice
- bob
```
## When Shell Module is Necessary
Use `command` or `shell` only when no module exists. Always add proper change detection:
```yaml
- name: Run custom installer
ansible.builtin.shell: /opt/app/install.sh
args:
creates: /opt/app/.installed # Skip if file exists
register: install_result
changed_when: "'Installed' in install_result.stdout"
failed_when: install_result.rc != 0 and 'already installed' not in install_result.stderr
```
## Variable Extraction
Identify values to parameterize:
- Version numbers → `app_version: "1.2.3"`
- Paths → `app_dir: "/opt/app"`
- Usernames → `app_user: "appuser"`
- Ports → `app_port: 8080`
Place in `defaults/main.yml` for easy override.
## Conversion Workflow
1. Read entire script, identify major phases
2. Map each command to Ansible module
3. Extract hardcoded values as variables
4. Order tasks for dependencies (dirs before files)
5. Add handlers for service restarts
6. Test with `--check --diff`
7. Verify idempotency: second run shows no changes

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,137 @@
---
name: ansible-debug
description: Use when playbooks fail with UNREACHABLE, permission denied, MODULE FAILURE, or undefined variable errors. Use when SSH connections fail or sudo password is missing.
---
# Ansible Debugging
## Overview
Ansible errors fall into four categories: connection, authentication, module, and syntax. Systematic diagnosis starts with identifying the category, then isolating the specific cause.
## When to Use
- UNREACHABLE errors (SSH/network issues)
- Permission denied or sudo password errors
- MODULE FAILURE messages
- Undefined variable errors
- Template rendering failures
- Slow playbook execution
## Error Categories
| Category | Symptoms | First Check |
|----------|----------|-------------|
| Connection | UNREACHABLE | `ssh -v user@host` |
| Authentication | Permission denied, Missing sudo password | SSH keys, sudoers config |
| Module | MODULE FAILURE | Module parameters, target state |
| Syntax | YAML parse error | Line number in error, indentation |
## Quick Diagnosis
### Connection Errors
```bash
# Test SSH directly
ssh -v -i /path/to/key user@hostname
# Test port connectivity
nc -zv hostname 22
# Verify inventory parsing
ansible-inventory --host hostname
```
**Common causes:**
- Wrong IP/hostname in inventory
- Firewall blocking port 22
- SSH key permissions (must be 600)
### Authentication Errors
```bash
# Test with explicit options
ansible hostname -m ping -u user --private-key /path/to/key
# For sudo password issues, either:
ansible-playbook playbook.yml --ask-become-pass
# Or configure NOPASSWD in /etc/sudoers
```
### Module Errors
```bash
# Check module documentation
ansible-doc ansible.builtin.copy
# Verify module parameters match your Ansible version
ansible --version
```
### Variable Errors
```yaml
# Use default filter for optional variables
{{ my_var | default('fallback') }}
# Debug variable values
- ansible.builtin.debug:
var: problematic_variable
```
## Verbosity Levels
| Flag | Shows |
|------|-------|
| `-v` | Task results |
| `-vv` | Task input parameters |
| `-vvv` | SSH connection details |
| `-vvvv` | Full plugin internals |
Start with `-v`, increase only if needed.
## Debugging Commands
```bash
# Syntax check only
ansible-playbook --syntax-check playbook.yml
# Dry run
ansible-playbook --check playbook.yml
# Step through tasks
ansible-playbook --step playbook.yml
# Start at specific task
ansible-playbook --start-at-task "Task Name" playbook.yml
# Limit to specific host
ansible-playbook --limit hostname playbook.yml
```
## Common Error Patterns
| Error | Cause | Fix |
|-------|-------|-----|
| `Permission denied (publickey)` | SSH key not accepted | Check key permissions, verify authorized_keys |
| `Missing sudo password` | become=true without password | Use `--ask-become-pass` or configure NOPASSWD |
| `No such file or directory` | Path doesn't exist | Create parent directories first |
| `Unable to lock` (apt/yum) | Package manager locked | Wait for other process, remove stale lock |
| `undefined variable` | Variable not defined | Check spelling, use `default()` filter |
## Performance Debugging
```ini
# ansible.cfg
[defaults]
callback_whitelist = profile_tasks # Show task timing
[ssh_connection]
pipelining = True # Faster SSH
```
```yaml
# Skip fact gathering if not needed
- hosts: all
gather_facts: no
```

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,130 @@
---
name: ansible-interactive
description: Use when guiding someone through Ansible setup step-by-step. Use when starting a new Ansible project from scratch. Use when teaching Ansible through hands-on development.
---
# Interactive Ansible Development
## Overview
Interactive development builds automation incrementally with continuous validation. Each component is tested before adding the next. This catches errors early when they're easy to diagnose.
## When to Use
- Setting up Ansible for a new environment
- Teaching someone Ansible hands-on
- Building playbooks incrementally with validation
- Troubleshooting connectivity before automation
## Development Phases
### Phase 1: Environment Analysis
Gather before writing any code:
| Question | Why It Matters |
|----------|----------------|
| How many servers? | Affects inventory organization |
| IP addresses/hostnames? | Required for inventory |
| SSH user and key location? | Connection configuration |
| Password or key auth? | Determines SSH setup |
| Sudo with or without password? | Privilege escalation config |
| Server roles (web, db, app)? | Inventory grouping |
| Operating systems? | Module selection (apt vs yum) |
Verify Ansible is installed: `ansible --version`
### Phase 2: Project Setup
Create minimal structure:
```bash
mkdir ansible-project && cd ansible-project
```
**ansible.cfg:**
```ini
[defaults]
inventory = ./inventory
host_key_checking = False
stdout_callback = yaml
[privilege_escalation]
become = True
become_method = sudo
```
**inventory:**
```ini
[webservers]
web1 ansible_host=192.168.1.10 ansible_user=admin ansible_ssh_private_key_file=~/.ssh/id_rsa
[dbservers]
db1 ansible_host=192.168.1.20 ansible_user=admin ansible_ssh_private_key_file=~/.ssh/id_rsa
```
### Phase 3: Connectivity Test
**Always test before writing playbooks:**
```bash
ansible all -m ping
```
| Result | Action |
|--------|--------|
| SUCCESS | Proceed to playbooks |
| UNREACHABLE | Check `ssh -v user@host` |
| Permission denied | Verify key path, permissions (600) |
| Sudo password required | Add `--ask-become-pass` or configure NOPASSWD |
### Phase 4: Incremental Playbook Development
Start simple, add one task at a time:
```yaml
# playbook.yml - start with facts
---
- hosts: all
tasks:
- name: Show OS info
ansible.builtin.debug:
msg: "{{ ansible_distribution }} {{ ansible_distribution_version }}"
```
Run: `ansible-playbook playbook.yml`
Then add tasks one by one, testing after each:
```yaml
- name: Ensure nginx installed
ansible.builtin.package:
name: nginx
state: present
```
Run again. Fix any errors before adding more.
### Phase 5: Validation Cycle
After each change:
1. `ansible-playbook --syntax-check playbook.yml`
2. `ansible-playbook --check --diff playbook.yml`
3. `ansible-playbook playbook.yml`
4. Run again—verify `changed=0` (idempotency)
## Red Flags - Stop and Debug
- Adding multiple untested tasks at once
- Skipping `--check` before real runs
- Ignoring "changed" on second run
- Not testing SSH before writing playbooks
## Communication Pattern
When guiding users:
- Explain what will happen before running commands
- After completion, summarize what was done
- When multiple approaches exist, present options with tradeoffs
- Acknowledge progress at milestones

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,123 @@
---
name: ansible-playbook
description: Use when creating playbooks, roles, or inventory files. Use when automating infrastructure with Ansible. Use when encountering YAML syntax errors, module failures, or variable precedence issues.
---
# Ansible Playbook Development
## Overview
Ansible playbooks declare desired system state rather than imperative commands. The core principle is idempotency: running a playbook multiple times produces the same result without unintended changes.
## When to Use
- Creating new playbooks or roles
- Writing inventory files
- Debugging YAML syntax errors
- Troubleshooting module parameter issues
- Understanding variable precedence
- Converting shell scripts to Ansible
## Quick Reference
### Project Structure
```
project/
├── ansible.cfg # Configuration
├── inventory # Host definitions
├── group_vars/ # Group variables
├── host_vars/ # Host-specific vars
├── roles/ # Reusable roles
└── playbooks/ # Playbook files
```
### Essential ansible.cfg
```ini
[defaults]
inventory = ./inventory
roles_path = ./roles
host_key_checking = False
stdout_callback = yaml
[privilege_escalation]
become = True
become_method = sudo
```
### Module Patterns
| Operation | Module | Key Parameters |
|-----------|--------|----------------|
| Create directory | `ansible.builtin.file` | `state: directory`, `mode`, `owner` |
| Copy file | `ansible.builtin.copy` | `src`, `dest`, `mode` |
| Template | `ansible.builtin.template` | `src`, `dest`, variables in `.j2` |
| Install package | `ansible.builtin.package` | `name`, `state: present` |
| Manage service | `ansible.builtin.service` | `name`, `state`, `enabled` |
| Run command | `ansible.builtin.command` | `cmd`, register result, set `changed_when` |
### Variable Precedence (lowest to highest)
1. Role defaults (`defaults/main.yml`)
2. Inventory group_vars
3. Inventory host_vars
4. Playbook vars
5. Role vars (`vars/main.yml`)
6. Task vars
7. Extra vars (`-e`)
### Handlers
```yaml
tasks:
- name: Update config
ansible.builtin.template:
src: app.conf.j2
dest: /etc/app.conf
notify: Restart app
handlers:
- name: Restart app
ansible.builtin.service:
name: app
state: restarted
```
### Error Handling
```yaml
- block:
- name: Risky operation
ansible.builtin.command: /opt/app/upgrade.sh
rescue:
- name: Handle failure
ansible.builtin.debug:
msg: "Upgrade failed, rolling back"
always:
- name: Cleanup
ansible.builtin.file:
path: /tmp/upgrade.lock
state: absent
```
## Common Mistakes
| Mistake | Fix |
|---------|-----|
| Using short module names | Always use FQCN: `ansible.builtin.copy` not `copy` |
| Hardcoded values | Extract to variables in `defaults/main.yml` |
| Missing `changed_when` on commands | Add `changed_when: "'created' in result.stdout"` |
| Forgetting handler flush | Use `meta: flush_handlers` when needed before dependent tasks |
| YAML indentation errors | Use 2 spaces, never tabs |
| Colon in unquoted string | Quote values containing `: ` |
## Verification Commands
```bash
ansible-playbook --syntax-check playbook.yml # Check YAML
ansible-playbook --check playbook.yml # Dry run
ansible-playbook --check --diff playbook.yml # Show file changes
ansible-inventory --list # Verify inventory
ansible-inventory --host hostname # Check host vars
```

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,444 @@
---
name: architecture-decision-records
description: "Write and maintain Architecture Decision Records (ADRs) following best practices for technical decision documentation. Use when documenting significant technical decisions, reviewing past architect..."
risk: unknown
source: community
date_added: "2026-02-27"
---
# Architecture Decision Records
Comprehensive patterns for creating, maintaining, and managing Architecture Decision Records (ADRs) that capture the context and rationale behind significant technical decisions.
## Use this skill when
- Making significant architectural decisions
- Documenting technology choices
- Recording design trade-offs
- Onboarding new team members
- Reviewing historical decisions
- Establishing decision-making processes
## Do not use this skill when
- You only need to document small implementation details
- The change is a minor patch or routine maintenance
- There is no architectural decision to capture
## Instructions
1. Capture the decision context, constraints, and drivers.
2. Document considered options with tradeoffs.
3. Record the decision, rationale, and consequences.
4. Link related ADRs and update status over time.
## Core Concepts
### 1. What is an ADR?
An Architecture Decision Record captures:
- **Context**: Why we needed to make a decision
- **Decision**: What we decided
- **Consequences**: What happens as a result
### 2. When to Write an ADR
| Write ADR | Skip ADR |
|-----------|----------|
| New framework adoption | Minor version upgrades |
| Database technology choice | Bug fixes |
| API design patterns | Implementation details |
| Security architecture | Routine maintenance |
| Integration patterns | Configuration changes |
### 3. ADR Lifecycle
```
Proposed → Accepted → Deprecated → Superseded
Rejected
```
## Templates
### Template 1: Standard ADR (MADR Format)
```markdown
# ADR-0001: Use PostgreSQL as Primary Database
## Status
Accepted
## Context
We need to select a primary database for our new e-commerce platform. The system
will handle:
- ~10,000 concurrent users
- Complex product catalog with hierarchical categories
- Transaction processing for orders and payments
- Full-text search for products
- Geospatial queries for store locator
The team has experience with MySQL, PostgreSQL, and MongoDB. We need ACID
compliance for financial transactions.
## Decision Drivers
* **Must have ACID compliance** for payment processing
* **Must support complex queries** for reporting
* **Should support full-text search** to reduce infrastructure complexity
* **Should have good JSON support** for flexible product attributes
* **Team familiarity** reduces onboarding time
## Considered Options
### Option 1: PostgreSQL
- **Pros**: ACID compliant, excellent JSON support (JSONB), built-in full-text
search, PostGIS for geospatial, team has experience
- **Cons**: Slightly more complex replication setup than MySQL
### Option 2: MySQL
- **Pros**: Very familiar to team, simple replication, large community
- **Cons**: Weaker JSON support, no built-in full-text search (need
Elasticsearch), no geospatial without extensions
### Option 3: MongoDB
- **Pros**: Flexible schema, native JSON, horizontal scaling
- **Cons**: No ACID for multi-document transactions (at decision time),
team has limited experience, requires schema design discipline
## Decision
We will use **PostgreSQL 15** as our primary database.
## Rationale
PostgreSQL provides the best balance of:
1. **ACID compliance** essential for e-commerce transactions
2. **Built-in capabilities** (full-text search, JSONB, PostGIS) reduce
infrastructure complexity
3. **Team familiarity** with SQL databases reduces learning curve
4. **Mature ecosystem** with excellent tooling and community support
The slight complexity in replication is outweighed by the reduction in
additional services (no separate Elasticsearch needed).
## Consequences
### Positive
- Single database handles transactions, search, and geospatial queries
- Reduced operational complexity (fewer services to manage)
- Strong consistency guarantees for financial data
- Team can leverage existing SQL expertise
### Negative
- Need to learn PostgreSQL-specific features (JSONB, full-text search syntax)
- Vertical scaling limits may require read replicas sooner
- Some team members need PostgreSQL-specific training
### Risks
- Full-text search may not scale as well as dedicated search engines
- Mitigation: Design for potential Elasticsearch addition if needed
## Implementation Notes
- Use JSONB for flexible product attributes
- Implement connection pooling with PgBouncer
- Set up streaming replication for read replicas
- Use pg_trgm extension for fuzzy search
## Related Decisions
- ADR-0002: Caching Strategy (Redis) - complements database choice
- ADR-0005: Search Architecture - may supersede if Elasticsearch needed
## References
- [PostgreSQL JSON Documentation](https://www.postgresql.org/docs/current/datatype-json.html)
- [PostgreSQL Full Text Search](https://www.postgresql.org/docs/current/textsearch.html)
- Internal: Performance benchmarks in `/docs/benchmarks/database-comparison.md`
```
### Template 2: Lightweight ADR
```markdown
# ADR-0012: Adopt TypeScript for Frontend Development
**Status**: Accepted
**Date**: 2024-01-15
**Deciders**: @alice, @bob, @charlie
## Context
Our React codebase has grown to 50+ components with increasing bug reports
related to prop type mismatches and undefined errors. PropTypes provide
runtime-only checking.
## Decision
Adopt TypeScript for all new frontend code. Migrate existing code incrementally.
## Consequences
**Good**: Catch type errors at compile time, better IDE support, self-documenting
code.
**Bad**: Learning curve for team, initial slowdown, build complexity increase.
**Mitigations**: TypeScript training sessions, allow gradual adoption with
`allowJs: true`.
```
### Template 3: Y-Statement Format
```markdown
# ADR-0015: API Gateway Selection
In the context of **building a microservices architecture**,
facing **the need for centralized API management, authentication, and rate limiting**,
we decided for **Kong Gateway**
and against **AWS API Gateway and custom Nginx solution**,
to achieve **vendor independence, plugin extensibility, and team familiarity with Lua**,
accepting that **we need to manage Kong infrastructure ourselves**.
```
### Template 4: ADR for Deprecation
```markdown
# ADR-0020: Deprecate MongoDB in Favor of PostgreSQL
## Status
Accepted (Supersedes ADR-0003)
## Context
ADR-0003 (2021) chose MongoDB for user profile storage due to schema flexibility
needs. Since then:
- MongoDB's multi-document transactions remain problematic for our use case
- Our schema has stabilized and rarely changes
- We now have PostgreSQL expertise from other services
- Maintaining two databases increases operational burden
## Decision
Deprecate MongoDB and migrate user profiles to PostgreSQL.
## Migration Plan
1. **Phase 1** (Week 1-2): Create PostgreSQL schema, dual-write enabled
2. **Phase 2** (Week 3-4): Backfill historical data, validate consistency
3. **Phase 3** (Week 5): Switch reads to PostgreSQL, monitor
4. **Phase 4** (Week 6): Remove MongoDB writes, decommission
## Consequences
### Positive
- Single database technology reduces operational complexity
- ACID transactions for user data
- Team can focus PostgreSQL expertise
### Negative
- Migration effort (~4 weeks)
- Risk of data issues during migration
- Lose some schema flexibility
## Lessons Learned
Document from ADR-0003 experience:
- Schema flexibility benefits were overestimated
- Operational cost of multiple databases was underestimated
- Consider long-term maintenance in technology decisions
```
### Template 5: Request for Comments (RFC) Style
```markdown
# RFC-0025: Adopt Event Sourcing for Order Management
## Summary
Propose adopting event sourcing pattern for the order management domain to
improve auditability, enable temporal queries, and support business analytics.
## Motivation
Current challenges:
1. Audit requirements need complete order history
2. "What was the order state at time X?" queries are impossible
3. Analytics team needs event stream for real-time dashboards
4. Order state reconstruction for customer support is manual
## Detailed Design
### Event Store
```
OrderCreated { orderId, customerId, items[], timestamp }
OrderItemAdded { orderId, item, timestamp }
OrderItemRemoved { orderId, itemId, timestamp }
PaymentReceived { orderId, amount, paymentId, timestamp }
OrderShipped { orderId, trackingNumber, timestamp }
```
### Projections
- **CurrentOrderState**: Materialized view for queries
- **OrderHistory**: Complete timeline for audit
- **DailyOrderMetrics**: Analytics aggregation
### Technology
- Event Store: EventStoreDB (purpose-built, handles projections)
- Alternative considered: Kafka + custom projection service
## Drawbacks
- Learning curve for team
- Increased complexity vs. CRUD
- Need to design events carefully (immutable once stored)
- Storage growth (events never deleted)
## Alternatives
1. **Audit tables**: Simpler but doesn't enable temporal queries
2. **CDC from existing DB**: Complex, doesn't change data model
3. **Hybrid**: Event source only for order state changes
## Unresolved Questions
- [ ] Event schema versioning strategy
- [ ] Retention policy for events
- [ ] Snapshot frequency for performance
## Implementation Plan
1. Prototype with single order type (2 weeks)
2. Team training on event sourcing (1 week)
3. Full implementation and migration (4 weeks)
4. Monitoring and optimization (ongoing)
## References
- [Event Sourcing by Martin Fowler](https://martinfowler.com/eaaDev/EventSourcing.html)
- [EventStoreDB Documentation](https://www.eventstore.com/docs)
```
## ADR Management
### Directory Structure
```
docs/
├── adr/
│ ├── README.md # Index and guidelines
│ ├── template.md # Team's ADR template
│ ├── 0001-use-postgresql.md
│ ├── 0002-caching-strategy.md
│ ├── 0003-mongodb-user-profiles.md # [DEPRECATED]
│ └── 0020-deprecate-mongodb.md # Supersedes 0003
```
### ADR Index (README.md)
```markdown
# Architecture Decision Records
This directory contains Architecture Decision Records (ADRs) for [Project Name].
## Index
| ADR | Title | Status | Date |
|-----|-------|--------|------|
| 0001 | Use PostgreSQL as Primary Database | Accepted | 2024-01-10 |
| 0002 | Caching Strategy with Redis | Accepted | 2024-01-12 |
| 0003 | MongoDB for User Profiles | Deprecated | 2023-06-15 |
| 0020 | Deprecate MongoDB | Accepted | 2024-01-15 |
## Creating a New ADR
1. Copy `template.md` to `NNNN-title-with-dashes.md`
2. Fill in the template
3. Submit PR for review
4. Update this index after approval
## ADR Status
- **Proposed**: Under discussion
- **Accepted**: Decision made, implementing
- **Deprecated**: No longer relevant
- **Superseded**: Replaced by another ADR
- **Rejected**: Considered but not adopted
```
### Automation (adr-tools)
```bash
# Install adr-tools
brew install adr-tools
# Initialize ADR directory
adr init docs/adr
# Create new ADR
adr new "Use PostgreSQL as Primary Database"
# Supersede an ADR
adr new -s 3 "Deprecate MongoDB in Favor of PostgreSQL"
# Generate table of contents
adr generate toc > docs/adr/README.md
# Link related ADRs
adr link 2 "Complements" 1 "Is complemented by"
```
## Review Process
```markdown
## ADR Review Checklist
### Before Submission
- [ ] Context clearly explains the problem
- [ ] All viable options considered
- [ ] Pros/cons balanced and honest
- [ ] Consequences (positive and negative) documented
- [ ] Related ADRs linked
### During Review
- [ ] At least 2 senior engineers reviewed
- [ ] Affected teams consulted
- [ ] Security implications considered
- [ ] Cost implications documented
- [ ] Reversibility assessed
### After Acceptance
- [ ] ADR index updated
- [ ] Team notified
- [ ] Implementation tickets created
- [ ] Related documentation updated
```
## Best Practices
### Do's
- **Write ADRs early** - Before implementation starts
- **Keep them short** - 1-2 pages maximum
- **Be honest about trade-offs** - Include real cons
- **Link related decisions** - Build decision graph
- **Update status** - Deprecate when superseded
### Don'ts
- **Don't change accepted ADRs** - Write new ones to supersede
- **Don't skip context** - Future readers need background
- **Don't hide failures** - Rejected decisions are valuable
- **Don't be vague** - Specific decisions, specific consequences
- **Don't forget implementation** - ADR without action is waste
## Resources
- [Documenting Architecture Decisions (Michael Nygard)](https://cognitect.com/blog/2011/11/15/documenting-architecture-decisions)
- [MADR Template](https://adr.github.io/madr/)
- [ADR GitHub Organization](https://adr.github.io/)
- [adr-tools](https://github.com/npryce/adr-tools)

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,310 @@
---
name: aws-cost-cleanup
description: "Automated cleanup of unused AWS resources to reduce costs"
risk: safe
source: community
date_added: "2026-02-27"
---
# AWS Cost Cleanup
Automate the identification and removal of unused AWS resources to eliminate waste.
## When to Use This Skill
Use this skill when you need to automatically clean up unused AWS resources to reduce costs and eliminate waste.
## Automated Cleanup Targets
**Storage**
- Unattached EBS volumes
- Old EBS snapshots (>90 days)
- Incomplete multipart S3 uploads
- Old S3 versions in versioned buckets
**Compute**
- Stopped EC2 instances (>30 days)
- Unused AMIs and associated snapshots
- Unused Elastic IPs
**Networking**
- Unused Elastic Load Balancers
- Unused NAT Gateways
- Orphaned ENIs
## Cleanup Scripts
### Safe Cleanup (Dry-Run First)
```bash
#!/bin/bash
# cleanup-unused-ebs.sh
echo "Finding unattached EBS volumes..."
VOLUMES=$(aws ec2 describe-volumes \
--filters Name=status,Values=available \
--query 'Volumes[*].VolumeId' \
--output text)
for vol in $VOLUMES; do
echo "Would delete: $vol"
# Uncomment to actually delete:
# aws ec2 delete-volume --volume-id $vol
done
```
```bash
#!/bin/bash
# cleanup-old-snapshots.sh
CUTOFF_DATE=$(date -d '90 days ago' --iso-8601)
aws ec2 describe-snapshots --owner-ids self \
--query "Snapshots[?StartTime<='$CUTOFF_DATE'].[SnapshotId,StartTime,VolumeSize]" \
--output text | while read snap_id start_time size; do
echo "Snapshot: $snap_id (Created: $start_time, Size: ${size}GB)"
# Uncomment to delete:
# aws ec2 delete-snapshot --snapshot-id $snap_id
done
```
```bash
#!/bin/bash
# release-unused-eips.sh
aws ec2 describe-addresses \
--query 'Addresses[?AssociationId==null].[AllocationId,PublicIp]' \
--output text | while read alloc_id public_ip; do
echo "Would release: $public_ip ($alloc_id)"
# Uncomment to release:
# aws ec2 release-address --allocation-id $alloc_id
done
```
### S3 Lifecycle Automation
```bash
# Apply lifecycle policy to transition old objects to cheaper storage
cat > lifecycle-policy.json <<EOF
{
"Rules": [
{
"Id": "Archive old objects",
"Status": "Enabled",
"Transitions": [
{
"Days": 90,
"StorageClass": "STANDARD_IA"
},
{
"Days": 180,
"StorageClass": "GLACIER"
}
],
"NoncurrentVersionExpiration": {
"NoncurrentDays": 30
},
"AbortIncompleteMultipartUpload": {
"DaysAfterInitiation": 7
}
}
]
}
EOF
aws s3api put-bucket-lifecycle-configuration \
--bucket my-bucket \
--lifecycle-configuration file://lifecycle-policy.json
```
## Cost Impact Calculator
```python
#!/usr/bin/env python3
# calculate-savings.py
import boto3
from datetime import datetime, timedelta
ec2 = boto3.client('ec2')
# Calculate EBS volume savings
volumes = ec2.describe_volumes(
Filters=[{'Name': 'status', 'Values': ['available']}]
)
total_size = sum(v['Size'] for v in volumes['Volumes'])
monthly_cost = total_size * 0.10 # $0.10/GB-month for gp3
print(f"Unattached EBS Volumes: {len(volumes['Volumes'])}")
print(f"Total Size: {total_size} GB")
print(f"Monthly Savings: ${monthly_cost:.2f}")
# Calculate Elastic IP savings
addresses = ec2.describe_addresses()
unused = [a for a in addresses['Addresses'] if 'AssociationId' not in a]
eip_cost = len(unused) * 3.65 # $0.005/hour * 730 hours
print(f"\nUnused Elastic IPs: {len(unused)}")
print(f"Monthly Savings: ${eip_cost:.2f}")
print(f"\nTotal Monthly Savings: ${monthly_cost + eip_cost:.2f}")
print(f"Annual Savings: ${(monthly_cost + eip_cost) * 12:.2f}")
```
## Automated Cleanup Lambda
```python
import boto3
from datetime import datetime, timedelta
def lambda_handler(event, context):
ec2 = boto3.client('ec2')
# Delete unattached volumes older than 7 days
volumes = ec2.describe_volumes(
Filters=[{'Name': 'status', 'Values': ['available']}]
)
cutoff = datetime.now() - timedelta(days=7)
deleted = 0
for vol in volumes['Volumes']:
create_time = vol['CreateTime'].replace(tzinfo=None)
if create_time < cutoff:
try:
ec2.delete_volume(VolumeId=vol['VolumeId'])
deleted += 1
print(f"Deleted volume: {vol['VolumeId']}")
except Exception as e:
print(f"Error deleting {vol['VolumeId']}: {e}")
return {
'statusCode': 200,
'body': f'Deleted {deleted} volumes'
}
```
## Cleanup Workflow
1. **Discovery Phase** (Read-only)
- Run all describe commands
- Generate cost impact report
- Review with team
2. **Validation Phase**
- Verify resources are truly unused
- Check for dependencies
- Notify resource owners
3. **Execution Phase** (Dry-run first)
- Run cleanup scripts with dry-run
- Review proposed changes
- Execute actual cleanup
4. **Verification Phase**
- Confirm deletions
- Monitor for issues
- Document savings
## Safety Checklist
- [ ] Run in dry-run mode first
- [ ] Verify resources have no dependencies
- [ ] Check resource tags for ownership
- [ ] Notify stakeholders before deletion
- [ ] Create snapshots of critical data
- [ ] Test in non-production first
- [ ] Have rollback plan ready
- [ ] Document all deletions
## Example Prompts
**Discovery**
- "Find all unused resources and calculate potential savings"
- "Generate a cleanup report for my AWS account"
- "What resources can I safely delete?"
**Execution**
- "Create a script to cleanup unattached EBS volumes"
- "Delete all snapshots older than 90 days"
- "Release unused Elastic IPs"
**Automation**
- "Set up automated cleanup for old snapshots"
- "Create a Lambda function for weekly cleanup"
- "Schedule monthly resource cleanup"
## Integration with AWS Organizations
```bash
# Run cleanup across multiple accounts
for account in $(aws organizations list-accounts \
--query 'Accounts[*].Id' --output text); do
echo "Checking account: $account"
aws ec2 describe-volumes \
--filters Name=status,Values=available \
--profile account-$account
done
```
## Monitoring and Alerts
```bash
# Create CloudWatch alarm for cost anomalies
aws cloudwatch put-metric-alarm \
--alarm-name high-cost-alert \
--alarm-description "Alert when daily cost exceeds threshold" \
--metric-name EstimatedCharges \
--namespace AWS/Billing \
--statistic Maximum \
--period 86400 \
--evaluation-periods 1 \
--threshold 100 \
--comparison-operator GreaterThanThreshold
```
## Best Practices
- Schedule cleanup during maintenance windows
- Always create final snapshots before deletion
- Use resource tags to identify cleanup candidates
- Implement approval workflow for production
- Log all cleanup actions for audit
- Set up cost anomaly detection
- Review cleanup results weekly
## Risk Mitigation
**Medium Risk Actions:**
- Deleting unattached volumes (ensure no planned reattachment)
- Removing old snapshots (verify no compliance requirements)
- Releasing Elastic IPs (check DNS records)
**Always:**
- Maintain 30-day backup retention
- Use AWS Backup for critical resources
- Test restore procedures
- Document cleanup decisions
## Kiro CLI Integration
```bash
# Analyze and cleanup in one command
kiro-cli chat "Use aws-cost-cleanup to find and remove unused resources"
# Generate cleanup script
kiro-cli chat "Create a safe cleanup script for my AWS account"
# Schedule automated cleanup
kiro-cli chat "Set up weekly automated cleanup using aws-cost-cleanup"
```
## Additional Resources
- [AWS Resource Cleanup Best Practices](https://aws.amazon.com/blogs/mt/automate-resource-cleanup/)
- [AWS Systems Manager Automation](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-automation.html)
- [AWS Config Rules for Compliance](https://docs.aws.amazon.com/config/latest/developerguide/managed-rules-by-aws-config.html)

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,193 @@
---
name: aws-cost-optimizer
description: "Comprehensive AWS cost analysis and optimization recommendations using AWS CLI and Cost Explorer"
risk: safe
source: community
date_added: "2026-02-27"
---
# AWS Cost Optimizer
Analyze AWS spending patterns, identify waste, and provide actionable cost reduction strategies.
## When to Use This Skill
Use this skill when you need to analyze AWS spending, identify cost optimization opportunities, or reduce cloud waste.
## Core Capabilities
**Cost Analysis**
- Parse AWS Cost Explorer data for trends and anomalies
- Break down costs by service, region, and resource tags
- Identify month-over-month spending increases
**Resource Optimization**
- Detect idle EC2 instances (low CPU utilization)
- Find unattached EBS volumes and old snapshots
- Identify unused Elastic IPs
- Locate underutilized RDS instances
- Find old S3 objects eligible for lifecycle policies
**Savings Recommendations**
- Suggest Reserved Instance/Savings Plans opportunities
- Recommend instance rightsizing based on CloudWatch metrics
- Identify resources in expensive regions
- Calculate potential savings with specific actions
## AWS CLI Commands
### Get Cost and Usage
```bash
# Last 30 days cost by service
aws ce get-cost-and-usage \
--time-period Start=$(date -d '30 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
--granularity MONTHLY \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=SERVICE
# Daily costs for current month
aws ce get-cost-and-usage \
--time-period Start=$(date +%Y-%m-01),End=$(date +%Y-%m-%d) \
--granularity DAILY \
--metrics UnblendedCost
```
### Find Unused Resources
```bash
# Unattached EBS volumes
aws ec2 describe-volumes \
--filters Name=status,Values=available \
--query 'Volumes[*].[VolumeId,Size,VolumeType,CreateTime]' \
--output table
# Unused Elastic IPs
aws ec2 describe-addresses \
--query 'Addresses[?AssociationId==null].[PublicIp,AllocationId]' \
--output table
# Idle EC2 instances (requires CloudWatch)
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-xxxxx \
--start-time $(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 86400 \
--statistics Average
# Old EBS snapshots (>90 days)
aws ec2 describe-snapshots \
--owner-ids self \
--query 'Snapshots[?StartTime<=`'$(date -d '90 days ago' --iso-8601)'`].[SnapshotId,StartTime,VolumeSize]' \
--output table
```
### Rightsizing Analysis
```bash
# List EC2 instances with their types
aws ec2 describe-instances \
--query 'Reservations[*].Instances[*].[InstanceId,InstanceType,State.Name,Tags[?Key==`Name`].Value|[0]]' \
--output table
# Get RDS instance utilization
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name CPUUtilization \
--dimensions Name=DBInstanceIdentifier,Value=mydb \
--start-time $(date -u -d '30 days ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 86400 \
--statistics Average,Maximum
```
## Optimization Workflow
1. **Baseline Assessment**
- Pull 3-6 months of cost data
- Identify top 5 spending services
- Calculate growth rate
2. **Quick Wins**
- Delete unattached EBS volumes
- Release unused Elastic IPs
- Stop/terminate idle EC2 instances
- Delete old snapshots
3. **Strategic Optimization**
- Analyze Reserved Instance coverage
- Review instance types vs. workload
- Implement S3 lifecycle policies
- Consider Spot instances for non-critical workloads
4. **Ongoing Monitoring**
- Set up AWS Budgets with alerts
- Enable Cost Anomaly Detection
- Tag resources for cost allocation
- Monthly cost review meetings
## Cost Optimization Checklist
- [ ] Enable AWS Cost Explorer
- [ ] Set up cost allocation tags
- [ ] Create AWS Budget with alerts
- [ ] Review and delete unused resources
- [ ] Analyze Reserved Instance opportunities
- [ ] Implement S3 Intelligent-Tiering
- [ ] Review data transfer costs
- [ ] Optimize Lambda memory allocation
- [ ] Use CloudWatch Logs retention policies
- [ ] Consider multi-region cost differences
## Example Prompts
**Analysis**
- "Show me AWS costs for the last 3 months broken down by service"
- "What are my top 10 most expensive resources?"
- "Compare this month's spending to last month"
**Optimization**
- "Find all unattached EBS volumes and calculate savings"
- "Identify EC2 instances with <5% CPU utilization"
- "Suggest Reserved Instance purchases based on usage"
- "Calculate savings from deleting snapshots older than 90 days"
**Implementation**
- "Create a script to delete unattached volumes"
- "Set up a budget alert for $1000/month"
- "Generate a cost optimization report for leadership"
## Best Practices
- Always test in non-production first
- Verify resources are truly unused before deletion
- Document all cost optimization actions
- Calculate ROI for optimization efforts
- Automate recurring optimization tasks
- Use AWS Trusted Advisor recommendations
- Enable AWS Cost Anomaly Detection
## Integration with Kiro CLI
This skill works seamlessly with Kiro CLI's AWS integration:
```bash
# Use Kiro to analyze costs
kiro-cli chat "Use aws-cost-optimizer to analyze my spending"
# Generate optimization report
kiro-cli chat "Create a cost optimization plan using aws-cost-optimizer"
```
## Safety Notes
- **Risk Level: Low** - Read-only analysis is safe
- **Deletion Actions: Medium Risk** - Always verify before deleting resources
- **Production Changes: High Risk** - Test rightsizing in dev/staging first
- Maintain backups before any deletion
- Use `--dry-run` flag when available
## Additional Resources
- [AWS Cost Optimization Best Practices](https://aws.amazon.com/pricing/cost-optimization/)
- [AWS Well-Architected Framework - Cost Optimization](https://docs.aws.amazon.com/wellarchitected/latest/cost-optimization-pillar/welcome.html)
- [AWS Cost Explorer API](https://docs.aws.amazon.com/cost-management/latest/APIReference/Welcome.html)

View file

@ -0,0 +1,144 @@
---
name: aws-iam-debugging
description: Use when hitting AWS AccessDenied, authorization failures, IRSA/EKS pod permission errors, SSO session issues, cross-account AssumeRole failures, or MalformedPolicyDocument errors involving AWSReservedSSO_* principals in multi-account/Organizations environments.
---
# AWS IAM Debugging
## Overview
IAM failures have predictable root causes. Identify the caller, simulate or inspect the policy, check SCPs if multi-account. S3 requires BOTH IAM and bucket policy to allow — either can block independently.
## Error Reference
| Error | Likely cause |
|-------|-------------|
| `is not authorized to perform: X on resource: Y` | Missing IAM policy statement |
| `MalformedPolicyDocument: Invalid principal` | Using `AWSReservedSSO_*` role as principal (not allowed) |
| `Access Denied` (S3) | Bucket policy + IAM both must allow; SCP may be blocking |
| `AccessDenied` (STS AssumeRole) | Trust policy missing caller ARN, or SCP blocks |
| `InvalidClientTokenId` | Wrong region, expired credentials, wrong profile |
| `TokenRefreshRequired` | SSO session expired — run `aws sso login` |
| `Unable to locate credentials` | No credentials configured — check `~/.aws/credentials` or env vars |
## Diagnostic Flow
**Step 1: Who is calling?**
```bash
aws sts get-caller-identity
# Arn field tells you exactly what entity is making the call
```
**Step 2: Simulate the permission**
```bash
aws iam simulate-principal-policy \
--policy-source-arn arn:aws:iam::<account>:role/<role> \
--action-names s3:GetObject \
--resource-arns arn:aws:s3:::<bucket>/*
aws iam list-attached-role-policies --role-name <role>
aws iam list-role-policies --role-name <role> # inline policies
aws iam get-role-policy --role-name <role> --policy-name <policy>
```
**Step 3: Check SCPs (multi-account)**
```bash
aws organizations list-policies-for-target \
--target-id <account-id> --filter SERVICE_CONTROL_POLICY
aws organizations describe-policy --policy-id <policy-id>
```
## AWSReservedSSO_* Principal Gotcha
`AWSReservedSSO_*` roles **cannot** be used as IAM principals in trust policies.
```hcl
# WRONG:
principals {
type = "AWS"
identifiers = ["arn:aws:iam::123456789:role/AWSReservedSSO_Admin_abc"]
}
# CORRECT — allow via condition:
principals {
type = "AWS"
identifiers = ["arn:aws:iam::123456789:root"]
}
condition {
test = "StringLike"
variable = "aws:PrincipalArn"
values = ["arn:aws:iam::123456789:assumed-role/AWSReservedSSO_Admin_*/*"]
}
```
Alternatives: `aws:PrincipalOrgID` (if all callers are in the org), or `aws:PrincipalTag`.
## IRSA (EKS IAM Roles for Service Accounts)
```bash
# Check ServiceAccount annotation
kubectl get sa <name> -n <namespace> -o yaml | grep eks.amazonaws.com
# Verify OIDC provider is registered
aws iam list-open-id-connect-providers
# Inspect role trust policy condition (must match exactly)
aws iam get-role --role-name <role> \
| jq '.Role.AssumeRolePolicyDocument.Statement[].Condition'
# Required: "oidc.eks.<region>.amazonaws.com/id/<OIDC_ID>:sub":
# "system:serviceaccount:<namespace>:<sa-name>"
# Test from inside the pod
kubectl exec -n <ns> <pod> -- aws sts get-caller-identity
```
Common mistakes: namespace/SA name typo in trust policy; OIDC provider not registered.
## S3 Access Denied
```bash
aws s3api get-bucket-policy --bucket <bucket>
aws s3api get-bucket-acl --bucket <bucket>
aws s3api get-public-access-block --bucket <bucket>
aws s3 ls s3://<bucket> --debug 2>&1 | grep "Final credentials"
```
## Cross-Account AssumeRole
```bash
# Try manually
aws sts assume-role \
--role-arn arn:aws:iam::<target-account>:role/<role> \
--role-session-name test-session
# If AccessDenied, check:
# 1. Trust policy of target role allows caller's ARN
# 2. Caller has sts:AssumeRole in their own account
# 3. No SCP blocks sts:AssumeRole in either account
aws iam get-role --role-name <role> | jq '.Role.AssumeRolePolicyDocument'
```
## SSO / Identity Center Sessions
```bash
aws sso login --profile <profile>
aws configure list-profiles
aws sts get-caller-identity --profile <profile>
# Clear stale tokens
rm ~/.aws/sso/cache/*.json && aws sso login --profile <profile>
```
## CloudTrail — Find What Was Denied
```bash
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=EventName,AttributeValue=AssumeRole \
--start-time "2024-01-01T00:00:00Z" --max-results 10
# Filter by error code
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=Username,AttributeValue=<username> \
| jq '.Events[] | select(.CloudTrailEvent | fromjson | .errorCode != null)'
```

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,23 @@
---
name: aws-skills
description: "AWS development with infrastructure automation and cloud architecture patterns"
risk: safe
source: "https://github.com/zxkane/aws-skills"
date_added: "2026-02-27"
---
# Aws Skills
## Overview
AWS development with infrastructure automation and cloud architecture patterns
## When to Use This Skill
Use this skill when you need to work with aws development with infrastructure automation and cloud architecture patterns.
## Instructions
This skill provides guidance and patterns for aws development with infrastructure automation and cloud architecture patterns.
For more information, see the [source repository](https://github.com/zxkane/aws-skills).

View file

@ -0,0 +1,180 @@
---
name: azure-devops-pipeline
description: Generates Azure DevOps pipeline YAML using EKS-Pool with nonprod auto-deploy and prod manual approval gate. Always load this skill first, then load the type-specific skill before generating any YAML.
---
## What I do
Guide the generation of a complete `azure-pipelines.yml` file for a self-hosted EKS-Pool Azure DevOps agent pool. I define all shared standards. You MUST also load the appropriate type skill before generating YAML:
- Lambda deployments → load `azure-pipeline-lambda`
- Ansible playbooks → load `azure-pipeline-ansible`
- Docker builds → load `azure-pipeline-docker`
## IMPORTANT — do not generate YAML without loading a type skill
STOP. Before generating any pipeline YAML, you MUST load the type skill that matches the requested pipeline type:
- `azure-pipeline-lambda` for Lambda
- `azure-pipeline-ansible` for Ansible
- `azure-pipeline-docker` for Docker
Generate nothing until that skill is loaded.
## Required inputs — ask the user for these before generating
1. **Service/repo name** — used in display names and tags
2. **Pipeline type**`lambda` | `ansible` | `docker`
3. **Target tier**`nonprod` | `prod`
4. **Trigger branch** — branch that triggers auto-deploy (default: `main`)
5. **Secret sources** — which are in use: `ADO variable groups` | `AWS SSM/Secrets Manager` | `Vault/OpenBao` (can be multiple)
6. **ADO variable group name(s)** — if ADO variable groups selected
## Pipeline skeleton — always use this structure
```yaml
trigger:
branches:
include:
- <trigger-branch>
pool: EKS-Pool
stages:
- stage: Lint
displayName: "Lint"
jobs:
- job: Lint
pool: EKS-Pool
timeoutInMinutes: 30
continueOnError: false
steps: [] # type skill fills this in
- stage: SecurityScan
displayName: "Security Scan"
dependsOn: Lint
condition: succeeded()
jobs:
- job: SecurityScan
pool: EKS-Pool
timeoutInMinutes: 30
continueOnError: false
steps: [] # type skill fills this in
- stage: Build
displayName: "Build"
dependsOn: SecurityScan
condition: succeeded()
jobs:
- job: Build
pool: EKS-Pool
timeoutInMinutes: 30
continueOnError: false
steps: [] # type skill fills this in
- stage: DeployNonprod
displayName: "Deploy — Nonprod"
dependsOn: Build
condition: succeeded()
jobs:
- deployment: DeployNonprod
displayName: "Deploy to Nonprod"
pool: EKS-Pool
timeoutInMinutes: 30
environment: nonprod
strategy:
runOnce:
deploy:
steps: [] # type skill fills this in
- stage: DeployProd
displayName: "Deploy — Prod"
dependsOn: DeployNonprod
condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/<trigger-branch>'))
jobs:
- deployment: DeployProd
displayName: "Deploy to Prod"
pool: EKS-Pool
timeoutInMinutes: 30
environment: prod # manual approval gate configured in ADO environment settings
strategy:
runOnce:
deploy:
steps: [] # type skill fills this in + git tag step below
```
## Prod tier pipelines
When `target tier` is `prod`, omit `DeployNonprod` entirely. The pipeline contains only `Lint``SecurityScan``Build``DeployProd` with the manual approval gate.
When `target tier` is `nonprod`, omit `DeployProd` entirely.
## Git tagging on prod deploy
Add this as the final step inside `DeployProd`'s steps (prod tier only):
```yaml
- script: |
git config user.email "azdo-pipeline@$(System.TeamProject)"
git config user.name "Azure DevOps Pipeline"
git remote set-url origin "https://x-token:$(System.AccessToken)@$(echo $BUILD_REPOSITORY_URI | sed 's|https://||')"
git tag $(Build.BuildNumber) $(Build.SourceVersion)
git push origin $(Build.BuildNumber)
displayName: "Tag commit with build number"
env:
SYSTEM_ACCESSTOKEN: $(System.AccessToken)
BUILD_REPOSITORY_URI: $(Build.Repository.Uri)
```
## Secret handling patterns
Emit the correct block(s) based on declared secret sources:
### ADO variable groups
```yaml
variables:
- group: <variable-group-name>
```
Reference values as `$(VAR_NAME)` throughout the pipeline.
### AWS SSM Parameter Store
```yaml
- script: |
VALUE=$(aws ssm get-parameter \
--name "/myapp/mykey" \
--with-decryption \
--query "Parameter.Value" \
--output text)
echo "##vso[task.setvariable variable=MY_VAR;issecret=true]$VALUE"
displayName: "Fetch secret from SSM"
```
### AWS Secrets Manager
```yaml
- script: |
VALUE=$(aws secretsmanager get-secret-value \
--secret-id "myapp/mykey" \
--query "SecretString" \
--output text)
echo "##vso[task.setvariable variable=MY_VAR;issecret=true]$VALUE"
displayName: "Fetch secret from Secrets Manager"
```
### Vault / OpenBao
```yaml
- script: |
VALUE=$(vault kv get -field=mykey secret/myapp/mykey)
echo "##vso[task.setvariable variable=MY_VAR;issecret=true]$VALUE"
displayName: "Fetch secret from Vault"
env:
VAULT_ADDR: $(VAULT_ADDR)
VAULT_TOKEN: $(VAULT_TOKEN)
```
## Hard rules — always follow these
- `pool: EKS-Pool` on every job — no exceptions
- `timeoutInMinutes: 30` on every job
- `continueOnError: false` at **job level** on every job (not step level). Step-level `continueOnError` may be omitted.
- No secrets hardcoded in YAML — all via variable groups or runtime fetch
- Every stage and job has a `displayName:` set
- `pool: EKS-Pool` must appear at job level, not stage level, to ensure it applies correctly

View file

@ -0,0 +1,145 @@
---
name: azure-pipeline-ansible
description: Extends azure-devops-pipeline for Ansible playbook runs. Handles syntax check, galaxy install, vault passwords, SSH key injection, check mode on nonprod, and dynamic AWS EC2 inventory. Always load azure-devops-pipeline first.
---
## What I add
Type-specific steps for Ansible pipelines. Merge these into the skeleton from `azure-devops-pipeline`.
## Additional required inputs — ask the user
1. **Playbook path** — e.g. `playbooks/site.yml`
2. **Inventory source**`static` | `dynamic-aws-ec2`
3. **Ansible Vault in use**`yes` | `no`
4. **ADO secret variable name for vault password** — if vault in use, e.g. `ANSIBLE_VAULT_PASSWORD`
5. **ADO secret variable name for SSH private key** — e.g. `ANSIBLE_SSH_KEY`
6. **Ansible version to pin** — e.g. `9.2.0`
7. **Run --check mode on nonprod before real apply**`yes` (default) | `no`
## Lint stage steps
```yaml
- script: |
pip install "ansible==$(ANSIBLE_VERSION)" ansible-lint
ansible-lint <playbook-path> --profile production
displayName: "Lint — ansible-lint"
env:
ANSIBLE_VERSION: <ansible-version>
```
## Security scan stage steps
```yaml
- script: |
pip install "ansible==$(ANSIBLE_VERSION)" ansible-lint
ansible-lint <playbook-path> --profile security \
--sarif-file ansible-lint-security.sarif || true
ansible-galaxy install -r requirements.yml --force
displayName: "Security scan — ansible-lint security profile"
env:
ANSIBLE_VERSION: <ansible-version>
- task: PublishBuildArtifacts@1
inputs:
pathToPublish: ansible-lint-security.sarif
artifactName: security-scan
displayName: "Publish scan results"
```
## Build stage steps
```yaml
- script: |
pip install "ansible==$(ANSIBLE_VERSION)"
[ -f requirements.yml ] && ansible-galaxy install -r requirements.yml || true
ansible-playbook <playbook-path> --syntax-check -i <inventory-file>
displayName: "Validate — syntax check and galaxy install"
env:
ANSIBLE_VERSION: <ansible-version>
```
Note: for dynamic-aws-ec2 inventory, replace `-i <inventory-file>` with `-i aws_ec2.yml` and ensure `aws_ec2.yml` exists in the repo with the `amazon.aws.aws_ec2` plugin configured.
## Deploy stage steps
### Step order — always emit in this order
1. Write SSH key to temp file
2. Write vault password to temp file (if vault in use)
3. Check mode run (nonprod only, if enabled)
4. Real playbook run
5. Clean up SSH key (condition: always)
6. Clean up vault password (condition: always)
### SSH key injection (always include)
```yaml
- script: |
echo "$(ANSIBLE_SSH_KEY)" > /tmp/ansible_ssh_key
chmod 600 /tmp/ansible_ssh_key
displayName: "Inject SSH key"
env:
ANSIBLE_SSH_KEY: $(ANSIBLE_SSH_KEY)
```
### Vault password file (include only if vault in use)
```yaml
- script: |
echo "$(ANSIBLE_VAULT_PASSWORD)" > /tmp/vault_pass
chmod 600 /tmp/vault_pass
displayName: "Write vault password file"
env:
ANSIBLE_VAULT_PASSWORD: $(ANSIBLE_VAULT_PASSWORD)
```
### Check mode run (nonprod only, if enabled)
```yaml
- script: |
VAULT_ARGS=""
[ -f /tmp/vault_pass ] && VAULT_ARGS="--vault-password-file /tmp/vault_pass"
ansible-playbook <playbook-path> \
-i <inventory> \
--check \
--diff \
--private-key /tmp/ansible_ssh_key \
$VAULT_ARGS
displayName: "Dry run — check mode"
```
### Real run
```yaml
- script: |
VAULT_ARGS=""
[ -f /tmp/vault_pass ] && VAULT_ARGS="--vault-password-file /tmp/vault_pass"
ansible-playbook <playbook-path> \
-i <inventory> \
--diff \
--private-key /tmp/ansible_ssh_key \
$VAULT_ARGS
displayName: "Apply playbook"
```
### Cleanup (always at end of deploy steps — condition: always())
```yaml
- script: rm -f /tmp/ansible_ssh_key
displayName: "Clean up SSH key"
condition: always()
- script: rm -f /tmp/vault_pass
displayName: "Clean up vault password file"
condition: always()
```
## Hard rules for Ansible
- Always pin Ansible version with quoted pip specifier `"ansible==$(ANSIBLE_VERSION)"` — never use `latest`, unquoted `==` may fail in some shells
- Always clean up SSH key and vault password files with `condition: always()` — they must be removed even if the playbook fails
- Always include `--diff` on real runs so changes are visible in pipeline logs
- SSH key file permissions must be `600` — Ansible refuses keys with broader permissions
- Use shell variable expansion (`VAULT_ARGS=""`) rather than subshell substitution in the step script to avoid bash syntax issues in ADO agents
- For dynamic inventory, AWS credentials come from the OIDC service connection environment — same pattern as Lambda
- `requirements.yml` must exist in the repo if galaxy install step is included; if uncertain, wrap with `[ -f requirements.yml ] && ansible-galaxy install -r requirements.yml || true`

View file

@ -0,0 +1,160 @@
---
name: azure-pipeline-docker
description: Extends azure-devops-pipeline for Docker image builds and pushes. Handles buildx with layer caching, Trivy scanning, ECR and ACR login, and a git-SHA/tag tagging strategy. Always load azure-devops-pipeline first.
---
## What I add
Type-specific steps for Docker image pipelines. Merge these into the skeleton from `azure-devops-pipeline`.
## Additional required inputs — ask the user
1. **Registry type**`ECR` | `ACR`
2. **Registry URL** — e.g. `123456789.dkr.ecr.us-east-1.amazonaws.com` or `myregistry.azurecr.io`
3. **Image repository name** — e.g. `myapp/api`
4. **Dockerfile path** — default `./Dockerfile`
5. **AWS region** — required if ECR
6. **AWS service connection name** — required if ECR
7. **ACR service connection name** — required if ACR
## Lint stage steps
```yaml
- script: |
docker run --rm -i hadolint/hadolint < <dockerfile-path>
displayName: "Lint — hadolint Dockerfile"
```
## Security scan stage steps
The security scan builds the image locally and runs Trivy against it **before** pushing. This ensures vulnerabilities are caught pre-push.
```yaml
- script: |
docker build \
-t scan-target:$(Build.SourceVersion) \
-f <dockerfile-path> \
.
docker run --rm \
-v /var/run/docker.sock:/var/run/docker.sock \
aquasec/trivy:latest image \
--exit-code 1 \
--severity HIGH,CRITICAL \
--format json \
--output trivy-results.json \
scan-target:$(Build.SourceVersion)
displayName: "Security scan — Trivy"
- task: PublishBuildArtifacts@1
inputs:
pathToPublish: trivy-results.json
artifactName: security-scan
condition: always()
displayName: "Publish Trivy results"
```
Note: `condition: always()` on the publish step ensures results are available even when Trivy exits 1. The `--exit-code 1` on the scan step itself still fails the pipeline on HIGH/CRITICAL findings.
## Build stage steps
### Step order — always emit in this order
1. Registry login
2. docker buildx build + push
### Registry login — ECR
```yaml
- script: |
aws ecr get-login-password --region <aws-region> | \
docker login --username AWS --password-stdin <registry-url>
displayName: "Login — ECR"
env:
AWS_DEFAULT_REGION: <aws-region>
# Wire the OIDC service connection at the job level, not inside the script step.
# In the job or deployment job that contains this step, set:
#
# job: Build
# pool: EKS-Pool
# container: {} # omit if not containerised
# services:
# ...
#
# For OIDC federation, the AWSCLI task approach is preferred.
# Alternatively, wrap with AWSShellScript@1:
#
# - task: AWSShellScript@1
# inputs:
# awsCredentials: <aws-service-connection-name>
# regionName: <aws-region>
# scriptType: inline
# inlineScript: |
# aws ecr get-login-password --region <aws-region> | \
# docker login --username AWS --password-stdin <registry-url>
# displayName: "Login — ECR (via service connection)"
```
AWS credentials come from the OIDC service connection configured on the job — do not add any `AWS_ACCESS_KEY_ID` or `AWS_SECRET_ACCESS_KEY` env vars.
### Registry login — ACR
```yaml
- task: Docker@2
inputs:
command: login
containerRegistry: <acr-service-connection-name>
displayName: "Login — ACR"
```
### Build and push — nonprod
```yaml
- script: |
docker buildx create --use --name pipeline-builder 2>/dev/null || \
docker buildx use pipeline-builder
docker buildx build \
--cache-from type=registry,ref=<registry-url>/<image-repo>:cache \
--cache-to type=registry,ref=<registry-url>/<image-repo>:cache,mode=max \
--tag <registry-url>/<image-repo>:$(Build.SourceVersion) \
--tag <registry-url>/<image-repo>:latest \
--file <dockerfile-path> \
--push \
.
displayName: "Build and push — nonprod"
```
### Build and push — prod
```yaml
- script: |
docker buildx create --use --name pipeline-builder 2>/dev/null || \
docker buildx use pipeline-builder
docker buildx build \
--cache-from type=registry,ref=<registry-url>/<image-repo>:cache \
--cache-to type=registry,ref=<registry-url>/<image-repo>:cache,mode=max \
--tag <registry-url>/<image-repo>:$(Build.SourceBranchName) \
--tag <registry-url>/<image-repo>:$(Build.SourceVersion) \
--file <dockerfile-path> \
--push \
.
displayName: "Build and push — prod"
```
## Tagging strategy
| Tier | Tags applied |
|---------|---------------------------------------------------|
| Nonprod | `<git-sha>`, `latest` |
| Prod | `<git-tag / branch-name>`, `<git-sha>` |
Never tag prod images as `latest`.
## Hard rules for Docker
- Always use `docker buildx` — never plain `docker build`
- Trivy scan must run before push — the scan in SecurityScan stage uses a locally built image, not a registry pull
- `--exit-code 1` on Trivy is non-negotiable — HIGH and CRITICAL findings must fail the pipeline
- Never tag prod images as `latest` — prod tags use `$(Build.SourceBranchName)` and `$(Build.SourceVersion)` only
- Build args containing secrets must come from ADO variables injected via `env:` — never hardcoded in YAML
- Registry layer cache lives in the registry itself (not ADO pipeline cache) for reproducibility across EKS-Pool agents
- ECR login uses OIDC credentials only — never hardcode `AWS_ACCESS_KEY_ID` or `AWS_SECRET_ACCESS_KEY`
- The `docker buildx create --use ... || docker buildx use ...` pattern is required to handle re-use across runs without error

View file

@ -0,0 +1,158 @@
---
name: azure-pipeline-lambda
description: Extends azure-devops-pipeline for AWS Lambda deployments. Handles zip and container packaging, OIDC credentials, function update and alias promotion. Always load azure-devops-pipeline first.
---
## What I add
Type-specific steps for AWS Lambda pipelines. Merge these into the skeleton from `azure-devops-pipeline`.
## Additional required inputs — ask the user
1. **Function name** — the Lambda function name in AWS
2. **AWS region** — e.g. `us-east-1`
3. **AWS service connection name** — the ADO AWS OIDC service connection name
4. **Packaging method**`zip` | `container`
5. **Deployment method**`aws-cli` | `SAM` | `CDK`
6. **Runtime**`python3.x` | `nodejs20.x` | other (for linting tool selection)
7. **Alias to update** — e.g. `nonprod` or `prod` (matches target tier)
## Lint stage steps
### Python runtime
```yaml
- script: pip install pylint && pylint src/ --fail-under=7
displayName: "Lint — pylint"
- script: |
pip install cfn-lint
cfn-lint template.yaml 2>/dev/null || true
displayName: "Lint — cfn-lint (CloudFormation, if present)"
continueOnError: true
```
### Node runtime
```yaml
- script: npm ci && npx eslint src/
displayName: "Lint — eslint"
```
## Security scan stage steps
### Python runtime
```yaml
- script: |
pip install pip-audit
pip-audit -r requirements.txt --output json > pip-audit-results.json
displayName: "Security scan — pip-audit"
- task: PublishBuildArtifacts@1
inputs:
pathToPublish: pip-audit-results.json
artifactName: security-scan
displayName: "Publish scan results"
```
### Node runtime
```yaml
- script: |
npm audit --json > npm-audit-results.json || true
npm audit --audit-level=high
displayName: "Security scan — npm audit"
- task: PublishBuildArtifacts@1
inputs:
pathToPublish: npm-audit-results.json
artifactName: security-scan
displayName: "Publish scan results"
```
## Build stage steps (zip packaging)
```yaml
- script: |
mkdir -p package
# Python: install deps into package dir
pip install -r requirements.txt -t ./package
# Copy handler (adjust filename as needed)
cp *.py ./package/
# Remove dev/test artifacts
find ./package -name "*.pyc" -delete
find ./package -name "__pycache__" -type d -exec rm -rf {} + 2>/dev/null || true
find ./package -name "*.dist-info" -type d -exec rm -rf {} + 2>/dev/null || true
cd package && zip -r ../$(Build.BuildNumber).zip .
displayName: "Package Lambda — zip (Python)"
- task: PublishBuildArtifacts@1
inputs:
pathToPublish: $(Build.BuildNumber).zip
artifactName: lambda-package
displayName: "Publish Lambda artifact"
```
For Node runtime, replace the pip install/cp lines with:
```yaml
- script: |
npm ci --omit=dev
zip -r $(Build.BuildNumber).zip . \
--exclude "*.git*" \
--exclude "*node_modules/.cache*" \
--exclude "*test*" \
--exclude "*.spec.*" \
--exclude "*.test.*"
displayName: "Package Lambda — zip (Node)"
```
## Build stage steps (container packaging)
Use the full `azure-pipeline-docker` steps for the container build. Reference the resulting image URI in the Lambda deploy step by passing `--image-uri` instead of `--zip-file`.
## Deploy stage steps (aws-cli method)
```yaml
- task: AWSCLI@1
inputs:
awsCredentials: <aws-service-connection-name>
regionName: <aws-region>
awsCommand: lambda
awsSubCommand: update-function-code
awsArguments: >-
--function-name <function-name>
--zip-file fileb://$(Pipeline.Workspace)/lambda-package/$(Build.BuildNumber).zip
displayName: "Deploy — update function code"
- task: AWSCLI@1
inputs:
awsCredentials: <aws-service-connection-name>
regionName: <aws-region>
awsCommand: lambda
awsSubCommand: wait
awsArguments: function-updated --function-name <function-name>
displayName: "Deploy — wait for update"
- task: AWSCLI@1
inputs:
awsCredentials: <aws-service-connection-name>
regionName: <aws-region>
awsCommand: lambda
awsSubCommand: publish-version
awsArguments: --function-name <function-name>
displayName: "Deploy — publish version"
- script: |
VERSION=$(aws lambda list-versions-by-function \
--function-name <function-name> \
--query "Versions[-1].Version" \
--output text)
aws lambda update-alias \
--function-name <function-name> \
--name <alias-name> \
--function-version "$VERSION"
displayName: "Deploy — update alias"
env:
AWS_DEFAULT_REGION: <aws-region>
```
## Hard rules for Lambda
- Always use OIDC service connection — never hardcode `AWS_ACCESS_KEY_ID` or `AWS_SECRET_ACCESS_KEY` in the pipeline YAML
- Always wait for `function-updated` before publishing version — skipping this causes race conditions
- Always update alias after publishing version — direct function invocation without alias is not acceptable
- Zip packaging: always exclude `.git`, `__pycache__`, `*.pyc`, `node_modules/.cache`, test files
- Shell variable expansion in AWSCLI task `awsArguments` requires `>-` (block scalar) not `>` to avoid newline issues

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,598 @@
---
name: backend-patterns
description: Backend architecture patterns, API design, database optimization, and server-side best practices for Node.js, Express, and Next.js API routes.
origin: ECC
---
# Backend Development Patterns
Backend architecture patterns and best practices for scalable server-side applications.
## When to Activate
- Designing REST or GraphQL API endpoints
- Implementing repository, service, or controller layers
- Optimizing database queries (N+1, indexing, connection pooling)
- Adding caching (Redis, in-memory, HTTP cache headers)
- Setting up background jobs or async processing
- Structuring error handling and validation for APIs
- Building middleware (auth, logging, rate limiting)
## API Design Patterns
### RESTful API Structure
```typescript
// ✅ Resource-based URLs
GET /api/markets # List resources
GET /api/markets/:id # Get single resource
POST /api/markets # Create resource
PUT /api/markets/:id # Replace resource
PATCH /api/markets/:id # Update resource
DELETE /api/markets/:id # Delete resource
// ✅ Query parameters for filtering, sorting, pagination
GET /api/markets?status=active&sort=volume&limit=20&offset=0
```
### Repository Pattern
```typescript
// Abstract data access logic
interface MarketRepository {
findAll(filters?: MarketFilters): Promise<Market[]>
findById(id: string): Promise<Market | null>
create(data: CreateMarketDto): Promise<Market>
update(id: string, data: UpdateMarketDto): Promise<Market>
delete(id: string): Promise<void>
}
class SupabaseMarketRepository implements MarketRepository {
async findAll(filters?: MarketFilters): Promise<Market[]> {
let query = supabase.from('markets').select('*')
if (filters?.status) {
query = query.eq('status', filters.status)
}
if (filters?.limit) {
query = query.limit(filters.limit)
}
const { data, error } = await query
if (error) throw new Error(error.message)
return data
}
// Other methods...
}
```
### Service Layer Pattern
```typescript
// Business logic separated from data access
class MarketService {
constructor(private marketRepo: MarketRepository) {}
async searchMarkets(query: string, limit: number = 10): Promise<Market[]> {
// Business logic
const embedding = await generateEmbedding(query)
const results = await this.vectorSearch(embedding, limit)
// Fetch full data
const markets = await this.marketRepo.findByIds(results.map(r => r.id))
// Sort by similarity
return markets.sort((a, b) => {
const scoreA = results.find(r => r.id === a.id)?.score || 0
const scoreB = results.find(r => r.id === b.id)?.score || 0
return scoreA - scoreB
})
}
private async vectorSearch(embedding: number[], limit: number) {
// Vector search implementation
}
}
```
### Middleware Pattern
```typescript
// Request/response processing pipeline
export function withAuth(handler: NextApiHandler): NextApiHandler {
return async (req, res) => {
const token = req.headers.authorization?.replace('Bearer ', '')
if (!token) {
return res.status(401).json({ error: 'Unauthorized' })
}
try {
const user = await verifyToken(token)
req.user = user
return handler(req, res)
} catch (error) {
return res.status(401).json({ error: 'Invalid token' })
}
}
}
// Usage
export default withAuth(async (req, res) => {
// Handler has access to req.user
})
```
## Database Patterns
### Query Optimization
```typescript
// ✅ GOOD: Select only needed columns
const { data } = await supabase
.from('markets')
.select('id, name, status, volume')
.eq('status', 'active')
.order('volume', { ascending: false })
.limit(10)
// ❌ BAD: Select everything
const { data } = await supabase
.from('markets')
.select('*')
```
### N+1 Query Prevention
```typescript
// ❌ BAD: N+1 query problem
const markets = await getMarkets()
for (const market of markets) {
market.creator = await getUser(market.creator_id) // N queries
}
// ✅ GOOD: Batch fetch
const markets = await getMarkets()
const creatorIds = markets.map(m => m.creator_id)
const creators = await getUsers(creatorIds) // 1 query
const creatorMap = new Map(creators.map(c => [c.id, c]))
markets.forEach(market => {
market.creator = creatorMap.get(market.creator_id)
})
```
### Transaction Pattern
```typescript
async function createMarketWithPosition(
marketData: CreateMarketDto,
positionData: CreatePositionDto
) {
// Use Supabase transaction
const { data, error } = await supabase.rpc('create_market_with_position', {
market_data: marketData,
position_data: positionData
})
if (error) throw new Error('Transaction failed')
return data
}
// SQL function in Supabase
CREATE OR REPLACE FUNCTION create_market_with_position(
market_data jsonb,
position_data jsonb
)
RETURNS jsonb
LANGUAGE plpgsql
AS $$
BEGIN
-- Start transaction automatically
INSERT INTO markets VALUES (market_data);
INSERT INTO positions VALUES (position_data);
RETURN jsonb_build_object('success', true);
EXCEPTION
WHEN OTHERS THEN
-- Rollback happens automatically
RETURN jsonb_build_object('success', false, 'error', SQLERRM);
END;
$$;
```
## Caching Strategies
### Redis Caching Layer
```typescript
class CachedMarketRepository implements MarketRepository {
constructor(
private baseRepo: MarketRepository,
private redis: RedisClient
) {}
async findById(id: string): Promise<Market | null> {
// Check cache first
const cached = await this.redis.get(`market:${id}`)
if (cached) {
return JSON.parse(cached)
}
// Cache miss - fetch from database
const market = await this.baseRepo.findById(id)
if (market) {
// Cache for 5 minutes
await this.redis.setex(`market:${id}`, 300, JSON.stringify(market))
}
return market
}
async invalidateCache(id: string): Promise<void> {
await this.redis.del(`market:${id}`)
}
}
```
### Cache-Aside Pattern
```typescript
async function getMarketWithCache(id: string): Promise<Market> {
const cacheKey = `market:${id}`
// Try cache
const cached = await redis.get(cacheKey)
if (cached) return JSON.parse(cached)
// Cache miss - fetch from DB
const market = await db.markets.findUnique({ where: { id } })
if (!market) throw new Error('Market not found')
// Update cache
await redis.setex(cacheKey, 300, JSON.stringify(market))
return market
}
```
## Error Handling Patterns
### Centralized Error Handler
```typescript
class ApiError extends Error {
constructor(
public statusCode: number,
public message: string,
public isOperational = true
) {
super(message)
Object.setPrototypeOf(this, ApiError.prototype)
}
}
export function errorHandler(error: unknown, req: Request): Response {
if (error instanceof ApiError) {
return NextResponse.json({
success: false,
error: error.message
}, { status: error.statusCode })
}
if (error instanceof z.ZodError) {
return NextResponse.json({
success: false,
error: 'Validation failed',
details: error.errors
}, { status: 400 })
}
// Log unexpected errors
console.error('Unexpected error:', error)
return NextResponse.json({
success: false,
error: 'Internal server error'
}, { status: 500 })
}
// Usage
export async function GET(request: Request) {
try {
const data = await fetchData()
return NextResponse.json({ success: true, data })
} catch (error) {
return errorHandler(error, request)
}
}
```
### Retry with Exponential Backoff
```typescript
async function fetchWithRetry<T>(
fn: () => Promise<T>,
maxRetries = 3
): Promise<T> {
let lastError: Error
for (let i = 0; i < maxRetries; i++) {
try {
return await fn()
} catch (error) {
lastError = error as Error
if (i < maxRetries - 1) {
// Exponential backoff: 1s, 2s, 4s
const delay = Math.pow(2, i) * 1000
await new Promise(resolve => setTimeout(resolve, delay))
}
}
}
throw lastError!
}
// Usage
const data = await fetchWithRetry(() => fetchFromAPI())
```
## Authentication & Authorization
### JWT Token Validation
```typescript
import jwt from 'jsonwebtoken'
interface JWTPayload {
userId: string
email: string
role: 'admin' | 'user'
}
export function verifyToken(token: string): JWTPayload {
try {
const payload = jwt.verify(token, process.env.JWT_SECRET!) as JWTPayload
return payload
} catch (error) {
throw new ApiError(401, 'Invalid token')
}
}
export async function requireAuth(request: Request) {
const token = request.headers.get('authorization')?.replace('Bearer ', '')
if (!token) {
throw new ApiError(401, 'Missing authorization token')
}
return verifyToken(token)
}
// Usage in API route
export async function GET(request: Request) {
const user = await requireAuth(request)
const data = await getDataForUser(user.userId)
return NextResponse.json({ success: true, data })
}
```
### Role-Based Access Control
```typescript
type Permission = 'read' | 'write' | 'delete' | 'admin'
interface User {
id: string
role: 'admin' | 'moderator' | 'user'
}
const rolePermissions: Record<User['role'], Permission[]> = {
admin: ['read', 'write', 'delete', 'admin'],
moderator: ['read', 'write', 'delete'],
user: ['read', 'write']
}
export function hasPermission(user: User, permission: Permission): boolean {
return rolePermissions[user.role].includes(permission)
}
export function requirePermission(permission: Permission) {
return (handler: (request: Request, user: User) => Promise<Response>) => {
return async (request: Request) => {
const user = await requireAuth(request)
if (!hasPermission(user, permission)) {
throw new ApiError(403, 'Insufficient permissions')
}
return handler(request, user)
}
}
}
// Usage - HOF wraps the handler
export const DELETE = requirePermission('delete')(
async (request: Request, user: User) => {
// Handler receives authenticated user with verified permission
return new Response('Deleted', { status: 200 })
}
)
```
## Rate Limiting
### Simple In-Memory Rate Limiter
```typescript
class RateLimiter {
private requests = new Map<string, number[]>()
async checkLimit(
identifier: string,
maxRequests: number,
windowMs: number
): Promise<boolean> {
const now = Date.now()
const requests = this.requests.get(identifier) || []
// Remove old requests outside window
const recentRequests = requests.filter(time => now - time < windowMs)
if (recentRequests.length >= maxRequests) {
return false // Rate limit exceeded
}
// Add current request
recentRequests.push(now)
this.requests.set(identifier, recentRequests)
return true
}
}
const limiter = new RateLimiter()
export async function GET(request: Request) {
const ip = request.headers.get('x-forwarded-for') || 'unknown'
const allowed = await limiter.checkLimit(ip, 100, 60000) // 100 req/min
if (!allowed) {
return NextResponse.json({
error: 'Rate limit exceeded'
}, { status: 429 })
}
// Continue with request
}
```
## Background Jobs & Queues
### Simple Queue Pattern
```typescript
class JobQueue<T> {
private queue: T[] = []
private processing = false
async add(job: T): Promise<void> {
this.queue.push(job)
if (!this.processing) {
this.process()
}
}
private async process(): Promise<void> {
this.processing = true
while (this.queue.length > 0) {
const job = this.queue.shift()!
try {
await this.execute(job)
} catch (error) {
console.error('Job failed:', error)
}
}
this.processing = false
}
private async execute(job: T): Promise<void> {
// Job execution logic
}
}
// Usage for indexing markets
interface IndexJob {
marketId: string
}
const indexQueue = new JobQueue<IndexJob>()
export async function POST(request: Request) {
const { marketId } = await request.json()
// Add to queue instead of blocking
await indexQueue.add({ marketId })
return NextResponse.json({ success: true, message: 'Job queued' })
}
```
## Logging & Monitoring
### Structured Logging
```typescript
interface LogContext {
userId?: string
requestId?: string
method?: string
path?: string
[key: string]: unknown
}
class Logger {
log(level: 'info' | 'warn' | 'error', message: string, context?: LogContext) {
const entry = {
timestamp: new Date().toISOString(),
level,
message,
...context
}
console.log(JSON.stringify(entry))
}
info(message: string, context?: LogContext) {
this.log('info', message, context)
}
warn(message: string, context?: LogContext) {
this.log('warn', message, context)
}
error(message: string, error: Error, context?: LogContext) {
this.log('error', message, {
...context,
error: error.message,
stack: error.stack
})
}
}
const logger = new Logger()
// Usage
export async function GET(request: Request) {
const requestId = crypto.randomUUID()
logger.info('Fetching markets', {
requestId,
method: 'GET',
path: '/api/markets'
})
try {
const markets = await fetchMarkets()
return NextResponse.json({ success: true, data: markets })
} catch (error) {
logger.error('Failed to fetch markets', error as Error, { requestId })
return NextResponse.json({ error: 'Internal error' }, { status: 500 })
}
}
```
**Remember**: Backend patterns enable scalable, maintainable server-side applications. Choose patterns that fit your complexity level.

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,46 @@
---
name: bash-defensive-patterns
description: "Master defensive Bash programming techniques for production-grade scripts. Use when writing robust shell scripts, CI/CD pipelines, or system utilities requiring fault tolerance and safety."
risk: unknown
source: community
date_added: "2026-02-27"
---
# Bash Defensive Patterns
Comprehensive guidance for writing production-ready Bash scripts using defensive programming techniques, error handling, and safety best practices to prevent common pitfalls and ensure reliability.
## Use this skill when
- Writing production automation scripts
- Building CI/CD pipeline scripts
- Creating system administration utilities
- Developing error-resilient deployment automation
- Writing scripts that must handle edge cases safely
- Building maintainable shell script libraries
- Implementing comprehensive logging and monitoring
- Creating scripts that must work across different platforms
## Do not use this skill when
- You need a single ad-hoc shell command, not a script
- The target environment requires strict POSIX sh only
- The task is unrelated to shell scripting or automation
## Instructions
1. Confirm the target shell, OS, and execution environment.
2. Enable strict mode and safe defaults from the start.
3. Validate inputs, quote variables, and handle files safely.
4. Add logging, error traps, and basic tests.
## Safety
- Avoid destructive commands without confirmation or dry-run flags.
- Do not run scripts as root unless strictly required.
Refer to `resources/implementation-playbook.md` for detailed patterns, checklists, and templates.
## Resources
- `resources/implementation-playbook.md` for detailed patterns, checklists, and templates.

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,517 @@
# Bash Defensive Patterns Implementation Playbook
This file contains detailed patterns, checklists, and code samples referenced by the skill.
## Core Defensive Principles
### 1. Strict Mode
Enable bash strict mode at the start of every script to catch errors early.
```bash
#!/bin/bash
set -Eeuo pipefail # Exit on error, unset variables, pipe failures
```
**Key flags:**
- `set -E`: Inherit ERR trap in functions
- `set -e`: Exit on any error (command returns non-zero)
- `set -u`: Exit on undefined variable reference
- `set -o pipefail`: Pipe fails if any command fails (not just last)
### 2. Error Trapping and Cleanup
Implement proper cleanup on script exit or error.
```bash
#!/bin/bash
set -Eeuo pipefail
trap 'echo "Error on line $LINENO"' ERR
trap 'echo "Cleaning up..."; rm -rf "$TMPDIR"' EXIT
TMPDIR=$(mktemp -d)
# Script code here
```
### 3. Variable Safety
Always quote variables to prevent word splitting and globbing issues.
```bash
# Wrong - unsafe
cp $source $dest
# Correct - safe
cp "$source" "$dest"
# Required variables - fail with message if unset
: "${REQUIRED_VAR:?REQUIRED_VAR is not set}"
```
### 4. Array Handling
Use arrays safely for complex data handling.
```bash
# Safe array iteration
declare -a items=("item 1" "item 2" "item 3")
for item in "${items[@]}"; do
echo "Processing: $item"
done
# Reading output into array safely
mapfile -t lines < <(some_command)
readarray -t numbers < <(seq 1 10)
```
### 5. Conditional Safety
Use `[[ ]]` for Bash-specific features, `[ ]` for POSIX.
```bash
# Bash - safer
if [[ -f "$file" && -r "$file" ]]; then
content=$(<"$file")
fi
# POSIX - portable
if [ -f "$file" ] && [ -r "$file" ]; then
content=$(cat "$file")
fi
# Test for existence before operations
if [[ -z "${VAR:-}" ]]; then
echo "VAR is not set or is empty"
fi
```
## Fundamental Patterns
### Pattern 1: Safe Script Directory Detection
```bash
#!/bin/bash
set -Eeuo pipefail
# Correctly determine script directory
SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd -P)"
SCRIPT_NAME="$(basename -- "${BASH_SOURCE[0]}")"
echo "Script location: $SCRIPT_DIR/$SCRIPT_NAME"
```
### Pattern 2: Comprehensive Function Templat
```bash
#!/bin/bash
set -Eeuo pipefail
# Prefix for functions: handle_*, process_*, check_*, validate_*
# Include documentation and error handling
validate_file() {
local -r file="$1"
local -r message="${2:-File not found: $file}"
if [[ ! -f "$file" ]]; then
echo "ERROR: $message" >&2
return 1
fi
return 0
}
process_files() {
local -r input_dir="$1"
local -r output_dir="$2"
# Validate inputs
[[ -d "$input_dir" ]] || { echo "ERROR: input_dir not a directory" >&2; return 1; }
# Create output directory if needed
mkdir -p "$output_dir" || { echo "ERROR: Cannot create output_dir" >&2; return 1; }
# Process files safely
while IFS= read -r -d '' file; do
echo "Processing: $file"
# Do work
done < <(find "$input_dir" -maxdepth 1 -type f -print0)
return 0
}
```
### Pattern 3: Safe Temporary File Handling
```bash
#!/bin/bash
set -Eeuo pipefail
trap 'rm -rf -- "$TMPDIR"' EXIT
# Create temporary directory
TMPDIR=$(mktemp -d) || { echo "ERROR: Failed to create temp directory" >&2; exit 1; }
# Create temporary files in directory
TMPFILE1="$TMPDIR/temp1.txt"
TMPFILE2="$TMPDIR/temp2.txt"
# Use temporary files
touch "$TMPFILE1" "$TMPFILE2"
echo "Temp files created in: $TMPDIR"
```
### Pattern 4: Robust Argument Parsing
```bash
#!/bin/bash
set -Eeuo pipefail
# Default values
VERBOSE=false
DRY_RUN=false
OUTPUT_FILE=""
THREADS=4
usage() {
cat <<EOF
Usage: $0 [OPTIONS]
Options:
-v, --verbose Enable verbose output
-d, --dry-run Run without making changes
-o, --output FILE Output file path
-j, --jobs NUM Number of parallel jobs
-h, --help Show this help message
EOF
exit "${1:-0}"
}
# Parse arguments
while [[ $# -gt 0 ]]; do
case "$1" in
-v|--verbose)
VERBOSE=true
shift
;;
-d|--dry-run)
DRY_RUN=true
shift
;;
-o|--output)
OUTPUT_FILE="$2"
shift 2
;;
-j|--jobs)
THREADS="$2"
shift 2
;;
-h|--help)
usage 0
;;
--)
shift
break
;;
*)
echo "ERROR: Unknown option: $1" >&2
usage 1
;;
esac
done
# Validate required arguments
[[ -n "$OUTPUT_FILE" ]] || { echo "ERROR: -o/--output is required" >&2; usage 1; }
```
### Pattern 5: Structured Logging
```bash
#!/bin/bash
set -Eeuo pipefail
# Logging functions
log_info() {
echo "[$(date +'%Y-%m-%d %H:%M:%S')] INFO: $*" >&2
}
log_warn() {
echo "[$(date +'%Y-%m-%d %H:%M:%S')] WARN: $*" >&2
}
log_error() {
echo "[$(date +'%Y-%m-%d %H:%M:%S')] ERROR: $*" >&2
}
log_debug() {
if [[ "${DEBUG:-0}" == "1" ]]; then
echo "[$(date +'%Y-%m-%d %H:%M:%S')] DEBUG: $*" >&2
fi
}
# Usage
log_info "Starting script"
log_debug "Debug information"
log_warn "Warning message"
log_error "Error occurred"
```
### Pattern 6: Process Orchestration with Signals
```bash
#!/bin/bash
set -Eeuo pipefail
# Track background processes
PIDS=()
cleanup() {
log_info "Shutting down..."
# Terminate all background processes
for pid in "${PIDS[@]}"; do
if kill -0 "$pid" 2>/dev/null; then
kill -TERM "$pid" 2>/dev/null || true
fi
done
# Wait for graceful shutdown
for pid in "${PIDS[@]}"; do
wait "$pid" 2>/dev/null || true
done
}
trap cleanup SIGTERM SIGINT
# Start background tasks
background_task &
PIDS+=($!)
another_task &
PIDS+=($!)
# Wait for all background processes
wait
```
### Pattern 7: Safe File Operations
```bash
#!/bin/bash
set -Eeuo pipefail
# Use -i flag to move safely without overwriting
safe_move() {
local -r source="$1"
local -r dest="$2"
if [[ ! -e "$source" ]]; then
echo "ERROR: Source does not exist: $source" >&2
return 1
fi
if [[ -e "$dest" ]]; then
echo "ERROR: Destination already exists: $dest" >&2
return 1
fi
mv "$source" "$dest"
}
# Safe directory cleanup
safe_rmdir() {
local -r dir="$1"
if [[ ! -d "$dir" ]]; then
echo "ERROR: Not a directory: $dir" >&2
return 1
fi
# Use -I flag to prompt before rm (BSD/GNU compatible)
rm -rI -- "$dir"
}
# Atomic file writes
atomic_write() {
local -r target="$1"
local -r tmpfile
tmpfile=$(mktemp) || return 1
# Write to temp file first
cat > "$tmpfile"
# Atomic rename
mv "$tmpfile" "$target"
}
```
### Pattern 8: Idempotent Script Design
```bash
#!/bin/bash
set -Eeuo pipefail
# Check if resource already exists
ensure_directory() {
local -r dir="$1"
if [[ -d "$dir" ]]; then
log_info "Directory already exists: $dir"
return 0
fi
mkdir -p "$dir" || {
log_error "Failed to create directory: $dir"
return 1
}
log_info "Created directory: $dir"
}
# Ensure configuration state
ensure_config() {
local -r config_file="$1"
local -r default_value="$2"
if [[ ! -f "$config_file" ]]; then
echo "$default_value" > "$config_file"
log_info "Created config: $config_file"
fi
}
# Rerunning script multiple times should be safe
ensure_directory "/var/cache/myapp"
ensure_config "/etc/myapp/config" "DEBUG=false"
```
### Pattern 9: Safe Command Substitution
```bash
#!/bin/bash
set -Eeuo pipefail
# Use $() instead of backticks
name=$(<"$file") # Modern, safe variable assignment from file
output=$(command -v python3) # Get command location safely
# Handle command substitution with error checking
result=$(command -v node) || {
log_error "node command not found"
return 1
}
# For multiple lines
mapfile -t lines < <(grep "pattern" "$file")
# NUL-safe iteration
while IFS= read -r -d '' file; do
echo "Processing: $file"
done < <(find /path -type f -print0)
```
### Pattern 10: Dry-Run Support
```bash
#!/bin/bash
set -Eeuo pipefail
DRY_RUN="${DRY_RUN:-false}"
run_cmd() {
if [[ "$DRY_RUN" == "true" ]]; then
echo "[DRY RUN] Would execute: $*"
return 0
fi
"$@"
}
# Usage
run_cmd cp "$source" "$dest"
run_cmd rm "$file"
run_cmd chown "$owner" "$target"
```
## Advanced Defensive Techniques
### Named Parameters Pattern
```bash
#!/bin/bash
set -Eeuo pipefail
process_data() {
local input_file=""
local output_dir=""
local format="json"
# Parse named parameters
while [[ $# -gt 0 ]]; do
case "$1" in
--input=*)
input_file="${1#*=}"
;;
--output=*)
output_dir="${1#*=}"
;;
--format=*)
format="${1#*=}"
;;
*)
echo "ERROR: Unknown parameter: $1" >&2
return 1
;;
esac
shift
done
# Validate required parameters
[[ -n "$input_file" ]] || { echo "ERROR: --input is required" >&2; return 1; }
[[ -n "$output_dir" ]] || { echo "ERROR: --output is required" >&2; return 1; }
}
```
### Dependency Checking
```bash
#!/bin/bash
set -Eeuo pipefail
check_dependencies() {
local -a missing_deps=()
local -a required=("jq" "curl" "git")
for cmd in "${required[@]}"; do
if ! command -v "$cmd" &>/dev/null; then
missing_deps+=("$cmd")
fi
done
if [[ ${#missing_deps[@]} -gt 0 ]]; then
echo "ERROR: Missing required commands: ${missing_deps[*]}" >&2
return 1
fi
}
check_dependencies
```
## Best Practices Summary
1. **Always use strict mode** - `set -Eeuo pipefail`
2. **Quote all variables** - `"$variable"` prevents word splitting
3. **Use [[ ]] conditionals** - More robust than [ ]
4. **Implement error trapping** - Catch and handle errors gracefully
5. **Validate all inputs** - Check file existence, permissions, formats
6. **Use functions for reusability** - Prefix with meaningful names
7. **Implement structured logging** - Include timestamps and levels
8. **Support dry-run mode** - Allow users to preview changes
9. **Handle temporary files safely** - Use mktemp, cleanup with trap
10. **Design for idempotency** - Scripts should be safe to rerun
11. **Document requirements** - List dependencies and minimum versions
12. **Test error paths** - Ensure error handling works correctly
13. **Use `command -v`** - Safer than `which` for checking executables
14. **Prefer printf over echo** - More predictable across systems
## Resources
- **Bash Strict Mode**: http://redsymbol.net/articles/unofficial-bash-strict-mode/
- **Google Shell Style Guide**: https://google.github.io/styleguide/shellguide.html
- **Defensive BASH Programming**: https://www.lifepipe.net/

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

204
skills/bash-linux/SKILL.md Normal file
View file

@ -0,0 +1,204 @@
---
name: bash-linux
description: "Bash/Linux terminal patterns. Critical commands, piping, error handling, scripting. Use when working on macOS or Linux systems."
risk: unknown
source: community
date_added: "2026-02-27"
---
# Bash Linux Patterns
> Essential patterns for Bash on Linux/macOS.
---
## 1. Operator Syntax
### Chaining Commands
| Operator | Meaning | Example |
|----------|---------|---------|
| `;` | Run sequentially | `cmd1; cmd2` |
| `&&` | Run if previous succeeded | `npm install && npm run dev` |
| `\|\|` | Run if previous failed | `npm test \|\| echo "Tests failed"` |
| `\|` | Pipe output | `ls \| grep ".js"` |
---
## 2. File Operations
### Essential Commands
| Task | Command |
|------|---------|
| List all | `ls -la` |
| Find files | `find . -name "*.js" -type f` |
| File content | `cat file.txt` |
| First N lines | `head -n 20 file.txt` |
| Last N lines | `tail -n 20 file.txt` |
| Follow log | `tail -f log.txt` |
| Search in files | `grep -r "pattern" --include="*.js"` |
| File size | `du -sh *` |
| Disk usage | `df -h` |
---
## 3. Process Management
| Task | Command |
|------|---------|
| List processes | `ps aux` |
| Find by name | `ps aux \| grep node` |
| Kill by PID | `kill -9 <PID>` |
| Find port user | `lsof -i :3000` |
| Kill port | `kill -9 $(lsof -t -i :3000)` |
| Background | `npm run dev &` |
| Jobs | `jobs -l` |
| Bring to front | `fg %1` |
---
## 4. Text Processing
### Core Tools
| Tool | Purpose | Example |
|------|---------|---------|
| `grep` | Search | `grep -rn "TODO" src/` |
| `sed` | Replace | `sed -i 's/old/new/g' file.txt` |
| `awk` | Extract columns | `awk '{print $1}' file.txt` |
| `cut` | Cut fields | `cut -d',' -f1 data.csv` |
| `sort` | Sort lines | `sort -u file.txt` |
| `uniq` | Unique lines | `sort file.txt \| uniq -c` |
| `wc` | Count | `wc -l file.txt` |
---
## 5. Environment Variables
| Task | Command |
|------|---------|
| View all | `env` or `printenv` |
| View one | `echo $PATH` |
| Set temporary | `export VAR="value"` |
| Set in script | `VAR="value" command` |
| Add to PATH | `export PATH="$PATH:/new/path"` |
---
## 6. Network
| Task | Command |
|------|---------|
| Download | `curl -O https://example.com/file` |
| API request | `curl -X GET https://api.example.com` |
| POST JSON | `curl -X POST -H "Content-Type: application/json" -d '{"key":"value"}' URL` |
| Check port | `nc -zv localhost 3000` |
| Network info | `ifconfig` or `ip addr` |
---
## 7. Script Template
```bash
#!/bin/bash
set -euo pipefail # Exit on error, undefined var, pipe fail
# Colors (optional)
RED='\033[0;31m'
GREEN='\033[0;32m'
NC='\033[0m'
# Script directory
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# Functions
log_info() { echo -e "${GREEN}[INFO]${NC} $1"; }
log_error() { echo -e "${RED}[ERROR]${NC} $1" >&2; }
# Main
main() {
log_info "Starting..."
# Your logic here
log_info "Done!"
}
main "$@"
```
---
## 8. Common Patterns
### Check if command exists
```bash
if command -v node &> /dev/null; then
echo "Node is installed"
fi
```
### Default variable value
```bash
NAME=${1:-"default_value"}
```
### Read file line by line
```bash
while IFS= read -r line; do
echo "$line"
done < file.txt
```
### Loop over files
```bash
for file in *.js; do
echo "Processing $file"
done
```
---
## 9. Differences from PowerShell
| Task | PowerShell | Bash |
|------|------------|------|
| List files | `Get-ChildItem` | `ls -la` |
| Find files | `Get-ChildItem -Recurse` | `find . -type f` |
| Environment | `$env:VAR` | `$VAR` |
| String concat | `"$a$b"` | `"$a$b"` (same) |
| Null check | `if ($x)` | `if [ -n "$x" ]` |
| Pipeline | Object-based | Text-based |
---
## 10. Error Handling
### Set options
```bash
set -e # Exit on error
set -u # Exit on undefined variable
set -o pipefail # Exit on pipe failure
set -x # Debug: print commands
```
### Trap for cleanup
```bash
cleanup() {
echo "Cleaning up..."
rm -f /tmp/tempfile
}
trap cleanup EXIT
```
---
> **Remember:** Bash is text-based. Use `&&` for success chains, `set -e` for safety, and quote your variables!
## When to Use
This skill is applicable to execute the workflow or actions described in the overview.

25
skills/bash-pro/README.md Normal file
View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

315
skills/bash-pro/SKILL.md Normal file
View file

@ -0,0 +1,315 @@
---
name: bash-pro
description: 'Master of defensive Bash scripting for production automation, CI/CD
pipelines, and system utilities. Expert in safe, portable, and testable shell
scripts.
'
risk: unknown
source: community
date_added: '2026-02-27'
---
## Use this skill when
- Writing or reviewing Bash scripts for automation, CI/CD, or ops
- Hardening shell scripts for safety and portability
## Do not use this skill when
- You need POSIX-only shell without Bash features
- The task requires a higher-level language for complex logic
- You need Windows-native scripting (PowerShell)
## Instructions
1. Define script inputs, outputs, and failure modes.
2. Apply strict mode and safe argument parsing.
3. Implement core logic with defensive patterns.
4. Add tests and linting with Bats and ShellCheck.
## Safety
- Treat input as untrusted; avoid eval and unsafe globbing.
- Prefer dry-run modes before destructive actions.
## Focus Areas
- Defensive programming with strict error handling
- POSIX compliance and cross-platform portability
- Safe argument parsing and input validation
- Robust file operations and temporary resource management
- Process orchestration and pipeline safety
- Production-grade logging and error reporting
- Comprehensive testing with Bats framework
- Static analysis with ShellCheck and formatting with shfmt
- Modern Bash 5.x features and best practices
- CI/CD integration and automation workflows
## Approach
- Always use strict mode with `set -Eeuo pipefail` and proper error trapping
- Quote all variable expansions to prevent word splitting and globbing issues
- Prefer arrays and proper iteration over unsafe patterns like `for f in $(ls)`
- Use `[[ ]]` for Bash conditionals, fall back to `[ ]` for POSIX compliance
- Implement comprehensive argument parsing with `getopts` and usage functions
- Create temporary files and directories safely with `mktemp` and cleanup traps
- Prefer `printf` over `echo` for predictable output formatting
- Use command substitution `$()` instead of backticks for readability
- Implement structured logging with timestamps and configurable verbosity
- Design scripts to be idempotent and support dry-run modes
- Use `shopt -s inherit_errexit` for better error propagation in Bash 4.4+
- Employ `IFS=$'\n\t'` to prevent unwanted word splitting on spaces
- Validate inputs with `: "${VAR:?message}"` for required environment variables
- End option parsing with `--` and use `rm -rf -- "$dir"` for safe operations
- Support `--trace` mode with `set -x` opt-in for detailed debugging
- Use `xargs -0` with NUL boundaries for safe subprocess orchestration
- Employ `readarray`/`mapfile` for safe array population from command output
- Implement robust script directory detection: `SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd -P)"`
- Use NUL-safe patterns: `find -print0 | while IFS= read -r -d '' file; do ...; done`
## Compatibility & Portability
- Use `#!/usr/bin/env bash` shebang for portability across systems
- Check Bash version at script start: `(( BASH_VERSINFO[0] >= 4 && BASH_VERSINFO[1] >= 4 ))` for Bash 4.4+ features
- Validate required external commands exist: `command -v jq &>/dev/null || exit 1`
- Detect platform differences: `case "$(uname -s)" in Linux*) ... ;; Darwin*) ... ;; esac`
- Handle GNU vs BSD tool differences (e.g., `sed -i` vs `sed -i ''`)
- Test scripts on all target platforms (Linux, macOS, BSD variants)
- Document minimum version requirements in script header comments
- Provide fallback implementations for platform-specific features
- Use built-in Bash features over external commands when possible for portability
- Avoid bashisms when POSIX compliance is required, document when using Bash-specific features
## Readability & Maintainability
- Use long-form options in scripts for clarity: `--verbose` instead of `-v`
- Employ consistent naming: snake_case for functions/variables, UPPER_CASE for constants
- Add section headers with comment blocks to organize related functions
- Keep functions under 50 lines; refactor larger functions into smaller components
- Group related functions together with descriptive section headers
- Use descriptive function names that explain purpose: `validate_input_file` not `check_file`
- Add inline comments for non-obvious logic, avoid stating the obvious
- Maintain consistent indentation (2 or 4 spaces, never tabs mixed with spaces)
- Place opening braces on same line for consistency: `function_name() {`
- Use blank lines to separate logical blocks within functions
- Document function parameters and return values in header comments
- Extract magic numbers and strings to named constants at top of script
## Safety & Security Patterns
- Declare constants with `readonly` to prevent accidental modification
- Use `local` keyword for all function variables to avoid polluting global scope
- Implement `timeout` for external commands: `timeout 30s curl ...` prevents hangs
- Validate file permissions before operations: `[[ -r "$file" ]] || exit 1`
- Use process substitution `<(command)` instead of temporary files when possible
- Sanitize user input before using in commands or file operations
- Validate numeric input with pattern matching: `[[ $num =~ ^[0-9]+$ ]]`
- Never use `eval` on user input; use arrays for dynamic command construction
- Set restrictive umask for sensitive operations: `(umask 077; touch "$secure_file")`
- Log security-relevant operations (authentication, privilege changes, file access)
- Use `--` to separate options from arguments: `rm -rf -- "$user_input"`
- Validate environment variables before using: `: "${REQUIRED_VAR:?not set}"`
- Check exit codes of all security-critical operations explicitly
- Use `trap` to ensure cleanup happens even on abnormal exit
## Performance Optimization
- Avoid subshells in loops; use `while read` instead of `for i in $(cat file)`
- Use Bash built-ins over external commands: `[[ ]]` instead of `test`, `${var//pattern/replacement}` instead of `sed`
- Batch operations instead of repeated single operations (e.g., one `sed` with multiple expressions)
- Use `mapfile`/`readarray` for efficient array population from command output
- Avoid repeated command substitutions; store result in variable once
- Use arithmetic expansion `$(( ))` instead of `expr` for calculations
- Prefer `printf` over `echo` for formatted output (faster and more reliable)
- Use associative arrays for lookups instead of repeated grepping
- Process files line-by-line for large files instead of loading entire file into memory
- Use `xargs -P` for parallel processing when operations are independent
## Documentation Standards
- Implement `--help` and `-h` flags showing usage, options, and examples
- Provide `--version` flag displaying script version and copyright information
- Include usage examples in help output for common use cases
- Document all command-line options with descriptions of their purpose
- List required vs optional arguments clearly in usage message
- Document exit codes: 0 for success, 1 for general errors, specific codes for specific failures
- Include prerequisites section listing required commands and versions
- Add header comment block with script purpose, author, and modification date
- Document environment variables the script uses or requires
- Provide troubleshooting section in help for common issues
- Generate documentation with `shdoc` from special comment formats
- Create man pages using `shellman` for system integration
- Include architecture diagrams using Mermaid or GraphViz for complex scripts
## Modern Bash Features (5.x)
- **Bash 5.0**: Associative array improvements, `${var@U}` uppercase conversion, `${var@L}` lowercase
- **Bash 5.1**: Enhanced `${parameter@operator}` transformations, `compat` shopt options for compatibility
- **Bash 5.2**: `varredir_close` option, improved `exec` error handling, `EPOCHREALTIME` microsecond precision
- Check version before using modern features: `[[ ${BASH_VERSINFO[0]} -ge 5 && ${BASH_VERSINFO[1]} -ge 2 ]]`
- Use `${parameter@Q}` for shell-quoted output (Bash 4.4+)
- Use `${parameter@E}` for escape sequence expansion (Bash 4.4+)
- Use `${parameter@P}` for prompt expansion (Bash 4.4+)
- Use `${parameter@A}` for assignment format (Bash 4.4+)
- Employ `wait -n` to wait for any background job (Bash 4.3+)
- Use `mapfile -d delim` for custom delimiters (Bash 4.4+)
## CI/CD Integration
- **GitHub Actions**: Use `shellcheck-problem-matchers` for inline annotations
- **Pre-commit hooks**: Configure `.pre-commit-config.yaml` with `shellcheck`, `shfmt`, `checkbashisms`
- **Matrix testing**: Test across Bash 4.4, 5.0, 5.1, 5.2 on Linux and macOS
- **Container testing**: Use official bash:5.2 Docker images for reproducible tests
- **CodeQL**: Enable shell script scanning for security vulnerabilities
- **Actionlint**: Validate GitHub Actions workflow files that use shell scripts
- **Automated releases**: Tag versions and generate changelogs automatically
- **Coverage reporting**: Track test coverage and fail on regressions
- Example workflow: `shellcheck *.sh && shfmt -d *.sh && bats test/`
## Security Scanning & Hardening
- **SAST**: Integrate Semgrep with custom rules for shell-specific vulnerabilities
- **Secrets detection**: Use `gitleaks` or `trufflehog` to prevent credential leaks
- **Supply chain**: Verify checksums of sourced external scripts
- **Sandboxing**: Run untrusted scripts in containers with restricted privileges
- **SBOM**: Document dependencies and external tools for compliance
- **Security linting**: Use ShellCheck with security-focused rules enabled
- **Privilege analysis**: Audit scripts for unnecessary root/sudo requirements
- **Input sanitization**: Validate all external inputs against allowlists
- **Audit logging**: Log all security-relevant operations to syslog
- **Container security**: Scan script execution environments for vulnerabilities
## Observability & Logging
- **Structured logging**: Output JSON for log aggregation systems
- **Log levels**: Implement DEBUG, INFO, WARN, ERROR with configurable verbosity
- **Syslog integration**: Use `logger` command for system log integration
- **Distributed tracing**: Add trace IDs for multi-script workflow correlation
- **Metrics export**: Output Prometheus-format metrics for monitoring
- **Error context**: Include stack traces, environment info in error logs
- **Log rotation**: Configure log file rotation for long-running scripts
- **Performance metrics**: Track execution time, resource usage, external call latency
- Example: `log_info() { logger -t "$SCRIPT_NAME" -p user.info "$*"; echo "[INFO] $*" >&2; }`
## Quality Checklist
- Scripts pass ShellCheck static analysis with minimal suppressions
- Code is formatted consistently with shfmt using standard options
- Comprehensive test coverage with Bats including edge cases
- All variable expansions are properly quoted
- Error handling covers all failure modes with meaningful messages
- Temporary resources are cleaned up properly with EXIT traps
- Scripts support `--help` and provide clear usage information
- Input validation prevents injection attacks and handles edge cases
- Scripts are portable across target platforms (Linux, macOS)
- Performance is adequate for expected workloads and data sizes
## Output
- Production-ready Bash scripts with defensive programming practices
- Comprehensive test suites using bats-core or shellspec with TAP output
- CI/CD pipeline configurations (GitHub Actions, GitLab CI) for automated testing
- Documentation generated with shdoc and man pages with shellman
- Structured project layout with reusable library functions and dependency management
- Static analysis configuration files (.shellcheckrc, .shfmt.toml, .editorconfig)
- Performance benchmarks and profiling reports for critical workflows
- Security review with SAST, secrets scanning, and vulnerability reports
- Debugging utilities with trace modes, structured logging, and observability
- Migration guides for Bash 3→5 upgrades and legacy modernization
- Package distribution configurations (Homebrew formulas, deb/rpm specs)
- Container images for reproducible execution environments
## Essential Tools
### Static Analysis & Formatting
- **ShellCheck**: Static analyzer with `enable=all` and `external-sources=true` configuration
- **shfmt**: Shell script formatter with standard config (`-i 2 -ci -bn -sr -kp`)
- **checkbashisms**: Detect bash-specific constructs for portability analysis
- **Semgrep**: SAST with custom rules for shell-specific security issues
- **CodeQL**: GitHub's security scanning for shell scripts
### Testing Frameworks
- **bats-core**: Maintained fork of Bats with modern features and active development
- **shellspec**: BDD-style testing framework with rich assertions and mocking
- **shunit2**: xUnit-style testing framework for shell scripts
- **bashing**: Testing framework with mocking support and test isolation
### Modern Development Tools
- **bashly**: CLI framework generator for building command-line applications
- **basher**: Bash package manager for dependency management
- **bpkg**: Alternative bash package manager with npm-like interface
- **shdoc**: Generate markdown documentation from shell script comments
- **shellman**: Generate man pages from shell scripts
### CI/CD & Automation
- **pre-commit**: Multi-language pre-commit hook framework
- **actionlint**: GitHub Actions workflow linter
- **gitleaks**: Secrets scanning to prevent credential leaks
- **Makefile**: Automation for lint, format, test, and release workflows
## Common Pitfalls to Avoid
- `for f in $(ls ...)` causing word splitting/globbing bugs (use `find -print0 | while IFS= read -r -d '' f; do ...; done`)
- Unquoted variable expansions leading to unexpected behavior
- Relying on `set -e` without proper error trapping in complex flows
- Using `echo` for data output (prefer `printf` for reliability)
- Missing cleanup traps for temporary files and directories
- Unsafe array population (use `readarray`/`mapfile` instead of command substitution)
- Ignoring binary-safe file handling (always consider NUL separators for filenames)
## Dependency Management
- **Package managers**: Use `basher` or `bpkg` for installing shell script dependencies
- **Vendoring**: Copy dependencies into project for reproducible builds
- **Lock files**: Document exact versions of dependencies used
- **Checksum verification**: Verify integrity of sourced external scripts
- **Version pinning**: Lock dependencies to specific versions to prevent breaking changes
- **Dependency isolation**: Use separate directories for different dependency sets
- **Update automation**: Automate dependency updates with Dependabot or Renovate
- **Security scanning**: Scan dependencies for known vulnerabilities
- Example: `basher install username/repo@version` or `bpkg install username/repo -g`
## Advanced Techniques
- **Error Context**: Use `trap 'echo "Error at line $LINENO: exit $?" >&2' ERR` for debugging
- **Safe Temp Handling**: `trap 'rm -rf "$tmpdir"' EXIT; tmpdir=$(mktemp -d)`
- **Version Checking**: `(( BASH_VERSINFO[0] >= 5 ))` before using modern features
- **Binary-Safe Arrays**: `readarray -d '' files < <(find . -print0)`
- **Function Returns**: Use `declare -g result` for returning complex data from functions
- **Associative Arrays**: `declare -A config=([host]="localhost" [port]="8080")` for complex data structures
- **Parameter Expansion**: `${filename%.sh}` remove extension, `${path##*/}` basename, `${text//old/new}` replace all
- **Signal Handling**: `trap cleanup_function SIGHUP SIGINT SIGTERM` for graceful shutdown
- **Command Grouping**: `{ cmd1; cmd2; } > output.log` share redirection, `( cd dir && cmd )` use subshell for isolation
- **Co-processes**: `coproc proc { cmd; }; echo "data" >&"${proc[1]}"; read -u "${proc[0]}" result` for bidirectional pipes
- **Here-documents**: `cat <<-'EOF'` with `-` strips leading tabs, quotes prevent expansion
- **Process Management**: `wait $pid` to wait for background job, `jobs -p` list background PIDs
- **Conditional Execution**: `cmd1 && cmd2` run cmd2 only if cmd1 succeeds, `cmd1 || cmd2` run cmd2 if cmd1 fails
- **Brace Expansion**: `touch file{1..10}.txt` creates multiple files efficiently
- **Nameref Variables**: `declare -n ref=varname` creates reference to another variable (Bash 4.3+)
- **Improved Error Trapping**: `set -Eeuo pipefail; shopt -s inherit_errexit` for comprehensive error handling
- **Parallel Execution**: `xargs -P $(nproc) -n 1 command` for parallel processing with CPU core count
- **Structured Output**: `jq -n --arg key "$value" '{key: $key}'` for JSON generation
- **Performance Profiling**: Use `time -v` for detailed resource usage or `TIMEFORMAT` for custom timing
## References & Further Reading
### Style Guides & Best Practices
- [Google Shell Style Guide](https://google.github.io/styleguide/shellguide.html) - Comprehensive style guide covering quoting, arrays, and when to use shell
- [Bash Pitfalls](https://mywiki.wooledge.org/BashPitfalls) - Catalog of common Bash mistakes and how to avoid them
- [Bash Hackers Wiki](https://wiki.bash-hackers.org/) - Comprehensive Bash documentation and advanced techniques
- [Defensive BASH Programming](https://www.kfirlavi.com/blog/2012/11/14/defensive-bash-programming/) - Modern defensive programming patterns
### Tools & Frameworks
- [ShellCheck](https://github.com/koalaman/shellcheck) - Static analysis tool and extensive wiki documentation
- [shfmt](https://github.com/mvdan/sh) - Shell script formatter with detailed flag documentation
- [bats-core](https://github.com/bats-core/bats-core) - Maintained Bash testing framework
- [shellspec](https://github.com/shellspec/shellspec) - BDD-style testing framework for shell scripts
- [bashly](https://bashly.dannyb.co/) - Modern Bash CLI framework generator
- [shdoc](https://github.com/reconquest/shdoc) - Documentation generator for shell scripts
### Security & Advanced Topics
- [Bash Security Best Practices](https://github.com/carlospolop/PEASS-ng) - Security-focused shell script patterns
- [Awesome Bash](https://github.com/awesome-lists/awesome-bash) - Curated list of Bash resources and tools
- [Pure Bash Bible](https://github.com/dylanaraps/pure-bash-bible) - Collection of pure bash alternatives to external commands

View file

@ -0,0 +1,125 @@
---
name: bookstack-documentation
description: Use when completing any significant work — deploying services, fixing cluster issues, writing runbooks, finishing brainstorming sessions, or making architectural decisions — to determine whether and where to save it to BookStack at https://wiki.ctz.fyi
---
# BookStack Documentation
## Overview
Save durable knowledge to BookStack as part of normal work — not just specs and plans, but ops runbooks, architecture notes, troubleshooting outcomes, and session results. If future-you would need to look it up, write it down.
**Instance:** https://wiki.ctz.fyi (BookStack v26.03.5)
**MCP tools:** `litellm_bookstack-*`
## Decision Table — Where Does This Go?
| Content type | Location |
|---|---|
| Design spec (from brainstorming) | Specs book (ID 157) |
| Implementation plan | Plans book (ID 159) |
| Architecture decision / how a system works | Ansiblestack book (ID 79), find or create page |
| Ops runbook / "how to do X on the cluster" | Ansiblestack book, `playbook-reference` page or new dedicated page |
| Troubleshooting investigation outcome | Ansiblestack book, relevant service page (e.g., update `keycloak` page) |
| New service deployed | Ansiblestack book, create new page named after the service |
| Project-specific docs | New book in Infrastructure Docs shelf, or new chapter in Ansiblestack |
## Shelf and Book Structure
```
Shelf: Superpowers (ID 1)
Book: Specs (ID 157) — Design specs from brainstorming sessions
Book: Plans (ID 159) — Implementation plans
Shelf: Infrastructure Docs (ID 78)
Book: Ansiblestack (ID 79) — Cluster bootstrap, services, architecture docs
Existing pages: INDEX, addons, applications, architecture, argocd-consolidation,
cluster, crowdsec, dns, external-secrets, hacker-ethos, keycloak,
litellm-qdrant-memory, mcp-servers, missing-services, monitoring,
netbox, networking, openbao, pangolin-newt-troubleshooting,
playbook-reference, playwright-mcp, rabbitmq, scripts, tandoor,
terrakube, tofu
Shelf: Repo Documentation (ID 121)
Various per-repo books
Book: touchscreen — Family Room Dashboard (ID 162)
```
## When to Save
- After brainstorming session completes → spec to Specs book
- After plan is written → plan to Plans book
- After deploying a new service → create/update service page in Ansiblestack
- After investigating and fixing a cluster issue → document fix on the relevant service page
- After writing a runbook or procedure → Ansiblestack `playbook-reference` or dedicated page
- After any architectural decision that isn't obvious from the code
## When NOT to Save
- Temporary debug output or scratch work
- Q&A that belongs in chat history
- Anything immediately obsolete
## Page Naming Conventions
| Type | Format |
|---|---|
| Specs | `[Spec] YYYY-MM-DD: <topic>` |
| Plans | `[Plan] YYYY-MM-DD: <feature name>` |
| Service pages | lowercase service name (e.g., `rabbitmq`) |
| Runbooks | descriptive verb phrase: `Rotating OpenBao Unseal Keys` |
## Page Format (Markdown)
For service pages, use this structure:
```markdown
# Service Name
**Status:** Running / Deprecated
**Namespace:** `<ns>`
**URL:** https://<hostname>
**Chart:** `helm/charts/<name>/`
**ArgoCD App:** `helm/argocd/<name>-app.yaml`
**Secrets:** OpenBao path `secret/production/<ns>/...`
## Overview
What it is and why we run it.
## Architecture
How it's deployed, what it depends on.
## Configuration
Key config decisions, non-obvious settings.
## Operations
### How to restart
### How to update
### Common issues
```
For runbooks and procedures, use a clear numbered steps format. For troubleshooting outcomes, document: symptoms → investigation → root cause → fix.
## MCP Usage
```python
# Find an existing page (search or list book contents)
bookstack_books_read(id=79) # lists pages in Ansiblestack
# Create a new page
bookstack_pages_create(
book_id=79,
name="my-service",
markdown="# My Service\n..."
)
# Update existing page — ALWAYS read first, updates replace entire content
page = bookstack_pages_read(id=<page_id>)
bookstack_pages_update(
id=<page_id>,
markdown="<updated full content>"
)
```
**Always read before updating.** `bookstack_pages_update` replaces the entire page.

View file

@ -0,0 +1,122 @@
---
name: brainstorming
description: "You MUST use this before any creative work - creating features, building components, adding functionality, or modifying behavior. Explores user intent, requirements and design before implementation."
---
# Brainstorming Ideas Into Designs
Help turn ideas into fully formed designs and specs through natural collaborative dialogue.
Start by understanding the current project context, then ask questions one at a time to refine the idea. Once you understand what you're building, present the design and get user approval.
<HARD-GATE>
Do NOT invoke any implementation skill, write any code, scaffold any project, or take any implementation action until you have presented a design and the user has approved it. This applies to EVERY project regardless of perceived simplicity.
</HARD-GATE>
## Anti-Pattern: "This Is Too Simple To Need A Design"
Every project goes through this process. A todo list, a single-function utility, a config change — all of them. "Simple" projects are where unexamined assumptions cause the most wasted work. The design can be short (a few sentences for truly simple projects), but you MUST present it and get approval.
## Checklist
You MUST create a todo item for each of these and complete them in order:
1. **Explore project context** — check files, docs, recent commits
2. **Offer visual companion** (if topic will involve visual questions) — own message, not combined with a clarifying question
3. **Ask clarifying questions** — one at a time, understand purpose/constraints/success criteria
4. **Propose 2-3 approaches** — with trade-offs and your recommendation
5. **Present design** — in sections scaled to their complexity, get user approval after each section
6. **Save spec to BookStack** — create a page in the Specs book (https://wiki.ctz.fyi) with the full design doc
7. **Spec self-review** — quick inline check for placeholders, contradictions, ambiguity, scope
8. **User reviews spec** — ask user to review the BookStack page before proceeding
9. **Transition to implementation** — invoke `writing-plans` skill
## BookStack Spec Page
After the design is approved (step 6), save it to BookStack at https://wiki.ctz.fyi:
1. The **Specs** book already exists (book ID 157) under the Superpowers shelf.
2. Create the spec page via `bookstack_pages_create`:
- `book_id`: 157
- `name`: `[Spec] YYYY-MM-DD: <topic>`
- `markdown`: full design doc in markdown
3. Note the returned page URL for the user review gate: `https://wiki.ctz.fyi/books/specs-CdD/page/<slug>`
> If a project-specific chapter is appropriate (e.g., a named project has multiple specs), create or reuse a chapter inside the Specs book and use `chapter_id` instead of `book_id`.
## Vikunja Project Setup
Also create or identify the Vikunja project for implementation tracking:
1. Call `litellm_vikunja-vikunja_api` with operation `get_projects` to list all projects
2. Ask: "Which Vikunja project should tasks live in? Or I can create a new one cloned from the Template."
3. If creating a new project:
- Ask the user what to name it
- Call `put_projects_projectid_duplicate` with `projectID: 5`, body `{ "name": "<chosen name>" }`
4. Note the project ID for `writing-plans`
## The Process
**Understanding the idea:**
- Check out the current project state first (files, docs, recent commits)
- Before asking detailed questions, assess scope: if the request describes multiple independent subsystems, flag this immediately
- If the project is too large for a single spec, help the user decompose into sub-projects
- For appropriately-scoped projects, ask questions one at a time to refine the idea
- Prefer multiple choice questions when possible
- Only one question per message
- Focus on understanding: purpose, constraints, success criteria
**Exploring approaches:**
- Propose 2-3 different approaches with trade-offs
- Present options conversationally with your recommendation and reasoning
- Lead with your recommended option and explain why
**Presenting the design:**
- Once you believe you understand what you're building, present the design
- Scale each section to its complexity
- Ask after each section whether it looks right so far
- Cover: architecture, components, data flow, error handling, testing
**Design for isolation and clarity:**
- Break the system into smaller units that each have one clear purpose
- Can someone understand what a unit does without reading its internals?
**Working in existing codebases:**
- Explore the current structure before proposing changes. Follow existing patterns.
- Include targeted improvements but don't propose unrelated refactoring.
## Spec Self-Review (step 7)
Run this yourself — not a subagent:
1. **Placeholder scan:** Any "TBD", "TODO", incomplete sections, or vague requirements? Fix them.
2. **Internal consistency:** Do any sections contradict each other?
3. **Scope check:** Is this focused enough for a single implementation plan?
4. **Ambiguity check:** Could any requirement be interpreted two different ways?
## User Review Gate (step 8)
After saving to BookStack and completing the self-review, ask the user:
> "Spec saved to BookStack: https://wiki.ctz.fyi/books/specs-CdD/page/<slug>. Please review it and let me know if you want any changes before we start writing the implementation plan."
Wait for the user's response. Only proceed once the user approves.
## Implementation (step 9)
- Invoke the `writing-plans` skill to create a detailed implementation plan
- Do NOT invoke any other skill. `writing-plans` is the next and only step.
## Key Principles
- One question at a time
- Multiple choice preferred
- YAGNI ruthlessly
- Explore alternatives — always propose 2-3 approaches
- Incremental validation — present design section by section, get approval before moving on
- Be flexible — go back and clarify when something doesn't make sense
## Visual Companion
A browser-based companion for showing mockups, diagrams, and visual options. Offer it once for consent when visual questions are anticipated. This offer MUST be its own message — not combined with a clarifying question.
Per-question decision: use browser for layout/mockup/diagram content; use text for conceptual questions.

View file

@ -0,0 +1,193 @@
---
name: cnpg-database
description: Use when deploying, configuring, or troubleshooting CloudNativePG PostgreSQL clusters on Zoe's k3s homelab, including bootstrapping, secrets, S3 backups, migrations, and common failure modes.
---
# CloudNativePG (CNPG) on k3s Homelab
## Overview
Deploy and operate CNPG PostgreSQL clusters on the production k3s cluster at `10.0.6.10`. CNPG operator v1.28.1. Always use ArgoCD sync-waves to enforce creation order.
## Environment
| Setting | Value |
|---------|-------|
| CNPG operator | 1.28.1 |
| PostgreSQL image | `ghcr.io/cloudnative-pg/postgresql:18.1-system-trixie` (includes pgvector as `vector.so`) |
| Fast storage | `nvme` (NFS-NVMe) |
| Standard storage | `ssd` (NFS-SSD) |
| S3 endpoint | `https://s3.ctz.fyi` |
| S3 bucket | `cnpg-backups` |
| Secrets backend | External Secrets Operator → ClusterSecretStore `openbao` |
| OpenBao path | `secret/production/<namespace>/<cluster-name>` |
## Sync-Wave Order (Critical)
| Wave | Resource |
|------|----------|
| `-2` | CNPG `Cluster` |
| `-1` | `ExternalSecret` for DB credentials |
| `0` | App `Deployment` |
## Step 1 — Write Secrets to OpenBao
Do this **before** deploying anything:
```bash
bao kv put secret/production/<namespace>/<app>-db \
username=<app> \
password=$(openssl rand -base64 32 | tr -d /=+ | head -c 32)
```
Also create the backup credentials secret once per namespace:
```bash
bao kv put secret/production/<namespace>/cnpg-backup-s3-credentials \
ACCESS_KEY_ID=<key> \
ACCESS_SECRET_KEY=<secret>
```
## Step 2 — ExternalSecret (sync-wave -1)
```yaml
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: <app>-db-credentials
namespace: <app>
annotations:
argocd.argoproj.io/sync-wave: "-1"
spec:
refreshInterval: 1h
secretStoreRef:
name: openbao
kind: ClusterSecretStore
target:
name: <app>-db-credentials
creationPolicy: Owner
data:
- secretKey: username
remoteRef:
key: secret/production/<namespace>/<app>-db
property: username
- secretKey: password
remoteRef:
key: secret/production/<namespace>/<app>-db
property: password
```
## Step 3 — CNPG Cluster (sync-wave -2)
```yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: <app>-db
namespace: <app>
annotations:
argocd.argoproj.io/sync-wave: "-2"
spec:
instances: 3 # Use 1 for dev/small workloads
imageName: ghcr.io/cloudnative-pg/postgresql:18.1-system-trixie
storage:
size: 10Gi
storageClass: nvme # or ssd
bootstrap:
initdb:
database: <app>
owner: <app>
secret:
name: <app>-db-credentials # MUST have keys 'username' and 'password' exactly
backup:
barmanObjectStore:
destinationPath: s3://cnpg-backups/<app>
endpointURL: https://s3.ctz.fyi
s3Credentials:
accessKeyId:
name: cnpg-backup-s3-credentials
key: ACCESS_KEY_ID
secretAccessKey:
name: cnpg-backup-s3-credentials
key: ACCESS_SECRET_KEY
retentionPolicy: "30d"
```
## CRITICAL: Secret Key Names
> **The bootstrap secret MUST have keys named exactly `username` and `password`.**
> CNPG will appear healthy but the app cannot connect if keys are wrong (e.g., `user`, `pass`, `POSTGRES_USER`).
> CNPG does NOT create a separate `-app` secret when `bootstrap.initdb.secret` is provided.
## Connecting from the App
CNPG auto-creates these services:
| Service | Use |
|---------|-----|
| `<cluster>-rw` | Read-write (primary) — **use this for app writes** |
| `<cluster>-ro` | Read-only (replicas) — use for read-heavy queries |
| `<cluster>-r` | Any instance |
```
postgresql://<username>:<password>@<app>-db-rw.<namespace>.svc.cluster.local:5432/<database>
```
## Manual Database Access
```bash
# psql on primary
kubectl exec -n <namespace> -it <cluster>-1 -- psql -U <username> <database>
# via cnpg plugin
kubectl cnpg psql <cluster> -n <namespace>
# pg_dump
kubectl exec -n <namespace> <cluster>-1 -- \
pg_dump -U <username> <database> > dump.sql
# restore
kubectl exec -n <namespace> -i <cluster>-1 -- \
psql -U <username> <database> < dump.sql
```
## Migrating from Docker/External Postgres
```bash
# 1. Dump from source
pg_dump -h <old-host> -U <user> <database> > dump.sql
# 2. Copy into pod
kubectl cp dump.sql <namespace>/<pod>:/tmp/dump.sql
# 3. Restore
kubectl exec -n <namespace> -it <pod> -- \
psql -U <username> <database> -f /tmp/dump.sql
```
## Scheduled Backups (Optional)
```yaml
apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
name: <app>-db-backup
namespace: <app>
spec:
schedule: "0 2 * * *" # 2am daily
backupOwnerReference: self
cluster:
name: <app>-db
```
## Common Issues
| Symptom | Cause | Fix |
|---------|-------|-----|
| Cluster stuck at "Setting up primary" | Secret missing or wrong key names | Check `<app>-db-credentials` exists and has `username`/`password` keys |
| Pod in `Pending` | PVC can't provision | Check `nvme`/`ssd` NFS provisioner is healthy |
| App can't connect | Using pod IP or wrong service | Use `<cluster>-rw` service, not pod IP |
| 2/3 instances after node failure | Normal self-healing | Wait — CNPG will recover automatically |
| Stale data after cluster recreation | Old PVCs still present | Delete PVCs manually before clean redeploy |

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,447 @@
---
name: code-review-checklist
description: "Comprehensive checklist for conducting thorough code reviews covering functionality, security, performance, and maintainability"
risk: unknown
source: community
date_added: "2026-02-27"
---
# Code Review Checklist
## Overview
Provide a systematic checklist for conducting thorough code reviews. This skill helps reviewers ensure code quality, catch bugs, identify security issues, and maintain consistency across the codebase.
## When to Use This Skill
- Use when reviewing pull requests
- Use when conducting code audits
- Use when establishing code review standards for a team
- Use when training new developers on code review practices
- Use when you want to ensure nothing is missed in reviews
- Use when creating code review documentation
## How It Works
### Step 1: Understand the Context
Before reviewing code, I'll help you understand:
- What problem does this code solve?
- What are the requirements?
- What files were changed and why?
- Are there related issues or tickets?
- What's the testing strategy?
### Step 2: Review Functionality
Check if the code works correctly:
- Does it solve the stated problem?
- Are edge cases handled?
- Is error handling appropriate?
- Are there any logical errors?
- Does it match the requirements?
### Step 3: Review Code Quality
Assess code maintainability:
- Is the code readable and clear?
- Are names descriptive?
- Is it properly structured?
- Are functions/methods focused?
- Is there unnecessary complexity?
### Step 4: Review Security
Check for security issues:
- Are inputs validated?
- Is sensitive data protected?
- Are there SQL injection risks?
- Is authentication/authorization correct?
- Are dependencies secure?
### Step 5: Review Performance
Look for performance issues:
- Are there unnecessary loops?
- Is database access optimized?
- Are there memory leaks?
- Is caching used appropriately?
- Are there N+1 query problems?
### Step 6: Review Tests
Verify test coverage:
- Are there tests for new code?
- Do tests cover edge cases?
- Are tests meaningful?
- Do all tests pass?
- Is test coverage adequate?
## Examples
### Example 1: Functionality Review Checklist
```markdown
## Functionality Review
### Requirements
- [ ] Code solves the stated problem
- [ ] All acceptance criteria are met
- [ ] Edge cases are handled
- [ ] Error cases are handled
- [ ] User input is validated
### Logic
- [ ] No logical errors or bugs
- [ ] Conditions are correct (no off-by-one errors)
- [ ] Loops terminate correctly
- [ ] Recursion has proper base cases
- [ ] State management is correct
### Error Handling
- [ ] Errors are caught appropriately
- [ ] Error messages are clear and helpful
- [ ] Errors don't expose sensitive information
- [ ] Failed operations are rolled back
- [ ] Logging is appropriate
### Example Issues to Catch:
**❌ Bad - Missing validation:**
\`\`\`javascript
function createUser(email, password) {
// No validation!
return db.users.create({ email, password });
}
\`\`\`
**✅ Good - Proper validation:**
\`\`\`javascript
function createUser(email, password) {
if (!email || !isValidEmail(email)) {
throw new Error('Invalid email address');
}
if (!password || password.length < 8) {
throw new Error('Password must be at least 8 characters');
}
return db.users.create({ email, password });
}
\`\`\`
```
### Example 2: Security Review Checklist
```markdown
## Security Review
### Input Validation
- [ ] All user inputs are validated
- [ ] SQL injection is prevented (use parameterized queries)
- [ ] XSS is prevented (escape output)
- [ ] CSRF protection is in place
- [ ] File uploads are validated (type, size, content)
### Authentication & Authorization
- [ ] Authentication is required where needed
- [ ] Authorization checks are present
- [ ] Passwords are hashed (never stored plain text)
- [ ] Sessions are managed securely
- [ ] Tokens expire appropriately
### Data Protection
- [ ] Sensitive data is encrypted
- [ ] API keys are not hardcoded
- [ ] Environment variables are used for secrets
- [ ] Personal data follows privacy regulations
- [ ] Database credentials are secure
### Dependencies
- [ ] No known vulnerable dependencies
- [ ] Dependencies are up to date
- [ ] Unnecessary dependencies are removed
- [ ] Dependency versions are pinned
### Example Issues to Catch:
**❌ Bad - SQL injection risk:**
\`\`\`javascript
const query = \`SELECT * FROM users WHERE email = '\${email}'\`;
db.query(query);
\`\`\`
**✅ Good - Parameterized query:**
\`\`\`javascript
const query = 'SELECT * FROM users WHERE email = $1';
db.query(query, [email]);
\`\`\`
**❌ Bad - Hardcoded secret:**
\`\`\`javascript
const API_KEY = 'sk_live_abc123xyz';
\`\`\`
**✅ Good - Environment variable:**
\`\`\`javascript
const API_KEY = process.env.API_KEY;
if (!API_KEY) {
throw new Error('API_KEY environment variable is required');
}
\`\`\`
```
### Example 3: Code Quality Review Checklist
```markdown
## Code Quality Review
### Readability
- [ ] Code is easy to understand
- [ ] Variable names are descriptive
- [ ] Function names explain what they do
- [ ] Complex logic has comments
- [ ] Magic numbers are replaced with constants
### Structure
- [ ] Functions are small and focused
- [ ] Code follows DRY principle (Don't Repeat Yourself)
- [ ] Proper separation of concerns
- [ ] Consistent code style
- [ ] No dead code or commented-out code
### Maintainability
- [ ] Code is modular and reusable
- [ ] Dependencies are minimal
- [ ] Changes are backwards compatible
- [ ] Breaking changes are documented
- [ ] Technical debt is noted
### Example Issues to Catch:
**❌ Bad - Unclear naming:**
\`\`\`javascript
function calc(a, b, c) {
return a * b + c;
}
\`\`\`
**✅ Good - Descriptive naming:**
\`\`\`javascript
function calculateTotalPrice(quantity, unitPrice, tax) {
return quantity * unitPrice + tax;
}
\`\`\`
**❌ Bad - Function doing too much:**
\`\`\`javascript
function processOrder(order) {
// Validate order
if (!order.items) throw new Error('No items');
// Calculate total
let total = 0;
for (let item of order.items) {
total += item.price * item.quantity;
}
// Apply discount
if (order.coupon) {
total *= 0.9;
}
// Process payment
const payment = stripe.charge(total);
// Send email
sendEmail(order.email, 'Order confirmed');
// Update inventory
updateInventory(order.items);
return { orderId: order.id, total };
}
\`\`\`
**✅ Good - Separated concerns:**
\`\`\`javascript
function processOrder(order) {
validateOrder(order);
const total = calculateOrderTotal(order);
const payment = processPayment(total);
sendOrderConfirmation(order.email);
updateInventory(order.items);
return { orderId: order.id, total };
}
\`\`\`
```
## Best Practices
### ✅ Do This
- **Review Small Changes** - Smaller PRs are easier to review thoroughly
- **Check Tests First** - Verify tests pass and cover new code
- **Run the Code** - Test it locally when possible
- **Ask Questions** - Don't assume, ask for clarification
- **Be Constructive** - Suggest improvements, don't just criticize
- **Focus on Important Issues** - Don't nitpick minor style issues
- **Use Automated Tools** - Linters, formatters, security scanners
- **Review Documentation** - Check if docs are updated
- **Consider Performance** - Think about scale and efficiency
- **Check for Regressions** - Ensure existing functionality still works
### ❌ Don't Do This
- **Don't Approve Without Reading** - Actually review the code
- **Don't Be Vague** - Provide specific feedback with examples
- **Don't Ignore Security** - Security issues are critical
- **Don't Skip Tests** - Untested code will cause problems
- **Don't Be Rude** - Be respectful and professional
- **Don't Rubber Stamp** - Every review should add value
- **Don't Review When Tired** - You'll miss important issues
- **Don't Forget Context** - Understand the bigger picture
## Complete Review Checklist
### Pre-Review
- [ ] Read the PR description and linked issues
- [ ] Understand what problem is being solved
- [ ] Check if tests pass in CI/CD
- [ ] Pull the branch and run it locally
### Functionality
- [ ] Code solves the stated problem
- [ ] Edge cases are handled
- [ ] Error handling is appropriate
- [ ] User input is validated
- [ ] No logical errors
### Security
- [ ] No SQL injection vulnerabilities
- [ ] No XSS vulnerabilities
- [ ] Authentication/authorization is correct
- [ ] Sensitive data is protected
- [ ] No hardcoded secrets
### Performance
- [ ] No unnecessary database queries
- [ ] No N+1 query problems
- [ ] Efficient algorithms used
- [ ] No memory leaks
- [ ] Caching used appropriately
### Code Quality
- [ ] Code is readable and clear
- [ ] Names are descriptive
- [ ] Functions are focused and small
- [ ] No code duplication
- [ ] Follows project conventions
### Tests
- [ ] New code has tests
- [ ] Tests cover edge cases
- [ ] Tests are meaningful
- [ ] All tests pass
- [ ] Test coverage is adequate
### Documentation
- [ ] Code comments explain why, not what
- [ ] API documentation is updated
- [ ] README is updated if needed
- [ ] Breaking changes are documented
- [ ] Migration guide provided if needed
### Git
- [ ] Commit messages are clear
- [ ] No merge conflicts
- [ ] Branch is up to date with main
- [ ] No unnecessary files committed
- [ ] .gitignore is properly configured
## Common Pitfalls
### Problem: Missing Edge Cases
**Symptoms:** Code works for happy path but fails on edge cases
**Solution:** Ask "What if...?" questions
- What if the input is null?
- What if the array is empty?
- What if the user is not authenticated?
- What if the network request fails?
### Problem: Security Vulnerabilities
**Symptoms:** Code exposes security risks
**Solution:** Use security checklist
- Run security scanners (npm audit, Snyk)
- Check OWASP Top 10
- Validate all inputs
- Use parameterized queries
- Never trust user input
### Problem: Poor Test Coverage
**Symptoms:** New code has no tests or inadequate tests
**Solution:** Require tests for all new code
- Unit tests for functions
- Integration tests for features
- Edge case tests
- Error case tests
### Problem: Unclear Code
**Symptoms:** Reviewer can't understand what code does
**Solution:** Request improvements
- Better variable names
- Explanatory comments
- Smaller functions
- Clear structure
## Review Comment Templates
### Requesting Changes
```markdown
**Issue:** [Describe the problem]
**Current code:**
\`\`\`javascript
// Show problematic code
\`\`\`
**Suggested fix:**
\`\`\`javascript
// Show improved code
\`\`\`
**Why:** [Explain why this is better]
```
### Asking Questions
```markdown
**Question:** [Your question]
**Context:** [Why you're asking]
**Suggestion:** [If you have one]
```
### Praising Good Code
```markdown
**Nice!** [What you liked]
This is great because [explain why]
```
## Related Skills
- `@requesting-code-review` - Prepare code for review
- `@receiving-code-review` - Handle review feedback
- `@systematic-debugging` - Debug issues found in review
- `@test-driven-development` - Ensure code has tests
## Additional Resources
- [Google Code Review Guidelines](https://google.github.io/eng-practices/review/)
- [OWASP Top 10](https://owasp.org/www-project-top-ten/)
- [Code Review Best Practices](https://github.com/thoughtbot/guides/tree/main/code-review)
- [How to Review Code](https://www.kevinlondon.com/2015/05/05/code-review-best-practices.html)
---
**Pro Tip:** Use a checklist template for every review to ensure consistency and thoroughness. Customize it for your team's specific needs!

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,43 @@
---
name: code-review-excellence
description: "Master effective code review practices to provide constructive feedback, catch bugs early, and foster knowledge sharing while maintaining team morale. Use when reviewing pull requests, establishing..."
risk: unknown
source: community
date_added: "2026-02-27"
---
# Code Review Excellence
Transform code reviews from gatekeeping to knowledge sharing through constructive feedback, systematic analysis, and collaborative improvement.
## Use this skill when
- Reviewing pull requests and code changes
- Establishing code review standards
- Mentoring developers through review feedback
- Auditing for correctness, security, or performance
## Do not use this skill when
- There are no code changes to review
- The task is a design-only discussion without code
- You need to implement fixes instead of reviewing
## Instructions
- Read context, requirements, and test signals first.
- Review for correctness, security, performance, and maintainability.
- Provide actionable feedback with severity and rationale.
- Ask clarifying questions when intent is unclear.
- If detailed checklists are required, open `resources/implementation-playbook.md`.
## Output Format
- High-level summary of findings
- Issues grouped by severity (blocking, important, minor)
- Suggestions and questions
- Test and coverage notes
## Resources
- `resources/implementation-playbook.md` for detailed review patterns and templates.

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,515 @@
# Code Review Excellence Implementation Playbook
This file contains detailed patterns, checklists, and code samples referenced by the skill.
## When to Use This Skill
- Reviewing pull requests and code changes
- Establishing code review standards for teams
- Mentoring junior developers through reviews
- Conducting architecture reviews
- Creating review checklists and guidelines
- Improving team collaboration
- Reducing code review cycle time
- Maintaining code quality standards
## Core Principles
### 1. The Review Mindset
**Goals of Code Review:**
- Catch bugs and edge cases
- Ensure code maintainability
- Share knowledge across team
- Enforce coding standards
- Improve design and architecture
- Build team culture
**Not the Goals:**
- Show off knowledge
- Nitpick formatting (use linters)
- Block progress unnecessarily
- Rewrite to your preference
### 2. Effective Feedback
**Good Feedback is:**
- Specific and actionable
- Educational, not judgmental
- Focused on the code, not the person
- Balanced (praise good work too)
- Prioritized (critical vs nice-to-have)
```markdown
❌ Bad: "This is wrong."
✅ Good: "This could cause a race condition when multiple users
access simultaneously. Consider using a mutex here."
❌ Bad: "Why didn't you use X pattern?"
✅ Good: "Have you considered the Repository pattern? It would
make this easier to test. Here's an example: [link]"
❌ Bad: "Rename this variable."
✅ Good: "[nit] Consider `userCount` instead of `uc` for
clarity. Not blocking if you prefer to keep it."
```
### 3. Review Scope
**What to Review:**
- Logic correctness and edge cases
- Security vulnerabilities
- Performance implications
- Test coverage and quality
- Error handling
- Documentation and comments
- API design and naming
- Architectural fit
**What Not to Review Manually:**
- Code formatting (use Prettier, Black, etc.)
- Import organization
- Linting violations
- Simple typos
## Review Process
### Phase 1: Context Gathering (2-3 minutes)
```markdown
Before diving into code, understand:
1. Read PR description and linked issue
2. Check PR size (>400 lines? Ask to split)
3. Review CI/CD status (tests passing?)
4. Understand the business requirement
5. Note any relevant architectural decisions
```
### Phase 2: High-Level Review (5-10 minutes)
```markdown
1. **Architecture & Design**
- Does the solution fit the problem?
- Are there simpler approaches?
- Is it consistent with existing patterns?
- Will it scale?
2. **File Organization**
- Are new files in the right places?
- Is code grouped logically?
- Are there duplicate files?
3. **Testing Strategy**
- Are there tests?
- Do tests cover edge cases?
- Are tests readable?
```
### Phase 3: Line-by-Line Review (10-20 minutes)
```markdown
For each file:
1. **Logic & Correctness**
- Edge cases handled?
- Off-by-one errors?
- Null/undefined checks?
- Race conditions?
2. **Security**
- Input validation?
- SQL injection risks?
- XSS vulnerabilities?
- Sensitive data exposure?
3. **Performance**
- N+1 queries?
- Unnecessary loops?
- Memory leaks?
- Blocking operations?
4. **Maintainability**
- Clear variable names?
- Functions doing one thing?
- Complex code commented?
- Magic numbers extracted?
```
### Phase 4: Summary & Decision (2-3 minutes)
```markdown
1. Summarize key concerns
2. Highlight what you liked
3. Make clear decision:
- ✅ Approve
- 💬 Comment (minor suggestions)
- 🔄 Request Changes (must address)
4. Offer to pair if complex
```
## Review Techniques
### Technique 1: The Checklist Method
```markdown
## Security Checklist
- [ ] User input validated and sanitized
- [ ] SQL queries use parameterization
- [ ] Authentication/authorization checked
- [ ] Secrets not hardcoded
- [ ] Error messages don't leak info
## Performance Checklist
- [ ] No N+1 queries
- [ ] Database queries indexed
- [ ] Large lists paginated
- [ ] Expensive operations cached
- [ ] No blocking I/O in hot paths
## Testing Checklist
- [ ] Happy path tested
- [ ] Edge cases covered
- [ ] Error cases tested
- [ ] Test names are descriptive
- [ ] Tests are deterministic
```
### Technique 2: The Question Approach
Instead of stating problems, ask questions to encourage thinking:
```markdown
❌ "This will fail if the list is empty."
✅ "What happens if `items` is an empty array?"
❌ "You need error handling here."
✅ "How should this behave if the API call fails?"
❌ "This is inefficient."
✅ "I see this loops through all users. Have we considered
the performance impact with 100k users?"
```
### Technique 3: Suggest, Don't Command
```markdown
## Use Collaborative Language
❌ "You must change this to use async/await"
✅ "Suggestion: async/await might make this more readable:
```typescript
async function fetchUser(id: string) {
const user = await db.query('SELECT * FROM users WHERE id = ?', id);
return user;
}
```
What do you think?"
❌ "Extract this into a function"
✅ "This logic appears in 3 places. Would it make sense to
extract it into a shared utility function?"
```
### Technique 4: Differentiate Severity
```markdown
Use labels to indicate priority:
🔴 [blocking] - Must fix before merge
🟡 [important] - Should fix, discuss if disagree
🟢 [nit] - Nice to have, not blocking
💡 [suggestion] - Alternative approach to consider
📚 [learning] - Educational comment, no action needed
🎉 [praise] - Good work, keep it up!
Example:
"🔴 [blocking] This SQL query is vulnerable to injection.
Please use parameterized queries."
"🟢 [nit] Consider renaming `data` to `userData` for clarity."
"🎉 [praise] Excellent test coverage! This will catch edge cases."
```
## Language-Specific Patterns
### Python Code Review
```python
# Check for Python-specific issues
# ❌ Mutable default arguments
def add_item(item, items=[]): # Bug! Shared across calls
items.append(item)
return items
# ✅ Use None as default
def add_item(item, items=None):
if items is None:
items = []
items.append(item)
return items
# ❌ Catching too broad
try:
result = risky_operation()
except: # Catches everything, even KeyboardInterrupt!
pass
# ✅ Catch specific exceptions
try:
result = risky_operation()
except ValueError as e:
logger.error(f"Invalid value: {e}")
raise
# ❌ Using mutable class attributes
class User:
permissions = [] # Shared across all instances!
# ✅ Initialize in __init__
class User:
def __init__(self):
self.permissions = []
```
### TypeScript/JavaScript Code Review
```typescript
// Check for TypeScript-specific issues
// ❌ Using any defeats type safety
function processData(data: any) { // Avoid any
return data.value;
}
// ✅ Use proper types
interface DataPayload {
value: string;
}
function processData(data: DataPayload) {
return data.value;
}
// ❌ Not handling async errors
async function fetchUser(id: string) {
const response = await fetch(`/api/users/${id}`);
return response.json(); // What if network fails?
}
// ✅ Handle errors properly
async function fetchUser(id: string): Promise<User> {
try {
const response = await fetch(`/api/users/${id}`);
if (!response.ok) {
throw new Error(`HTTP ${response.status}`);
}
return await response.json();
} catch (error) {
console.error('Failed to fetch user:', error);
throw error;
}
}
// ❌ Mutation of props
function UserProfile({ user }: Props) {
user.lastViewed = new Date(); // Mutating prop!
return <div>{user.name}</div>;
}
// ✅ Don't mutate props
function UserProfile({ user, onView }: Props) {
useEffect(() => {
onView(user.id); // Notify parent to update
}, [user.id]);
return <div>{user.name}</div>;
}
```
## Advanced Review Patterns
### Pattern 1: Architectural Review
```markdown
When reviewing significant changes:
1. **Design Document First**
- For large features, request design doc before code
- Review design with team before implementation
- Agree on approach to avoid rework
2. **Review in Stages**
- First PR: Core abstractions and interfaces
- Second PR: Implementation
- Third PR: Integration and tests
- Easier to review, faster to iterate
3. **Consider Alternatives**
- "Have we considered using [pattern/library]?"
- "What's the tradeoff vs. the simpler approach?"
- "How will this evolve as requirements change?"
```
### Pattern 2: Test Quality Review
```typescript
// ❌ Poor test: Implementation detail testing
test('increments counter variable', () => {
const component = render(<Counter />);
const button = component.getByRole('button');
fireEvent.click(button);
expect(component.state.counter).toBe(1); // Testing internal state
});
// ✅ Good test: Behavior testing
test('displays incremented count when clicked', () => {
render(<Counter />);
const button = screen.getByRole('button', { name: /increment/i });
fireEvent.click(button);
expect(screen.getByText('Count: 1')).toBeInTheDocument();
});
// Review questions for tests:
// - Do tests describe behavior, not implementation?
// - Are test names clear and descriptive?
// - Do tests cover edge cases?
// - Are tests independent (no shared state)?
// - Can tests run in any order?
```
### Pattern 3: Security Review
```markdown
## Security Review Checklist
### Authentication & Authorization
- [ ] Is authentication required where needed?
- [ ] Are authorization checks before every action?
- [ ] Is JWT validation proper (signature, expiry)?
- [ ] Are API keys/secrets properly secured?
### Input Validation
- [ ] All user inputs validated?
- [ ] File uploads restricted (size, type)?
- [ ] SQL queries parameterized?
- [ ] XSS protection (escape output)?
### Data Protection
- [ ] Passwords hashed (bcrypt/argon2)?
- [ ] Sensitive data encrypted at rest?
- [ ] HTTPS enforced for sensitive data?
- [ ] PII handled according to regulations?
### Common Vulnerabilities
- [ ] No eval() or similar dynamic execution?
- [ ] No hardcoded secrets?
- [ ] CSRF protection for state-changing operations?
- [ ] Rate limiting on public endpoints?
```
## Giving Difficult Feedback
### Pattern: The Sandwich Method (Modified)
```markdown
Traditional: Praise + Criticism + Praise (feels fake)
Better: Context + Specific Issue + Helpful Solution
Example:
"I noticed the payment processing logic is inline in the
controller. This makes it harder to test and reuse.
[Specific Issue]
The calculateTotal() function mixes tax calculation,
discount logic, and database queries, making it difficult
to unit test and reason about.
[Helpful Solution]
Could we extract this into a PaymentService class? That
would make it testable and reusable. I can pair with you
on this if helpful."
```
### Handling Disagreements
```markdown
When author disagrees with your feedback:
1. **Seek to Understand**
"Help me understand your approach. What led you to
choose this pattern?"
2. **Acknowledge Valid Points**
"That's a good point about X. I hadn't considered that."
3. **Provide Data**
"I'm concerned about performance. Can we add a benchmark
to validate the approach?"
4. **Escalate if Needed**
"Let's get [architect/senior dev] to weigh in on this."
5. **Know When to Let Go**
If it's working and not a critical issue, approve it.
Perfection is the enemy of progress.
```
## Best Practices
1. **Review Promptly**: Within 24 hours, ideally same day
2. **Limit PR Size**: 200-400 lines max for effective review
3. **Review in Time Blocks**: 60 minutes max, take breaks
4. **Use Review Tools**: GitHub, GitLab, or dedicated tools
5. **Automate What You Can**: Linters, formatters, security scans
6. **Build Rapport**: Emoji, praise, and empathy matter
7. **Be Available**: Offer to pair on complex issues
8. **Learn from Others**: Review others' review comments
## Common Pitfalls
- **Perfectionism**: Blocking PRs for minor style preferences
- **Scope Creep**: "While you're at it, can you also..."
- **Inconsistency**: Different standards for different people
- **Delayed Reviews**: Letting PRs sit for days
- **Ghosting**: Requesting changes then disappearing
- **Rubber Stamping**: Approving without actually reviewing
- **Bike Shedding**: Debating trivial details extensively
## Templates
### PR Review Comment Template
```markdown
## Summary
[Brief overview of what was reviewed]
## Strengths
- [What was done well]
- [Good patterns or approaches]
## Required Changes
🔴 [Blocking issue 1]
🔴 [Blocking issue 2]
## Suggestions
💡 [Improvement 1]
💡 [Improvement 2]
## Questions
❓ [Clarification needed on X]
❓ [Alternative approach consideration]
## Verdict
✅ Approve after addressing required changes
```
## Resources
- **references/code-review-best-practices.md**: Comprehensive review guidelines
- **references/common-bugs-checklist.md**: Language-specific bugs to watch for
- **references/security-review-guide.md**: Security-focused review checklist
- **assets/pr-review-template.md**: Standard review comment template
- **assets/review-checklist.md**: Quick reference checklist
- **scripts/pr-analyzer.py**: Analyze PR complexity and suggest reviewers

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,175 @@
---
name: code-reviewer
description: "Elite code review expert specializing in modern AI-powered code"
risk: unknown
source: community
date_added: "2026-02-27"
---
## Use this skill when
- Working on code reviewer tasks or workflows
- Needing guidance, best practices, or checklists for code reviewer
## Do not use this skill when
- The task is unrelated to code reviewer
- You need a different domain or tool outside this scope
## Instructions
- Clarify goals, constraints, and required inputs.
- Apply relevant best practices and validate outcomes.
- Provide actionable steps and verification.
- If detailed examples are required, open `resources/implementation-playbook.md`.
You are an elite code review expert specializing in modern code analysis techniques, AI-powered review tools, and production-grade quality assurance.
## Expert Purpose
Master code reviewer focused on ensuring code quality, security, performance, and maintainability using cutting-edge analysis tools and techniques. Combines deep technical expertise with modern AI-assisted review processes, static analysis tools, and production reliability practices to deliver comprehensive code assessments that prevent bugs, security vulnerabilities, and production incidents.
## Capabilities
### AI-Powered Code Analysis
- Integration with modern AI review tools (Trag, Bito, Codiga, GitHub Copilot)
- Natural language pattern definition for custom review rules
- Context-aware code analysis using LLMs and machine learning
- Automated pull request analysis and comment generation
- Real-time feedback integration with CLI tools and IDEs
- Custom rule-based reviews with team-specific patterns
- Multi-language AI code analysis and suggestion generation
### Modern Static Analysis Tools
- SonarQube, CodeQL, and Semgrep for comprehensive code scanning
- Security-focused analysis with Snyk, Bandit, and OWASP tools
- Performance analysis with profilers and complexity analyzers
- Dependency vulnerability scanning with npm audit, pip-audit
- License compliance checking and open source risk assessment
- Code quality metrics with cyclomatic complexity analysis
- Technical debt assessment and code smell detection
### Security Code Review
- OWASP Top 10 vulnerability detection and prevention
- Input validation and sanitization review
- Authentication and authorization implementation analysis
- Cryptographic implementation and key management review
- SQL injection, XSS, and CSRF prevention verification
- Secrets and credential management assessment
- API security patterns and rate limiting implementation
- Container and infrastructure security code review
### Performance & Scalability Analysis
- Database query optimization and N+1 problem detection
- Memory leak and resource management analysis
- Caching strategy implementation review
- Asynchronous programming pattern verification
- Load testing integration and performance benchmark review
- Connection pooling and resource limit configuration
- Microservices performance patterns and anti-patterns
- Cloud-native performance optimization techniques
### Configuration & Infrastructure Review
- Production configuration security and reliability analysis
- Database connection pool and timeout configuration review
- Container orchestration and Kubernetes manifest analysis
- Infrastructure as Code (Terraform, CloudFormation) review
- CI/CD pipeline security and reliability assessment
- Environment-specific configuration validation
- Secrets management and credential security review
- Monitoring and observability configuration verification
### Modern Development Practices
- Test-Driven Development (TDD) and test coverage analysis
- Behavior-Driven Development (BDD) scenario review
- Contract testing and API compatibility verification
- Feature flag implementation and rollback strategy review
- Blue-green and canary deployment pattern analysis
- Observability and monitoring code integration review
- Error handling and resilience pattern implementation
- Documentation and API specification completeness
### Code Quality & Maintainability
- Clean Code principles and SOLID pattern adherence
- Design pattern implementation and architectural consistency
- Code duplication detection and refactoring opportunities
- Naming convention and code style compliance
- Technical debt identification and remediation planning
- Legacy code modernization and refactoring strategies
- Code complexity reduction and simplification techniques
- Maintainability metrics and long-term sustainability assessment
### Team Collaboration & Process
- Pull request workflow optimization and best practices
- Code review checklist creation and enforcement
- Team coding standards definition and compliance
- Mentor-style feedback and knowledge sharing facilitation
- Code review automation and tool integration
- Review metrics tracking and team performance analysis
- Documentation standards and knowledge base maintenance
- Onboarding support and code review training
### Language-Specific Expertise
- JavaScript/TypeScript modern patterns and React/Vue best practices
- Python code quality with PEP 8 compliance and performance optimization
- Java enterprise patterns and Spring framework best practices
- Go concurrent programming and performance optimization
- Rust memory safety and performance critical code review
- C# .NET Core patterns and Entity Framework optimization
- PHP modern frameworks and security best practices
- Database query optimization across SQL and NoSQL platforms
### Integration & Automation
- GitHub Actions, GitLab CI/CD, and Jenkins pipeline integration
- Slack, Teams, and communication tool integration
- IDE integration with VS Code, IntelliJ, and development environments
- Custom webhook and API integration for workflow automation
- Code quality gates and deployment pipeline integration
- Automated code formatting and linting tool configuration
- Review comment template and checklist automation
- Metrics dashboard and reporting tool integration
## Behavioral Traits
- Maintains constructive and educational tone in all feedback
- Focuses on teaching and knowledge transfer, not just finding issues
- Balances thorough analysis with practical development velocity
- Prioritizes security and production reliability above all else
- Emphasizes testability and maintainability in every review
- Encourages best practices while being pragmatic about deadlines
- Provides specific, actionable feedback with code examples
- Considers long-term technical debt implications of all changes
- Stays current with emerging security threats and mitigation strategies
- Champions automation and tooling to improve review efficiency
## Knowledge Base
- Modern code review tools and AI-assisted analysis platforms
- OWASP security guidelines and vulnerability assessment techniques
- Performance optimization patterns for high-scale applications
- Cloud-native development and containerization best practices
- DevSecOps integration and shift-left security methodologies
- Static analysis tool configuration and custom rule development
- Production incident analysis and preventive code review techniques
- Modern testing frameworks and quality assurance practices
- Software architecture patterns and design principles
- Regulatory compliance requirements (SOC2, PCI DSS, GDPR)
## Response Approach
1. **Analyze code context** and identify review scope and priorities
2. **Apply automated tools** for initial analysis and vulnerability detection
3. **Conduct manual review** for logic, architecture, and business requirements
4. **Assess security implications** with focus on production vulnerabilities
5. **Evaluate performance impact** and scalability considerations
6. **Review configuration changes** with special attention to production risks
7. **Provide structured feedback** organized by severity and priority
8. **Suggest improvements** with specific code examples and alternatives
9. **Document decisions** and rationale for complex review points
10. **Follow up** on implementation and provide continuous guidance
## Example Interactions
- "Review this microservice API for security vulnerabilities and performance issues"
- "Analyze this database migration for potential production impact"
- "Assess this React component for accessibility and performance best practices"
- "Review this Kubernetes deployment configuration for security and reliability"
- "Evaluate this authentication implementation for OAuth2 compliance"
- "Analyze this caching strategy for race conditions and data consistency"
- "Review this CI/CD pipeline for security and deployment best practices"
- "Assess this error handling implementation for observability and debugging"

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,49 @@
---
name: comprehensive-review-pr-enhance
description: "You are a PR optimization expert specializing in creating high-quality pull requests that facilitate efficient code reviews. Generate comprehensive PR descriptions, automate review processes, and e..."
risk: unknown
source: community
date_added: "2026-02-27"
---
# Pull Request Enhancement
You are a PR optimization expert specializing in creating high-quality pull requests that facilitate efficient code reviews. Generate comprehensive PR descriptions, automate review processes, and ensure PRs follow best practices for clarity, size, and reviewability.
## Use this skill when
- Writing or improving PR descriptions
- Summarizing changes for faster reviews
- Organizing tests, risks, and rollout notes
- Reducing PR size or improving reviewability
## Do not use this skill when
- There is no PR or change list to summarize
- You need a full code review instead of PR polishing
- The task is unrelated to software delivery
## Context
The user needs to create or improve pull requests with detailed descriptions, proper documentation, test coverage analysis, and review facilitation. Focus on making PRs that are easy to review, well-documented, and include all necessary context.
## Requirements
$ARGUMENTS
## Instructions
- Analyze the diff and identify intent and scope.
- Summarize changes, tests, and risks clearly.
- Highlight breaking changes and rollout notes.
- Add checklists and reviewer guidance.
- If detailed templates are required, open `resources/implementation-playbook.md`.
## Output Format
- PR summary and scope
- What changed and why
- Tests performed and results
- Risks, rollbacks, and reviewer notes
## Resources
- `resources/implementation-playbook.md` for detailed templates and examples.

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,691 @@
# Pull Request Enhancement Implementation Playbook
This file contains detailed patterns, checklists, and code samples referenced by the skill.
## Instructions
### 1. PR Analysis
Analyze the changes and generate insights:
**Change Summary Generator**
```python
import subprocess
import re
from collections import defaultdict
class PRAnalyzer:
def analyze_changes(self, base_branch='main'):
"""
Analyze changes between current branch and base
"""
analysis = {
'files_changed': self._get_changed_files(base_branch),
'change_statistics': self._get_change_stats(base_branch),
'change_categories': self._categorize_changes(base_branch),
'potential_impacts': self._assess_impacts(base_branch),
'dependencies_affected': self._check_dependencies(base_branch)
}
return analysis
def _get_changed_files(self, base_branch):
"""Get list of changed files with statistics"""
cmd = f"git diff --name-status {base_branch}...HEAD"
result = subprocess.run(cmd.split(), capture_output=True, text=True)
files = []
for line in result.stdout.strip().split('\n'):
if line:
status, filename = line.split('\t', 1)
files.append({
'filename': filename,
'status': self._parse_status(status),
'category': self._categorize_file(filename)
})
return files
def _get_change_stats(self, base_branch):
"""Get detailed change statistics"""
cmd = f"git diff --shortstat {base_branch}...HEAD"
result = subprocess.run(cmd.split(), capture_output=True, text=True)
# Parse output like: "10 files changed, 450 insertions(+), 123 deletions(-)"
stats_pattern = r'(\d+) files? changed(?:, (\d+) insertions?\(\+\))?(?:, (\d+) deletions?\(-\))?'
match = re.search(stats_pattern, result.stdout)
if match:
files, insertions, deletions = match.groups()
return {
'files_changed': int(files),
'insertions': int(insertions or 0),
'deletions': int(deletions or 0),
'net_change': int(insertions or 0) - int(deletions or 0)
}
return {'files_changed': 0, 'insertions': 0, 'deletions': 0, 'net_change': 0}
def _categorize_file(self, filename):
"""Categorize file by type"""
categories = {
'source': ['.js', '.ts', '.py', '.java', '.go', '.rs'],
'test': ['test', 'spec', '.test.', '.spec.'],
'config': ['config', '.json', '.yml', '.yaml', '.toml'],
'docs': ['.md', 'README', 'CHANGELOG', '.rst'],
'styles': ['.css', '.scss', '.less'],
'build': ['Makefile', 'Dockerfile', '.gradle', 'pom.xml']
}
for category, patterns in categories.items():
if any(pattern in filename for pattern in patterns):
return category
return 'other'
```
### 2. PR Description Generation
Create comprehensive PR descriptions:
**Description Template Generator**
```python
def generate_pr_description(analysis, commits):
"""
Generate detailed PR description from analysis
"""
description = f"""
## Summary
{generate_summary(analysis, commits)}
## What Changed
{generate_change_list(analysis)}
## Why These Changes
{extract_why_from_commits(commits)}
## Type of Change
{determine_change_types(analysis)}
## How Has This Been Tested?
{generate_test_section(analysis)}
## Visual Changes
{generate_visual_section(analysis)}
## Performance Impact
{analyze_performance_impact(analysis)}
## Breaking Changes
{identify_breaking_changes(analysis)}
## Dependencies
{list_dependency_changes(analysis)}
## Checklist
{generate_review_checklist(analysis)}
## Additional Notes
{generate_additional_notes(analysis)}
"""
return description
def generate_summary(analysis, commits):
"""Generate executive summary"""
stats = analysis['change_statistics']
# Extract main purpose from commits
main_purpose = extract_main_purpose(commits)
summary = f"""
This PR {main_purpose}.
**Impact**: {stats['files_changed']} files changed ({stats['insertions']} additions, {stats['deletions']} deletions)
**Risk Level**: {calculate_risk_level(analysis)}
**Review Time**: ~{estimate_review_time(stats)} minutes
"""
return summary
def generate_change_list(analysis):
"""Generate categorized change list"""
changes_by_category = defaultdict(list)
for file in analysis['files_changed']:
changes_by_category[file['category']].append(file)
change_list = ""
icons = {
'source': '🔧',
'test': '✅',
'docs': '📝',
'config': '⚙️',
'styles': '🎨',
'build': '🏗️',
'other': '📁'
}
for category, files in changes_by_category.items():
change_list += f"\n### {icons.get(category, '📁')} {category.title()} Changes\n"
for file in files[:10]: # Limit to 10 files per category
change_list += f"- {file['status']}: `{file['filename']}`\n"
if len(files) > 10:
change_list += f"- ...and {len(files) - 10} more\n"
return change_list
```
### 3. Review Checklist Generation
Create automated review checklists:
**Smart Checklist Generator**
```python
def generate_review_checklist(analysis):
"""
Generate context-aware review checklist
"""
checklist = ["## Review Checklist\n"]
# General items
general_items = [
"Code follows project style guidelines",
"Self-review completed",
"Comments added for complex logic",
"No debugging code left",
"No sensitive data exposed"
]
# Add general items
checklist.append("### General")
for item in general_items:
checklist.append(f"- [ ] {item}")
# File-specific checks
file_types = {file['category'] for file in analysis['files_changed']}
if 'source' in file_types:
checklist.append("\n### Code Quality")
checklist.extend([
"- [ ] No code duplication",
"- [ ] Functions are focused and small",
"- [ ] Variable names are descriptive",
"- [ ] Error handling is comprehensive",
"- [ ] No performance bottlenecks introduced"
])
if 'test' in file_types:
checklist.append("\n### Testing")
checklist.extend([
"- [ ] All new code is covered by tests",
"- [ ] Tests are meaningful and not just for coverage",
"- [ ] Edge cases are tested",
"- [ ] Tests follow AAA pattern (Arrange, Act, Assert)",
"- [ ] No flaky tests introduced"
])
if 'config' in file_types:
checklist.append("\n### Configuration")
checklist.extend([
"- [ ] No hardcoded values",
"- [ ] Environment variables documented",
"- [ ] Backwards compatibility maintained",
"- [ ] Security implications reviewed",
"- [ ] Default values are sensible"
])
if 'docs' in file_types:
checklist.append("\n### Documentation")
checklist.extend([
"- [ ] Documentation is clear and accurate",
"- [ ] Examples are provided where helpful",
"- [ ] API changes are documented",
"- [ ] README updated if necessary",
"- [ ] Changelog updated"
])
# Security checks
if has_security_implications(analysis):
checklist.append("\n### Security")
checklist.extend([
"- [ ] No SQL injection vulnerabilities",
"- [ ] Input validation implemented",
"- [ ] Authentication/authorization correct",
"- [ ] No sensitive data in logs",
"- [ ] Dependencies are secure"
])
return '\n'.join(checklist)
```
### 4. Code Review Automation
Automate common review tasks:
**Automated Review Bot**
```python
class ReviewBot:
def perform_automated_checks(self, pr_diff):
"""
Perform automated code review checks
"""
findings = []
# Check for common issues
checks = [
self._check_console_logs,
self._check_commented_code,
self._check_large_functions,
self._check_todo_comments,
self._check_hardcoded_values,
self._check_missing_error_handling,
self._check_security_issues
]
for check in checks:
findings.extend(check(pr_diff))
return findings
def _check_console_logs(self, diff):
"""Check for console.log statements"""
findings = []
pattern = r'\+.*console\.(log|debug|info|warn|error)'
for file, content in diff.items():
matches = re.finditer(pattern, content, re.MULTILINE)
for match in matches:
findings.append({
'type': 'warning',
'file': file,
'line': self._get_line_number(match, content),
'message': 'Console statement found - remove before merging',
'suggestion': 'Use proper logging framework instead'
})
return findings
def _check_large_functions(self, diff):
"""Check for functions that are too large"""
findings = []
# Simple heuristic: count lines between function start and end
for file, content in diff.items():
if file.endswith(('.js', '.ts', '.py')):
functions = self._extract_functions(content)
for func in functions:
if func['lines'] > 50:
findings.append({
'type': 'suggestion',
'file': file,
'line': func['start_line'],
'message': f"Function '{func['name']}' is {func['lines']} lines long",
'suggestion': 'Consider breaking into smaller functions'
})
return findings
```
### 5. PR Size Optimization
Help split large PRs:
**PR Splitter Suggestions**
```python
def suggest_pr_splits(analysis):
"""
Suggest how to split large PRs
"""
stats = analysis['change_statistics']
# Check if PR is too large
if stats['files_changed'] > 20 or stats['insertions'] + stats['deletions'] > 1000:
suggestions = analyze_split_opportunities(analysis)
return f"""
## ⚠️ Large PR Detected
This PR changes {stats['files_changed']} files with {stats['insertions'] + stats['deletions']} total changes.
Large PRs are harder to review and more likely to introduce bugs.
### Suggested Splits:
{format_split_suggestions(suggestions)}
### How to Split:
1. Create feature branch from current branch
2. Cherry-pick commits for first logical unit
3. Create PR for first unit
4. Repeat for remaining units
```bash
# Example split workflow
git checkout -b feature/part-1
git cherry-pick <commit-hashes-for-part-1>
git push origin feature/part-1
# Create PR for part 1
git checkout -b feature/part-2
git cherry-pick <commit-hashes-for-part-2>
git push origin feature/part-2
# Create PR for part 2
```
"""
return ""
def analyze_split_opportunities(analysis):
"""Find logical units for splitting"""
suggestions = []
# Group by feature areas
feature_groups = defaultdict(list)
for file in analysis['files_changed']:
feature = extract_feature_area(file['filename'])
feature_groups[feature].append(file)
# Suggest splits
for feature, files in feature_groups.items():
if len(files) >= 5:
suggestions.append({
'name': f"{feature} changes",
'files': files,
'reason': f"Isolated changes to {feature} feature"
})
return suggestions
```
### 6. Visual Diff Enhancement
Generate visual representations:
**Mermaid Diagram Generator**
```python
def generate_architecture_diff(analysis):
"""
Generate diagram showing architectural changes
"""
if has_architectural_changes(analysis):
return f"""
## Architecture Changes
```mermaid
graph LR
subgraph "Before"
A1[Component A] --> B1[Component B]
B1 --> C1[Database]
end
subgraph "After"
A2[Component A] --> B2[Component B]
B2 --> C2[Database]
B2 --> D2[New Cache Layer]
A2 --> E2[New API Gateway]
end
style D2 fill:#90EE90
style E2 fill:#90EE90
```
### Key Changes:
1. Added caching layer for performance
2. Introduced API gateway for better routing
3. Refactored component communication
"""
return ""
```
### 7. Test Coverage Report
Include test coverage analysis:
**Coverage Report Generator**
```python
def generate_coverage_report(base_branch='main'):
"""
Generate test coverage comparison
"""
# Get coverage before and after
before_coverage = get_coverage_for_branch(base_branch)
after_coverage = get_coverage_for_branch('HEAD')
coverage_diff = after_coverage - before_coverage
report = f"""
## Test Coverage
| Metric | Before | After | Change |
|--------|--------|-------|--------|
| Lines | {before_coverage['lines']:.1f}% | {after_coverage['lines']:.1f}% | {format_diff(coverage_diff['lines'])} |
| Functions | {before_coverage['functions']:.1f}% | {after_coverage['functions']:.1f}% | {format_diff(coverage_diff['functions'])} |
| Branches | {before_coverage['branches']:.1f}% | {after_coverage['branches']:.1f}% | {format_diff(coverage_diff['branches'])} |
### Uncovered Files
"""
# List files with low coverage
for file in get_low_coverage_files():
report += f"- `{file['name']}`: {file['coverage']:.1f}% coverage\n"
return report
def format_diff(value):
"""Format coverage difference"""
if value > 0:
return f"<span style='color: green'>+{value:.1f}%</span> ✅"
elif value < 0:
return f"<span style='color: red'>{value:.1f}%</span> ⚠️"
else:
return "No change"
```
### 8. Risk Assessment
Evaluate PR risk:
**Risk Calculator**
```python
def calculate_pr_risk(analysis):
"""
Calculate risk score for PR
"""
risk_factors = {
'size': calculate_size_risk(analysis),
'complexity': calculate_complexity_risk(analysis),
'test_coverage': calculate_test_risk(analysis),
'dependencies': calculate_dependency_risk(analysis),
'security': calculate_security_risk(analysis)
}
overall_risk = sum(risk_factors.values()) / len(risk_factors)
risk_report = f"""
## Risk Assessment
**Overall Risk Level**: {get_risk_level(overall_risk)} ({overall_risk:.1f}/10)
### Risk Factors
| Factor | Score | Details |
|--------|-------|---------|
| Size | {risk_factors['size']:.1f}/10 | {get_size_details(analysis)} |
| Complexity | {risk_factors['complexity']:.1f}/10 | {get_complexity_details(analysis)} |
| Test Coverage | {risk_factors['test_coverage']:.1f}/10 | {get_test_details(analysis)} |
| Dependencies | {risk_factors['dependencies']:.1f}/10 | {get_dependency_details(analysis)} |
| Security | {risk_factors['security']:.1f}/10 | {get_security_details(analysis)} |
### Mitigation Strategies
{generate_mitigation_strategies(risk_factors)}
"""
return risk_report
def get_risk_level(score):
"""Convert score to risk level"""
if score < 3:
return "🟢 Low"
elif score < 6:
return "🟡 Medium"
elif score < 8:
return "🟠 High"
else:
return "🔴 Critical"
```
### 9. PR Templates
Generate context-specific templates:
```python
def generate_pr_template(pr_type, analysis):
"""
Generate PR template based on type
"""
templates = {
'feature': f"""
## Feature: {extract_feature_name(analysis)}
### Description
{generate_feature_description(analysis)}
### User Story
As a [user type]
I want [feature]
So that [benefit]
### Acceptance Criteria
- [ ] Criterion 1
- [ ] Criterion 2
- [ ] Criterion 3
### Demo
[Link to demo or screenshots]
### Technical Implementation
{generate_technical_summary(analysis)}
### Testing Strategy
{generate_test_strategy(analysis)}
""",
'bugfix': f"""
## Bug Fix: {extract_bug_description(analysis)}
### Issue
- **Reported in**: #[issue-number]
- **Severity**: {determine_severity(analysis)}
- **Affected versions**: {get_affected_versions(analysis)}
### Root Cause
{analyze_root_cause(analysis)}
### Solution
{describe_solution(analysis)}
### Testing
- [ ] Bug is reproducible before fix
- [ ] Bug is resolved after fix
- [ ] No regressions introduced
- [ ] Edge cases tested
### Verification Steps
1. Step to reproduce original issue
2. Apply this fix
3. Verify issue is resolved
""",
'refactor': f"""
## Refactoring: {extract_refactor_scope(analysis)}
### Motivation
{describe_refactor_motivation(analysis)}
### Changes Made
{list_refactor_changes(analysis)}
### Benefits
- Improved {list_improvements(analysis)}
- Reduced {list_reductions(analysis)}
### Compatibility
- [ ] No breaking changes
- [ ] API remains unchanged
- [ ] Performance maintained or improved
### Metrics
| Metric | Before | After |
|--------|--------|-------|
| Complexity | X | Y |
| Test Coverage | X% | Y% |
| Performance | Xms | Yms |
"""
}
return templates.get(pr_type, templates['feature'])
```
### 10. Review Response Templates
Help with review responses:
```python
review_response_templates = {
'acknowledge_feedback': """
Thank you for the thorough review! I'll address these points.
""",
'explain_decision': """
Great question! I chose this approach because:
1. [Reason 1]
2. [Reason 2]
Alternative approaches considered:
- [Alternative 1]: [Why not chosen]
- [Alternative 2]: [Why not chosen]
Happy to discuss further if you have concerns.
""",
'request_clarification': """
Thanks for the feedback. Could you clarify what you mean by [specific point]?
I want to make sure I understand your concern correctly before making changes.
""",
'disagree_respectfully': """
I appreciate your perspective on this. I have a slightly different view:
[Your reasoning]
However, I'm open to discussing this further. What do you think about [compromise/middle ground]?
""",
'commit_to_change': """
Good catch! I'll update this to [specific change].
This should address [concern] while maintaining [other requirement].
"""
}
```
## Output Format
1. **PR Summary**: Executive summary with key metrics
2. **Detailed Description**: Comprehensive PR description
3. **Review Checklist**: Context-aware review items
4. **Risk Assessment**: Risk analysis with mitigation strategies
5. **Test Coverage**: Before/after coverage comparison
6. **Visual Aids**: Diagrams and visual diffs where applicable
7. **Size Recommendations**: Suggestions for splitting large PRs
8. **Review Automation**: Automated checks and findings
Focus on creating PRs that are a pleasure to review, with all necessary context and documentation for efficient code review process.

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

12
skills/create-pr/SKILL.md Normal file
View file

@ -0,0 +1,12 @@
---
name: create-pr
description: Alias for sentry-skills:pr-writer. Use when users explicitly ask for "create-pr" or reference the legacy skill name. Redirects to the canonical PR writing workflow.
---
# Alias: create-pr
This skill name is kept for compatibility.
Use `sentry-skills:pr-writer` as the canonical skill for creating and editing pull requests.
If invoked via `create-pr`, run the same workflow and conventions documented in `sentry-skills:pr-writer`.

View file

@ -0,0 +1,119 @@
---
name: creating-grafana-dashboard
description: Use when adding a dashboard to Zoe's Grafana monitoring stack — whether importing from grafana.com or creating from scratch — including datasource UID patching, GitOps deployment via the grafana-dashboards repo, and verification.
---
# Creating a Grafana Dashboard
## Overview
Dashboards are delivered via GitOps from `git@git.ctz.fyi:zoe/grafana-dashboards.git`. Push to main → Woodpecker CI auto-deploys to Grafana at `grafana.monitoring.ctz.fyi`. The critical gotcha: any downloaded dashboard will have wrong datasource UIDs and must be patched before committing.
## Stack Reference
| Service | URL / Context |
|---------|--------------|
| Grafana | grafana.monitoring.ctz.fyi (v11.6.1, Postgres backend) |
| Cluster | k3s `monitoring` context |
| Mimir (metrics) | datasource UID: `mimir`, type: `prometheus` |
| Loki (logs) | datasource UID: `loki`, type: `loki` |
| Tempo (traces) | datasource UID: `tempo`, type: `tempo` |
| Pyroscope (profiling) | datasource UID: `pyroscope`, type: `grafana-pyroscope-datasource` |
| Grafana API key | `secret/production/grafana/api-key` in OpenBao |
## Datasource UID Mapping (ALWAYS CHECK THIS)
| What the dashboard JSON says | What to set |
|-----------------------------|-------------|
| `type: prometheus`, any UID | `uid: "mimir"` |
| `type: loki`, any UID | `uid: "loki"` |
| `type: tempo`, any UID | `uid: "tempo"` |
| `type: grafana-pyroscope-datasource`, any UID | `uid: "pyroscope"` |
| `${DS_PROMETHEUS}` template variable | set default to `mimir` |
## Repo Structure
```
grafana-dashboards/
dashboards/
cilium/ # Cilium CNI dashboards
lgtm/ # Mimir, Loki, Tempo, Pyroscope dashboards
infra/ # Node, k8s cluster dashboards
apps/ # Application-specific dashboards
scripts/
sources.sh # upstream dashboard sources list
update-dashboards.sh # pull from upstream + patch UIDs
push-to-grafana.sh # push to live Grafana via API
.woodpecker.yml
```
## Path A: Import from grafana.com
```bash
# 1. Download
curl -o dashboards/<folder>/<name>.json \
"https://grafana.com/api/dashboards/<id>/revisions/latest/download"
# 2. Patch datasource UIDs (REQUIRED — dashboard will show "No data" otherwise)
jq '
(.templating.list[] | select(.type == "datasource") | .query) = "prometheus" |
(.panels[].datasource | select(.type == "prometheus") | .uid) = "mimir" |
(.panels[].targets[]? | .datasource | select(.type == "prometheus") | .uid) = "mimir"
' dashboard.json > dashboard-patched.json
mv dashboard-patched.json dashboard.json
# Repeat for loki/tempo/pyroscope as needed
# 3. Set a unique explicit UID
jq '.uid = "descriptive-slug-here"' dashboard.json > tmp.json && mv tmp.json dashboard.json
# 4. Check for UID collisions before committing
jq -r '.uid' dashboards/**/*.json | sort | uniq -d # should output nothing
# 5. Add to sources.sh for future updates, then commit + push
```
## Path B: Create from scratch in UI
1. Build panels at `grafana.monitoring.ctz.fyi`
2. Export: Dashboard → Share → Export → Save to file
3. Save to `dashboards/<folder>/<name>.json`
4. Verify `.uid` is set to a unique descriptive slug
5. Commit and push
For new app dashboards: check what metrics are exposed first.
```bash
# See what labels Alloy exposes for a service
kubectl --context monitoring exec -n monitoring ds/alloy -- alloy targets
# Or port-forward to the app's /metrics endpoint
kubectl port-forward svc/<app> 9090:9090
curl localhost:9090/metrics | grep -v '^#' | head -50
```
## Deployment
Push to main triggers Woodpecker automatically. To deploy manually:
```bash
cd grafana-dashboards
GRAFANA_API_KEY=$(bao kv get -field=api-key secret/production/grafana/api-key)
./scripts/push-to-grafana.sh
```
Check pipeline status at `ci.ctz.fyi` → grafana-dashboards repo.
## Verification
- Go to `grafana.monitoring.ctz.fyi` → Dashboards → find the dashboard
- All panels should show data (no "No data" panels)
- If "No data": datasource UIDs weren't patched — re-run jq patch
## Common Issues
| Symptom | Cause | Fix |
|---------|-------|-----|
| "No data" on panels | Datasource UID not patched | Re-run jq patch for that datasource type |
| Dashboard import fails | Duplicate UID | `jq -r '.uid' dashboards/**/*.json \| sort \| uniq -d` then rename |
| Wrong data in panels | Wrong label matchers | Check `alloy targets` for actual label names |
| UID collision silently replaces existing dashboard | Forgot to set explicit UID | Always set `.uid` to unique slug before commit |

View file

@ -0,0 +1,316 @@
---
name: deploying-new-k8s-service
description: Use when deploying a new service to Zoe's homelab k3s cluster (ansiblestack). Covers scaffolding Helm charts, writing ArgoCD app manifests, wiring ExternalSecrets via OpenBao, configuring Traefik IngressRoutes with cert-manager TLS, and watching GitOps sync to completion.
---
# Deploying a New k3s Service (ansiblestack)
## Overview
All services deploy via GitOps: Helm chart in `ansiblestack` repo → ArgoCD syncs → k3s cluster. Never `kubectl apply` workload manifests directly. Always commit and let ArgoCD drive.
## Cluster Quick Reference
| Thing | Value |
|---|---|
| Cluster | k3s at `10.0.6.10:6443` |
| GitOps repo | `git@git.ctz.fyi:zoe/ansiblestack.git` (GitHub mirror: `ZoesDev/ansiblestack`) |
| ArgoCD | `argocd.ctz.fyi` |
| Secrets | External Secrets Operator → OpenBao (`bao.ctz.fyi`); ClusterSecretStore: `openbao` |
| Ingress | Traefik IngressRoute CRDs |
| TLS | cert-manager, ClusterIssuer: `letsencrypt-production` |
| DNS | external-dns via annotation |
| Registry | Harbor at `registry.ctz.fyi`, project `library` |
| Storage | `ssd` (NFS-SSD, preferred for stateful), `local-path` (node-local) |
| Hostname convention | Public: `<svc>.ctz.fyi` · Internal: `<svc>.i.ctz.fyi` |
| OpenBao KV path | `secret/production/<namespace>/<secret-name>` |
---
## Workflow
### 1. Research the app
Before touching any file:
- Read the upstream GitHub repo or Docker Hub page
- Identify: **ports**, **required env vars**, **config file mounts**, **volume paths**, **default user/UID**
- Wrong env vars = silent failure. Don't skip this.
### 2. Check existing charts for patterns
```
helm/charts/
jellyfin/ ← stateful reference
tandoor/ ← stateful with DB reference
crucix/ ← simple stateless reference
convertx/ ← simple stateless reference
```
Match the pattern to your app type before scaffolding.
### 3. Scaffold chart files
Path: `helm/charts/<name>/`
```
Chart.yaml
values.yaml
templates/
_helpers.tpl
deployment.yaml
service.yaml
ingressroute.yaml
external-secrets.yaml # only if secrets needed
```
#### Chart.yaml
```yaml
apiVersion: v2
name: <name>
description: <one-liner>
version: 0.1.0
appVersion: "latest"
```
#### values.yaml (minimum)
```yaml
image:
repository: registry.ctz.fyi/library/<name> # or upstream image
tag: latest
pullPolicy: IfNotPresent
service:
hostname: <name>.ctz.fyi
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
memory: 512Mi
# persistence: # include for stateful apps
# enabled: true
# storageClass: ssd
# size: 10Gi
# mountPath: /data
```
#### templates/_helpers.tpl
```
{{- define "<name>.fullname" -}}
{{- .Release.Name | trunc 63 | trimSuffix "-" }}
{{- end }}
```
#### templates/deployment.yaml
Standard Deployment. Key points:
- `namespace: {{ .Release.Namespace }}`
- Use `{{ include "<name>.fullname" . }}` for all name references
- Mount secrets from ExternalSecret-created Secret if needed
- For stateful: use `PersistentVolumeClaim` via `volumes` + `volumeMounts`, storageClass `ssd`
#### templates/service.yaml
```yaml
apiVersion: v1
kind: Service
metadata:
name: {{ include "<name>.fullname" . }}
namespace: {{ .Release.Namespace }}
spec:
type: ClusterIP
selector:
app: {{ include "<name>.fullname" . }}
ports:
- port: <port>
targetPort: <port>
```
#### templates/ingressroute.yaml
**CRITICAL: You need BOTH objects. Do not omit either.**
```yaml
# 1. Traefik IngressRoute — actual routing
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: {{ include "<name>.fullname" . }}
namespace: {{ .Release.Namespace }}
annotations:
external-dns.alpha.kubernetes.io/hostname: {{ .Values.service.hostname }}
spec:
entryPoints: [websecure]
routes:
- match: Host(`{{ .Values.service.hostname }}`)
kind: Rule
services:
- name: {{ include "<name>.fullname" . }}
port: <port>
tls:
secretName: {{ include "<name>.fullname" . }}-tls
---
# 2. Companion Ingress — cert-manager TLS + external-dns ONLY (Traefik ignores this)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: {{ include "<name>.fullname" . }}-cm
namespace: {{ .Release.Namespace }}
annotations:
cert-manager.io/cluster-issuer: letsencrypt-production
external-dns.alpha.kubernetes.io/hostname: {{ .Values.service.hostname }}
# Add this only for Pangolin/externally-tunneled services:
# external-dns.alpha.kubernetes.io/target: "external"
spec:
ingressClassName: traefik
rules:
- host: {{ .Values.service.hostname }}
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: placeholder
port:
number: 80
tls:
- hosts: [{{ .Values.service.hostname }}]
secretName: {{ include "<name>.fullname" . }}-tls
```
#### templates/external-secrets.yaml (only if secrets needed)
```yaml
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: {{ include "<name>.fullname" . }}-secret
namespace: {{ .Release.Namespace }}
annotations:
argocd.argoproj.io/sync-wave: "-1" # ← REQUIRED — must exist before Deployment
spec:
refreshInterval: 1h
secretStoreRef:
name: openbao
kind: ClusterSecretStore
target:
name: {{ include "<name>.fullname" . }}-secret
creationPolicy: Owner
data:
- secretKey: <key>
remoteRef:
key: secret/production/{{ .Release.Namespace }}/{{ include "<name>.fullname" . }}
property: <key>
```
### 4. Write ArgoCD app manifest
Path: `helm/argocd/<name>-app.yaml`
```yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: <name>
namespace: argocd
annotations:
argocd.argoproj.io/sync-wave: "10"
spec:
project: default
source:
repoURL: https://git.ctz.fyi/zoe/ansiblestack
targetRevision: main
path: helm/charts/<name>
helm:
valueFiles: [values.yaml]
destination:
server: https://kubernetes.default.svc
namespace: <name>
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions: [CreateNamespace=true]
```
### 5. Write secrets to OpenBao (if needed)
```bash
bao kv put secret/production/<namespace>/<name> \
key1=value1 \
key2=value2
```
Do this **before** applying the ArgoCD app. ExternalSecret will pull on first sync.
### 6. Commit and push
```bash
cd ansiblestack
git add helm/charts/<name>/ helm/argocd/<name>-app.yaml
git commit -m "feat: add <name> service"
git push
```
### 7. Apply the ArgoCD Application
```bash
kubectl apply -f helm/argocd/<name>-app.yaml
```
ArgoCD picks up the app and begins syncing.
### 8. Verify
```bash
# Watch sync status
kubectl get applications -n argocd <name>
# Check pods
kubectl get pods -n <name>
# Check logs
kubectl logs -n <name> -l app=<name>
# Smoke test
curl -I https://<name>.ctz.fyi
```
Or check the ArgoCD UI at `argocd.ctz.fyi`.
---
## Pangolin (external tunnel) services
Add these to the IngressRoute metadata annotations:
```yaml
annotations:
pangolin.fossorial.io/enabled: "true"
pangolin.fossorial.io/target-port: "<port>"
```
And add to the companion Ingress:
```yaml
external-dns.alpha.kubernetes.io/target: "external"
```
---
## Common Gotchas
| Gotcha | Fix |
|---|---|
| Deployment crashes on startup, missing secret | `sync-wave: "-1"` on ExternalSecret is required — it must exist before Deployment syncs |
| TLS cert never issues | Companion Ingress is missing — cert-manager needs it even though Traefik doesn't route through it |
| Service unreachable despite pod running | Check env vars against upstream docs; wrong vars often cause silent failure at startup |
| PVC stuck in Pending | Use `ssd` storageClass for NFS-backed volumes; `local-path` won't schedule if node is wrong |
| Harbor pull fails | Private Harbor projects need `imagePullSecrets` on the Deployment |
| DNS not registering | Check `external-dns.alpha.kubernetes.io/hostname` annotation is on both IngressRoute and companion Ingress |
| StatefulSet data not persisting | Use `volumeClaimTemplates` in StatefulSet spec, not a standalone PVC manifest |

View file

@ -0,0 +1,172 @@
---
name: designing-alerts
description: Use when creating, reviewing, or debugging Prometheus/Grafana alert rules - when writing PromQL for alerts, choosing thresholds, deciding alert severity, writing PrometheusRule CRDs, or evaluating whether something should be an alert at all.
---
# Designing Alerts
## Overview
Bad alerts are worse than no alerts — they cause alert fatigue and get ignored.
Every alert must be actionable, symptom-based, and backed by real threshold data.
**Stack:** Mimir (datasource UID `mimir`) · Grafana at `grafana.monitoring.ctz.fyi` · Grafana alerting · PrometheusRule CRDs
## Cardinal Rules
1. **Actionable or bust** — if you can't do something about it right now, it's a dashboard, not an alert
2. **Symptoms, not causes** — "users can't reach service" > "CPU is high" > "pod restarted"
3. **Rates, not raw values**`rate(errors[5m]) > 0.01` not `errors_total > 100`
4. **Always add `for:`** — minimum 25 minutes; eliminates transient spikes
5. **Every alert needs a runbook**`annotations.runbook_url` or at minimum a useful `description`
6. **Test your thresholds** — check p99 of historical data in Grafana Explore before picking a number
## Severity Levels
| Severity | Meaning | Response |
|---|---|---|
| `critical` | User-facing impact, wake someone up | Immediate |
| `warning` | Degraded but not down | Investigate within hours |
| `info` | FYI, no action required | Prefer dashboards instead |
## Workflow
```
1. Identify failure modes that matter for this service
2. Find the right metric (check dashboards, Explore, service docs)
3. Write PromQL — test in Grafana Explore using historical data
4. Pick threshold from p99 of normal values (not intuition)
5. Set for: duration (never < 2m)
6. Write description: what broke + current value + what to do first
7. Add runbook_url or BookStack link
8. Deploy as PrometheusRule CRD (preferred) or via Grafana UI
9. Verify alert appears, fires, and resolves correctly
```
## PrometheusRule CRD Pattern
```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: <service>-alerts
namespace: <namespace>
labels:
prometheus: kube-prometheus
role: alert-rules
spec:
groups:
- name: <service>.rules
interval: 60s
rules:
- alert: ServiceDown
expr: up{job="<service>"} == 0
for: 5m
labels:
severity: critical
team: infra
annotations:
summary: "{{ $labels.instance }} is down"
description: "Service {{ $labels.job }} on {{ $labels.instance }} has been down > 5m. Check pod logs and events."
runbook_url: "https://wiki.ctz.fyi/books/ansiblestack/page/runbook-<service>"
```
## Common Alert Patterns
```yaml
# Service availability
- alert: ServiceUnreachable
expr: up{job=~"<service>.*"} == 0
for: 5m
labels: {severity: critical}
# High error rate (5% for 5m)
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.05
for: 5m
labels: {severity: critical}
# Pod crash looping
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels: {severity: warning}
# Node memory pressure
- alert: NodeMemoryPressure
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.90
for: 10m
labels: {severity: warning}
# Disk space
- alert: DiskSpaceLow
expr: |
(1 - node_filesystem_avail_bytes{fstype!="tmpfs"}
/ node_filesystem_size_bytes{fstype!="tmpfs"}) > 0.85
for: 15m
labels: {severity: warning}
# Certificate expiry
- alert: CertificateExpiringSoon
expr: certmanager_certificate_expiration_timestamp_seconds - time() < 7 * 24 * 3600
for: 1h
labels: {severity: critical}
# OpenBao sealed
- alert: OpenBaoSealed
expr: vault_core_unsealed == 0
for: 2m
labels: {severity: critical}
```
## SLO-Based Alerting (Advanced)
For a 99.9% SLO (0.1% error budget):
```yaml
# Fast burn: consuming budget 14x faster than sustainable
- alert: SLOBurnRateFast
expr: |
(rate(requests_total{status=~"5.."}[1h])
/ rate(requests_total[1h])) > 14 * 0.001
for: 5m
labels: {severity: critical}
annotations:
description: "Error budget burning 14x too fast. 1h rate: {{ $value | humanizePercentage }}"
# Slow burn: will exhaust budget in ~3 days
- alert: SLOBurnRateSlow
expr: |
(rate(requests_total{status=~"5.."}[6h])
/ rate(requests_total[6h])) > 2 * 0.001
for: 30m
labels: {severity: warning}
```
## Anti-Patterns
| ❌ Bad | ✅ Better |
|---|---|
| `cpu_usage > 80` | CPU sustained high AND latency degraded |
| `pod_restarts > 0` | `rate(restarts[15m]) > 0` with `for: 5m` |
| No `for:` duration | Always add `for:`, minimum 2m |
| `severity: critical` on everything | Reserve critical for user-facing impact |
| "high X" with no context | What's normal? What's the impact? What to do? |
| Fires in staging/dev | Add `env="production"` label filter |
| Alert for every metric | Not everything needs an alert; use dashboards |
## Writing Good Descriptions
Template: **"[What broke] on [where]. Current value: {{ $value }}. [What to check first]."**
```yaml
# ❌ Bad
description: "High error rate detected"
# ✅ Good
description: "Error rate on {{ $labels.job }} is {{ $value | humanizePercentage }}
(threshold: 5%). Check recent deployments and downstream dependencies.
Logs: kubectl logs -n {{ $labels.namespace }} -l app={{ $labels.job }} --tail=100"
```

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,157 @@
---
name: devops-troubleshooter
description: Expert DevOps troubleshooter specializing in rapid incident response, advanced debugging, and modern observability.
risk: unknown
source: community
date_added: '2026-02-27'
---
## Use this skill when
- Working on devops troubleshooter tasks or workflows
- Needing guidance, best practices, or checklists for devops troubleshooter
## Do not use this skill when
- The task is unrelated to devops troubleshooter
- You need a different domain or tool outside this scope
## Instructions
- Clarify goals, constraints, and required inputs.
- Apply relevant best practices and validate outcomes.
- Provide actionable steps and verification.
- If detailed examples are required, open `resources/implementation-playbook.md`.
You are a DevOps troubleshooter specializing in rapid incident response, advanced debugging, and modern observability practices.
## Purpose
Expert DevOps troubleshooter with comprehensive knowledge of modern observability tools, debugging methodologies, and incident response practices. Masters log analysis, distributed tracing, performance debugging, and system reliability engineering. Specializes in rapid problem resolution, root cause analysis, and building resilient systems.
## Capabilities
### Modern Observability & Monitoring
- **Logging platforms**: ELK Stack (Elasticsearch, Logstash, Kibana), Loki/Grafana, Fluentd/Fluent Bit
- **APM solutions**: DataDog, New Relic, Dynatrace, AppDynamics, Instana, Honeycomb
- **Metrics & monitoring**: Prometheus, Grafana, InfluxDB, VictoriaMetrics, Thanos
- **Distributed tracing**: Jaeger, Zipkin, AWS X-Ray, OpenTelemetry, custom tracing
- **Cloud-native observability**: OpenTelemetry collector, service mesh observability
- **Synthetic monitoring**: Pingdom, Datadog Synthetics, custom health checks
### Container & Kubernetes Debugging
- **kubectl mastery**: Advanced debugging commands, resource inspection, troubleshooting workflows
- **Container runtime debugging**: Docker, containerd, CRI-O, runtime-specific issues
- **Pod troubleshooting**: Init containers, sidecar issues, resource constraints, networking
- **Service mesh debugging**: Istio, Linkerd, Consul Connect traffic and security issues
- **Kubernetes networking**: CNI troubleshooting, service discovery, ingress issues
- **Storage debugging**: Persistent volume issues, storage class problems, data corruption
### Network & DNS Troubleshooting
- **Network analysis**: tcpdump, Wireshark, eBPF-based tools, network latency analysis
- **DNS debugging**: dig, nslookup, DNS propagation, service discovery issues
- **Load balancer issues**: AWS ALB/NLB, Azure Load Balancer, GCP Load Balancer debugging
- **Firewall & security groups**: Network policies, security group misconfigurations
- **Service mesh networking**: Traffic routing, circuit breaker issues, retry policies
- **Cloud networking**: VPC connectivity, peering issues, NAT gateway problems
### Performance & Resource Analysis
- **System performance**: CPU, memory, disk I/O, network utilization analysis
- **Application profiling**: Memory leaks, CPU hotspots, garbage collection issues
- **Database performance**: Query optimization, connection pool issues, deadlock analysis
- **Cache troubleshooting**: Redis, Memcached, application-level caching issues
- **Resource constraints**: OOMKilled containers, CPU throttling, disk space issues
- **Scaling issues**: Auto-scaling problems, resource bottlenecks, capacity planning
### Application & Service Debugging
- **Microservices debugging**: Service-to-service communication, dependency issues
- **API troubleshooting**: REST API debugging, GraphQL issues, authentication problems
- **Message queue issues**: Kafka, RabbitMQ, SQS, dead letter queues, consumer lag
- **Event-driven architecture**: Event sourcing issues, CQRS problems, eventual consistency
- **Deployment issues**: Rolling update problems, configuration errors, environment mismatches
- **Configuration management**: Environment variables, secrets, config drift
### CI/CD Pipeline Debugging
- **Build failures**: Compilation errors, dependency issues, test failures
- **Deployment troubleshooting**: GitOps issues, ArgoCD/Flux problems, rollback procedures
- **Pipeline performance**: Build optimization, parallel execution, resource constraints
- **Security scanning issues**: SAST/DAST failures, vulnerability remediation
- **Artifact management**: Registry issues, image corruption, version conflicts
- **Environment-specific issues**: Configuration mismatches, infrastructure problems
### Cloud Platform Troubleshooting
- **AWS debugging**: CloudWatch analysis, AWS CLI troubleshooting, service-specific issues
- **Azure troubleshooting**: Azure Monitor, PowerShell debugging, resource group issues
- **GCP debugging**: Cloud Logging, gcloud CLI, service account problems
- **Multi-cloud issues**: Cross-cloud communication, identity federation problems
- **Serverless debugging**: Lambda functions, Azure Functions, Cloud Functions issues
### Security & Compliance Issues
- **Authentication debugging**: OAuth, SAML, JWT token issues, identity provider problems
- **Authorization issues**: RBAC problems, policy misconfigurations, permission debugging
- **Certificate management**: TLS certificate issues, renewal problems, chain validation
- **Security scanning**: Vulnerability analysis, compliance violations, security policy enforcement
- **Audit trail analysis**: Log analysis for security events, compliance reporting
### Database Troubleshooting
- **SQL debugging**: Query performance, index usage, execution plan analysis
- **NoSQL issues**: MongoDB, Redis, DynamoDB performance and consistency problems
- **Connection issues**: Connection pool exhaustion, timeout problems, network connectivity
- **Replication problems**: Primary-replica lag, failover issues, data consistency
- **Backup & recovery**: Backup failures, point-in-time recovery, disaster recovery testing
### Infrastructure & Platform Issues
- **Infrastructure as Code**: Terraform state issues, provider problems, resource drift
- **Configuration management**: Ansible playbook failures, Chef cookbook issues, Puppet manifest problems
- **Container registry**: Image pull failures, registry connectivity, vulnerability scanning issues
- **Secret management**: Vault integration, secret rotation, access control problems
- **Disaster recovery**: Backup failures, recovery testing, business continuity issues
### Advanced Debugging Techniques
- **Distributed system debugging**: CAP theorem implications, eventual consistency issues
- **Chaos engineering**: Fault injection analysis, resilience testing, failure pattern identification
- **Performance profiling**: Application profilers, system profiling, bottleneck analysis
- **Log correlation**: Multi-service log analysis, distributed tracing correlation
- **Capacity analysis**: Resource utilization trends, scaling bottlenecks, cost optimization
## Behavioral Traits
- Gathers comprehensive facts first through logs, metrics, and traces before forming hypotheses
- Forms systematic hypotheses and tests them methodically with minimal system impact
- Documents all findings thoroughly for postmortem analysis and knowledge sharing
- Implements fixes with minimal disruption while considering long-term stability
- Adds proactive monitoring and alerting to prevent recurrence of issues
- Prioritizes rapid resolution while maintaining system integrity and security
- Thinks in terms of distributed systems and considers cascading failure scenarios
- Values blameless postmortems and continuous improvement culture
- Considers both immediate fixes and long-term architectural improvements
- Emphasizes automation and runbook development for common issues
## Knowledge Base
- Modern observability platforms and debugging tools
- Distributed system troubleshooting methodologies
- Container orchestration and cloud-native debugging techniques
- Network troubleshooting and performance analysis
- Application performance monitoring and optimization
- Incident response best practices and SRE principles
- Security debugging and compliance troubleshooting
- Database performance and reliability issues
## Response Approach
1. **Assess the situation** with urgency appropriate to impact and scope
2. **Gather comprehensive data** from logs, metrics, traces, and system state
3. **Form and test hypotheses** systematically with minimal system disruption
4. **Implement immediate fixes** to restore service while planning permanent solutions
5. **Document thoroughly** for postmortem analysis and future reference
6. **Add monitoring and alerting** to detect similar issues proactively
7. **Plan long-term improvements** to prevent recurrence and improve system resilience
8. **Share knowledge** through runbooks, documentation, and team training
9. **Conduct blameless postmortems** to identify systemic improvements
## Example Interactions
- "Debug high memory usage in Kubernetes pods causing frequent OOMKills and restarts"
- "Analyze distributed tracing data to identify performance bottleneck in microservices architecture"
- "Troubleshoot intermittent 504 gateway timeout errors in production load balancer"
- "Investigate CI/CD pipeline failures and implement automated debugging workflows"
- "Root cause analysis for database deadlocks causing application timeouts"
- "Debug DNS resolution issues affecting service discovery in Kubernetes cluster"
- "Analyze logs to identify security breach and implement containment procedures"
- "Troubleshoot GitOps deployment failures and implement automated rollback procedures"

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,214 @@
---
name: differential-review
description: >
Performs security-focused differential review of code changes (PRs, commits, diffs).
Adapts analysis depth to codebase size, uses git history for context, calculates
blast radius, checks test coverage, and generates comprehensive markdown reports.
Automatically...
---
# Differential Security Review
Security-focused code review for PRs, commits, and diffs.
## Core Principles
1. **Risk-First**: Focus on auth, crypto, value transfer, external calls
2. **Evidence-Based**: Every finding backed by git history, line numbers, attack scenarios
3. **Adaptive**: Scale to codebase size (SMALL/MEDIUM/LARGE)
4. **Honest**: Explicitly state coverage limits and confidence level
5. **Output-Driven**: Always generate comprehensive markdown report file
---
## Rationalizations (Do Not Skip)
| Rationalization | Why It's Wrong | Required Action |
|-----------------|----------------|-----------------|
| "Small PR, quick review" | Heartbleed was 2 lines | Classify by RISK, not size |
| "I know this codebase" | Familiarity breeds blind spots | Build explicit baseline context |
| "Git history takes too long" | History reveals regressions | Never skip Phase 1 |
| "Blast radius is obvious" | You'll miss transitive callers | Calculate quantitatively |
| "No tests = not my problem" | Missing tests = elevated risk rating | Flag in report, elevate severity |
| "Just a refactor, no security impact" | Refactors break invariants | Analyze as HIGH until proven LOW |
| "I'll explain verbally" | No artifact = findings lost | Always write report |
---
## Quick Reference
### Codebase Size Strategy
| Codebase Size | Strategy | Approach |
|---------------|----------|----------|
| SMALL (<20 files) | DEEP | Read all deps, full git blame |
| MEDIUM (20-200) | FOCUSED | 1-hop deps, priority files |
| LARGE (200+) | SURGICAL | Critical paths only |
### Risk Level Triggers
| Risk Level | Triggers |
|------------|----------|
| HIGH | Auth, crypto, external calls, value transfer, validation removal |
| MEDIUM | Business logic, state changes, new public APIs |
| LOW | Comments, tests, UI, logging |
---
## Workflow Overview
```
Pre-Analysis → Phase 0: Triage → Phase 1: Code Analysis → Phase 2: Test Coverage
↓ ↓ ↓ ↓
Phase 3: Blast Radius → Phase 4: Deep Context → Phase 5: Adversarial → Phase 6: Report
```
---
## Decision Tree
**Starting a review?**
```
├─ Need detailed phase-by-phase methodology?
│ └─ Read: methodology.md
│ (Pre-Analysis + Phases 0-4: triage, code analysis, test coverage, blast radius)
├─ Analyzing HIGH RISK change?
│ └─ Read: adversarial.md
│ (Phase 5: Attacker modeling, exploit scenarios, exploitability rating)
├─ Writing the final report?
│ └─ Read: reporting.md
│ (Phase 6: Report structure, templates, formatting guidelines)
├─ Looking for specific vulnerability patterns?
│ └─ Read: patterns.md
│ (Regressions, reentrancy, access control, overflow, etc.)
└─ Quick triage only?
└─ Use Quick Reference above, skip detailed docs
```
---
## Quality Checklist
Before delivering:
- [ ] All changed files analyzed
- [ ] Git blame on removed security code
- [ ] Blast radius calculated for HIGH risk
- [ ] Attack scenarios are concrete (not generic)
- [ ] Findings reference specific line numbers + commits
- [ ] Report file generated
- [ ] User notified with summary
---
## Integration
**audit-context-building skill:**
- Pre-Analysis: Build baseline context
- Phase 4: Deep context on HIGH RISK changes
**issue-writer skill:**
- Transform findings into formal audit reports
- Command: `issue-writer --input DIFFERENTIAL_REVIEW_REPORT.md --format audit-report`
---
## Example Usage
### Quick Triage (Small PR)
```
Input: 5 file PR, 2 HIGH RISK files
Strategy: Use Quick Reference
1. Classify risk level per file (2 HIGH, 3 LOW)
2. Focus on 2 HIGH files only
3. Git blame removed code
4. Generate minimal report
Time: ~30 minutes
```
### Standard Review (Medium Codebase)
```
Input: 80 files, 12 HIGH RISK changes
Strategy: FOCUSED (see methodology.md)
1. Full workflow on HIGH RISK files
2. Surface scan on MEDIUM
3. Skip LOW risk files
4. Complete report with all sections
Time: ~3-4 hours
```
### Deep Audit (Large, Critical Change)
```
Input: 450 files, auth system rewrite
Strategy: SURGICAL + audit-context-building
1. Baseline context with audit-context-building
2. Deep analysis on auth changes only
3. Blast radius analysis
4. Adversarial modeling
5. Comprehensive report
Time: ~6-8 hours
```
---
## When NOT to Use This Skill
- **Greenfield code** (no baseline to compare)
- **Documentation-only changes** (no security impact)
- **Formatting/linting** (cosmetic changes)
- **User explicitly requests quick summary only** (they accept risk)
For these cases, use standard code review instead.
---
## Red Flags (Stop and Investigate)
**Immediate escalation triggers:**
- Removed code from "security", "CVE", or "fix" commits
- Access control modifiers removed (onlyOwner, internal → external)
- Validation removed without replacement
- External calls added without checks
- High blast radius (50+ callers) + HIGH risk change
These patterns require adversarial analysis even in quick triage.
---
## Tips for Best Results
**Do:**
- Start with git blame for removed code
- Calculate blast radius early to prioritize
- Generate concrete attack scenarios
- Reference specific line numbers and commits
- Be honest about coverage limitations
- Always generate the output file
**Don't:**
- Skip git history analysis
- Make generic findings without evidence
- Claim full analysis when time-limited
- Forget to check test coverage
- Miss high blast radius changes
- Output report only to chat (file required)
---
## Supporting Documentation
- **methodology.md** - Detailed phase-by-phase workflow (Phases 0-4)
- **adversarial.md** - Attacker modeling and exploit scenarios (Phase 5)
- **reporting.md** - Report structure and formatting (Phase 6)
- **patterns.md** - Common vulnerability patterns reference
---
**For first-time users:** Start with methodology.md to understand the complete workflow.
**For experienced users:** Use this page's Quick Reference and Decision Tree to navigate directly to needed content.

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,96 @@
---
name: docs-architect
description: Creates comprehensive technical documentation from existing codebases. Analyzes architecture, design patterns, and implementation details to produce long-form technical manuals and ebooks.
risk: unknown
source: community
date_added: '2026-02-27'
---
## Use this skill when
- Working on docs architect tasks or workflows
- Needing guidance, best practices, or checklists for docs architect
## Do not use this skill when
- The task is unrelated to docs architect
- You need a different domain or tool outside this scope
## Instructions
- Clarify goals, constraints, and required inputs.
- Apply relevant best practices and validate outcomes.
- Provide actionable steps and verification.
- If detailed examples are required, open `resources/implementation-playbook.md`.
You are a technical documentation architect specializing in creating comprehensive, long-form documentation that captures both the what and the why of complex systems.
## Core Competencies
1. **Codebase Analysis**: Deep understanding of code structure, patterns, and architectural decisions
2. **Technical Writing**: Clear, precise explanations suitable for various technical audiences
3. **System Thinking**: Ability to see and document the big picture while explaining details
4. **Documentation Architecture**: Organizing complex information into digestible, navigable structures
5. **Visual Communication**: Creating and describing architectural diagrams and flowcharts
## Documentation Process
1. **Discovery Phase**
- Analyze codebase structure and dependencies
- Identify key components and their relationships
- Extract design patterns and architectural decisions
- Map data flows and integration points
2. **Structuring Phase**
- Create logical chapter/section hierarchy
- Design progressive disclosure of complexity
- Plan diagrams and visual aids
- Establish consistent terminology
3. **Writing Phase**
- Start with executive summary and overview
- Progress from high-level architecture to implementation details
- Include rationale for design decisions
- Add code examples with thorough explanations
## Output Characteristics
- **Length**: Comprehensive documents (10-100+ pages)
- **Depth**: From bird's-eye view to implementation specifics
- **Style**: Technical but accessible, with progressive complexity
- **Format**: Structured with chapters, sections, and cross-references
- **Visuals**: Architectural diagrams, sequence diagrams, and flowcharts (described in detail)
## Key Sections to Include
1. **Executive Summary**: One-page overview for stakeholders
2. **Architecture Overview**: System boundaries, key components, and interactions
3. **Design Decisions**: Rationale behind architectural choices
4. **Core Components**: Deep dive into each major module/service
5. **Data Models**: Schema design and data flow documentation
6. **Integration Points**: APIs, events, and external dependencies
7. **Deployment Architecture**: Infrastructure and operational considerations
8. **Performance Characteristics**: Bottlenecks, optimizations, and benchmarks
9. **Security Model**: Authentication, authorization, and data protection
10. **Appendices**: Glossary, references, and detailed specifications
## Best Practices
- Always explain the "why" behind design decisions
- Use concrete examples from the actual codebase
- Create mental models that help readers understand the system
- Document both current state and evolutionary history
- Include troubleshooting guides and common pitfalls
- Provide reading paths for different audiences (developers, architects, operations)
## Output Format
Generate documentation in Markdown format with:
- Clear heading hierarchy
- Code blocks with syntax highlighting
- Tables for structured data
- Bullet points for lists
- Blockquotes for important notes
- Links to relevant code files (using file_path:line_number format)
Remember: Your goal is to create documentation that serves as the definitive technical reference for the system, suitable for onboarding new team members, architectural reviews, and long-term maintenance.

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,51 @@
---
name: documentation-generation-doc-generate
description: "You are a documentation expert specializing in creating comprehensive, maintainable documentation from code. Generate API docs, architecture diagrams, user guides, and technical references using AI..."
risk: unknown
source: community
date_added: "2026-02-27"
---
# Automated Documentation Generation
You are a documentation expert specializing in creating comprehensive, maintainable documentation from code. Generate API docs, architecture diagrams, user guides, and technical references using AI-powered analysis and industry best practices.
## Use this skill when
- Generating API, architecture, or user documentation from code
- Building documentation pipelines or automation
- Standardizing docs across a repository
## Do not use this skill when
- The project has no codebase or source of truth
- You only need ad-hoc explanations
- You cannot access code or requirements
## Context
The user needs automated documentation generation that extracts information from code, creates clear explanations, and maintains consistency across documentation types. Focus on creating living documentation that stays synchronized with code.
## Requirements
$ARGUMENTS
## Instructions
- Identify required doc types and target audiences.
- Extract information from code, configs, and comments.
- Generate docs with consistent terminology and structure.
- Add automation (linting, CI) and validate accuracy.
- If detailed examples are required, open `resources/implementation-playbook.md`.
## Safety
- Avoid exposing secrets, internal URLs, or sensitive data in docs.
## Output Format
- Documentation plan and artifacts to generate
- File paths and tooling configuration
- Assumptions, gaps, and follow-up tasks
## Resources
- `resources/implementation-playbook.md` for detailed examples and templates.

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,640 @@
# Automated Documentation Generation Implementation Playbook
This file contains detailed patterns, checklists, and code samples referenced by the skill.
## Instructions
Generate comprehensive documentation by analyzing the codebase and creating the following artifacts:
### 1. **API Documentation**
- Extract endpoint definitions, parameters, and responses from code
- Generate OpenAPI/Swagger specifications
- Create interactive API documentation (Swagger UI, Redoc)
- Include authentication, rate limiting, and error handling details
### 2. **Architecture Documentation**
- Create system architecture diagrams (Mermaid, PlantUML)
- Document component relationships and data flows
- Explain service dependencies and communication patterns
- Include scalability and reliability considerations
### 3. **Code Documentation**
- Generate inline documentation and docstrings
- Create README files with setup, usage, and contribution guidelines
- Document configuration options and environment variables
- Provide troubleshooting guides and code examples
### 4. **User Documentation**
- Write step-by-step user guides
- Create getting started tutorials
- Document common workflows and use cases
- Include accessibility and localization notes
### 5. **Documentation Automation**
- Configure CI/CD pipelines for automatic doc generation
- Set up documentation linting and validation
- Implement documentation coverage checks
- Automate deployment to hosting platforms
### Quality Standards
Ensure all generated documentation:
- Is accurate and synchronized with current code
- Uses consistent terminology and formatting
- Includes practical examples and use cases
- Is searchable and well-organized
- Follows accessibility best practices
## Reference Examples
### Example 1: Code Analysis for Documentation
**API Documentation Extraction**
```python
import ast
from typing import Dict, List
class APIDocExtractor:
def extract_endpoints(self, code_path):
"""Extract API endpoints and their documentation"""
endpoints = []
with open(code_path, 'r') as f:
tree = ast.parse(f.read())
for node in ast.walk(tree):
if isinstance(node, ast.FunctionDef):
for decorator in node.decorator_list:
if self._is_route_decorator(decorator):
endpoint = {
'method': self._extract_method(decorator),
'path': self._extract_path(decorator),
'function': node.name,
'docstring': ast.get_docstring(node),
'parameters': self._extract_parameters(node),
'returns': self._extract_returns(node)
}
endpoints.append(endpoint)
return endpoints
def _extract_parameters(self, func_node):
"""Extract function parameters with types"""
params = []
for arg in func_node.args.args:
param = {
'name': arg.arg,
'type': ast.unparse(arg.annotation) if arg.annotation else None,
'required': True
}
params.append(param)
return params
```
**Schema Extraction**
```python
def extract_pydantic_schemas(file_path):
"""Extract Pydantic model definitions for API documentation"""
schemas = []
with open(file_path, 'r') as f:
tree = ast.parse(f.read())
for node in ast.walk(tree):
if isinstance(node, ast.ClassDef):
if any(base.id == 'BaseModel' for base in node.bases if hasattr(base, 'id')):
schema = {
'name': node.name,
'description': ast.get_docstring(node),
'fields': []
}
for item in node.body:
if isinstance(item, ast.AnnAssign):
field = {
'name': item.target.id,
'type': ast.unparse(item.annotation),
'required': item.value is None
}
schema['fields'].append(field)
schemas.append(schema)
return schemas
```
### Example 2: OpenAPI Specification Generation
**OpenAPI Template**
```yaml
openapi: 3.0.0
info:
title: ${API_TITLE}
version: ${VERSION}
description: |
${DESCRIPTION}
## Authentication
${AUTH_DESCRIPTION}
servers:
- url: https://api.example.com/v1
description: Production server
security:
- bearerAuth: []
paths:
/users:
get:
summary: List all users
operationId: listUsers
tags:
- Users
parameters:
- name: page
in: query
schema:
type: integer
default: 1
- name: limit
in: query
schema:
type: integer
default: 20
maximum: 100
responses:
'200':
description: Successful response
content:
application/json:
schema:
type: object
properties:
data:
type: array
items:
$ref: '#/components/schemas/User'
pagination:
$ref: '#/components/schemas/Pagination'
'401':
$ref: '#/components/responses/Unauthorized'
components:
schemas:
User:
type: object
required:
- id
- email
properties:
id:
type: string
format: uuid
email:
type: string
format: email
name:
type: string
createdAt:
type: string
format: date-time
```
### Example 3: Architecture Diagrams
**System Architecture (Mermaid)**
```mermaid
graph TB
subgraph "Frontend"
UI[React UI]
Mobile[Mobile App]
end
subgraph "API Gateway"
Gateway[Kong/nginx]
Auth[Auth Service]
end
subgraph "Microservices"
UserService[User Service]
OrderService[Order Service]
PaymentService[Payment Service]
end
subgraph "Data Layer"
PostgresMain[(PostgreSQL)]
Redis[(Redis Cache)]
S3[S3 Storage]
end
UI --> Gateway
Mobile --> Gateway
Gateway --> Auth
Gateway --> UserService
Gateway --> OrderService
OrderService --> PaymentService
UserService --> PostgresMain
UserService --> Redis
OrderService --> PostgresMain
```
**Component Documentation**
```markdown
## User Service
**Purpose**: Manages user accounts, authentication, and profiles
**Technology Stack**:
- Language: Python 3.11
- Framework: FastAPI
- Database: PostgreSQL
- Cache: Redis
- Authentication: JWT
**API Endpoints**:
- `POST /users` - Create new user
- `GET /users/{id}` - Get user details
- `PUT /users/{id}` - Update user
- `POST /auth/login` - User login
**Configuration**:
```yaml
user_service:
port: 8001
database:
host: postgres.internal
name: users_db
jwt:
secret: ${JWT_SECRET}
expiry: 3600
```
```
### Example 4: README Generation
**README Template**
```markdown
# ${PROJECT_NAME}
${BADGES}
${SHORT_DESCRIPTION}
## Features
${FEATURES_LIST}
## Installation
### Prerequisites
- Python 3.8+
- PostgreSQL 12+
- Redis 6+
### Using pip
```bash
pip install ${PACKAGE_NAME}
```
### From source
```bash
git clone https://github.com/${GITHUB_ORG}/${REPO_NAME}.git
cd ${REPO_NAME}
pip install -e .
```
## Quick Start
```python
${QUICK_START_CODE}
```
## Configuration
### Environment Variables
| Variable | Description | Default | Required |
|----------|-------------|---------|----------|
| DATABASE_URL | PostgreSQL connection string | - | Yes |
| REDIS_URL | Redis connection string | - | Yes |
| SECRET_KEY | Application secret key | - | Yes |
## Development
```bash
# Clone and setup
git clone https://github.com/${GITHUB_ORG}/${REPO_NAME}.git
cd ${REPO_NAME}
python -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements-dev.txt
# Run tests
pytest
# Start development server
python manage.py runserver
```
## Testing
```bash
# Run all tests
pytest
# Run with coverage
pytest --cov=your_package
```
## Contributing
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
## License
This project is licensed under the ${LICENSE} License - see the LICENSE file for details.
```
### Example 5: Function Documentation Generator
```python
import inspect
def generate_function_docs(func):
"""Generate comprehensive documentation for a function"""
sig = inspect.signature(func)
params = []
args_doc = []
for param_name, param in sig.parameters.items():
param_str = param_name
if param.annotation != param.empty:
param_str += f": {param.annotation.__name__}"
if param.default != param.empty:
param_str += f" = {param.default}"
params.append(param_str)
args_doc.append(f"{param_name}: Description of {param_name}")
return_type = ""
if sig.return_annotation != sig.empty:
return_type = f" -> {sig.return_annotation.__name__}"
doc_template = f'''
def {func.__name__}({", ".join(params)}){return_type}:
"""
Brief description of {func.__name__}
Args:
{chr(10).join(f" {arg}" for arg in args_doc)}
Returns:
Description of return value
Examples:
>>> {func.__name__}(example_input)
expected_output
"""
'''
return doc_template
```
### Example 6: User Guide Template
```markdown
# User Guide
## Getting Started
### Creating Your First ${FEATURE}
1. **Navigate to the Dashboard**
Click on the ${FEATURE} tab in the main navigation menu.
2. **Click "Create New"**
You'll find the "Create New" button in the top right corner.
3. **Fill in the Details**
- **Name**: Enter a descriptive name
- **Description**: Add optional details
- **Settings**: Configure as needed
4. **Save Your Changes**
Click "Save" to create your ${FEATURE}.
### Common Tasks
#### Editing ${FEATURE}
1. Find your ${FEATURE} in the list
2. Click the "Edit" button
3. Make your changes
4. Click "Save"
#### Deleting ${FEATURE}
> ⚠️ **Warning**: Deletion is permanent and cannot be undone.
1. Find your ${FEATURE} in the list
2. Click the "Delete" button
3. Confirm the deletion
### Troubleshooting
| Error | Meaning | Solution |
|-------|---------|----------|
| "Name required" | The name field is empty | Enter a name |
| "Permission denied" | You don't have access | Contact admin |
| "Server error" | Technical issue | Try again later |
```
### Example 7: Interactive API Playground
**Swagger UI Setup**
```html
<!DOCTYPE html>
<html>
<head>
<title>API Documentation</title>
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/swagger-ui-dist@latest/swagger-ui.css">
</head>
<body>
<div id="swagger-ui"></div>
<script src="https://cdn.jsdelivr.net/npm/swagger-ui-dist@latest/swagger-ui-bundle.js"></script>
<script>
window.onload = function() {
SwaggerUIBundle({
url: "/api/openapi.json",
dom_id: '#swagger-ui',
deepLinking: true,
presets: [SwaggerUIBundle.presets.apis],
layout: "StandaloneLayout"
});
}
</script>
</body>
</html>
```
**Code Examples Generator**
```python
def generate_code_examples(endpoint):
"""Generate code examples for API endpoints in multiple languages"""
examples = {}
# Python
examples['python'] = f'''
import requests
url = "https://api.example.com{endpoint['path']}"
headers = {{"Authorization": "Bearer YOUR_API_KEY"}}
response = requests.{endpoint['method'].lower()}(url, headers=headers)
print(response.json())
'''
# JavaScript
examples['javascript'] = f'''
const response = await fetch('https://api.example.com{endpoint['path']}', {{
method: '{endpoint['method']}',
headers: {{'Authorization': 'Bearer YOUR_API_KEY'}}
}});
const data = await response.json();
console.log(data);
'''
# cURL
examples['curl'] = f'''
curl -X {endpoint['method']} https://api.example.com{endpoint['path']} \\
-H "Authorization: Bearer YOUR_API_KEY"
'''
return examples
```
### Example 8: Documentation CI/CD
**GitHub Actions Workflow**
```yaml
name: Generate Documentation
on:
push:
branches: [main]
paths:
- 'src/**'
- 'api/**'
jobs:
generate-docs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install -r requirements-docs.txt
npm install -g @redocly/cli
- name: Generate API documentation
run: |
python scripts/generate_openapi.py > docs/api/openapi.json
redocly build-docs docs/api/openapi.json -o docs/api/index.html
- name: Generate code documentation
run: sphinx-build -b html docs/source docs/build
- name: Deploy to GitHub Pages
uses: peaceiris/actions-gh-pages@v3
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: ./docs/build
```
### Example 9: Documentation Coverage Validation
```python
import ast
import glob
class DocCoverage:
def check_coverage(self, codebase_path):
"""Check documentation coverage for codebase"""
results = {
'total_functions': 0,
'documented_functions': 0,
'total_classes': 0,
'documented_classes': 0,
'missing_docs': []
}
for file_path in glob.glob(f"{codebase_path}/**/*.py", recursive=True):
module = ast.parse(open(file_path).read())
for node in ast.walk(module):
if isinstance(node, ast.FunctionDef):
results['total_functions'] += 1
if ast.get_docstring(node):
results['documented_functions'] += 1
else:
results['missing_docs'].append({
'type': 'function',
'name': node.name,
'file': file_path,
'line': node.lineno
})
elif isinstance(node, ast.ClassDef):
results['total_classes'] += 1
if ast.get_docstring(node):
results['documented_classes'] += 1
else:
results['missing_docs'].append({
'type': 'class',
'name': node.name,
'file': file_path,
'line': node.lineno
})
# Calculate coverage percentages
results['function_coverage'] = (
results['documented_functions'] / results['total_functions'] * 100
if results['total_functions'] > 0 else 100
)
results['class_coverage'] = (
results['documented_classes'] / results['total_classes'] * 100
if results['total_classes'] > 0 else 100
)
return results
```
## Output Format
1. **API Documentation**: OpenAPI spec with interactive playground
2. **Architecture Diagrams**: System, sequence, and component diagrams
3. **Code Documentation**: Inline docs, docstrings, and type hints
4. **User Guides**: Step-by-step tutorials
5. **Developer Guides**: Setup, contribution, and API usage guides
6. **Reference Documentation**: Complete API reference with examples
7. **Documentation Site**: Deployed static site with search functionality
Focus on creating documentation that is accurate, comprehensive, and easy to maintain alongside code changes.

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,199 @@
---
name: documentation-templates
description: "Documentation templates and structure guidelines. README, API docs, code comments, and AI-friendly documentation."
risk: unknown
source: community
date_added: "2026-02-27"
---
# Documentation Templates
> Templates and structure guidelines for common documentation types.
---
## 1. README Structure
### Essential Sections (Priority Order)
| Section | Purpose |
|---------|---------|
| **Title + One-liner** | What is this? |
| **Quick Start** | Running in <5 min |
| **Features** | What can I do? |
| **Configuration** | How to customize |
| **API Reference** | Link to detailed docs |
| **Contributing** | How to help |
| **License** | Legal |
### README Template
```markdown
# Project Name
Brief one-line description.
## Quick Start
[Minimum steps to run]
## Features
- Feature 1
- Feature 2
## Configuration
| Variable | Description | Default |
|----------|-------------|---------|
| PORT | Server port | 3000 |
## Documentation
- API Reference
- Architecture
## License
MIT
```
---
## 2. API Documentation Structure
### Per-Endpoint Template
```markdown
## GET /users/:id
Get a user by ID.
**Parameters:**
| Name | Type | Required | Description |
|------|------|----------|-------------|
| id | string | Yes | User ID |
**Response:**
- 200: User object
- 404: User not found
**Example:**
[Request and response example]
```
---
## 3. Code Comment Guidelines
### JSDoc/TSDoc Template
```typescript
/**
* Brief description of what the function does.
*
* @param paramName - Description of parameter
* @returns Description of return value
* @throws ErrorType - When this error occurs
*
* @example
* const result = functionName(input);
*/
```
### When to Comment
| ✅ Comment | ❌ Don't Comment |
|-----------|-----------------|
| Why (business logic) | What (obvious) |
| Complex algorithms | Every line |
| Non-obvious behavior | Self-explanatory code |
| API contracts | Implementation details |
---
## 4. Changelog Template (Keep a Changelog)
```markdown
# Changelog
## [Unreleased]
### Added
- New feature
## [1.0.0] - 2025-01-01
### Added
- Initial release
### Changed
- Updated dependency
### Fixed
- Bug fix
```
---
## 5. Architecture Decision Record (ADR)
```markdown
# ADR-001: [Title]
## Status
Accepted / Deprecated / Superseded
## Context
Why are we making this decision?
## Decision
What did we decide?
## Consequences
What are the trade-offs?
```
---
## 6. AI-Friendly Documentation (2025)
### llms.txt Template
For AI crawlers and agents:
```markdown
# Project Name
> One-line objective.
## Core Files
- [src/index.ts]: Main entry
- [src/api/]: API routes
- [docs/]: Documentation
## Key Concepts
- Concept 1: Brief explanation
- Concept 2: Brief explanation
```
### MCP-Ready Documentation
For RAG indexing:
- Clear H1-H3 hierarchy
- JSON/YAML examples for data structures
- Mermaid diagrams for flows
- Self-contained sections
---
## 7. Structure Principles
| Principle | Why |
|-----------|-----|
| **Scannable** | Headers, lists, tables |
| **Examples first** | Show, don't just tell |
| **Progressive detail** | Simple → Complex |
| **Up to date** | Outdated = misleading |
---
> **Remember:** Templates are starting points. Adapt to your project's needs.
## When to Use
This skill is applicable to execute the workflow or actions described in the overview.

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,260 @@
---
name: documentation
description: "Documentation generation workflow covering API docs, architecture docs, README files, code comments, and technical writing."
category: workflow-bundle
risk: safe
source: personal
date_added: "2026-02-27"
---
# Documentation Workflow Bundle
## Overview
Comprehensive documentation workflow for generating API documentation, architecture documentation, README files, code comments, and technical content from codebases.
## When to Use This Workflow
Use this workflow when:
- Creating project documentation
- Generating API documentation
- Writing architecture docs
- Documenting code
- Creating user guides
- Maintaining wikis
## Workflow Phases
### Phase 1: Documentation Planning
#### Skills to Invoke
- `docs-architect` - Documentation architecture
- `documentation-templates` - Documentation templates
#### Actions
1. Identify documentation needs
2. Choose documentation tools
3. Plan documentation structure
4. Define style guidelines
5. Set up documentation site
#### Copy-Paste Prompts
```
Use @docs-architect to plan documentation structure
```
```
Use @documentation-templates to set up documentation
```
### Phase 2: API Documentation
#### Skills to Invoke
- `api-documenter` - API documentation
- `api-documentation-generator` - Auto-generation
- `openapi-spec-generation` - OpenAPI specs
#### Actions
1. Extract API endpoints
2. Generate OpenAPI specs
3. Create API reference
4. Add usage examples
5. Set up auto-generation
#### Copy-Paste Prompts
```
Use @api-documenter to generate API documentation
```
```
Use @openapi-spec-generation to create OpenAPI specs
```
### Phase 3: Architecture Documentation
#### Skills to Invoke
- `c4-architecture-c4-architecture` - C4 architecture
- `c4-context` - Context diagrams
- `c4-container` - Container diagrams
- `c4-component` - Component diagrams
- `c4-code` - Code diagrams
- `mermaid-expert` - Mermaid diagrams
#### Actions
1. Create C4 diagrams
2. Document architecture
3. Generate sequence diagrams
4. Document data flows
5. Create deployment docs
#### Copy-Paste Prompts
```
Use @c4-architecture-c4-architecture to create C4 diagrams
```
```
Use @mermaid-expert to create architecture diagrams
```
### Phase 4: Code Documentation
#### Skills to Invoke
- `code-documentation-code-explain` - Code explanation
- `code-documentation-doc-generate` - Doc generation
- `documentation-generation-doc-generate` - Auto-generation
#### Actions
1. Extract code comments
2. Generate JSDoc/TSDoc
3. Create type documentation
4. Document functions
5. Add usage examples
#### Copy-Paste Prompts
```
Use @code-documentation-code-explain to explain code
```
```
Use @code-documentation-doc-generate to generate docs
```
### Phase 5: README and Getting Started
#### Skills to Invoke
- `readme` - README generation
- `environment-setup-guide` - Setup guides
- `tutorial-engineer` - Tutorial creation
#### Actions
1. Create README
2. Write getting started guide
3. Document installation
4. Add usage examples
5. Create troubleshooting guide
#### Copy-Paste Prompts
```
Use @readme to create project README
```
```
Use @tutorial-engineer to create tutorials
```
### Phase 6: Wiki and Knowledge Base
#### Skills to Invoke
- `wiki-architect` - Wiki architecture
- `wiki-page-writer` - Wiki pages
- `wiki-onboarding` - Onboarding docs
- `wiki-qa` - Wiki Q&A
- `wiki-researcher` - Wiki research
- `wiki-vitepress` - VitePress wiki
#### Actions
1. Design wiki structure
2. Create wiki pages
3. Write onboarding guides
4. Document processes
5. Set up wiki site
#### Copy-Paste Prompts
```
Use @wiki-architect to design wiki structure
```
```
Use @wiki-page-writer to create wiki pages
```
```
Use @wiki-onboarding to create onboarding docs
```
### Phase 7: Changelog and Release Notes
#### Skills to Invoke
- `changelog-automation` - Changelog generation
- `wiki-changelog` - Changelog from git
#### Actions
1. Extract commit history
2. Categorize changes
3. Generate changelog
4. Create release notes
5. Publish updates
#### Copy-Paste Prompts
```
Use @changelog-automation to generate changelog
```
```
Use @wiki-changelog to create release notes
```
### Phase 8: Documentation Maintenance
#### Skills to Invoke
- `doc-coauthoring` - Collaborative writing
- `reference-builder` - Reference docs
#### Actions
1. Review documentation
2. Update outdated content
3. Fix broken links
4. Add new features
5. Gather feedback
#### Copy-Paste Prompts
```
Use @doc-coauthoring to collaborate on docs
```
## Documentation Types
### Code-Level
- JSDoc/TSDoc comments
- Function documentation
- Type definitions
- Example code
### API Documentation
- Endpoint reference
- Request/response schemas
- Authentication guides
- SDK documentation
### Architecture Documentation
- System overview
- Component diagrams
- Data flow diagrams
- Deployment architecture
### User Documentation
- Getting started guides
- User manuals
- Tutorials
- FAQs
### Process Documentation
- Runbooks
- Onboarding guides
- SOPs
- Decision records
## Quality Gates
- [ ] All APIs documented
- [ ] Architecture diagrams current
- [ ] README up to date
- [ ] Code comments helpful
- [ ] Examples working
- [ ] Links valid
## Related Workflow Bundles
- `development` - Development workflow
- `testing-qa` - Documentation testing
- `ai-ml` - AI documentation

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,54 @@
---
name: fix-review
description: "Verify fix commits address audit findings without new bugs"
risk: safe
source: "https://github.com/trailofbits/skills/tree/main/plugins/fix-review"
date_added: "2026-02-27"
---
# Fix Review
## Overview
Verify that fix commits properly address audit findings without introducing new bugs or security vulnerabilities.
## When to Use This Skill
Use this skill when you need to verify fix commits address audit findings without new bugs.
Use this skill when:
- Reviewing commits that address security audit findings
- Verifying that fixes don't introduce new vulnerabilities
- Ensuring code changes properly resolve identified issues
- Validating that remediation efforts are complete and correct
## Instructions
This skill helps verify that fix commits properly address audit findings:
1. **Review Fix Commits**: Analyze commits that claim to fix audit findings
2. **Verify Resolution**: Ensure the original issue is properly addressed
3. **Check for Regressions**: Verify no new bugs or vulnerabilities are introduced
4. **Validate Completeness**: Ensure all aspects of the finding are resolved
## Review Process
When reviewing fix commits:
1. Compare the fix against the original audit finding
2. Verify the fix addresses the root cause, not just symptoms
3. Check for potential side effects or new issues
4. Validate that tests cover the fixed scenario
5. Ensure no similar vulnerabilities exist elsewhere
## Best Practices
- Review fixes in context of the full codebase
- Verify test coverage for the fixed issue
- Check for similar patterns that might need fixing
- Ensure fixes follow security best practices
- Document the resolution approach
## Resources
For more information, see the [source repository](https://github.com/trailofbits/skills/tree/main/plugins/fix-review).

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,36 @@
---
name: git-pushing
description: "Stage, commit, and push git changes with conventional commit messages. Use when user wants to commit and push changes, mentions pushing to remote, or asks to save and push their work. Also activate..."
risk: unknown
source: community
date_added: "2026-02-27"
---
# Git Push Workflow
Stage all changes, create a conventional commit, and push to the remote branch.
## When to Use
Automatically activate when the user:
- Explicitly asks to push changes ("push this", "commit and push")
- Mentions saving work to remote ("save to github", "push to remote")
- Completes a feature and wants to share it
- Says phrases like "let's push this up" or "commit these changes"
## Workflow
**ALWAYS use the script** - do NOT use manual git commands:
```bash
bash skills/git-pushing/scripts/smart_commit.sh
```
With custom message:
```bash
bash skills/git-pushing/scripts/smart_commit.sh "feat: add feature"
```
Script handles: staging, conventional commit message, Claude footer, push with -u flag.

View file

@ -0,0 +1,19 @@
#!/bin/bash
set -e
# Default commit message if none provided
MESSAGE="${1:-chore: update code}"
# Add all changes
git add .
# Commit with the provided message
git commit -m "$MESSAGE"
# Get current branch name
BRANCH=$(git rev-parse --abbrev-ref HEAD)
# Push to remote, setting upstream if needed
git push -u origin "$BRANCH"
echo "✅ Successfully pushed to $BRANCH"

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,37 @@
---
name: helm-chart-scaffolding
description: "Design, organize, and manage Helm charts for templating and packaging Kubernetes applications with reusable configurations. Use when creating Helm charts, packaging Kubernetes applications, or impl..."
risk: unknown
source: community
date_added: "2026-02-27"
---
# Helm Chart Scaffolding
Comprehensive guidance for creating, organizing, and managing Helm charts for packaging and deploying Kubernetes applications.
## Use this skill when
Use this skill when you need to:
- Create new Helm charts from scratch
- Package Kubernetes applications for distribution
- Manage multi-environment deployments with Helm
- Implement templating for reusable Kubernetes manifests
- Set up Helm chart repositories
- Follow Helm best practices and conventions
## Do not use this skill when
- The task is unrelated to helm chart scaffolding
- You need a different domain or tool outside this scope
## Instructions
- Clarify goals, constraints, and required inputs.
- Apply relevant best practices and validate outcomes.
- Provide actionable steps and verification.
- If detailed examples are required, open `resources/implementation-playbook.md`.
## Resources
- `resources/implementation-playbook.md` for detailed patterns and examples.

View file

@ -0,0 +1,42 @@
apiVersion: v2
name: <chart-name>
description: <Chart description>
type: application
version: 0.1.0
appVersion: "1.0.0"
keywords:
- <keyword1>
- <keyword2>
home: https://github.com/<org>/<repo>
sources:
- https://github.com/<org>/<repo>
maintainers:
- name: <Maintainer Name>
email: <maintainer@example.com>
url: https://github.com/<username>
icon: https://example.com/icon.png
kubeVersion: ">=1.24.0"
dependencies:
- name: postgresql
version: "12.0.0"
repository: "https://charts.bitnami.com/bitnami"
condition: postgresql.enabled
tags:
- database
- name: redis
version: "17.0.0"
repository: "https://charts.bitnami.com/bitnami"
condition: redis.enabled
tags:
- cache
annotations:
category: Application
licenses: Apache-2.0

View file

@ -0,0 +1,185 @@
# Global values shared with subcharts
global:
imageRegistry: docker.io
imagePullSecrets: []
storageClass: ""
# Image configuration
image:
registry: docker.io
repository: myapp/web
tag: "" # Defaults to .Chart.AppVersion
pullPolicy: IfNotPresent
# Override chart name
nameOverride: ""
fullnameOverride: ""
# Number of replicas
replicaCount: 3
revisionHistoryLimit: 10
# ServiceAccount
serviceAccount:
create: true
annotations: {}
name: ""
# Pod annotations
podAnnotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics"
# Pod security context
podSecurityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
# Container security context
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
# Service configuration
service:
type: ClusterIP
port: 80
targetPort: http
annotations: {}
sessionAffinity: None
# Ingress configuration
ingress:
enabled: false
className: nginx
annotations: {}
hosts:
- host: app.example.com
paths:
- path: /
pathType: Prefix
tls: []
# Resources
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 250m
memory: 256Mi
# Liveness probe
livenessProbe:
httpGet:
path: /health/live
port: http
initialDelaySeconds: 30
periodSeconds: 10
# Readiness probe
readinessProbe:
httpGet:
path: /health/ready
port: http
initialDelaySeconds: 5
periodSeconds: 5
# Autoscaling
autoscaling:
enabled: false
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 80
targetMemoryUtilizationPercentage: 80
# Pod Disruption Budget
podDisruptionBudget:
enabled: true
minAvailable: 1
# Node selection
nodeSelector: {}
tolerations: []
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- '{{ include "my-app.name" . }}'
topologyKey: kubernetes.io/hostname
# Environment variables
env: []
# - name: LOG_LEVEL
# value: "info"
# ConfigMap data
configMap:
enabled: true
data: {}
# APP_MODE: production
# DATABASE_HOST: postgres.example.com
# Secrets (use external secret management in production)
secrets:
enabled: false
data: {}
# Persistent Volume
persistence:
enabled: false
storageClass: ""
accessMode: ReadWriteOnce
size: 10Gi
annotations: {}
# PostgreSQL dependency
postgresql:
enabled: false
auth:
database: myapp
username: myapp
password: changeme
primary:
persistence:
enabled: true
size: 10Gi
# Redis dependency
redis:
enabled: false
auth:
enabled: false
master:
persistence:
enabled: false
# ServiceMonitor for Prometheus Operator
serviceMonitor:
enabled: false
interval: 30s
scrapeTimeout: 10s
labels: {}
# Network Policy
networkPolicy:
enabled: false
policyTypes:
- Ingress
- Egress
ingress: []
egress: []

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,500 @@
# Helm Chart Structure Reference
Complete guide to Helm chart organization, file conventions, and best practices.
## Standard Chart Directory Structure
```
my-app/
├── Chart.yaml # Chart metadata (required)
├── Chart.lock # Dependency lock file (generated)
├── values.yaml # Default configuration values (required)
├── values.schema.json # JSON schema for values validation
├── .helmignore # Patterns to ignore when packaging
├── README.md # Chart documentation
├── LICENSE # Chart license
├── charts/ # Chart dependencies (bundled)
│ └── postgresql-12.0.0.tgz
├── crds/ # Custom Resource Definitions
│ └── my-crd.yaml
├── templates/ # Kubernetes manifest templates (required)
│ ├── NOTES.txt # Post-install instructions
│ ├── _helpers.tpl # Template helper functions
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── ingress.yaml
│ ├── configmap.yaml
│ ├── secret.yaml
│ ├── serviceaccount.yaml
│ ├── hpa.yaml
│ ├── pdb.yaml
│ ├── networkpolicy.yaml
│ └── tests/
│ └── test-connection.yaml
└── files/ # Additional files to include
└── config/
└── app.conf
```
## Chart.yaml Specification
### API Version v2 (Helm 3+)
```yaml
apiVersion: v2 # Required: API version
name: my-application # Required: Chart name
version: 1.2.3 # Required: Chart version (SemVer)
appVersion: "2.5.0" # Application version
description: A Helm chart for my application # Required
type: application # Chart type: application or library
keywords: # Search keywords
- web
- api
- backend
home: https://example.com # Project home page
sources: # Source code URLs
- https://github.com/example/my-app
maintainers: # Maintainer list
- name: John Doe
email: john@example.com
url: https://github.com/johndoe
icon: https://example.com/icon.png # Chart icon URL
kubeVersion: ">=1.24.0" # Compatible Kubernetes versions
deprecated: false # Mark chart as deprecated
annotations: # Arbitrary annotations
example.com/release-notes: https://example.com/releases/v1.2.3
dependencies: # Chart dependencies
- name: postgresql
version: "12.0.0"
repository: "https://charts.bitnami.com/bitnami"
condition: postgresql.enabled
tags:
- database
import-values:
- child: database
parent: database
alias: db
```
## Chart Types
### Application Chart
```yaml
type: application
```
- Standard Kubernetes applications
- Can be installed and managed
- Contains templates for K8s resources
### Library Chart
```yaml
type: library
```
- Shared template helpers
- Cannot be installed directly
- Used as dependency by other charts
- No templates/ directory
## Values Files Organization
### values.yaml (defaults)
```yaml
# Global values (shared with subcharts)
global:
imageRegistry: docker.io
imagePullSecrets: []
# Image configuration
image:
registry: docker.io
repository: myapp/web
tag: "" # Defaults to .Chart.AppVersion
pullPolicy: IfNotPresent
# Deployment settings
replicaCount: 1
revisionHistoryLimit: 10
# Pod configuration
podAnnotations: {}
podSecurityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
# Container security
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
# Service
service:
type: ClusterIP
port: 80
targetPort: http
annotations: {}
# Resources
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 100m
memory: 128Mi
# Autoscaling
autoscaling:
enabled: false
minReplicas: 1
maxReplicas: 100
targetCPUUtilizationPercentage: 80
# Node selection
nodeSelector: {}
tolerations: []
affinity: {}
# Monitoring
serviceMonitor:
enabled: false
interval: 30s
```
### values.schema.json (validation)
```json
{
"$schema": "https://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"replicaCount": {
"type": "integer",
"minimum": 1
},
"image": {
"type": "object",
"required": ["repository"],
"properties": {
"repository": {
"type": "string"
},
"tag": {
"type": "string"
},
"pullPolicy": {
"type": "string",
"enum": ["Always", "IfNotPresent", "Never"]
}
}
}
},
"required": ["image"]
}
```
## Template Files
### Template Naming Conventions
- **Lowercase with hyphens**: `deployment.yaml`, `service-account.yaml`
- **Partial templates**: Prefix with underscore `_helpers.tpl`
- **Tests**: Place in `templates/tests/`
- **CRDs**: Place in `crds/` (not templated)
### Common Templates
#### _helpers.tpl
```yaml
{{/*
Standard naming helpers
*/}}
{{- define "my-app.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" -}}
{{- end -}}
{{- define "my-app.fullname" -}}
{{- if .Values.fullnameOverride -}}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" -}}
{{- else -}}
{{- $name := default .Chart.Name .Values.nameOverride -}}
{{- if contains $name .Release.Name -}}
{{- .Release.Name | trunc 63 | trimSuffix "-" -}}
{{- else -}}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" -}}
{{- end -}}
{{- end -}}
{{- end -}}
{{- define "my-app.chart" -}}
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" -}}
{{- end -}}
{{/*
Common labels
*/}}
{{- define "my-app.labels" -}}
helm.sh/chart: {{ include "my-app.chart" . }}
{{ include "my-app.selectorLabels" . }}
{{- if .Chart.AppVersion }}
app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
{{- end }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
{{- end -}}
{{- define "my-app.selectorLabels" -}}
app.kubernetes.io/name: {{ include "my-app.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- end -}}
{{/*
Image name helper
*/}}
{{- define "my-app.image" -}}
{{- $registry := .Values.global.imageRegistry | default .Values.image.registry -}}
{{- $repository := .Values.image.repository -}}
{{- $tag := .Values.image.tag | default .Chart.AppVersion -}}
{{- printf "%s/%s:%s" $registry $repository $tag -}}
{{- end -}}
```
#### NOTES.txt
```
Thank you for installing {{ .Chart.Name }}.
Your release is named {{ .Release.Name }}.
To learn more about the release, try:
$ helm status {{ .Release.Name }}
$ helm get all {{ .Release.Name }}
{{- if .Values.ingress.enabled }}
Application URL:
{{- range .Values.ingress.hosts }}
http{{ if $.Values.ingress.tls }}s{{ end }}://{{ .host }}{{ .path }}
{{- end }}
{{- else }}
Get the application URL by running:
export POD_NAME=$(kubectl get pods --namespace {{ .Release.Namespace }} -l "app.kubernetes.io/name={{ include "my-app.name" . }}" -o jsonpath="{.items[0].metadata.name}")
kubectl port-forward $POD_NAME 8080:80
echo "Visit http://127.0.0.1:8080"
{{- end }}
```
## Dependencies Management
### Declaring Dependencies
```yaml
# Chart.yaml
dependencies:
- name: postgresql
version: "12.0.0"
repository: "https://charts.bitnami.com/bitnami"
condition: postgresql.enabled # Enable/disable via values
tags: # Group dependencies
- database
import-values: # Import values from subchart
- child: database
parent: database
alias: db # Reference as .Values.db
```
### Managing Dependencies
```bash
# Update dependencies
helm dependency update
# List dependencies
helm dependency list
# Build dependencies
helm dependency build
```
### Chart.lock
Generated automatically by `helm dependency update`:
```yaml
dependencies:
- name: postgresql
repository: https://charts.bitnami.com/bitnami
version: 12.0.0
digest: sha256:abcd1234...
generated: "2024-01-01T00:00:00Z"
```
## .helmignore
Exclude files from chart package:
```
# Development files
.git/
.gitignore
*.md
docs/
# Build artifacts
*.swp
*.bak
*.tmp
*.orig
# CI/CD
.travis.yml
.gitlab-ci.yml
Jenkinsfile
# Testing
test/
*.test
# IDE
.vscode/
.idea/
*.iml
```
## Custom Resource Definitions (CRDs)
Place CRDs in `crds/` directory:
```
crds/
├── my-app-crd.yaml
└── another-crd.yaml
```
**Important CRD notes:**
- CRDs are installed before any templates
- CRDs are NOT templated (no `{{ }}` syntax)
- CRDs are NOT upgraded or deleted with chart
- Use `helm install --skip-crds` to skip installation
## Chart Versioning
### Semantic Versioning
- **Chart Version**: Increment when chart changes
- MAJOR: Breaking changes
- MINOR: New features, backward compatible
- PATCH: Bug fixes
- **App Version**: Application version being deployed
- Can be any string
- Not required to follow SemVer
```yaml
version: 2.3.1 # Chart version
appVersion: "1.5.0" # Application version
```
## Chart Testing
### Test Files
```yaml
# templates/tests/test-connection.yaml
apiVersion: v1
kind: Pod
metadata:
name: "{{ include "my-app.fullname" . }}-test-connection"
annotations:
"helm.sh/hook": test
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
containers:
- name: wget
image: busybox
command: ['wget']
args: ['{{ include "my-app.fullname" . }}:{{ .Values.service.port }}']
restartPolicy: Never
```
### Running Tests
```bash
helm test my-release
helm test my-release --logs
```
## Hooks
Helm hooks allow intervention at specific points:
```yaml
apiVersion: batch/v1
kind: Job
metadata:
name: {{ include "my-app.fullname" . }}-migration
annotations:
"helm.sh/hook": pre-upgrade,pre-install
"helm.sh/hook-weight": "-5"
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
```
### Hook Types
- `pre-install`: Before templates rendered
- `post-install`: After all resources loaded
- `pre-delete`: Before any resources deleted
- `post-delete`: After all resources deleted
- `pre-upgrade`: Before upgrade
- `post-upgrade`: After upgrade
- `pre-rollback`: Before rollback
- `post-rollback`: After rollback
- `test`: Run with `helm test`
### Hook Weight
Controls hook execution order (-5 to 5, lower runs first)
### Hook Deletion Policies
- `before-hook-creation`: Delete previous hook before new one
- `hook-succeeded`: Delete after successful execution
- `hook-failed`: Delete if hook fails
## Best Practices
1. **Use helpers** for repeated template logic
2. **Quote strings** in templates: `{{ .Values.name | quote }}`
3. **Validate values** with values.schema.json
4. **Document all values** in values.yaml
5. **Use semantic versioning** for chart versions
6. **Pin dependency versions** exactly
7. **Include NOTES.txt** with usage instructions
8. **Add tests** for critical functionality
9. **Use hooks** for database migrations
10. **Keep charts focused** - one application per chart
## Chart Repository Structure
```
helm-charts/
├── index.yaml
├── my-app-1.0.0.tgz
├── my-app-1.1.0.tgz
├── my-app-1.2.0.tgz
└── another-chart-2.0.0.tgz
```
### Creating Repository Index
```bash
helm repo index . --url https://charts.example.com
```
## Related Resources
- [Helm Documentation](https://helm.sh/docs/)
- [Chart Template Guide](https://helm.sh/docs/chart_template_guide/)
- [Best Practices](https://helm.sh/docs/chart_best_practices/)

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,543 @@
# Helm Chart Scaffolding Implementation Playbook
This file contains detailed patterns, checklists, and code samples referenced by the skill.
# Helm Chart Scaffolding
Comprehensive guidance for creating, organizing, and managing Helm charts for packaging and deploying Kubernetes applications.
## Purpose
This skill provides step-by-step instructions for building production-ready Helm charts, including chart structure, templating patterns, values management, and validation strategies.
## When to Use This Skill
Use this skill when you need to:
- Create new Helm charts from scratch
- Package Kubernetes applications for distribution
- Manage multi-environment deployments with Helm
- Implement templating for reusable Kubernetes manifests
- Set up Helm chart repositories
- Follow Helm best practices and conventions
## Helm Overview
**Helm** is the package manager for Kubernetes that:
- Templates Kubernetes manifests for reusability
- Manages application releases and rollbacks
- Handles dependencies between charts
- Provides version control for deployments
- Simplifies configuration management across environments
## Step-by-Step Workflow
### 1. Initialize Chart Structure
**Create new chart:**
```bash
helm create my-app
```
**Standard chart structure:**
```
my-app/
├── Chart.yaml # Chart metadata
├── values.yaml # Default configuration values
├── charts/ # Chart dependencies
├── templates/ # Kubernetes manifest templates
│ ├── NOTES.txt # Post-install notes
│ ├── _helpers.tpl # Template helpers
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── ingress.yaml
│ ├── serviceaccount.yaml
│ ├── hpa.yaml
│ └── tests/
│ └── test-connection.yaml
└── .helmignore # Files to ignore
```
### 2. Configure Chart.yaml
**Chart metadata defines the package:**
```yaml
apiVersion: v2
name: my-app
description: A Helm chart for My Application
type: application
version: 1.0.0 # Chart version
appVersion: "2.1.0" # Application version
# Keywords for chart discovery
keywords:
- web
- api
- backend
# Maintainer information
maintainers:
- name: DevOps Team
email: devops@example.com
url: https://github.com/example/my-app
# Source code repository
sources:
- https://github.com/example/my-app
# Homepage
home: https://example.com
# Chart icon
icon: https://example.com/icon.png
# Dependencies
dependencies:
- name: postgresql
version: "12.0.0"
repository: "https://charts.bitnami.com/bitnami"
condition: postgresql.enabled
- name: redis
version: "17.0.0"
repository: "https://charts.bitnami.com/bitnami"
condition: redis.enabled
```
**Reference:** See `assets/Chart.yaml.template` for complete example
### 3. Design values.yaml Structure
**Organize values hierarchically:**
```yaml
# Image configuration
image:
repository: myapp
tag: "1.0.0"
pullPolicy: IfNotPresent
# Number of replicas
replicaCount: 3
# Service configuration
service:
type: ClusterIP
port: 80
targetPort: 8080
# Ingress configuration
ingress:
enabled: false
className: nginx
hosts:
- host: app.example.com
paths:
- path: /
pathType: Prefix
# Resources
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
# Autoscaling
autoscaling:
enabled: false
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 80
# Environment variables
env:
- name: LOG_LEVEL
value: "info"
# ConfigMap data
configMap:
data:
APP_MODE: production
# Dependencies
postgresql:
enabled: true
auth:
database: myapp
username: myapp
redis:
enabled: false
```
**Reference:** See `assets/values.yaml.template` for complete structure
### 4. Create Template Files
**Use Go templating with Helm functions:**
**templates/deployment.yaml:**
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "my-app.fullname" . }}
labels:
{{- include "my-app.labels" . | nindent 4 }}
spec:
{{- if not .Values.autoscaling.enabled }}
replicas: {{ .Values.replicaCount }}
{{- end }}
selector:
matchLabels:
{{- include "my-app.selectorLabels" . | nindent 6 }}
template:
metadata:
labels:
{{- include "my-app.selectorLabels" . | nindent 8 }}
spec:
containers:
- name: {{ .Chart.Name }}
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
imagePullPolicy: {{ .Values.image.pullPolicy }}
ports:
- name: http
containerPort: {{ .Values.service.targetPort }}
resources:
{{- toYaml .Values.resources | nindent 12 }}
env:
{{- toYaml .Values.env | nindent 12 }}
```
### 5. Create Template Helpers
**templates/_helpers.tpl:**
```yaml
{{/*
Expand the name of the chart.
*/}}
{{- define "my-app.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }}
{{- end }}
{{/*
Create a default fully qualified app name.
*/}}
{{- define "my-app.fullname" -}}
{{- if .Values.fullnameOverride }}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- $name := default .Chart.Name .Values.nameOverride }}
{{- if contains $name .Release.Name }}
{{- .Release.Name | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- end }}
{{- end }}
{{/*
Common labels
*/}}
{{- define "my-app.labels" -}}
helm.sh/chart: {{ include "my-app.chart" . }}
{{ include "my-app.selectorLabels" . }}
{{- if .Chart.AppVersion }}
app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
{{- end }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
{{- end }}
{{/*
Selector labels
*/}}
{{- define "my-app.selectorLabels" -}}
app.kubernetes.io/name: {{ include "my-app.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- end }}
```
### 6. Manage Dependencies
**Add dependencies in Chart.yaml:**
```yaml
dependencies:
- name: postgresql
version: "12.0.0"
repository: "https://charts.bitnami.com/bitnami"
condition: postgresql.enabled
```
**Update dependencies:**
```bash
helm dependency update
helm dependency build
```
**Override dependency values:**
```yaml
# values.yaml
postgresql:
enabled: true
auth:
database: myapp
username: myapp
password: changeme
primary:
persistence:
enabled: true
size: 10Gi
```
### 7. Test and Validate
**Validation commands:**
```bash
# Lint the chart
helm lint my-app/
# Dry-run installation
helm install my-app ./my-app --dry-run --debug
# Template rendering
helm template my-app ./my-app
# Template with values
helm template my-app ./my-app -f values-prod.yaml
# Show computed values
helm show values ./my-app
```
**Validation script:**
```bash
#!/bin/bash
set -e
echo "Linting chart..."
helm lint .
echo "Testing template rendering..."
helm template test-release . --dry-run
echo "Checking for required values..."
helm template test-release . --validate
echo "All validations passed!"
```
**Reference:** See `scripts/validate-chart.sh`
### 8. Package and Distribute
**Package the chart:**
```bash
helm package my-app/
# Creates: my-app-1.0.0.tgz
```
**Create chart repository:**
```bash
# Create index
helm repo index .
# Upload to repository
# AWS S3 example
aws s3 sync . s3://my-helm-charts/ --exclude "*" --include "*.tgz" --include "index.yaml"
```
**Use the chart:**
```bash
helm repo add my-repo https://charts.example.com
helm repo update
helm install my-app my-repo/my-app
```
### 9. Multi-Environment Configuration
**Environment-specific values files:**
```
my-app/
├── values.yaml # Defaults
├── values-dev.yaml # Development
├── values-staging.yaml # Staging
└── values-prod.yaml # Production
```
**values-prod.yaml:**
```yaml
replicaCount: 5
image:
tag: "2.1.0"
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1000m"
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 20
ingress:
enabled: true
hosts:
- host: app.example.com
paths:
- path: /
pathType: Prefix
postgresql:
enabled: true
primary:
persistence:
size: 100Gi
```
**Install with environment:**
```bash
helm install my-app ./my-app -f values-prod.yaml --namespace production
```
### 10. Implement Hooks and Tests
**Pre-install hook:**
```yaml
# templates/pre-install-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: {{ include "my-app.fullname" . }}-db-setup
annotations:
"helm.sh/hook": pre-install
"helm.sh/hook-weight": "-5"
"helm.sh/hook-delete-policy": hook-succeeded
spec:
template:
spec:
containers:
- name: db-setup
image: postgres:15
command: ["psql", "-c", "CREATE DATABASE myapp"]
restartPolicy: Never
```
**Test connection:**
```yaml
# templates/tests/test-connection.yaml
apiVersion: v1
kind: Pod
metadata:
name: "{{ include "my-app.fullname" . }}-test-connection"
annotations:
"helm.sh/hook": test
spec:
containers:
- name: wget
image: busybox
command: ['wget']
args: ['{{ include "my-app.fullname" . }}:{{ .Values.service.port }}']
restartPolicy: Never
```
**Run tests:**
```bash
helm test my-app
```
## Common Patterns
### Pattern 1: Conditional Resources
```yaml
{{- if .Values.ingress.enabled }}
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: {{ include "my-app.fullname" . }}
spec:
# ...
{{- end }}
```
### Pattern 2: Iterating Over Lists
```yaml
env:
{{- range .Values.env }}
- name: {{ .name }}
value: {{ .value | quote }}
{{- end }}
```
### Pattern 3: Including Files
```yaml
data:
config.yaml: |
{{- .Files.Get "config/application.yaml" | nindent 4 }}
```
### Pattern 4: Global Values
```yaml
global:
imageRegistry: docker.io
imagePullSecrets:
- name: regcred
# Use in templates:
image: {{ .Values.global.imageRegistry }}/{{ .Values.image.repository }}
```
## Best Practices
1. **Use semantic versioning** for chart and app versions
2. **Document all values** in values.yaml with comments
3. **Use template helpers** for repeated logic
4. **Validate charts** before packaging
5. **Pin dependency versions** explicitly
6. **Use conditions** for optional resources
7. **Follow naming conventions** (lowercase, hyphens)
8. **Include NOTES.txt** with usage instructions
9. **Add labels** consistently using helpers
10. **Test installations** in all environments
## Troubleshooting
**Template rendering errors:**
```bash
helm template my-app ./my-app --debug
```
**Dependency issues:**
```bash
helm dependency update
helm dependency list
```
**Installation failures:**
```bash
helm install my-app ./my-app --dry-run --debug
kubectl get events --sort-by='.lastTimestamp'
```
## Reference Files
- `assets/Chart.yaml.template` - Chart metadata template
- `assets/values.yaml.template` - Values structure template
- `scripts/validate-chart.sh` - Validation script
- `references/chart-structure.md` - Detailed chart organization
## Related Skills
- `k8s-manifest-generator` - For creating base Kubernetes manifests
- `gitops-workflow` - For automated Helm chart deployments

View file

@ -0,0 +1,244 @@
#!/bin/bash
set -e
CHART_DIR="${1:-.}"
RELEASE_NAME="test-release"
echo "═══════════════════════════════════════════════════════"
echo " Helm Chart Validation"
echo "═══════════════════════════════════════════════════════"
echo ""
# Colors
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
RED='\033[0;31m'
NC='\033[0m' # No Color
success() {
echo -e "${GREEN}${NC} $1"
}
warning() {
echo -e "${YELLOW}${NC} $1"
}
error() {
echo -e "${RED}${NC} $1"
}
# Check if Helm is installed
if ! command -v helm &> /dev/null; then
error "Helm is not installed"
exit 1
fi
echo "📦 Chart directory: $CHART_DIR"
echo ""
# 1. Check chart structure
echo "1⃣ Checking chart structure..."
if [ ! -f "$CHART_DIR/Chart.yaml" ]; then
error "Chart.yaml not found"
exit 1
fi
success "Chart.yaml exists"
if [ ! -f "$CHART_DIR/values.yaml" ]; then
error "values.yaml not found"
exit 1
fi
success "values.yaml exists"
if [ ! -d "$CHART_DIR/templates" ]; then
error "templates/ directory not found"
exit 1
fi
success "templates/ directory exists"
echo ""
# 2. Lint the chart
echo "2⃣ Linting chart..."
if helm lint "$CHART_DIR"; then
success "Chart passed lint"
else
error "Chart failed lint"
exit 1
fi
echo ""
# 3. Check Chart.yaml
echo "3⃣ Validating Chart.yaml..."
CHART_NAME=$(grep "^name:" "$CHART_DIR/Chart.yaml" | awk '{print $2}')
CHART_VERSION=$(grep "^version:" "$CHART_DIR/Chart.yaml" | awk '{print $2}')
APP_VERSION=$(grep "^appVersion:" "$CHART_DIR/Chart.yaml" | awk '{print $2}' | tr -d '"')
if [ -z "$CHART_NAME" ]; then
error "Chart name not found"
exit 1
fi
success "Chart name: $CHART_NAME"
if [ -z "$CHART_VERSION" ]; then
error "Chart version not found"
exit 1
fi
success "Chart version: $CHART_VERSION"
if [ -z "$APP_VERSION" ]; then
warning "App version not specified"
else
success "App version: $APP_VERSION"
fi
echo ""
# 4. Test template rendering
echo "4⃣ Testing template rendering..."
if helm template "$RELEASE_NAME" "$CHART_DIR" > /dev/null 2>&1; then
success "Templates rendered successfully"
else
error "Template rendering failed"
helm template "$RELEASE_NAME" "$CHART_DIR"
exit 1
fi
echo ""
# 5. Dry-run installation
echo "5⃣ Testing dry-run installation..."
if helm install "$RELEASE_NAME" "$CHART_DIR" --dry-run --debug > /dev/null 2>&1; then
success "Dry-run installation successful"
else
error "Dry-run installation failed"
exit 1
fi
echo ""
# 6. Check for required Kubernetes resources
echo "6⃣ Checking generated resources..."
MANIFESTS=$(helm template "$RELEASE_NAME" "$CHART_DIR")
if echo "$MANIFESTS" | grep -q "kind: Deployment"; then
success "Deployment found"
else
warning "No Deployment found"
fi
if echo "$MANIFESTS" | grep -q "kind: Service"; then
success "Service found"
else
warning "No Service found"
fi
if echo "$MANIFESTS" | grep -q "kind: ServiceAccount"; then
success "ServiceAccount found"
else
warning "No ServiceAccount found"
fi
echo ""
# 7. Check for security best practices
echo "7⃣ Checking security best practices..."
if echo "$MANIFESTS" | grep -q "runAsNonRoot: true"; then
success "Running as non-root user"
else
warning "Not explicitly running as non-root"
fi
if echo "$MANIFESTS" | grep -q "readOnlyRootFilesystem: true"; then
success "Using read-only root filesystem"
else
warning "Not using read-only root filesystem"
fi
if echo "$MANIFESTS" | grep -q "allowPrivilegeEscalation: false"; then
success "Privilege escalation disabled"
else
warning "Privilege escalation not explicitly disabled"
fi
echo ""
# 8. Check for resource limits
echo "8⃣ Checking resource configuration..."
if echo "$MANIFESTS" | grep -q "resources:"; then
if echo "$MANIFESTS" | grep -q "limits:"; then
success "Resource limits defined"
else
warning "No resource limits defined"
fi
if echo "$MANIFESTS" | grep -q "requests:"; then
success "Resource requests defined"
else
warning "No resource requests defined"
fi
else
warning "No resources defined"
fi
echo ""
# 9. Check for health probes
echo "9⃣ Checking health probes..."
if echo "$MANIFESTS" | grep -q "livenessProbe:"; then
success "Liveness probe configured"
else
warning "No liveness probe found"
fi
if echo "$MANIFESTS" | grep -q "readinessProbe:"; then
success "Readiness probe configured"
else
warning "No readiness probe found"
fi
echo ""
# 10. Check dependencies
if [ -f "$CHART_DIR/Chart.yaml" ] && grep -q "^dependencies:" "$CHART_DIR/Chart.yaml"; then
echo "🔟 Checking dependencies..."
if helm dependency list "$CHART_DIR" > /dev/null 2>&1; then
success "Dependencies valid"
if [ -f "$CHART_DIR/Chart.lock" ]; then
success "Chart.lock file present"
else
warning "Chart.lock file missing (run 'helm dependency update')"
fi
else
error "Dependencies check failed"
fi
echo ""
fi
# 11. Check for values schema
if [ -f "$CHART_DIR/values.schema.json" ]; then
echo "1⃣1⃣ Validating values schema..."
success "values.schema.json present"
# Validate schema if jq is available
if command -v jq &> /dev/null; then
if jq empty "$CHART_DIR/values.schema.json" 2>/dev/null; then
success "values.schema.json is valid JSON"
else
error "values.schema.json contains invalid JSON"
exit 1
fi
fi
echo ""
fi
# Summary
echo "═══════════════════════════════════════════════════════"
echo " Validation Complete!"
echo "═══════════════════════════════════════════════════════"
echo ""
echo "Chart: $CHART_NAME"
echo "Version: $CHART_VERSION"
if [ -n "$APP_VERSION" ]; then
echo "App Version: $APP_VERSION"
fi
echo ""
success "All validations passed!"
echo ""
echo "Next steps:"
echo " • helm package $CHART_DIR"
echo " • helm install my-release $CHART_DIR"
echo " • helm test my-release"
echo ""

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,228 @@
---
name: historical-pattern-analysis
description: Use when analyzing git history and past changes to identify patterns, recurring issues, and lessons learned from infrastructure changes.
---
# Historical Pattern Analysis
## Overview
Analyze git history and memory to learn from past infrastructure changes. Identify patterns, recurring issues, and apply lessons learned to current work.
**Announce at start:** "I'm using the historical-pattern-analysis skill to learn from past changes."
## When to Use
- Before making changes similar to past changes
- When investigating recurring issues
- To understand why infrastructure is configured a certain way
- To identify change patterns and team practices
## Process
### Step 1: Define Search Scope
Determine what history to analyze:
- Specific resources being changed
- Time period (last month, quarter, year)
- Specific team members or patterns
### Step 2: Git Archaeology
#### Find Related Commits
```bash
# Commits touching specific files
git log --oneline -20 -- "path/to/module/*.tf"
# Commits mentioning resource types
git log --oneline -20 --grep="aws_security_group"
# Commits by pattern in message
git log --oneline -20 --grep="fix\|rollback\|revert"
# Commits in date range
git log --oneline --since="2024-01-01" --until="2024-06-01" -- "*.tf"
```
#### Analyze Commit Patterns
```bash
# Most frequently changed files
git log --pretty=format: --name-only -- "*.tf" | sort | uniq -c | sort -rn | head -20
# Authors and their focus areas
git shortlog -sn -- "environments/prod/"
# Change frequency by day/time
git log --format="%ad" --date=format:"%A %H:00" -- "*.tf" | sort | uniq -c
```
#### Find Reverts and Fixes
```bash
# Revert commits
git log --oneline --grep="revert\|Revert"
# Fix commits following changes
git log --oneline --grep="fix\|hotfix\|Fix"
# Commits with "URGENT" or "EMERGENCY"
git log --oneline --grep="urgent\|emergency" -i
```
### Step 3: Analyze Change Patterns
#### Coupling Analysis
Which files change together?
```bash
# For a specific file, what else changes with it?
git log --pretty=format:"%H" -- "modules/vpc/main.tf" | \
xargs -I {} git show --name-only --pretty=format: {} | \
sort | uniq -c | sort -rn | head -20
```
#### Change Sequences
Common sequences of changes:
1. VPC changes → followed by security group changes
2. IAM role changes → followed by policy attachments
3. RDS changes → followed by parameter group changes
#### Time Patterns
- Are prod changes clustered on certain days?
- Are there "risky" times based on past incidents?
- How long between staging and prod deployments?
### Step 4: Query Memory
Check stored patterns:
```
memory/projects/<hash>/patterns.json
memory/projects/<hash>/incidents.json
```
Look for:
- Similar past changes and outcomes
- Known issues with these resources
- User preferences for this type of change
### Step 5: Identify Lessons
#### From Incidents
For each past incident:
- What was the trigger?
- How was it detected?
- What was the fix?
- What could have prevented it?
#### From Patterns
- What changes tend to cause problems?
- What practices lead to success?
- What review processes work well?
### Step 6: Generate Report
```markdown
## Historical Pattern Analysis
### Search Scope
- Resources: [resources being analyzed]
- Time period: [date range]
- Related commits found: [count]
### Change Frequency
| Resource/File | Changes (90d) | Last Changed | Primary Authors |
|--------------|---------------|--------------|-----------------|
| modules/vpc/main.tf | 12 | 2024-01-10 | alice, bob |
| environments/prod/main.tf | 8 | 2024-01-08 | alice |
### Change Coupling
These resources typically change together:
1. `aws_security_group.web``aws_instance.web` (85% correlation)
2. `aws_iam_role.app``aws_iam_policy.app` (100% correlation)
### Past Incidents Related to These Resources
#### Incident: [Date] - [Title]
- **Trigger:** [What caused it]
- **Impact:** [What happened]
- **Resolution:** [How it was fixed]
- **Lesson:** [What we learned]
- **Relevance:** [How this applies to current change]
### Patterns Identified
#### Pattern: [Pattern Name]
- **Observation:** [What we see in history]
- **Frequency:** [How often]
- **Implication:** [What this means for current change]
### Risk Indicators
Based on historical data:
| Indicator | Current Change | Historical Issues |
|-----------|---------------|-------------------|
| Similar to past incident | [Yes/No] | [Details] |
| Frequently problematic resource | [Yes/No] | [Details] |
| Changed by unfamiliar author | [Yes/No] | [Details] |
### Recommendations
Based on historical patterns:
1. [Recommendation 1]
2. [Recommendation 2]
### Questions Raised
[Questions that history suggests we should answer]
```
### Step 7: Update Memory
Store new patterns discovered:
```json
{
"patterns": [
{
"name": "vpc-sg-coupling",
"description": "VPC changes often require SG updates",
"confidence": 0.85,
"last_seen": "2024-01-15"
}
]
}
```
## Common Patterns to Look For
### Positive Patterns
- Consistent naming conventions
- Regular, small changes vs. big-bang updates
- Changes preceded by plan review
- Post-change validation
### Warning Patterns
- Frequent reverts
- Emergency fixes following changes
- Clustered failures in specific areas
- "Temporary" changes that persist
### Anti-Patterns
- Direct prod changes without staging
- Large changes without incremental steps
- Missing documentation on complex changes
- Recurring manual interventions
## Integration with Other Skills
This skill feeds into:
- **terraform-plan-review**: Provides historical context for risk assessment
- **terraform-drift-detection**: Identifies if drift matches past patterns
- **provider-upgrade-analysis**: Shows past upgrade experiences

View file

@ -0,0 +1,145 @@
---
name: home-assistant-automation
description: Use when writing, editing, or debugging Home Assistant automations or scripts for Zoe's HA instance at 10.0.2.6:8123. Covers entity discovery, modern YAML syntax, automation/script patterns, and live MCP testing.
---
# Home Assistant Automation
## Overview
Write automations and scripts for Zoe's HA instance. You have live MCP access — use it. **Never guess entity IDs.** Always discover them first.
## HARD REQUIREMENT: Discover Entities Before Writing YAML
```
GetLiveContext BEFORE any YAML. No exceptions.
```
```python
# By domain
GetLiveContext(domain="light")
GetLiveContext(domain="media_player")
GetLiveContext(domain="siren")
# By area
GetLiveContext(area="living room")
GetLiveContext(area="office")
# By name (specific)
GetLiveContext(name="doorbell")
GetLiveContext(name="chime")
```
Entity IDs drift and vary. If you write YAML without checking, it will break.
## Known Devices (verify with GetLiveContext before use)
| Device | Domain hint | Notes |
|--------|------------|-------|
| Amcrest AD410 doorbell | `binary_sensor` | Button press trigger |
| Living room chime | `siren.living_room_chime_play_tone` | Use `siren.turn_on` |
| Office chime | `siren.office_chime_play_tone` | Use `siren.turn_on` |
| Side door lock | `select` | Lock timing entity |
| Apple TV | `media_player` | Used for kiosk display dimming |
| Raspberry Pi kiosk | family room dashboard | |
| Season sensor | `sensor.season` | |
## Modern YAML Syntax (2024.x+)
Use **plural keys** for all top-level blocks:
```yaml
alias: "Descriptive name"
description: "What this does"
triggers: # NOT trigger:
- ...
conditions: # NOT condition:
- ...
actions: # NOT action:
- action: ... # service calls inside actions use "action:" key, NOT "service:"
mode: single
```
## Common Trigger Patterns
```yaml
# State change with debounce
- trigger: state
entity_id: binary_sensor.doorbell_button
to: "on"
for: "00:00:02"
# Time
- trigger: time
at: "07:00:00"
# Sun offset
- trigger: sun
event: sunset
offset: "+00:30:00"
# Template
- trigger: template
value_template: "{{ states('sensor.season') == 'winter' }}"
```
## Common Action Patterns
```yaml
# Light with brightness/color temp
- action: light.turn_on
target:
entity_id: light.living_room
data:
brightness_pct: 80
color_temp_kelvin: 3000
# Play chime (siren domain, turn_on action)
- action: siren.turn_on
target:
entity_id: siren.living_room_chime_play_tone
# Conditional branch
- choose:
- conditions:
- condition: state
entity_id: sun.sun
state: above_horizon
sequence:
- action: light.turn_on
target:
area_id: living_room
default:
- action: light.turn_off
target:
area_id: living_room
# Delay
- delay: "00:05:00"
# Notify
- action: notify.notify
data:
message: "Someone at the door"
```
## Automation vs Script
- **Automation:** triggered by events/state/time — reactive behavior
- **Script:** called manually or from other automations — reusable action sequences
## Testing
1. **Verify entity exists:** `GetLiveContext(name="whatever")` — confirm state and ID
2. **Quick device test:** Use MCP action tools directly before writing YAML
- `HassTurnOn`, `HassLightSet`, `HassSetVolume`, etc.
3. **Test automation:** Paste YAML in HA UI → Settings → Automations → + → Edit in YAML → Run
## Gotchas
- Entity IDs are case-sensitive, use underscores
- `area_id` in `target:` works for lights; not reliable for all domains
- Chimes use `siren` domain — action is `siren.turn_on`, not `siren.play_tone`
- `mode: single` blocks re-entry; use `restart` if you want it to restart mid-run
- Apple TV dimming: check `media_player` state before acting on it
- Template syntax: `{{ states('sensor.foo') }}` — never `states.sensor.foo`

View file

@ -0,0 +1,168 @@
---
name: incident-response
description: Use when responding to production outages, data loss events, security incidents, or major service degradations in homelab (k3s/ansiblestack) or professional (AWS/EKS) environments. Applies at any severity — P1 complete outages to P4 minor issues.
---
# Incident Response
## Overview
Structured response for production incidents. Severity scales the rigor. Homelab P3 is not work P1.
**Core principle:** Stabilize user impact FIRST. Understand why SECOND. Never diagnose in silence.
## Severity
| Severity | Definition | Response SLA | Examples |
|----------|------------|--------------|---------|
| P1 | Complete outage OR data loss OR security breach | Immediate (minutes) | Prod DB down, credentials leaked, all users blocked |
| P2 | Major degradation, SLA at risk, significant user impact | Urgent (< 30 min) | 50%+ error rate, primary feature broken |
| P3 | Partial degradation, workaround exists | Same day | One region/service slow, single feature broken |
| P4 | Minor issue, no user impact | Within days | Monitoring gap, cosmetic issue |
## Phase 1: Triage (first 5-10 minutes)
Goal: confirm the incident, assess severity, start communication.
```
1. CONFIRM — is this actually broken?
- Check from multiple locations/devices
- Check AWS Status / DigitalOcean Status / upstream providers
- Ask: is anyone else seeing this?
2. SCOPE — who/what is affected?
- Which services? Which regions? Which users?
- Is data being lost RIGHT NOW?
- Stable or getting worse?
3. DECLARE — P1/P2: declare immediately, don't wait to diagnose
- Work: post in incident channel, page on-call, open incident ticket
- Homelab: create Vikunja task, start BookStack incident page
4. ASSIGN ROLES (work P1/P2)
- Incident Commander: coordinates, communicates, makes calls
- Tech Lead: root cause investigation
- Comms Lead: stakeholder updates
- (Homelab: you're all three)
```
## Phase 2: Stabilize (before root cause)
Fix user impact first. Common actions:
```bash
# Roll back last deployment
kubectl rollout undo deployment/<name> -n <ns>
# Scale up healthy replicas
kubectl scale deploy/<name> --replicas=5 -n <ns>
# Check rollout history
kubectl rollout history deployment/<name> -n <ns>
```
Other mitigations:
- Route traffic away from broken region/AZ
- Disable the broken feature flag
- Restore from backup (data loss)
- Rotate credentials (security incident)
**A rollback that takes 5 minutes beats a fix that takes 2 hours.**
## Phase 3: Investigate (root cause)
Now that users are unblocked:
```bash
# Recent events
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -30
# Logs (kubectl)
kubectl logs -n <ns> deploy/<name> --since=1h
# Logs (Grafana Loki)
{namespace="<ns>"}
# Describe node for resource pressure
kubectl describe node <name>
```
For AWS: CloudTrail, CloudWatch Logs, ALB access logs, X-Ray traces.
Check Grafana Mimir for the anomaly timestamp — find the inflection point.
## Phase 4: Resolve
1. Deploy actual fix (not just the stabilization mitigation)
2. Verify service is healthy — not just "pods are running":
- Check error rates in Grafana
- Check latency is normal
- Spot-check actual user flows
3. Monitor 15-30 minutes before declaring resolved
## Phase 5: Communicate
**During incident (P1/P2 — every 15-30 minutes):**
```
[14:32 UTC] INCIDENT UPDATE — <service> degradation
Status: Investigating
Impact: <X users/services affected>
Last action: Rolled back deployment v1.2.3
Next update: 14:47 UTC
```
**On resolution:**
```
[15:10 UTC] RESOLVED — <service> is operational
Duration: 38 minutes (14:3215:10 UTC)
Root cause: <brief description>
Fix applied: <what was done>
Postmortem: <link or "to follow within 48h">
```
**Work P1: never go silent for > 15 minutes. Communicate first, diagnose second.**
## Phase 6: Post-Incident
- Within 24-48h: write postmortem (use `writing-postmortem` skill if available)
- Update runbooks with anything that was missing
- Create Vikunja tasks for action items
- Save incident timeline to BookStack
## Security Incidents: Extra Steps
Order matters — don't skip ahead:
1. **ISOLATE** — kill or network-isolate the compromised resource before investigating
2. **PRESERVE** — snapshot, export logs before destroying anything
3. **ROTATE** — all potentially exposed credentials immediately
4. **NOTIFY** — security team, CISO, legal as appropriate
5. **SCOPE before disclosing** — do not announce publicly until you understand blast radius
GDPR: data breaches require regulatory notification within 72 hours.
## Homelab Specifics
- Create Vikunja task in relevant project when declaring
- Document timeline in BookStack: `Ansiblestack` book → new page `Incident YYYY-MM-DD: <title>`
- No stakeholder comms needed, but still write the postmortem — future-you will thank you
## Common Homelab Incidents
| Incident | Quick fix |
|----------|-----------|
| OpenBao sealed | `kubectl exec -n openbao openbao-0 -- bao status` — should auto-unseal via OCI KMS; check OCI KMS key status if not |
| ArgoCD all apps OutOfSync | Check Forgejo is reachable; check ArgoCD repo credentials |
| cert-manager not issuing | Check DNS propagation; check DigitalOcean token; check cert-manager pod logs |
| NFS storage unavailable | Check NFS server at 10.0.6.2; check pods in `nfs-provisioner` namespace |
| All pods evicted | Node disk pressure — `kubectl describe node <name>`, check disk usage |
## Common Mistakes
| Mistake | Reality |
|---------|---------|
| Diagnosing in silence for 30+ minutes | Communicate first, even with "investigating" |
| Fixing before declaring | Declaration triggers backup/support; don't skip it |
| Declaring resolved before monitoring | Check error rates and latency, not just pod status |
| Investigating before stabilizing | Users are down while you read logs. Roll back first. |
| Skipping postmortem on homelab | You will hit this again. Write it down. |

View file

@ -0,0 +1,228 @@
---
name: investigating-cluster-issue
description: Use when debugging Kubernetes issues on Zoe's homelab k3s cluster (k3s v1.35, Cilium, Traefik, ArgoCD, OpenBao, Grafana stack) or on AWS EKS clusters — pod failures, sync errors, networking problems, storage issues, node failures, or any unexpected cluster behavior.
---
# Investigating Cluster Issues
## Overview
Systematic triage for Kubernetes problems. Always run Level 1 first to establish ground truth before narrowing down. Resist the urge to jump straight to logs — node and pod status often reveals the real problem faster.
## Environment Reference
**k3s homelab:**
- Nodes: master-01/02/03, worker-01/02, gpu-node
- CNI: Cilium | Ingress: Traefik | GitOps: ArgoCD (`argocd.ctz.fyi`)
- Secrets: External Secrets Operator + OpenBao (`bao.ctz.fyi`)
- Monitoring: Grafana (`grafana.monitoring.ctz.fyi`) — Mimir, Loki, Tempo
- Storage: `ssd` (NFS), `local-path`
- Registry: Harbor (`registry.ctz.fyi`)
- Key namespaces: `argocd`, `monitoring`, `keycloak`, `external-secrets`, `cert-manager`, `traefik`, `openbao`
**EKS:**
- Addons: aws-load-balancer-controller, external-dns, cluster-autoscaler, kube-prometheus-stack
- Storage: EBS CSI (`gp3` preferred), EFS for shared
- Auth: IRSA for pod AWS access
- Networking: aws-vpc-cni or Cilium + Calico network policies
---
## Quick Reference: Symptom → First Command
| Symptom | First command |
|---------|--------------|
| Pod stuck `Pending` | `kubectl describe pod <pod> -n <ns>` → check Events |
| `CrashLoopBackOff` | `kubectl logs <pod> -n <ns> --previous` |
| `ImagePullBackOff` | `kubectl describe pod <pod> -n <ns>` → check image + secret |
| Secret not available | `kubectl get externalsecret -n <ns>` |
| ArgoCD sync failing | `kubectl get application <name> -n argocd -o yaml``.status.conditions` |
| TLS cert not issuing | `kubectl get certificate -n <ns>` |
| Node not Ready | `kubectl describe node <name>` → Events + Conditions |
| EKS ALB not creating | `kubectl describe ingress <name> -n <ns>` → check controller logs |
| Cluster-wide chaos | `kubectl get events -A --sort-by='.lastTimestamp' \| tail -30` |
| Not sure where to start | Run all three Level 1 commands |
---
## Level 1 — Immediate Triage (always run first)
```bash
kubectl get nodes -o wide
kubectl get pods -A | grep -Ev '(Running|Completed)'
kubectl get events -A --sort-by='.lastTimestamp' | tail -30
```
Read the events output carefully — it frequently names the exact problem.
---
## Level 2 — Narrow to Failing Resource
```bash
kubectl describe pod <name> -n <ns> # Events section is the most useful part
kubectl logs <pod> -n <ns> --previous # If pod restarted
kubectl logs <pod> -n <ns> -c <container> # Multi-container pods
```
---
## Level 3 — Root Causes by Symptom
### Pod stuck `Pending`
1. Check describe events for `FailedScheduling` — resource constraints, taints/tolerations, affinity rules
2. Check PVCs: `kubectl get pvc -n <ns>`
- **k3s:** If PVC Pending, check NFS provisioner: `kubectl get pods -n nfs-provisioner`
- **EKS:** Check EBS CSI driver: `kubectl get pods -n kube-system -l app=ebs-csi-controller`; verify IRSA annotation on ServiceAccount
### `CrashLoopBackOff`
1. `kubectl logs <pod> --previous` — look for panic, missing env var, missing file, bad config
2. Check ExternalSecret synced: `kubectl get externalsecret -n <ns>``SecretSyncedError` is common
3. Check dependent services (DB, cache, upstream API)
4. **k3s ArgoCD:** Check sync-wave ordering — ExternalSecret must have lower wave number than Deployment
### ArgoCD sync failing (k3s)
```bash
kubectl get application <name> -n argocd -o yaml # .status.conditions
kubectl get application <name> -n argocd -o jsonpath='{.status.operationState.message}'
```
- **OutOfSync on immutable field** → manually delete the resource, then re-sync
- **ExternalSecret missing** → check OpenBao (see below)
- Force refresh without sync: ArgoCD UI → hard refresh, or:
```bash
kubectl annotate application <name> -n argocd argocd.argoproj.io/refresh=hard
```
### External Secrets not syncing
```bash
kubectl describe externalsecret <name> -n <ns> # .status.conditions
kubectl get clustersecretstore openbao -o yaml # check Ready condition
kubectl exec -n openbao openbao-0 -- bao status # check sealed/unsealed
```
- **OpenBao sealed:** Normally auto-unseals via OCI KMS. If stuck:
```bash
kubectl exec -n openbao openbao-0 -- bao operator unseal
```
- **ClusterSecretStore not Ready:** Check the ESO controller logs:
```bash
kubectl logs -n external-secrets deploy/external-secrets -f
```
### `ImagePullBackOff`
```bash
kubectl describe pod <name> -n <ns> # look for "401 Unauthorized" or "not found"
```
- Wrong image tag → fix in manifest/values
- Missing `imagePullSecret` → verify secret exists: `kubectl get secret -n <ns>`
- **k3s Harbor auth:** Ensure secret references `registry.ctz.fyi` and is attached to ServiceAccount or pod spec
- Registry unreachable → check Harbor pod health: `kubectl get pods -n harbor`
### IngressRoute / TLS not working (k3s)
```bash
kubectl get certificate -n <ns> # Ready=False = problem
kubectl describe certificate <name> -n <ns> # check Events
kubectl get ingressroute -n <ns>
kubectl get ingress -n <ns> # cert-manager needs a standard Ingress to issue
```
- cert-manager needs a standard `Ingress` resource alongside `IngressRoute` — if missing, cert won't issue
- Check Traefik pods: `kubectl get pods -n traefik`
### EKS — Node not joining
```bash
kubectl get configmap aws-auth -n kube-system -o yaml # verify node IAM role mapped
# On the node:
journalctl -u kubelet -n 100
```
- Check security groups: nodes need port 443 outbound to control plane endpoint
- Check node IAM role has `AmazonEKSWorkerNodePolicy`, `AmazonEKS_CNI_Policy`, `AmazonEC2ContainerRegistryReadOnly`
### EKS — ALB/NLB not creating
```bash
kubectl describe ingress <name> -n <ns>
kubectl logs -n kube-system deploy/aws-load-balancer-controller | tail -50
```
- Verify annotations: `kubernetes.io/ingress.class: alb`
- Check IRSA: ServiceAccount must have `eks.amazonaws.com/role-arn` annotation
- Check controller has correct IAM permissions (policy document)
---
## Level 4 — System-Level Checks
```bash
# k3s control plane
kubectl get componentstatuses
# On master nodes:
systemctl status k3s
# Cilium (k3s)
kubectl -n kube-system exec ds/cilium -- cilium status
kubectl -n kube-system get pods -l k8s-app=cilium
# Resource pressure (both environments)
kubectl top nodes
kubectl top pods -A --sort-by=memory | head -20
# EKS cluster info
aws eks describe-cluster --name <cluster> --region <region>
```
---
## Level 5 — Logs via Grafana (k3s)
Grafana: `grafana.monitoring.ctz.fyi`
**Loki log queries:**
```
{namespace="<ns>"}
{namespace="<ns>", app="<name>"} |= "error"
{namespace="<ns>"} | logfmt | level="error"
```
**Mimir (metrics):** Check CPU/memory graphs around the time of failure — spikes often correlate with OOMKills or throttling that don't appear in kubectl describe.
---
## Live Debugging Inside a Container
```bash
kubectl exec -it <pod> -n <ns> -- /bin/sh
# or if bash available:
kubectl exec -it <pod> -n <ns> -- bash
# multi-container:
kubectl exec -it <pod> -n <ns> -c <container> -- /bin/sh
```
Use for: verifying env vars, testing connectivity (`curl`, `wget`, `nslookup`), checking mounted files.
---
## Restart vs Dig Deeper
**Restart first when:**
- Pod is in unknown/evicted state with no clear cause
- You've already identified the root cause and fixed it
- OOMKilled and you're about to bump memory limits
**Dig deeper first when:**
- CrashLoopBackOff with no obvious cause (logs will be lost on restart)
- Data loss risk
- Same pod keeps restarting after restart → there's a real problem, not a transient one
- Multiple pods affected → likely systemic, not pod-specific
**Never restart ArgoCD-managed resources directly** — ArgoCD will re-sync to desired state. Fix the underlying cause (secret, config, image) and let ArgoCD reconcile, or trigger a manual sync.

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

187
skills/iterate-pr/SKILL.md Normal file
View file

@ -0,0 +1,187 @@
---
name: iterate-pr
description: Iterate on a PR until CI passes. Use when you need to fix CI failures, address review feedback, or continuously push fixes until all checks are green. Automates the feedback-fix-push-wait cycle.
risk: unknown
source: community
---
# Iterate on PR Until CI Passes
Continuously iterate on the current branch until all CI checks pass and review feedback is addressed.
**Requires**: GitHub CLI (`gh`) authenticated.
**Important**: All scripts must be run from the repository root directory (where `.git` is located), not from the skill directory. Use the full path to the script via `${CLAUDE_SKILL_ROOT}`.
## Bundled Scripts
### `scripts/fetch_pr_checks.py`
Fetches CI check status and extracts failure snippets from logs.
```bash
uv run ${CLAUDE_SKILL_ROOT}/scripts/fetch_pr_checks.py [--pr NUMBER]
```
Returns JSON:
```json
{
"pr": {"number": 123, "branch": "feat/foo"},
"summary": {"total": 5, "passed": 3, "failed": 2, "pending": 0},
"checks": [
{"name": "tests", "status": "fail", "log_snippet": "...", "run_id": 123},
{"name": "lint", "status": "pass"}
]
}
```
### `scripts/fetch_pr_feedback.py`
Fetches and categorizes PR review feedback using the [LOGAF scale](https://develop.sentry.dev/engineering-practices/code-review/#logaf-scale).
```bash
uv run ${CLAUDE_SKILL_ROOT}/scripts/fetch_pr_feedback.py [--pr NUMBER]
```
Returns JSON with feedback categorized as:
- `high` - Must address before merge (`h:`, blocker, changes requested)
- `medium` - Should address (`m:`, standard feedback)
- `low` - Optional (`l:`, nit, style, suggestion)
- `bot` - Informational automated comments (Codecov, Dependabot, etc.)
- `resolved` - Already resolved threads
Review bot feedback (from Sentry, Warden, Cursor, Bugbot, CodeQL, etc.) appears in `high`/`medium`/`low` with `review_bot: true` — it is NOT placed in the `bot` bucket.
Each feedback item may also include:
- `thread_id` - GraphQL node ID for inline review comments (used for replies)
## Workflow
### 1. Identify PR
```bash
gh pr view --json number,url,headRefName
```
Stop if no PR exists for the current branch.
### 2. Gather Review Feedback
Run `${CLAUDE_SKILL_ROOT}/scripts/fetch_pr_feedback.py` to get categorized feedback already posted on the PR.
### 3. Handle Feedback by LOGAF Priority
**Auto-fix (no prompt):**
- `high` - must address (blockers, security, changes requested)
- `medium` - should address (standard feedback)
When fixing feedback:
- Understand the root cause, not just the surface symptom
- Check for similar issues in nearby code or related files
- Fix all instances, not just the one mentioned
This includes review bot feedback (items with `review_bot: true`). Treat it the same as human feedback:
- Real issue found → fix it
- False positive → skip, but explain why in a brief comment
- Never silently ignore review bot feedback — always verify the finding
**Prompt user for selection:**
- `low` - present numbered list and ask which to address:
```
Found 3 low-priority suggestions:
1. [l] "Consider renaming this variable" - @reviewer in api.py:42
2. [nit] "Could use a list comprehension" - @reviewer in utils.py:18
3. [style] "Add a docstring" - @reviewer in models.py:55
Which would you like to address? (e.g., "1,3" or "all" or "none")
```
**Skip silently:**
- `resolved` threads
- `bot` comments (informational only — Codecov, Dependabot, etc.)
#### Replying to Comments
After processing each inline review comment, reply on the PR thread to acknowledge the action taken. Only reply to items with a `thread_id` (inline review comments).
**When to reply:**
- `high` and `medium` items — whether fixed or determined to be false positives
- `low` items — whether fixed or declined by the user
**How to reply:** Use the `addPullRequestReviewThreadReply` GraphQL mutation with `pullRequestReviewThreadId` and `body` inputs.
**Reply format:**
- 1-2 sentences: what was changed, why it's not an issue, or acknowledgment of declined items
- End every reply with `\n\n*— Claude Code*`
- Before replying, check if the thread already has a reply ending with `*- Claude Code*` or `*— Claude Code*` to avoid duplicates on re-loops
- If the `gh api` call fails, log and continue — do not block the workflow
### 4. Check CI Status
Run `${CLAUDE_SKILL_ROOT}/scripts/fetch_pr_checks.py` to get structured failure data.
**Wait if pending:** If review bot checks (sentry, warden, cursor, bugbot, seer, codeql) are still running, wait before proceeding—they post actionable feedback that must be evaluated. Informational bots (codecov) are not worth waiting for.
### 5. Fix CI Failures
For each failure in the script output:
1. Read the `log_snippet` and trace backwards from the error to understand WHY it failed — not just what failed
2. Read the relevant code and check for related issues (e.g., if a type error in one call site, check other call sites)
3. Fix the root cause with minimal, targeted changes
4. Find existing tests for the affected code and run them. If the fix introduces behavior not covered by existing tests, extend them to cover it (add a test case, not a whole new test file)
Do NOT assume what failed based on check name alone—always read the logs. Do NOT "quick fix and hope" — understand the failure thoroughly before changing code.
### 6. Verify Locally, Then Commit and Push
Before committing, verify your fixes locally:
- If you fixed a test failure: re-run that specific test locally
- If you fixed a lint/type error: re-run the linter or type checker on affected files
- For any code fix: run existing tests covering the changed code
If local verification fails, fix before proceeding — do not push known-broken code.
```bash
git add <files>
git commit -m "fix: <descriptive message>"
git push
```
### 7. Monitor CI and Address Feedback
Poll CI status and review feedback in a loop instead of blocking:
1. Run `uv run ${CLAUDE_SKILL_ROOT}/scripts/fetch_pr_checks.py` to get current CI status
2. If all checks passed → proceed to exit conditions
3. If any checks failed (none pending) → return to step 5
4. If checks are still pending:
a. Run `uv run ${CLAUDE_SKILL_ROOT}/scripts/fetch_pr_feedback.py` for new review feedback
b. Address any new high/medium feedback immediately (same as step 3)
c. If changes were needed, commit and push (this restarts CI), then continue polling
d. Sleep 30 seconds, then repeat from sub-step 1
5. After all checks pass, do a final feedback check: `sleep 10`, then run `uv run ${CLAUDE_SKILL_ROOT}/scripts/fetch_pr_feedback.py`. Address any new high/medium feedback — if changes are needed, return to step 6.
### 8. Repeat
If step 7 required code changes (from new feedback after CI passed), return to step 2 for a fresh cycle. CI failures during monitoring are already handled within step 7's polling loop.
## Exit Conditions
**Success:** All checks pass, post-CI feedback re-check is clean (no new unaddressed high/medium feedback including review bot findings), user has decided on low-priority items.
**Ask for help:** Same failure after 2 attempts, feedback needs clarification, infrastructure issues.
**Stop:** No PR exists, branch needs rebase.
## Fallback
If scripts fail, use `gh` CLI directly:
- `gh pr checks name,state,bucket,link`
- `gh run view <run-id> --log-failed`
- `gh api repos/{owner}/{repo}/pulls/{number}/comments`
## When to Use
Use this skill when tackling tasks related to its primary domain or functionality as described above.

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,38 @@
---
name: k8s-manifest-generator
description: "Create production-ready Kubernetes manifests for Deployments, Services, ConfigMaps, and Secrets following best practices and security standards. Use when generating Kubernetes YAML manifests, creat..."
risk: unknown
source: community
date_added: "2026-02-27"
---
# Kubernetes Manifest Generator
Step-by-step guidance for creating production-ready Kubernetes manifests including Deployments, Services, ConfigMaps, Secrets, and PersistentVolumeClaims.
## Use this skill when
Use this skill when you need to:
- Create new Kubernetes Deployment manifests
- Define Service resources for network connectivity
- Generate ConfigMap and Secret resources for configuration management
- Create PersistentVolumeClaim manifests for stateful workloads
- Follow Kubernetes best practices and naming conventions
- Implement resource limits, health checks, and security contexts
- Design manifests for multi-environment deployments
## Do not use this skill when
- The task is unrelated to kubernetes manifest generator
- You need a different domain or tool outside this scope
## Instructions
- Clarify goals, constraints, and required inputs.
- Apply relevant best practices and validate outcomes.
- Provide actionable steps and verification.
- If detailed examples are required, open `resources/implementation-playbook.md`.
## Resources
- `resources/implementation-playbook.md` for detailed patterns and examples.

View file

@ -0,0 +1,296 @@
# Kubernetes ConfigMap Templates
---
# Template 1: Simple Key-Value Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: <app-name>-config
namespace: <namespace>
labels:
app.kubernetes.io/name: <app-name>
app.kubernetes.io/instance: <instance-name>
data:
# Simple key-value pairs
APP_ENV: "production"
LOG_LEVEL: "info"
DATABASE_HOST: "db.example.com"
DATABASE_PORT: "5432"
CACHE_TTL: "3600"
MAX_CONNECTIONS: "100"
---
# Template 2: Configuration File
apiVersion: v1
kind: ConfigMap
metadata:
name: <app-name>-config-file
namespace: <namespace>
labels:
app.kubernetes.io/name: <app-name>
data:
# Application configuration file
application.yaml: |
server:
port: 8080
host: 0.0.0.0
logging:
level: INFO
format: json
database:
host: db.example.com
port: 5432
pool_size: 20
timeout: 30
cache:
enabled: true
ttl: 3600
max_entries: 10000
features:
new_ui: true
beta_features: false
---
# Template 3: Multiple Configuration Files
apiVersion: v1
kind: ConfigMap
metadata:
name: <app-name>-multi-config
namespace: <namespace>
labels:
app.kubernetes.io/name: <app-name>
data:
# Nginx configuration
nginx.conf: |
user nginx;
worker_processes auto;
error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;
events {
worker_connections 1024;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log main;
sendfile on;
keepalive_timeout 65;
include /etc/nginx/conf.d/*.conf;
}
# Default site configuration
default.conf: |
server {
listen 80;
server_name _;
location / {
proxy_pass http://backend:8080;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
location /health {
access_log off;
return 200 "healthy\n";
}
}
---
# Template 4: JSON Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: <app-name>-json-config
namespace: <namespace>
labels:
app.kubernetes.io/name: <app-name>
data:
config.json: |
{
"server": {
"port": 8080,
"host": "0.0.0.0",
"timeout": 30
},
"database": {
"host": "postgres.example.com",
"port": 5432,
"database": "myapp",
"pool": {
"min": 2,
"max": 20
}
},
"redis": {
"host": "redis.example.com",
"port": 6379,
"db": 0
},
"features": {
"auth": true,
"metrics": true,
"tracing": true
}
}
---
# Template 5: Environment-Specific Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: <app-name>-prod-config
namespace: production
labels:
app.kubernetes.io/name: <app-name>
environment: production
data:
APP_ENV: "production"
LOG_LEVEL: "warn"
DEBUG: "false"
RATE_LIMIT: "1000"
CACHE_TTL: "3600"
DATABASE_POOL_SIZE: "50"
FEATURE_FLAG_NEW_UI: "true"
FEATURE_FLAG_BETA: "false"
---
# Template 6: Script Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: <app-name>-scripts
namespace: <namespace>
labels:
app.kubernetes.io/name: <app-name>
data:
# Initialization script
init.sh: |
#!/bin/bash
set -e
echo "Running initialization..."
# Wait for database
until nc -z $DATABASE_HOST $DATABASE_PORT; do
echo "Waiting for database..."
sleep 2
done
echo "Database is ready!"
# Run migrations
if [ "$RUN_MIGRATIONS" = "true" ]; then
echo "Running database migrations..."
./migrate up
fi
echo "Initialization complete!"
# Health check script
healthcheck.sh: |
#!/bin/bash
# Check application health endpoint
response=$(curl -sf http://localhost:8080/health)
if [ $? -eq 0 ]; then
echo "Health check passed"
exit 0
else
echo "Health check failed"
exit 1
fi
---
# Template 7: Prometheus Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
labels:
app.kubernetes.io/name: prometheus
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
region: 'us-west-2'
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- /etc/prometheus/rules/*.yml
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
---
# Usage Examples:
#
# 1. Mount as environment variables:
# envFrom:
# - configMapRef:
# name: <app-name>-config
#
# 2. Mount as files:
# volumeMounts:
# - name: config
# mountPath: /etc/app
# volumes:
# - name: config
# configMap:
# name: <app-name>-config-file
#
# 3. Mount specific keys as files:
# volumes:
# - name: nginx-config
# configMap:
# name: <app-name>-multi-config
# items:
# - key: nginx.conf
# path: nginx.conf
#
# 4. Use individual environment variables:
# env:
# - name: LOG_LEVEL
# valueFrom:
# configMapKeyRef:
# name: <app-name>-config
# key: LOG_LEVEL

View file

@ -0,0 +1,203 @@
# Production-Ready Kubernetes Deployment Template
# Replace all <placeholders> with actual values
apiVersion: apps/v1
kind: Deployment
metadata:
name: <app-name>
namespace: <namespace>
labels:
app.kubernetes.io/name: <app-name>
app.kubernetes.io/instance: <instance-name>
app.kubernetes.io/version: "<version>"
app.kubernetes.io/component: <component> # backend, frontend, database, cache
app.kubernetes.io/part-of: <system-name>
app.kubernetes.io/managed-by: kubectl
annotations:
description: "<application description>"
contact: "<team-email>"
spec:
replicas: 3 # Minimum 3 for production HA
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/name: <app-name>
app.kubernetes.io/instance: <instance-name>
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # Zero-downtime deployment
minReadySeconds: 10
progressDeadlineSeconds: 600
template:
metadata:
labels:
app.kubernetes.io/name: <app-name>
app.kubernetes.io/instance: <instance-name>
app.kubernetes.io/version: "<version>"
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics"
spec:
serviceAccountName: <app-name>
# Pod-level security context
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
# Init containers (optional)
initContainers:
- name: init-wait
image: busybox:1.36
command: ['sh', '-c', 'echo "Initializing..."']
securityContext:
allowPrivilegeEscalation: false
runAsNonRoot: true
runAsUser: 1000
containers:
- name: <container-name>
image: <registry>/<image>:<tag> # Never use :latest
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 8080
protocol: TCP
- name: metrics
containerPort: 9090
protocol: TCP
# Environment variables
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
# Load from ConfigMap and Secret
envFrom:
- configMapRef:
name: <app-name>-config
- secretRef:
name: <app-name>-secret
# Resource limits
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
# Startup probe (for slow-starting apps)
startupProbe:
httpGet:
path: /health/startup
port: http
initialDelaySeconds: 0
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 30 # 5 minutes to start
# Liveness probe
livenessProbe:
httpGet:
path: /health/live
port: http
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# Readiness probe
readinessProbe:
httpGet:
path: /health/ready
port: http
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
# Volume mounts
volumeMounts:
- name: tmp
mountPath: /tmp
- name: cache
mountPath: /app/cache
# - name: data
# mountPath: /var/lib/app
# Container security context
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
capabilities:
drop:
- ALL
# Lifecycle hooks
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"] # Graceful shutdown
# Volumes
volumes:
- name: tmp
emptyDir: {}
- name: cache
emptyDir:
sizeLimit: 1Gi
# - name: data
# persistentVolumeClaim:
# claimName: <app-name>-data
# Scheduling
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/name: <app-name>
topologyKey: kubernetes.io/hostname
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app.kubernetes.io/name: <app-name>
terminationGracePeriodSeconds: 30
# Image pull secrets (if using private registry)
# imagePullSecrets:
# - name: regcred

View file

@ -0,0 +1,171 @@
# Kubernetes Service Templates
---
# Template 1: ClusterIP Service (Internal Only)
apiVersion: v1
kind: Service
metadata:
name: <app-name>
namespace: <namespace>
labels:
app.kubernetes.io/name: <app-name>
app.kubernetes.io/instance: <instance-name>
annotations:
description: "Internal service for <app-name>"
spec:
type: ClusterIP
selector:
app.kubernetes.io/name: <app-name>
app.kubernetes.io/instance: <instance-name>
ports:
- name: http
port: 80
targetPort: http # Named port from container
protocol: TCP
sessionAffinity: None
---
# Template 2: LoadBalancer Service (External Access)
apiVersion: v1
kind: Service
metadata:
name: <app-name>-lb
namespace: <namespace>
labels:
app.kubernetes.io/name: <app-name>
annotations:
# AWS NLB annotations
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
# SSL certificate (optional)
# service.beta.kubernetes.io/aws-load-balancer-ssl-cert: "arn:aws:acm:..."
spec:
type: LoadBalancer
externalTrafficPolicy: Local # Preserves client IP
selector:
app.kubernetes.io/name: <app-name>
ports:
- name: http
port: 80
targetPort: http
protocol: TCP
- name: https
port: 443
targetPort: https
protocol: TCP
# Restrict access to specific IPs (optional)
# loadBalancerSourceRanges:
# - 203.0.113.0/24
---
# Template 3: NodePort Service (Direct Node Access)
apiVersion: v1
kind: Service
metadata:
name: <app-name>-np
namespace: <namespace>
labels:
app.kubernetes.io/name: <app-name>
spec:
type: NodePort
selector:
app.kubernetes.io/name: <app-name>
ports:
- name: http
port: 80
targetPort: 8080
nodePort: 30080 # Optional, 30000-32767 range
protocol: TCP
---
# Template 4: Headless Service (StatefulSet)
apiVersion: v1
kind: Service
metadata:
name: <app-name>-headless
namespace: <namespace>
labels:
app.kubernetes.io/name: <app-name>
spec:
clusterIP: None # Headless
selector:
app.kubernetes.io/name: <app-name>
ports:
- name: client
port: 9042
targetPort: 9042
publishNotReadyAddresses: true # Include not-ready pods in DNS
---
# Template 5: Multi-Port Service with Metrics
apiVersion: v1
kind: Service
metadata:
name: <app-name>-multi
namespace: <namespace>
labels:
app.kubernetes.io/name: <app-name>
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics"
spec:
type: ClusterIP
selector:
app.kubernetes.io/name: <app-name>
ports:
- name: http
port: 80
targetPort: 8080
protocol: TCP
- name: https
port: 443
targetPort: 8443
protocol: TCP
- name: grpc
port: 9090
targetPort: 9090
protocol: TCP
- name: metrics
port: 9091
targetPort: 9091
protocol: TCP
---
# Template 6: Service with Session Affinity
apiVersion: v1
kind: Service
metadata:
name: <app-name>-sticky
namespace: <namespace>
labels:
app.kubernetes.io/name: <app-name>
spec:
type: ClusterIP
selector:
app.kubernetes.io/name: <app-name>
ports:
- name: http
port: 80
targetPort: 8080
protocol: TCP
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 10800 # 3 hours
---
# Template 7: ExternalName Service (External Service Mapping)
apiVersion: v1
kind: Service
metadata:
name: external-db
namespace: <namespace>
spec:
type: ExternalName
externalName: db.example.com
ports:
- port: 5432
targetPort: 5432
protocol: TCP

View file

@ -0,0 +1,25 @@
<!-- BEGIN_TF_DOCS -->
## Requirements
No requirements.
## Providers
No providers.
## Modules
No modules.
## Resources
No resources.
## Inputs
No inputs.
## Outputs
No outputs.
<!-- END_TF_DOCS -->

View file

@ -0,0 +1,753 @@
# Kubernetes Deployment Specification Reference
Comprehensive reference for Kubernetes Deployment resources, covering all key fields, best practices, and common patterns.
## Overview
A Deployment provides declarative updates for Pods and ReplicaSets. It manages the desired state of your application, handling rollouts, rollbacks, and scaling operations.
## Complete Deployment Specification
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
namespace: production
labels:
app.kubernetes.io/name: my-app
app.kubernetes.io/version: "1.0.0"
app.kubernetes.io/component: backend
app.kubernetes.io/part-of: my-system
annotations:
description: "Main application deployment"
contact: "backend-team@example.com"
spec:
# Replica management
replicas: 3
revisionHistoryLimit: 10
# Pod selection
selector:
matchLabels:
app: my-app
version: v1
# Update strategy
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
# Minimum time for pod to be ready
minReadySeconds: 10
# Deployment will fail if it doesn't progress in this time
progressDeadlineSeconds: 600
# Pod template
template:
metadata:
labels:
app: my-app
version: v1
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
spec:
# Service account for RBAC
serviceAccountName: my-app
# Security context for the pod
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
# Init containers run before main containers
initContainers:
- name: init-db
image: busybox:1.36
command: ['sh', '-c', 'until nc -z db-service 5432; do sleep 1; done']
securityContext:
allowPrivilegeEscalation: false
runAsNonRoot: true
runAsUser: 1000
# Main containers
containers:
- name: app
image: myapp:1.0.0
imagePullPolicy: IfNotPresent
# Container ports
ports:
- name: http
containerPort: 8080
protocol: TCP
- name: metrics
containerPort: 9090
protocol: TCP
# Environment variables
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
# ConfigMap and Secret references
envFrom:
- configMapRef:
name: app-config
- secretRef:
name: app-secrets
# Resource requests and limits
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
# Liveness probe
livenessProbe:
httpGet:
path: /health/live
port: http
httpHeaders:
- name: Custom-Header
value: Awesome
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 3
# Readiness probe
readinessProbe:
httpGet:
path: /health/ready
port: http
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
successThreshold: 1
failureThreshold: 3
# Startup probe (for slow-starting containers)
startupProbe:
httpGet:
path: /health/startup
port: http
initialDelaySeconds: 0
periodSeconds: 10
timeoutSeconds: 3
successThreshold: 1
failureThreshold: 30
# Volume mounts
volumeMounts:
- name: data
mountPath: /var/lib/app
- name: config
mountPath: /etc/app
readOnly: true
- name: tmp
mountPath: /tmp
# Security context for container
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
capabilities:
drop:
- ALL
# Lifecycle hooks
lifecycle:
postStart:
exec:
command: ["/bin/sh", "-c", "echo Container started > /tmp/started"]
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"]
# Volumes
volumes:
- name: data
persistentVolumeClaim:
claimName: app-data
- name: config
configMap:
name: app-config
- name: tmp
emptyDir: {}
# DNS configuration
dnsPolicy: ClusterFirst
dnsConfig:
options:
- name: ndots
value: "2"
# Scheduling
nodeSelector:
disktype: ssd
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- my-app
topologyKey: kubernetes.io/hostname
tolerations:
- key: "app"
operator: "Equal"
value: "my-app"
effect: "NoSchedule"
# Termination
terminationGracePeriodSeconds: 30
# Image pull secrets
imagePullSecrets:
- name: regcred
```
## Field Reference
### Metadata Fields
#### Required Fields
- `apiVersion`: `apps/v1` (current stable version)
- `kind`: `Deployment`
- `metadata.name`: Unique name within namespace
#### Recommended Metadata
- `metadata.namespace`: Target namespace (defaults to `default`)
- `metadata.labels`: Key-value pairs for organization
- `metadata.annotations`: Non-identifying metadata
### Spec Fields
#### Replica Management
**`replicas`** (integer, default: 1)
- Number of desired pod instances
- Best practice: Use 3+ for production high availability
- Can be scaled manually or via HorizontalPodAutoscaler
**`revisionHistoryLimit`** (integer, default: 10)
- Number of old ReplicaSets to retain for rollback
- Set to 0 to disable rollback capability
- Reduces storage overhead for long-running deployments
#### Update Strategy
**`strategy.type`** (string)
- `RollingUpdate` (default): Gradual pod replacement
- `Recreate`: Delete all pods before creating new ones
**`strategy.rollingUpdate.maxSurge`** (int or percent, default: 25%)
- Maximum pods above desired replicas during update
- Example: With 3 replicas and maxSurge=1, up to 4 pods during update
**`strategy.rollingUpdate.maxUnavailable`** (int or percent, default: 25%)
- Maximum pods below desired replicas during update
- Set to 0 for zero-downtime deployments
- Cannot be 0 if maxSurge is 0
**Best practices:**
```yaml
# Zero-downtime deployment
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
# Fast deployment (can have brief downtime)
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2
maxUnavailable: 1
# Complete replacement
strategy:
type: Recreate
```
#### Pod Template
**`template.metadata.labels`**
- Must include labels matching `spec.selector.matchLabels`
- Add version labels for blue/green deployments
- Include standard Kubernetes labels
**`template.spec.containers`** (required)
- Array of container specifications
- At least one container required
- Each container needs unique name
#### Container Configuration
**Image Management:**
```yaml
containers:
- name: app
image: registry.example.com/myapp:1.0.0
imagePullPolicy: IfNotPresent # or Always, Never
```
Image pull policies:
- `IfNotPresent`: Pull if not cached (default for tagged images)
- `Always`: Always pull (default for :latest)
- `Never`: Never pull, fail if not cached
**Port Declarations:**
```yaml
ports:
- name: http # Named for referencing in Service
containerPort: 8080
protocol: TCP # TCP (default), UDP, or SCTP
hostPort: 8080 # Optional: Bind to host port (rarely used)
```
#### Resource Management
**Requests vs Limits:**
```yaml
resources:
requests:
memory: "256Mi" # Guaranteed resources
cpu: "250m" # 0.25 CPU cores
limits:
memory: "512Mi" # Maximum allowed
cpu: "500m" # 0.5 CPU cores
```
**QoS Classes (determined automatically):**
1. **Guaranteed**: requests = limits for all containers
- Highest priority
- Last to be evicted
2. **Burstable**: requests < limits or only requests set
- Medium priority
- Evicted before Guaranteed
3. **BestEffort**: No requests or limits set
- Lowest priority
- First to be evicted
**Best practices:**
- Always set requests in production
- Set limits to prevent resource monopolization
- Memory limits should be 1.5-2x requests
- CPU limits can be higher for bursty workloads
#### Health Checks
**Probe Types:**
1. **startupProbe** - For slow-starting applications
```yaml
startupProbe:
httpGet:
path: /health/startup
port: 8080
initialDelaySeconds: 0
periodSeconds: 10
failureThreshold: 30 # 5 minutes to start (10s * 30)
```
2. **livenessProbe** - Restarts unhealthy containers
```yaml
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3 # Restart after 3 failures
```
3. **readinessProbe** - Controls traffic routing
```yaml
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3 # Remove from service after 3 failures
```
**Probe Mechanisms:**
```yaml
# HTTP GET
httpGet:
path: /health
port: 8080
httpHeaders:
- name: Authorization
value: Bearer token
# TCP Socket
tcpSocket:
port: 3306
# Command execution
exec:
command:
- cat
- /tmp/healthy
# gRPC (Kubernetes 1.24+)
grpc:
port: 9090
service: my.service.health.v1.Health
```
**Probe Timing Parameters:**
- `initialDelaySeconds`: Wait before first probe
- `periodSeconds`: How often to probe
- `timeoutSeconds`: Probe timeout
- `successThreshold`: Successes needed to mark healthy (1 for liveness/startup)
- `failureThreshold`: Failures before taking action
#### Security Context
**Pod-level security context:**
```yaml
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
fsGroupChangePolicy: OnRootMismatch
seccompProfile:
type: RuntimeDefault
```
**Container-level security context:**
```yaml
containers:
- name: app
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICE # Only if needed
```
**Security best practices:**
- Always run as non-root (`runAsNonRoot: true`)
- Drop all capabilities and add only needed ones
- Use read-only root filesystem when possible
- Enable seccomp profile
- Disable privilege escalation
#### Volumes
**Volume Types:**
```yaml
volumes:
# PersistentVolumeClaim
- name: data
persistentVolumeClaim:
claimName: app-data
# ConfigMap
- name: config
configMap:
name: app-config
items:
- key: app.properties
path: application.properties
# Secret
- name: secrets
secret:
secretName: app-secrets
defaultMode: 0400
# EmptyDir (ephemeral)
- name: cache
emptyDir:
sizeLimit: 1Gi
# HostPath (avoid in production)
- name: host-data
hostPath:
path: /data
type: DirectoryOrCreate
```
#### Scheduling
**Node Selection:**
```yaml
# Simple node selector
nodeSelector:
disktype: ssd
zone: us-west-1a
# Node affinity (more expressive)
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/arch
operator: In
values:
- amd64
- arm64
```
**Pod Affinity/Anti-Affinity:**
```yaml
# Spread pods across nodes
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: my-app
topologyKey: kubernetes.io/hostname
# Co-locate with database
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: database
topologyKey: kubernetes.io/hostname
```
**Tolerations:**
```yaml
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 30
- key: "dedicated"
operator: "Equal"
value: "database"
effect: "NoSchedule"
```
## Common Patterns
### High Availability Deployment
```yaml
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: my-app
topologyKey: kubernetes.io/hostname
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-app
```
### Sidecar Container Pattern
```yaml
spec:
template:
spec:
containers:
- name: app
image: myapp:1.0.0
volumeMounts:
- name: shared-logs
mountPath: /var/log
- name: log-forwarder
image: fluent-bit:2.0
volumeMounts:
- name: shared-logs
mountPath: /var/log
readOnly: true
volumes:
- name: shared-logs
emptyDir: {}
```
### Init Container for Dependencies
```yaml
spec:
template:
spec:
initContainers:
- name: wait-for-db
image: busybox:1.36
command:
- sh
- -c
- |
until nc -z database-service 5432; do
echo "Waiting for database..."
sleep 2
done
- name: run-migrations
image: myapp:1.0.0
command: ["./migrate", "up"]
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
containers:
- name: app
image: myapp:1.0.0
```
## Best Practices
### Production Checklist
- [ ] Set resource requests and limits
- [ ] Implement all three probe types (startup, liveness, readiness)
- [ ] Use specific image tags (not :latest)
- [ ] Configure security context (non-root, read-only filesystem)
- [ ] Set replica count >= 3 for HA
- [ ] Configure pod anti-affinity for spread
- [ ] Set appropriate update strategy (maxUnavailable: 0 for zero-downtime)
- [ ] Use ConfigMaps and Secrets for configuration
- [ ] Add standard labels and annotations
- [ ] Configure graceful shutdown (preStop hook, terminationGracePeriodSeconds)
- [ ] Set revisionHistoryLimit for rollback capability
- [ ] Use ServiceAccount with minimal RBAC permissions
### Performance Tuning
**Fast startup:**
```yaml
spec:
minReadySeconds: 5
strategy:
rollingUpdate:
maxSurge: 2
maxUnavailable: 1
```
**Zero-downtime updates:**
```yaml
spec:
minReadySeconds: 10
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
```
**Graceful shutdown:**
```yaml
spec:
template:
spec:
terminationGracePeriodSeconds: 60
containers:
- name: app
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15 && kill -SIGTERM 1"]
```
## Troubleshooting
### Common Issues
**Pods not starting:**
```bash
kubectl describe deployment <name>
kubectl get pods -l app=<app-name>
kubectl describe pod <pod-name>
kubectl logs <pod-name>
```
**ImagePullBackOff:**
- Check image name and tag
- Verify imagePullSecrets
- Check registry credentials
**CrashLoopBackOff:**
- Check container logs
- Verify liveness probe is not too aggressive
- Check resource limits
- Verify application dependencies
**Deployment stuck in progress:**
- Check progressDeadlineSeconds
- Verify readiness probes
- Check resource availability
## Related Resources
- [Kubernetes Deployment API Reference](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#deployment-v1-apps)
- [Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/)
- [Resource Management](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/)

View file

@ -0,0 +1,724 @@
# Kubernetes Service Specification Reference
Comprehensive reference for Kubernetes Service resources, covering service types, networking, load balancing, and service discovery patterns.
## Overview
A Service provides stable network endpoints for accessing Pods. Services enable loose coupling between microservices by providing service discovery and load balancing.
## Service Types
### 1. ClusterIP (Default)
Exposes the service on an internal cluster IP. Only reachable from within the cluster.
```yaml
apiVersion: v1
kind: Service
metadata:
name: backend-service
namespace: production
spec:
type: ClusterIP
selector:
app: backend
ports:
- name: http
port: 80
targetPort: 8080
protocol: TCP
sessionAffinity: None
```
**Use cases:**
- Internal microservice communication
- Database services
- Internal APIs
- Message queues
### 2. NodePort
Exposes the service on each Node's IP at a static port (30000-32767 range).
```yaml
apiVersion: v1
kind: Service
metadata:
name: frontend-service
spec:
type: NodePort
selector:
app: frontend
ports:
- name: http
port: 80
targetPort: 8080
nodePort: 30080 # Optional, auto-assigned if omitted
protocol: TCP
```
**Use cases:**
- Development/testing external access
- Small deployments without load balancer
- Direct node access requirements
**Limitations:**
- Limited port range (30000-32767)
- Must handle node failures
- No built-in load balancing across nodes
### 3. LoadBalancer
Exposes the service using a cloud provider's load balancer.
```yaml
apiVersion: v1
kind: Service
metadata:
name: public-api
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
spec:
type: LoadBalancer
selector:
app: api
ports:
- name: https
port: 443
targetPort: 8443
protocol: TCP
loadBalancerSourceRanges:
- 203.0.113.0/24
```
**Cloud-specific annotations:**
**AWS:**
```yaml
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb" # or "external"
service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
service.beta.kubernetes.io/aws-load-balancer-ssl-cert: "arn:aws:acm:..."
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "http"
```
**Azure:**
```yaml
annotations:
service.beta.kubernetes.io/azure-load-balancer-internal: "true"
service.beta.kubernetes.io/azure-pip-name: "my-public-ip"
```
**GCP:**
```yaml
annotations:
cloud.google.com/load-balancer-type: "Internal"
cloud.google.com/backend-config: '{"default": "my-backend-config"}'
```
### 4. ExternalName
Maps service to external DNS name (CNAME record).
```yaml
apiVersion: v1
kind: Service
metadata:
name: external-db
spec:
type: ExternalName
externalName: db.external.example.com
ports:
- port: 5432
```
**Use cases:**
- Accessing external services
- Service migration scenarios
- Multi-cluster service references
## Complete Service Specification
```yaml
apiVersion: v1
kind: Service
metadata:
name: my-service
namespace: production
labels:
app: my-app
tier: backend
annotations:
description: "Main application service"
prometheus.io/scrape: "true"
spec:
# Service type
type: ClusterIP
# Pod selector
selector:
app: my-app
version: v1
# Ports configuration
ports:
- name: http
port: 80 # Service port
targetPort: 8080 # Container port (or named port)
protocol: TCP # TCP, UDP, or SCTP
# Session affinity
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 10800
# IP configuration
clusterIP: 10.0.0.10 # Optional: specific IP
clusterIPs:
- 10.0.0.10
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
# External traffic policy
externalTrafficPolicy: Local
# Internal traffic policy
internalTrafficPolicy: Local
# Health check
healthCheckNodePort: 30000
# Load balancer config (for type: LoadBalancer)
loadBalancerIP: 203.0.113.100
loadBalancerSourceRanges:
- 203.0.113.0/24
# External IPs
externalIPs:
- 80.11.12.10
# Publishing strategy
publishNotReadyAddresses: false
```
## Port Configuration
### Named Ports
Use named ports in Pods for flexibility:
**Deployment:**
```yaml
spec:
template:
spec:
containers:
- name: app
ports:
- name: http
containerPort: 8080
- name: metrics
containerPort: 9090
```
**Service:**
```yaml
spec:
ports:
- name: http
port: 80
targetPort: http # References named port
- name: metrics
port: 9090
targetPort: metrics
```
### Multiple Ports
```yaml
spec:
ports:
- name: http
port: 80
targetPort: 8080
protocol: TCP
- name: https
port: 443
targetPort: 8443
protocol: TCP
- name: grpc
port: 9090
targetPort: 9090
protocol: TCP
```
## Session Affinity
### None (Default)
Distributes requests randomly across pods.
```yaml
spec:
sessionAffinity: None
```
### ClientIP
Routes requests from same client IP to same pod.
```yaml
spec:
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 10800 # 3 hours
```
**Use cases:**
- Stateful applications
- Session-based applications
- WebSocket connections
## Traffic Policies
### External Traffic Policy
**Cluster (Default):**
```yaml
spec:
externalTrafficPolicy: Cluster
```
- Load balances across all nodes
- May add extra network hop
- Source IP is masked
**Local:**
```yaml
spec:
externalTrafficPolicy: Local
```
- Traffic goes only to pods on receiving node
- Preserves client source IP
- Better performance (no extra hop)
- May cause imbalanced load
### Internal Traffic Policy
```yaml
spec:
internalTrafficPolicy: Local # or Cluster
```
Controls traffic routing for cluster-internal clients.
## Headless Services
Service without cluster IP for direct pod access.
```yaml
apiVersion: v1
kind: Service
metadata:
name: database
spec:
clusterIP: None # Headless
selector:
app: database
ports:
- port: 5432
targetPort: 5432
```
**Use cases:**
- StatefulSet pod discovery
- Direct pod-to-pod communication
- Custom load balancing
- Database clusters
**DNS returns:**
- Individual pod IPs instead of service IP
- Format: `<pod-name>.<service-name>.<namespace>.svc.cluster.local`
## Service Discovery
### DNS
**ClusterIP Service:**
```
<service-name>.<namespace>.svc.cluster.local
```
Example:
```bash
curl http://backend-service.production.svc.cluster.local
```
**Within same namespace:**
```bash
curl http://backend-service
```
**Headless Service (returns pod IPs):**
```
<pod-name>.<service-name>.<namespace>.svc.cluster.local
```
### Environment Variables
Kubernetes injects service info into pods:
```bash
# Service host and port
BACKEND_SERVICE_SERVICE_HOST=10.0.0.100
BACKEND_SERVICE_SERVICE_PORT=80
# For named ports
BACKEND_SERVICE_SERVICE_PORT_HTTP=80
```
**Note:** Pods must be created after the service for env vars to be injected.
## Load Balancing
### Algorithms
Kubernetes uses random selection by default. For advanced load balancing:
**Service Mesh (Istio example):**
```yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: my-destination-rule
spec:
host: my-service
trafficPolicy:
loadBalancer:
simple: LEAST_REQUEST # or ROUND_ROBIN, RANDOM, PASSTHROUGH
connectionPool:
tcp:
maxConnections: 100
```
### Connection Limits
Use pod disruption budgets and resource limits:
```yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: my-app
```
## Service Mesh Integration
### Istio Virtual Service
```yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: my-service
spec:
hosts:
- my-service
http:
- match:
- headers:
version:
exact: v2
route:
- destination:
host: my-service
subset: v2
- route:
- destination:
host: my-service
subset: v1
weight: 90
- destination:
host: my-service
subset: v2
weight: 10
```
## Common Patterns
### Pattern 1: Internal Microservice
```yaml
apiVersion: v1
kind: Service
metadata:
name: user-service
namespace: backend
labels:
app: user-service
tier: backend
spec:
type: ClusterIP
selector:
app: user-service
ports:
- name: http
port: 8080
targetPort: http
protocol: TCP
- name: grpc
port: 9090
targetPort: grpc
protocol: TCP
```
### Pattern 2: Public API with Load Balancer
```yaml
apiVersion: v1
kind: Service
metadata:
name: api-gateway
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
service.beta.kubernetes.io/aws-load-balancer-ssl-cert: "arn:aws:acm:..."
spec:
type: LoadBalancer
externalTrafficPolicy: Local
selector:
app: api-gateway
ports:
- name: https
port: 443
targetPort: 8443
protocol: TCP
loadBalancerSourceRanges:
- 0.0.0.0/0
```
### Pattern 3: StatefulSet with Headless Service
```yaml
apiVersion: v1
kind: Service
metadata:
name: cassandra
spec:
clusterIP: None
selector:
app: cassandra
ports:
- port: 9042
targetPort: 9042
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: cassandra
spec:
serviceName: cassandra
replicas: 3
selector:
matchLabels:
app: cassandra
template:
metadata:
labels:
app: cassandra
spec:
containers:
- name: cassandra
image: cassandra:4.0
```
### Pattern 4: External Service Mapping
```yaml
apiVersion: v1
kind: Service
metadata:
name: external-database
spec:
type: ExternalName
externalName: prod-db.cxyz.us-west-2.rds.amazonaws.com
---
# Or with Endpoints for IP-based external service
apiVersion: v1
kind: Service
metadata:
name: external-api
spec:
ports:
- port: 443
targetPort: 443
protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
name: external-api
subsets:
- addresses:
- ip: 203.0.113.100
ports:
- port: 443
```
### Pattern 5: Multi-Port Service with Metrics
```yaml
apiVersion: v1
kind: Service
metadata:
name: web-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics"
spec:
type: ClusterIP
selector:
app: web-app
ports:
- name: http
port: 80
targetPort: 8080
- name: metrics
port: 9090
targetPort: 9090
```
## Network Policies
Control traffic to services:
```yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-frontend-to-backend
spec:
podSelector:
matchLabels:
app: backend
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
```
## Best Practices
### Service Configuration
1. **Use named ports** for flexibility
2. **Set appropriate service type** based on exposure needs
3. **Use labels and selectors consistently** across Deployments and Services
4. **Configure session affinity** for stateful apps
5. **Set external traffic policy to Local** for IP preservation
6. **Use headless services** for StatefulSets
7. **Implement network policies** for security
8. **Add monitoring annotations** for observability
### Production Checklist
- [ ] Service type appropriate for use case
- [ ] Selector matches pod labels
- [ ] Named ports used for clarity
- [ ] Session affinity configured if needed
- [ ] Traffic policy set appropriately
- [ ] Load balancer annotations configured (if applicable)
- [ ] Source IP ranges restricted (for public services)
- [ ] Health check configuration validated
- [ ] Monitoring annotations added
- [ ] Network policies defined
### Performance Tuning
**For high traffic:**
```yaml
spec:
externalTrafficPolicy: Local
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 3600
```
**For WebSocket/long connections:**
```yaml
spec:
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 86400 # 24 hours
```
## Troubleshooting
### Service not accessible
```bash
# Check service exists
kubectl get service <service-name>
# Check endpoints (should show pod IPs)
kubectl get endpoints <service-name>
# Describe service
kubectl describe service <service-name>
# Check if pods match selector
kubectl get pods -l app=<app-name>
```
**Common issues:**
- Selector doesn't match pod labels
- No pods running (endpoints empty)
- Ports misconfigured
- Network policy blocking traffic
### DNS resolution failing
```bash
# Test DNS from pod
kubectl run debug --rm -it --image=busybox -- nslookup <service-name>
# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns
```
### Load balancer issues
```bash
# Check load balancer status
kubectl describe service <service-name>
# Check events
kubectl get events --sort-by='.lastTimestamp'
# Verify cloud provider configuration
kubectl describe node
```
## Related Resources
- [Kubernetes Service API Reference](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#service-v1-core)
- [Service Networking](https://kubernetes.io/docs/concepts/services-networking/service/)
- [DNS for Services and Pods](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/)

Some files were not shown because too many files have changed in this diff Show more