autojanet/skills/terrashark/docs/guides/quick-ops.md
Zoë cfec11bb46
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
fix: convert skill submodules to plain directories
stop-slop, taste-skill, terrashark had embedded .git dirs causing
Woodpecker clone to fail on submodule update.
2026-05-30 15:44:44 -07:00

150 lines
3.5 KiB
Markdown

# Terraform Quick Ops and Troubleshooting
Fast command recall and common failure handling for Terraform and OpenTofu operations.
## Core Command Sequence
### Terraform
```bash
terraform fmt -check
terraform init
terraform validate
terraform plan -out=plan.bin
terraform show -json plan.bin > plan.json
```
### OpenTofu
```bash
tofu fmt -check
tofu init
tofu validate
tofu plan -out=plan.bin
tofu show -json plan.bin > plan.json
```
## Common Failures and Fixes
### CI Passes Locally but Fails in Runner
**Causes:**
- Mismatch in runtime/provider versions
- Missing lockfile updates
- Environment variables present locally but missing in CI
**Fix:**
- Pin runtime and providers
- Commit lockfile
- Make required env vars explicit in pipeline
### Large Unexpected Replacements in Plan
**Causes:**
- Unstable iteration keys
- Hidden rename without `moved` mapping
- Data source drift feeding identity fields
**Fix:**
- Stabilize keys
- Add `moved` blocks
- Separate identity from mutable attributes
### AWS RDS Identifier Validation Errors
**Symptoms:** `InvalidParameterValue` for `identifier` or `final_snapshot_identifier`, names include dots.
**Fix:** Normalize to lowercase letters, numbers, and hyphens:
```javascript
locals {
rds_base = regexreplace(lower("${var.project}-prod"), "[^a-z0-9-]", "-")
}
resource "aws_db_instance" "main" {
identifier = local.rds_base
final_snapshot_identifier = "${local.rds_base}-final"
}
```
### Apply Contention on Shared State
**Cause:** Concurrent pipelines targeting same backend key.
**Fix:**
- Serialize applies for that stack
- Use lock timeout and per-stack concurrency guard
### Tests Are Too Costly
**Fix:**
- Tag tests by risk (`fast`, `integration`, `destructive`)
- Run full suite nightly, risk-tier suite on PRs
- Auto-clean ephemeral infra with TTL tags
### State Lock Stuck
**Symptom:** `Error: Error acquiring the state lock`
**Fix:**
```bash
# Identify the lock holder from the error message (lock ID shown)
terraform force-unlock LOCK_ID
# OpenTofu equivalent:
tofu force-unlock LOCK_ID
```
Only force-unlock when you are certain no other apply is running. Check CI pipelines and team activity first.
### State Corruption or Lost State
**Fix:**
- Restore from versioned state backend (S3 versioning, GCS versioning)
- If no backup: re-import resources using `import` blocks
- Never manually edit state JSON unless absolutely no alternative and with peer review
```bash
# Pull current state for inspection
terraform state pull > state-backup.json
# List all tracked resources
terraform state list
```
### Backend Migration
When changing state backends (e.g., local to S3, or S3 to different bucket):
```bash
# Update backend config in code, then:
terraform init -migrate-state
```
- Always backup state before migration
- Verify resource count matches after migration
- Test plan shows no changes after migration
### Provider Authentication Failures in CI
**Symptom:** `Error: No valid credential sources found`
**Fix:**
- Verify environment variables are set in CI runner
- Prefer workload identity federation over static keys
- Check credential expiry for short-lived tokens
- Ensure CI runner IAM role/service account has required permissions
### `null_resource` vs `terraform_data`
Use `terraform_data` (TF 1.4+) instead of `null_resource` + `null` provider:
```javascript
# Prefer this (no extra provider needed):
resource "terraform_data" "bootstrap" {
triggers_replace = [var.config_hash]
provisioner "local-exec" {
command = "bootstrap.sh"
}
}
```