Automation Center

Overview

Automation Center is the fleet-wide operations dashboard at /automation. It provides a single view of automation health, queue status, job history, and system controls across all sites — not just one.

This is distinct from the per-site Content Automation settings inside each Client Profile. Client Profile handles one site's automation configuration (topics, schedule, frequency). Automation Center is the control room that monitors everything at once — queue depth, worker status, failure rates, dead letter queue, and system mode — with admin-only controls for pausing, resuming, and managing the queue.

Hierarchy: - /automation — Fleet-wide dashboard (this feature) - /site/<id>/profile → Automations tab — Per-site automation settings

What It Does

System Status

The top of the dashboard shows the current state of the automation system at a glance:

Indicator	What It Shows
Mode	Current system mode — LIVE (active), DRY-RUN (running but not publishing), SHADOW (monitoring only), CANARY (gradual rollout), OFF, or OFFLINE
Leader	Hostname of the active automation worker
Queue	Current queue depth and in-progress job count
DLQ	Dead letter queue depth (red warning if > 0)
Cap	Daily publication cap status

Below the status row:

Tile	What It Shows
Next Run Windows	The 3 daily publication slots (8am, 2pm, 8pm ET) with next occurrence in UTC
WordPress Backfill	Status and pending/processed counts for WordPress content backfill
WordPress Upsert	Status and 24-hour update count for WordPress content sync
Post Verification	Pass/fail counts for post verification checks (24h)
IndexNow	Submitted/failed counts for search engine submissions (24h)

Automation Insights KPIs

Five primary KPIs displayed as real-time counters:

KPI	What It Measures
Posts (24h)	Total posts created across all sites in the last 24 hours
Verify (24h)	Post verification success/failure ratio
IndexNow (24h)	Search engine submission success/failure ratio
Queue	Current queue depth (jobs waiting to run)
DLQ	Current dead letter queue depth (jobs that failed permanently)

Recent Jobs

A searchable, filterable table of all automation jobs across all sites.

Columns: Job ID, Site, Status, Topic, Created, Started, Finished, Actions

Filters: - By status: All, Queued, Running, Finished, Failed - By site: All sites or a specific site - By search term (URL, site name, topic)

Features: - Auto-refreshes every 10 seconds - Paginated (up to 100 per page) - Export to CSV - Color-coded status badges: green (finished), red (failed), blue (running), amber (queued)

Dead Letter Queue (DLQ)

Jobs that have failed permanently — exceeded retry attempts or hit catastrophic errors — are moved to the dead letter queue. The DLQ prevents failed jobs from clogging the main queue while retaining them for manual inspection.

What the DLQ panel shows: - Job ID, site, topic, creation date, number of attempts, error message - Site filter and pagination - Export to CSV

Admin actions on DLQ items: - Retry — moves the job back to "queued" with attempts reset to 0 - Delete — permanently removes the job

Per-Site Automation Cards

A grid of cards showing every site with automation configured. Each card displays:

Site name and enabled/disabled status
Automation frequency (daily, weekly, biweekly, monthly)
Last run timestamp and next scheduled run
Queue depth for that site
Color-coded status (idle, queued, running, finished, failed)
Run Now button (admin-only) to trigger an immediate manual run

Clicking a site card navigates to that site's Client Profile → Automations tab for per-site configuration.

Queue Management (Admin-Only)

Four queue control actions available to administrators:

Action	What It Does
Pause	Stops all job processing without clearing the queue. Sets the `global_enable` flag to false.
Resume	Restarts job processing from where it was paused.
Retry All Failed	Moves all failed jobs back to "queued" with attempts reset to 0.
Clear Pending	Permanently deletes all queued jobs. Destructive — cannot be undone.

All admin actions are logged with the actor's email for audit trail.

Why It Matters

Fleet-wide visibility — instead of checking automation status site by site, see every site's queue health, job status, and failure rates on one page.
Catch failures early — the DLQ and failure rate indicators surface problems before they compound. A spike in failed jobs across multiple sites often indicates an upstream issue (API rate limits, OpenAI outages, WordPress connectivity) that needs attention.
Admin controls prevent cascading problems — pausing the queue during an API outage stops failed jobs from piling up. Retrying after the issue resolves is one click.
Capacity planning — queue depth, daily caps, and publication rate KPIs show whether the system is keeping up with the configured publishing schedules across all sites.
Accountability — every admin action (pause, resume, retry, delete) is logged with the actor's identity and timestamp.

How to Use It

Monitoring Day-to-Day Operations

Navigate to Automation Center (/automation) from the sidebar.
Check the System Status row — confirm the mode is LIVE and the leader worker is active.
Review the KPI tiles — Posts (24h) shows production volume; DLQ > 0 means failures need attention.
Scan the Per-Site Cards — look for sites with failed or stuck jobs (red status).

Investigating Failures

Check the Recent Jobs panel — filter by status "Failed" to see all failures.
Click into a failed job to see the error message and attempt count.
If the failure is transient (API timeout, rate limit), use Retry All Failed to re-queue.
If the failure is permanent (bad configuration, invalid topic), check the DLQ panel.
In the DLQ, decide per-job: Retry (if the root cause is fixed) or Delete (if the job should be discarded).

Responding to System Issues

Scenario	Action
OpenAI API outage	Pause the queue → wait for resolution → Resume
High failure rate spike	Check error messages for pattern → fix root cause → Retry All Failed
DLQ building up	Review each item → fix site-level configuration → Retry or Delete
Queue growing faster than processing	Check worker health → verify leader is active → check daily caps
Need to stop all automation	Pause the queue (preserves all jobs for later)

Triggering Manual Runs

Find the target site's card in the Per-Site grid.
Click Run Now (admin-only).
The system inserts a job with an idempotency key — duplicate rapid clicks are prevented (one run per site per minute).

Key Settings / Options

System Modes

Mode	Behavior	Badge Color
LIVE	Active, processing and publishing jobs normally	Green
DRY-RUN	Running pipeline but not publishing to WordPress	Amber
SHADOW	Monitoring only, no job execution	Blue
CANARY	Gradual rollout — processing a subset of sites	Blue
OFF	Automation disabled globally	Gray
OFFLINE	Worker unreachable, operations halted	Gray

Guardrails & Safety

Guardrail	What It Prevents
Daily caps	Per-site configurable limit (1-10 posts/day, default 1). Prevents over-publishing.
Circuit breaker	Trips automatically if failure rate exceeds 20% across last 20+ jobs, or if p95 job duration exceeds 120 seconds. Disables processing until manually reset.
Idempotency keys	Prevents duplicate manual runs — one run per site per action type per minute.
Rate limiting	3-second minimum between admin actions per user.
Duplicate prevention	Same job cannot be re-enqueued while already in queue.
Topic guards	Per-site topic whitelist/blacklist prevents forbidden topics from being automated.

Access Control

Capability	Admin	Non-Admin
View dashboard, KPIs, status	Yes	Yes
View recent jobs and DLQ	Yes	Yes (read-only)
Pause / Resume queue	Yes	No
Retry All Failed	Yes	No
Clear Pending	Yes	No
Retry / Delete DLQ items	Yes	No
Run Now (per-site)	Yes	No
Edit runtime flags	Yes	No

Publication Schedule

The automation worker runs on a fixed daily schedule with 3 publication windows:

Window	Time (ET)
Morning	8:00 AM
Afternoon	2:00 PM
Evening	8:00 PM

Per-site frequency settings (daily, weekly, biweekly, monthly) determine which windows a site participates in.

Notes / Edge Cases

Worker lock — only one automation worker runs at a time, enforced by a database lock (automation_locks table) with heartbeat monitoring. If the heartbeat is older than 5 minutes, the worker is considered dead.
Circuit breaker auto-trips — if the failure rate exceeds 20% or p95 duration exceeds 120 seconds, the circuit breaker trips automatically. This protects against cascading failures but means automation will stop until an admin investigates and resets.
DLQ is not automatic cleanup — jobs in the dead letter queue stay there indefinitely until an admin manually retries or deletes them. They don't auto-expire.
Clear Pending is destructive — unlike Pause (which preserves jobs), Clear Pending permanently deletes all queued jobs. There is no undo.
All admin actions are logged — pause, resume, retry, delete, and manual run actions include the actor's email in the log for accountability.
Per-site automation is configured separately — Automation Center is for monitoring and fleet-level controls. To change a site's frequency, daily cap, topics, or schedule, go to that site's Client Profile → Automations tab.