Chapter 25: Monitoring & Alerts

Sync Monitoring

Synchronization is the core operational concern for AdPriority. If syncs fail, custom labels go stale, and merchants’ Google Ads campaigns operate on outdated priority data. The monitoring system tracks every sync event and alerts on failures.

Sync Types

Sync Type	Direction	Trigger	Frequency
Product import	Shopify -> AdPriority	Install, manual, webhook	On demand
Priority recalculation	Internal	Rule change, season transition, manual	On demand
Sheet write	AdPriority -> Google Sheet	Scheduled, manual, post-recalculation	Configurable (hourly to daily)
GMC fetch	Google Sheet -> GMC	GMC schedule	Daily (managed by Google)

Sync Tracking

Every sync event creates a record in the sync_logs table:

-- Example sync log entry
INSERT INTO sync_logs (
  store_id, sync_type, trigger_source, status,
  started_at, completed_at,
  products_total, products_success, products_failed, products_skipped,
  error_message
) VALUES (
  'store-uuid', 'sheet_write', 'scheduled_cron', 'completed',
  '2026-02-10 10:30:00', '2026-02-10 10:30:45',
  15284, 15284, 0, 0,
  NULL
);

Success Rate Tracking

The system computes a rolling success rate over the last 24 hours and the last 7 days:

SYNC SUCCESS RATE
=================

  Last 24 Hours:
    Total syncs:      4
    Successful:       4
    Failed:           0
    Success rate:     100%

  Last 7 Days:
    Total syncs:      28
    Successful:       27
    Failed:           1  (Feb 7, Sheet API quota exceeded)
    Success rate:     96.4%

  Last 30 Days:
    Total syncs:      120
    Successful:       118
    Failed:           2
    Success rate:     98.3%

Alert on Failures

Condition	Alert Level	Action
Single sync failure	Warning	Log error, retry in 5 minutes
2 consecutive failures	Error	Email notification to merchant
3+ consecutive failures	Critical	Email notification + in-app banner
Success rate below 95% (7-day rolling)	Warning	Review sync logs for patterns
No sync in 24+ hours	Error	Check worker health, email alert

Error Tracking

Error Categories

Category	Examples	Severity	Response
Shopify API	Rate limit (429), token expired, scope revoked	High	Backoff and retry, re-auth if token invalid
Shopify Webhook	Delivery failure, HMAC mismatch, payload parse error	Medium	Shopify retries 19 times over 48h automatically
Google Sheets API	Quota exceeded, permission denied, sheet deleted	High	Retry with backoff, alert merchant if persistent
Database	Connection refused, query timeout, constraint violation	Critical	Auto-reconnect, alert on repeated failures
Scoring Engine	Unmapped product type, invalid tag format	Low	Log warning, use default priority, continue
Billing	Subscription expired, charge declined	Medium	Downgrade to free tier, notify merchant

Error Log Format

All errors are logged as structured JSON to facilitate parsing and alerting:

{
  "timestamp": "2026-02-10T10:30:45.123Z",
  "level": "error",
  "service": "sync",
  "store_id": "abc-123",
  "sync_type": "sheet_write",
  "error_code": "SHEETS_QUOTA_EXCEEDED",
  "error_message": "Google Sheets API daily quota exceeded",
  "context": {
    "products_processed": 8742,
    "products_remaining": 6542,
    "retry_attempt": 2,
    "next_retry_at": "2026-02-10T10:35:45.123Z"
  },
  "stack_trace": "..."
}

Error Aggregation

Errors are aggregated by category and time window for the monitoring dashboard:

ERROR SUMMARY (Last 24 Hours)
==============================

  Category            Count   Last Occurrence      Status
  ------------------  -----   ------------------   ------
  Shopify API           0     --                   OK
  Google Sheets API     1     Feb 10, 04:15 AM     Recovered
  Database              0     --                   OK
  Scoring Engine        3     Feb 10, 10:30 AM     Active
  Webhooks              0     --                   OK

Health Checks

API Health Endpoint

The backend exposes a /health endpoint that checks all dependencies:

GET /health

Response (healthy):
{
  "status": "healthy",
  "timestamp": "2026-02-10T10:30:45.123Z",
  "uptime_seconds": 86400,
  "checks": {
    "database": { "status": "connected", "latency_ms": 3 },
    "redis": { "status": "connected", "latency_ms": 1 },
    "google_sheets_api": { "status": "reachable", "latency_ms": 120 },
    "shopify_api": { "status": "reachable", "latency_ms": 85 }
  }
}

Response (degraded):
{
  "status": "degraded",
  "timestamp": "2026-02-10T10:30:45.123Z",
  "checks": {
    "database": { "status": "connected", "latency_ms": 3 },
    "redis": { "status": "connected", "latency_ms": 1 },
    "google_sheets_api": { "status": "error", "error": "quota exceeded" },
    "shopify_api": { "status": "reachable", "latency_ms": 85 }
  }
}

Health Check Schedule

Check	Frequency	Method	Alert Threshold
Database connectivity	Every 30 seconds	`SELECT 1` query	1 consecutive failure
Redis connectivity	Every 30 seconds	`PING` command	1 consecutive failure
Google Sheets API	Every 5 minutes	Metadata read on test sheet	3 consecutive failures
Shopify API	Every 5 minutes	`GET /admin/api/2024-01/shop.json` (one test store)	3 consecutive failures
Worker process	Every 60 seconds	Bull queue heartbeat	2 consecutive failures
Disk usage	Every 15 minutes	Docker system df	> 80% usage

Docker Health Checks

The production Docker Compose configuration includes container-level health checks:

CONTAINER HEALTH
================

  adpriority-backend:
    Check: curl -f http://localhost:3010/health
    Interval: 30s
    Timeout: 10s
    Retries: 3
    Start period: 15s

  adpriority-redis:
    Check: redis-cli ping
    Interval: 10s
    Timeout: 5s
    Retries: 3

  adpriority-worker:
    Check: node healthcheck.js (checks Bull queue connection)
    Interval: 30s
    Timeout: 10s
    Retries: 3

Docker restarts unhealthy containers automatically when restart: always is set in the compose file. The health check system ensures transient failures (network blip, temporary memory pressure) resolve without manual intervention.

Metrics

Operational Metrics

Metric	Type	Description	Collection
`sync.products_synced`	Counter	Total products synced (cumulative)	Incremented per sync
`sync.duration_seconds`	Histogram	Time to complete a sync	Per sync event
`sync.success_rate`	Gauge	Rolling success rate (7-day)	Computed hourly
`sync.error_count`	Counter	Total sync errors	Incremented per error
`priority.changes_per_day`	Counter	Products that changed priority	Daily aggregation
`priority.distribution`	Gauge (x6)	Products at each priority level	Computed on demand
`api.request_count`	Counter	Total API requests served	Per request
`api.latency_p95`	Histogram	95th percentile response time	Per request
`api.error_rate`	Gauge	Percentage of 5xx responses	Rolling 5-minute window
`worker.queue_depth`	Gauge	Jobs waiting in Bull queue	Polled every 30s
`worker.job_duration`	Histogram	Time to process a queue job	Per job

Business Metrics

Metric	Description	Collection
`stores.active`	Number of stores with active subscriptions	Daily count
`stores.trial`	Number of stores in trial period	Daily count
`stores.churned`	Stores that uninstalled in last 30 days	Monthly count
`products.total_managed`	Sum of products across all stores	Daily count
`revenue.mrr`	Monthly recurring revenue	From subscriptions table
`onboarding.time_to_first_sync`	Time from install to first Sheet sync	Per store

Metric Storage

For the initial deployment, metrics are stored in the PostgreSQL database as aggregated daily snapshots rather than introducing a dedicated time-series database. The sync_logs and audit_logs tables serve as the primary metric source.

-- Example: Daily metrics aggregation query
SELECT
  DATE(created_at) AS day,
  COUNT(*) AS total_syncs,
  COUNT(*) FILTER (WHERE status = 'completed') AS successful,
  COUNT(*) FILTER (WHERE status = 'failed') AS failed,
  AVG(EXTRACT(EPOCH FROM (completed_at - started_at))) AS avg_duration_seconds,
  SUM(products_success) AS total_products_synced
FROM sync_logs
WHERE store_id = $1
  AND created_at > NOW() - INTERVAL '30 days'
GROUP BY DATE(created_at)
ORDER BY day DESC;

If the deployment scales beyond 50 stores, consider migrating metrics to a lightweight time-series solution or an external monitoring service.

Alerting

Alert Channels

Channel	Use Case	Configuration
Email	Sync failures, critical errors	Merchant notification email from Settings
In-app Banner	Degraded service, action required	Polaris Banner on Dashboard
Docker logs	All operational events	`docker logs adpriority-backend`
Console (stdout)	Structured JSON logs	Captured by Docker log driver

Alert Rules

ALERT RULES
============

  Rule: sync_consecutive_failures
    Condition: 3 consecutive sync failures for a store
    Severity: Critical
    Action: Email merchant, show in-app banner
    Message: "AdPriority sync has failed 3 times. Your Google Merchant
             Center labels may be out of date. Check Settings > Sync."

  Rule: worker_down
    Condition: Worker health check fails for 2 minutes
    Severity: Critical
    Action: Docker auto-restart, log to console
    Message: (internal) "Worker process unresponsive, auto-restarting"

  Rule: database_connection_lost
    Condition: Database health check fails
    Severity: Critical
    Action: Prisma auto-reconnect, log to console
    Message: (internal) "Database connection lost, reconnecting"

  Rule: sheet_api_quota
    Condition: Google Sheets API returns 429 (quota exceeded)
    Severity: Warning
    Action: Queue retry for next hour, email merchant if persists
    Message: "Google Sheets API quota exceeded. Sync will retry in 1 hour."

  Rule: high_error_rate
    Condition: API error rate > 5% over 5-minute window
    Severity: Warning
    Action: Log to console
    Message: (internal) "API error rate elevated: X% in last 5 minutes"

  Rule: stale_sync
    Condition: No successful sync in 24+ hours for an active store
    Severity: Warning
    Action: Email merchant
    Message: "AdPriority has not synced in over 24 hours.
             Check your sync settings."

Alert Suppression

To avoid alert fatigue:

Rule	Suppression
Same alert type for same store	Suppress for 1 hour after first firing
Maintenance window	Suppress all non-critical alerts during scheduled maintenance
Inactive stores	Do not alert for stores with `is_active = false`
Trial stores	Lighter alerting (email only, no paging)

The AdPriority Blueprint