Chapter 25: Monitoring & Alerts

Sync Monitoring

Synchronization is the core operational concern for AdPriority. If syncs fail, custom labels go stale, and merchants’ Google Ads campaigns operate on outdated priority data. The monitoring system tracks every sync event and alerts on failures.

Sync Types

Sync TypeDirectionTriggerFrequency
Product importShopify -> AdPriorityInstall, manual, webhookOn demand
Priority recalculationInternalRule change, season transition, manualOn demand
Sheet writeAdPriority -> Google SheetScheduled, manual, post-recalculationConfigurable (hourly to daily)
GMC fetchGoogle Sheet -> GMCGMC scheduleDaily (managed by Google)

Sync Tracking

Every sync event creates a record in the sync_logs table:

-- Example sync log entry
INSERT INTO sync_logs (
  store_id, sync_type, trigger_source, status,
  started_at, completed_at,
  products_total, products_success, products_failed, products_skipped,
  error_message
) VALUES (
  'store-uuid', 'sheet_write', 'scheduled_cron', 'completed',
  '2026-02-10 10:30:00', '2026-02-10 10:30:45',
  15284, 15284, 0, 0,
  NULL
);

Success Rate Tracking

The system computes a rolling success rate over the last 24 hours and the last 7 days:

SYNC SUCCESS RATE
=================

  Last 24 Hours:
    Total syncs:      4
    Successful:       4
    Failed:           0
    Success rate:     100%

  Last 7 Days:
    Total syncs:      28
    Successful:       27
    Failed:           1  (Feb 7, Sheet API quota exceeded)
    Success rate:     96.4%

  Last 30 Days:
    Total syncs:      120
    Successful:       118
    Failed:           2
    Success rate:     98.3%

Alert on Failures

ConditionAlert LevelAction
Single sync failureWarningLog error, retry in 5 minutes
2 consecutive failuresErrorEmail notification to merchant
3+ consecutive failuresCriticalEmail notification + in-app banner
Success rate below 95% (7-day rolling)WarningReview sync logs for patterns
No sync in 24+ hoursErrorCheck worker health, email alert

Error Tracking

Error Categories

CategoryExamplesSeverityResponse
Shopify APIRate limit (429), token expired, scope revokedHighBackoff and retry, re-auth if token invalid
Shopify WebhookDelivery failure, HMAC mismatch, payload parse errorMediumShopify retries 19 times over 48h automatically
Google Sheets APIQuota exceeded, permission denied, sheet deletedHighRetry with backoff, alert merchant if persistent
DatabaseConnection refused, query timeout, constraint violationCriticalAuto-reconnect, alert on repeated failures
Scoring EngineUnmapped product type, invalid tag formatLowLog warning, use default priority, continue
BillingSubscription expired, charge declinedMediumDowngrade to free tier, notify merchant

Error Log Format

All errors are logged as structured JSON to facilitate parsing and alerting:

{
  "timestamp": "2026-02-10T10:30:45.123Z",
  "level": "error",
  "service": "sync",
  "store_id": "abc-123",
  "sync_type": "sheet_write",
  "error_code": "SHEETS_QUOTA_EXCEEDED",
  "error_message": "Google Sheets API daily quota exceeded",
  "context": {
    "products_processed": 8742,
    "products_remaining": 6542,
    "retry_attempt": 2,
    "next_retry_at": "2026-02-10T10:35:45.123Z"
  },
  "stack_trace": "..."
}

Error Aggregation

Errors are aggregated by category and time window for the monitoring dashboard:

ERROR SUMMARY (Last 24 Hours)
==============================

  Category            Count   Last Occurrence      Status
  ------------------  -----   ------------------   ------
  Shopify API           0     --                   OK
  Google Sheets API     1     Feb 10, 04:15 AM     Recovered
  Database              0     --                   OK
  Scoring Engine        3     Feb 10, 10:30 AM     Active
  Webhooks              0     --                   OK

Health Checks

API Health Endpoint

The backend exposes a /health endpoint that checks all dependencies:

GET /health

Response (healthy):
{
  "status": "healthy",
  "timestamp": "2026-02-10T10:30:45.123Z",
  "uptime_seconds": 86400,
  "checks": {
    "database": { "status": "connected", "latency_ms": 3 },
    "redis": { "status": "connected", "latency_ms": 1 },
    "google_sheets_api": { "status": "reachable", "latency_ms": 120 },
    "shopify_api": { "status": "reachable", "latency_ms": 85 }
  }
}

Response (degraded):
{
  "status": "degraded",
  "timestamp": "2026-02-10T10:30:45.123Z",
  "checks": {
    "database": { "status": "connected", "latency_ms": 3 },
    "redis": { "status": "connected", "latency_ms": 1 },
    "google_sheets_api": { "status": "error", "error": "quota exceeded" },
    "shopify_api": { "status": "reachable", "latency_ms": 85 }
  }
}

Health Check Schedule

CheckFrequencyMethodAlert Threshold
Database connectivityEvery 30 secondsSELECT 1 query1 consecutive failure
Redis connectivityEvery 30 secondsPING command1 consecutive failure
Google Sheets APIEvery 5 minutesMetadata read on test sheet3 consecutive failures
Shopify APIEvery 5 minutesGET /admin/api/2024-01/shop.json (one test store)3 consecutive failures
Worker processEvery 60 secondsBull queue heartbeat2 consecutive failures
Disk usageEvery 15 minutesDocker system df> 80% usage

Docker Health Checks

The production Docker Compose configuration includes container-level health checks:

CONTAINER HEALTH
================

  adpriority-backend:
    Check: curl -f http://localhost:3010/health
    Interval: 30s
    Timeout: 10s
    Retries: 3
    Start period: 15s

  adpriority-redis:
    Check: redis-cli ping
    Interval: 10s
    Timeout: 5s
    Retries: 3

  adpriority-worker:
    Check: node healthcheck.js (checks Bull queue connection)
    Interval: 30s
    Timeout: 10s
    Retries: 3

Docker restarts unhealthy containers automatically when restart: always is set in the compose file. The health check system ensures transient failures (network blip, temporary memory pressure) resolve without manual intervention.


Metrics

Operational Metrics

MetricTypeDescriptionCollection
sync.products_syncedCounterTotal products synced (cumulative)Incremented per sync
sync.duration_secondsHistogramTime to complete a syncPer sync event
sync.success_rateGaugeRolling success rate (7-day)Computed hourly
sync.error_countCounterTotal sync errorsIncremented per error
priority.changes_per_dayCounterProducts that changed priorityDaily aggregation
priority.distributionGauge (x6)Products at each priority levelComputed on demand
api.request_countCounterTotal API requests servedPer request
api.latency_p95Histogram95th percentile response timePer request
api.error_rateGaugePercentage of 5xx responsesRolling 5-minute window
worker.queue_depthGaugeJobs waiting in Bull queuePolled every 30s
worker.job_durationHistogramTime to process a queue jobPer job

Business Metrics

MetricDescriptionCollection
stores.activeNumber of stores with active subscriptionsDaily count
stores.trialNumber of stores in trial periodDaily count
stores.churnedStores that uninstalled in last 30 daysMonthly count
products.total_managedSum of products across all storesDaily count
revenue.mrrMonthly recurring revenueFrom subscriptions table
onboarding.time_to_first_syncTime from install to first Sheet syncPer store

Metric Storage

For the initial deployment, metrics are stored in the PostgreSQL database as aggregated daily snapshots rather than introducing a dedicated time-series database. The sync_logs and audit_logs tables serve as the primary metric source.

-- Example: Daily metrics aggregation query
SELECT
  DATE(created_at) AS day,
  COUNT(*) AS total_syncs,
  COUNT(*) FILTER (WHERE status = 'completed') AS successful,
  COUNT(*) FILTER (WHERE status = 'failed') AS failed,
  AVG(EXTRACT(EPOCH FROM (completed_at - started_at))) AS avg_duration_seconds,
  SUM(products_success) AS total_products_synced
FROM sync_logs
WHERE store_id = $1
  AND created_at > NOW() - INTERVAL '30 days'
GROUP BY DATE(created_at)
ORDER BY day DESC;

If the deployment scales beyond 50 stores, consider migrating metrics to a lightweight time-series solution or an external monitoring service.


Alerting

Alert Channels

ChannelUse CaseConfiguration
EmailSync failures, critical errorsMerchant notification email from Settings
In-app BannerDegraded service, action requiredPolaris Banner on Dashboard
Docker logsAll operational eventsdocker logs adpriority-backend
Console (stdout)Structured JSON logsCaptured by Docker log driver

Alert Rules

ALERT RULES
============

  Rule: sync_consecutive_failures
    Condition: 3 consecutive sync failures for a store
    Severity: Critical
    Action: Email merchant, show in-app banner
    Message: "AdPriority sync has failed 3 times. Your Google Merchant
             Center labels may be out of date. Check Settings > Sync."

  Rule: worker_down
    Condition: Worker health check fails for 2 minutes
    Severity: Critical
    Action: Docker auto-restart, log to console
    Message: (internal) "Worker process unresponsive, auto-restarting"

  Rule: database_connection_lost
    Condition: Database health check fails
    Severity: Critical
    Action: Prisma auto-reconnect, log to console
    Message: (internal) "Database connection lost, reconnecting"

  Rule: sheet_api_quota
    Condition: Google Sheets API returns 429 (quota exceeded)
    Severity: Warning
    Action: Queue retry for next hour, email merchant if persists
    Message: "Google Sheets API quota exceeded. Sync will retry in 1 hour."

  Rule: high_error_rate
    Condition: API error rate > 5% over 5-minute window
    Severity: Warning
    Action: Log to console
    Message: (internal) "API error rate elevated: X% in last 5 minutes"

  Rule: stale_sync
    Condition: No successful sync in 24+ hours for an active store
    Severity: Warning
    Action: Email merchant
    Message: "AdPriority has not synced in over 24 hours.
             Check your sync settings."

Alert Suppression

To avoid alert fatigue:

RuleSuppression
Same alert type for same storeSuppress for 1 hour after first firing
Maintenance windowSuppress all non-critical alerts during scheduled maintenance
Inactive storesDo not alert for stores with is_active = false
Trial storesLighter alerting (email only, no paging)