Chapter 25: Monitoring & Alerts
Sync Monitoring
Synchronization is the core operational concern for AdPriority. If syncs fail, custom labels go stale, and merchants’ Google Ads campaigns operate on outdated priority data. The monitoring system tracks every sync event and alerts on failures.
Sync Types
| Sync Type | Direction | Trigger | Frequency |
|---|---|---|---|
| Product import | Shopify -> AdPriority | Install, manual, webhook | On demand |
| Priority recalculation | Internal | Rule change, season transition, manual | On demand |
| Sheet write | AdPriority -> Google Sheet | Scheduled, manual, post-recalculation | Configurable (hourly to daily) |
| GMC fetch | Google Sheet -> GMC | GMC schedule | Daily (managed by Google) |
Sync Tracking
Every sync event creates a record in the sync_logs table:
-- Example sync log entry
INSERT INTO sync_logs (
store_id, sync_type, trigger_source, status,
started_at, completed_at,
products_total, products_success, products_failed, products_skipped,
error_message
) VALUES (
'store-uuid', 'sheet_write', 'scheduled_cron', 'completed',
'2026-02-10 10:30:00', '2026-02-10 10:30:45',
15284, 15284, 0, 0,
NULL
);
Success Rate Tracking
The system computes a rolling success rate over the last 24 hours and the last 7 days:
SYNC SUCCESS RATE
=================
Last 24 Hours:
Total syncs: 4
Successful: 4
Failed: 0
Success rate: 100%
Last 7 Days:
Total syncs: 28
Successful: 27
Failed: 1 (Feb 7, Sheet API quota exceeded)
Success rate: 96.4%
Last 30 Days:
Total syncs: 120
Successful: 118
Failed: 2
Success rate: 98.3%
Alert on Failures
| Condition | Alert Level | Action |
|---|---|---|
| Single sync failure | Warning | Log error, retry in 5 minutes |
| 2 consecutive failures | Error | Email notification to merchant |
| 3+ consecutive failures | Critical | Email notification + in-app banner |
| Success rate below 95% (7-day rolling) | Warning | Review sync logs for patterns |
| No sync in 24+ hours | Error | Check worker health, email alert |
Error Tracking
Error Categories
| Category | Examples | Severity | Response |
|---|---|---|---|
| Shopify API | Rate limit (429), token expired, scope revoked | High | Backoff and retry, re-auth if token invalid |
| Shopify Webhook | Delivery failure, HMAC mismatch, payload parse error | Medium | Shopify retries 19 times over 48h automatically |
| Google Sheets API | Quota exceeded, permission denied, sheet deleted | High | Retry with backoff, alert merchant if persistent |
| Database | Connection refused, query timeout, constraint violation | Critical | Auto-reconnect, alert on repeated failures |
| Scoring Engine | Unmapped product type, invalid tag format | Low | Log warning, use default priority, continue |
| Billing | Subscription expired, charge declined | Medium | Downgrade to free tier, notify merchant |
Error Log Format
All errors are logged as structured JSON to facilitate parsing and alerting:
{
"timestamp": "2026-02-10T10:30:45.123Z",
"level": "error",
"service": "sync",
"store_id": "abc-123",
"sync_type": "sheet_write",
"error_code": "SHEETS_QUOTA_EXCEEDED",
"error_message": "Google Sheets API daily quota exceeded",
"context": {
"products_processed": 8742,
"products_remaining": 6542,
"retry_attempt": 2,
"next_retry_at": "2026-02-10T10:35:45.123Z"
},
"stack_trace": "..."
}
Error Aggregation
Errors are aggregated by category and time window for the monitoring dashboard:
ERROR SUMMARY (Last 24 Hours)
==============================
Category Count Last Occurrence Status
------------------ ----- ------------------ ------
Shopify API 0 -- OK
Google Sheets API 1 Feb 10, 04:15 AM Recovered
Database 0 -- OK
Scoring Engine 3 Feb 10, 10:30 AM Active
Webhooks 0 -- OK
Health Checks
API Health Endpoint
The backend exposes a /health endpoint that checks all dependencies:
GET /health
Response (healthy):
{
"status": "healthy",
"timestamp": "2026-02-10T10:30:45.123Z",
"uptime_seconds": 86400,
"checks": {
"database": { "status": "connected", "latency_ms": 3 },
"redis": { "status": "connected", "latency_ms": 1 },
"google_sheets_api": { "status": "reachable", "latency_ms": 120 },
"shopify_api": { "status": "reachable", "latency_ms": 85 }
}
}
Response (degraded):
{
"status": "degraded",
"timestamp": "2026-02-10T10:30:45.123Z",
"checks": {
"database": { "status": "connected", "latency_ms": 3 },
"redis": { "status": "connected", "latency_ms": 1 },
"google_sheets_api": { "status": "error", "error": "quota exceeded" },
"shopify_api": { "status": "reachable", "latency_ms": 85 }
}
}
Health Check Schedule
| Check | Frequency | Method | Alert Threshold |
|---|---|---|---|
| Database connectivity | Every 30 seconds | SELECT 1 query | 1 consecutive failure |
| Redis connectivity | Every 30 seconds | PING command | 1 consecutive failure |
| Google Sheets API | Every 5 minutes | Metadata read on test sheet | 3 consecutive failures |
| Shopify API | Every 5 minutes | GET /admin/api/2024-01/shop.json (one test store) | 3 consecutive failures |
| Worker process | Every 60 seconds | Bull queue heartbeat | 2 consecutive failures |
| Disk usage | Every 15 minutes | Docker system df | > 80% usage |
Docker Health Checks
The production Docker Compose configuration includes container-level health checks:
CONTAINER HEALTH
================
adpriority-backend:
Check: curl -f http://localhost:3010/health
Interval: 30s
Timeout: 10s
Retries: 3
Start period: 15s
adpriority-redis:
Check: redis-cli ping
Interval: 10s
Timeout: 5s
Retries: 3
adpriority-worker:
Check: node healthcheck.js (checks Bull queue connection)
Interval: 30s
Timeout: 10s
Retries: 3
Docker restarts unhealthy containers automatically when restart: always is set in the compose file. The health check system ensures transient failures (network blip, temporary memory pressure) resolve without manual intervention.
Metrics
Operational Metrics
| Metric | Type | Description | Collection |
|---|---|---|---|
sync.products_synced | Counter | Total products synced (cumulative) | Incremented per sync |
sync.duration_seconds | Histogram | Time to complete a sync | Per sync event |
sync.success_rate | Gauge | Rolling success rate (7-day) | Computed hourly |
sync.error_count | Counter | Total sync errors | Incremented per error |
priority.changes_per_day | Counter | Products that changed priority | Daily aggregation |
priority.distribution | Gauge (x6) | Products at each priority level | Computed on demand |
api.request_count | Counter | Total API requests served | Per request |
api.latency_p95 | Histogram | 95th percentile response time | Per request |
api.error_rate | Gauge | Percentage of 5xx responses | Rolling 5-minute window |
worker.queue_depth | Gauge | Jobs waiting in Bull queue | Polled every 30s |
worker.job_duration | Histogram | Time to process a queue job | Per job |
Business Metrics
| Metric | Description | Collection |
|---|---|---|
stores.active | Number of stores with active subscriptions | Daily count |
stores.trial | Number of stores in trial period | Daily count |
stores.churned | Stores that uninstalled in last 30 days | Monthly count |
products.total_managed | Sum of products across all stores | Daily count |
revenue.mrr | Monthly recurring revenue | From subscriptions table |
onboarding.time_to_first_sync | Time from install to first Sheet sync | Per store |
Metric Storage
For the initial deployment, metrics are stored in the PostgreSQL database as aggregated daily snapshots rather than introducing a dedicated time-series database. The sync_logs and audit_logs tables serve as the primary metric source.
-- Example: Daily metrics aggregation query
SELECT
DATE(created_at) AS day,
COUNT(*) AS total_syncs,
COUNT(*) FILTER (WHERE status = 'completed') AS successful,
COUNT(*) FILTER (WHERE status = 'failed') AS failed,
AVG(EXTRACT(EPOCH FROM (completed_at - started_at))) AS avg_duration_seconds,
SUM(products_success) AS total_products_synced
FROM sync_logs
WHERE store_id = $1
AND created_at > NOW() - INTERVAL '30 days'
GROUP BY DATE(created_at)
ORDER BY day DESC;
If the deployment scales beyond 50 stores, consider migrating metrics to a lightweight time-series solution or an external monitoring service.
Alerting
Alert Channels
| Channel | Use Case | Configuration |
|---|---|---|
| Sync failures, critical errors | Merchant notification email from Settings | |
| In-app Banner | Degraded service, action required | Polaris Banner on Dashboard |
| Docker logs | All operational events | docker logs adpriority-backend |
| Console (stdout) | Structured JSON logs | Captured by Docker log driver |
Alert Rules
ALERT RULES
============
Rule: sync_consecutive_failures
Condition: 3 consecutive sync failures for a store
Severity: Critical
Action: Email merchant, show in-app banner
Message: "AdPriority sync has failed 3 times. Your Google Merchant
Center labels may be out of date. Check Settings > Sync."
Rule: worker_down
Condition: Worker health check fails for 2 minutes
Severity: Critical
Action: Docker auto-restart, log to console
Message: (internal) "Worker process unresponsive, auto-restarting"
Rule: database_connection_lost
Condition: Database health check fails
Severity: Critical
Action: Prisma auto-reconnect, log to console
Message: (internal) "Database connection lost, reconnecting"
Rule: sheet_api_quota
Condition: Google Sheets API returns 429 (quota exceeded)
Severity: Warning
Action: Queue retry for next hour, email merchant if persists
Message: "Google Sheets API quota exceeded. Sync will retry in 1 hour."
Rule: high_error_rate
Condition: API error rate > 5% over 5-minute window
Severity: Warning
Action: Log to console
Message: (internal) "API error rate elevated: X% in last 5 minutes"
Rule: stale_sync
Condition: No successful sync in 24+ hours for an active store
Severity: Warning
Action: Email merchant
Message: "AdPriority has not synced in over 24 hours.
Check your sync settings."
Alert Suppression
To avoid alert fatigue:
| Rule | Suppression |
|---|---|
| Same alert type for same store | Suppress for 1 hour after first firing |
| Maintenance window | Suppress all non-critical alerts during scheduled maintenance |
| Inactive stores | Do not alert for stores with is_active = false |
| Trial stores | Lighter alerting (email only, no paging) |