SpeakTrue

Grafana + Loki Query Catalog (SpeakTrue)

This file tracks the LogQL queries we use for continuous monitoring of SpeakTrue logs.

Scope

Production app host logs shipped from Docker VM (host="hl-dockervmla2-speaktrue") via Promtail.
Dev app host logs shipped from the LA dev VM (host="hl-dockervmla2-speaktrue-dev", env="dev") via Promtail.
Loki backend on Grafana VM (192.168.1.118).
Covers HTTP status monitoring, strict-mode policy failures, Supabase edge failures, ElevenLabs/provider errors, speech pipeline failures, and ingestion health.

Time windows to run

Use these windows for all trend/count queries:

10s
30s
1m
5m
30m
1h
6h
12h

In queries below, replace <RANGE> with one of the above.

Label assumptions

Current baseline labels (already configured):

job="docker"
host="hl-dockervmla2-speaktrue"
host="hl-dockervmla2-speaktrue-dev" for dev VM logs
env="dev" on dev VM logs

If you later enable Docker metadata relabeling, also use:

container
compose_service
compose_project
log_type

0) Global health / ingestion

0.1 Total log lines seen

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"}[<RANGE>])

0.2 Logs by container (if `container` label exists)

sum by (container) (count_over_time({job="docker",host="hl-dockervmla2-speaktrue"}[<RANGE>]))

0.3 Error-like lines (generic)

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "(?i)error|exception|traceback|failed|fatal|panic" [<RANGE>])

1) HTTP status coverage (all major status classes and codes)

These rely on status strings in logs (access logs or app logs with status=XYZ).

1.1 2xx

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "status[=: ]2[0-9][0-9]|\\s2[0-9][0-9]\\s" [<RANGE>])

1.2 3xx

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "status[=: ]3[0-9][0-9]|\\s3[0-9][0-9]\\s" [<RANGE>])

1.3 4xx

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "status[=: ]4[0-9][0-9]|\\s4[0-9][0-9]\\s" [<RANGE>])

1.4 5xx

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "status[=: ]5[0-9][0-9]|\\s5[0-9][0-9]\\s" [<RANGE>])

1.5 Specific status code queries (recommended panel set)

400

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "status[=: ]400|\\s400\\s" [<RANGE>])

401

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "status[=: ]401|\\s401\\s|unauthorized" [<RANGE>])

403

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "status[=: ]403|\\s403\\s|STRICT_MODE_POLICY_DENIED|strict_mode_policy_denied" [<RANGE>])

404

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "status[=: ]404|\\s404\\s" [<RANGE>])

405

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "status[=: ]405|\\s405\\s|METHOD_NOT_ALLOWED_IN_SUPABASE_STRICT_MODE|method_not_allowed_in_supabase_strict_mode" [<RANGE>])

409

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "status[=: ]409|\\s409\\s" [<RANGE>])

422

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "status[=: ]422|\\s422\\s|preprocessing_failed|INVALID_PREPROCESSING_" [<RANGE>])

429

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "status[=: ]429|\\s429\\s|quota|rate limit|too many requests" [<RANGE>])

500

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "status[=: ]500|\\s500\\s|internal_error|SPEECH_RUNTIME_ERROR" [<RANGE>])

502

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "status[=: ]502|\\s502\\s|SUPABASE_EDGE_STRICT_FAILURE|provider_error|Bad Gateway" [<RANGE>])

503

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "status[=: ]503|\\s503\\s|service unavailable|provider unavailable" [<RANGE>])

504

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "status[=: ]504|\\s504\\s|timeout|deadline exceeded" [<RANGE>])

2) Supabase edge monitoring

2.1 Edge telemetry volume

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |= "backend=supabase_edge" [<RANGE>])

2.2 Edge failures (`outcome=error`)

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |= "backend=supabase_edge" |= "outcome=error" [<RANGE>])

2.3 Edge fallbacks used

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |= "backend=supabase_edge" |= "fallback_used=true" [<RANGE>])

2.4 Edge strict failures

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |= "SUPABASE_EDGE_STRICT_FAILURE" [<RANGE>])

2.5 Edge operation latency (avg ms)

avg_over_time(
  {job="docker",host="hl-dockervmla2-speaktrue"} |= "backend=supabase_edge"
  | logfmt
  | unwrap latency_ms
[<RANGE>])

2.6 Edge operation latency (p95 ms)

quantile_over_time(
  0.95,
  {job="docker",host="hl-dockervmla2-speaktrue"} |= "backend=supabase_edge"
  | logfmt
  | unwrap latency_ms
[<RANGE>])

2.7 Edge errors by operation (if operation is logged)

sum by (operation) (
  count_over_time(
    {job="docker",host="hl-dockervmla2-speaktrue"} |= "backend=supabase_edge" |= "outcome=error"
    | logfmt
  [<RANGE>])
)

3) ElevenLabs / provider monitoring

3.1 Provider errors (global)

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "provider_error|provider_validation_error|voice_authorization_provider_error" [<RANGE>])

3.2 ElevenLabs-specific error hints

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "ElevenLabs|xi-api-key|speech-to-text|text-to-speech" |~ "(?i)error|failed|status=" [<RANGE>])

3.3 Supabase edge -> provider failures in speech paths

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "Supabase (stt-transcribe|tts-generate|voices-list|models-list) failed status=" [<RANGE>])

4) STT / STS / TTS pipeline monitoring

4.1 STT preprocess applied

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |= "operation=stt_preprocess" |= "outcome=applied" [<RANGE>])

4.2 STT preprocess bypassed

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |= "operation=stt_preprocess" |= "outcome=bypassed" [<RANGE>])

4.3 STT preprocess failed

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |= "operation=stt_preprocess" |~ "outcome=attempt_failed|outcome=failed_non_strict|preprocessing_failed" [<RANGE>])

4.4 STT endpoint failures

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "stt-transcribe|speech-to-text" |~ "(?i)error|failed|status=" [<RANGE>])

4.5 STS endpoint failures

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "speech-to-speech" |~ "(?i)error|failed|status=" [<RANGE>])

4.6 TTS endpoint failures

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "tts-generate|text-to-speech" |~ "(?i)error|failed|status=" [<RANGE>])

5) Strict mode / policy / auth monitoring

5.1 Strict policy denied

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "STRICT_MODE_POLICY_DENIED|strict_mode_policy_denied" [<RANGE>])

5.2 Strict method not allowed

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "METHOD_NOT_ALLOWED_IN_SUPABASE_STRICT_MODE|method_not_allowed_in_supabase_strict_mode" [<RANGE>])

5.3 Unauthorized/auth lookup failures

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "unauthorized|INVALID_CREDENTIALS|SUPABASE_AUTH_UNAVAILABLE|AuthLookupFailed" [<RANGE>])

6) Soundboard async jobs monitoring

6.1 Job failures

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "status=failed|combine job failed|regeneration.*failed" [<RANGE>])

6.2 Job success/failure ratio (two panels)

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |= "status=succeeded" [<RANGE>])

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |= "status=failed" [<RANGE>])

7) Logs by type (after `log_type` label rollout)

7.1 App logs only

{job="docker",host="hl-dockervmla2-speaktrue",log_type="app"}

7.2 Access logs only

{job="docker",host="hl-dockervmla2-speaktrue",log_type="access"}

7.3 Error rate by log type

sum by (log_type) (count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "(?i)error|failed|exception" [<RANGE>]))

8) Dashboard panel starter set (recommended)

Use these on one dashboard with 10s auto-refresh:

Total log volume (0.1, 1m)
5xx count (1.4, 1m)
502 count (1.5 / 502, 1m)
Edge failures (2.2, 1m)
Edge p95 latency (2.6, 5m)
Provider errors (3.1, 5m)
STT failures (4.4, 5m)
Strict policy denied (5.1, 5m)
Raw logs ({job="docker",host="hl-dockervmla2-speaktrue"})

9) Alert-rule queries (copy-ready)

9.1 STT 502 spike

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "stt-transcribe|speech-to-text" |~ "status[=: ]502|SUPABASE_EDGE_STRICT_FAILURE|provider_error" [5m]) > 5

9.2 Global 5xx spike

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "status[=: ]5[0-9][0-9]|\\s5[0-9][0-9]\\s" [5m]) > 20

9.3 Edge strict failures present

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |= "SUPABASE_EDGE_STRICT_FAILURE" [5m]) > 0

9.4 Ingestion gap (no logs)

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"}[5m]) == 0

10) Additional requested operational queries

10.1 TTS failures

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "tts-generate|text-to-speech|/api/text-to-speech" |~ "(?i)error|failed|status=|provider_error" [<RANGE>])

10.2 API failures (broad)

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "status[=: ]4[0-9][0-9]|status[=: ]5[0-9][0-9]|\\s4[0-9][0-9]\\s|\\s5[0-9][0-9]\\s|(?i)error|failed|exception" [<RANGE>])

10.3 Number of API requests

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "/api/|speech-to-text|speech-to-speech|text-to-speech|soundboard|settings|voice-clone|models-list|voices-list|quota-status" [<RANGE>])

10.4 Successful API responses

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "status[=: ]2[0-9][0-9]|\\s2[0-9][0-9]\\s|\"success\":true|\"ok\":true" [<RANGE>])

10.5 API limits / quota events

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "status[=: ]429|\\s429\\s|quota|rate limit|limit reached|usage|audio_payload_too_large|CATEGORY_LIMIT|category_limit" [<RANGE>])

10.6 Soundboard activity

count_over_time({job="docker",host="hl-dockervmla2-speaktrue"} |~ "soundboard|/api/soundboard|category|categories|clip|clips|combine|reorder|copy|delete|regeneration|save-clip" [<RANGE>])

11) Notes

If a query returns nothing, verify the message pattern exists in your logs and adjust regex.
Prefer label selectors over regex once you add metadata labels (container, compose_service, log_type).
Keep this file updated when logging formats change (especially status and error key names).