Alerting
Introduction
The PortX platform provides built-in alerting that monitors PortX-managed services running in your cluster — this includes platform components like Grafana, ArgoCD, Karavan, Istio, and the underlying infrastructure (nodes, pods, storage). If any PortX-managed service has an issue, our platform engineering team is automatically notified and will respond.
Your applications are not monitored by PortX by default. We do not create alerts for tenant-deployed workloads — that is your responsibility. However, we provide the tooling and a working template to make it straightforward. Every tenant has the ability to create their own custom alerts through two paths:
- Metric-based alerts (Prometheus) — configured via your tenant GitOps repository using the
tenant-alertsapplication and the Prometheus Operator - Log-based alerts (Loki) — configured through the Grafana UI
Purpose
This document provides information and step-by-step guidance for the following topics:
- Understanding what the platform monitors out of the box
- Creating custom metric-based alerts with PrometheusRules
- Routing alert notifications to your team (email, Slack, OpsGenie, webhooks)
- Creating log-based alerts through Grafana
- Reusing existing platform alerts for your namespace
Initialisms
| Initialism | Definition |
|---|---|
| HPA | Horizontal Pod Autoscaler |
| OOM | Out of Memory |
| PVC | Persistent Volume Claim |
| LogQL | Loki Query Language |
| PromQL | Prometheus Query Language |
What the Platform Monitors
The following alerts are active on every tenant cluster. These are managed by PortX — no configuration is required on your part.
Application and Pod Health
| Condition | Severity | Description |
|---|---|---|
| Crash Looping | Critical | A container is repeatedly crashing and restarting |
| Pod Not Ready | Warning | A pod has been unable to start for more than 15 minutes |
| Out of Memory | Warning | A container was terminated because it exceeded its memory limit |
| Image Pull Failure | Warning | A container image could not be pulled from the registry |
| Deployment Replicas Mismatch | Warning | Running pods do not match the desired count for more than 15 minutes |
| Rollout Stuck | Critical | A deployment update has stalled and is not progressing |
Infrastructure
| Condition | Severity | Description |
|---|---|---|
| Node Not Ready | Critical | A cluster node is unresponsive |
| Memory / Disk Pressure | Warning | A node is running low on memory or disk |
| PVC Above 90% | Warning | A persistent volume is nearly full |
| PVC Above 95% | Critical | A persistent volume is critically full |
Platform Services
Core services including Grafana, ArgoCD, Prometheus, Loki, Tempo, Istio, and Karpenter are all monitored. If any of these become unavailable, the platform team is alerted immediately.
Autoscaling
| Condition | Severity | Description |
|---|---|---|
| HPA Maxed Out | Warning | The autoscaler has been at maximum replicas for more than 15 minutes |
| HPA Not Scaling | Warning | The autoscaler cannot reach the desired replica count |
Creating Metric-Based Alerts (Prometheus)
Every tenant GitOps repository includes a tenant-alerts application under the apps/ directory. This is where you define custom Prometheus alerts using the Prometheus Operator.
The tenant-alerts app uses the prom-alert-rules Helm chart which creates two Kubernetes resources:
- PrometheusRule — defines the alert conditions using PromQL
- AlertmanagerConfig — defines where notifications are sent and how alerts are routed
Step 1. Define Your Alert Rules
Edit apps/tenant-alerts/values.yaml in your tenant GitOps repository. The prometheusrule section is where you define what conditions should trigger an alert.
Example: Alert when a deployment has zero running pods
prometheusrule:
enabled: true
name: my-app-alerts
labels:
release: portx-monitoring
application: my-app
groups:
- name: deployment
rules:
- alert: DeploymentAt0Replicas
expr: |
sum(kube_deployment_status_replicas{
pod_template_hash=""
}) by (deployment, namespace) < 1
for: 1m
labels:
app: my-app
annotations:
summary: "Deployment {{$labels.deployment}} has no running pods"
description: |
Cluster Name: {{$externalLabels.cluster}}
Namespace: {{$labels.namespace}}
Deployment name: {{$labels.deployment}}
Example: Alert when request error rate exceeds 5%
prometheusrule:
enabled: true
name: my-app-alerts
labels:
release: portx-monitoring
application: my-app
groups:
- name: http-errors
rules:
- alert: HighErrorRate
expr: |
sum(rate(istio_requests_total{
response_code=~"5.*",
destination_service_namespace="prod"
}[5m])) by (destination_service_name)
/
sum(rate(istio_requests_total{
destination_service_namespace="prod"
}[5m])) by (destination_service_name)
> 0.05
for: 5m
labels:
severity: critical
app: my-app
annotations:
summary: "High 5xx error rate on {{$labels.destination_service_name}}"
description: "Error rate is above 5% for the last 5 minutes."
Example: Alert when response latency is too high
prometheusrule:
enabled: true
name: my-app-alerts
labels:
release: portx-monitoring
application: my-app
groups:
- name: latency
rules:
- alert: HighP99Latency
expr: |
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{
destination_service_namespace="prod"
}[5m])) by (le, destination_service_name)
) > 2000
for: 10m
labels:
severity: warning
app: my-app
annotations:
summary: "P99 latency above 2s on {{$labels.destination_service_name}}"
PrometheusRules must include release: portx-monitoring in their labels to be picked up by the Prometheus Operator. Without this label, your rules will be ignored.
Key Fields
| Field | Description |
|---|---|
expr | The PromQL expression that defines the alert condition. When this expression returns results, the alert fires. |
for | How long the condition must be true before the alert fires. Prevents flapping on brief spikes. |
labels | Labels attached to the alert. Use severity: critical or severity: warning. Use app to tag your application. |
annotations.summary | Short description shown in notifications. Supports Go template variables like {{$labels.deployment}}. |
annotations.description | Detailed description. Include cluster, namespace, and relevant context. |
Step 2. Configure Notification Routing
The alertmanager section in the same values.yaml file defines where alert notifications are delivered and how they are grouped.
Example: Route alerts to your team via email
alertmanager:
enabled: true
name: my-app-alerts
labels:
alertmanager: portx-alertmanager
receivers:
- name: 'my-team'
emailConfigs:
- to: your-team@example.com
from: noreply@portx.io
smarthost: smtp.sendgrid.net:587
authUsername: apikey
authPassword:
name: grafana-client-secret
key: GF_SMTP_PASSWORD
route:
groupBy: [job]
groupWait: 30s
groupInterval: 5m
repeatInterval: 12h
global_receiver: 'my-team'
routes:
- matchers:
- matchType: =
name: alertname
value: DeploymentAt0Replicas
receiver: 'my-team'
Example: Route alerts to a Slack channel
receivers:
- name: 'my-team-slack'
slackConfigs:
- apiURL:
name: my-slack-secret
key: webhook-url
channel: '#my-app-alerts'
sendResolved: true
title: '{{ .Status | toUpper }}: {{ .CommonLabels.alertname }}'
text: |
*Namespace:* {{ .CommonLabels.namespace }}
*Severity:* {{ .CommonLabels.severity }}
{{ range .Alerts }}*Description:* {{ .Annotations.description }}
{{ end }}
Example: Route alerts to OpsGenie
receivers:
- name: 'my-team-opsgenie'
opsgenieConfigs:
- sendResolved: true
apiKey:
name: my-opsgenie-secret
key: api-key
apiURL: "https://api.opsgenie.com"
message: "{{ .CommonLabels.alertname }}"
priority: '{{ if eq .CommonLabels.severity "critical" }}P1{{ else }}P3{{ end }}'
AlertmanagerConfig resources must include alertmanager: portx-alertmanager in their labels. Without this label, the platform AlertManager will not recognize your routing configuration.
Routing to Existing Platform Alerts
You do not need to create new PrometheusRules to get notified about common issues. The platform already fires alerts like KubePodNotReady and PodCrashLoopBackOff. You can route these existing alerts to your own receivers by matching on the alert name and your namespace:
alertmanager:
enabled: true
name: my-app-alerts
labels:
alertmanager: portx-alertmanager
receivers:
- name: 'my-team'
emailConfigs:
- to: your-team@example.com
from: noreply@portx.io
smarthost: smtp.sendgrid.net:587
authUsername: apikey
authPassword:
name: grafana-client-secret
key: GF_SMTP_PASSWORD
route:
groupBy: [job]
groupWait: 30s
groupInterval: 5m
repeatInterval: 12h
global_receiver: 'my-team'
routes:
- matchers:
- matchType: =
name: alertname
value: KubePodNotReady
- matchType: =
name: namespace
value: prod
- matchType: =~
name: pod
value: my-app-*
receiver: 'my-team'
This sends you an email whenever any pod matching my-app-* in the prod namespace is not ready — using the platform's built-in alert, routed to your team.
Creating Log-Based Alerts (Grafana + Loki)
For alerts based on log content (error messages, specific log patterns, log volume), you create alert rules through the Grafana UI. These alerts query Loki using LogQL and are evaluated by the Grafana alerting engine.
Step 1. Open Grafana Alerting
Navigate to your Grafana instance at:
https://tools.<your-tenant>.tenants.portx.io/grafana/alerting/list
In the left sidebar, click Alerting (bell icon), then Alert rules.
Step 2. Create a New Alert Rule
- Click + New alert rule
- Give the rule a name (e.g., "Error log spike — my-app")
- In the Define query and alert condition section:
- Select the logs data source (this is your Loki instance)
- Write a LogQL query
Example: Alert on error log volume
sum(count_over_time({namespace="prod", app="my-app"} |= "ERROR" [5m])) > 10
This fires when more than 10 error logs appear in a 5-minute window.
Example: Alert on a specific error message
count_over_time({namespace="prod", app="my-app"} |= "database connection refused" [5m]) > 0
Example: Alert on high log volume (possible log storm)
sum(rate({namespace="prod", app="my-app"}[5m])) > 100
This fires when your app is producing more than 100 log lines per second.
Step 3. Set Evaluation Behavior
- Evaluate every: How often the query runs (e.g.,
1m) - For: How long the condition must be true before firing (e.g.,
5m) - Folder and Group: Organize your alerts into folders (e.g., "My App Alerts")
Step 4. Configure Notifications
In the Notifications section:
- Select an existing contact point or create a new one
- Contact points support: email, Slack, OpsGenie, PagerDuty, webhooks, and more
- Add labels (e.g.,
severity=warning,team=my-team) for routing
Step 5. Save and Enable
Click Save rule and exit. The alert will begin evaluating immediately based on your schedule.
Use Grafana's Explore view to test your LogQL queries before creating alert rules. Navigate to Explore, select the logs data source, and run your query to verify it returns the expected results.
Log-based alerts are evaluated by Grafana, not by the Prometheus Operator. This means they are managed entirely through the Grafana UI and are not stored in your GitOps repository. If you need version-controlled, GitOps-managed alerting, use metric-based Prometheus alerts instead.
Summary
| Alert Type | Where to Configure | Query Language | GitOps Managed |
|---|---|---|---|
| Metric-based (Prometheus) | apps/tenant-alerts/values.yaml in your GitOps repo | PromQL | Yes |
| Log-based (Loki) | Grafana UI → Alerting → Alert rules | LogQL | No |
| Platform built-in | No configuration needed — active by default | — | — |