Alerting

Introduction

The PortX platform provides built-in alerting that monitors PortX-managed services running in your cluster — this includes platform components like Grafana, ArgoCD, Karavan, Istio, and the underlying infrastructure (nodes, pods, storage). If any PortX-managed service has an issue, our platform engineering team is automatically notified and will respond.

Your applications are not monitored by PortX by default. We do not create alerts for tenant-deployed workloads — that is your responsibility. However, we provide the tooling and a working template to make it straightforward. Every tenant has the ability to create their own custom alerts through two paths:

Metric-based alerts (Prometheus) — configured via your tenant GitOps repository using the tenant-alerts application and the Prometheus Operator
Log-based alerts (Loki) — configured through the Grafana UI

Purpose

This document provides information and step-by-step guidance for the following topics:

Understanding what the platform monitors out of the box
Creating custom metric-based alerts with PrometheusRules
Routing alert notifications to your team (email, Slack, OpsGenie, webhooks)
Creating log-based alerts through Grafana
Reusing existing platform alerts for your namespace

Initialisms

Initialism	Definition
HPA	Horizontal Pod Autoscaler
OOM	Out of Memory
PVC	Persistent Volume Claim
LogQL	Loki Query Language
PromQL	Prometheus Query Language

What the Platform Monitors

The following alerts are active on every tenant cluster. These are managed by PortX — no configuration is required on your part.

Application and Pod Health

Condition	Severity	Description
Crash Looping	Critical	A container is repeatedly crashing and restarting
Pod Not Ready	Warning	A pod has been unable to start for more than 15 minutes
Out of Memory	Warning	A container was terminated because it exceeded its memory limit
Image Pull Failure	Warning	A container image could not be pulled from the registry
Deployment Replicas Mismatch	Warning	Running pods do not match the desired count for more than 15 minutes
Rollout Stuck	Critical	A deployment update has stalled and is not progressing

Infrastructure

Condition	Severity	Description
Node Not Ready	Critical	A cluster node is unresponsive
Memory / Disk Pressure	Warning	A node is running low on memory or disk
PVC Above 90%	Warning	A persistent volume is nearly full
PVC Above 95%	Critical	A persistent volume is critically full

Platform Services

Core services including Grafana, ArgoCD, Prometheus, Loki, Tempo, Istio, and Karpenter are all monitored. If any of these become unavailable, the platform team is alerted immediately.

Autoscaling

Condition	Severity	Description
HPA Maxed Out	Warning	The autoscaler has been at maximum replicas for more than 15 minutes
HPA Not Scaling	Warning	The autoscaler cannot reach the desired replica count

Creating Metric-Based Alerts (Prometheus)

Every tenant GitOps repository includes a tenant-alerts application under the apps/ directory. This is where you define custom Prometheus alerts using the Prometheus Operator.

The tenant-alerts app uses the prom-alert-rules Helm chart which creates two Kubernetes resources:

PrometheusRule — defines the alert conditions using PromQL
AlertmanagerConfig — defines where notifications are sent and how alerts are routed

Step 1. Define Your Alert Rules

Edit apps/tenant-alerts/values.yaml in your tenant GitOps repository. The prometheusrule section is where you define what conditions should trigger an alert.

Example: Alert when a deployment has zero running pods

prometheusrule:
  enabled: true
  name: my-app-alerts
  labels:
    release: portx-monitoring
    application: my-app
  groups:
  - name: deployment
    rules:
    - alert: DeploymentAt0Replicas
      expr: |
        sum(kube_deployment_status_replicas{
          pod_template_hash=""
        }) by (deployment, namespace) < 1
      for: 1m
      labels:
        app: my-app
      annotations:
        summary: "Deployment {{$labels.deployment}} has no running pods"
        description: |
          Cluster Name: {{$externalLabels.cluster}}
          Namespace: {{$labels.namespace}}
          Deployment name: {{$labels.deployment}}

Example: Alert when request error rate exceeds 5%

prometheusrule:
  enabled: true
  name: my-app-alerts
  labels:
    release: portx-monitoring
    application: my-app
  groups:
  - name: http-errors
    rules:
    - alert: HighErrorRate
      expr: |
        sum(rate(istio_requests_total{
          response_code=~"5.*",
          destination_service_namespace="prod"
        }[5m])) by (destination_service_name)
        /
        sum(rate(istio_requests_total{
          destination_service_namespace="prod"
        }[5m])) by (destination_service_name)
        > 0.05
      for: 5m
      labels:
        severity: critical
        app: my-app
      annotations:
        summary: "High 5xx error rate on {{$labels.destination_service_name}}"
        description: "Error rate is above 5% for the last 5 minutes."

Example: Alert when response latency is too high

prometheusrule:
  enabled: true
  name: my-app-alerts
  labels:
    release: portx-monitoring
    application: my-app
  groups:
  - name: latency
    rules:
    - alert: HighP99Latency
      expr: |
        histogram_quantile(0.99,
          sum(rate(istio_request_duration_milliseconds_bucket{
            destination_service_namespace="prod"
          }[5m])) by (le, destination_service_name)
        ) > 2000
      for: 10m
      labels:
        severity: warning
        app: my-app
      annotations:
        summary: "P99 latency above 2s on {{$labels.destination_service_name}}"

note

PrometheusRules must include release: portx-monitoring in their labels to be picked up by the Prometheus Operator. Without this label, your rules will be ignored.

Key Fields

Field	Description
`expr`	The PromQL expression that defines the alert condition. When this expression returns results, the alert fires.
`for`	How long the condition must be true before the alert fires. Prevents flapping on brief spikes.
`labels`	Labels attached to the alert. Use `severity: critical` or `severity: warning`. Use `app` to tag your application.
`annotations.summary`	Short description shown in notifications. Supports Go template variables like `{{$labels.deployment}}`.
`annotations.description`	Detailed description. Include cluster, namespace, and relevant context.

Step 2. Configure Notification Routing

The alertmanager section in the same values.yaml file defines where alert notifications are delivered and how they are grouped.

Example: Route alerts to your team via email

alertmanager:
  enabled: true
  name: my-app-alerts
  labels:
    alertmanager: portx-alertmanager

  receivers:
  - name: 'my-team'
    emailConfigs:
    - to: your-team@example.com
      from: noreply@portx.io
      smarthost: smtp.sendgrid.net:587
      authUsername: apikey
      authPassword:
        name: grafana-client-secret
        key: GF_SMTP_PASSWORD

  route:
    groupBy: [job]
    groupWait: 30s
    groupInterval: 5m
    repeatInterval: 12h
    global_receiver: 'my-team'
    routes:
    - matchers:
      - matchType: =
        name: alertname
        value: DeploymentAt0Replicas
      receiver: 'my-team'

Example: Route alerts to a Slack channel

  receivers:
  - name: 'my-team-slack'
    slackConfigs:
    - apiURL:
        name: my-slack-secret
        key: webhook-url
      channel: '#my-app-alerts'
      sendResolved: true
      title: '{{ .Status | toUpper }}: {{ .CommonLabels.alertname }}'
      text: |
        *Namespace:* {{ .CommonLabels.namespace }}
        *Severity:* {{ .CommonLabels.severity }}
        {{ range .Alerts }}*Description:* {{ .Annotations.description }}
        {{ end }}

Example: Route alerts to OpsGenie

  receivers:
  - name: 'my-team-opsgenie'
    opsgenieConfigs:
    - sendResolved: true
      apiKey:
        name: my-opsgenie-secret
        key: api-key
      apiURL: "https://api.opsgenie.com"
      message: "{{ .CommonLabels.alertname }}"
      priority: '{{ if eq .CommonLabels.severity "critical" }}P1{{ else }}P3{{ end }}'

note

AlertmanagerConfig resources must include alertmanager: portx-alertmanager in their labels. Without this label, the platform AlertManager will not recognize your routing configuration.

Routing to Existing Platform Alerts

You do not need to create new PrometheusRules to get notified about common issues. The platform already fires alerts like KubePodNotReady and PodCrashLoopBackOff. You can route these existing alerts to your own receivers by matching on the alert name and your namespace:

alertmanager:
  enabled: true
  name: my-app-alerts
  labels:
    alertmanager: portx-alertmanager

  receivers:
  - name: 'my-team'
    emailConfigs:
    - to: your-team@example.com
      from: noreply@portx.io
      smarthost: smtp.sendgrid.net:587
      authUsername: apikey
      authPassword:
        name: grafana-client-secret
        key: GF_SMTP_PASSWORD

  route:
    groupBy: [job]
    groupWait: 30s
    groupInterval: 5m
    repeatInterval: 12h
    global_receiver: 'my-team'
    routes:
    - matchers:
      - matchType: =
        name: alertname
        value: KubePodNotReady
      - matchType: =
        name: namespace
        value: prod
      - matchType: =~
        name: pod
        value: my-app-*
      receiver: 'my-team'

This sends you an email whenever any pod matching my-app-* in the prod namespace is not ready — using the platform's built-in alert, routed to your team.

Creating Log-Based Alerts (Grafana + Loki)

For alerts based on log content (error messages, specific log patterns, log volume), you create alert rules through the Grafana UI. These alerts query Loki using LogQL and are evaluated by the Grafana alerting engine.

Step 1. Open Grafana Alerting

Navigate to your Grafana instance at:

https://tools.<your-tenant>.tenants.portx.io/grafana/alerting/list

In the left sidebar, click Alerting (bell icon), then Alert rules.

Step 2. Create a New Alert Rule

Click + New alert rule
Give the rule a name (e.g., "Error log spike — my-app")
In the Define query and alert condition section:
- Select the logs data source (this is your Loki instance)
- Write a LogQL query

Example: Alert on error log volume

sum(count_over_time({namespace="prod", app="my-app"} |= "ERROR" [5m])) > 10

This fires when more than 10 error logs appear in a 5-minute window.

Example: Alert on a specific error message

count_over_time({namespace="prod", app="my-app"} |= "database connection refused" [5m]) > 0

Example: Alert on high log volume (possible log storm)

sum(rate({namespace="prod", app="my-app"}[5m])) > 100

This fires when your app is producing more than 100 log lines per second.

Step 3. Set Evaluation Behavior

Evaluate every: How often the query runs (e.g., 1m)
For: How long the condition must be true before firing (e.g., 5m)
Folder and Group: Organize your alerts into folders (e.g., "My App Alerts")

Step 4. Configure Notifications

In the Notifications section:

Select an existing contact point or create a new one
Contact points support: email, Slack, OpsGenie, PagerDuty, webhooks, and more
Add labels (e.g., severity=warning, team=my-team) for routing

Step 5. Save and Enable

Click Save rule and exit. The alert will begin evaluating immediately based on your schedule.

tip

Use Grafana's Explore view to test your LogQL queries before creating alert rules. Navigate to Explore, select the logs data source, and run your query to verify it returns the expected results.

warning

Log-based alerts are evaluated by Grafana, not by the Prometheus Operator. This means they are managed entirely through the Grafana UI and are not stored in your GitOps repository. If you need version-controlled, GitOps-managed alerting, use metric-based Prometheus alerts instead.

Summary

Alert Type	Where to Configure	Query Language	GitOps Managed
Metric-based (Prometheus)	`apps/tenant-alerts/values.yaml` in your GitOps repo	PromQL	Yes
Log-based (Loki)	Grafana UI → Alerting → Alert rules	LogQL	No
Platform built-in	No configuration needed — active by default	—	—

Introduction​

Purpose​

Initialisms​

What the Platform Monitors​

Application and Pod Health​

Infrastructure​

Platform Services​

Autoscaling​

Creating Metric-Based Alerts (Prometheus)​

Step 1. Define Your Alert Rules​

Key Fields​

Step 2. Configure Notification Routing​

Routing to Existing Platform Alerts​

Creating Log-Based Alerts (Grafana + Loki)​

Step 1. Open Grafana Alerting​

Step 2. Create a New Alert Rule​

Step 3. Set Evaluation Behavior​

Step 4. Configure Notifications​

Step 5. Save and Enable​

Summary​