Quick start

Get Tracegrid running in under 5 minutes.

Prerequisites

  • A Slack workspace where you are an admin
  • A Linux server (Ubuntu 20.04+) or Kubernetes cluster
  • Your Tracegrid API key (from your account page)

Step 1 — Get your API key

After signing up, you receive an email with your API key. It looks like: gw_a1b2c3d4e5f6...

Your API key is shown once. Store it securely.

Step 2 — Connect Slack

Create a Slack Incoming Webhook:

  1. 1. Go to api.slack.com/apps
  2. 2. Create New App → From scratch → name it "Tracegrid"
  3. 3. Incoming Webhooks → Activate → Add New Webhook
  4. 4. Select your #incidents channel
  5. 5. Copy the webhook URL

Step 3 — Install the agent

Follow the Linux or Kubernetes installation guides.

How Tracegrid works

Tracegrid provides deep visibility into your infrastructure incidents using AI.

[Infrastructure] -- (Metrics/Logs) --> [Tracegrid Agent]
                                   |
                            (Anomaly Data)
                                   v
                        [Tracegrid Backend AI]
                                   |
                            (Intelligence)
                                   v
                       [Slack / Postmortems]
  1. 1. Agent installs: Deploys on your infrastructure as a systemd service or Kubernetes DaemonSet.
  2. 2. Agent collects: Metrics, process states, and container events every 15 seconds. Ships anomalies immediately when detected.
  3. 3. Backend AI analysis: AI correlates signals, identifies root cause, and generates plain-English explanation and fix steps.
  4. 4. Knowledge Delivery: Slack receives the incident card. When resolved, postmortem appears automatically in thread.

Triggering your first incident

Trigger a test incident

After installing the agent, verify everything works by triggering a demo incident:

bashCopy
curl -X GET https://api.tracegrid.app/internal/demo-incident \ -H "X-Internal-Key: your_internal_key"

Check your Slack channel within 10 seconds.

Trigger a real incident (Linux)

Simulate high CPU:

bashCopy
# Install stress (Ubuntu) sudo apt-get install stress -y # Spike CPU for 60 seconds stress --cpu 4 --timeout 60

Tracegrid will alert when CPU exceeds 90% for 30 seconds.

Trigger a real incident (Kubernetes)

Deploy a pod that crashes:

bashCopy
kubectl run crash-test --image=busybox --restart=Always \ -n default -- sh -c "exit 1"

Tracegrid detects CrashLoopBackOff after 3 restarts.

Clean up: kubectl delete pod crash-test -n default

Linux / VM / EC2 Installation

bashCopy
curl -sSL https://tracegrid.app/install.sh | bash

The installer will ask for your API key, your backend URL, and a name for this host. It sets up a systemd service that restarts automatically on failure.

Kubernetes Installation

Deploy as a DaemonSet to monitor every node in your cluster.

bashCopy
curl -O https://tracegrid.app/daemonset.yaml # Edit daemonset.yaml: set your API key and cluster name kubectl apply -f daemonset.yaml

Use kubectl logs -n tracegrid -l app=tracegrid-agent to check status.

Environment variables

Variable Description
TRACEGRID_API_KEYYour tenant API key (required)
TRACEGRID_BACKEND_URLBackend URL — https://api.tracegrid.app
TRACEGRID_MODEvm or kubernetes (default: vm)
TRACEGRID_HOSTNAMEOverride hostname shown in alerts
TRACEGRID_LOG_LEVELdebug, info, warn, error (default: info)
TRACEGRID_CLUSTER_NAMECluster name shown in K8s alerts
TRACEGRID_NAMESPACEK8s namespace to watch (default: all)
HOST_PROCHost /proc path in K8s (default: /host/proc)

Agent config file

The agent reads from /etc/tracegrid/agent.yaml (Linux) or ./agent.yaml (local development). Environment variables override config file values.

Example config:

yamlCopy
api_key: gw_your_key_here backend_url: https://api.tracegrid.app hostname: prod-web-01 collection_interval_seconds: 15 log_level: info

For Kubernetes, use the ConfigMap and Secret in daemonset.yaml instead of a config file.

Alert thresholds

Default thresholds (Growth plan allows customization):

Metric Warning Critical
CPU usage> 90%> 95%
Memory usage> 85%> 95%
Disk usage> 85%> 95%
K8s restarts>= 3>= 5
K8s pending> 5 min> 10 min

Custom thresholds (Growth plan)

Coming in the next release. Configure via the Tracegrid dashboard or API.

Slack integration

Option 1 — Incoming Webhook (recommended)

  1. 1. Go to api.slack.com/apps
  2. 2. Create New App → From scratch
  3. 3. Name: Tracegrid, select your workspace
  4. 4. Incoming Webhooks → Activate → Add New Webhook to Workspace
  5. 5. Select your #incidents channel → Allow
  6. 6. Copy the webhook URL (starts with https://hooks.slack.com)
  7. 7. Provide this URL via API:
bashCopy
curl -X PATCH https://api.tracegrid.app/v1/tenants/YOUR_ID \ -H "X-Internal-Key: YOUR_KEY" \ -H "Content-Type: application/json" \ -d '{"slack_webhook_url": "https://hooks.slack.com/..."}'

Option 2 — Slack App (interactive buttons)

See the full Slack App setup guide in docs/slack-app-setup.md for Acknowledge, Escalate, and Dismiss button support.

AI provider setup

Tracegrid supports three AI providers. Set AI_PROVIDER in your backend environment to switch between them.

Groq (recommended — fastest, generous free tier)

  1. 1. Sign up at console.groq.com (free, no credit card)
  2. 2. Create API key → copy it
  3. 3. Set: AI_PROVIDER=groq, GROQ_API_KEY=your_key

Model used: llama-3.3-70b-versatile

Google Gemini

  1. 1. Go to aistudio.google.com → Get API key
  2. 2. Set: AI_PROVIDER=gemini, GEMINI_API_KEY=your_key

Model used: gemini-2.0-flash

Anthropic Claude

  1. 1. Sign up at console.anthropic.com → add credits
  2. 2. Set: AI_PROVIDER=anthropic, ANTHROPIC_API_KEY=your_key

Model used: claude-sonnet-4-20250514

Note: requires paid credits — use Groq for free testing