Account

A grouping of Chkk resources, such as clusters and subscriptions. Accounts often map to a specific billable entity boundary in the your organization.

Actions

An action is a single, side‑effect‑bearing operation executed by the Workflow—e.g., invoking a REST API, calling an internal micro‑service, running a shell command, or posting to Slack.

Add-on Upgrade Plan

Similar to a Cluster Upgrade Plan but specific to a particular Add-on, Application Service, or Kubernetes Operator instance in a given cluster.

Created from an approved Add-on Upgrade Template.

Add-on Upgrade Template

An AI-curated workflow for a specific Addon, Application Service, or Kubernetes Operator across multiple clusters.

While simpler Add-ons, and Application Services can be upgraded as part of the Cluster Upgrade Template, stateful or datapath Add-ons, or Application Services have their own templates for specialized handling.

AI Agent

A software entity that perceives its environment, reasons about what to do next, and autonomously executes actions (often by calling tools) to achieve a user-specified goal.

AI‑Driven ETL

AI‑driven ETL pipelines use machine learning or LLMs to infer schema, cleanse anomalies, redact PII, map fields, and generate transformation code, replacing brittle hand‑coded mappings.

Application

An Application is a type of Project that implements business logic or end-user-facing functionality and is deployed on Kubernetes as part of an Application Stack. It typically consists of one or more services and depends on both Application Services and Kubernetes Add-ons.

Application Service

An Application Service is a type of Project that provides essential services to the rest of an application stack. An application stack is one or more Projects that provide related non-Kubernetes functionality — in other words, application-specific functionality.

Application Services do not extend or implement cluster-level Kubernetes functionality.

Examples of Application Services include:

  • MySQL, PostgreSQL and Redis

    These provide database services to Application Stacks.

  • Kafka and NATS

    These provide message queueing and event streaming services to Application Stacks.

  • ArgoCD, FluxCD

    Both are Kubernetes Custom Controllers that provide provide GitOps and CI automation services to Application Stacks running on Kubernetes.

  • ArgoRollouts and Flagger

    Both are Kubernetes Custom Controllers that provide progressive delivery services for Application Stacks running on Kubernetes.

  • Crossplane and ACK S3 Controller

    Both are Kubernetes Custom Controllers that provide cloud resources and service integrations to Application Stacks.

Dive deeper into Chkk’s selection of covered Application Services .

Application Stack

An Application Stack is one or more Projects that provide related non-Kubernetes functionality — in other words, application-specific functionality.

Blast Radius

The potential scope of impact an error, failure, or disruption may have on your system. It represents how far negative consequences can propagate within your environment.

Chkk Cloud Connector

Chkk Cloud Connector is a secure, read-only integration that fetches relevant resource data from your cloud environment and correlates it with your Kubernetes Clusters. By focusing on resources that affect — or are affected by — your clusters (e.g., security groups, IAM roles, networking settings), the Connector facilitates a unified view of your infrastructure. Cloud Connector metadata is used to detect Operational Risks that span beyond the K8s layers. Examples of such cross-layer risks include security groups misconfigurations, risks that are only latent for certain kernel versions, risks that only trigger when using certain AWS services (e.g. Route53 records with external-dns), etc.

Details of Chkk Cloud Connector can be found here.

Chkk Dashboard

Chkk Dashboard is a UI for you to interact with Chkk.

Details of Chkk Dashboard can be found here.

Chkk Integrations

A set of pre-built connectors which allow Chkk to fit smoothly with your existing tools. Integrations include operational tools (GitHub, Jira, Slack, etc.) and SSO (Okta, PingIdentity).

Chkk Kubernetes Connector

The Chkk Kubernetes Connector is composed of two main components:

  • Chkk Operator
  • Chkk Agent Working together, these components periodically (or on-demand) extract cluster metadata and ingest it into the Chkk SaaS platform. Once ingestion is complete, Chkk scans and analyzes your environment for potential risks or helpful insights (e.g., add-on instances running in your cluster).

Details of Chkk Kubernetes Connector can be found here

Chkk Proxy Filter

Redacts any private or sensitive data before it leaves your cluster. It automatically excludes Kubernetes secrets and can be configured to exclude additional data as needed.

Chkk Research Team

A dedicated team of Kubernetes experts that reviews and curates Operational Risks to make them actionable.

Classifier

The Classifier is Chkk’s reasoning engine that transforms raw, customer artifacts into richly-typed objects linked to Chkk’s Knowledge Graph. It runs a pipeline of specialized mappers—digest, deduction, ruleset, and release—to resolve each resource and container to its most likely Deployment System, Package, Project, Component, and specific Release.

Contextualizer

The Contextualizer takes the Classifier’s matched entities and enriches them with situation-specific guidance. It filters change logs, synthesizes readiness checks, flags application-client actions, authors upgrade steps that make sense for the customer’s exact Deployment System, Package version, OCI repository, cluster topology etc. In effect, the Contextualizer translates generic release knowledge into high-fidelity, environment-aware instructions that operators can execute with confidence.

Cluster Metadata

Information about a Kubernetes Cluster—versions, Pods, Add-ons, cloud-provided components, node configs, etc. Secrets and config maps are redacted by default, and additional filters can be configured.

Cluster Upgrade Plan

An instantiation of a Cluster Upgrade Template, customized for a specific Kubernetes Cluster. The instantiated Upgrade Plans inherit all the information present in Upgrade Templates + additional cluster-specific information like:

  • The cluster’s name and region
  • Node Group details
  • Which Applications are using deprecated/removed APIs?
  • Are there any Application client changes in Add-ons?
  • Are there any Application misconfigurations-like incorrect Pod Disruption Budgets (PDBs)-that can cause the upgrade to fail?

Upgrade Plans move through these statuses:

  1. Waiting for Plan - Being generated.
  2. In Progress - Ready for execution.
  3. Completed - Cluster upgraded successfully.
  4. Canceled - User-initiated cancellation.

More details here

Cluster Upgrade Template

A Cluster Upgrade Template is an AI-curated workflow containing a tested and structured sequence of steps and stages to safely upgrade your Kubernetes Clusters. A Cluster Upgrade Template is generated on-demand and is scoped to an Environment (e.g. dev, staging or prod). Cluster Upgrade Templates support three commonly-used upgrade patterns: In-Place, Blue-Green, and Rolling/Surge.

A template includes steps for each stage of an upgrade, preverified via a Digital Twin. Once an Upgrade Template is generated, you can perform the following actions::

  • Action: Add custom markdown steps before or after any step.
  • Action: Request regenerations for refined versions.
  • Action: Approve the template so other team members can instantiate it as an Upgrade Plan for specific clusters.

Upgrade Templates move through these statuses:

  1. Waiting for Template - Being generated.
  2. Available - Ready for review/customization.
  3. Approved For Use - May now be used to create Upgrade Plans.
  4. Environment Upgraded - All clusters in the environment have used this template to upgrade.

More details here

Collective Learning

Collective Learning comprises a suite of technologies that codify operational wisdom to prevent incidents, breakages, and disruptions.

Collective Learning has two main parts:

  1. Operational Risk Signature Database (RSig DB)
  2. Knowledge Graph

These components are defined on the Understanding Chkk page, with additional etails provided on the Technology page.

Deactivated Clusters

Kubernetes Clusters explicitly offboarded from Chkk via a “Deactivate Cluster” action in the Chkk Dashboard. Deactivation in the dashboard doesn’t remove all Chkk components; you can find instructions for complete removal in the Troubleshooting page.

Digital Twin

A Digital Twin is a virtual replica of your infrastructure, simulating how it runs and interacts. There are four levels of Digital Twins:

  • Level 1: Basic replica using the same cluster version, node versions, and Add-ons versions with default config.
  • Level 2: Extends Level 1 by including custom configs of all Add-ons.
  • Level 3: Adds dummy Applications to emulate functionality on top of Level 2.
  • Level 4: Fully functioning staging environment, an exact replica of your cluster (including real Applications), typically running within your own cloud account.

Disconnected Clusters

Kubernetes Clusters that have not been Deactivated but are not sending metadata to Chkk. They appear with an alert icon in the Chkk Dashboard so you can diagnose why the Chkk Kubernetes Connector isn’t sending data.

Grounding

Grounding constrains an AI system’s outputs to verifiable fact and policies.

Grounding Layer

A curated corpus that integrates all authoritative sources consumed by the Chkk Knowledge Engine. AI pipelines ingest and normalize each source, while the Chkk Research Team continuously reviews and validates the content. The layer models clouds, open source project, add-ons, and application services so that agents, workflows, and tools can reference the same trusted schema, avoid hallucinations, and maintain provable accuracy for every knowledge attribute.

Guardrails

Rules and policies from a cloud provider, add-on vendor, kubernetes distribution, or the open source community. Whether or not to follow a Guardrail is a business/team decision.

Helm Chart

A Helm Chart is a type of Package that uses the helm Package System to format its artifacts (called “charts”).

Knowledge Base Article (KBA)

A single page explaining an RSig or Guardrail—covering its severity, impact, trigger conditions, remediation steps, and potentially code snippets. All RSigs and Guardrails have an associated KBA.

KBAs support multiple actions on Operational Risks and Guardrails, such as:

  1. Action: Create Ticket - Generates a Jira ticket for tracking.
  2. Action: Mark - Mark as False Positive, By Design, or other reasons if not fixing.
  3. Action: Ignore - Stop receiving notifications for this risk (optionally ignoring only specific resources).

Knowledge Graph

Knowledge Graph models and stores AI-curated data and relationships across hundreds of open-source Projects and Add-ons in the Kubernetes ecosystem, modeling their impact and identifying the safest upgrade paths. Oversight is provided for AI-curated data and relationships by the Chkk Research Team.

Knowledge Graph covers Kubernetes releases of all major clouds and distributions: EKS, GKE, AKS, VMware Tanzu, OpenShift, Rancher RKE1/RKE2, Nutanix. We also support DIY and Self-Hosted Kubernetes. Chkk also covers 250+ Add-ons, and coverage for a new Add-on can be extended within 48hrs.

Kubernetes Add-on

A Kubernetes Add-on is a type of Project that extends Kubernetes cluster functionality but is not part of the Kubernetes core. Kubernetes Add-ons typically run inside the cluster as regular workloads (e.g., Deployments, DaemonSets) and provide services like networking, monitoring, logging, and DNS.

Kubernetes Add-ons typically modify or enhance cluster-level behavior.

Examples of Kubernetes Add-ons include:

  • CoreDNS and kube-dns

    They provide cluster DNS services, a critical but optional functionality of the Kubernetes cluster used for service discovery.

  • Amazon VPC CNI, Cilium and Calico

    They provide implementations of Kubernetes data plane networking and Container Networking Interface (CNI) plugins.

  • Amazon EBS CSI Driver and AzureDisk CSI Driver

    They provide implementations of the Container Storage Interface (CSI) plugin for plumbing Kubernetes Volumes to a cloud provider-specific storage backend.

  • Ingress NGINX controller and AWS Load Balancer Controller

    They provide implementations of Kubernetes Service and Ingress functionality, which are core Kubernetes Resources.

  • Kubernetes Metrics Server and kube-state-metrics

    They provide cluster-level aggregation of resource usage and resource health metrics that are used by other Kubernetes components like Horizontal Pod Autoscaler and Vertical Pod Autoscaler.

  • External Secrets Operator

    Despite being called an “Operator”, the External Secrets Operator is not a Kubernetes Operator because it does not install or manage the lifecycle of some other Kubernetes Add-on or Application Service. However, it is a Kubernetes Add-on because it implements storage functionality for Kubernetes Secrets and therefore extends the Kubernetes cluster’s functionality.

For a complete list of Kubernetes Add-ons, see covered Kubernetes Add-ons.

Kubernetes Cluster

A Kubernetes Cluster (or just “Cluster”) is a group of physical of virtual machines (Kubernetes Nodes) that run containerized applications.

Kubernetes Controller

A Kubernetes Controller is software that follows the controller design pattern. This design pattern features a control loop (also called a “reconciliation loop”) that repeatedly attempts to make the actual state of some resource match the desired state of that resource.

Kubernetes Custom Controller

A Kubernetes Custom Controller is a Kubernetes Controller that tracks and reconcile Kubernetes Custom Resources.

Kubernetes Custom Resource

A Kubernetes Custom Resource is a special type of Kubernetes Resource that falls outside of the core Kubernetes API groups.

Kubernetes DaemonSet

A Kubernetes DaemonSet (or just “DaemonSet”) is a type of Kubernetes Resource that describes a related set of Pods that will run on every Node in the Kubernetes Cluster.

Kubernetes Deployment

A Kubernetes Deployment (or just “Deployment”) is a type of Kubernetes Resource that describes a related set of Pods that run an application workload.

Kubernetes Deployment System

A Kubernetes Deployment System is something that manages the rollout of Kubernetes Resources like Deployments, DaemonSets and StatefulSets.

Kubernetes Operator

A Kubernetes Operator is a type of Project that is responsible for installing and managing the lifecycle of another Kubernetes Add-on or Application Service.

Generally, Kubernetes Operators encode domain-specific knowledge to manage complex stateful software on Kubernetes.

A Kubernetes Operator is often confused with a Kubernetes Controller. Almost all Kubernetes Operators use the Kubernetes Controller design pattern, but not all software Projects that use the Kubernetes Controller design pattern are Kubernetes Operators!

A Kubernetes Operator is sometimes mistakenly defined as a Kubernetes Controller that uses Kubernetes Custom Resources. The more accurate term for that concept is [“Kubernetes custom controller”](#kubernetes-custom-controller]. What differentiates a Kubernetes Operator is the focus on managing complex lifecycle operations for some other piece of software.

Examples of Kubernetes Operators include:

  • Postgres Operator

    A Kubernetes Custom Controller that manages the installation and lifecycle of PostgreSQL database servers, database schemas and database users.

  • Prometheus Operator

    Manages the installation and lifecycle of Prometheus metrics database and associated Application Services like Prometheus Alertmanager.

  • OpenTelemetry (OTEL) Operator

    Manages the installation, lifecycle management and configuration of OpenTelemetry collectors and auto-instrumentation libraries.

For a complete list of Kubernetes Operators, see covered Kubernetes Operators.

Kubernetes Pod

A Kubernetes Pod (or just “Pod”) is a type of Kubernetes Resource that describes a group of containers with shared storage and network resources.

Kubernetes Node

A Kubernetes Node (or just “Node”) is a physical or virtual machine that comprises part of a Kubernetes Cluster.

Kubernetes Resource

A Kubernetes Resource is a representation of the desired and observed state of some object. Common Kubernetes Resources are Deployments, DaemonSets and StatefulSets.

Kubernetes StatefulSet

A Kubernetes StatefulSet (or just “StatefulSet”) is a type of Kubernetes Resource that describes a related set of Pods that run an application workload. The difference between a StatefulSet and a Deployment is that the Pods in a StatefulSet typically have an ordered initialization and the application workload maintains some form of persistent state.

Mitigation

Mitigation is a short‑term workaround that lowers the probability or impact of a risk until full remediation can occur. Typical mitigations include disabling a feature flag, throttling traffic, or rolling back a change.

Mitigation Workflow

A mitigation workflow is a durable workflow that automates or guides the application of mitigations.

Notifications

Inform about team invites, cluster onboarding, newly detected Operational Risks, or published Add-on Upgrade Templates, Cluster Upgrade Templates, Add-on Upgrade Plans, and Cluster Upgrade Plans.

Chkk supports Email, Slack, and in-app notifications.

Operational Risk (OR)

Operational Risk refers to any known or potential defect, misconfiguration, or incompatibility in Kubernetes clusters and Add-ons that can cause incidents, disruptions, or breakages. These risks, which may include known defects or issues stemming from unsupported versions, deprecated APIs, and software nearing end-of-life, are categorized by severity—Critical, High, Medium, or Low. An Operational Risk is detected by scanning for at-risk components, identifying trigger conditions, and assessing availability impact, root cause, remediation steps, and possible mitigations. In Chkk, these risks are codified as Risk Signatures (RSigs) that continuously scan customer environments to proactively uncover and address Operational Risks before they cause breakages or outages.

Operational Risk Signature (RSig)

An RSig is the logic used to detect the presence of a specific Operational Risk in your environment.

Operational Risk Signature Database (RSig DB)

RSig DB takes inspiration from cybersecurity, where security vulnerabilities are reported publicly in the CVE Database. We extended this idea to operational safety: If there’s an Operational Risk (e.g. an error, failure, or disruption) that has happened anywhere in the world, Chkk AI aggregators and data connectors learn about it, convert it into a Operational Risk Signature-similar to a virus signature-and store it in the RSig DB. Any new Operational Risk Signature is streamed to all our customers, where it is scanned in their environments. That way, our customer can proactively detect, identify, and remediate Operational Risks before they cause breakages and disruptions, much like antivirus software detects and removes viruses before they start causing harm.

Organization

A grouping of multiple Chkk Accounts under a common ownership, typically matching an entire company or large department.

Package

A Package is a named bundle of software that is bundled (and identified) in the format of a specific Package System.

Package System

A Package System identifies of a packaging ecosystem.

Preverification Engines

Preverification Engines create a Digital Twin of your Kubernetes environment and simulate an Add-on Upgrade Template or Cluster Upgrade Template before its published to you. Ensuring that all steps are verified to execute without errors.

Project

A Project is software that provides some functionality.

Project Release Series

A Project may have one or more Project Release Series. A Project Release Series is a single release series for the Project.

Project Release Series are identified by a simple string that must be unique for the Project. Typically the ProjectRelease Name will be a major or major.minor version series, e.g. “4” or “1.28”, however this is not universally true. Some Projects use date-based names for the release series, like “2024.04”.

Project Release

A Project may have one or more Project Releases. A Project Release is the coordinated publication of one or more Release Artifacts for the Project having the same Version string.

Remediation

Remediation is the permanent correction of a defect or risk, aimed at eliminating the root cause rather than merely reducing impact. Remediation is the long-term fix.

Remediation Workflow

The sequence of actions stitched together into a workflow to rollout remediations of a risk or a defect. For instance, code changes, or package upgrades, and validates success via post‑conditions and metrics.

Revalidation

Revalidation starts as soon as a new cluster is onboarded. The first pass of Revalidation leverages multiple Classifiers (part of Collective Learning) to catalog what’s running in your Kubernetes fleet and where.

Once Classifiers finish, a secondary pass is kicked off to extend coverage and remove false positives.

This process can take up to 24 hours. You can also report false positives via the Action: Feedback.

Details of Chkk’s coverage dimensions here

Risk Scan (aka RSig Scan)

An RSig Scan matches your running Kubernetes environment against relevant RSigs in the RSig DB.

Scanning an RSig has two stages:

  1. Contextualize - Check if the RSig’s components/versions are present in your environment.
  2. Test - If the context is relevant, compare version numbers and conditions to determine if the RSig is present.

RSig Scans can be periodic (default: every 12 hours), on-demand, or event-driven (triggered by CI/CD).

You can also set the frequency of the scans—see instructions here

Scan Engines

Scan Engines identify Operational Risks at various layers of Kubernetes and the underlying cloud infrastructure. Multiple engines run in parallel, each focusing on a subset of risks.

Source‑Grounded AI

Source‑grounded AI produces answers that link directly back to the underlying asset record or source, enabling users to verify every factual claim and trace the agent’s chain of thought.

Task‑Specific AI Agent

A task‑specific AI agent is an agent pre‑loaded with exactly one core skill and an intentionally narrow toolset, making its behavior easier to predict, govern, and audit.

Team

A group of users or Cloud Identities (such as AWS Accounts, IAM users, and IAM roles) within a Chkk Organization. Team members may be assigned ownership of certain resources, like clusters or certain addons across the fleet of clusters.

Tool

A tool is any external capability—software function, API endpoint, database query, CLI command, or workflow-that an AI Agent can invoke to extend its own reasoning and perception.

Trigger

A trigger is the initiating event, schedule, or external signal that instantiates a workflow run. It can be time‑based (cron, interval), state‑based (configuration drift detected), or externally invoked.

Workflow

A workflow is a sequence of coordinated steps that binds AI agents, tools, humans, and even nested workflows into a single deterministic process to achieve a defined operational outcome. Each step’s intent, inputs, outputs, and side-effects are durably recorded, so the entire flow can be replayed, resumed, or audited—guaranteeing that every decision and external action is applied exactly once, in order, and with full lineage.

Workflow Engine

Workflow Engine persists state for long-running multi-step operations such as Upgrade Templates and Upgrade Plans.

Upgrade paths are pre-verified on a Digital Twin of the underlying cluster, ensuring predictable and safe execution before changes are applied.
After pre-verification, Workflow Engine generates long-running, durable upgrade workflows that are highly contextualized and pre-verified to prevent failures.