Understanding Chkk
Upgrading Kubernetes clusters and add-ons is complex, error-prone, and time-consuming. The process is highly manual, with teams sifting through scattered release notes and documentation for every component. Small changes in one layer can introduce hidden incompatibilities in another, risking outages and forcing expert-level troubleshooting at every step. Because it’s so complex, it can’t be easily delegated or automated, creating bottlenecks, long delays, and added costs—including extended support fees and mounting technical debt.
This is the problem Chkk is designed to solve. Chkk is an AI-curated Operational Safety Platform for Kubernetes and its Add-ons. It leverages Knowledge Graphs, Risk Signature Databases, and Agentic AI systems to proactively identify hidden dependencies, unknown incompatibilities, and potential risks before they cause failures. By generating pre-verified, AI-curated upgrade workflows on demand, Chkk Upgrade Copilot provides teams with structured, safe, and efficient upgrade plans—speeding up the upgrades by 3x to 5x while eliminating last-minute surprises and disruptions.
Building Blocks
Upgrade Templates
An Upgrade Template is an AI-curated workflow containing a tested and structured sequence of steps and stages to safely upgrade your Kubernetes clusters. An Upgrade Template is generated on-demand and is scoped to an Environment (e.g. dev, staging or prod). Upgrade Templates support three commonly-used upgrade patterns: In-Place, Blue-Green, and Rolling/Surge.
Upgrade Templates answer all the questions that Platform Teams have to address in an upgrade: It starts with “Research”:
- What versions and configuration of control planes, nodes, add-ons and applications are running?
- Have any of these versions reached EOL?
- Are any of these versions incompatible with each other?
- What Operational Risks should I be aware of before upgrading? E.g.
- Are there open defects in specific versions that have caused breakages or disruptions?
- Are there any misconfigurations that I should fix prior to upgrading? -Which next version of control planes and add-ons should we upgrade to? More specifically:
- What’s the version matrix where all Add-ons are compatible with the next Kubernetes version?
- What breaking changes (CRDs, configuration, features, behavioral changes,…) will we encounter when executing the upgrade?
- Are there any hidden dependencies that can break Add-ons or applications?
- Can I upgrade Add-ons to the desired version directly or do we need to do it as a sequence of upgrade hops to avoid schema breakages and/or incompatibilities?
Then comes “Preparation”:
- What changes in Add-ons’ Helm charts and CRDs we must cater for before executing upgrades?
- What preflight checks for control plane and Add-ons should we run to ensure it’s safe to execute upgrades?
- What exact steps (code diffs) should be applied to upgrade Add-ons, control plane, and nodes, and in which order?
- What postflight checks should be run to ensure everything is healthy after the upgrade is complete?
An Upgrade Template answers all the above questions and its entire workflow of steps and stages is pre-verified to work without failures on a Digital Twin of your Kubernetes environment.
While simple add-ons (e.g. VPC CNI, cert-manager, External Secrets Operator, etc.) can be upgraded with the cluster, complex add-ons (e.g. Istio, Contour, Consul, etc.) generally required a dedicated Add-on Upgrade Template with steps and stages specific to that add-on. Add-on Upgrade Templates ensure upgrade safety by enabling you to execute complex add-ons’ upgrades before or after cluster upgrades.
Your team reviews and customizes an Upgrade Template by collaborating through comments inside the Template, adding your own custom steps, and finally approving the Upgrade Template for execution.
And now you are ready for “Upgrade Execution”: This is where you instantiate Upgrade Plans for each cluster in the Environment. The instantiated Upgrade Plans inherit all the information present in Upgrade Templates + additional cluster-specific information like:
- Which applications are using deprecated/removed APIs?
- Are there any application client changes in Add-ons?
- Are there any application misconfigurations–like incorrect Pod Disruption Budgets (PDBs)–that can cause the upgrade to fail?
All activities performed on the Upgrade Templates and Upgrade Plans are stored in long-running, durable workflows, ensuring safe, structured, and repeatable upgrades.
Upgrade Templates get all the information they need from two foundational technology components: Risk Signature Database (RSig DB) and Knowledge Graph.
Operational Risks
Operational Risk refers to any known or potential defect, misconfiguration, or incompatibility in Kubernetes clusters and add-ons that can cause incidents, disruptions, or breakages. These risks, which may include known defects or issues stemming from unsupported versions, deprecated APIs, and software nearing end-of-life, are categorized by severity—Critical, High, Medium, or Low. An Operational Risk is detected by scanning for at-risk components, identifying trigger conditions, and assessing availability impact, root cause, remediation steps, and possible mitigations. In Chkk, these risks are codified as Risk Signatures (RSigs) that continuously scan customer environments to proactively uncover and address Operational Risks before they cause breakages or outages.
Chkk Service
Chkk is a secure, scalable multi-regional solution offered in the US and EU. It is built using a cell-based architecture and supports multi-tenant and instances. It comprises multiple services that run APIs and microservices to serve product modules and maintain inventory records. At the heart of the Chkk Service are Classifiers and Engines, working in tandem to provide real-time insights and operational resilience for Kubernetes environments.
Classifiers leverage references from the Knowledge Graph and RSigDB, either extracting image digests (hash-based) or applying RuleSets (rule-based) to identify Kubernetes resources and their relationships. Engines then use these classification results to detect latent Operational Risks, ensure guardrail conformance, and generate Upgrade Templates. They also create digital twins to preverify proposed Upgrade Plans, helping ensure that all steps can be executed without failures, and enabling smoother, more reliable operations.
Risk Signature Database (RSig DB)
RSig DB takes inspiration from cybersecurity, where security vulnerabilities are reported publicly in the CVE Database. We extended this idea to operational safety: If there’s an Operational Risk (e.g. an error, failure, or disruption) that has happened anywhere in the world, Chkk AI aggregators and data connectors learn about it, convert it into a Risk Signature–similar to a virus signature–and store it in the RSig DB. Any new Risk Signature is streamed to all our customers, where it is scanned in their environments. That way, our customer can proactively detect, identify, and remediate Operational Risks before they cause breakages and disruptions, much like antivirus software detects and removes viruses before they start causing harm.
RSigDB’s AI aggregators and data connectors scour Operational Risks from the following sources of information:
- Release schedules
- Upstream Tickets / Issues
- Release notes / Changelogs
- Pull Requests
- Cloud provider knowledge basis and issue trackers
Customers can also voluntarily opt-in to share Operational Risks with Chkk–we don’t learn anything from the customer directly.
RSig DB also codifies Guardrails which represent operational best practices from across the Kubernetes ecosystem (upstream communities, cloud providers, and vendors). Platform Teams use these Guardrails to ensure Application developers are conforming to their Platform’s operational excellence standards.
Knowledge Graph
Knowledge Graph stores AI-curated data and relationships across hundreds of open-source projects and add-ons in the Kubernetes ecosystem, modeling their impact and identifying the safest upgrade paths. Oversight is provided for AI-curated data and relationships by the Chkk Research Team.
Knowledge Graph covers Kubernetes releases of all major clouds and distributions: EKS, GKE, AKS, VMware Tanzu, OpenShift, Rancher RKE1/RKE2, Nutanix. We also support DIY and Self-Hosted Kubernetes. Chkk also covers 250+ add-ons, and coverage for a new add-on can be extended within 48hrs.
Coverage Extension is done on a continual basis by an agentic AI architecture, with multiple task-specific AI agents identifying and curating information from the following sources:
- Release Notes are curated with emphasis on breaking changes and upgrade considerations
- Image hashes from upstream registries
- Components are modeled, where an add-on may package other add-ons (e.g. Redis-OSS is packaged inside ArgoCD)
- Package Systems (Helm, Kustomize, etc.)
- Package Sources (HelmCharts, KustomizeSources, KubeSources, etc.)
- Deployment Modes (e.g. Istio has two Deployment Modes: Sidecar, Ambient)
- EOL policies
- Version Compatibility: With upstream Kubernetes versions With the cloud substrate (e.g. Amazon EKS versions and Amazon AMI versions), and With other add-ons (e.g. a Contour version being compatible with certain Envoy versions)
- Safety, Health and Readiness Checks, which include per-version preflight, inflight and postflight checks packaged in single-click, ready-to-run containers
Using the Knowledge Graph, Chkk identifies the safest Upgrade Paths for clusters and add-ons. This path discovery, at times, contains multiple upgrade hops.
Chkk Dashboard
Chkk Dashboard is a UI for you to interact with Chkk–this interaction includes, but isn’t limited to, the following actions:
- Onboarding clusters, managing access tokens
- Inviting team members
- Operational Risks: latent in clusters, reading Knowledge Base articles about these Risks, and performing actions (like Ignoring a risk or leaving comments for team members)
- Guardrails: that are not followed, reading Knowledge Base articles about these Guardrails, and exposing these Guardrails to Application Teams through an API integration
- Upgrade Templates: Requesting, customizing, reviewing and approving Upgrade Templates.
- Upgrade Plans: instantiating and executing Upgrade Plans.
- Artifacts: inventory extracted from running components, container images, repositories, and tools across multiple clusters, clouds and layers of infrastructure.
- Integrations: with internal tools like GitHub, Slack, etc. Also includes SSO integration.
Was this page helpful?