Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions docs/04-For Operators/03-Deployment/01_guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
---
slug: /deployment-guide
title: Guide
sidebar_position: 1
---

# Deployment Guide

metal-stack is a Metal-as-a-Service (MaaS) platform that turns on-premises bare-metal servers into an elastic, self-managed cloud infrastructure. At its core, metal-stack consists of two components:

- **Control Plane** — The central management layer (APIs, databases, scheduling)
- **Partitions** — The data center infrastructure (servers, leaf switches, BMCs)

These two components alone provide a fully functional **Bare-Metal as a Service (MaaS)** platform. You can allocate machines, manage networks, configure firewalls, and operate servers via REST/gRPC APIs and the `metalctl` CLI — without any Kubernetes cluster lifecycle management.

## Kubernetes Cluster Lifecycle Management

If you need to provision **Kubernetes clusters** on top of your bare-metal infrastructure, metal-stack integrates with KCLM solutions like [Gardener](../../05-Concepts/04-Kubernetes/01-gardener.md) or [Cluster-API](../../05-Concepts/04-Kubernetes/02-cluster-api.md). These are **optional layers** built on top of the MaaS foundation — not prerequisites.

:::tip
You can use metal-stack as a pure MaaS platform without any KCLM integration. The KCLM layer is only needed if you want to automate Kubernetes cluster provisioning on your bare-metal infrastructure.
:::

## Deployment Approaches

We are bootstrapping the [metal control plane](../../05-Concepts/01-architecture.mdx#metal-control-plane) as well as our [partitions](../../05-Concepts/01-architecture.mdx#partitions) with [Ansible](https://www.ansible.com/) through CI.

We recommend using the same Ansible roles that we use to deploy metal-stack in our own environments. They are available in the [metal-roles](https://github.com/metal-stack/metal-roles) repository.

To simplify dependency management, we provide a dedicated [deployment base image](https://github.com/metal-stack/metal-deployment-base/pkgs/container/metal-deployment-base) on GitHub Container Registry. This Docker image bundles all required tools and libraries, keeping your deployment environment lean and easy to maintain.

The remainder of this guide assumes you are using our Ansible roles and the deployment base image. Make sure you have [Docker](https://docs.docker.com/get-started/get-docker/) installed. Every deployment environment is unique, so this guide describes how we set up metal-stack in our own infrastructure. You will need to adapt the steps to match your specific requirements.

:::warning
Probably you need to learn writing Ansible playbooks if you want to be able to deploy the metal-stack as presented in this documentation. However, even when starting without any knowledge about Ansible it should be possible to follow these docs. In case you need further explanations regarding Ansible please refer to [docs.ansible.com](https://docs.ansible.com/).
:::

:::info
If you do not want to use Ansible for deployment, you need to come up with a deployment mechanism by yourself. You can still learn from our [metal-roles](https://github.com/metal-stack/metal-roles) repository and [Helm charts](https://github.com/metal-stack/helm-charts/) — but be aware that the Ansible roles tie everything together (variable management, dependency ordering, environment-specific configurations), so building an equivalent from scratch will require significant effort.
:::
Comment on lines +38 to +40

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I missing something here? What can be re-used from the metal-roles repo when not using ansible?

For example the metal-control-plane helm chart is in https://github.com/metal-stack/helm-charts/ not metal-roles.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it more explicit, please re-check!


:::tip
You can use the [mini-lab](https://github.com/metal-stack/mini-lab) as a template project for your own deployment. It uses the same approach as described in this document.
:::
91 changes: 91 additions & 0 deletions docs/04-For Operators/03-Deployment/02_initial-cluster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
---
slug: /deployment/initial-cluster
title: Initial Cluster
sidebar_position: 2
---

# Initial Cluster

An initial Kubernetes cluster is always required for metal-stack deployments using this guide, as the [metal control plane](../../05-Concepts/01-architecture.mdx#metal-control-plane) is deployed on Kubernetes.

The initial cluster(s) serve as the bootstrap infrastructure for the **metal-stack control plane**.

The number and placement of initial clusters depend on whether you use a KCLM solution, as well as your availability and autonomy requirements.

If you only need **Bare-Metal as a Service** (allocating machines, managing networks, configuring firewalls via API) without KCLM, you need at least **one cluster** for the [Control Plane](./03_control-plane.mdx).

## KCLM Solutions

metal-stack supports three Kubernetes Cluster Lifecycle Management solutions. Each has different maturity levels and capabilities.
See the [Kubernetes Concepts Section](../../05-Concepts/04-Kubernetes/01-gardener.md) for a detailed view on the solutions.

### Gardener (Recommended)

[Gardener](../../05-Concepts/04-Kubernetes/01-gardener.md) is the **recommended** path for Kubernetes cluster lifecycle management. It is battle-tested in production for over seven years at financial-sector customers and bundles more day-2 capabilities natively (DNS, backup, audit). Gardener manages entire clusters as Kubernetes-native resources with a strong separation between platform operators and end-users.

:::tip
Gardener is the recommended solution for production environments. See the [Gardener concept doc](../../05-Concepts/04-Kubernetes/01-gardener.md) for terminology and architecture details.
:::

### Cluster-API

[Cluster-API](../../05-Concepts/04-Kubernetes/02-cluster-api.md) is a CNCF project maintained by a Kubernetes SIG that provides declarative cluster management through a management cluster. The metal-stack provider (CAPMS) is **under development** and not yet production-ready.

:::warning
Cluster-API with metal-stack is in development and not advised for production use. Please use Gardener for production workloads.
:::

#### Kamaji

[Kamaji](../../05-Concepts/04-Kubernetes/02-cluster-api.md#kamaji) allows a similar control plane hosting model as Gardener, where the control plane runs on dedicated infrastructure separate from worker nodes.
Kamaji therefore uses ClusterAPI to support different infrastructure- and control-plane-providers.
However, Kamaji integrations with metal-stack **have not been evaluated in production-grade scenarios** by metal-stack.

:::warning
Kamaji with metal-stack is in development and not advised for production use. Please use Gardener for production workloads.
:::

## Deployment Options

There are three supported approaches for hosting the initial cluster:

### Option 1: Shared Initial Cluster

It is possible to use a **single initial cluster** for both metal-stack and the KCLM solution. This approach is technically feasible but **not recommended** for production environments. Sharing a single cluster mixes platform infrastructure with lifecycle management, which can complicate operational boundaries and failure isolation.

### Option 2: Dedicated Clusters

We recommend using **dedicated (initial) clusters** for metal-stack and the KCLM solution — one cluster for the metal-stack control plane and a separate cluster for the KCLM.

This approach provides clearer operational boundaries, better isolation and simplified failure boundaries.

### Option 3: Autonomous Control Plane

For self-hosted deployments, metal-stack can be set up with an [Autonomous Control Plane](/community/MEP-18-autonomous-control-plane) cluster. This approach is the best choice for organizations that require full digital sovereignty and autonomy over their entire infrastructure stack.

The autonomous control-plane cluster serves as a minimal control plane whose sole purpose is to host the production control plane cluster (the "Matryoshka principle").
This brings several advantages like failure isolation, separate operational responsibility, minimal resource requirements and full control and ownership.

The only requirement from metal-stack is that your partitions can establish network connections to the metal control plane.

## Suggestions for the Initial Cluster

### For Options 1 & 2: Cloud-Hosted Clusters

For the shared and dedicated cluster approaches, the initial cluster can be hosted anywhere — a hyperscaler, metalstack.cloud, or any other managed Kubernetes provider. Some common options:

- **metalstack.cloud** — A Kubernetes cluster can be created via [UI](https://metalstack.cloud/de/documentation/UserManual#creating-a-cluster), CLI, or Terraform.
- **GCP/GKE** — A GCP account is required. The Ansible [gcp-auth role](https://github.com/metal-stack/ansible-common/tree/master/roles/gcp-auth) can be used for authentication, and the [gcp-create role](https://github.com/metal-stack/ansible-common/tree/master/roles/gcp-create) for creating a GKE cluster.
- Suggested defaults: `gcp_machine_type`: e2-standard-8, `gcp_autoscaling_min_nodes`: 1, `gcp_autoscaling_max_nodes`: 3
Comment thread
ma-hartma marked this conversation as resolved.

:::tip
For metal-stack it does not matter where your control plane Kubernetes cluster is located. You can of course use a cluster managed by a hyperscaler. This has the advantage of not having to setup Kubernetes by yourself and could even become beneficial in terms of fail-safe operation. If you are interested, you can find a reasoning behind this deployment decision [here](../../05-Concepts/01-architecture.mdx#target-deployment-platforms).
:::

### For Option 3: Autonomous Control Plane with k3s

For the autonomous control plane approach, [MEP-18](/community/MEP-18-autonomous-control-plane) proposes using [k3s](https://k3s.io/) as the initial cluster. This is because KCLM solutions are not yet able to create an initial cluster themselves (though this may change with implementations like [GEP-28](https://github.com/gardener/gardener/blob/master/docs/proposals/28-autonomous-shoot-clusters.md) for Gardener).

The k3s nodes can be either bare metal machines or virtual machines. For a minimal setup, a single node with 8–16 cores, 64GB RAM, and two NVMe drives of 1TB is a good starting point. For high availability, a clustered k3s configuration across multiple nodes is recommended, with ETCD replication and backup-restore mechanisms configured for metal-stack and KCLM components.

See the [Autonomous Control Plane](/community/MEP-18-autonomous-control-plane) proposal for detailed architecture, failure scenarios, and implementation guidance.
Loading