November 2, 2025 16 min to read

Automated Ceph Cluster Deployment with Cephadm-Ansible and Kubernetes Integration

Complete automation workflow for Ceph deployment using cephadm-ansible and seamless integration with Kubernetes via Ceph-CSI

Overview

In this article, we’ll cover the complete process of installing Ceph clusters in an automated manner using cephadm-ansible and integrating them with Kubernetes environments.

Cephadm is a container-based Ceph management tool, while cephadm-ansible provides separate Ansible automation for repetitive tasks such as initial host setup, cluster management, and purging operations that are not directly handled by cephadm itself.

The installation process involves provisioning VMs with Terraform, building a Kubernetes cluster with Kubespray, then using Cephadm and Ansible scripts to bootstrap the Ceph cluster and configure OSD nodes, MON/MGR nodes.

We’ll also install Ceph-CSI drivers to enable Kubernetes integration with Ceph RBD storage, and configure StorageClass and test Pods to demonstrate the complete workflow.

What is Cephadm?

Cephadm is Ceph’s latest deployment and management tool, introduced starting with the Ceph Octopus release.

It’s designed to simplify deploying, configuring, managing, and scaling Ceph clusters. It can bootstrap a cluster with a single command and deploys Ceph services using container technology.

Cephadm doesn’t rely on external configuration tools like Ansible, Rook, or Salt. However, these external configuration tools can be used to automate tasks not performed by cephadm itself.

What is Cephadm-ansible?

Cephadm-ansible is a collection of Ansible playbooks designed to simplify workflows not covered by cephadm itself.

The workflows it handles include:

Preflight: Initial host setup before bootstrapping the cluster
Client: Client host configuration
Purge: Ceph cluster removal

Kubernetes Installation

Refer to the linked article for Kubernetes installation. Important note: Storage nodes must have at least 32GB of memory.

Infrastructure Configuration

Master Node (Control Plane)

Component	IP	CPU	Memory
test-server	10.77.101.47	16	32G

Worker Nodes

Component	IP	CPU	Memory
test-server-agent	10.77.101.43	16	32G
test-server-storage	10.77.101.48	16	32G

Terraform Configuration for Storage Node

Add the following configuration:

Verify Kubernetes Installation

kubectl get nodes
NAME                STATUS   ROLES           AGE     VERSION
test-server         Ready    control-plane   6m27s   v1.29.1
test-server-agent   Ready    <none>          5m55s   v1.29.1

Cephadm-Ansible Installation

Clone Repository and Setup Environment

git clone https://github.com/ceph/cephadm-ansible

VENVDIR=cephadm-venv
CEPAHADMDIR=cephadm-ansible
python3.10 -m venv $VENVDIR
source $VENVDIR/bin/activate
cd $CEPAHADMDIR

pip install -U -r requirements.txt

Verify Storage Disks

Check disk configuration on the test-server-storage node:

lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
loop0     7:0    0  63.9M  1 loop /snap/core20/2105
loop1     7:1    0 368.2M  1 loop /snap/google-cloud-cli/207
loop2     7:2    0  40.4M  1 loop /snap/snapd/20671
loop3     7:3    0  91.9M  1 loop /snap/lxd/24061
sda       8:0    0    50G  0 disk
├─sda1    8:1    0  49.9G  0 part /
├─sda14   8:14   0     4M  0 part
└─sda15   8:15   0   106M  0 part /boot/efi
sdb       8:16   0    50G  0 disk
sdc       8:32   0    50G  0 disk
sdd       8:48   0    50G  0 disk

Configuration Files Setup

Create Inventory File

# inventory.ini
[all]
test-server ansible_host=10.77.101.47
test-server-agent ansible_host=10.77.101.43
test-server-storage ansible_host=10.77.101.48

# Ceph Client Nodes (Kubernetes nodes that require access to Ceph storage)
[clients]
test-server
test-server-agent
test-server-storage

# Admin Node (Usually the first monitor node)
[admin]
test-server

Update cephadm-distribute-ssh-key.yml

Fix the file attribute check:

# Change this line:
# or not cephadm_pubkey_path_stat.stat.isfile | bool
# To:
or not cephadm_pubkey_path_stat.stat.isreg | bool

Complete corrected file:

Update cephadm-preflight.yml

Add Ubuntu-specific tasks:

- name: Ubuntu related tasks
  when: ansible_facts['distribution'] == 'Ubuntu'
  block:
    - name: Update apt cache
      apt:
        update_cache: yes
        cache_valid_time: 3600
      changed_when: false

    - name: install container engine
      block:
        - name: install docker
          # ... existing docker installation tasks
        - name: ensure Docker is running
          service:
            name: docker
            state: started
            enabled: true

    - name: install jq package
      apt:
       name: jq
       state: present
       update_cache: yes
      register: result
      until: result is succeeded

Automation Scripts

Variables Configuration (`ceph_vars.sh`)

#!/bin/bash

# Define variables (Modify as needed)
SSH_KEY="/home/somaz/.ssh/id_rsa_ansible" # SSH KEY Path
INVENTORY_FILE="inventory.ini" # Inventory Path
CEPHADM_PREFLIGHT_PLAYBOOK="cephadm-preflight.yml"
CEPHADM_CLIENTS_PLAYBOOK="cephadm-clients.yml"
CEPHADM_DISTRIBUTE_SSHKEY_PLAYBOOK="cephadm-distribute-ssh-key.yml"
HOST_GROUP=(test-server test-server-agent test-server-storage) # All host group
ADMIN_HOST="test-server" # Admin host name
OSD_HOST="test-server-storage" # OSD host name
HOST_IPS=("10.77.101.47" "10.77.101.43" "10.77.101.48") # Corresponding IPs
OSD_DEVICES=("sdb" "sdc" "sdd") # OSD devices, without /dev/ prefix
CLUSTER_NETWORK="10.77.101.0/24" # Cluster network CIDR
SSH_USER="somaz" # SSH user
CLEANUP_CEPH="false" # Reset based on user input

Function Library (`ceph_functions.sh`)

Main Setup Script (`setup_ceph_cluster.sh`)

#!/bin/bash

# Load functions from ceph_functions.sh and ceph_vars.sh
source ceph_vars.sh
source ceph_functions.sh

read -p "Do you want to cleanup existing Ceph cluster? (yes/no): " user_confirmation
if [[ "$user_confirmation" == "yes" ]]; then
    CLEANUP_CEPH="true"
else
    CLEANUP_CEPH="false"
fi

echo "Starting Ceph cluster setup..."

# Check for existing SSH key and generate if it does not exist
if [ ! -f "$SSH_KEY" ]; then
    echo "Generating SSH key..."
    ssh-keygen -f "$SSH_KEY" -N ''
    echo "SSH key generated successfully."
else
    echo "SSH key already exists. Skipping generation."
fi

# Copy SSH key to each host in the group
for host in "${HOST_GROUP[@]}"; do
    echo "Copying SSH key to $host..."
    ssh-copy-id -i "${SSH_KEY}.pub" -o StrictHostKeyChecking=no "$host"
done

# Cleanup existing Ceph setup if confirmed
cleanup_ceph_cluster

# Wipe OSD devices
echo "Wiping OSD devices on $OSD_HOST..."
for device in ${OSD_DEVICES[@]}; do
    if ssh $OSD_HOST "sudo wipefs --all /dev/$device"; then
        echo "Wiped $device successfully."
    else
        echo "Failed to wipe $device."
    fi
done

# Run cephadm-ansible preflight playbook
echo "Running cephadm-ansible preflight setup..."
run_ansible_playbook $CEPHADM_PREFLIGHT_PLAYBOOK ""

# Create a temporary Ceph configuration file for initial settings
TEMP_CONFIG_FILE=$(mktemp)
echo "[global]
osd crush chooseleaf type = 0
osd_pool_default_size = 1" > $TEMP_CONFIG_FILE

# Bootstrap the Ceph cluster
MON_IP="${HOST_IPS[0]}"
echo "Bootstrapping Ceph cluster with MON_IP: $MON_IP"
add_to_known_hosts $MON_IP
sudo cephadm bootstrap --mon-ip $MON_IP --cluster-network $CLUSTER_NETWORK --ssh-user $SSH_USER -c $TEMP_CONFIG_FILE --allow-overwrite --log-to-file
rm -f $TEMP_CONFIG_FILE

# Distribute Cephadm SSH keys to all hosts
echo "Distributing Cephadm SSH keys to all hosts..."
run_ansible_playbook $CEPHADM_DISTRIBUTE_SSHKEY_PLAYBOOK "-e cephadm_ssh_user=$SSH_USER -e admin_node=$ADMIN_HOST -e cephadm_pubkey_path=$SSH_KEY.pub"

# Fetch FSID of the Ceph cluster
FSID=$(sudo ceph fsid)
echo "Ceph FSID: $FSID"

# Add and label hosts in the Ceph cluster
add_host_and_label

# Prepare and add OSDs
sleep 60
add_osds_and_wait

# Check Ceph cluster status and OSD creation
check_osd_creation

echo "Ceph cluster setup and client configuration completed successfully."

Execute the Setup

# Make script executable
chmod +x setup_ceph_cluster.sh
./setup_ceph_cluster.sh

Verification and Management

Ceph Dashboard Access

After successful installation, you’ll see:

Ceph Dashboard is now available at:
             URL: https://test-server:8443/
            User: admin
        Password: 9m16nzu1h7

Cluster Status Commands

# Check Ceph hosts
sudo ceph orch host ls
test-server          10.77.101.47  _admin
test-server-agent    10.77.101.43
test-server-storage  10.77.101.48  osd
3 hosts in cluster

# Check Ceph status
sudo ceph -s
  cluster:
    id:     43d4ca77-cf91-11ee-8e5d-831aa89df15f
    health: HEALTH_WARN
            1 pool(s) have no replicas configured

  services:
    mon: 1 daemons, quorum test-server (age 7m)
    mgr: test-server.nckhts(active, since 6m)
    osd: 3 osds: 3 up (since 3m), 3 in (since 3m)

  data:
    pools:   1 pools, 1 pgs
    objects: 2 objects, 577 KiB
    usage:   872 MiB used, 149 GiB / 150 GiB avail
    pgs:     1 active+clean

# Check OSD tree
sudo ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME                     STATUS  REWEIGHT  PRI-AFF
-1         0.14639  root default
-3         0.14639      host test-server-storage
 0    ssd  0.04880          osd.0                     up   1.00000  1.00000
 1    ssd  0.04880          osd.1                     up   1.00000  1.00000
 2    ssd  0.04880          osd.2                     up   1.00000  1.00000

# Check service status
sudo ceph orch ls --service-type mon
sudo ceph orch ls --service-type mgr
sudo ceph orch ls --service-type osd

Pool Management

# Check current pools
ceph df

# Create a new pool
ceph osd pool create kube 128

# Check pool replica settings
ceph osd pool get kube size

# Modify pool settings
ceph osd pool set kube size 3

# Set placement groups
ceph osd pool set kube pg_num 256

# Delete pool (if needed)
# ceph osd pool delete {pool-name} {pool-name} --yes-i-really-really-mean-it

Understanding Placement Groups (PGs)

Placement Groups (PGs) are logical units for data distribution and management within Ceph clusters. Ceph stores data as objects and assigns these objects to PGs, which are then distributed across various OSDs in the cluster. PGs optimize cluster scalability, performance, and resilience.

Kubernetes Integration with Ceph-CSI

Install Ceph-CSI Driver

To use Ceph as a storage solution for Kubernetes Pods, install Ceph-CSI (Container Storage Interface). Ceph-CSI enables Ceph as persistent storage for Kubernetes, supporting both block and file storage.

# Add Helm repository
helm repo add ceph-csi https://ceph.github.io/csi-charts
helm repo update

# Search available charts
helm search repo ceph-csi
NAME                            CHART VERSION   APP VERSION     DESCRIPTION
ceph-csi/ceph-csi-cephfs        3.10.2          3.10.2          Container Storage Interface (CSI) driver, provi...
ceph-csi/ceph-csi-rbd           3.10.2          3.10.2          Container Storage Interface (CSI) driver, provi...

# Install RBD driver
helm install ceph-csi-rbd ceph-csi/ceph-csi-rbd --namespace ceph-csi --create-namespace

# Install CephFS driver (optional)
helm install ceph-csi-cephfs ceph-csi/ceph-csi-cephfs --namespace ceph-csi --create-namespace

Create Ceph-CSI Values Configuration

Since we have one worker node, set replicaCount to 1:

# Get Ceph cluster ID
sudo ceph fsid
afdfd487-cef1-11ee-8e5d-831aa89df15f

# Check monitor endpoint
ss -nlpt | grep 6789
LISTEN  0        512         10.77.101.47:6789           0.0.0.0:*

Create ceph-csi-values.yaml:

csiConfig:
- clusterID: "afdfd487-cef1-11ee-8e5d-831aa89df15f" # ceph cluster id
  monitors:
  - "10.77.101.47:6789"
provisioner:
  replicaCount: 1

Install Ceph-CSI Driver with Custom Values

# Create namespace
kubectl create namespace ceph-csi

# Install with custom values
helm install -n ceph-csi ceph-csi ceph-csi/ceph-csi-rbd -f ceph-csi-values.yaml

# Verify installation
kubectl get all -n ceph-csi

StorageClass Configuration

Create Secret and StorageClass

First, get the Ceph authentication information:

sudo ceph auth list
# Look for client.admin key
client.admin
        key: AQC899JlcL6CKBAAQsBOJqWw/CVTQKUD+2FbyQ==
        caps: [mds] allow *
        caps: [mgr] allow *
        caps: [mon] allow *
        caps: [osd] allow *

Create the StorageClass configuration:

# ceph-csi-storageclass.yaml
apiVersion: v1
kind: Secret
metadata:
  name: csi-rbd-secret
  namespace: kube-system
stringData:
  userID: admin # ceph user id (client.admin) - admin is the user
  userKey: "AQC899JlcL6CKBAAQsBOJqWw/CVTQKUD+2FbyQ==" # client.admin key
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
   name: rbd
   annotations:
     storageclass.beta.kubernetes.io/is-default-class: "true"
     storageclass.kubesphere.io/supported-access-modes: '["ReadWriteOnce","ReadOnlyMany","ReadWriteMany"]'
provisioner: rbd.csi.ceph.com
parameters:
   clusterID: "afdfd487-cef1-11ee-8e5d-831aa89df15f"
   pool: "kube"
   imageFeatures: layering
   csi.storage.k8s.io/provisioner-secret-name: csi-rbd-secret
   csi.storage.k8s.io/provisioner-secret-namespace: kube-system
   csi.storage.k8s.io/controller-expand-secret-name: csi-rbd-secret
   csi.storage.k8s.io/controller-expand-secret-namespace: kube-system
   csi.storage.k8s.io/node-stage-secret-name: csi-rbd-secret
   csi.storage.k8s.io/node-stage-secret-namespace: kube-system
   csi.storage.k8s.io/fstype: ext4
reclaimPolicy: Delete
allowVolumeExpansion: true
mountOptions:
   - discard

Apply the configuration:

kubectl apply -f ceph-csi-storageclass.yaml

# Verify StorageClass
kubectl get storageclass
NAME            PROVISIONER        RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
rbd (default)   rbd.csi.ceph.com   Delete          Immediate           true                   21s

Testing the Integration

Deploy Test Pod with Persistent Volume

# test-pod.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ceph-rbd-pvc
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: rbd
  resources:
    requests:
      storage: 1Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: pod-using-ceph-rbd
spec:
  containers:
  - name: my-container
    image: nginx
    volumeMounts:
    - mountPath: "/var/lib/www/html"
      name: mypd
  volumes:
  - name: mypd
    persistentVolumeClaim:
      claimName: ceph-rbd-pvc

Deploy and verify:

# Apply test configuration
kubectl apply -f test-pod.yaml

# Verify resources
kubectl get pod,pv,pvc
NAME                     READY   STATUS    RESTARTS   AGE
pod/pod-using-ceph-rbd   1/1     Running   0          16s

NAME                                                        CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                  STORAGECLASS   REASON   AGE
persistentvolume/pvc-83fd673e-077c-4d24-b9c9-290118586bd3   1Gi        RWO            Delete           Bound    default/ceph-rbd-pvc   rbd                     16s

NAME                                 STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/ceph-rbd-pvc   Bound    pvc-83fd673e-077c-4d24-b9c9-290118586bd3   1Gi        RWO            rbd            16s

Best Practices and Recommendations

Infrastructure as Code Benefits

Automation: Cephadm-ansible eliminates manual configuration steps
Repeatability: Scripts ensure consistent deployments across environments
Version Control: Infrastructure configurations can be versioned and tracked
Disaster Recovery: Automated procedures enable rapid cluster recovery

Production Considerations

Multi-Node Setup: Use multiple MON and MGR nodes for high availability
Storage Replication: Configure appropriate replication levels for data durability
Monitoring: Implement comprehensive monitoring for cluster health
Backup Strategy: Establish regular backup procedures for critical data

Troubleshooting Tips

Log Analysis: Check cephadm and Ansible logs for deployment issues
Network Connectivity: Ensure proper network configuration between nodes
Disk Preparation: Verify disk wiping and LVM cleanup before deployment
Service Dependencies: Ensure container runtime is properly configured

Conclusion

Cephadm-ansible provides a powerful automation framework for Ceph cluster deployment, significantly simplifying the complex process of distributed storage setup. The integration with Kubernetes through Ceph-CSI creates a robust storage solution for cloud-native applications.

Key Achievements:

Automated Deployment: Complete Ceph cluster automation using scripts
Infrastructure as Code: Terraform-based VM provisioning
Kubernetes Integration: Seamless storage integration via Ceph-CSI
Production Ready: Scalable architecture with monitoring capabilities

Operational Benefits:

Reduced Manual Work: Automated host setup, SSH key distribution, OSD disk cleanup
Cluster Management: Automated purging and labeling operations
Storage Integration: Native Kubernetes persistent storage support
Scalability: Foundation for enterprise-grade storage automation

Future Enhancements:

Implement multi-cluster federation
Add automated backup and disaster recovery
Integrate with CI/CD pipelines for infrastructure updates
Develop custom monitoring and alerting solutions

Learning to configure clusters directly through Cephadm and Ansible integration, understanding OSD configuration flows, and storage class integration provides invaluable experience for enterprise-level storage operation automation.

“Mastering automation tools like cephadm-ansible is essential for building reliable, scalable infrastructure in modern cloud environments.”