Setting up GPU Server on Google Cloud Platform

Featured image



Overview

Learn how to set up and configure NVIDIA A100 GPU servers on Google Cloud Platform (GCP) for AI/ML workloads.


GPU Availability Check

The GPU platform to be used can be found in the link below.

The GPUs available are different for each zone of a specific region. I’m going to explain it based on NVIDIA A100.

# Check A100 availability in Seoul region
gcloud compute accelerator-types list --filter="name:nvidia-tesla-a100 AND zone:asia-northeast3"
NAME               ZONE               DESCRIPTION
nvidia-tesla-a100  asia-northeast3-a  NVIDIA A100 40GB
nvidia-tesla-a100  asia-northeast3-b  NVIDIA A100 40GB

# Check A100 availability in all regions
gcloud compute accelerator-types list |grep a100
nvidia-a100-80gb       us-central1-a              NVIDIA A100 80GB
nvidia-tesla-a100      us-central1-a              NVIDIA A100 40GB
nvidia-tesla-a100      us-central1-b              NVIDIA A100 40GB
nvidia-a100-80gb       us-central1-c              NVIDIA A100 80GB
nvidia-tesla-a100      us-central1-c              NVIDIA A100 40GB
nvidia-tesla-a100      us-central1-f              NVIDIA A100 40GB
nvidia-tesla-a100      us-west1-b                 NVIDIA A100 40GB
nvidia-tesla-a100      us-east1-a                 NVIDIA A100 40GB
nvidia-tesla-a100      us-east1-b                 NVIDIA A100 40GB
nvidia-tesla-a100      asia-northeast1-a          NVIDIA A100 40GB
nvidia-tesla-a100      asia-northeast1-c          NVIDIA A100 40GB
nvidia-tesla-a100      asia-southeast1-b          NVIDIA A100 40GB
nvidia-a100-80gb       asia-southeast1-c          NVIDIA A100 80GB
nvidia-tesla-a100      asia-southeast1-c          NVIDIA A100 40GB
nvidia-a100-80gb       us-east4-c                 NVIDIA A100 80GB
nvidia-tesla-a100      europe-west4-b             NVIDIA A100 40GB
nvidia-a100-80gb       europe-west4-a             NVIDIA A100 80GB
nvidia-tesla-a100      europe-west4-a             NVIDIA A100 40GB
nvidia-tesla-a100      asia-northeast3-a          NVIDIA A100 40GB
nvidia-tesla-a100      asia-northeast3-b          NVIDIA A100 40GB
nvidia-tesla-a100      us-west3-b                 NVIDIA A100 40GB
nvidia-tesla-a100      us-west4-b                 NVIDIA A100 40GB
nvidia-a100-80gb       us-east7-a                 NVIDIA A100 80GB
nvidia-tesla-a100      us-east7-b                 NVIDIA A100 40GB
nvidia-a100-80gb       us-east5-b                 NVIDIA A100 80GB
nvidia-tesla-a100      me-west1-b                 NVIDIA A100 40GB
nvidia-tesla-a100      me-west1-c                 NVIDIA A100 40GB


How to turn off GPU Quota and select Compute Engine

The NVIDIA A100 80GB is basically Quota hanging, so you need to apply for an upgrade through a Quota request.

After identifying the GPU Quota name of the desired A100 GPU by referring to the document [1] it is possible to apply for Quota upward in the same manner as in the document (2).


GPU Price

You can check it in the link below.


Terraform Configuration Example

## ai_server ##
resource "google_compute_address" "ai_server_ip" {
  name   = var.ai_server_ip
  region = var.region
}

resource "google_compute_instance" "ai_server" {
  name                      = var.ai_server
  machine_type              = "a2-highgpu-2g" # a2-ultragpu-2g = A100 80G 2 / a2-highgpu-2g = A100 40G 2
  labels                    = local.default_labels
  zone                      = "${var.region}-a"
  allow_stopping_for_update = true

  tags = [var.nfs_client]

  boot_disk {
    initialize_params {
      image = "ubuntu-os-cloud/ubuntu-2204-lts"
      size  = 100
    }
  }

  metadata = {
    ssh-keys              = "somaz:${file("../../key/ai-server.pub")}"
    install-nvidia-driver = "true"
  }

  network_interface {
    network    = "projects/${var.host_project}/global/networks/${var.shared_vpc}"
    subnetwork = "projects/${var.host_project}/regions/${var.region}/subnetworks/${var.subnet_share}-ai-b"

    access_config {
      ## Include this section to give the VM an external ip ##
      nat_ip = google_compute_address.ai_server_ip.address
    }
  }

  scheduling {
    on_host_maintenance = "TERMINATE" # 또는 "MIGRATE" 대신 "RESTART" 사용
    automatic_restart   = true
    preemptible         = false
  }

  guest_accelerator {
    type  = "nvidia-tesla-a100" # nvidia-a100-80gb = A100 80G / nvidia-tesla-a100 = A100 40G
    count = 2
  }

  depends_on = [google_compute_address.ai_server_ip]

}


GPU Server Setup

# Os Version
lsb_release -a

# Confirm GPU
sudo lspci | grep -i nvidia
sudo lshw -c display

# Install required packages
sudo apt update
sudo apt install -y ubuntu-drivers-common
sudo apt install -y nvidia-driver-535
sudo apt install -y nvidia-cuda-toolkit

## Install CUDA
tar xvf cudnn-linux-x86_64-8.6.0.163_cuda11-archive.tar.xz
cd cudnn-linux-x86_64-8.6.0.163_cuda11-archive.tar.xz

# Copy header files
sudo cp include/cudnn*.h /usr/include

# Copy library files
sudo cp lib/libcudnn* /usr/lib/x86_64-linux-gnu

# Set permissions and update library cache
sudo chmod a+r /usr/include/cudnn*.h /usr/lib/x86_64-linux-gnu/libcudnn*
sudo ldconfig

# Verify installation

## nvidia version
nvidia-smi

## cuda version
nvcc --version

## cudnn test
cat <<EOF > cudnn_test.cpp
#include <cudnn.h>
#include <iostream>

int main() {
    cudnnHandle_t cudnn;
    cudnnCreate(&cudnn);
    std::cout << "CuDNN version: " << CUDNN_VERSION << std::endl;
    cudnnDestroy(cudnn);
    return 0;
}
EOF

# Compile
nvcc -o cudnn_test cudnn_test.cpp -lcudnn

# Run
./cudnn_test
CuDNN version: 8600

(Optional) nvidia-docker

sudo apt install docker.io

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \\
   && curl -s -L <https://nvidia.github.io/nvidia-docker/gpgkey> | sudo apt-key add - \\
   && curl -s -L <https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list> | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt update
sudo apt-get install -y nvidia-docker2

sudo systemctl restart docker

# test
sudo docker run --rm --gpus all ubuntu:18.04 nvidia-smi


Application Setup

Stable Diffusion WebUI

wget -q https://raw.githubusercontent.com/AUTOMATIC1111/stable-diffusion-webui/master/webui.sh
chmod +x webui.sh
./webui.sh --listen

Kohya_ss

git clone https://github.com/bmaltais/kohya_ss.git
cd kohya_ss
./setup.sh
./gui.sh --listen=0.0.0.0 --headless

ComfyUI

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
pip install -r requirements.txt
python3 main.py --listen 0.0.0.0


Pricing Information

Machine Type GPU Configuration Monthly Cost Daily Cost
a2-highgpu-2g 2x A100 40GB ~₩7.9M ~₩240K
a2-ultragpu-2g 2x A100 80GB ~₩12.4M ~₩410K


Best Practices


1. GPU Selection: - Choose appropriate GPU type
- Consider memory requirements
- Check zone availability

2. Cost Optimization: - Monitor usage
- Use preemptible instances
- Schedule workloads

3. Security: - Configure firewall rules
- Use service accounts
- Enable monitoring

4. Performance: - Install latest drivers
- Optimize CUDA settings
- Monitor GPU utilization



Reference