5 min to read
Setting up GPU Server on Google Cloud Platform

Overview
Learn how to set up and configure NVIDIA A100 GPU servers on Google Cloud Platform (GCP) for AI/ML workloads.
GPU Availability Check
The GPU platform to be used can be found in the link below.
The GPUs available are different for each zone of a specific region. I’m going to explain it based on NVIDIA A100.
# Check A100 availability in Seoul region
gcloud compute accelerator-types list --filter="name:nvidia-tesla-a100 AND zone:asia-northeast3"
NAME ZONE DESCRIPTION
nvidia-tesla-a100 asia-northeast3-a NVIDIA A100 40GB
nvidia-tesla-a100 asia-northeast3-b NVIDIA A100 40GB
# Check A100 availability in all regions
gcloud compute accelerator-types list |grep a100
nvidia-a100-80gb us-central1-a NVIDIA A100 80GB
nvidia-tesla-a100 us-central1-a NVIDIA A100 40GB
nvidia-tesla-a100 us-central1-b NVIDIA A100 40GB
nvidia-a100-80gb us-central1-c NVIDIA A100 80GB
nvidia-tesla-a100 us-central1-c NVIDIA A100 40GB
nvidia-tesla-a100 us-central1-f NVIDIA A100 40GB
nvidia-tesla-a100 us-west1-b NVIDIA A100 40GB
nvidia-tesla-a100 us-east1-a NVIDIA A100 40GB
nvidia-tesla-a100 us-east1-b NVIDIA A100 40GB
nvidia-tesla-a100 asia-northeast1-a NVIDIA A100 40GB
nvidia-tesla-a100 asia-northeast1-c NVIDIA A100 40GB
nvidia-tesla-a100 asia-southeast1-b NVIDIA A100 40GB
nvidia-a100-80gb asia-southeast1-c NVIDIA A100 80GB
nvidia-tesla-a100 asia-southeast1-c NVIDIA A100 40GB
nvidia-a100-80gb us-east4-c NVIDIA A100 80GB
nvidia-tesla-a100 europe-west4-b NVIDIA A100 40GB
nvidia-a100-80gb europe-west4-a NVIDIA A100 80GB
nvidia-tesla-a100 europe-west4-a NVIDIA A100 40GB
nvidia-tesla-a100 asia-northeast3-a NVIDIA A100 40GB
nvidia-tesla-a100 asia-northeast3-b NVIDIA A100 40GB
nvidia-tesla-a100 us-west3-b NVIDIA A100 40GB
nvidia-tesla-a100 us-west4-b NVIDIA A100 40GB
nvidia-a100-80gb us-east7-a NVIDIA A100 80GB
nvidia-tesla-a100 us-east7-b NVIDIA A100 40GB
nvidia-a100-80gb us-east5-b NVIDIA A100 80GB
nvidia-tesla-a100 me-west1-b NVIDIA A100 40GB
nvidia-tesla-a100 me-west1-c NVIDIA A100 40GB
How to turn off GPU Quota and select Compute Engine
The NVIDIA A100 80GB is basically Quota hanging, so you need to apply for an upgrade through a Quota request.
After identifying the GPU Quota name of the desired A100 GPU by referring to the document [1] it is possible to apply for Quota upward in the same manner as in the document (2).
GPU Price
You can check it in the link below.
Terraform Configuration Example
## ai_server ##
resource "google_compute_address" "ai_server_ip" {
name = var.ai_server_ip
region = var.region
}
resource "google_compute_instance" "ai_server" {
name = var.ai_server
machine_type = "a2-highgpu-2g" # a2-ultragpu-2g = A100 80G 2 / a2-highgpu-2g = A100 40G 2
labels = local.default_labels
zone = "${var.region}-a"
allow_stopping_for_update = true
tags = [var.nfs_client]
boot_disk {
initialize_params {
image = "ubuntu-os-cloud/ubuntu-2204-lts"
size = 100
}
}
metadata = {
ssh-keys = "somaz:${file("../../key/ai-server.pub")}"
install-nvidia-driver = "true"
}
network_interface {
network = "projects/${var.host_project}/global/networks/${var.shared_vpc}"
subnetwork = "projects/${var.host_project}/regions/${var.region}/subnetworks/${var.subnet_share}-ai-b"
access_config {
## Include this section to give the VM an external ip ##
nat_ip = google_compute_address.ai_server_ip.address
}
}
scheduling {
on_host_maintenance = "TERMINATE" # 또는 "MIGRATE" 대신 "RESTART" 사용
automatic_restart = true
preemptible = false
}
guest_accelerator {
type = "nvidia-tesla-a100" # nvidia-a100-80gb = A100 80G / nvidia-tesla-a100 = A100 40G
count = 2
}
depends_on = [google_compute_address.ai_server_ip]
}
GPU Server Setup
# Os Version
lsb_release -a
# Confirm GPU
sudo lspci | grep -i nvidia
sudo lshw -c display
# Install required packages
sudo apt update
sudo apt install -y ubuntu-drivers-common
sudo apt install -y nvidia-driver-535
sudo apt install -y nvidia-cuda-toolkit
## Install CUDA
tar xvf cudnn-linux-x86_64-8.6.0.163_cuda11-archive.tar.xz
cd cudnn-linux-x86_64-8.6.0.163_cuda11-archive.tar.xz
# Copy header files
sudo cp include/cudnn*.h /usr/include
# Copy library files
sudo cp lib/libcudnn* /usr/lib/x86_64-linux-gnu
# Set permissions and update library cache
sudo chmod a+r /usr/include/cudnn*.h /usr/lib/x86_64-linux-gnu/libcudnn*
sudo ldconfig
# Verify installation
## nvidia version
nvidia-smi
## cuda version
nvcc --version
## cudnn test
cat <<EOF > cudnn_test.cpp
#include <cudnn.h>
#include <iostream>
int main() {
cudnnHandle_t cudnn;
cudnnCreate(&cudnn);
std::cout << "CuDNN version: " << CUDNN_VERSION << std::endl;
cudnnDestroy(cudnn);
return 0;
}
EOF
# Compile
nvcc -o cudnn_test cudnn_test.cpp -lcudnn
# Run
./cudnn_test
CuDNN version: 8600
(Optional) nvidia-docker
sudo apt install docker.io
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \\
&& curl -s -L <https://nvidia.github.io/nvidia-docker/gpgkey> | sudo apt-key add - \\
&& curl -s -L <https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list> | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
# test
sudo docker run --rm --gpus all ubuntu:18.04 nvidia-smi
Application Setup
Stable Diffusion WebUI
wget -q https://raw.githubusercontent.com/AUTOMATIC1111/stable-diffusion-webui/master/webui.sh
chmod +x webui.sh
./webui.sh --listen
Kohya_ss
git clone https://github.com/bmaltais/kohya_ss.git
cd kohya_ss
./setup.sh
./gui.sh --listen=0.0.0.0 --headless
ComfyUI
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
pip install -r requirements.txt
python3 main.py --listen 0.0.0.0
Pricing Information
Machine Type | GPU Configuration | Monthly Cost | Daily Cost |
---|---|---|---|
a2-highgpu-2g | 2x A100 40GB | ~₩7.9M | ~₩240K |
a2-ultragpu-2g | 2x A100 80GB | ~₩12.4M | ~₩410K |
Best Practices
1. GPU Selection: - Choose appropriate GPU type
- Consider memory requirements
- Check zone availability
2. Cost Optimization: - Monitor usage
- Use preemptible instances
- Schedule workloads
3. Security: - Configure firewall rules
- Use service accounts
- Enable monitoring
4. Performance: - Install latest drivers
- Optimize CUDA settings
- Monitor GPU utilization
Comments