Debugging Terraform State Errors

How to resolve hanging Terraform apply operations caused by state conflicts

Featured image



Overview

Terraform sometimes gets stuck in an infinite loading state or fails to complete an apply operation. These issues typically occur when there’s a mismatch between Terraform’s state file and the actual cloud resources (GCP, AWS, etc.).

This guide explains how to use the TF_LOG environment variable for debugging, analyze logs to identify problems, and resolve issues by removing deleted resources from the state file using the terraform state rm command.


Terraform State Error

When Terraform hangs during execution or gets stuck in a state operation, you can enable detailed logging to diagnose the issue. Terraform provides logging through the TF_LOG environment variable:

export TF_LOG=DEBUG
terraform apply -var-file="devqa.tfvars"

This command provides detailed logging that helps identify where the process is getting stuck.


Analyzing the Error Logs

With debugging enabled, you can easily find the problematic section in the logs:

2024-05-13T18:55:35.041+0900 [DEBUG] provider.terraform-provider-google_v5.21.0_x5: ---[ REQUEST ]---------------------------------------
2024-05-13T18:55:35.041+0900 [DEBUG] provider.terraform-provider-google_v5.21.0_x5: GET /storage/v1/b/cdn.somaz.link?alt=json&prettyPrint=false HTTP/1.1
2024-05-13T18:55:35.041+0900 [DEBUG] provider.terraform-provider-google_v5.21.0_x5: Host: storage.googleapis.com
2024-05-13T18:55:35.041+0900 [DEBUG] provider.terraform-provider-google_v5.21.0_x5: User-Agent: google-api-go-client/0.5 Terraform/1.6.2 (+https://www.terraform.io) Terraform-Plugin-SDK/2.31.0 terraform-provider-google/5.21.0

The critical information appears in the error section:

2024-05-13T18:55:35.410+0900 [DEBUG] provider.terraform-provider-google_v5.21.0_x5: HTTP/2.0 404 Not Found
2024-05-13T18:55:35.410+0900 [DEBUG] provider.terraform-provider-google_v5.21.0_x5:  "error": {
2024-05-13T18:55:35.410+0900 [DEBUG] provider.terraform-provider-google_v5.21.0_x5:   "code": 404,
2024-05-13T18:55:35.410+0900 [DEBUG] provider.terraform-provider-google_v5.21.0_x5:   "message": "The specified bucket does not exist.",
2024-05-13T18:55:35.410+0900 [DEBUG] provider.terraform-provider-google_v5.21.0_x5:   "errors": [
2024-05-13T18:55:35.410+0900 [DEBUG] provider.terraform-provider-google_v5.21.0_x5:    {
2024-05-13T18:55:35.410+0900 [DEBUG] provider.terraform-provider-google_v5.21.0_x5:     "message": "The specified bucket does not exist.",
2024-05-13T18:55:35.410+0900 [DEBUG] provider.terraform-provider-google_v5.21.0_x5:     "domain": "global",
2024-05-13T18:55:35.410+0900 [DEBUG] provider.terraform-provider-google_v5.21.0_x5:     "reason": "notFound"

Looking at the entire log, we can see a 404 Not Found error. This indicates that a specific resource has been manually deleted, causing inconsistency between Terraform’s state and the actual infrastructure. Terraform continuously tries to find this resource, leading to the error.

The specific request failing is:

GET /storage/v1/b/cdn.somaz.link?alt=json&prettyPrint=false HTTP/1.1
...
HTTP/2.0 404 Not Found


Resolving State Inconsistencies

To resolve this issue, we need to identify which resources in the Terraform state no longer exist in the cloud infrastructure and remove them from the state file.

First, list all resources in the state:

terraform state list

This command provides a list of all resources in the state file:

data.google_client_config.default
google_compute_address.bastion_ip
google_compute_backend_bucket.dev_sm_cdn_bucket_backend
google_compute_global_address.dev1_adam_lb_ip
google_compute_global_address.dev1_multipath_lb_ip
google_compute_global_address.dev_sm_cdn_lb_ip
google_compute_global_address.qa1_adam_lb_ip
google_compute_global_address.qa1_multipath_lb_ip
google_compute_global_address.qa_sm_cdn_lb_ip
google_compute_global_address.review_multipath_lb_ip
google_storage_bucket.dev_sm_cdn_somaz_run
module.gcs_buckets.google_storage_bucket.buckets["devqa-sm-terraform-remote-tfstate"]
module.gke_autopilot.google_container_cluster.primary

By analyzing the error logs and comparing them with the state list, we can identify that google_storage_bucket.dev_sm_cdn_somaz_run has been manually deleted.

Remove this resource from the state file:

terraform state rm google_storage_bucket.dev_sm_cdn_somaz_run

After removing the problematic resource from the state file, run the Terraform apply command again:

terraform apply -var-file="devqa.tfvars"

The operation should now complete successfully.


Checking Resource Dependencies

If you suspect dependency issues between resources or modules, Terraform provides a way to visualize these relationships:

terraform graph

For a more visual representation, convert the output to an SVG image:

terraform graph | dot -Tsvg > graph.svg

This graph helps identify resource dependencies that might be causing issues or conflicts in your infrastructure.


Conclusion

Terraform is a powerful tool for declarative infrastructure management, but mismatches between the state file and actual resources can cause unexpected errors. These issues often occur when resources are manually deleted outside of Terraform’s control.

You can resolve these problems by:

  1. Enabling detailed logging with TF_LOG=DEBUG to identify the source of errors
  2. Looking for 404 or “notFound” errors in the logs
  3. Using terraform state rm to clean up the state file

For complex dependency relationships, the terraform graph command can help visualize and debug state management issues.

When working in production environments, always backup your state file before manual manipulation and follow shared team guidelines when making such changes.



References