Blog Post
Vatsal Shah
June 23, 2026
13 min read

Terraform and OpenTofu for Multi-Cloud AI: One Module, Three Hyperscalers

Terraform and OpenTofu for Multi-Cloud AI: One Module, Three Hyperscalers

By Vatsal Shah | June 23, 2026 | 26 min read

Table of Contents

  1. The Multi-Cloud AI Infrastructure Blueprint: Core Components
  2. Multi-Cloud Module Architecture: Variables, Providers, and SKUs
  3. Stateful Security: Remote Backends, Workspaces, and Vault Integration
  4. Automated Drift Detection & Self-Healing Cloud Resources
  5. Comparative Analysis: Terraform vs. OpenTofu vs. Pulumi for AI Stacks
  6. Step-by-Step: Deploying a Multi-Cloud AI Stack in HCL
  7. Pitfalls & Industrial Anti-Patterns in AI IaC
  8. Futuristic Horizon: 2027-2030 Roadmap
  9. Key Takeaways
  10. Frequently Asked Questions
  11. About the Author
  12. Conclusion + CTA
💡 block titled "AI SUMMARY"
  • Unified Interface: Platform teams deploy a single, standardized IaC module interface to coordinate model endpoints across AWS Bedrock, Google Cloud Vertex AI, and Azure OpenAI.
  • State Security: OpenTofu's native state encryption and HashiCorp Vault integrations prevent the leakage of sensitive model credentials and API keys in plain-text state files.
  • Drift Remediation: Automated cron-based drift detection catches manual adjustments made in cloud consoles, restoring the validated architecture baseline automatically.
  • Cost Tagging Sovereignty: Enforcing a universal tagging matrix across all provisioned cloud resources isolates AI model costs for centralized billing dashboards.

ℹ️ block titled "GLOSSARY OF TERMS"
  • OpenTofu: The open-source, community-driven fork of Terraform created under the Linux Foundation to maintain a neutral, highly extensible IaC engine.
  • Hyperscaler: Large-scale public cloud providers (primarily AWS, GCP, and Microsoft Azure) offering globally distributed utility computing.
  • IaC (Infrastructure as Code): The management and provisioning of infrastructure through machine-readable definition files rather than manual interactive configuration.
  • State File: The database file used by Terraform/OpenTofu to map real-world resources to active configuration declarations.
  • Drift: The delta between the declared infrastructure state in the definition files and the actual configurations of running resources in the cloud console.

Diagram of three cloud provider logos connected by glowing grid lines to a central block representing a Terraform module
A unified Terraform and OpenTofu module architecture distributing AI resources across AWS, Google Cloud, and Microsoft Azure.

The Multi-Cloud AI Infrastructure Blueprint: Core Components

In 2026, enterprise AI architectures have evolved beyond single-cloud locks. A resilient, high-performance AI platform must leverage the strengths of multiple public clouds (hyperscalers) concurrently:

  • AWS Bedrock for zero-operational-overhead access to foundational models (like Claude and Llama).
  • Google Cloud Vertex AI for low-latency training, custom pipelines, and Gemini-based agents.
  • Microsoft Azure OpenAI Service for secure, high-availability deployments of proprietary enterprise models.

Managing this multi-cloud sprawl manually via cloud consoles is an operational nightmare. It introduces configuration inconsistencies, leaves endpoints exposed, and makes cost tracking nearly impossible.

To build a stable platform, organizations require a unified Infrastructure as Code (IaC) blueprint. This blueprint abstracts the unique APIs, resource schemas, and networking parameters of each hyperscaler into a single, queryable control plane.

By standardizing model deployments, vector databases, API gateways, and observability layers into modular IaC definitions, engineers can spin up identical dev, staging, and production environments across all three clouds in minutes.


Multi-Cloud Module Architecture: Variables, Providers, and SKUs

The core architectural block of this approach is the Unified Multi-Cloud AI Module. Rather than writing separate Terraform code bases for each cloud, platform teams design a single module that accepts standardized inputs and translates them into cloud-specific resources.

This is achieved using conditional logic and feature flag variables:

  • Variable Inputs: The module accepts a list of target models, region maps, and tenant access rules.
  • Provider Blocks: The module initializes the AWS, Google Cloud, and Azure providers concurrently.
  • Resource Maps: Based on the flags (e.g. enable_aws = true), the engine provisions only the resources needed, suppressing the other provider configurations.
Code
                  ┌──────────────────────────────┐
                  │   Unified AI Module (Input)  │
                  └──────────────┬───────────────┘
                                 │
         ┌───────────────────────┼───────────────────────┐
         ▼                       ▼                       ▼
   [enable_aws]            [enable_gcp]            [enable_azure]
         │                       │                       │
         ▼                       ▼                       ▼
  AWS Bedrock Config      GCP Vertex AI Config     Azure OpenAI Config

This abstraction isolates model SKU differences. The parent module presents a simple, clean interface to developer teams, while hiding the complex network configurations, IAM policies, and VPC endpoints inside its internal files.

The following module dependency blueprint illustrates how the parent module aggregates the cloud providers to coordinate model endpoints:

Dependency graph mapping parent module to providers and resources
The Multi-Cloud Module Dependency Graph showing the flow from unified input variables down to specific hyperscaler API resources.


Stateful Security: Remote Backends, Workspaces, and Vault Integration

Deploying AI infrastructure involves handling highly sensitive keys, including vendor API credentials, database connection strings, and TLS certificates. If these keys leak, malicious actors can exploit your endpoints, driving up token bills or scraping corporate data.

Platform teams must secure two critical layers:

1. State File Encryption

The IaC state file contains a full mapping of your resources, including sensitive plain-text configurations. In 2026, OpenTofu leads the security landscape by introducing native, hardware-based state encryption.

Using KMS keys from AWS, Google Cloud, or Azure, OpenTofu encrypts the state file locally before uploading it to remote backends (like Amazon S3 or Google Cloud Storage). Even if a malicious actor gains access to the storage bucket, they cannot decrypt the state without permissions on the KMS key.

2. Vault Integration

To keep passwords and API keys out of code repositories, integrate the IaC pipeline directly with HashiCorp Vault.

During the plan and apply phases, the pipeline authenticates to Vault using ephemeral GitHub Actions or OIDC tokens. It fetches dynamic database credentials and model API keys, holds them in memory during execution, and destroys the references when the run completes.

Code
[GitHub Actions OIDC] ──► [HashiCorp Vault] ──► [Dynamic Keys] ──► [OpenTofu Run] ──► [Zero State Leakage]

Additionally, partition environments (dev, staging, production) using isolated workspaces with distinct backend configurations. This prevents a configuration bug in a development workspace from corrupting production state.

The following pipeline flow shows how secrets are injected dynamically during workspace promotions without ever touching code bases:

Flowchart showing secrets injection from Vault into OpenTofu workspaces
The Secure Secret Injection Pipeline showing dynamic authentication and runtime key injection into OpenTofu workspaces.


Automated Drift Detection & Self-Healing Cloud Resources

One of the most common issues in cloud operations is configuration drift. A developer needs to test a model version quickly, logs into the Azure Console, and manually updates an endpoint scaling range. They forget to update the IaC code, leaving a gap between declared state and running reality.

In AI environments, drift is particularly dangerous. If an automated agent begins routing user traffic to a model configuration modified manually, it can lead to unexpected billing spikes or security boundary breaches.

The Drift Remediation Loop

To combat drift, platform teams implement an automated, self-healing remediation loop:

  1. Scheduled Audit: A GitHub Action or Argo Workflows cron job executes tofu plan -detailed-exitcode every 6 hours.
  2. Drift Detection: If the Exit Code is 2 (indicating changes exist), the pipeline triggers an alert.
  3. Auto-Remediation: For non-destructive drift (e.g. scaling limits, logging configurations), the pipeline executes tofu apply -auto-approve, immediately overwriting manual changes and restoring the verified configuration baseline.
Code
[Cloud Console Drift] ──► [tofu plan Audit] ──► [Delta Found] ──► [tofu apply Auto-Apply] ──► [Baseline Restored]

Remediation loop diagram from console change to tofu apply correction
The Drift Remediation Loop illustrating automated reconciliation of manual console edits back to the git source of truth.

By enforcing this loop, git remains the single source of truth for the entire multi-cloud AI infrastructure, preventing undocumented modifications from degrading security or budget boundaries.


Comparative Analysis: Terraform vs. OpenTofu vs. Pulumi for AI Stacks

Selecting the right IaC tool for managing modern AI clusters requires comparing capabilities across state security, licensing, and provider ecosystems:

IaC Engine Dimension HashiCorp Terraform Linux Foundation OpenTofu Pulumi (Code-First) Architectural Verdict
Licensing & Governance Proprietary BSL (limits commercial wrapper platforms). Fully Open-Source (MPL 2.0 under Linux Foundation). Apache 2.0 (open-source engine, commercial SaaS backend). OpenTofu eliminates licensing risks for enterprise wrapper platforms.
State File Security Relies on cloud backend bucket policies for access control. Native, hardware-backed local state encryption configurations. SaaS-backend encryption with custom KMS key integrations. OpenTofu delivers superior offline-first and private network security.
Programming Syntax Declarative HashiCorp Configuration Language (HCL). Standard Declarative HCL (fully backward-compatible). Imperative Languages (Python, TypeScript, Go, C#). HCL provides deterministic planning; Pulumi fits software developer teams.
Provider Ecosystem Access to the massive HashiCorp Registry. OpenTofu Registry with backward compatibility hooks. Native provider mappings and Terraform bridge adapters. All platforms support equivalent AWS, GCP, and Azure resource sets.

The table demonstrates that while Pulumi offers a code-first approach, OpenTofu combines the deterministic safety of declarative HCL with modern open-source state encryption.


Step-by-Step: Deploying a Multi-Cloud AI Stack in HCL

Let's write a complete, production-ready multi-cloud module using HCL. This module provisions an Azure OpenAI instance and AWS Bedrock model allocations, applying a standardized tag schema for cross-cloud cost tracking.

1. Define variables (variables.tf)

Define inputs to configure SKUs, regions, and cost tagging profiles:

Hcl
variable class="tok-str">"environment" {
  type        = string
  description = class="tok-str">"Target deployment environment (dev, staging, prod)"
  default     = class="tok-str">"dev"
}

variable class="tok-str">"project_name" {
  type        = string
  description = class="tok-str">"Name of the parent AI project"
  default     = class="tok-str">"sovereign-ai-mesh"
}

variable class="tok-str">"enable_azure_openai" {
  type        = bool
  description = class="tok-str">"Toggle to provision Azure OpenAI resources"
  default     = true
}

variable class="tok-str">"enable_aws_bedrock" {
  type        = bool
  description = class="tok-str">"Toggle to provision AWS Bedrock configurations"
  default     = true
}

variable class="tok-str">"azure_region" {
  type        = string
  default     = class="tok-str">"eastus"
}

variable class="tok-str">"aws_region" {
  type        = string
  default     = class="tok-str">"us-east-1"
}

2. Configure Providers and Core Resources (main.tf)

Initialize the provider endpoints and configure the resources conditionally:

Hcl
terraform {
  required_version = class="tok-str">">= 1.6.0"
  required_providers {
    aws = {
      source  = class="tok-str">"hashicorp/aws"
      version = class="tok-str">"~> 5.0"
    }
    azurerm = {
      source  = class="tok-str">"hashicorp/azurerm"
      version = class="tok-str">"~> 3.0"
    }
  }
}

provider class="tok-str">"aws" {
  region = var.aws_region
}

provider class="tok-str">"azurerm" {
  features {}
}

class="tok-cm"># Local tags mapped to universal billing schemas
locals {
  billing_tags = {
    Environment = var.environment
    Project     = var.project_name
    CostCenter  = class="tok-str">"AI-Infrastructure"
    ManagedBy   = class="tok-str">"IaC-OpenTofu"
  }
}

class="tok-cm"># -----------------------------------------------------------------------
class="tok-cm"># Microsoft Azure OpenAI Resources
class="tok-cm"># -----------------------------------------------------------------------
resource class="tok-str">"azurerm_resource_group" class="tok-str">"ai_rg" {
  count    = var.enable_azure_openai ? 1 : 0
  name     = class="tok-str">"${var.project_name}-${var.environment}-rg"
  location = var.azure_region
  tags     = local.billing_tags
}

resource class="tok-str">"azurerm_cognitive_account" class="tok-str">"openai" {
  count               = var.enable_azure_openai ? 1 : 0
  name                = class="tok-str">"${var.project_name}-${var.environment}-openai"
  location            = azurerm_resource_group.ai_rg[0].location
  resource_group_name = azurerm_resource_group.ai_rg[0].name
  kind                = class="tok-str">"OpenAI"
  sku_name            = class="tok-str">"S0"
  tags                = local.billing_tags
}

resource class="tok-str">"azurerm_cognitive_deployment" class="tok-str">"gpt4" {
  count                = var.enable_azure_openai ? 1 : 0
  name                 = class="tok-str">"gpt-4o-deployment"
  cognitive_account_id = azurerm_cognitive_account.openai[0].id
  model {
    format  = class="tok-str">"OpenAI"
    name    = class="tok-str">"gpt-4o"
    version = class="tok-str">"2024-05-13"
  }
  scale {
    type = class="tok-str">"Standard"
  }
}

class="tok-cm"># -----------------------------------------------------------------------
class="tok-cm"># AWS Bedrock Foundation Model Access
class="tok-cm"># -----------------------------------------------------------------------
resource class="tok-str">"aws_bedrock_model_invocation_logging_configuration" class="tok-str">"logging" {
  count = var.enable_aws_bedrock ? 1 : 0

  logging_config {
    embedding_data_delivery_enabled = true
    image_data_delivery_enabled     = true
    text_data_delivery_enabled      = true

    cloudwatch_config {
      log_group_name = class="tok-str">"/aws/bedrock/${var.project_name}-${var.environment}"
      role_arn       = aws_iam_role.bedrock_logging[0].arn
    }
  }
}

resource class="tok-str">"aws_iam_role" class="tok-str">"bedrock_logging" {
  count = var.enable_aws_bedrock ? 1 : 0
  name  = class="tok-str">"${var.project_name}-${var.environment}-bedrock-logging-role"

  assume_role_policy = jsonencode({
    Version = class="tok-str">"2012-10-17"
    Statement = [
      {
        Action = class="tok-str">"sts:AssumeRole"
        Effect = class="tok-str">"Allow"
        Principal = {
          Service = class="tok-str">"bedrock.amazonaws.com"
        }
      }
    ]
  })
  tags = local.billing_tags
}

Under this module structure, running tofu apply will deploy an isolated, tagged resource group in Microsoft Azure hosting a GPT-4o endpoint, and configure centralized CloudWatch logging configurations for AWS Bedrock services.

All resources are tagged under a single unified tracking schema, mapping billing reports to centralized enterprise cost-management views.


Pitfalls & Industrial Anti-Patterns in AI IaC

When provisioning distributed AI architectures using IaC tools, platform teams must watch for these common anti-patterns:

  1. State Leakage of Model Keys: Storing raw API tokens or connection strings as default string values in variables files. This causes keys to land in git logs and state databases in plain-text. Always utilize Vault integration or environment variables prefixed with TF_VAR_ injected at run-time.
  2. Hardcoded Regional SKUs: Attempting to provision GPU nodes (such as GCP a2-highgpu-1g instances hosting A100s) in regions that do not physically possess the hardware capacity. This results in plan successes but runtime deployment failures. Always utilize variable regional maps that point only to active GPU regions.
  3. Weak State Locking Configurations: Failing to configure DynamoDB or Azure Table locks on shared remote backends. If two automated agents or developer pipelines run apply sequences concurrently, it will corrupt the state database. Always enforce state locking.
  4. Ignoring Cloud Provider Quotas: Provisioning massive GPU pools or model endpoints without requesting limits increases first. The IaC pipeline will crash mid-apply due to API quota failures, leaving state databases in a partially applied state. Request quota overrides prior to running deployments.

By building security gates, enforcing state locks, and structuring regional inputs dynamically, platform teams can safely deploy highly resilient, multi-cloud AI infrastructure.


Futuristic Horizon: 2027-2030 Roadmap

The evolution of AI infrastructure code is shifting from static resource provisioning to dynamic, intent-based orchestration fabrics:

Code
2026: Multi-cloud HCL modules & remote state locks
   │
   ├──► 2027: Intent-driven IaC (natural language declarations translated to provider graphs)
   │
   └──► 2028-2030: MCP-driven dynamic scheduling meshes across decentralized GPU pools

Between 2026 and 2030, we will see the emergence of Model Context Protocol (MCP) integrations for Infrastructure. In this model, autonomous agents running in local clusters will query physical compute metrics, check real-time billing tariff markets, and dynamically generate and apply OpenTofu delta configurations, shifting model workloads globally to capture optimal pricing.

Engineering groups that standardise on clean, variable-driven, and securely locked HCL modules today will be prepared to interface with these autonomous scheduling meshes as they mature.


Key Takeaways

  • Standardize on Modules: Use a single, variable-driven HCL module to coordinate resources across AWS Bedrock, GCP Vertex AI, and Azure OpenAI.
  • Encrypt the State: Use OpenTofu's native state encryption to protect sensitive model endpoints and access configurations from state database leakage.
  • Inject Secrets dynamically: Prevent credentials from hitting code files by reading them at runtime from Vault systems.
  • Enforce Drift Remediation: Schedule automated daily drift check runs to detect and overwrite manual console configuration changes.
  • Tag Everything: Apply a unified tagging and labelling schema across all cloud resources to build centralized billing views.

Frequently Asked Questions

How does OpenTofu handle backward compatibility with old Terraform state files? +

OpenTofu is fully backward-compatible with Terraform configurations and states up to version 1.6.0. Upgrading to OpenTofu requires updating the CLI binary and running init, with no HCL syntax modifications required.

Can I use OpenTofu to provision local, on-premise GPU clusters? +

Yes. Using providers like Libvirt, Nutanix, or Kubernetes device plugins, you can manage on-premise hardware nodes alongside public hyperscalers under a single unified IaC model.

How do cloud cost tagging schemas map to accounting balance sheets? +

Under ASC 350-40, costs directly associated with provisioning model endpoints and dedicated database clusters are capitalized as development assets, while run-time query processing fees are expensed as operational costs.

What is the impact of provider update lags on AI services? +

When hyperscalers release new model APIs, it can take weeks for official providers to update. To deploy new models immediately, use custom REST API call resources within OpenTofu until native resource support is merged.

How do I protect private network connections to AI endpoints in HCL? +

Configure private endpoints (such as AWS PrivateLink or Azure Private Link) within your VPC/VNet structures using HCL resources, disabling public internet ingress paths to block external access.


About the Author

Vatsal Shah is a technology executive, system architect, and sovereign founder specializing in enterprise AI adoption, digital business transformation, and stateful agentic system integration. Over his career, he has guided global engineering organizations, scaled enterprise software platforms, and designed high-throughput distributed systems that align business operations with emerging technology trends.


Conclusion + CTA

Managing multi-cloud AI infrastructure requires robust, modular, and secure IaC pipelines. By standardizing on unified HCL modules, locking state databases, and automating drift audits, engineering teams can safely provision and cost-optimize their global model resources.

Are you looking to design and automate a multi-cloud AI platform for your enterprise? Get in touch today to schedule a technical architecture session.

Want to work together on business transformation?

Visit my personal hub for advisory scope, or connect on LinkedIn. Every engagement is principal-led with measurable outcomes.

Visit Shah Vatsal Connect on LinkedIn Book intro call
Book intro