1.5.2.1. End-to-End Testing

In the process of large-scale maintenance and governance of Terraform modules, End-to-End (E2E) testing is a critical link for ensuring the quality of infrastructure code. This chapter will delve into how to use Terratest and the Azure terraform-module-test-helper tool to conduct comprehensive end-to-end testing on Terraform modules.

1.5.2.1.1. What is End-to-End Testing?

End-to-end testing is a testing method that simulates the actual deployment process, designed to verify that the entire infrastructure deployment meets expectations. Unlike unit testing or integration testing, end-to-end testing focuses on the overall behavior of the system, ensuring that various components work together correctly in a real operating environment.

In the context of Terraform, end-to-end testing typically involves the following steps:

Deploy Test Environment: Use Terraform to deploy the module into an isolated test environment.
Validate Deployment Results (Optional): Verify the functionality and configuration of resources through actual operations (such as network connection tests, API calls, etc.).
Clean Up Test Environment: After testing is complete, destroy all deployed resources to ensure the environment is kept clean.

This testing method helps developers discover potential issues before pushing code to the production environment, thereby reducing risk.

1.5.2.1.2. Why Choose Terratest?

Terratest is an open-source testing framework developed by GruntWork based on the Go language, designed specifically for Infrastructure as Code (IaC) tools like Terraform. It offers rich functionality, supporting the automated deployment, validation, and destruction of infrastructure resources.

1.5.2.1.2.1. Advantages of Terratest

Automated Testing Workflow: Terratest can automatically execute commands like terraform init, apply, and destroy, simplifying the testing process.
Flexible Validation Mechanism: By writing test logic in Go, complex validation operations can be implemented, such as API calls, port checks, etc.
Integration Test Stage Control: Using the test_structure module, tests can be divided into multiple stages, facilitating debugging and reuse.

1.5.2.1.2.2. Typical Terratest Structure

A typical Terratest test file usually includes the following structure:

package test

import (
    "testing"

    "[github.com/gruntwork-io/terratest/modules/terraform](https://github.com/gruntwork-io/terratest/modules/terraform)"
    test_structure "[github.com/gruntwork-io/terratest/modules/test-structure](https://github.com/gruntwork-io/terratest/modules/test-structure)"
)

func TestTerraformModule(t *testing.T) {
    t.Parallel()

    // Set test directory
    exampleDir := "../examples/basic"

    // Deploy stage
    test_structure.RunTestStage(t, "deploy", func() {
        terraformOptions := &terraform.Options{
            TerraformDir: exampleDir,
        }
        test_structure.SaveTerraformOptions(t, exampleDir, terraformOptions)
        terraform.InitAndApply(t, terraformOptions)
    })

    // Validate stage
    test_structure.RunTestStage(t, "validate", func() {
        terraformOptions := test_structure.LoadTerraformOptions(t, exampleDir)
        output := terraform.Output(t, terraformOptions, "resource_id")
        // Add validation logic, e.g., checking if the resource exists
    })

    // Teardown stage
    test_structure.RunTestStage(t, "destroy", func() {
        terraformOptions := test_structure.LoadTerraformOptions(t, exampleDir)
        terraform.Destroy(t, terraformOptions)
    })
}

Through the structure above, developers can clearly divide the test into various stages, making it easier to maintain and extend.

1.5.2.1.3. Azure's terraform-module-test-helper Tool

In the development of Azure Terraform modules, terraform-module-test-helper is a Go library specifically designed to simplify end-to-end testing. It encapsulates common testing logic provided by Terratest and, combined with the structure of Azure Verified Modules, offers a higher level of abstraction, reducing testing complexity.

1.5.2.1.3.1. The RunE2ETest Function

RunE2ETest is the core function in this library for executing end-to-end tests. Its typical usage is as follows:

func TestExample(t *testing.T) {

    test_helper.RunE2ETest(t, "../../", "examples/startup", terraform.Options{
        Upgrade: true,
    }, func(t *testing.T, output test_helper.TerraformOutput) {
        // Add validation logic, e.g., checking if the output meets expectations
        resourceID, ok := output["resource_id"].(string)
        assert.True(t, ok)
        assert.Contains(t, resourceID, "/subscriptions/")
    })
}

This function accepts the root path of the module, the subdirectory of the example code, Terraform configuration options, and a callback function for validating the output. It automatically executes terraform init, apply, and destroy, and calls the callback function for verification after deployment.

1.5.2.1.4. Parallelization of E2E Tests and Global Configuration Management

In the process of large-scale maintenance and governance of Terraform modules, the efficiency and stability of end-to-end testing are paramount. GitHub Actions has a crucial limit: the execution time limit for all jobs is 6 hours. If we test all examples serially, timeouts will occur when there are many examples or execution is time-consuming. To speed up testing and ensure environmental consistency, we have adopted a parallel testing strategy and introduced a global configuration management mechanism.

1.5.2.1.4.1. Strategy for Parallel Testing

In AVM (Azure Verified Modules) projects, each module usually contains multiple usage examples located in different subdirectories under the examples/ directory. To improve testing efficiency, we utilize the matrix strategy of GitHub Actions to assign an independent Runner instance for each example to conduct parallel testing.

The specific process is as follows:

Get List of Examples: First, execute a task named get-examples to scan the examples/ directory and identify subdirectories containing .tf files.
Generate Test Matrix: Pass the identified list of subdirectories as output to subsequent test tasks to form the GitHub Actions matrix configuration.
Execute Tests in Parallel: For each example in the matrix, GitHub Actions will start an independent Job and use the terraform-module-test-helper tool to execute the test.

This parallel testing strategy significantly shortens the total testing time and improves test coverage and efficiency.

1.5.2.1.4.2. Global Configuration and Teardown Mechanism

Before executing multiple test tasks in parallel, global configuration of the test environment may be required, such as resetting certain global resources or setting shared environment variables. To this end, we introduced global setup and teardown mechanisms to ensure the consistency and cleanliness of the test environment.

Global Setup

Before executing tests, GitHub Actions checks if a setup.sh script exists in the examples/ directory. If it exists, a Job named globalsetup is executed to run the script and complete the necessary global configuration.

Global Teardown

After tests are completed, GitHub Actions checks if a teardown.sh script exists in the examples/ directory. If it exists, a Job named globalteardown is executed to run the script and clean up global resources created during the testing process.

By placing global cleanup operations in an independent Job and setting dependencies, we ensure that cleanup operations are performed after all test tasks are completed, keeping the environment clean.

1.5.2.1.4.3. Best Practice Recommendations

When implementing parallel testing and global configuration management, it is recommended to follow these best practices:

Idempotency: Ensure that setup.sh and teardown.sh scripts are idempotent, meaning multiple executions produce no side effects.
Error Handling: For API instability phenomena known to occur in certain tests, automatic retries can be set up using Retryable Errors provided by Terratest. Retryable Errors should be saved in the example code in Terragrunt configuration format, alerting users that running this example may encounter retryable errors, and automatic retries can be achieved via terragrunt combined with our provided configuration files.
Resource Isolation: In parallel testing, ensure that each test task uses independent resources to avoid resource conflicts.
Logging: Add detailed logging in scripts and tests to facilitate troubleshooting and debugging.

By following the above best practices, the reliability and maintainability of end-to-end testing can be further enhanced.

1.5.2.1.5. Challenges from Configuration Drift

In the maintenance and governance of large-scale Terraform modules, ensuring infrastructure configuration consistency is vital. Configuration Drift can lead to security vulnerabilities, compliance issues, and runtime errors. Therefore, integrating drift detection mechanisms into end-to-end testing has become a key step in guaranteeing infrastructure stability.

Below, we will continue to delve into the e2etest.go and upgradetest.go files in the Azure official Terraform module test library terraform-module-test-helper, analyzing how functions like RunE2ETest and initAndPlanAndIdempotentAtEasyMode implement automated drift detection to ensure example code maintains configuration consistency after deployment.

Configuration drift refers to the difference between the actual state of deployed infrastructure and the expected state recorded in the Terraform state file (terraform.tfstate). This difference can be caused by the following reasons:

Manual Changes: Operations personnel or developers directly modifying resource configurations in the cloud platform console.
External System Intervention: Other automation tools or scripts modifying resources.
Automatic Resource Changes: Certain resources automatically adjusting configurations during operation, such as changes in the number of instances in an auto-scaling group.

These drifts can cause inconsistency between infrastructure and code definitions, increasing the risk of failure. Therefore, timely detection and remediation of drift is an important measure to ensure stable system operation.

In Terraform Module maintenance work, we use a test subscription without external system intervention, and this subscription is solely used for Terraform Module test maintenance. Thus, it is naturally unaffected by reasons 1 and 2, but we need additional test validation to ensure our modules are not affected by reason 3.

1.5.2.1.6. The initAndPlanAndIdempotentAtEasyMode Function in terraform-module-test-helper

In the upgradetest.go file, the initAndPlanAndIdempotentAtEasyMode function is used to simplify the module initialization, planning, and idempotency verification process. Its core logic includes:

Initialize Terraform Environment: Execute terraform init to ensure the environment is ready.
Generate Execution Plan: Run terraform plan to generate a resource change plan.
Idempotency Verification: Run terraform plan multiple times to ensure that the plan results remain consistent without code changes, verifying the module's idempotency.

Through these steps, the function ensures that the module does not introduce unexpected changes during multiple applications, enhancing the module's stability and reliability. The following is the function definition used to determine whether there is configuration drift in the plan obtained by executing the plan command immediately after apply:

func noChange(changes map[string]*tfjson.ResourceChange) bool {
    if len(changes) == 0 {
        return true
    }
    return linq.From(changes).Select(func(i interface{}) interface{} {
        return i.(linq.KeyValue).Value
    }).All(func(i interface{}) bool {
        change := i.(*tfjson.ResourceChange).Change
        if change == nil {
            return true
        }
        if change.Actions == nil {
            return true
        }
        return change.Actions.NoOp()
    })
}

The purpose of this function is to determine whether the Terraform resource change collection (changes) indicates "no configuration changes" (i.e., no configuration drift).

1.5.2.1.6.1. Function Input

The changes parameter is a map where the key is the unique identifier of the resource and the value is the corresponding tfjson.ResourceChange object. This represents the resource change details parsed by Terraform after executing terraform plan.

1.5.2.1.6.2. Function Implementation Logic Analysis

The logic of the function is divided into two main parts:

Fast Path Check

First check:

if len(changes) == 0 {
    return true
}

When changes is empty (i.e., there are no changes to any resources), the function immediately returns true, indicating no drift.

In-depth Check of Each Resource Change

If there are resource change records (i.e., len(changes) != 0), a more in-depth check is performed:

return linq.From(changes).Select(func(i interface{}) interface{} {
    return i.(linq.KeyValue).Value
}).All(func(i interface{}) bool {
    change := i.(*tfjson.ResourceChange).Change
    if change == nil {
        return true
    }
    if change.Actions == nil {
        return true
    }
    return change.Actions.NoOp()
})

If the change entries satisfy the following conditions:

If ResourceChange.Change is nil, it indicates no specific change content (safely returns true). ```go if change == nil { return true }



2. If `ResourceChange.Change.Actions` is `nil`, it indicates Terraform detected no actions, which can also be viewed as no change (safely returns `true`).
```go
if change.Actions == nil {
    return true
}

If ResourceChange.Change.Actions.NoOp() returns true, it indicates that the current resource explicitly shows no action in the Terraform execution plan (i.e., no create, update, or delete operations). This is the most critical part of judging configuration drift: ```go return change.Actions.NoOp()




* Only when **all** resources satisfy one of the three conditions above will it return `true` overall. If any resource has actual change operations (such as `create`, `update`, `delete`, etc.), `NoOp()` will return `false`, causing the entire function to return `false`, indicating the presence of configuration drift.

### Explanation of Terraform's `NoOp()` Method

`NoOp()` is a method defined in the official Terraform library, indicating whether there are no actual actions in the plan:

* When `Actions` is `["no-op"]` (explicitly no operation), `NoOp()` returns `true`.
* If other actions exist (e.g., `"create"`, `"update"`, `"delete"`), it returns `false`.

Therefore, `change.Actions.NoOp()` can accurately identify whether the resource has undergone actual changes.

### Why Not Write End-to-End Tests Using Terraform's Native Test Framework and Commands?

Terratest is chosen because when writing test code in Go, functionality tests can be performed against the created infrastructure, such as accessing an HTTP service to see if the expected return value is received. The Terraform native test framework can only verify if the Terraform plan or state meets expectations and cannot execute other custom validation logic.

Of course, if we only need to verify the Terraform state file, using the native test framework is acceptable. The issue with Terratest is that writing tests requires mastery of the Go language, and not all module maintainers have mastered Go.

## Breaking Change Detection Tests in Terraform Module Upgrades

In the maintenance and governance of Terraform modules, ensuring module stability and backward compatibility is crucial. especially in the management of large-scale modules, any untested breaking change can have severe impacts on multiple projects relying on that module. Therefore, establishing an effective breaking change detection testing mechanism becomes a key link in guaranteeing module quality.

### Definition and Impact of Breaking Changes

Breaking Changes refer to changes that cause the configurations of existing module users to fail to work properly. These changes may include:

* Removing or renaming existing input or output variables;
* Modifying key attributes of resources, causing resources to be destroyed and recreated;
* Changing the default behavior or dependencies of resources;
* Modifying the file structure or path of the module, affecting how the module is referenced.

According to the principles of [Semantic Versioning](../large-scale-terraform-module-governance/semantic-versioning.md), breaking changes are only allowed during a Major Version upgrade. Therefore, introducing breaking changes in a Minor Version or Patch Version violates version control conventions and may cause users to encounter configuration errors or resource interruptions unknowingly.

### Strategies for Detecting Breaking Changes

To discover breaking changes in time during the module upgrade process, the following strategies can be adopted:

1. **Version Comparison Testing**: Identify potential breaking changes by comparing the behavioral differences between the current module version and the previous stable version.
2. **Automated Test Integration**: Incorporate automated testing steps into the Continuous Integration (CI) process to ensure every commit undergoes thorough validation.

### How to Implement Breaking Change Detection

The following is the code implementing breaking change detection in the `terraform-module-test-helper` library:

```go
func moduleUpgrade(t *T, owner string, repo string, moduleFolderRelativeToRoot string, newModulePath string, opts terraform.Options, currentMajorVer int) error {
    if currentMajorVer == 0 {
        return SkipV0Error
    }
    latestTag, err := getLatestTag(owner, repo, currentMajorVer)
    if err != nil {
        return err
    }
    if semver.Major(latestTag) == "v0" {
        return SkipV0Error
    }
    tmpDirForTag, err := cloneGithubRepo(owner, repo, &latestTag)
    if err != nil {
        return err
    }

    fullTerraformModuleFolder := filepath.Join(tmpDirForTag, moduleFolderRelativeToRoot)

    exists := files.FileExists(fullTerraformModuleFolder)
    if !exists {
        return CannotTestError
    }
    tmpTestDir := test_structure.CopyTerraformFolderToTemp(t, tmpDirForTag, moduleFolderRelativeToRoot)
    defer func() {
        _ = os.RemoveAll(filepath.Clean(tmpTestDir))
    }()
    return diffTwoVersions(t, opts, tmpTestDir, newModulePath)
}

func diffTwoVersions(t *T, opts terraform.Options, originTerraformDir string, newModulePath string) error {
    opts.TerraformDir = originTerraformDir
    defer destroy(t, opts)
    initAndApply(t, &opts)
    overrideModuleSourceToCurrentPath(t, originTerraformDir, newModulePath)
    return initAndPlanAndIdempotentAtEasyMode(t, opts)
}

Get Version Number of Previous Stable Version: Retrieve the module's latest stable version tag via the GitHub API as a comparison baseline.
Clone Code of Specified Version: Use the go-getter tool to clone the specified version of the module code to a local temporary directory.
Execute Terraform Plan: Inside the cloned module directory, execute terraform init and terraform apply to ensure the current configuration is usable.
Replace Module Source Path: Replace the module reference path with the local path of the current development version to simulate a module upgrade scenario.
Execute Terraform Plan Again: Execute terraform plan and analyze the output results to determine if there are resource destruction and recreation operations.
Result Judgment and Reporting: If resource destruction and recreation are detected, and the current version number has not undergone a major version upgrade, it is marked as a breaking change, blocking the merge of this change.

1.5.2.1.6.3. When Should Breaking Change Tests Be Skipped?

Breaking change tests should be skipped in the following specific situations:

When the current module's major version is still at v0, the module is considered to be in an exploratory phase, and requiring the avoidance of all breaking changes is impractical.
When the next new version we intend to release is a major version update. In this case, we aim to accommodate as many breaking changes as possible in a single update; this is arguably the only window of opportunity to introduce breaking changes.

1.5.2.1.6.4. Application in Continuous Integration

Integrating the above detection mechanism into the continuous integration process allows for automated detection of breaking changes. The specific steps are as follows:

CI Trigger Conditions: Trigger the breaking change detection test on every commit or Pull Request.
Environment Preparation: Set necessary environment variables in the CI environment, such as GITHUB_TOKEN, PREVIOUS_MAJOR_VERSION, etc.
Execute Test Script: Run the breaking change detection test script to automatically complete version comparison, plan execution, and result analysis.
Result Feedback: Based on the test results, decide whether to allow the current change to be merged or prompt the developer to perform a major version upgrade.

Through the strategies and practices above, breaking changes can be effectively detected and prevented, improving the stability and reliability of Terraform modules and ensuring the continued healthy operation of infrastructure.

1.5.2.1.7. Summary

End-to-end testing is not only a key line of defense for ensuring Terraform module quality but also the core mechanism for achieving "automated reliability" in large-scale module governance. This chapter centered around two tools, Terratest and terraform-module-test-helper, explaining how to build a complete E2E testing framework, implement parallel testing of examples, and achieve configuration drift detection and breaking change monitoring capabilities.

With Terratest, we saw the flexibility and programmability of verifying infrastructure behavior via the Go language, allowing us to validate the actual deployment effects of modules from a perspective closest to the user. With Azure's terraform-module-test-helper tool, through encapsulated functions like RunE2ETest and initAndPlanAndIdempotentAtEasyMode, we can automate and standardize the module testing process and implement it within daily CI workflows.

Through parallel testing strategies and the design of global setup/teardown mechanisms, we significantly reduced testing duration and enhanced test controllability, avoiding the timeout risks associated with serial execution. In actual production maintenance, configuration drift detection and breaking change testing further strengthen the robustness of our modules against environmental evolution and version upgrades.

In short, a Terraform module without end-to-end testing cannot be trusted for use in a production environment. Especially when the number of modules reaches dozens or hundreds, end-to-end testing serves not just as a quality assurance mechanism, but as a fundamental asset for maintaining scalable and long-term sustainable infrastructure.

End-to-End Testing