3 months after: Intro into Terraform Infrastructure as Code (IaC).
This is a before and after type post. I’m embarking on a project where I’ll be providing infrastructure support for a team of about 15 developers. Having some previous experience, but requiring much more, I’ll have to learn the ins-and-outs of Terraform and IaC. Hopefully I’ll be able to share both the hair pulling and woohoo moments. I’ll be back after 3 ̶w̶e̶e̶k̶s̶ months.
I assume the reader is in the same boat as me. So let’s recap. We know that Terraform is a tool for managing software infrastructure and we’ve used it lightly. We’re pretty sure we want to learn about infra and IaC in general (that’s why you’re reading this post?) so our slant will be to explore quickly and find out how to get it done properly. This is our list of demands:
- Why Terraform, why IaC?
- When to not use Terraform and IaC?
- How do You use Terraform?
- One painfully obvious tip?
- Any Woohoo moments?
- What tools do you always have on hand?
- What is just as important that you have to learn in parallel.
- IaC and the mechanics of working together in large teams.
- What is the ideal devops for Terraform and IaC?
- Three things that you’ll be doing in the next 3 weeks.
3 months after
So 3 months have flown by instead of 3 weeks, and it’s a testament to my bad discipline and how busy I was. Someone was talking about 20 hours to learn a new skill and I think it’s still fair proposition to learn Terraform in 20 hours… On the condition that you already know a few topics before this. Such as pipelines, devops, terminal, file storage, networking, applications, APIs **cough**. My general learning has been something like this:
- Read books, then experiment.
- Face a problem —
Take tutorials, then troubleshoot problems. Refactor. - Need to build a feature —
Read articles, then build features. Refactor. - Noticed that pipelines needs to be improved —
Read discussions, then experiment with pipelines. Refactor. - Read books, then experiment. Refactor
I’ll leave you to draw the conclusions from that. Even with someone hand-holding me through the process it would probably have been different. So let’s get on with our initial demands:
- Why Terraform, why IaC?
Brikman has very good explanations in his book, Terraform: Up & Running: Writing Infrastructure as Code, comparing Terraform with other tools. The main proponents are it’s level of integration with cloud providers, immutability and definitive language. Until another tool offers the same quality of integration with cloud providers, this is probably the tool to use. IaC is probably the only sane way to manage infrastructure in this day and age. Managing infrastructure manually would be inviting disaster the moment a team member is on leave, or worse, leaves! - When to not use Terraform and IaC (for reasonably large projects)?
Certainly there are limitations to the level of integration, and particularly with the more advanced offerings by cloud providers, Terraform just doesn’t keep up. An example is the configuration of captured log data sources from VMs, it’s not offered through Terraform. If you’re using some of the latest features, you’ll have to weigh provisioning infra via API or even partially including manual steps. - How do You use Terraform?
Currently we’re using Terraform to manage about 80% of our infra. Our compute resources are in managed clusters such as kubernetes, so there’s additional code to configure that. We have a lower environment that we can run our local scripts to validate Terraform plans against, but all Terraform apply* is run in pipeline. - One painfully obvious tip?
I was advised by my colleague early on: “Once you start using Terraform, don’t edit the infrastructure manually”. For whatever reason, if the piece of infrastructure needs to be edited via API or management portal, ensure that the Terraform state is kept up-to-date. - Any Woohoo moments?
Each deployment is amazing when Terraform is working well. Cloud resources work together seamlessly as they should. Still, it’s a thrill seeing deployments succeed and cross dependencies being applied. For example, we could provision a kubernetes cluster and a devops machine, while defining that the k8s must whitelist devops, and devops can only access k8s (amongst the other networks it can access). - What tools do you always have on hand?
iTerm for viewing and printing the APIs concisely in different panes, VSCode with ANSI Colors to view Terraform plans from pipelines, TFSec to check vulnerabilities. - What is just as important that you have to learn in parallel.
The cloud provider’s resources for compute, storage, logging, monitoring, security, networking is another continuously evolving topic. Essentially whether it’s AWS, Azure, or GCP, each has it’s own way to provision, say compute resources, nicely to fit in your network infrastructure to maintain ease of access while still ensuring sufficient isolation. - IaC and the mechanics of working together in large teams.
There are two far ends of a spectrum, 1) a core infra team with no participation by application DEVs, and 2) a team with full knowledge of infra. I recon it would be healthy to be somewhere in the middle where application DEVs have working knowledge of the infra, but core infra anchors are present to radiate good practices. This demystifies any misconceptions about infra, enabling developers to think more effectively, while maintaining a deeper context of infra in the team. - What is the ideal devops for Terraform and IaC?
Certainly infra pipelines should be similar to most application pipelines. They should run reliably, run full layers of tests and be used to continuously promote code from lower to higher environments. What I found different was in the case of IaC, we need an opportunity to review Terraform plans before applying them. This is caused by a fact that states may differ across states environments, either due to the backlog of versions pushed, a manual change that was not committed or simply have been tampered with and we have want the opportunity to review changes before applying them. - Three things that you’ll be doing in the next 3 weeks.
Seeing that it took 3 months instead of 3 weeks for this post, I’m not sure if I’ll achieve this list any time soon. Items on the improvement list would be:
- to move towards system identities of the devops machine
- use cleaner code such as the Stack Configuration File pattern described in Infrastructure as Code , and
- to have a clearer separation between setup infrastructure and main infrastructure.