In my previous articles, I’ve discussed the DevOps strategy/culture and application/development considerations for DevOps goodness. As the next article in this series, I will delve into the infrastructure perspective and describe some foundational best practices to win in the cloud.
But first, a brief view into Infrastructure evolution. Remember not so long ago, we had to manage physical servers, deal with pagers, NOCs – measuring uptime, SOPs for what happens when something fails, managing those boxes, MTBF, etc. Routine tasks like updating an application and changing network passwords took a long time and lots of communication overhead. Some of us started to include some automation (‘Perl’ anyone?), bash scripting, and so forth. Then in 2008, virtualization came along about the same time as the economic downturn, and we had to make do with a smaller team, figuring out how to get the same amount of work done with half the staff. More and more automation kicked in, and server and network virtualization and cloud services started to evolve with elastic compute & elastic storage. These market trends have made it possible for us to move at a faster pace to help support the business.
To do this on the infrastructure side, DevOps provides the means to implement Lean principles and improve agility, reliability and performance. How is this possible? The key is automation. If there is a small DNS update somewhere being done manually, or, if there is some manual testing after deployment, ask yourself whether you can automate these.
Containers, Serverless and beyond
The rise of Containers, in application development/deployment and more recently Serverless computing, brings very real and new challenges in delivery and operations. However, there are various advantages in agility and portability.
You need to think about things like:
- evolving your release management processes to support the speed of delivering micro-services multiple times a day;
- your logging & monitoring should be able to help the developers and IT Ops teams with end-end visibility because now there are more transient events
- your security posture needs to address the new attack surfaces that Container orchestration and Serverless paradigms bring with them
- the complexity in the updating of skills and tools which will help to focus on the application, and not on the servers that the app runs on
- addressing the technical-debt that builds faster due to the evolution/stabilization of these newer technologies/platforms
- DR for your Serverless environment
While the advancements in infrastructure and the best practices stated in this article (see below) should enable us to address these challenges, we need further maturity and fail-fast approaches in every organization’s technology/processes, especially when it comes to Containers and Serverless apps. Here are some guidelines we have found to help us along our journey towards DevOps maturity in the infrastructure space.
- Shoot your pets
Well, not literally. In the spirit of reusable infrastructure and repeatable deployments, your ‘pet’ server needs to be treated as cattle – i.e., it needs to be versioned and treated as an artifact that can be included in your delivery workflow. Treat servers as a ‘herd’ and not as pets. If a server is down or has an issue, that’s fine – you are not going to ‘ssh’/login – all you care about is that the components around this server know that it’s down so that we can spin up a new one. No fixing of servers, just building around it. To implement this, you must look at auto-scaling and other elastic aspects.
- No manual intervention
Consider having a regular schedule (maybe weekly or in some cadence suitable to your organization) to create an AMI package where you get all your application code and patches that you want to roll out. These new base (pre-baked) AMIs – with all the patches, security updates, application changes rolled in together – become your versioned artifacts (the deployment units) that enable you to implement elasticity and reliability with great speed.
- Access management – Define users & roles – Federation
Taking a step back, when we think of leveraging the public cloud, one of the first things to define is who are the different users that are going to use the account. What are their roles, and how are they going to use it? Getting your current Identity Management solution to federate the users in your organization into the cloud is the way to go. The key here is to leverage temporary credentials (lasting ~15 minutes to 36 hours based on your needs) for the users. Around application access to infrastructure, remember that traditionally applications have keys embedded in the application/config files. This needs to go away and you should leverage role based access.
- Billing & account management
In the cloud, there is sprawl that occurs in almost every organization. The shadow IT aspects can be exacerbated in the cloud – the tendency to just whip out the credit card, ‘buy’ your infrastructure/services and completely bypass the IT/Ops. You could end up having multiple accounts. Its highly beneficial to have a well thought out Governance strategy, consolidating accounts under a master account, understanding who is going to pay the bill (what P&L does all this roll up under), etc. You can consider having multiple accounts and linking them to one payer account. This will also help benefit in leveraging your investments in reserved instance purchases. If you are a Managed Services Provider (MSP), the management of multiple accounts for your customers in multiple regions should be automated.
- Separation of environments
There are many ways to provide separation of your virtual networks. You can consider setting up individual VPCs (virtual private clouds) per application, per tier (web, app, DB) and/or per environment (Development, Test, Production). This enables us to have clear processes for each of the environments. From a security perspective, we don’t want the developer having access to production. We can set this up as follows: a person in the ‘Development VPC’ can jump into ‘Production’ for performing deployment, by assuming a temporary role into production just for the duration of the deployment and then un-assuming that role. To keep networking/security even simpler and provide for easier access management, we recommend using roles and having separate accounts per environment and/or per application (and based on compliance/security needs within each account, you can probably have individual VPCs for different app tiers).
- Logging & audit trails
Governance and enforcement of policies based on real-time events is critical, and having a rigorous practice around log management and audit trails is the way to go. Look at leveraging CloudWatch Logs, Events and CloudTrail in AWS and/or various third-party vendors in this space to help implement. Also, it is good to have alerting mechanisms in place based on configured alarms or anomalies in normal operation.
- Treat everything as code
All of what we talked about above can be codified/templated. The tried and tested techniques, practices, and tools from software development are being applied to creating versioned, reusable, maintainable, extensible, and testable infrastructure. So, provisioning, testing, updating and deletion of environments happens via a code repository. Infrastructure as code is here to stay and these versioned artifacts become the key towards the agility and reliability in DevOps.
- Service Catalog
Think of creating a product catalog of your standard set up – network, security, identity, AMI, etc. Now as a user, maybe a developer can come in and say they need to set up a Drupal website for development. They can choose your preconfigured set of products in the catalog, self-service the creation of the environment. This empowers the employees and speeds up the provisioning/deployment while ensuring compliance, best practices and corporate standards. Now you can have your CloudFormation (or Terraform) architectures and configurations cataloged – helping you move towards centralized control over the portfolio of services that end users can browse and launch.
The picture below helps summarize the above points. You could use it in your DevOps teams when considering infrastructure options.
This brings us to the end of our Infrastructure recommendations. In the next article, we will be delving into best practices for your Continuous Integration and Continuous Delivery/Deployment (CI/CD).