Even an old dog can learn something new

Effortless account management with Active Directory with AWS

Even though Active Directory is as old as dinosaurs, we will have it here for a bit more time, so let us find out how to use it in the age of the cloud for automatic server management. Below I will describe how to do a real seamless join and remove of servers from Active Directory

This blog lives as a GitHub repository where I keep track of actual infrastructure configuration and this text. I will be very grateful if you consider sending pull-requests with either infrastructure configuration or textual changes there. It would be so great if this initial try changed into something anybody can effortlessly use in infrastructures.

This blog’s ultimate goal is to conserve all the knowledge about managing the Active Directory (AD) domain in the cloud I gathered over a couple of past months. Most of the information is transferable to on-premise, e.g., GNU\Linux and AD, but Amazon AWS Directory services.

So here we go.

AD is a Microsoft technology allowing to manage resources inside a company. Theoretically, a company of any size longing from small business to an international enterprise. There are plenty of good sources dealing with AD specific topics, from the basics to deep dive topics. For this writing, a reader can look at the AD as a server-side software where computers can be registered to and user accounts can be centrally managed. Having such a setup, users can be granted rights to login to specific servers and so on. Subsequently, such a user account can log in using one password that is always the same and managed and known by the user only, no matter which server is logging. Additionally, the single-sign on (SSO) technique can be used, so the user provides his/her credentials just once upon a time, and a unique authenticating token is used for login onto the server instead of credentials. Basically, such a configuration allows removing all public ssh keys from the use, forming one of the Zero trust pillars.

Due to the AD origin, created by Microsoft around year 2000 which is when Microsoft literally hated Linux, the server registration process AKA joining used to be a bit clumsy. Historically, the joining process used to be a very manual process requiring configuration of LDAP, Kerberos, some Samba components, and DNS, so the Linux’s authentication layer could work with AD. Joining a Linux server felt more like hacking then a simple task. On top of that, this layer could be slightly different depending on the used Linux distribution. In other words, the Linux joining process was not a piece of cake. Luckily, with the advance of technologies such as RealmD and SystemD that are de-facto standards for all modern distributions, the joining process has become relatively straightforward. Nevertheless, an administrator needs to log in onto the server and run through a few steps before joining an AD domain.

On the same page, forming an AD installation requires some knowledge. Here comes Amazon AWS with AWS Directory Service, which delivers AD as a Service. AWS delivers several flavors of AD installations, but all flavors deliver fully functional AD installation within a matter of minutes after installation. The AD can be installed through AWS web console, AWS Cloudformation, or Terraform. The advantage of AD delivered by AWS is its integration allowing SSO or server joining through Launch instance form.

Launch instance — join form
Launch instance — join form

This joining method is a great acceleration for server joining to AD. This functionality was only usable by Windows server, and recently has been enabled for GNU\Linux distributions with AWS SSM agent VERSION (TODO) installed. No surprise, this embedded joining process uses AWS System Manager (SSM) Run Document module in the background. Specifically, AWS SSM provides the managed document AWS-JoinDirectoryServiceDomain, which can be easily used. The example of a Windows server joining is attached in ec2-management.tf.

Bear in mind that AWS SSM documents are only triggered by creating a document attachment which can be a manual operation or is added by a 3rd party component such as Launch instance wizard. Subsequently, it means no AWS SSM document would be removing servers from AD once they are deleted. Technically, it is not even possible without the operating system’s direct interaction. Computer deregistration’s impossibility is a very significant disadvantage because the cloud is a subject of change, and servers can be started and deleted many times a day. In other words, the necessary integration provided by AWS is great for Cloud deployments, which are mimicking on-premise deployments with semi-permanent servers. However, stale computer objects can be pestering admins by removing plenty of servers manually or using some periodically triggered scripts.

A closer look at DNS records resolution

The joining procedure requires some DNS records, so the joining mechanism can properly discover all necessary components, e.g., LDAP and Kerberos interface. Therefore, VPC DHCP options need to be extended, so all computers use domain controllers’ IP addresses for the DNS resolution.

resource "aws_vpc_dhcp_options" "dns_resolver" {
domain_name_servers = aws_directory_service_directory.simpleds.dns_ip_addresses
}
resource "aws_vpc_dhcp_options_association" "dns_resolver" {
vpc_id = module.vpc.vpc_id
dhcp_options_id = aws_vpc_dhcp_options.dns_resolver.id
}

Interestingly, this DHCP configuration is suitable for deployments in one region or geographically close regions, e.g., multiple VPCs in the same region. Such DHCP options can cause some issues for cross-ocean deployments. The main problem here is that all DNS traffic is centralized to domain controllers, which means Route 53 zones assignment for remote regions are not considered, etc., In other words, all VPCs will resolve the same Route 53 zones, as VPC where the AWS Directory service is deployed. Additionally, a failure of the domain controller can paralyze the whole network. The ideal solution is if every single EC2 instance was using its regional local DNS resolver, i.e., .2 address, i.e., 169.254.169.253, and the domain-specific traffic was forwarded to the domain controller only.

Such functionality can be delivered via Route 53 Resolver. This is probably the most scalable solution, but it comes with price and puts a strain on the VPC networking.

Alternatively, for Linux centric deployments, as Windows does NOT support split DNS resolution, the Linux instances can be configured to use DNSmasq or SystemD-resolver for DNS traffic routing. The DNSmasq is not broadly used for local DNS caching as systemd-resolver in Linux distributions. The example configuration looks like this

root@server:~# cat /etc/systemd/resolved.conf
[Resolve]
DNS=IP1 IP2
#FallbackDNS=
Domains=ad.domain.test
#LLMNR=no
#MulticastDNS=no
#DNSSEC=no
#DNSOverTLS=no
Cache=yes
#DNSStubListener=yes
#ReadEtcHosts=yes

Subsequently, the configuration can be verified using systemd-resolve --status command.

This piece will further use the last option. In order to encompass this functionality into the joining process, the official AWS SSM Document AWS-JoinDirectoryServiceDomain needs to be altered. The modification is stored in ssm.tf file. The enclosed modification of the original joining document contains steps necessary for the password-less joining process.

Effortless computer management

All the previous sections covered the technological bases leading to the moment a Linux system (Ubuntu specifically, but the joining document can be updated for any SystemD based Linux distribution) can be seamlessly joined. The join, however, is triggerd by an user who either selects the directory using the dropdown menu during Launch instance or assignings an AWS SSM document to an EC2 instance. The downside of this AWS SSM-backed process is that it is a manuall process. Manual processes should be pruned from cloud-based deployments as much as possible. Additionally, computer objects remain in AD even after EC2 instances are terminated; hence more manual cleaning work is necessary. The automation of an EC2 instance removal and registration is the topic of further sections.

EC2 instance joining should be controlled by EC2 tags, and should be as simple as possible, e.g., assign Domain:Join = True/False tag.

Automation of EC2 registration and deregistration from AD

First, let’s define expectations and put them on the table. Expected behavior:

  • When a new EC2 instance is started and the tag “Domain:Join” is set to True, the EC2 instances is joined to the AD.
  • When a new EC2 instance is started, but the tag “Domain:Join” is missing or set to False, the EC2 instance does not join AD.
  • When an already joined EC2 instance is terminated, the computer object is removed from AD.
  • When an EC2 instance that is not joined to AD is terminated, no objects are removed from AD.

AWS does not provide such a functionality, but other services can be used for particular operations, and they were reviewed in the previous sections. It means all the recalled components need to be somehow “glued” together. The best general purpose glue is AWS Lambda in the Amazon AWS environment, so let’s use it. AWS Lambda allows the implementation of a logic using just a code without worrying about where the code will run. The only missing piece is the automation triggering when EC2 instance state change happens. Simple answer here, AWS CloudWatch Events.

The whole chain is depicted in the following diagram.

Service chain necessary for automatic registration and deregistration

Notably, the path forming the computer deregistration terminates by the step using LDAP protocol directly. The deregistration is perfomed not from the EC2 instance using an AWS API call, but from the AWS Lambda function’s runtime environment, as AWS does not provide direct interaction with AD object via its API. In other words, the AWS Lambda function needs an elastic network interface (ENI) attached to a subnet that has IP connectivity with AD.

A computer object removal is a privileged operation, so the LDAP call needs to be authenticated. Storing credentials in AWS Lambdas is wrong, so the credentials should be stored in a secret store, e.g., AWS SecretManager. To simplify integration with other AWS processes, this process tends to stick with naming conventions used by AWS that stores credentials in AWS SecretManager under path aws/directory-services/d-xxxxxxxxx/seamless-domain-join.

The computer registration operation is straightforward, as the AWS Lambda function only needs to retrieve EC2 instance’s metadata using DescribeInstance API call. Having the EC2 tag Domain:Join=True causes a new AWS SSM association is created. A tiny little catch here. As there is no specific AWS CloudWatch even indicating EC2 instance being created, but states are propagated. The meaningful states are PENDING and RUNNING for triggering EC2 registration. The catch is AD might already contain the computer object, as such an event is triggered by EC2 instance has been start-stopped while created a long time ago. Therefore, the registration process needs to look up to AD to check whether to start the registration process. The process might look like this:

procedure Registration
event <- AWS CloudWatch event is received

if event[state] != PENDING then STOP
if event[InstanceId] is registed in AD then STOP

Create ssm assiciation object

END

Similarly, the deregistration process might look like this:

procedure DeRegistration
event <- AWS CloudWatch event is received

if event[state] != TERMINATED then STOP
if event[InstanceId] is not registed in AD then STOP

Delete the computer object from AD via LDAP protocol

END

That is all.

The actual code is implemented using Python language function/main.py and the actual registration to AWS is tackled again by Terraform in lambda.tf. Due to the necessary connectivity with AD via LDAP protocol, ENI has to be attached, resulting in extra configuration of security groups, etc. The AWS CloudWatch events triggers and integration with the lambda code is formalized in lambda_ec2trigger.tf. Additionally, there are a few more lambda_xxx.tf files that tackle some related technicalities.

This lambda function is implemented and delivered as a standard deployment package, but as of Re:Invent 2020 it is possible to deliver the functionality via a container image.

A tiny little extra functionality is hidden in lambda_cron.tf, which uses AWS CloudWatch cron-like service for triggering cleaning. This cleaning is practically just a workaround to the brittle LDAP interface to AD. Sometimes, the deregistration via LDAP timeouts, so the computer object resides in AD. The periodical cleaning scans all EC2 instances, and cross-reference them with computer objects in AD. The forgotten computer objects are simply retried to delete.

Conclusion and potential improvements

This topis has been quite crucial for me recently, but during the work, I have come across other modern technologies, such as Octa, etc. I am afraid AD is looking to its end, but we will have it here for a few more years, so I believe it is vital to know it and use it in the cloud.

Bear in mind this implementation is just a proof of concept, so the code is ugly and not recommendable for production usage. Even though it can be just downloaded, applied, and will work just fine. Nevertheless,

  • This code should be put under module structure
  • Runtime can be converted to either AWS Lambda layer or container image for better reuse
  • For multi-region use, the AWS Lambda function should be deployed only once in the same region as AD. Further, the AWS Lambda function should be attached to AWS SQS where other regions would be dropping the registration/deregistration requests.
  • Use default_domain_suffix for sssd.conf for easier logging names, see.

A former academic, doctor (Ph.D.) for networks. Operational research, distributed systems, and Linux lover. I am having fun with DevOps and my newborn daughter