Ansible best practices

April 24, 2017

Ansible can be summarized as tool for running automated tasks on servers that require nothing but Python installed on the remote side. Typically used as configuration management framework, Ansible comes with a set of key benefits:

Has simple configuration with YAML, avoiding copy-paste by applying customizable "roles"
Uses inventories to scope and define the set of servers
Fosters repeatable "playbook" runs, i.e. applying same configuration to a server twice should be idempotent
Doesn’t suffer from feature matrix issues because by design it is a framework, not a full-fledged solution for configuration management. You cannot say "it supports only web servers X and Y, but not Z", as principally Ansible allows you to do anything that is possible through manual server configuration.

For a full introduction to Ansible, better read the documentation first. This article assumes you have already made yourself familiar with the concepts and have some existing attempts of getting Ansible working for a certain use case, but want some guidance on improving the way you are working with Ansible.

The company behind Ansible gives some official guidelines which mostly relate to file structure, naming and other common rules. While these are helpful, as they are not immediately common sense for beginners, only a fraction of Ansible’s features and complexity of larger setups are touched by that small set of guidelines.

I would like to present my experience from roughly over 2 years of Ansible experience, during which I have used it for a test environment at work (allowing developers to test systems like in production), for configuring my laptop and eventually for setting up this server and web application, and also my home server (a Raspberry Pi).

Why Ansible over other frameworks?

Honestly, I did not compare many alternatives because the Ansible environment at work already existed when I joined and soon I believed Ansible to be the best option. The usual suspects Chef and Puppet did not really please me because the recipes do not really look like "infrastructure as code", but are too declarative and hard to understand in detail without looking at many files — while in a typical Ansible playbook, the actions taken can be read top-down like code.
Many years ago, I built my own solution to deploy my personal web applications ("Site Deploy"; UI-based). As hobby project, it never became popular or sophisticated enough, and eventually I learned that it suffers from the aforementioned feature matrix problem. Essentially it only supported the features relevant to me 🙄, without providing a framework to support anything on any server. Nevertheless, Site Deploy already had support for configuring hosts with their connection data and services, with the help of variable substitution in most places. Or in other words: the very basic concepts of Ansible.
Size of the user-base says a lot (cf. their 2016 recap)
Ansible aims at simple design, and becomes powerful by all the open-source modules to support services, applications, hardware, network, connections, etc.
No server-side, persistent component required. Only Python needed to execute modules. Usual connection type is SSH, but custom modules are available for other types.
Flat learning curve: once you understand the basic concepts (define hosts in inventory, set variables on different levels, write tasks in playbooks) and you know the commands/steps to configure a host manually, it’s easy to get started writing the same steps down in Ansible’s YAML format.
Put simply, Ansible combines a set of hosts (inventory) with a list of applicable tasks (playbooks & roles), customizable with variables (at different places), allowing you to use pre-defined or own task modules and plugins (connection, value lookup, etc.). If you rolled your own, generic configuration management, you probably could not implement its principles much simpler. Since the concepts are so clearly separated, the source code (Python) is easy enough to read, if ever needed. Usually you will only have 2 situations to look into Ansible source code: learning how modules should be implemented and finding out about changed behavior when upgrading Ansible. The latter is not common and only occurred to me when switching from Ansible 1.8/1.9.x to 2.2.x which was quite a big step both in features, deprecations and also Ansible source code architecture itself.
Change detection and idempotency. Whenever a task is run, there may be distinct outcomes: successfully changed, failed, skipped, unchanged. After running a playbook, you will have an overview of which tasks actually made changes on the target hosts. Usually, one would design playbooks in a way that running it a second time only gives "unchanged" outcomes, and Ansible’s modules support this idea of idempotency — for example, a command task can be marked as "already done that before, no changes required" by specifying creates: /file/created/by/command → once the file was successfully created, a repeated execution of the task module will not run the command again.

Choose your type of environment

Before we jump into practice, in the first thought we must consider what kind of Ansible-based setup we want to achieve, which greatly depends on the environment: work/personal, production/staging/testing, mixture of those…

Testing

A test environment could have many faces: for instance, at my company we manage a separate Git repo for the test environment, unrelated to any production configuration and therefore very quick to modify for developers without lengthy code reviews or approval by devops, as no production system can be affected. Ansible is used to fully configure the system and our software within a virtual machine.

To spin up a VM, many solutions exist already — for instance Vagrant with a small provisioning script that installs everything required for Ansible (only Python 😉) in the VM. We use a small Fabric script to bootstrap a FreeBSD VM and networking before continuing with Ansible.

Staging/production

You should keep separate inventories for staging and production. If you don’t have staging, you should probably aim at automating staging setup with Ansible, since you already develop the production configuration in playbooks. But if you have both, the below recommendations apply.

Both non-production and production with one Ansible setup

When deploying both non-production and production environments from the same roles/playbooks, you must take care they don’t interfere with each other. For instance, you don’t want to send real e-mails to customers from staging, use different domain names, etc. The main way to decide on applying non-production vs. production properties should be your use of inventories and variables. An example will be discussed below (dynamic inventory).
Careful — developers should not have live credentials such as SSH access to a production server, but probably be able to manage testing/staging systems?!
GPG encryption of sensitive files or other protection to disallow unprivileged people from accessing production machines at all (mentioned in section Storing sensitive files)
A safe default choice for inventories is required, and the default should most probably not be production. This is described below in the section Ansible configuration.

Careful when mixing manual and automated configuration

If you already have a production system manually set up — which is almost always the case, at least for initial OS installation steps which cannot be done via Ansible on physical servers — making the switch to fully automated configuration via Ansible is not easy. You may want to introduce automation step-by-step.

There are many imaginable ways to achieve that migration. I want to propose what I would do, admittedly without any real-world experience because I do not manage any production systems as developer.

Develop playbooks and maintain check mode and the --diff option. This is not always easy and sometimes unnerving because you have to think both in normal mode (read-write) and check mode (read-only) when writing tasks, and apply appropriate options for modules that can’t handle it themselves (like command):
- check_mode: no (previously called always_run: yes)
- changed_when
- If you use tags: apply tags: [ always ] to tasks that e.g. provide results for subsequent tasks
Take care when making manual changes to servers. While often okay and necessary to react quickly, ensure the responsible people (e.g. devops team) can later reproduce the setup rather sooner than later with playbooks.
Use {{ ansible_managed }} to mark auto-generated files as such, so nobody unknowingly edits them manually
Automate as much setup as you can, but only the parts that you are able to implement via Ansible without risk. For example, if you fear that an automatic database setup could go horribly wrong (like overwrite the existing production database), then rely on your distrust and do those steps manually.

Directory structure

Some common directory layouts are already part of the official documentation. In addition, you may want to separate your playbooks in subdirectories of playbooks/ once your content grows too large. This cannot really be handled well in best practices because size and purpose of each project varies, so I just leave this on you to decide when time comes to "clean up". Note that if you use several playbook (sub-)directories and files relative to them (such as a custom library folder), you may have to symlink into the each directory containing playbooks.

Basic setup

It should be clear that Ansible uses text files and therefore should be versioned in a VCS like Git. Make sure you ignore files that should not be committed (for example in .gitignore: *.retry).
Add something like alias apl=ansible-playbook in your shell. Or do you want to type ansible-playbook all the time?
Require users to use at least a certain Ansible version, e.g. the latest version available in OS package managers at the time of starting your endeavors. You could have a little role check-preconditions doing this:

# Check and require certain Ansible version. You should document why that
# version is required, for instance:
#
# We require Ansible 2.2.1 or newer, see changelog
# (https://github.com/ansible/ansible/blob/devel/CHANGELOG.md#221-the-battle-of-evermore---2017-01-16):
# > Fixes a bug where undefined variables in with_* loops would cause a task
# > failure even if the when condition would cause the task to be skipped.
- name: Check Ansible version
  assert:
    that: '(ansible_version.major, ansible_version.minor, ansible_version.revision) >= (2, 2, 1)'
    msg: 'Please install the recommended version 2.2.1+. You have Ansible {{ ansible_version.string }}.'
  run_once: true

Ansible configuration

ansible.cfg allows you to tweak many settings to be a little saner than the defaults.

I recommend the following:

[defaults]
# Default to no fact gathering because it's slow and "explicit is better
# than implicit". Depending how you use variables, you may rather explicitly
# define variables instead of relying on facts. You can enable this on
# a per-playbook basis with `gather_facts: yes`.
gathering = explicit
# You should default either 1) to a non-risky inventory (not production)
# or 2) point to a nonexistent one so that the person explicitly needs to
# specify which one to use. I find the alternative 1) the least risky,
# because 2) may lead to people creating shortcuts to deploy to live machines
# which defeats the purpose of having a safer default here.
inventory = inventories/test
# Cows are scared of playbook developers
nocows = 1

# Point to your local collection of extras, e.g. roles
roles_path = ./roles

[ssh_connection]
# Enable SSH multiplexing to increase performance
pipelining = True
control_path = /tmp/ansible-ssh-%%h-%%p-%%r

Choosing a safe default for the inventory is obviously important, thinking about recent catastrophic events like the Amazon S3 outage that originated from a typo. Inventory names should not be confusable with each other, e.g. avoid using a prefix (inv_live, inv_test) because people hastily using tab completion may quickly introduce a typo.

If you are annoyed by *.retry files being created next to playbooks which hinders filename tab completion, an environment variable ANSIBLE_RETRY_FILES_SAVE_PATH lets you put them in a different place. For myself, I never use them as I’m not working with hundreds of hosts matching per playbook, so I just disable them with ANSIBLE_RETRY_FILES_ENABLED=no. Since that is a per-person decision, it should be an environment variable and not go into ansible.cfg.

Name tasks

While already outlined in the mentioned best practices article, I’d like to stress this point: names, comments and readability enable you and others to understand playbooks and roles later on. Ansible output on its own is too concise to really tell you the exact spot which is currently executing, and sometimes in large setups you will be searching that spot where you canceled (Ctrl+C) or a task failed fatally. Naming even the single tasks comes in handy here. Or tooling like ARA which I personally did not try yet (overkill for me). After all we’re doing programming, and no reasonable language would allow you to make public functions unnamed/anonymous.

- name: 'Create directories for service {{ daemontools_service_name }}'
  file:
    state: directory
    dest: '{{ item }}'
    owner: '{{ daemontools_service_user }}'
  with_items: '{{ daemontools_service_directories }}'

In recent versions of Ansible, variables in the task name will be correctly substituted by their value in the console output, giving you visual feedback which part of the play is executing. That will be especially important once your configuration management project is growing and you run large collections of playbooks that execute a certain role (this example: daemontools_service) multiple times, for example to create a couple of permanent services.

Another advantage of this technique is that you can start where a play canceled/failed previously using the --start-at-task="Task name" option. That might not always work, e.g. if a task depends on a previously register:-ed variable, but is often helpful to save time by skipping all previously succeeded tasks. If you use static task names like "Install packages", then --start-at-task="Install packages" will start at the first occurrence of that task name in the play instead of a specific one ("Install dependencies for service XYZ").

Avoid skipping items

…because it might hurt idempotency. What if your Ansible playbook adds a cronjob based on a boolean variable, and later you change the value to false? Using when: my_bool (value now changed to no) will skip the task, leaving the cronjob intact even though you expected it to be removed or disabled.

Here’s a slightly more complicated example: I had to set up a service that should be disabled by default until the developer enables it (because it would log error messages all the time unless the developer had established a required, manual SSH tunnel). Considerations:

When configuring that service (let’s call the role daemontools_service; daemontools are great to set up and manage services on *nix), we cannot simply enable/disable the service conditionally: the service should only be disabled initially (first playbook run = service created for the first time on remote machine) and on boot, but its state should be untouched if the developer had already enabled the service manually. Or in other words (since that fact is not easy to find out), leave state untouched if the service was already configured by a previous playbook run (= idempotency).
You might also want an option to toggle enabling/disabling the service by default, so I’ll show that as well

- hosts: xyz

  vars:
    xyz_service_name: xyz-daemon

    # Knob to enable/disable service by default (on reboot, and after
    # initial configuration)
    xyz_always_enabled: true

  roles:
    - role: daemontools_service
      daemontools_service_name: '{{ xyz_service_name }}'
      # Contrived variable, leaving state untouched should be the default
      # behavior unless you want to risk in production that services are
      # unintentionally enabled or disabled by a playbook run.
      daemontools_service_enabled: 'do_not_change_state'
      daemontools_service_other_variables: ...

  tasks:
    - name: Disable XYZ service on boot
      cron:
        # We know that the role will symlink into /var/service,
        # as usual for daemontools
        job: "svc -d /var/service/{{ xyz_service_name }}"
        name: "xyz_default_disabled"
        special_time: "reboot"
        disabled: "{{ xyz_always_enabled }}"
        # ...or...
        # state: "{{ 'absent' if xyz_always_enabled else 'present' }}"
      tags: [ cron ]

    - name: Disable XYZ service initially
      # After *all* initial configuration steps succeeded, take the service
      # down (`svc -d`) and mark the service as created so we...
      shell: "svc -d /var/service/{{ xyz_service_name }} && touch /var/service/{{ xyz_service_name }}/.created"
      args:
        # ...don't disable the service again if playbook is run again
        # (as someone may have enabled the service manually in the meantime).
        creates: "/var/service/{{ xyz_service_name }}/.created"
      when: not xyz_always_enabled
      tags: [ cron ]

Use and abuse of variables

The most important principle for variables is that you should know which variables are used when looking at a portion of "Ansible code" (YAML). As an Ansible beginner, you might have 1) wondered a few times, or looked up, in which order of precedence variables are taken into account. Or 2) you might have just given up and asked the author what is happening there. Like in software development, both 1) and 2) are fatal mistakes that hamper productivity — code must be readable (hopefully top-down or by looking within the surrounding 100 lines) and understandable by colleagues and other contributors. The case that you even had to check the precedence shows the problem in the first place! Variables should be specified at exactly one place (or two places if a variable has a reasonable, overridable default value), as close as possible to their usage while still being at the relevant location and most variables should be ultimately mandatory so that Ansible loudly complains if a variable is missing. Let us look at a few examples to see what these basic rules mean.

[exampleservers]
192.168.178.34

[all:vars]
# Global helper variables.
#
# I tend to use these specific ones because when inside a role, Ansible 1.9.x
# did not correctly find files/templates in some cases (if called from playbook
# or dependency of other role). Not sure if that is still required for 2.x,
# so don't copy-paste without understanding the need! These are really
# just examples.
my_playbooks_dir={{ inventory_dir + "/../playbooks" }}
my_roles_dir={{ inventory_dir + "/../roles" }}

# With dynamic inventories, you can structure your per-host and per-group
# variables in a nicer way than this INI file top-down format. If you use
# INI files, at least try to create some structure, like alphabetical sorting
# for hosts and groups.
[exampleservers:vars]
# Here, put only variables that belong to matching servers in general,
# not to a functional component
ansible_ssh_user=dog

Let’s look at an example role "mysql" which installs a MySQL server, optionally creates a database and then optionally gives privileges to the database (also allows value * for all databases) to a user:

# ...contrived excerpt...
- name: Ensure database {{ database_name }} exists
  mysql_db:
    name: 'ourprefix_{{ database_name }}'
  when: database_name is defined and database_name != "*"

- name: Ensure database user {{ database_user }} exists and has access to {{ database_name }}
  mysql_user:
    name: '{{ database_user }}'
    password: '{{ database_password }}'
    priv: '{{ database_name }}.*:ALL'
    host: '%'
  when: database_user is defined and database_user
# ...

The good parts first:

Once database_user is given, the required variable database_password is mandatory, i.e. not checked with another database_password is defined.
Variables used in task names, so that Ansible output clearly tells you what exactly is currently happening

But many things should be fixed here:

Role (I called this example role "mysql") is doing way too many things at once without having a proper name. It should be split up into several roles: MySQL server installation, database creation, user & privilege setup. If you really find yourself doing these three things together repeatedly, you can still create an uber-role "mysql" that depends on the others.
Role variables should be prefixed with the role name (e.g. mysql_database_name) because Ansible has no concept of namespaces or scoping these variables only to the role. This helps finding out quickly where a variable comes from. In contrast, host groups in Ansible are a way to scope variables so they are only available to a certain set of hosts.
The database name prefix ourprefix_ seems to be a hardcoded string. First of all, this led to a bug — privileges are not correctly applied to the user in the second task because the prefix was forgotten. The hardcoded string could be an internal variable (mark those with an underscore!) defined in the defaults file roles/mysql/defaults/main.yml: _database_name_prefix: 'ourprefix_' # comment describing why it’s hardcoded, and must be used wherever applicable. Whenever the value needs changing, you only need to touch one location.
The special value database_name: '*' must be considered. Because the role has more than one responsibility (remember software engineering best practices?!), the variables have too many meanings. As said, there had better be a role "mysql_user" that only handles user creation and privileges — inside such a scoped role, using one special value turns out to be less bug-prone.
database_user is defined and database_user is again only necessary because the role is doing too much. In general, you should almost never use such a conditional. For no real reason, an empty value is principally allowed, and the task skipped in that case, and also if the variable is not specified. Once you decide to rename the variable and forget to replace one occurrence, you suddenly always skip the task. Whenever you can, let Ansible complain loudly when a variable is undefined, instead of e.g. skipping a task conditionally. In this example, splitting up the role is the solution to immediately make the variables mandatory. In other cases, you could introduce a default value for a role variable and allow users to override that value.

Other practices regarding variables and their values and inline templates:

Consistently name your variables. Just like code, Ansible plays should be grep-able. A simple text search through your Ansible setup repo should immediately find the source of a variable and other places where it is used.
Avoid indirections like includes or vars_files if possible to keep relevant variables close to their use. In some cases, these helpers can shorten repeated code, but usually they just add one more level of having to jump around between files to grasp where a value comes from.
Don’t use the special one-line dictionary syntax mysql_db: name="{{ database_name }}" state="present" encoding="utf8mb4". YAML is very readable per se, so why use Ansible’s crippled syntax instead? It’s okay to use for single-variable tasks, though.
On the same note, remove defaults which are obvious, such as the usual state: present. The "official" blog post on best practices recommends otherwise, but I like to keep code short and boilerplate-less.
Decide for one quoting style and use it consistently: double quotes (dest: "/etc/some.conf"), single quotes (dest: '/etc/some.conf') plus decision if you quote things that don’t need it (dest: /etc/some.conf). Keep in mind that dest: {{ var }} is not possible (must be quoted), and that mode: 0755 (chmod) will give an unexpected result (no octal number support), so recommended practice is of course mode: '0755'.
Also decide for one style for spacing and writing Jinja templates. I prefer dest: '{{ var|int + 5 }}' over dest: '{{var | int + 5}}' but only staying consistent is key, not the style you choose.
You don’t need --- at the top of YAML files. Just leave them away unless you know what it means.

More rules can be shown best in a playbook example:

- hosts: web-analytics-database

  vars:
    # Under `vars`, only put variables that really must be available in several
    # roles and tasks below. They have high precedence and therefore are prone
    # to clash with other variables of the same name (if you didn't follow
    # the principle of only one definition), or may set a value in one of the
    # below roles that you didn't want to be set! Therefore the role name
    # prefix is so important (`mysql_user_name` instead of `username` because
    # the latter might also be used in many other places and is hard to grep
    # for if used all over the place).

    # When writing many playbooks, you probably don't want to hardcode your
    # DBA's username everywhere, but define a variable `database_admin_username`.
    # The rule of putting it as close as possible to its use tells you to
    # create a group "database-servers" containing all database hosts and put
    # the variable into `group_vars/database-servers.yml` so it's only available
    # in the limited scope.
    # Using variable name prefix `wa_` for "web analytics" as example.
    wa_mysql_user_name_prefix: '{{ database_admin_username }}'

  roles:
    - role: mysql_server

      # [Comment describing why we chose MySQL 5.5...]
      # Alternatively (but more risky than requiring it to be defined explicitly),
      # this might have a default value in the role, stating the version you
      # normally use in production.
      mysql_server_version: '5.5'

    # Admin with full privileges
    - role: mysql_user
      mysql_user_name: '{{ wa_mysql_user_name_prefix }}_admin'

      # This should not have a default. Defaulting to `ALL` means that on a
      # playbook mistake, a new user may get all privileges!
      mysql_user_privileges: 'ALL'

      # Production passwords should not be committed to version control
      # in plaintext. See article section "Storing sensitive files".
      mysql_user_password: '{{ lookup("gpgfile", "secure/web-analytics-database.password") }}'

    # Read-only access
    - role: mysql_user
      mysql_user_name: '{{ wa_mysql_user_name_prefix }}_readonly'
      mysql_user_privileges: 'SELECT'
      mysql_user_password: '{{ lookup("gpgfile", "secure/web-analytics-database.readonly.password") }}'

  tasks:
    # With well-developed roles, you don't need extra {pre_}tasks!

sudo only where necessary

The command failed, so I used sudo command and it worked fine. I’m now doing that everywhere because it’s easier.

It should be obvious to devops people, and hopefully also software developers, how very wrong this is. Just like you would not do that for manual commands, you also should not use become: yes globally for a whole playbook. Better only use it for tasks that actually need root rights. The become flag can be assigned to task blocks, avoiding repetition.

Another downside of "sudo everywhere" is that you have to take care of owner/group membership of directories and files you create, instead of defaulting to creating files owned by the connecting user.

Assertions

If you ever had a to debug a case where a YAML dictionary was missing a key, you will know how bad Ansible is at telling you where an error came from (does not even tell you the dictionary variable name). I have found my own way to deal with that: assert a condition before actually running into the default error message. Only a very simple plugin is required. I opened a pull request already but the maintainers did not like the approach. Still I will recommend it here because of practical experience.

In ansible.cfg, ensure you have:

filter_plugins = ./plugins/filter

Then add the plugin plugins/filter/assert.py:

from ansible import errors


def _assert(value, msg=''):
    # You can leave this condition away if you think it's too strict.
    # It's supposed to help find typos and type mistakes in assertion conditions.
    if not isinstance(value, bool):
        raise errors.AnsibleFilterError('assert filter requires boolean as input, got %s' % type(value))

    if not value:
        raise errors.AnsibleFilterError('assertion failed: %s' % (msg or '<no message given>',))
    return ''


class FilterModule(object):
    filter_map = {
        'assert': _assert,
    }

    def filters(self):
        return self.filter_map

And use it like so:

- name: My task
  command: 'somecommand {{ (somevar|int > 5)|assert("somevar must be number > 5") }}{{ somevar }}'

This will only be able to test Jinja expressions, which are mostly but not 100% Python, but that should be enough.

Less code by using repetition primitives

Ever wrote something like this?

- name: Do something with A
  command: dosomething A
  args:
    creates: /etc/somethingA
  when: '{{ is_admin_user["A"] }}'

- name: Do something with B
  command: dosomething --a-little-different B
  args:
    creates: /etc/somethingB
  when: '{{ is_admin_user["B"] }}'

A little exaggerated, but chances are that you suffered from copy-pasting too much Ansible code a few times in your configuration management career, and had the usual share of copy-paste mistakes and typos. Use with_items and friends to your advantage:

- name: Do something with {{ item.name }}
  # At a task-level scope, it's totally okay to use non-mandatory variables
  # because you have to read only these few lines to understand what it's
  # doing. Use quoting if you want to support e.g. whitespace in values - just
  # saying, of course it's unusual on *nix...
  command: 'dosomething {{ item.args|default("") }} "{{ item.name }}"'
  args:
    creates: '/etc/something{{ item.name }}'
  # This is again following the rule of mandatory variables: making dictionary
  # keys mandatory protects you from typos and, in this case, from forgetting
  # to add people to a list. Get a good error message instead of just
  # `KeyError: B` by using the aforementioned assert module.
  when: '{{ item.name in is_admin_user|assert("User " + item.name + " missing in is_admin_user") }}{{ is_admin_user[item.name] }}'
  with_items:
    - name: A
    - name: B
      args: '--a-little-different'

More readable (once it gets bigger than my contrived example), and still does the same thing without being prone to copy-paste mistakes and complexity.

Idempotency done right

This term was already mentioned a few times above. I want to give more hints on how to achieve repeatable playbook runs. "Idempotent" effectively means that on the second run, everything is green and no actual changes happened, which Ansible calls "ok" but in a well-developed setup means "unchanged" or "read-only action was performed".

The advantages should be pretty clear: not only can you see the exact --diff of what would happen on remote servers but also it gives visual feedback of what has really changed (even if you don’t use diff mode).

Only a few considerations are necessary when writing tasks and playbooks, and you can get perfect idempotency in most cases:

Avoid skipping items in certain cases (explained above)
Often you need a command or shell task to perform very specific work. These tasks are always considered "changed" unless you define e.g. the creates argument or use changed_when.
Example: changed_when: _previously_registered_process_result.stdout == ''
On the same note, you may want to use failed_when in special cases, like if a program exits with code 0 even on errors.
Always use same inputs. For example, don’t write a new timestamp into a file at every task run, but detect that the file is already up-to-date and does not need to be changed.
Use built-in modules like lineinfile, file, synchronize, copy and template which support the relevant arguments to get idempotency if used right. They also typically fully support checked mode and other features that are hard to achieve yourself. Avoid command/shell if built-ins can be used instead.
The argument force: no can be used for some modules to ensure that a task is only run once. For instance, you want a configuration template copied once if not existent, but afterwards manage it manually or with other tools, use copy and force: no to only upload the file if not yet existent, but on repeated run don’t make any changes to the existing remote file. This is not exactly related to idempotency but sometimes a valid use case.

Leverage dynamic inventory

Who needs to fiddle around carefully in check mode every time you change a production system, if there’s a staging environment which can bear a downtime if something goes wrong? Dynamic inventories can help separate staging and production in the most readable and — you guessed it — dynamic way.

Separate environments like test, staging or production of course have different properties like

IP addresses and networks
Host and domain names (FQDN)
Set of hosts. Production software may be distributed to multiple servers, while your staging may simply be installed on one server or virtual machine.
Other values

Ideally, all of these should be specified in variables, so that you can use different values for each environment in the respective inventory, but with consistent variable names. In your roles and playbooks, you can then mostly ignore the fact that you have different environments — except for tasks that e.g. should not or only run in production, but that should also be decided by a variable (→ when: not is_production).

Check the official introduction to Dynamic Inventories and Developing Dynamic Inventory Sources to understand my example inventory script. It forces the domain suffix .test for the "test" environment, and no suffix for the "live" environment.

#!/usr/bin/env python
from __future__ import print_function
import argparse
import json
import os
import sys

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))

# One way to go "dynamic": decide inventory type (test, staging, production)
# based on inventory directory. Remember that Ansible calls the first file
# found if you specify a directory as inventory. Symlinking the same script
# into different directories allows you to use one inventory script
# for several environments.
IS_LIVE = {'live': True, 'test': False}[os.path.basename(SCRIPT_DIR)]
DOMAIN_SUFFIX = '' if IS_LIVE else '.test'


host_to_vars = {
    'first': {
        'public_ip': '1.2.3.4',
        'public_hostname': 'first.mystuff.example.com',
    },
    'second': {
        'public_ip': '1.2.3.5',
        'public_hostname': 'second.mystuff.example.com',
    },
}
groups = {
    'webservers': ['first', 'second'],
}


# Avoid human mistakes by applying test settings everywhere at once (instead
# of inline per-variable)
for host, variables in host_to_vars.items():
    if 'public_hostname' in variables:
        # Just an example. Realistically you may want to change `public_ip`
        # as well, plus other variables that differ between test and production.
        variables['public_hostname'] += DOMAIN_SUFFIX


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--debug', action='store_true', default=False)
    parser.add_argument('--host')
    parser.add_argument('--list', action='store_true', default=False)
    args = parser.parse_args()

    def printJson(v):
        print(json.dumps(v, sort_keys=True, indent=4 if args.debug else None, separators=(',', ': ' if args.debug else ':')))

    if args.host is not None:
        printJson(host_to_vars.get(args.host, {}))
    elif args.list:
        # Allow Ansible to only make one call to this script instead
        # of one per host.
        # See https://docs.ansible.com/ansible/dev_guide/developing_inventory.html#tuning-the-external-inventory-script
        groups['_meta'] = {
            'hostvars': host_to_vars,
        }
        printJson(groups)
    else:
        parser.print_usage(sys.stderr)
        print('Use either --host or --list', file=sys.stderr)
        exit(1)

Much more customization is possible with dynamic inventories. Another example: in my company, we use FreeBSD servers with our software installed and managed in jails. For developer testing, we have an Ansible setup to roughly resemble the production configuration. Unfortunately, at the time of writing, Ansible does not directly support configuration of jails or a concept of "child hosts". Therefore, we simply created an SSH connection plugin to connect to jails. Each jail looks like a regular host to Ansible, with the special naming pattern jailname@servername. Our dynamic inventory allows us to easily configure the hierarchy of groups > servers > jails and all their variables.

For personal and simple setups, in which only a few servers are involved, you might as well just use the INI-style inventory file format that Ansible uses by default. For the above example inventory, that would mean to split into two files test.ini and live.ini and managing them separately.

Dynamic inventories have one major downside compared to INI files: they don’t allow text diffs. Or in other words, you see the script change when looking at your VCS history, not the inventory diff. If you want a more explicit history, you may want a different setup: auto-generate INI inventory files with some script or template, then commit the INI files whenever you change something. Of course you will have to make sure to actually re-generate the files (potential for human mistakes!). I will leave this as exercise to you to decide.

Modern Ansible features

While you may have introduced Ansible years back when it was still in v1.x or earlier stages, the framework is in very active development both by Red Hat and the community. Ansible 2.0 introduced many powerful features and preparations for future improvements:

Task blocks (try-except-finally): useful to perform cleanups if a block of tasks should be applied "either all or none of the tasks". Also can reduce repeated code because you can apply when, become and other flags to a block.
Dynamic includes: you can now use variables in includes, e.g. - include: 'server-setup-{{ environment_name }}.yml'
Conditional roles are nothing new. I had some trouble with related bugs in 1.8.x, but those are obviously resolved and role: […] when: somecondition can help in some use cases to make code cleaner (similar to task blocks).
Plugins were refactored to cater for clean, more maintainable APIs, and more changes will come in 2.x updates (like the persistent connections framework). Migrating your own library to 2.x should be simple in most cases.

Off-topic: storing sensitive files

For this special use case, I don’t have a recommendation since I never compared different approaches.

Vault support seems to be a good start but seems to only support protection by a single password — a password which you then have to share among the team.

Several built-in lookups exist for password retrieval and storage, such as "password" (only supports plaintext) and Ansible 2.3’s "passwordstore".

In my company, we store somewhat sensitive files (such as passwords for financial test systems) in our developers' Ansible test environment repository, but in GPG-encrypted form. A script contains a list of files and people and encrypts the files. The encrypted .gpg files are committed, while original files should be in .gitignore. Within playbooks, we use a lookup plugin to decrypt the respective files. That way, access can be limited to a "need to know" group of people. While this is not tested for production use, it may be an idea to try and incorporate this extra level of security if you are dealing with sensitive information.

Conclusion

Ansible can be complex and overwhelming after developing playbooks in a wrong way for a long time. Just like for source code, readability, simplicity and common practices do not come naturally and yet are important to keep your Ansible code base lean and understandable. I’ve shown basic and advanced principles and some examples to structure your setup. Many things are left out of this general article, because either I have no experience with it yet (like Ansible Galaxy) or it would just be too much for an introductory article.

Happy automation!