As many of you already know, Ansible AWX is a great and powerful tool to manage your infrastructure. However even one unreachable host results in entire AWX job being reported as failed regardless of the outcome of all other hosts. If you would like to ignore unreachable hosts and prevent marking jobs as failed just because of host reachability keep reading…

The problem

When you run jobs against your entire big inventory (for example to rollout some compliance settings, cmdb update or some checks) very often you target some hosts, that are simply down for any reason whatsoever.

Having only one host powered down / not reachable results in ansible-playbook returning exit code 4 (BTW here you find excellent description of all exit codes for Ansible and when which code is produced).

AWX decides alone based on exit code, whether the entire job was successful or not. And in case of unreachable hosts – the entire job is reported as failed.

Because of this, such AWX job will very often be failed. Therefore it might be hard to distinguish if there is really a problem with your Ansible code (and the playbook fails for all of the hosts), or this is just because of unreachable hosts.

Of course you can each time analyze the task output, but especially when working with slices each time you would need to analyze outputs of each slice manually… just to find out that this is still due to having at least one unreachable host.

Solution

There is a quite simple solution – just by using Ansible means, as this is where the problem comes from. Just prepend the main PLAY in the playbook with another one, that checks reachability of hosts. Then execute the main PLAY only on the reachable ones. That way if you get an error for the entire Job, you know for sure that there is a problem that needs to be addressed.

Here is how such a playbook might look like:

- name: Check host reachability
  gather_facts: false
  hosts: "{{ target_hosts }}"
  tasks:
    - name: Check connection via explicite fact gathering
      ansible.builtin.setup:
      register: reachable_check
    - meta: clear_host_errors
    - name: Add to un-/reachable group
      ansible.builtin.group_by:
        key: "{{ reachable_check is unreachable | ternary('unreachable','reachable') }}_hosts"
             
- name: MAIN PLAY - Actions with reachable hosts
  hosts: reachable_hosts
  gather_facts: true
  tasks:
    - ....
  roles:
    - ...

- name: SUMMARY Play (optional)
  hosts: localhost
  gather_facts: false
  tasks:
    - name: SUMMARY
      ansible.builtin.debug:
        msg: "Playbook executed on {{ groups['reachable_hosts'] | length }}/{{ q('ansible.builtin.inventory_hostnames',target_hosts) | length }} hosts that were reachable."
    - name: Unreachable hosts with reason for unreachability
      ansible.builtin.debug:
        msg: "{{ hostvars[item]['reachable_check']['msg'] }}"
      with_items: "{{ groups['unreachable_hosts'] }}"

By using this extra play before you assign a host either to reachable_hosts or unreachable_hosts virtual inventory group. Then by running next play only against reachable_hosts the main playbook code is executed only against hosts, that were reachable during the time the job was started. As a bonus you get a list of hosts that are not reachable should this be needed like for the summary at the very last play.

Also if using clear_host_errors be aware that as of moment of writing of this article there was a still open Bug in Ansible, that results in unexpected behavior when there are many tasks between the failing one and the meta task. Make sure that the bug is already fixed or you have just one task before the meta that might fail due to unreachable host.

If calling ansible.builtin.setup is not wanted (for example when using fact caching and explicite fact collection takes too much time/ressources) you may use any other task like ansible.builtin.ping (or ansible.builtin.win_ping) as long as it really connects to the host (some do not) and is able to produce unreachable status (here again, not all do).

Solution 2

If you do not need extra information about unreachable hosts in the summary, you may skip the register variable and just declare reachable_hosts group like this:

...
  tasks:
    - name: Check connection via explicite fact gathering
      ansible.builtin.setup:
    - name: Add to un-/reachable group
      ansible.builtin.group_by:
        key: "reachable_hosts"
    - meta: clear_host_errors
...

This way you save the runtime inventory some bytes by not doubling all collected facts. This might seem nothing, but at a scale of thousands of hosts memory can start getting an issue.

Another solution

This is of course not the only working solution (as in Ansible, there are always multiple ways to achieve the same result more or less performant). There is also a posibility to use block: with rescue: and do not use meta: clear_host_errors like this:

- name: Check host reachability
  gather_facts: false
  hosts: "{{ target_hosts }}"
  tasks:
    - block:
        - name: Check connection
          ansible.builtin.wait_for_connection:
            timeout: 10
          register: reachable_check
          # ignore_unreachable: true
        - name: Add to reachable hosts
          ansible.builtin.group_by:
            key: "reachable_hosts"
      rescue:
        - name: Add to unreachable hosts
          ansible.builtin.group_by:
            key: "unreachable_hosts"

- name: MAIN PLAY - Actions with reachable hosts
  hosts: reachable_hosts
  gather_facts: true
  tasks:
...

Here the ansible.builtin.wait_for_connection has been used instead of ansible.builtin.setup to save some space in runtime inventory (especially when using fact_caching and explicite fact gathering is not wanted). This approach however has some drawbacks:

  • ansible.builtin.wait_for_connection does not return unreachable status, only a failure. Changing it to something that returns unreachable would require either to use ignore_unreachable: true on the task to conert unreachable into failed (as rescue does not heal not reachable hosts) or meta: clear_host_errors as a last task outside block (do not use this meta inside block, as weird things happen)
  • the rescue: block heals failed tasks and there is no information about failure/unreachability (only rescued in the standard ansible summary at the very bottom)

Summary

The first proposed solution works in a way, that AWX shows the number of unreachable hosts next to number of total and/or failed hosts at the very top of the Job Summary and I personally like it better.

number of unreachable hosts in awx job summary

All solutions work with windows and linux target hosts which also might need do be eventually taken into account when choosing proper check task. Final choice is of course yours, feel free to experiment (as long as it is not on production environment :))

See also other Ansible or AWX Related articles on CattleCrew.

Alle Beiträge von Bartlomiej Sowa

Solution Architect specialized in Cloud Solutions, Dev(Sec)Ops and large scale automation using Ansible/AWX. Additionally Oracle DBA with very strong background as System Administrator with excellent Unix/Linux OS knowledge covering aspects from system architecture, InfrastructureAsCode (Ansible, Terraform, Pipelines), to network security, shell scripting, Perl, ANSI C, VCS (primary Git, but worked with Clearcase, CVS, SVN). With very high security standards regarding data-, network and even physical security.

Schreibe einen Kommentar