As many of you already know, Ansible AWX is a great and powerful tool to manage your infrastructure. However even one unreachable host results in entire AWX job being reported as failed regardless of the outcome of all other hosts. If you would like to ignore unreachable hosts and prevent marking jobs as failed just because of host reachability keep reading…
The problem
When you run jobs against your entire big inventory (for example to rollout some compliance settings, cmdb update or some checks) very often you target some hosts, that are simply down for any reason whatsoever.
Having only one host powered down / not reachable results in ansible-playbook
returning exit code 4 (BTW here you find excellent description of all exit codes for Ansible and when which code is produced).
AWX decides alone based on exit code, whether the entire job was successful or not. And in case of unreachable hosts – the entire job is reported as failed.
Because of this, such AWX job will very often be failed. Therefore it might be hard to distinguish if there is really a problem with your Ansible code (and the playbook fails for all of the hosts), or this is just because of unreachable hosts.
Of course you can each time analyze the task output, but especially when working with slices each time you would need to analyze outputs of each slice manually… just to find out that this is still due to having at least one unreachable host.
Solution
There is a quite simple solution – just by using Ansible means, as this is where the problem comes from. Just prepend the main PLAY in the playbook with another one, that checks reachability of hosts. Then execute the main PLAY only on the reachable ones. That way if you get an error for the entire Job, you know for sure that there is a problem that needs to be addressed.
Here is how such a playbook might look like:
- name: Check host reachability gather_facts: false hosts: "{{ target_hosts }}" tasks: - name: Check connection via explicite fact gathering ansible.builtin.setup: register: reachable_check - meta: clear_host_errors - name: Add to un-/reachable group ansible.builtin.group_by: key: "{{ reachable_check is unreachable | ternary('unreachable','reachable') }}_hosts" - name: MAIN PLAY - Actions with reachable hosts hosts: reachable_hosts gather_facts: true tasks: - .... roles: - ... - name: SUMMARY Play (optional) hosts: localhost gather_facts: false tasks: - name: SUMMARY ansible.builtin.debug: msg: "Playbook executed on {{ groups['reachable_hosts'] | length }}/{{ q('ansible.builtin.inventory_hostnames',target_hosts) | length }} hosts that were reachable." - name: Unreachable hosts with reason for unreachability ansible.builtin.debug: msg: "{{ hostvars[item]['reachable_check']['msg'] }}" with_items: "{{ groups['unreachable_hosts'] }}"
By using this extra play before you assign a host either to reachable_hosts
or unreachable_hosts
virtual inventory group. Then by running next play only against reachable_hosts
the main playbook code is executed only against hosts, that were reachable during the time the job was started. As a bonus you get a list of hosts that are not reachable should this be needed like for the summary at the very last play.
Also if using clear_host_errors
be aware that as of moment of writing of this article there was a still open Bug in Ansible, that results in unexpected behavior when there are many tasks between the failing one and the meta task. Make sure that the bug is already fixed or you have just one task before the meta that might fail due to unreachable host.
If calling ansible.builtin.setup
is not wanted (for example when using fact caching and explicite fact collection takes too much time/ressources) you may use any other task like ansible.builtin.ping
(or ansible.builtin.win_ping
) as long as it really connects to the host (some do not) and is able to produce unreachable
status (here again, not all do).
Solution 2
If you do not need extra information about unreachable hosts in the summary, you may skip the register variable and just declare reachable_hosts group like this:
... tasks: - name: Check connection via explicite fact gathering ansible.builtin.setup: - name: Add to un-/reachable group ansible.builtin.group_by: key: "reachable_hosts" - meta: clear_host_errors ...
This way you save the runtime inventory some bytes by not doubling all collected facts. This might seem nothing, but at a scale of thousands of hosts memory can start getting an issue.
Another solution
This is of course not the only working solution (as in Ansible, there are always multiple ways to achieve the same result more or less performant). There is also a posibility to use block:
with rescue:
and do not use meta: clear_host_errors
like this:
- name: Check host reachability gather_facts: false hosts: "{{ target_hosts }}" tasks: - block: - name: Check connection ansible.builtin.wait_for_connection: timeout: 10 register: reachable_check # ignore_unreachable: true - name: Add to reachable hosts ansible.builtin.group_by: key: "reachable_hosts" rescue: - name: Add to unreachable hosts ansible.builtin.group_by: key: "unreachable_hosts" - name: MAIN PLAY - Actions with reachable hosts hosts: reachable_hosts gather_facts: true tasks: ...
Here the ansible.builtin.wait_for_connection
has been used instead of ansible.builtin.setup
to save some space in runtime inventory (especially when using fact_caching and explicite fact gathering is not wanted). This approach however has some drawbacks:
ansible.builtin.wait_for_connection
does not return unreachable status, only a failure. Changing it to something that returns unreachable would require either to useignore_unreachable: true
on the task to conert unreachable into failed (as rescue does not heal not reachable hosts) ormeta: clear_host_errors
as a last task outside block (do not use this meta inside block, as weird things happen)- the
rescue:
block heals failed tasks and there is no information about failure/unreachability (onlyrescued
in the standard ansible summary at the very bottom)
Summary
The first proposed solution works in a way, that AWX shows the number of unreachable hosts next to number of total and/or failed hosts at the very top of the Job Summary and I personally like it better.
All solutions work with windows and linux target hosts which also might need do be eventually taken into account when choosing proper check task. Final choice is of course yours, feel free to experiment (as long as it is not on production environment :))
See also other Ansible or AWX Related articles on CattleCrew.