Jeff Li

Be another Jeff

A Pitfall of Vagrant When Using Ansible's wait_for Module

Introduction

Vagrant and Ansible are used to manage my development environment which is really wonderful. Recently, I have written a Ansible playbook which would reboot the VM during Vagrant provisioning. Check here to see how Vagrant works together with Ansible. Besides, there are tasks left which are expected to continue after the VM comes back. It is easy to find a solution in Ansible

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
---
- name: reboot the system
  shell: sleep 2 && shutdown -r now "Ansible updates triggered"
  async: 1
  poll: 0
  become: yes
  ignore_errors: true

- name: waiting for server to come back
  wait_for:
    port: "{{ ansible_ssh_port | default(22) }}"
    host: "{{ ansible_ssh_host | default(inventory_hostname) }}"
    state: started
    delay: 10
    timeout: 60
  delegate_to: localhost
  become: no

This is the standard way to reboot and wait for a host to come back through polling in Ansible. However, it doesn’t work as expected. Technically, it works with libvirt provider but not the built-in VirtualBox provider.

  • Output of Ansible shows that the wait_for task is successful
  • SSH connection failure occurs in wait_for task’s following task
  • VirtualBox UI tells that the VM is still booting when wait_for task is done

Triage

What makes me confused is that VirtualBox tells that VM is still booting but the wait_for module has finished successfully. Disclose SSH connection failure from following task means that the VM is not ready yet. It seems that something is wrong with wait_for module.

For the first try, the value of delay option of wait_for task was set to a longer time like 30 which means Ansible will block for 30 seconds before it starts to check if the host is back or not. Every task got executed successfully and VirtualBox UI showed that the VM had been up. Though every thing seems to be OK, the workaround didn’t make sense because that is not how wait_for module works. It is supposed to poll the status of the VM and wait until port 22 of the VM is ready. However, the result tell that wait_for module will return immediately whenever it start to check the status.

Next, enable Ansible’s verbose mode in Vagrant. From the output of wait_for module we can find

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
{
   "changed":false,
   "elapsed":10,
   "invocation":{
      "module_args":{
         "connect_timeout":5,
         "delay":10,
         "exclude_hosts":null,
         "host":"127.0.0.1",
         "path":null,
         "port":2222,
         "search_regex":null,
         "state":"started",
         "timeout":60
      },
      "module_name":"wait_for"
   },
   "path":null,
   "port":2222,
   "search_regex":null,
   "state":"started"
}

Wait, why are the values of host and port 127.0.0.1 and 2222? Aren’t they supposed to be the ip of VM and 22? The output means that 2 of Ansible’s facts, ansible_ssh_port and ansible_ssh_host, are interpreted unexpectedly. If not specified, Ansible’s inventory file will be generated by Vagrant automatically. The content of inventory file generated by Vagrant is

1
2
3
# Generated by Vagrant

VM_NAME ansible_ssh_host=127.0.0.1 ansible_ssh_port=2222 ansible_ssh_private_key_file=VM_DIRECTORY/.vagrant/machines/VM_NAME/virtualbox/private_key

Because Vagrant use a forwarded port to create SSH connection in VirtualBox, the host will be 127.0.0.1 and the port is a 2222 other than 22.

So, can we instruct Vagrant to generate the inventory file with the VM’s own address and port 22 by disabling port forwarding? Unfortunately, NO! You can disable any port except the forwarded SSH port in VirtualBox. That is because Vagrant require a way to access and configure the VM through SSH and port forwarding is used with VirtualBox provider. The libvirt provider uses private network to provide the SSH connection so the playbook works well.

The reason why wait_for module is able to finish successfully is that the forwarded port will be created before the VM is ready. Every time wait_for module starts to check the VM, it finds that the port is ready. In fact, only the host’s port is ready but the guest’s port is not.

Conclusion

Since the root cause has been found, there are several ways to fix it.

  1. Provide a customized inventory file to Vagrant. This is not a good practice because we should let Vagrant to manage as much stuff as possible.
  2. Pass the VM’s address and port to the playbook. Not good enough since hardcoded parameters should be avoided.
  3. Use the search_regex option of wait_for module. With the option, wait_for module will not just check if the port is ready, it will also try to read some data from the port and search if it matches the search_regex. After setting the search_regex option to OpenSSH, every one is happy now.

Some lessons

  • Networking is really the must-know knowledge for Vagrant users.
  • Ansible’s verbose mode is informative and should be turned on when something is abnormal.
  • Let Vagrant to manage the environment as much as possible.

Comments