-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Most Grid5000 servers are very slow to reboot, so the boot.timeout setting of the nodes should be updated to a larger value when the walt platform is deployed.
For instance, on the following trace, chifflot-2 took 4min15 to boot, although it is on the same site as the server (at Lille).
Six minutes is probably an appropriate value (the defaut boot.timeout is 3min).
In the trace, we see that hard reboots do not work when the timeout is reached (which is a good thing until we increase this timeout) because we only support hard reboot using PoE.
However, in this case taurus-1 did fail to boot properly and I used walt node reboot --hard taurus-1 in another terminal to reboot it, which eventually allowed it to boot properly. In this case (G5K plugin), the hard reboot relies on command kareboot3.
So in this case of reaching boot.timeout (6 minutes) on G5K, we should allow using this path based on kareboot3 for hard-rebooting, since we don't have access on PoE switches.
Jun 25 08:34:28 chiclet-2.lille.grid5000.fr walt-server-daemon[3802]: hub api_call: CSAPI set_image ('all-nodes', 'pc-x86-64-default') {}
Jun 25 08:34:29 chiclet-2.lille.grid5000.fr walt-server-daemon[3802]: hub api_call: CSAPI wait_for_nodes ('all-nodes',) {}
Jun 25 08:37:09 chiclet-2.lille.grid5000.fr walt-server-daemon[3802]: hub api_call: NSAPI sync_clock () {}
Jun 25 08:37:22 chiclet-2.lille.grid5000.fr walt-server-daemon[3802]: hub api_call: NSAPI sync_clock () {}
Jun 25 08:37:23 chiclet-2.lille.grid5000.fr walt-server-daemon[3800]: chifflot-4: boot timeout reached, trying hard-reboot (8 retries left).
Jun 25 08:37:23 chiclet-2.lille.grid5000.fr walt-server-daemon[3800]: Node chifflot-4: failed poe-reboot (unknown lldp network position)
Jun 25 08:37:23 chiclet-2.lille.grid5000.fr walt-server-daemon[3800]: chifflot-2: boot timeout reached, trying hard-reboot (8 retries left).
Jun 25 08:37:23 chiclet-2.lille.grid5000.fr walt-server-daemon[3800]: Node chifflot-2: failed poe-reboot (unknown lldp network position)
Jun 25 08:37:23 chiclet-2.lille.grid5000.fr walt-server-daemon[3800]: taurus-1: boot timeout reached, trying hard-reboot (8 retries left).
Jun 25 08:37:23 chiclet-2.lille.grid5000.fr walt-server-daemon[3800]: Node taurus-1: failed poe-reboot (unknown lldp network position)
Jun 25 08:37:23 chiclet-2.lille.grid5000.fr walt-server-daemon[3800]: taurus-10: boot timeout reached, trying hard-reboot (8 retries left).
Jun 25 08:37:23 chiclet-2.lille.grid5000.fr walt-server-daemon[3800]: Node taurus-10: failed poe-reboot (unknown lldp network position)
Jun 25 08:37:24 chiclet-2.lille.grid5000.fr walt-server-daemon[3800]: node taurus-10 is booted
Jun 25 08:37:25 chiclet-2.lille.grid5000.fr walt-server-daemon[3800]: node chifflot-4 is booted
Jun 25 08:37:45 chiclet-2.lille.grid5000.fr walt-server-daemon[3802]: hub api_call: NSAPI report_lldp_neighbor ('00:01:e8:8b:41:81', 'TenGigabitEthernet 0>
Jun 25 08:38:35 chiclet-2.lille.grid5000.fr walt-server-daemon[3802]: hub api_call: NSAPI sync_clock () {}
Jun 25 08:38:38 chiclet-2.lille.grid5000.fr walt-server-daemon[3800]: node chifflot-2 is booted
Jun 25 08:38:54 chiclet-2.lille.grid5000.fr walt-server-daemon[3802]: hub api_call: NSAPI report_lldp_neighbor ('a0:3d:6f:7f:96:69', 'Ethernet1/2') {}
Jun 25 08:39:17 chiclet-2.lille.grid5000.fr walt-server-daemon[3802]: hub api_call: CSAPI shell_autocomplete ('walt-g5k', ['NODE', 'walt', 'node', 'shell'>
Jun 25 08:39:21 chiclet-2.lille.grid5000.fr walt-server-daemon[3802]: hub api_call: CSAPI shell_autocomplete ('walt-g5k', ['NODE', 'walt', 'node', 'shell'>
Jun 25 08:39:22 chiclet-2.lille.grid5000.fr walt-server-daemon[3802]: hub api_call: CSAPI shell_autocomplete ('walt-g5k', ['NODE', 'walt', 'node', 'shell'>
Jun 25 08:39:24 chiclet-2.lille.grid5000.fr walt-server-daemon[3802]: hub api_call: CSAPI shell_autocomplete ('walt-g5k', ['NODE', 'walt', 'node', 'shell'>
Jun 25 08:39:25 chiclet-2.lille.grid5000.fr walt-server-daemon[3802]: hub api_call: CSAPI filter_ownership ('chifflot-2',) {}
Jun 25 08:39:25 chiclet-2.lille.grid5000.fr walt-server-daemon[3802]: hub api_call: CSAPI get_nodes_ip ('chifflot-2',) {}
Jun 25 08:39:25 chiclet-2.lille.grid5000.fr walt-server-daemon[3802]: hub api_call: CSAPI wait_for_nodes ('chifflot-2',) {}
Jun 25 08:39:59 chiclet-2.lille.grid5000.fr walt-server-daemon[3802]: hub api_call: CSAPI shell_autocomplete ('walt-g5k', ['NODE', 'walt', 'node', 'shell'>
Jun 25 08:40:01 chiclet-2.lille.grid5000.fr walt-server-daemon[3802]: hub api_call: CSAPI filter_ownership ('taurus-10',) {}
Jun 25 08:40:01 chiclet-2.lille.grid5000.fr walt-server-daemon[3802]: hub api_call: CSAPI get_nodes_ip ('taurus-10',) {}
Jun 25 08:40:01 chiclet-2.lille.grid5000.fr walt-server-daemon[3802]: hub api_call: CSAPI wait_for_nodes ('taurus-10',) {}
Jun 25 08:40:23 chiclet-2.lille.grid5000.fr walt-server-daemon[3800]: taurus-1: boot timeout reached, trying hard-reboot (7 retries left).
Jun 25 08:40:23 chiclet-2.lille.grid5000.fr walt-server-daemon[3800]: Node taurus-1: failed poe-reboot (unknown lldp network position)
Jun 25 08:42:16 chiclet-2.lille.grid5000.fr walt-server-daemon[3802]: hub api_call: CSAPI show_nodes ('eduble', False, False) {}
Jun 25 08:42:24 chiclet-2.lille.grid5000.fr walt-server-daemon[3802]: hub api_call: CSAPI show_nodes ('eduble', True, False) {}
Jun 25 08:43:17 chiclet-2.lille.grid5000.fr walt-server-daemon[3802]: hub api_call: CSAPI filter_ownership ('taurus-1',) {}
Jun 25 08:43:17 chiclet-2.lille.grid5000.fr walt-server-daemon[3802]: hub api_call: CSAPI reboot_nodes ('taurus-1',) {'hard_only': True}
Jun 25 08:45:57 chiclet-2.lille.grid5000.fr walt-server-daemon[3802]: hub api_call: NSAPI sync_clock () {}
Jun 25 08:46:02 chiclet-2.lille.grid5000.fr walt-server-daemon[3800]: node taurus-1 is booted
Jun 25 08:46:18 chiclet-2.lille.grid5000.fr walt-server-daemon[3802]: hub api_call: CSAPI show_nodes ('walt-g5k', False, False) {}