Skip to content

Commit 970a16f

Browse files
dgchinnerrichm
authored andcommitted
tests: verify SKU customisation scripts
Add a script to enable both manual and automated testing of the Azure SKU customisation scripts. When running the tests manually, it will exercise all the different supported SKU types via mocking and checking that appropriate links are installed. It will not check that the customisation service is active and running as manual mode is expected to used on dev machines that are unsupported SKU types. Manual testing like this may throw some warnings or errors because hardware is not directly supported. For example, testing on a VM type that does not have GPUs that are supported by the fabric manager will result in warnings that the service failed to start: $ sudo /opt/hpc/azure/tests/test-sku-setup.sh --manual Testing standard_nc96ads_a100_v4 Test Passed: standard_nc96ads_a100_v4 Testing standard_nd40rs_v2 Test Passed: standard_nd40rs_v2 Testing standard_nd96asr_v4 Job for nvidia-fabricmanager.service failed because the control process exited with error code. See "systemctl status nvidia-fabricmanager.service" and "journalctl -xeu nvidia-fabricmanager.service" for details. NVIDIA Fabric Manager Inactive! Test Passed: standard_nd96asr_v4 Testing standard_hb176rs_v4 Test Passed: standard_hb176rs_v4 Testing standard_nc80adis_h100_v5 Check NVLink status after reloading NVIDIA kernel modules... NVLink is Active. Test Passed: standard_nc80adis_h100_v5 Testing standard_nd96isr_h200_v5 Job for nvidia-fabricmanager.service failed because the control process exited with error code. See "systemctl status nvidia-fabricmanager.service" and "journalctl -xeu nvidia-fabricmanager.service" for details. NVIDIA Fabric Manager Inactive! Test Passed: standard_nd96isr_h200_v5 $ Such warnings are fine. When not in manual mode, the test expects that it is running on a supported SKU VM (e.g. in the CI system) and will query the current the SKU type. If the SKU is unsupported, it will check that no files are currently installed. It will fail in the casei where stale config files are found: $ sudo /opt/hpc/azure/tests/test-sku-setup.sh Unknown SKU Failed: Standard_NC8as_T4_v3: /etc/nccl.conf not empty $ If the SKU is supported, it will check that appropriate files are installed and the service is running. Signed-off-by: Dave Chinner <dchinner@redhat.com>
1 parent bc97a03 commit 970a16f

File tree

3 files changed

+163
-1
lines changed

3 files changed

+163
-1
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -195,7 +195,7 @@ Type: `bool`
195195
Whether to install the hardware tuning files for different Azure VM types (SKUs).
196196

197197
This will install definitions for optimal hardware configurations for the different types of high performance VMs that are typically used for HPC workloads in the Azure environment.
198-
These include Infiniband and GPU/NVLink and NCCL customisations, as well as any workarounds for specific hardware problems that may be needed.
198+
These include InfiniBand and GPU/NVLink and NCCL customisations, as well as any workarounds for specific hardware problems that may be needed.
199199

200200
Default: `true`
201201

tasks/main.yml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -774,6 +774,14 @@
774774
name: sku_customisation.service
775775
enabled: true
776776

777+
- name: Install tests
778+
template:
779+
src: sku/test-sku-setup.sh
780+
dest: "{{ __hpc_azure_tests_dir }}/"
781+
owner: root
782+
group: root
783+
mode: '0755'
784+
777785
- name: Remove build dependencies
778786
vars:
779787
__hpc_dependencies: >-

templates/sku/test-sku-setup.sh

Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
#!/bin/bash -eu
2+
# This is a template, not an actual shell script, so tell shellcheck to
3+
# ignore the problematic templated parts
4+
# shellcheck disable=all
5+
{{ ansible_managed | comment }}
6+
{{ "system_role:hpc" | comment(prefix="", postfix="") }}
7+
# shellcheck enable=all
8+
9+
# Script for testing SKU customisation.
10+
#
11+
# This can be run in two ways, determined by the CLI parameter '--manual'
12+
# being specified.
13+
#
14+
# When run in manual mode, the test will mock the SKU string and run the setup
15+
# script, check the install, remove it and check that it is empty. It will
16+
# iterate through all supported SKU types and an invalid type to exercise the
17+
# failure path
18+
#
19+
# When run without the manual CLI parameter, it is assumed that we are being run
20+
# from a CI system using real azure VMs and we are doing whole system testing.
21+
# This means the service scripts will be installing the SKU files at startup,
22+
# and so we do a API query to determine what the current SKU is and expect
23+
# the system to already be set up appropriately.
24+
25+
NCCL_CONF="/etc/nccl.conf"
26+
27+
# define the expected runtime location for the topology and graph
28+
# customisation files
29+
TOPOLOGY_RUNTIME_DIR="{{ __hpc_azure_runtime_dir }}/topology"
30+
TOPOLOGY_GRAPH="${TOPOLOGY_RUNTIME_DIR}/graph.xml"
31+
TOPOLOGY_FILE="${TOPOLOGY_RUNTIME_DIR}/topo.xml"
32+
33+
MANUAL_TEST=
34+
SKU_LIST="standard_nc96ads_a100_v4 \
35+
standard_nd40rs_v2 \
36+
standard_nd96asr_v4 \
37+
standard_hb176rs_v4 \
38+
standard_nc80adis_h100_v5 \
39+
standard_nd96isr_h200_v5 \
40+
standard_nd128isr_gb300_v6 \
41+
some_unknown_sku_for_testing"
42+
43+
fail()
44+
{
45+
echo Failed: "$1"
46+
exit 1
47+
}
48+
49+
usage()
50+
{
51+
echo "$1"
52+
echo "$0 [--manual] [--help|-h|-?]"
53+
echo
54+
echo "Run SKU customisation tests. Options:"
55+
echo "--manual Exercise all SKU types via mocking."
56+
echo "--help Print this usage message."
57+
58+
exit 1
59+
}
60+
61+
while [ $# -gt 0 ]; do
62+
63+
case "$1" in
64+
--manual)
65+
MANUAL_TEST=1 ;;
66+
--help|-h|-?)
67+
usage "Help requested" ;;
68+
*) usage "Unknown Option" ;;
69+
esac
70+
shift
71+
done
72+
73+
74+
if [ -z "$MANUAL_TEST" ]; then
75+
metadata_endpoint="http://169.254.169.254/metadata/instance?api-version=2019-06-04"
76+
77+
retry_count=0
78+
while (( retry_count++ < 5 )); do
79+
SKU_LIST=$(curl -s -H Metadata:true "$metadata_endpoint" | jq -r ".compute.vmSize")
80+
[ -z "$SKU_LIST" ] || break
81+
sleep 30
82+
done
83+
fi
84+
85+
if [ -z "$SKU_LIST" ]; then
86+
fail "Could not retrieve VM Size from IMDS endpoint"
87+
fi
88+
89+
SKU_LIST=$(echo "$SKU_LIST" | awk '{print tolower($0)}')
90+
91+
## Topo file setup based on SKU
92+
for sku in $SKU_LIST; do
93+
unknown_sku=
94+
95+
echo
96+
echo "Testing $sku"
97+
if [ -n "$MANUAL_TEST" ]; then
98+
__MOCK_SKU="$sku {{ __hpc_azure_resource_dir }}/bin/setup_sku_customisations.sh"
99+
fi
100+
101+
case "$sku" in
102+
standard_hb176*v4 | \
103+
standard_nc80adis_h100_v5 | \
104+
standard_nd128is*_gb[2-3]00_v6)
105+
# No topology or graph file, nccl.conf configured
106+
[ -e "$TOPOLOGY_FILE" ] && fail "$sku: unexpected topology file found"
107+
[ -e "$TOPOLOGY_GRAPH" ] && fail "$sku: unexpected graph file found"
108+
[ -s "$NCCL_CONF" ] || fail "$sku: $NCCL_CONF empty or does not exist"
109+
;;
110+
111+
standard_nc96ads_a100_v4)
112+
# Both topology and graph file, nccl.conf configured
113+
[ -e "$TOPOLOGY_FILE" ] || fail "$sku: topology file not found"
114+
[ -e "$TOPOLOGY_GRAPH" ] || fail "$sku: graph file not found"
115+
[ -s "$NCCL_CONF" ] || fail "$sku: $NCCL_CONF empty or does not exist"
116+
;;
117+
118+
standard_nd40rs_v2 | \
119+
standard_nd*v4 | \
120+
standard_nd96is*_h[1-2]00_v5)
121+
# Only topology file, nccl.conf configured
122+
[ -e "$TOPOLOGY_FILE" ] || fail "$sku: topology file not found"
123+
[ -e "$TOPOLOGY_GRAPH" ] && fail "$sku: unexpected graph file found"
124+
[ -s "$NCCL_CONF" ] || fail "$sku: $NCCL_CONF empty or does not exist"
125+
;;
126+
127+
*)
128+
# No topology or graph file, nccl.conf missing or zero length
129+
echo "Unknown SKU: $sku"
130+
[ -e "$TOPOLOGY_FILE" ] && fail "$sku: unexpected topology file found"
131+
[ -e "$TOPOLOGY_GRAPH" ] && fail "$sku: unexpected graph file found"
132+
[ -s "$NCCL_CONF" ] && fail "$sku: $NCCL_CONF not empty"
133+
# turn off the service running check
134+
unknown_sku="$sku"
135+
;;
136+
esac
137+
138+
if [ -n "$MANUAL_TEST" ]; then
139+
"{{ __hpc_azure_resource_dir }}"/bin/remove_sku_customisations.sh
140+
141+
# No topology or graph file, nccl.conf missing or zero length
142+
[ -e "$TOPOLOGY_FILE" ] && fail "$sku: topology file not removed"
143+
[ -e "$TOPOLOGY_GRAPH" ] && fail "$sku: graph file not removed"
144+
[ -s "$NCCL_CONF" ] && fail "$sku: $NCCL_CONF not empty"
145+
elif [ -z "$unknown_sku" ]; then
146+
# check that the customisation service is running
147+
if ! systemctl is-active --quiet sku_customisations ; then
148+
fail "$sku: customisation service not running"
149+
fi
150+
fi
151+
echo Test Passed: "$sku"
152+
done
153+
154+
exit 0

0 commit comments

Comments
 (0)