slurm: set containerd root to EBS#914
Conversation
|
Thanks @maekawataiki for reporting the issue and creating this PR. The suggested fix did not take effect as expected because the original $ bash easy-ssh.sh -r us-east-1 hyperpod-after-20241216
=================================================
==== 🚀 HyperPod Cluster Easy SSH Script! 🚀 ====
=================================================
srun Cluster id: jhroxiiv5v3e
Instance id: i-05060251d0a782283
Node Group: controller-machine
SSH User: ubuntu
1. Detected hyperpod-after-20241216 in ~/.ssh/config. Skipping adding...
2. Detected SSH public key ~/.ssh/id_rsa.pub for user ubuntu on the cluster. Skipping adding...
Now you can run:
$ ssh hyperpod-after-20241216
Starting session with SessionId: i-0f5934b931601f25a-epjgelyqlb4aq6l44epdkeuo4q
$ srun cat /etc/containerd/config.toml
# Copyright 2018-2022 Docker Inc.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
disabled_plugins = ["cri"]
#root = "/opt/dlami/nvme/docker/containerd" # Here
#state = "/run/containerd"
#subreaper = true
#oom_score = 0
#[grpc]
# address = "/run/containerd/containerd.sock"
# uid = 0
# gid = 0
#[debug]
# address = "/run/containerd/debug.sock"
# uid = 0
# gid = 0
# level = "info"
$ |
|
For reference, here's the original $ bash easy-ssh.sh -r us-east-1 hyperpod-before-20241216
=================================================
==== 🚀 HyperPod Cluster Easy SSH Script! 🚀 ====
=================================================
Cluster id: 69q9l3vgs5iv
Instance id: i-0105da2ccc9eae353
Node Group: controller-machine
SSH User: ubuntu
1. Detected hyperpod-before-20241216 in ~/.ssh/config. Skipping adding...
2. Detected SSH public key ~/.ssh/id_rsa.pub for user ubuntu on the cluster. Skipping adding...
Now you can run:
$ ssh hyperpod-before-20241216
Starting session with SessionId: i-0f5934b931601f25a-dab2bng46eqgfk9a3vyx8pesdq
$ srun cat /etc/containerd/config.toml
# Copyright 2018-2022 Docker Inc.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
disabled_plugins = ["cri"]
#root = "/var/lib/containerd"
#state = "/run/containerd"
#subreaper = true
#oom_score = 0
#[grpc]
# address = "/run/containerd/containerd.sock"
# uid = 0
# gid = 0
#[debug]
# address = "/run/containerd/debug.sock"
# uid = 0
# gid = 0
# level = "info"
$ |
|
In the original version, |
|
Workaround on existing clusters: |
Removed state configuration from containerd setup for both paths.
| containerd config default | sudo tee /etc/containerd/config.toml >/dev/null | ||
| fi | ||
| sudo sed -i \ | ||
| -e 's|^#\\?root *=.*|root = "/opt/dlami/nvme/docker/containerd"|' \ |
There was a problem hiding this comment.
Tested these changes but it didn't work but the below did
sudo sed -i -e 's|^#\?root *=.*|root = "/opt/sagemaker/docker/containerd"|'
| containerd config default | sudo tee /etc/containerd/config.toml >/dev/null | ||
| fi | ||
| sudo sed -i \ | ||
| -e 's|^#\\?root *=.*|root = "/opt/sagemaker/docker/containerd"|' \ |
There was a problem hiding this comment.
Tested these changes but it didn't work but the below did
sudo sed -i -e 's|^#\?root *=.*|root = "/opt/sagemaker/docker/containerd"|'
| containerd config default | sudo tee /etc/containerd/config.toml >/dev/null | ||
| fi | ||
| sudo sed -i \ | ||
| -e 's|^#\\?root *=.*|root = "/opt/sagemaker/docker/containerd"|' \ |
There was a problem hiding this comment.
Could you use /opt/sagemaker/containerd/data-root instead of /opt/sagemaker/docker/containerd? for consistency with HyperPod EKS side.
|
Hi team, When will this PR merge to main? Our team currently have a workaround which is simply using |
## Summary Fixes sed regex in containerd root configuration to use correct single backslash `\?` instead of double backslash `\\?`. ## Problem The double backslash `\\?` in the sed pattern looks for a literal backslash character, not an optional `#`. This prevents the containerd root configuration from being uncommented and updated. ## Solution Changed sed pattern from `^#\\?root` to `^#\?root` to correctly match optional `#` character. ## Testing Verified on live cluster that: - Double backslash `\\?` fails: `#root = "/var/lib/containerd"` (stays commented) - Single backslash `\?` works: `root = "/opt/sagemaker/containerd/data-root"` (uncommented and updated) ## Changes - Line 84: Containerd config for `/opt/sagemaker` with correct sed - Line 101: Containerd config for `/opt/dlami/nvme` with correct sed Fixes the issue reported in PR #914.
- Use consistent naming: containerd/data-root for both paths - Fix sed regex: use \? instead of \? to match optional # - Minimal changes: just sed commands without extra logic - Addresses feedback from PR #914
* Update install_docker.sh for containerd configuration ## Summary Fixes sed regex in containerd root configuration to use correct single backslash `\?` instead of double backslash `\\?`. ## Problem The double backslash `\\?` in the sed pattern looks for a literal backslash character, not an optional `#`. This prevents the containerd root configuration from being uncommented and updated. ## Solution Changed sed pattern from `^#\\?root` to `^#\?root` to correctly match optional `#` character. ## Testing Verified on live cluster that: - Double backslash `\\?` fails: `#root = "/var/lib/containerd"` (stays commented) - Single backslash `\?` works: `root = "/opt/sagemaker/containerd/data-root"` (uncommented and updated) ## Changes - Line 84: Containerd config for `/opt/sagemaker` with correct sed - Line 101: Containerd config for `/opt/dlami/nvme` with correct sed Fixes the issue reported in PR #914. * Fix containerd path naming consistency and sed regex - Use consistent naming: containerd/data-root for both paths - Fix sed regex: use \? instead of \? to match optional # - Minimal changes: just sed commands without extra logic - Addresses feedback from PR #914 * Add containerd restart to apply config changes Containerd only reads config at startup, so restart is required to apply the new root path. Verified containerd data is written to the new location after restart.
Issue #, if available:
#913 (related #127)
Description of changes:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.