fix: Retry installation of packages#23
Conversation
Reviewer's guide (collapsed on small PRs)Reviewer's GuideThis PR introduces retry loops for critical package installation tasks by registering each installation result and looping until success, improving resilience against intermittent repository failures. Sequence diagram for package installation with retry logicsequenceDiagram
participant Ansible
participant Repository
loop Retry until success
Ansible->>Repository: Install package
Repository-->>Ansible: Success/Failure
alt Failure
Ansible->>Repository: Retry install
end
end
Class diagram for registered package installation tasksclassDiagram
class PackageInstallTask {
+name: string
+register: string
+until: condition
}
PackageInstallTask <|-- DkmsPackagesInstall
PackageInstallTask <|-- NvidiaDriverPackagesInstall
PackageInstallTask <|-- CudaDriverPackagesInstall
PackageInstallTask <|-- CudaToolkitPackagesInstall
PackageInstallTask <|-- NvidiaNcclPackagesInstall
PackageInstallTask <|-- NvidiaFabricManagerPackagesInstall
PackageInstallTask <|-- RdmaPackagesInstall
PackageInstallTask <|-- OpenmpiCommonPackagesInstall
PackageInstallTask <|-- SystemOpenmpiPackagesInstall
DkmsPackagesInstall: register = __hpc_dkms_packages_install
NvidiaDriverPackagesInstall: register = __hpc_nvidia_driver_packages_install
CudaDriverPackagesInstall: register = __hpc_cuda_driver_packages_install
CudaToolkitPackagesInstall: register = __hpc_cuda_toolkit_packages_install
NvidiaNcclPackagesInstall: register = __hpc_nvidia_nccl_packages_install
NvidiaFabricManagerPackagesInstall: register = __hpc_nvidia_fabric_manager_packages_install
RdmaPackagesInstall: register = __hpc_rdma_packages_install
OpenmpiCommonPackagesInstall: register = __hpc_openmpi_common_packages_install
SystemOpenmpiPackagesInstall: register = __hpc_system_openmpi_packages_install
PackageInstallTask: until = <register> is success
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey there - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `tasks/main.yml:198` </location>
<code_context>
use: "{{ (__hpc_server_is_ostree | d(false)) |
ternary('ansible.posix.rhel_rpm_ostree', omit) }}"
+ register: __hpc_dkms_packages_install
+ until: __hpc_dkms_packages_install is success
- name: Install NVIDIA driver
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Consider adding a 'retries' parameter to control the number of attempts.
Not setting 'retries' and 'delay' can cause the loop to run indefinitely if the condition fails. Defining these parameters ensures the task does not hang and improves error handling.
Suggested implementation:
```
register: __hpc_dkms_packages_install
until: __hpc_dkms_packages_install is success
retries: 5
delay: 10
```
```
register: __hpc_nvidia_driver_packages_install
until: __hpc_nvidia_driver_packages_install is success
retries: 5
delay: 10
```
</issue_to_address>
### Comment 2
<location> `tasks/main.yml:206-207` </location>
<code_context>
use: "{{ (__hpc_server_is_ostree | d(false)) |
ternary('ansible.posix.rhel_rpm_ostree', omit) }}"
+ register: __hpc_nvidia_driver_packages_install
+ until: __hpc_nvidia_driver_packages_install is success
# This makes the role not idempotent.
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Explicitly set 'retries' and 'delay' for the 'until' loop.
Specifying 'retries' and 'delay' prevents the loop from running indefinitely and allows for controlled failure after a defined number of attempts.
```suggestion
register: __hpc_nvidia_driver_packages_install
until: __hpc_nvidia_driver_packages_install is success
retries: 5
delay: 10
```
</issue_to_address>
### Comment 3
<location> `tasks/main.yml:228-229` </location>
<code_context>
use: "{{ (__hpc_server_is_ostree | d(false)) |
ternary('ansible.posix.rhel_rpm_ostree', omit) }}"
+ register: __hpc_cuda_driver_packages_install
+ until: __hpc_cuda_driver_packages_install is success
- name: Enable nvidia-persistenced.service
</code_context>
<issue_to_address>
**suggestion:** Add 'retries' and 'delay' to the 'until' statement for reliability.
Setting 'retries' and 'delay' ensures the task does not retry endlessly and allows for controlled failure handling.
```suggestion
register: __hpc_cuda_driver_packages_install
until: __hpc_cuda_driver_packages_install is success
retries: 5
delay: 10
```
</issue_to_address>
### Comment 4
<location> `tasks/main.yml:248-249` </location>
<code_context>
use: "{{ (__hpc_server_is_ostree | d(false)) |
ternary('ansible.posix.rhel_rpm_ostree', omit) }}"
+ register: __hpc_cuda_toolkit_packages_install
+ until: __hpc_cuda_toolkit_packages_install is success
- name: Prevent update of CUDA Toolkit packages
</code_context>
<issue_to_address>
**suggestion:** Include 'retries' and 'delay' to avoid endless retries.
This will help prevent the playbook from hanging during repeated installation failures.
```suggestion
register: __hpc_cuda_toolkit_packages_install
until: __hpc_cuda_toolkit_packages_install is success
retries: 5
delay: 30
```
</issue_to_address>
### Comment 5
<location> `tasks/main.yml:269-270` </location>
<code_context>
use: "{{ (__hpc_server_is_ostree | d(false)) |
ternary('ansible.posix.rhel_rpm_ostree', omit) }}"
+ register: __hpc_nvidia_nccl_packages_install
+ until: __hpc_nvidia_nccl_packages_install is success
- name: Prevent update of NVIDIA NCCL packages
</code_context>
<issue_to_address>
**suggestion:** Specify 'retries' and 'delay' for the 'until' loop.
This prevents infinite retries and ensures the task fails after a set number of attempts.
```suggestion
register: __hpc_nvidia_nccl_packages_install
until: __hpc_nvidia_nccl_packages_install is success
retries: 5
delay: 10
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| register: __hpc_cuda_driver_packages_install | ||
| until: __hpc_cuda_driver_packages_install is success |
There was a problem hiding this comment.
suggestion: Add 'retries' and 'delay' to the 'until' statement for reliability.
Setting 'retries' and 'delay' ensures the task does not retry endlessly and allows for controlled failure handling.
| register: __hpc_cuda_driver_packages_install | |
| until: __hpc_cuda_driver_packages_install is success | |
| register: __hpc_cuda_driver_packages_install | |
| until: __hpc_cuda_driver_packages_install is success | |
| retries: 5 | |
| delay: 10 |
| register: __hpc_nvidia_nccl_packages_install | ||
| until: __hpc_nvidia_nccl_packages_install is success |
There was a problem hiding this comment.
suggestion: Specify 'retries' and 'delay' for the 'until' loop.
This prevents infinite retries and ensures the task fails after a set number of attempts.
| register: __hpc_nvidia_nccl_packages_install | |
| until: __hpc_nvidia_nccl_packages_install is success | |
| register: __hpc_nvidia_nccl_packages_install | |
| until: __hpc_nvidia_nccl_packages_install is success | |
| retries: 5 | |
| delay: 10 |
|
[citest] |
Thirdparty repositories from where we pull packages fail sometimes
324bee6 to
ab9ce0c
Compare
|
[citest] |
1 similar comment
|
[citest] |
c2a1819 to
018ec3f
Compare
|
[citest] |
6851d57 to
c4a6b4b
Compare
c4a6b4b to
f60b595
Compare
|
[citest] |
Third-party repositories from where we pull packages fail sometimes
Authored with assistance from Cursor AI.
Summary by Sourcery
Bug Fixes: