Skip to content

Comments

fix: Retry installation of packages#23

Merged
spetrosi merged 2 commits intolinux-system-roles:mainfrom
spetrosi:retry-package-download
Oct 6, 2025
Merged

fix: Retry installation of packages#23
spetrosi merged 2 commits intolinux-system-roles:mainfrom
spetrosi:retry-package-download

Conversation

@spetrosi
Copy link
Collaborator

@spetrosi spetrosi commented Oct 3, 2025

Third-party repositories from where we pull packages fail sometimes

Authored with assistance from Cursor AI.

Summary by Sourcery

Bug Fixes:

  • Retry package installations for DKMS, NVIDIA drivers, CUDA components, NCCL, fabric manager, RDMA, and OpenMPI until they succeed to handle transient errors

@spetrosi spetrosi requested a review from richm as a code owner October 3, 2025 13:17
@sourcery-ai
Copy link

sourcery-ai bot commented Oct 3, 2025

Reviewer's guide (collapsed on small PRs)

Reviewer's Guide

This PR introduces retry loops for critical package installation tasks by registering each installation result and looping until success, improving resilience against intermittent repository failures.

Sequence diagram for package installation with retry logic

sequenceDiagram
    participant Ansible
    participant Repository
    loop Retry until success
        Ansible->>Repository: Install package
        Repository-->>Ansible: Success/Failure
        alt Failure
            Ansible->>Repository: Retry install
        end
    end
Loading

Class diagram for registered package installation tasks

classDiagram
    class PackageInstallTask {
        +name: string
        +register: string
        +until: condition
    }
    PackageInstallTask <|-- DkmsPackagesInstall
    PackageInstallTask <|-- NvidiaDriverPackagesInstall
    PackageInstallTask <|-- CudaDriverPackagesInstall
    PackageInstallTask <|-- CudaToolkitPackagesInstall
    PackageInstallTask <|-- NvidiaNcclPackagesInstall
    PackageInstallTask <|-- NvidiaFabricManagerPackagesInstall
    PackageInstallTask <|-- RdmaPackagesInstall
    PackageInstallTask <|-- OpenmpiCommonPackagesInstall
    PackageInstallTask <|-- SystemOpenmpiPackagesInstall
    DkmsPackagesInstall: register = __hpc_dkms_packages_install
    NvidiaDriverPackagesInstall: register = __hpc_nvidia_driver_packages_install
    CudaDriverPackagesInstall: register = __hpc_cuda_driver_packages_install
    CudaToolkitPackagesInstall: register = __hpc_cuda_toolkit_packages_install
    NvidiaNcclPackagesInstall: register = __hpc_nvidia_nccl_packages_install
    NvidiaFabricManagerPackagesInstall: register = __hpc_nvidia_fabric_manager_packages_install
    RdmaPackagesInstall: register = __hpc_rdma_packages_install
    OpenmpiCommonPackagesInstall: register = __hpc_openmpi_common_packages_install
    SystemOpenmpiPackagesInstall: register = __hpc_system_openmpi_packages_install
    PackageInstallTask: until = <register> is success
Loading

File-Level Changes

Change Details Files
Add retry loops to package installations
  • Register and retry DKMS package installation until success
  • Register and retry NVIDIA driver installation until success
  • Register and retry CUDA driver installation until success
  • Register and retry CUDA toolkit installation until success
  • Register and retry NVIDIA NCCL package installation until success
  • Register and retry NVIDIA Fabric Manager installation until success
  • Register and retry RDMA package installation until success
  • Register and retry OpenMPI common package installation until success
  • Register and retry system OpenMPI package installation until success
tasks/main.yml

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `tasks/main.yml:198` </location>
<code_context>
         use: "{{ (__hpc_server_is_ostree | d(false)) |
           ternary('ansible.posix.rhel_rpm_ostree', omit) }}"
+      register: __hpc_dkms_packages_install
+      until: __hpc_dkms_packages_install is success

     - name: Install NVIDIA driver
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Consider adding a 'retries' parameter to control the number of attempts.

Not setting 'retries' and 'delay' can cause the loop to run indefinitely if the condition fails. Defining these parameters ensures the task does not hang and improves error handling.

Suggested implementation:

```
      register: __hpc_dkms_packages_install
      until: __hpc_dkms_packages_install is success
      retries: 5
      delay: 10

```

```
      register: __hpc_nvidia_driver_packages_install
      until: __hpc_nvidia_driver_packages_install is success
      retries: 5
      delay: 10

```
</issue_to_address>

### Comment 2
<location> `tasks/main.yml:206-207` </location>
<code_context>
         use: "{{ (__hpc_server_is_ostree | d(false)) |
           ternary('ansible.posix.rhel_rpm_ostree', omit) }}"
+      register: __hpc_nvidia_driver_packages_install
+      until: __hpc_nvidia_driver_packages_install is success

     # This makes the role not idempotent.
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Explicitly set 'retries' and 'delay' for the 'until' loop.

Specifying 'retries' and 'delay' prevents the loop from running indefinitely and allows for controlled failure after a defined number of attempts.

```suggestion
      register: __hpc_nvidia_driver_packages_install
      until: __hpc_nvidia_driver_packages_install is success
      retries: 5
      delay: 10
```
</issue_to_address>

### Comment 3
<location> `tasks/main.yml:228-229` </location>
<code_context>
         use: "{{ (__hpc_server_is_ostree | d(false)) |
           ternary('ansible.posix.rhel_rpm_ostree', omit) }}"
+      register: __hpc_cuda_driver_packages_install
+      until: __hpc_cuda_driver_packages_install is success

     - name: Enable nvidia-persistenced.service
</code_context>

<issue_to_address>
**suggestion:** Add 'retries' and 'delay' to the 'until' statement for reliability.

Setting 'retries' and 'delay' ensures the task does not retry endlessly and allows for controlled failure handling.

```suggestion
      register: __hpc_cuda_driver_packages_install
      until: __hpc_cuda_driver_packages_install is success
      retries: 5
      delay: 10
```
</issue_to_address>

### Comment 4
<location> `tasks/main.yml:248-249` </location>
<code_context>
         use: "{{ (__hpc_server_is_ostree | d(false)) |
           ternary('ansible.posix.rhel_rpm_ostree', omit) }}"
+      register: __hpc_cuda_toolkit_packages_install
+      until: __hpc_cuda_toolkit_packages_install is success

     - name: Prevent update of CUDA Toolkit packages
</code_context>

<issue_to_address>
**suggestion:** Include 'retries' and 'delay' to avoid endless retries.

This will help prevent the playbook from hanging during repeated installation failures.

```suggestion
      register: __hpc_cuda_toolkit_packages_install
      until: __hpc_cuda_toolkit_packages_install is success
      retries: 5
      delay: 30
```
</issue_to_address>

### Comment 5
<location> `tasks/main.yml:269-270` </location>
<code_context>
         use: "{{ (__hpc_server_is_ostree | d(false)) |
           ternary('ansible.posix.rhel_rpm_ostree', omit) }}"
+      register: __hpc_nvidia_nccl_packages_install
+      until: __hpc_nvidia_nccl_packages_install is success

     - name: Prevent update of NVIDIA NCCL packages
</code_context>

<issue_to_address>
**suggestion:** Specify 'retries' and 'delay' for the 'until' loop.

This prevents infinite retries and ensures the task fails after a set number of attempts.

```suggestion
      register: __hpc_nvidia_nccl_packages_install
      until: __hpc_nvidia_nccl_packages_install is success
      retries: 5
      delay: 10
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +228 to +229
register: __hpc_cuda_driver_packages_install
until: __hpc_cuda_driver_packages_install is success
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Add 'retries' and 'delay' to the 'until' statement for reliability.

Setting 'retries' and 'delay' ensures the task does not retry endlessly and allows for controlled failure handling.

Suggested change
register: __hpc_cuda_driver_packages_install
until: __hpc_cuda_driver_packages_install is success
register: __hpc_cuda_driver_packages_install
until: __hpc_cuda_driver_packages_install is success
retries: 5
delay: 10

Comment on lines +269 to +270
register: __hpc_nvidia_nccl_packages_install
until: __hpc_nvidia_nccl_packages_install is success
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Specify 'retries' and 'delay' for the 'until' loop.

This prevents infinite retries and ensures the task fails after a set number of attempts.

Suggested change
register: __hpc_nvidia_nccl_packages_install
until: __hpc_nvidia_nccl_packages_install is success
register: __hpc_nvidia_nccl_packages_install
until: __hpc_nvidia_nccl_packages_install is success
retries: 5
delay: 10

@spetrosi
Copy link
Collaborator Author

spetrosi commented Oct 3, 2025

[citest]

Thirdparty repositories from where we pull packages fail sometimes
@spetrosi spetrosi force-pushed the retry-package-download branch from 324bee6 to ab9ce0c Compare October 3, 2025 14:10
@spetrosi
Copy link
Collaborator Author

spetrosi commented Oct 3, 2025

[citest]

1 similar comment
@spetrosi
Copy link
Collaborator Author

spetrosi commented Oct 3, 2025

[citest]

@spetrosi spetrosi force-pushed the retry-package-download branch from c2a1819 to 018ec3f Compare October 3, 2025 15:55
@spetrosi
Copy link
Collaborator Author

spetrosi commented Oct 3, 2025

[citest]

@spetrosi spetrosi force-pushed the retry-package-download branch 2 times, most recently from 6851d57 to c4a6b4b Compare October 3, 2025 18:02
@spetrosi spetrosi force-pushed the retry-package-download branch from c4a6b4b to f60b595 Compare October 6, 2025 10:56
@spetrosi
Copy link
Collaborator Author

spetrosi commented Oct 6, 2025

[citest]

@spetrosi spetrosi merged commit 7ce5299 into linux-system-roles:main Oct 6, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants