Skip to content

Add DataClean CRD for cache management#5634

Draft
Nakshatra480 wants to merge 1 commit intofluid-cloudnative:masterfrom
Nakshatra480:feature/dataclean-crd1
Draft

Add DataClean CRD for cache management#5634
Nakshatra480 wants to merge 1 commit intofluid-cloudnative:masterfrom
Nakshatra480:feature/dataclean-crd1

Conversation

@Nakshatra480
Copy link

Fixes #2177

This PR introduces a new DataClean CRD that allows users to clean dataset cache on demand. This complements the existing DataLoad feature by providing the opposite operation clearing cached data when it's no longer needed.

What does this add?

  • New CRD: DataClean
    • A new custom resource that lets users trigger cache cleaning operations on datasets.
    • The CRD includes:
      • Target dataset specification: Point to any dataset with a bounded accelerate runtime
      • TTL support: Optional auto cleanup after the operation completes
      • Status tracking: Phase, duration, and conditions to monitor operation progress

Controller Implementation

The controller handles the full lifecycle of cache cleaning:

  1. Validates the target dataset - Ensures the dataset exists and is in the same namespace.
  2. Locates the accelerate runtime - Finds Alluxio or Jindo runtime bound to the dataset.
  3. Executes cache cleaning - Calls the runtime-specific cache cleaning operation:
    • Alluxio: Uses alluxio fs free -f / to free cached data.
    • Jindo: Uses jindo jfs -formatCache -force to clear cache.
  4. Updates status - Sets phase to Complete or Failed with detailed conditions.
  5. Handles TTL - Automatically deletes the DataClean resource after the specified TTL.

Usage Example

apiVersion: data.fluid.io/v1alpha1
kind: DataClean
metadata:
  name: clean-my-dataset
spec:
  dataset:
    name: my-dataset
  ttlSecondsAfterFinished: 300

After creating this resource, the controller will clean the cache for my-dataset and automatically remove the DataClean resource 5 minutes after completion.

Implementation Details

  • Reuses existing CleanCache() methods from Alluxio and Jindo operations packages.
  • Follows the same controller pattern as DataLoad and DataMigrate.
  • Includes proper RBAC permissions for accessing datasets, pods, and statefulsets.
  • Handles edge cases like missing runtime, unready master pods, and namespace mismatches.

Why is this useful?

Users can now manage cache lifecycle more effectively:

  • Free up storage when cached data is no longer needed.
  • Clean cache before reloading fresh data.
  • Automate cache cleanup workflows with TTL.
  • Get visibility into cache cleaning operations through status tracking.
    This completes the cache management story alongside DataLoad, giving users full control over their dataset caching strategy.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Nakshatra480, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses issue #2177 by introducing the DataClean Custom Resource Definition (CRD), which empowers users to initiate on-demand cache cleaning for their datasets. This new capability complements existing data management features by allowing efficient freeing of storage, preparing for new data loads, and automating cleanup workflows. The accompanying controller orchestrates the entire cleaning process, supporting Alluxio and Jindo runtimes, and provides robust status tracking and automatic resource deletion via Time-To-Live (TTL) settings.

Highlights

  • New DataClean CRD: Introduces a new DataClean Custom Resource Definition (CRD) to enable on-demand cleaning of dataset caches, addressing issue [FEATURES]need suport clean dataset cache #2177.
  • CRD Capabilities: The DataClean CRD allows specifying a target dataset, an optional ttlSecondsAfterFinished for automatic cleanup, and provides comprehensive status tracking for the operation's phase, duration, and conditions.
  • Controller Implementation: A new controller manages the full lifecycle of DataClean resources, including validation of the target dataset, locating the appropriate runtime (Alluxio or Jindo), executing runtime-specific cache cleaning commands, updating the operation status, and handling TTL-based resource deletion.
  • Technical Details: The implementation reuses existing CleanCache() methods from runtime-specific operation packages, follows the established Fluid controller pattern, includes proper RBAC permissions, and handles various edge cases such as missing runtimes or namespace mismatches.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new DataClean CRD and its controller, which is a great addition for managing dataset caches. The implementation is well-structured and follows existing patterns in the project. My feedback includes a few suggestions for improvement, mainly around context propagation, which is a critical aspect of writing robust controllers, removing an unused parameter for code clarity, making a hardcoded value a constant, and improving the robustness of a status check. These changes will enhance the maintainability and reliability of the new controller.

name := dataset.Name

masterName := fmt.Sprintf("%s-jindofs-master", name)
master, err := kubeclient.GetStatefulSet(r.Client, masterName, namespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The underlying kubeclient.GetStatefulSet function appears to use context.TODO(). For better context propagation, cancellation, and tracing, it's recommended to refactor GetStatefulSet to accept a context.Context and pass the reconciler's context from here. This same issue applies to the call on line 302.

Comment on lines 27 to 37
func GetDataClean(c client.Client, name, namespace string) (*datav1alpha1.DataClean, error) {
key := types.NamespacedName{
Namespace: namespace,
Name: name,
}
var dataClean datav1alpha1.DataClean
if err := c.Get(context.TODO(), key, &dataClean); err != nil {
return nil, err
}
return &dataClean, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The function GetDataClean uses context.TODO(). It's a best practice to propagate the context from the caller instead of creating a new one with context.TODO(). Please update the function signature to accept a context.Context and use it in the c.Get call. The caller in DataCleanReconciler will need to be updated to pass its context.

Suggested change
func GetDataClean(c client.Client, name, namespace string) (*datav1alpha1.DataClean, error) {
key := types.NamespacedName{
Namespace: namespace,
Name: name,
}
var dataClean datav1alpha1.DataClean
if err := c.Get(context.TODO(), key, &dataClean); err != nil {
return nil, err
}
return &dataClean, nil
}
func GetDataClean(ctx context.Context, c client.Client, name, namespace string) (*datav1alpha1.DataClean, error) {
key := types.NamespacedName{
Namespace: namespace,
Name: name,
}
var dataClean datav1alpha1.DataClean
if err := c.Get(ctx, key, &dataClean); err != nil {
return nil, err
}
return &dataClean, nil
}

}

start := time.Now()
cleanErr := r.cleanCacheForRuntime(dataClean, dataset, boundedRuntime.Type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The dataClean parameter is passed to cleanCacheForRuntime and then to cleanJindoCache/cleanAlluxioCache, but it's never used. You can remove this parameter from all three function signatures (cleanCacheForRuntime, cleanJindoCache, cleanAlluxioCache) to simplify the code.

Suggested change
cleanErr := r.cleanCacheForRuntime(dataClean, dataset, boundedRuntime.Type)
cleanErr := r.cleanCacheForRuntime(dataset, boundedRuntime.Type)

return ctrl.Result{}, nil
}

completionCond := dataClean.Status.Conditions[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Accessing the completion condition with dataClean.Status.Conditions[0] is brittle. It assumes the desired condition is always the first element and that the slice is not empty (which is checked before). A more robust approach would be to iterate through the conditions and find the one with Type as common.Complete or common.Failed.

return nil
}

const defaultTimeoutSeconds int32 = 60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This timeout is hardcoded within the function. It's better to define this as a package-level constant for better visibility and maintainability. For example: const alluxioCleanCacheTimeoutSeconds = 60. Also, 60 seconds might be too short for very large caches; consider making it configurable in the future.

@Nakshatra480 Nakshatra480 force-pushed the feature/dataclean-crd1 branch from 559440c to 0d36653 Compare January 29, 2026 13:45
@fluid-e2e-bot
Copy link

fluid-e2e-bot bot commented Jan 29, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ronggu for approval by writing /assign @ronggu in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This commit introduces a new DataClean CRD that allows users to clean
dataset cache on demand, complementing the existing DataLoad feature.

Changes:
- Add DataClean CRD definition with target dataset and TTL support
- Implement controller to handle cache cleaning lifecycle
- Support both Alluxio and Jindo runtimes
- Integrate with existing CleanCache operations

Signed-off-by: Nakshatra Sharma <nakshatra.sharma3012@gmail.com>
@Nakshatra480 Nakshatra480 force-pushed the feature/dataclean-crd1 branch from fcb208c to ff1a740 Compare January 29, 2026 14:17
@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURES]need suport clean dataset cache

1 participant