Add DataClean CRD for cache management#5634
Add DataClean CRD for cache management#5634Nakshatra480 wants to merge 1 commit intofluid-cloudnative:masterfrom
Conversation
Summary of ChangesHello @Nakshatra480, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses issue #2177 by introducing the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new DataClean CRD and its controller, which is a great addition for managing dataset caches. The implementation is well-structured and follows existing patterns in the project. My feedback includes a few suggestions for improvement, mainly around context propagation, which is a critical aspect of writing robust controllers, removing an unused parameter for code clarity, making a hardcoded value a constant, and improving the robustness of a status check. These changes will enhance the maintainability and reliability of the new controller.
| name := dataset.Name | ||
|
|
||
| masterName := fmt.Sprintf("%s-jindofs-master", name) | ||
| master, err := kubeclient.GetStatefulSet(r.Client, masterName, namespace) |
There was a problem hiding this comment.
The underlying kubeclient.GetStatefulSet function appears to use context.TODO(). For better context propagation, cancellation, and tracing, it's recommended to refactor GetStatefulSet to accept a context.Context and pass the reconciler's context from here. This same issue applies to the call on line 302.
pkg/utils/dataclean.go
Outdated
| func GetDataClean(c client.Client, name, namespace string) (*datav1alpha1.DataClean, error) { | ||
| key := types.NamespacedName{ | ||
| Namespace: namespace, | ||
| Name: name, | ||
| } | ||
| var dataClean datav1alpha1.DataClean | ||
| if err := c.Get(context.TODO(), key, &dataClean); err != nil { | ||
| return nil, err | ||
| } | ||
| return &dataClean, nil | ||
| } |
There was a problem hiding this comment.
The function GetDataClean uses context.TODO(). It's a best practice to propagate the context from the caller instead of creating a new one with context.TODO(). Please update the function signature to accept a context.Context and use it in the c.Get call. The caller in DataCleanReconciler will need to be updated to pass its context.
| func GetDataClean(c client.Client, name, namespace string) (*datav1alpha1.DataClean, error) { | |
| key := types.NamespacedName{ | |
| Namespace: namespace, | |
| Name: name, | |
| } | |
| var dataClean datav1alpha1.DataClean | |
| if err := c.Get(context.TODO(), key, &dataClean); err != nil { | |
| return nil, err | |
| } | |
| return &dataClean, nil | |
| } | |
| func GetDataClean(ctx context.Context, c client.Client, name, namespace string) (*datav1alpha1.DataClean, error) { | |
| key := types.NamespacedName{ | |
| Namespace: namespace, | |
| Name: name, | |
| } | |
| var dataClean datav1alpha1.DataClean | |
| if err := c.Get(ctx, key, &dataClean); err != nil { | |
| return nil, err | |
| } | |
| return &dataClean, nil | |
| } |
| } | ||
|
|
||
| start := time.Now() | ||
| cleanErr := r.cleanCacheForRuntime(dataClean, dataset, boundedRuntime.Type) |
There was a problem hiding this comment.
The dataClean parameter is passed to cleanCacheForRuntime and then to cleanJindoCache/cleanAlluxioCache, but it's never used. You can remove this parameter from all three function signatures (cleanCacheForRuntime, cleanJindoCache, cleanAlluxioCache) to simplify the code.
| cleanErr := r.cleanCacheForRuntime(dataClean, dataset, boundedRuntime.Type) | |
| cleanErr := r.cleanCacheForRuntime(dataset, boundedRuntime.Type) |
| return ctrl.Result{}, nil | ||
| } | ||
|
|
||
| completionCond := dataClean.Status.Conditions[0] |
There was a problem hiding this comment.
Accessing the completion condition with dataClean.Status.Conditions[0] is brittle. It assumes the desired condition is always the first element and that the slice is not empty (which is checked before). A more robust approach would be to iterate through the conditions and find the one with Type as common.Complete or common.Failed.
| return nil | ||
| } | ||
|
|
||
| const defaultTimeoutSeconds int32 = 60 |
There was a problem hiding this comment.
This timeout is hardcoded within the function. It's better to define this as a package-level constant for better visibility and maintainability. For example: const alluxioCleanCacheTimeoutSeconds = 60. Also, 60 seconds might be too short for very large caches; consider making it configurable in the future.
559440c to
0d36653
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
This commit introduces a new DataClean CRD that allows users to clean dataset cache on demand, complementing the existing DataLoad feature. Changes: - Add DataClean CRD definition with target dataset and TTL support - Implement controller to handle cache cleaning lifecycle - Support both Alluxio and Jindo runtimes - Integrate with existing CleanCache operations Signed-off-by: Nakshatra Sharma <nakshatra.sharma3012@gmail.com>
fcb208c to
ff1a740
Compare
|



Fixes #2177
This PR introduces a new DataClean CRD that allows users to clean dataset cache on demand. This complements the existing DataLoad feature by providing the opposite operation clearing cached data when it's no longer needed.
What does this add?
Controller Implementation
The controller handles the full lifecycle of cache cleaning:
alluxio fs free -f /to free cached data.jindo jfs -formatCache -forceto clear cache.Usage Example
After creating this resource, the controller will clean the cache for
my-datasetand automatically remove the DataClean resource 5 minutes after completion.Implementation Details
CleanCache()methods from Alluxio and Jindo operations packages.Why is this useful?
Users can now manage cache lifecycle more effectively:
This completes the cache management story alongside DataLoad, giving users full control over their dataset caching strategy.