Consolidate repo_name and repo_path into a Single Repository Identifier #517#526
Consolidate repo_name and repo_path into a Single Repository Identifier #517#526moksha-hub wants to merge 9 commits intopotpie-ai:mainfrom
Conversation
|
Warning Rate limit exceeded@moksha-hub has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 15 minutes and 13 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughCentralizes repository identification with a new RepositoryResolver and shifts parsing flow to accept a single repository identifier resolved into a RepoDetails object (including is_local, repo_name, repo_path, branch_name, commit_id). Controller, helper, service, schema, validator, and task layers consume resolved RepoDetails and worktree/clone helpers were added. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Controller as ParsingController
participant Resolver as RepositoryResolver
participant Helper as ParsingHelper
participant RepoMgr as RepoManager/CodeProvider
Client->>Controller: POST parse (repository, branch?, commit?)
Controller->>Resolver: resolve(repository, branch, commit)
Resolver-->>Controller: RepoDetails(is_local?, repo_name, repo_path, ...)
alt repo is local
Controller->>Helper: setup local project (repo_path, repo_name)
Helper-->>Controller: local files prepared / worktree created
else repo is remote
Controller->>RepoMgr: authorize/fetch repo (credentials or tarball)
RepoMgr-->>Controller: tarball / clone URL / metadata
Controller->>Helper: _clone_repository_with_auth(repo_details, target_dir)
Helper-->>Controller: files prepared / worktree created
end
Controller->>Controller: persist project / enqueue parsing task (repo_details.model_dump())
Controller-->>Client: 202 Accepted (task queued)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
app/modules/parsing/graph_construction/parsing_helper.py (3)
63-108: Local repo open needs explicit error mapping (InvalidGitRepositoryError / permissions).
Repo(repo_details.repo_path)can raise (invalid repo, permissions). Right now that’ll bubble as a 500 later. Consider catchinggit.InvalidGitRepositoryError/GitCommandErrorand returning a 400 with a clear message.
697-767: Fixisinstance(repo_details, Repo)toisinstance(repo, Repo)and add local repository checkout logic.The check at line 745 will never be true because
repo_detailsis aRepoDetailsPydantic model, not a GitPythonRepoobject. This is dead code. The condition should use therepoparameter instead (which contains the actual Repo object for local repositories).Additionally, the local repository early return at line 741–744 skips critical logic:
- No checkout of
commit_idorbranchfor local reposlatest_commit_shais not computedProjectService.update_project()at line 811–817 is never called, leaving project metadata staleThe local path should determine the target commit/branch, compute
latest_commit_sha, and callProjectService.update_project()before returning, maintaining parity with the remote repository flow.- if isinstance(repo_details, Repo): + if isinstance(repo, Repo):
358-368: Blocking git clone is called from anasyncmethod; move clone to a thread (or revert to async).
download_and_extract_tarball()isasyncand falls back to_clone_repository_with_auth()(sync, heavy IO). That can block the event loop running parsing (especially in your Celery "long-lived event loop" setup).- return self._clone_repository_with_auth( - repo, branch, target_dir, user_id - ) + import asyncio + return await asyncio.to_thread( + self._clone_repository_with_auth, + repo, branch, target_dir, user_id, + )app/modules/parsing/graph_construction/parsing_validator.py (1)
7-27: Guard blocks all repos when it should block only remote. The condition at line 22 doesn't checkrepo_details.is_localbefore raising the 403 exception. This means local repository parsing is incorrectly denied whenuser_id == defaultUsername, contradicting the error message which states "Cannot parse remote repository without auth token". SinceRepoDetails.is_localis a required field, gate the auth check on remote repositories:- # Check if user is trying to parse remote repository without proper auth - if user_id == os.getenv("defaultUsername"): + # Check if user is trying to parse remote repository without proper auth + if not repo_details.is_local and user_id == os.getenv("defaultUsername"): raise HTTPException(
🧹 Nitpick comments (11)
app/modules/parsing/graph_construction/repository_resolver.py (1)
192-271:resolve()behavior looks consistent withRepoDetailscontract; consider structured logs.
Minor: most logs here interpolate user input via f-strings; if you’re aiming for structured/parameterized logging, switch to%sargs to avoid eager formatting and keep log tooling happier.app/modules/parsing/graph_construction/parsing_schema.py (2)
6-29:ParsingRequest.validate_repository()OK; consider validating obviously-invalid remote identifiers earlier (optional).
Right now you only check non-empty; all other validation happens later (RepositoryResolver). That’s fine, but if you want earlier 422s for clearly invalid input (e.g. URL forms), you could add lightweight checks here.
27-28: Ruff TRY003: don’t contort exceptions just to satisfy it.
The currentValueError("Repository identifier cannot be empty")is fine for request validation. If you want to appease TRY003, consider disabling it for this line rather than reducing message quality.app/modules/parsing/graph_construction/parsing_helper.py (4)
507-543:_copy_text_files()/_cleanup_temp_dir(): address Ruff issues + narrow exception handling.
for root, dirs, files→ renamedirsto_dirs(B007)._cleanup_temp_dir()should avoidexcept Exception(BLE001) and uselogger.exception(...)when cleanup fails.- for root, dirs, files in os.walk(temp_clone_dir): + for root, _dirs, files in os.walk(temp_clone_dir): ... - except Exception as e: - logger.warning(f"ParsingHelper: Failed to clean up temp directory: {e}") + except OSError: + logger.exception("ParsingHelper: Failed to clean up temp directory: %s", temp_dir)
586-593: Preferlogger.exceptioninside exception handlers and de-duplicate cleanup.
You already haveein-hand; switching tologger.exception(...)preserves stack traces and matches Ruff TRY400. Also consider a singlefinally: self._cleanup_temp_dir(...)to avoid repeating cleanup calls.
819-834: Sync_copy_repo_to_repo_manager()is fine, but be mindful it’s heavy IO in an async method.
If repo-manager copy/worktree is expensive, consider running it in a thread as well (same rationale as the clone fallback).Also applies to: 837-845
957-1035: Worktree helpers: add minimal logging on silentexcept, and confirm git-worktree args via GitPython.
_create_worktree_for_branch():except Exception: passaround commit loses debuggability (Sonar S110)._remove_existing_worktree()/_create_worktree_for_commit()rely on GitPython argument translation; worth confirming with a quick smoke test in CI.app/celery/tasks/parsing_tasks.py (1)
25-46: Catchpydantic.ValidationErrorfor clearer task failures.
Ifrepo_detailsdict is malformed,RepoDetails(**repo_details)will raise; loggingstr(e)loses field-level context. Consider catchingValidationErrorand logginge.errors()(ore.json()), then re-raise.Also applies to: 57-59
app/modules/parsing/graph_construction/parsing_controller.py (3)
52-68: Preferlogger.exception+ exception chaining in the new resolution block.
Keeps stack traces and avoids losing root cause when returning HTTP 400.try: resolved_repo = RepositoryResolver.resolve( repository=repo_details.repository, branch_name=repo_details.branch_name, commit_id=repo_details.commit_id, ) except HTTPException: raise except Exception as e: - logger.error(f"Failed to resolve repository identifier: {e}") - raise HTTPException( + logger.exception("Failed to resolve repository identifier") + raise HTTPException( status_code=400, detail=f"Invalid repository identifier: {str(e)}" - ) + ) from eAlso applies to: 70-73
75-82: Dev-mode gating for local repos: consider aligning onconfig_provider.get_is_development_mode()(single source of truth).
Right now the newresolved_repo.is_localbranch relies onos.getenv("isDevelopmentMode"), while other parts useconfig_provider.get_is_development_mode(), which can diverge.
228-236: Typehandle_new_projectexplicitly asRepoDetailsto lock in the new contract.
Helps mypy/IDE tooling and prevents accidental reintroduction of raw request objects.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
app/celery/tasks/parsing_tasks.py(3 hunks)app/modules/parsing/graph_construction/parsing_controller.py(13 hunks)app/modules/parsing/graph_construction/parsing_helper.py(11 hunks)app/modules/parsing/graph_construction/parsing_schema.py(2 hunks)app/modules/parsing/graph_construction/parsing_service.py(2 hunks)app/modules/parsing/graph_construction/parsing_validator.py(1 hunks)app/modules/parsing/graph_construction/repository_resolver.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (5)
app/celery/tasks/parsing_tasks.py (4)
app/modules/parsing/graph_construction/parsing_schema.py (1)
RepoDetails(38-62)app/modules/parsing/graph_construction/parsing_controller.py (1)
parse_directory(39-226)app/modules/parsing/graph_construction/parsing_service.py (1)
parse_directory(53-168)app/api/router.py (1)
parse_directory(111-116)
app/modules/parsing/graph_construction/parsing_controller.py (2)
app/modules/parsing/graph_construction/repository_resolver.py (2)
RepositoryResolver(24-271)resolve(193-271)app/modules/projects/projects_service.py (3)
get_project_from_db(169-222)get_global_project_from_db(224-265)duplicate_project(123-144)
app/modules/parsing/graph_construction/parsing_service.py (1)
app/modules/parsing/graph_construction/parsing_schema.py (1)
RepoDetails(38-62)
app/modules/parsing/graph_construction/repository_resolver.py (2)
app/modules/parsing/graph_construction/parsing_schema.py (1)
RepoDetails(38-62)app/modules/parsing/utils/repo_name_normalizer.py (1)
normalize_repo_name(11-55)
app/modules/parsing/graph_construction/parsing_helper.py (2)
app/modules/parsing/utils/encoding_detector.py (1)
is_text_file(103-115)app/modules/projects/projects_service.py (2)
get_project_from_db(169-222)register_project(59-121)
🪛 GitHub Check: SonarCloud Code Analysis
app/modules/parsing/graph_construction/parsing_helper.py
[failure] 507-507: Refactor this function to reduce its Cognitive Complexity from 16 to the 15 allowed.
[warning] 638-638: Remove this redundant Exception class; it derives from another which is already caught.
🪛 Ruff (0.14.8)
app/modules/parsing/graph_construction/parsing_controller.py
63-63: Do not catch blind exception: Exception
(BLE001)
64-64: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
65-68: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
67-67: Use explicit conversion flag
Replace with conversion flag
(RUF010)
app/modules/parsing/graph_construction/parsing_helper.py
512-512: Loop control variable dirs not used within loop body
Rename unused dirs to _dirs
(B007)
530-530: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
541-541: Do not catch blind exception: Exception
(BLE001)
584-584: Consider moving this statement to an else block
(TRY300)
587-587: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
589-589: Avoid specifying long messages outside the exception class
(TRY003)
591-591: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
593-593: Avoid specifying long messages outside the exception class
(TRY003)
1001-1002: try-except-pass detected, consider logging the exception
(S110)
1001-1001: Do not catch blind exception: Exception
(BLE001)
app/modules/parsing/graph_construction/parsing_schema.py
28-28: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (6)
app/modules/parsing/graph_construction/repository_resolver.py (2)
136-191: Remote parsing is “exactly 2 segments”; verify that matches all supported providers.
If you support GitLab subgroups (group/subgroup/repo) or other multi-segment namespaces,len(parts) != 2will reject valid repos. If you only supportowner/repo, consider renaming the docstrings/messages to explicitly state that (and/or validate provider-specific formats).
91-135: [rewritten comment]
[classification tag]app/modules/parsing/graph_construction/parsing_schema.py (1)
38-62: RepoDetails shape aligns with the new “resolved” flow.
Good separation between “original identifier” (repository) and “normalized lookup key” (repo_name) and “local path” (repo_path).app/modules/parsing/graph_construction/parsing_helper.py (1)
129-134: Good: treat OS/IO errors as non-text.
Covers permission/file-not-found cases cleanly.app/modules/parsing/graph_construction/parsing_service.py (2)
28-28: Type swap toRepoDetailslooks consistent with the consolidation goal.
Small but good step toward making downstream logic consume only resolved repository state.
53-70: Docstring +RepoDetailsparameterization is clear and matches the new contract.
No functional changes here; typing/doc helps ensure callers pass a resolved identifier rather than raw request input.
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (2)
app/modules/parsing/graph_construction/parsing_helper.py (1)
488-506: URL-encode credentials in clone URL to prevent authentication failures and exception leakage.This was flagged in a previous review. Unencoded credentials in
{username}:{password}@hostwill break with special characters (@,:,/#?%, etc.) and may leak in GitPython exception messages.Apply this diff to fix credential encoding:
-from urllib.parse import urlparse, urlunparse +from urllib.parse import quote, urlparse, urlunparseThen update
_build_clone_url:def _build_clone_url(self, repo, username, password): """Build authenticated clone URL for GitBucket.""" base_url = os.getenv("CODE_PROVIDER_BASE_URL", "http://localhost:8080") if base_url.endswith("/api/v3"): base_url = base_url[:-7] parsed = urlparse(base_url) base_path = parsed.path.rstrip("/") repo_path = f"{base_path}/{repo.full_name}.git" if base_path else f"/{repo.full_name}.git" + userinfo = f"{quote(username, safe='')}:{quote(password, safe='')}" clone_url = urlunparse(( parsed.scheme, - f"{username}:{password}@{parsed.netloc}", + f"{userinfo}@{parsed.netloc}", repo_path, "", "", "" )) safe_url = urlunparse((parsed.scheme, parsed.netloc, repo_path, "", "", "")) return clone_url, safe_urlapp/modules/parsing/graph_construction/parsing_controller.py (1)
114-119: Bug:get_global_project_from_dbcalled with positional argument mismatch.This was flagged in a previous review and is still present. The function signature is
get_global_project_from_db(repo_name, branch_name, repo_path=None, commit_id=None), but line 118 passesresolved_repo.commit_idas the third positional argument, which will be interpreted asrepo_path.if not project and resolved_repo.repository in demo_repos: existing_project = await project_manager.get_global_project_from_db( resolved_repo.repo_name, resolved_repo.branch_name, - resolved_repo.commit_id, + repo_path=resolved_repo.repo_path, + commit_id=resolved_repo.commit_id, )
🧹 Nitpick comments (6)
app/modules/parsing/graph_construction/parsing_helper.py (4)
513-534: Rename unused loop variable and uselogging.exceptionfor errors.Per static analysis hints (B007, TRY400):
- def _copy_text_files(self, temp_clone_dir, final_dir): - """Copy only text files from temp clone to final directory.""" - os.makedirs(final_dir, exist_ok=True) - text_files_count = 0 - - for root, dirs, files in os.walk(temp_clone_dir): + def _copy_text_files(self, temp_clone_dir, final_dir): + """Copy only text files from temp clone to final directory.""" + os.makedirs(final_dir, exist_ok=True) + text_files_count = 0 + + for root, _dirs, files in os.walk(temp_clone_dir): # Skip .git and hidden directories if ".git" in root.split(os.sep) or any(part.startswith(".") for part in root.split(os.sep)): continue for file in files: if file.startswith("."): continue file_path = os.path.join(root, file) if self.is_text_file(file_path): try: relative_path = os.path.relpath(file_path, temp_clone_dir) dest_path = os.path.join(final_dir, relative_path) os.makedirs(os.path.dirname(dest_path), exist_ok=True) shutil.copy2(file_path, dest_path) text_files_count += 1 except OSError as e: - logger.error(f"ParsingHelper: Error copying file {file_path}: {e}") + logger.exception(f"ParsingHelper: Error copying file {file_path}")
536-543: Avoid catching blindExceptionin cleanup.Per static analysis (BLE001), catch more specific exceptions like
OSErrororshutil.Error:def _cleanup_temp_dir(self, temp_dir): """Clean up temporary directory.""" if os.path.exists(temp_dir): try: shutil.rmtree(temp_dir) logger.info(f"ParsingHelper: Cleaned up temporary directory: {temp_dir}") - except Exception as e: + except OSError as e: logger.warning(f"ParsingHelper: Failed to clean up temp directory: {e}")
574-594: Movereturnstatement toelseblock and avoid long exception messages.Per static analysis hints (TRY300, TRY003), the return on line 585 should be in an
elseblock. Also, consider defining custom exception classes for cleaner messages.try: # Clone repository Repo.clone_from(clone_url_with_auth, temp_clone_dir, branch=branch, depth=1) logger.info(f"ParsingHelper: Successfully cloned repository to temporary directory: {temp_clone_dir}") # Copy text files logger.info(f"ParsingHelper: Filtering text files from clone to {final_dir}") self._copy_text_files(temp_clone_dir, final_dir) # Cleanup self._cleanup_temp_dir(temp_clone_dir) - return final_dir except GitCommandError as e: logger.exception("ParsingHelper: Git clone failed") self._cleanup_temp_dir(temp_clone_dir) - raise ParsingFailedError(f"Failed to clone repository: {e}") from e + raise ParsingFailedError("Failed to clone repository") from e except Exception as e: logger.exception("ParsingHelper: Unexpected error during git clone") self._cleanup_temp_dir(temp_clone_dir) - raise ParsingFailedError(f"Unexpected error during repository clone: {e}") from e + raise ParsingFailedError("Unexpected error during repository clone") from e + else: + return final_dir
987-1005: Log the exception instead of silently passing.Per static analysis (S110), the
try-except-passon lines 1002-1003 silently swallows errors. Consider logging a warning:worktree_repo = Repo.init(worktree_path) worktree_repo.git.add(A=True) try: worktree_repo.index.commit(f"Initial commit for {ref}") - except Exception: - pass + except Exception as e: + logger.warning(f"Could not create initial commit for {ref}: {e}")app/modules/parsing/graph_construction/parsing_controller.py (2)
52-68: Improve exception handling with proper chaining and logging.Per static analysis hints (BLE001, TRY400, B904, RUF010):
try: resolved_repo = RepositoryResolver.resolve( repository=repo_details.repository, branch_name=repo_details.branch_name, commit_id=repo_details.commit_id, ) except HTTPException: raise except Exception as e: - logger.error(f"Failed to resolve repository identifier: {e}") + logger.exception("Failed to resolve repository identifier") raise HTTPException( status_code=400, - detail=f"Invalid repository identifier: {str(e)}" - ) + detail=f"Invalid repository identifier: {e!s}" + ) from e
228-250: Add type hint forrepo_detailsparameter.The comment says it now accepts
RepoDetailsinstead ofParsingRequest, but the type hint is missing. Adding it would improve code clarity and enable better IDE support.+from app.modules.parsing.graph_construction.parsing_schema import RepoDetails + @staticmethod async def handle_new_project( - repo_details, # Now accepts RepoDetails instead of ParsingRequest + repo_details: RepoDetails, user_id: str, user_email: str, new_project_id: str, project_manager: ProjectService, db: AsyncSession, ):Note:
RepoDetailsis already imported inparsing_service.py. You'll need to add the import at the top ofparsing_controller.pyas well, or import it from the schema module:from app.modules.parsing.graph_construction.parsing_schema import ParsingRequest, RepoDetails
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
app/celery/tasks/parsing_tasks.py(3 hunks)app/modules/parsing/graph_construction/parsing_controller.py(13 hunks)app/modules/parsing/graph_construction/parsing_helper.py(11 hunks)app/modules/parsing/graph_construction/parsing_service.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- app/celery/tasks/parsing_tasks.py
🧰 Additional context used
🧬 Code graph analysis (2)
app/modules/parsing/graph_construction/parsing_controller.py (2)
app/modules/parsing/graph_construction/repository_resolver.py (2)
RepositoryResolver(24-271)resolve(193-271)app/modules/projects/projects_service.py (3)
get_project_from_db(169-222)get_global_project_from_db(224-265)duplicate_project(123-144)
app/modules/parsing/graph_construction/parsing_helper.py (2)
app/modules/parsing/utils/encoding_detector.py (1)
is_text_file(103-115)app/modules/projects/projects_service.py (2)
get_project_from_db(169-222)register_project(59-121)
🪛 Ruff (0.14.8)
app/modules/parsing/graph_construction/parsing_controller.py
63-63: Do not catch blind exception: Exception
(BLE001)
64-64: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
65-68: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
67-67: Use explicit conversion flag
Replace with conversion flag
(RUF010)
app/modules/parsing/graph_construction/parsing_helper.py
513-513: Loop control variable dirs not used within loop body
Rename unused dirs to _dirs
(B007)
531-531: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
542-542: Do not catch blind exception: Exception
(BLE001)
585-585: Consider moving this statement to an else block
(TRY300)
590-590: Avoid specifying long messages outside the exception class
(TRY003)
594-594: Avoid specifying long messages outside the exception class
(TRY003)
1002-1003: try-except-pass detected, consider logging the exception
(S110)
1002-1002: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (5)
app/modules/parsing/graph_construction/parsing_helper.py (2)
63-108: LGTM - Simplified repository handling using resolved RepoDetails.The refactored
clone_or_copy_repositorycorrectly usesrepo_details.is_localto distinguish local vs remote repositories. The logic is cleaner and relies on pre-validated data fromRepositoryResolver.
742-748: Theisinstance(repo_details, Repo)check is not dead code and serves a legitimate purpose.While the code structure appears suspicious at first glance, analysis reveals the control flow is intentional: after the early return for local repositories (line 745), the code needs to distinguish between two possible types when handling remote repositories. In the
clone_or_copy_repositorymethod, whenrepo_details.is_localisFalse, therepo_detailsparameter can be either aRepoobject (for local clones converted to remote context) or aRepoDetailsPydantic model (for actual remote repositories from the code provider). Theisinstancecheck on line 746 correctly branches to Repo-specific code (accessing.working_tree_dir,.git,.head, etc.) in one path and Pydantic model handling in the else clause. This design appropriately handles the refactored flow where both repository types may reach this point.Likely an incorrect or invalid review comment.
app/modules/parsing/graph_construction/parsing_service.py (3)
27-27: LGTM - Import updated to use RepoDetails.The import change aligns with the broader refactoring to use the centralized
RepoDetailsmodel.
50-67: LGTM - Method signature and docstring properly updated.The
parse_directorymethod now acceptsRepoDetailsinstead ofParsingRequest, and the docstring clearly documents the parameters.
98-123: Intentional but undocumented type inconsistency forrepo_detailsparameter.In dev mode (line 106),
repo_detailsreceives aRepoDetailsobject, while in non-dev mode (line 119), it receivesrepo(agit.Repoobject). This inconsistency is intentional—both objects are used for different purposes:repo_detailsprovides path/name attributes and local repository info, whilerepois used for Git operations likeget_branch(). The function relies on duck typing and handles both types correctly (lines 747–767 inparsing_helper.py).Consider adding type hints (
Union[RepoDetails, Repo]) to the function signature to document this behavior and prevent future confusion.
|
Even with this fix, SonarCloud Security Hotspots are often designed to force a human review. If the check still fails please leave a comment to rework on it why and fix |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
app/modules/parsing/graph_construction/parsing_controller.py (3)
75-92: Use the same dev-mode switch as the rest of the parsing stack.
This file gates local parsing viaos.getenv("isDevelopmentMode"), while other code paths useconfig_provider.get_is_development_mode(). This can easily drift/misconfigure.- if resolved_repo.is_local: - if os.getenv("isDevelopmentMode") != "enabled": + if resolved_repo.is_local: + if not config_provider.get_is_development_mode(): raise HTTPException( status_code=400, detail="Parsing local repositories is only supported in development mode", )
93-120: Bug risk: demo repo check usesresolved_repo.repository(raw input) instead of normalizedrepo_name.
If the user supplies a URL / different casing / extra prefixes,resolved_repo.repository in demo_reposwon’t match and demo behavior will silently break.- if not project and resolved_repo.repository in demo_repos: + if not project and resolved_repo.repo_name in demo_repos:
229-265: Avoid positional params inregister_project(...)(prevents future “wrong slot” bugs).
Safer and more readable to passcommit_id/repo_pathby name.- await project_manager.register_project( - repo_details.repo_name, - repo_details.branch_name, - user_id, - new_project_id, - repo_details.commit_id, - repo_details.repo_path, - ) + await project_manager.register_project( + repo_name=repo_details.repo_name, + branch_name=repo_details.branch_name, + user_id=user_id, + project_id=new_project_id, + commit_id=repo_details.commit_id, + repo_path=repo_details.repo_path, + )app/modules/parsing/graph_construction/parsing_helper.py (1)
369-379: Wrap synchronous git clone inasyncio.to_thread()when called from async context.
download_and_extract_tarball()is async and called withawait(lines 809, 819), but internally calls synchronous_clone_repository_with_auth()without threading. The fallback executesRepo.clone_from()(line 615) and file operations via_copy_text_files()(line 618), both blocking the event loop and stalling server workers under load.At line 376, replace the direct call with:
return await asyncio.to_thread(self._clone_repository_with_auth, repo, branch, target_dir, user_id). Alternatively, move git clone and file operations to a worker/thread pool or dedicated synchronous handler.
🧹 Nitpick comments (1)
app/modules/parsing/graph_construction/parsing_helper.py (1)
735-806: Stale branch:if isinstance(repo_details, Repo)is likely dead/incorrect after switching toRepoDetails.
This reads like leftover pre-refactor logic and makes control flow harder to reason about. If you meant “local git repo object”, checkrepoinstead (or delete this branch if it’s now unreachable).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
app/modules/parsing/graph_construction/parsing_controller.py(13 hunks)app/modules/parsing/graph_construction/parsing_helper.py(18 hunks)app/modules/parsing/graph_construction/parsing_service.py(2 hunks)app/modules/parsing/graph_construction/repository_resolver.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- app/modules/parsing/graph_construction/repository_resolver.py
🧰 Additional context used
🧬 Code graph analysis (1)
app/modules/parsing/graph_construction/parsing_helper.py (4)
app/modules/parsing/graph_construction/parsing_schema.py (1)
RepoDetails(38-62)app/modules/projects/projects_schema.py (1)
RepoDetails(15-16)app/modules/parsing/utils/encoding_detector.py (1)
is_text_file(103-115)app/modules/projects/projects_service.py (1)
get_project_from_db(169-222)
🪛 Ruff (0.14.8)
app/modules/parsing/graph_construction/parsing_controller.py
63-63: Do not catch blind exception: Exception
(BLE001)
64-64: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
65-68: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
67-67: Use explicit conversion flag
Replace with conversion flag
(RUF010)
app/modules/parsing/graph_construction/parsing_service.py
66-66: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
144-144: Consider moving this statement to an else block
(TRY300)
148-148: Do not catch blind exception: Exception
(BLE001)
app/modules/parsing/graph_construction/parsing_helper.py
67-67: Avoid specifying long messages outside the exception class
(TRY003)
70-70: Avoid specifying long messages outside the exception class
(TRY003)
74-74: Unused method argument: user_id
(ARG002)
441-441: Unused function argument: members
(ARG001)
445-445: Avoid specifying long messages outside the exception class
(TRY003)
547-547: Consider moving this statement to an else block
(TRY300)
549-549: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
557-557: Loop control variable dirs not used within loop body
Rename unused dirs to _dirs
(B007)
579-579: Do not catch blind exception: Exception
(BLE001)
622-622: Consider moving this statement to an else block
(TRY300)
627-627: Avoid specifying long messages outside the exception class
(TRY003)
631-631: Avoid specifying long messages outside the exception class
(TRY003)
1039-1040: try-except-pass detected, consider logging the exception
(S110)
1039-1039: Do not catch blind exception: Exception
(BLE001)
| try: | ||
| resolved_repo = RepositoryResolver.resolve( | ||
| repository=repo_details.repository, | ||
| branch_name=repo_details.branch_name, | ||
| commit_id=repo_details.commit_id, | ||
| ) | ||
| except HTTPException: | ||
| raise | ||
| except Exception as e: | ||
| logger.error(f"Failed to resolve repository identifier: {e}") | ||
| raise HTTPException( | ||
| status_code=400, | ||
| detail=f"Invalid repository identifier: {str(e)}" | ||
| ) |
There was a problem hiding this comment.
Prefer logger.exception + raise ... from e in resolver failure path.
Keeps stack trace and preserves exception context (Ruff TRY400/B904).
- except Exception as e:
- logger.error(f"Failed to resolve repository identifier: {e}")
+ except Exception as e:
+ logger.exception("Failed to resolve repository identifier")
raise HTTPException(
status_code=400,
- detail=f"Invalid repository identifier: {str(e)}"
- )
+ detail=f"Invalid repository identifier: {e}",
+ ) from e🧰 Tools
🪛 Ruff (0.14.8)
63-63: Do not catch blind exception: Exception
(BLE001)
64-64: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
65-68: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
67-67: Use explicit conversion flag
Replace with conversion flag
(RUF010)
🤖 Prompt for AI Agents
In app/modules/parsing/graph_construction/parsing_controller.py around lines 55
to 68, the exception handler for RepositoryResolver.resolve currently logs with
logger.error and re-raises a new HTTPException without preserving the original
traceback; change this to call logger.exception(...) to capture the stack trace
and then raise the HTTPException using "raise HTTPException(... ) from e" so the
original exception context is preserved and the log contains the full traceback.
| @staticmethod | ||
| def _validate_url(url: str): | ||
| """Validate URL scheme to prevent basic SSRF.""" | ||
| if not url or not isinstance(url, str): | ||
| raise ValueError("Invalid URL") | ||
| parsed = urlparse(url) | ||
| if parsed.scheme not in ("http", "https"): | ||
| raise ValueError(f"Invalid URL scheme: {parsed.scheme}. Only http/https allowed.") | ||
| return True |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
SSRF hardening: scheme-only validation is not enough (redirects/localhost/private IPs).
Right now _validate_url only checks http/https. If tarball_url can be influenced (or redirected), it can still hit internal hosts. At minimum: block localhost + private IP ranges and validate the final response.url (or disable redirects and follow manually with validation).
Also applies to: 262-357
🧰 Tools
🪛 Ruff (0.14.8)
67-67: Avoid specifying long messages outside the exception class
(TRY003)
70-70: Avoid specifying long messages outside the exception class
(TRY003)
|
|
||
| # Safe extraction (Zip Slip protection) | ||
| def is_within_directory(directory, target): | ||
| abs_directory = os.path.abspath(directory) | ||
| abs_target = os.path.abspath(target) | ||
| prefix = os.path.commonprefix([abs_directory, abs_target]) | ||
| return prefix == abs_directory | ||
|
|
||
| def safe_extract(tar, path=".", members=None, *, numeric_owner=False): | ||
| for member in tar.getmembers(): | ||
| member_path = os.path.join(path, member.name) | ||
| if not is_within_directory(path, member_path): | ||
| raise ParsingFailedError(f"Attempted Path Traversal in Tar File: {member.name}") | ||
| tar.extract(member, path, numeric_owner=numeric_owner) | ||
|
|
||
| safe_extract(tar, path=temp_dir) | ||
| logger.info( |
There was a problem hiding this comment.
Critical: tar extraction guard is bypassable (commonprefix ≠ path containment).
os.path.commonprefix is string-based and can be tricked (e.g., /tmp/proj vs /tmp/proj_evil). Use os.path.commonpath (or pathlib) and consider rejecting symlinks/hardlinks.
def is_within_directory(directory, target):
abs_directory = os.path.abspath(directory)
abs_target = os.path.abspath(target)
- prefix = os.path.commonprefix([abs_directory, abs_target])
- return prefix == abs_directory
+ return os.path.commonpath([abs_directory, abs_target]) == abs_directory
def safe_extract(tar, path=".", members=None, *, numeric_owner=False):
for member in tar.getmembers():
+ if member.issym() or member.islnk():
+ raise ParsingFailedError(
+ f"Symlinks/links not allowed in tar: {member.name}"
+ )
member_path = os.path.join(path, member.name)
if not is_within_directory(path, member_path):
raise ParsingFailedError(f"Attempted Path Traversal in Tar File: {member.name}")
tar.extract(member, path, numeric_owner=numeric_owner)🧰 Tools
🪛 Ruff (0.14.8)
441-441: Unused function argument: members
(ARG001)
445-445: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In app/modules/parsing/graph_construction/parsing_helper.py around lines 433 to
449, the current tar extraction guard uses os.path.commonprefix which can be
bypassed; replace it with a robust containment check using os.path.commonpath or
pathlib.Path.resolve() to compare resolved absolute paths, and additionally
disallow extraction of symlinks and hardlinks (skip members with type
islnk/issym) or explicitly sanitize their targets before extracting; iterate
only over the provided members (if any), validate each member’s resolved
destination is within the temp_dir, create parent directories as needed, and use
tar.extractfile/extract only after these checks to prevent path traversal and
link-based escapes.
| def _cleanup_graph_if_needed(self, project_id, user_id, cleanup_graph): | ||
| """Clean up existing graph if requested.""" | ||
| if not cleanup_graph: | ||
| return | ||
|
|
||
| neo4j_config = config_provider.get_neo4j_config() | ||
| try: | ||
| code_graph_service = CodeGraphService( | ||
| neo4j_config["uri"], | ||
| neo4j_config["username"], | ||
| neo4j_config["password"], | ||
| self.db, | ||
| ) | ||
| code_graph_service.cleanup_graph(project_id) | ||
| except Exception: | ||
| logger.exception("Error in cleanup_graph", project_id=project_id, user_id=user_id) | ||
| raise HTTPException(status_code=500, detail="Internal server error") | ||
|
|
There was a problem hiding this comment.
Consider closing CodeGraphService + preserve exception context.
If CodeGraphService holds a driver/session, this helper may leak it. Also, raising HTTPException(...) inside except Exception should generally be raise ... from e (Ruff B904) for debuggability.
- try:
+ code_graph_service = None
+ try:
code_graph_service = CodeGraphService(
neo4j_config["uri"],
neo4j_config["username"],
neo4j_config["password"],
self.db,
)
code_graph_service.cleanup_graph(project_id)
- except Exception:
+ except Exception as e:
logger.exception("Error in cleanup_graph", project_id=project_id, user_id=user_id)
- raise HTTPException(status_code=500, detail="Internal server error")
+ raise HTTPException(status_code=500, detail="Internal server error") from e
+ finally:
+ if code_graph_service is not None and hasattr(code_graph_service, "close"):
+ code_graph_service.close()🧰 Tools
🪛 Ruff (0.14.8)
66-66: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
🤖 Prompt for AI Agents
In app/modules/parsing/graph_construction/parsing_service.py around lines 50 to
67, the helper may leak resources because CodeGraphService is not closed and the
exception handler loses the original exception context; update the code to
instantiate CodeGraphService within a context manager or ensure its
close/dispose is always called (use a try/finally or implement/use
CodeGraphService.__enter__/__exit__), call cleanup_graph inside that safe block,
and when re-raising wrap the HTTPException with the original exception (raise
HTTPException(...) from e) so the exception context is preserved.
| async def _setup_project(self, repo, repo_details, auth, user_id, project_id): | ||
| """Setup project directory based on environment mode.""" | ||
| if config_provider.get_is_development_mode(): | ||
| return await self.parse_helper.setup_project_directory( | ||
| repo, repo_details.branch_name, auth, repo_details, | ||
| user_id, project_id, commit_id=repo_details.commit_id | ||
| ) | ||
| else: | ||
| return await self.parse_helper.setup_project_directory( | ||
| repo, repo_details.branch_name, auth, repo, | ||
| user_id, project_id, commit_id=repo_details.commit_id | ||
| ) | ||
|
|
There was a problem hiding this comment.
Fix: _setup_project() passes the wrong type in prod (will break parsing).
In the else branch you pass repo where ParseHelper.setup_project_directory(..., repo_details, ...) expects a RepoDetails (it uses .repo_name/.repo_path/.is_local). This will likely crash for non-dev mode.
async def _setup_project(self, repo, repo_details, auth, user_id, project_id):
"""Setup project directory based on environment mode."""
- if config_provider.get_is_development_mode():
- return await self.parse_helper.setup_project_directory(
- repo, repo_details.branch_name, auth, repo_details,
- user_id, project_id, commit_id=repo_details.commit_id
- )
- else:
- return await self.parse_helper.setup_project_directory(
- repo, repo_details.branch_name, auth, repo,
- user_id, project_id, commit_id=repo_details.commit_id
- )
+ return await self.parse_helper.setup_project_directory(
+ repo,
+ repo_details.branch_name,
+ auth,
+ repo_details,
+ user_id,
+ project_id,
+ commit_id=repo_details.commit_id,
+ )🤖 Prompt for AI Agents
In app/modules/parsing/graph_construction/parsing_service.py around lines 68 to
80, the else branch of _setup_project incorrectly passes the Repo object where
ParseHelper.setup_project_directory expects a RepoDetails (it needs attributes
like repo_name/repo_path/is_local); change the fourth argument in the non-dev
branch from repo to repo_details so the call matches the development branch
signature and preserve the commit_id and other args as-is.
| def _cleanup_extracted_dir(self, extracted_dir): | ||
| """Clean up extracted directory if it's in the project path.""" | ||
| if (extracted_dir and isinstance(extracted_dir, str) and | ||
| os.path.exists(extracted_dir) and | ||
| extracted_dir.startswith(os.getenv("PROJECT_PATH"))): | ||
| shutil.rmtree(extracted_dir, ignore_errors=True) | ||
|
|
There was a problem hiding this comment.
Fix: _cleanup_extracted_dir() can throw when PROJECT_PATH is unset + path prefix check is unsafe.
extracted_dir.startswith(os.getenv("PROJECT_PATH")) raises TypeError if the env var is missing, and string-prefix checks are not a safe “is under directory” test.
def _cleanup_extracted_dir(self, extracted_dir):
"""Clean up extracted directory if it's in the project path."""
- if (extracted_dir and isinstance(extracted_dir, str) and
- os.path.exists(extracted_dir) and
- extracted_dir.startswith(os.getenv("PROJECT_PATH"))):
- shutil.rmtree(extracted_dir, ignore_errors=True)
+ project_path = os.getenv("PROJECT_PATH")
+ if not (project_path and extracted_dir and isinstance(extracted_dir, str)):
+ return
+ if not os.path.exists(extracted_dir):
+ return
+ try:
+ proj = os.path.realpath(project_path)
+ target = os.path.realpath(extracted_dir)
+ if os.path.commonpath([proj, target]) == proj:
+ shutil.rmtree(extracted_dir, ignore_errors=True)
+ except ValueError:
+ # Different drives / invalid paths
+ return🤖 Prompt for AI Agents
In app/modules/parsing/graph_construction/parsing_service.py around lines 116 to
122, _cleanup_extracted_dir currently calls os.getenv("PROJECT_PATH") directly
and uses startswith for containment which can raise TypeError if the env var is
unset and is not a safe "is under" check; change it to: read PROJECT_PATH into a
local variable and if falsy return/do nothing, resolve/abspath both PROJECT_PATH
and extracted_dir (or use pathlib.Path.resolve()) to eliminate symlinks, then
use a robust containment check such as comparing
os.path.commonpath([project_path, extracted_dir]) == project_path (wrapping in
try/except to handle ValueError on different drives) before calling
shutil.rmtree; ensure ignore_errors=True remains and handle any exceptions
gracefully.
|


Summary by CodeRabbit
New Features
Bug Fixes
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.