-
Notifications
You must be signed in to change notification settings - Fork 14
Description
Plan: Remote Repository Import for vcspull
Summary
Add a vcspull import command to search and import repositories from GitHub, GitLab, Codeberg/Gitea/Forgejo, and AWS CodeCommit.
Key Decisions:
- Command name:
import(action-oriented, mirrors common data tool patterns) - Workspace:
--workspacerequired (no guessing) - HTTP library: stdlib
urllibonly (no new dependencies)
API Research (Verified via curl)
All three services provide REST APIs for searching repositories:
| Service | Endpoint | Auth | Rate Limit |
|---|---|---|---|
| GitHub | /search/repositories?q=... |
GITHUB_TOKEN |
60/hr unauth, higher with auth |
| GitLab | /api/v4/search?scope=projects&search=... |
GITLAB_TOKEN |
Requires auth for search |
| Codeberg | /api/v1/repos/search?q=... (Gitea API) |
CODEBERG_TOKEN |
Works without auth |
| CodeCommit | aws codecommit list-repositories |
AWS CLI/boto3 | AWS service limits |
API Response Field Mappings
GitHub → RemoteRepo:
name = data["name"]
clone_url = data["clone_url"] # "https://github.com/user/repo.git"
html_url = data["html_url"]
description = data.get("description")
language = data.get("language")
topics = data.get("topics", []) # Only in detailed response
stars = data["stargazers_count"]
is_fork = data["fork"]
is_archived = data["archived"]
default_branch = data["default_branch"]
owner = data["owner"]["login"]GitLab → RemoteRepo:
name = data["path"] # Use path, not name (name can have spaces)
clone_url = data["http_url_to_repo"] # "https://gitlab.com/group/repo.git"
html_url = data["web_url"]
description = data.get("description")
language = None # GitLab doesn't return language in list
topics = data.get("topics", [])
stars = data["star_count"]
is_fork = data.get("forked_from_project") is not None
is_archived = data["archived"]
default_branch = data["default_branch"]
owner = data["namespace"]["path"]Codeberg/Gitea → RemoteRepo:
name = data["name"]
clone_url = data["clone_url"] # "https://codeberg.org/user/repo.git"
html_url = data["html_url"]
description = data.get("description")
language = data.get("language")
topics = data.get("topics", [])
stars = data["stars_count"] # Note: different from GitHub!
is_fork = data["fork"]
is_archived = data["archived"]
default_branch = data["default_branch"]
owner = data["owner"]["login"]AWS CodeCommit → RemoteRepo (via AWS CLI or boto3):
# From GetRepository API response:
name = data["repositoryMetadata"]["repositoryName"]
clone_url = data["repositoryMetadata"]["cloneUrlHttp"] # or cloneUrlSsh
html_url = f"https://{region}.console.aws.amazon.com/codecommit/repositories/{name}"
description = data["repositoryMetadata"].get("repositoryDescription")
language = None # CodeCommit doesn't track language
topics = [] # CodeCommit doesn't have topics
stars = 0 # CodeCommit doesn't have stars
is_fork = False # CodeCommit doesn't have forks
is_archived = False # CodeCommit doesn't have archived state
default_branch = data["repositoryMetadata"].get("defaultBranch", "main")
owner = data["repositoryMetadata"]["accountId"] # AWS account IDCodeCommit Clone URL Pattern:
HTTPS: https://git-codecommit.{region}.amazonaws.com/v1/repos/{RepoName}
SSH: ssh://git-codecommit.{region}.amazonaws.com/v1/repos/{RepoName}
Error Response Formats
GitHub 404:
{"message": "Not Found", "documentation_url": "...", "status": "404"}GitLab 401 (search without auth):
{"message": "401 Unauthorized"}Codeberg 404:
{"message": "user redirect does not exist [name: ...]", "url": "..."}Rate Limit Headers (GitHub)
x-ratelimit-limit: 60
x-ratelimit-remaining: 58
x-ratelimit-reset: 1769964016 # Unix timestamp
x-ratelimit-used: 2
Command Design
vcspull import <service> <target> -w <workspace> [options]
# Examples:
vcspull import github torvalds -w ~/repos/linux --mode user
vcspull import github django -w ~/study/python --mode org
vcspull import github "machine learning" -w ~/ml-repos --mode search --min-stars 1000
vcspull import gitlab myuser -w ~/work --url https://gitlab.company.com
vcspull import codeberg user -w ~/oss --dry-run
# AWS CodeCommit (uses AWS CLI, requires aws configure)
vcspull import codecommit -w ~/work/aws --region us-east-1
vcspull import codecommit "MyProject" -w ~/work/aws # filter by nameArguments & Options
Positional:
service github | gitlab | codeberg | gitea | forgejo | codecommit
target User, org name, or search query (for codecommit: optional filter)
Required:
-w, --workspace Workspace root directory (REQUIRED)
Options:
--mode, -m user (default) | org | search
--url Base URL for self-hosted instances
--token API token (overrides env var)
--region AWS region for CodeCommit (default: from AWS config)
--profile AWS profile for CodeCommit (default: from AWS config)
Filtering:
--language, -l Filter by programming language
--topics Filter by topics (comma-separated)
--min-stars Minimum stars (search mode)
--archived Include archived repos
--forks Include forked repos
--limit Max repos to fetch (default: 100)
Output:
-f, --file Config file to write to (default: ~/.vcspull.yaml)
--dry-run, -n Preview without writing
--yes, -y Skip confirmation
--json / --ndjson Machine-readable output
File Structure
src/vcspull/
cli/
import_repos.py # CLI handler (avoid `import.py` - Python keyword)
__init__.py # Register command
_internal/
remotes/ # New package for remote service abstraction
__init__.py
base.py # BaseImportr + RemoteRepo dataclass
github.py # GitHub implementation
gitlab.py # GitLab implementation
gitea.py # Gitea/Forgejo/Codeberg implementation
codecommit.py # AWS CodeCommit implementation (via AWS CLI)
tests/
cli/
test_import_repos.py
_internal/
remotes/
test_github.py
test_gitlab.py
test_gitea.py
test_codecommit.py
conftest.py
Key Components
1. RemoteRepo Dataclass
@dataclass(frozen=True)
class RemoteRepo:
name: str
clone_url: str
html_url: str
description: str | None
language: str | None
topics: list[str]
stars: int
is_fork: bool
is_archived: bool
default_branch: str
owner: str2. BaseImportr Protocol
class RemoteImportr(t.Protocol):
service_name: str
def Import(self, options: ImportOptions) -> t.Iterator[RemoteRepo]: ...
def authenticate(self, token: str | None = None) -> None: ...
@property
def is_authenticated(self) -> bool: ...3. Service Implementations
- GitHubImportr: Uses
/users/{user}/repos,/orgs/{org}/repos,/search/repositories - GitLabImportr: Uses
/api/v4/groups/{group}/projects(works without auth),/api/v4/search(requires auth) - GiteaImportr: Uses
/api/v1/users/{user}/repos,/api/v1/orgs/{org}/repos,/api/v1/repos/search - CodeCommitImportr: Uses AWS CLI
aws codecommit list-repositories+get-repository
Important API Quirks Discovered
-
GitLab search requires authentication - the
/api/v4/search?scope=projectsendpoint returns 401 without a token. Show helpful error message. -
Codeberg search response wraps data - unlike user/org endpoints that return
[...], search returns{"ok": true, "data": [...]}. Handle both formats. -
GitLab uses
pathnotname- projectnamecan contain spaces, usepathfor filesystem-safe names. -
GitHub rate limits - Unauthenticated: 60 req/hr. Check
x-ratelimit-remainingheader and warn when low. -
AWS CodeCommit requires AWS CLI - Uses
subprocessto callaws codecommitcommands. Auth handled by AWS CLI profiles/env vars/IAM roles. -
CodeCommit needs two API calls -
list-repositoriesonly returns names; must callget-repositoryfor each to get clone URL. Can usebatch-get-repositoriesfor efficiency (up to 25 at a time). -
CodeCommit is region-specific - Must specify
--regionor use default from AWS config.
4. Authentication Priority
--tokenCLI argument- Service-specific env var (
GITHUB_TOKEN,GITLAB_TOKEN,CODEBERG_TOKEN) - Generic fallback (
GITEA_TOKENfor Gitea-based) - Unauthenticated (lowest rate limits)
Implementation Steps
-
Create remotes package (
src/vcspull/_internal/remotes/)base.py- Protocol, dataclasses, base HTTP handling withurllibgithub.py- GitHub API implementationgitlab.py- GitLab API implementationgitea.py- Gitea/Forgejo/Codeberg implementation
-
Create CLI command (
src/vcspull/cli/import_repos.py)- Argument parser following
discover.pypatterns - Require
--workspaceargument (no default guessing) - Output formatting using existing
_output.pyutilities - Config writing reusing
discover.pypatterns
- Argument parser following
-
Register command in
src/vcspull/cli/__init__.py- Add
importsubparser pointing toimport_reposmodule
- Add
-
Add comprehensive tests (see Testing Strategy below)
Testing Strategy (Target: 90%+ Coverage)
Test File Structure
tests/
_internal/
remotes/
conftest.py # Shared fixtures, mock response factories
test_base.py # Base Importr tests
test_github.py # GitHub-specific tests
test_gitlab.py # GitLab-specific tests
test_gitea.py # Gitea/Codeberg tests
cli/
test_import_repos.py # CLI integration tests
Unit Tests for Importrs (Mocked HTTP)
All tests use functional style (no classes) with NamedTuple + test_id pattern.
tests/_internal/remotes/conftest.py - Shared Fixtures
@pytest.fixture
def mock_urlopen(monkeypatch: pytest.MonkeyPatch) -> t.Callable[..., None]:
"""Factory fixture to mock urllib.request.urlopen responses."""
def _mock(responses: list[tuple[bytes, dict[str, str]]]) -> None:
call_count = 0
def urlopen_side_effect(request: urllib.request.Request, timeout: int = 30):
nonlocal call_count
body, headers = responses[call_count % len(responses)]
call_count += 1
return MockResponse(body, headers)
monkeypatch.setattr("urllib.request.urlopen", urlopen_side_effect)
return _mock
@pytest.fixture
def github_user_repos_response() -> bytes:
"""Standard GitHub user repos API response."""
return json.dumps([...]).encode()tests/_internal/remotes/test_github.py
class GitHubUserFixture(t.NamedTuple):
test_id: str
response_json: list[dict[str, t.Any]]
options: ImportOptions
expected_count: int
expected_names: list[str]
GITHUB_USER_FIXTURES: list[GitHubUserFixture] = [
GitHubUserFixture(
test_id="single-repo-user",
response_json=[{"name": "repo1", "clone_url": "...", ...}],
options=ImportOptions(mode=ImportMode.USER, target="testuser"),
expected_count=1,
expected_names=["repo1"],
),
GitHubUserFixture(
test_id="multiple-repos-with-forks-excluded",
response_json=[
{"name": "repo1", "fork": False, ...},
{"name": "forked", "fork": True, ...},
],
options=ImportOptions(mode=ImportMode.USER, target="testuser", include_forks=False),
expected_count=1,
expected_names=["repo1"],
),
GitHubUserFixture(
test_id="archived-repos-excluded-by-default",
...
),
GitHubUserFixture(
test_id="language-filter-applied",
...
),
GitHubUserFixture(
test_id="empty-response-returns-empty-list",
response_json=[],
options=ImportOptions(mode=ImportMode.USER, target="emptyuser"),
expected_count=0,
expected_names=[],
),
]
@pytest.mark.parametrize(
list(GitHubUserFixture._fields),
GITHUB_USER_FIXTURES,
ids=[f.test_id for f in GITHUB_USER_FIXTURES],
)
def test_github_Import_user(
test_id: str,
response_json: list[dict[str, t.Any]],
options: ImportOptions,
expected_count: int,
expected_names: list[str],
mock_urlopen: t.Callable[..., None],
) -> None:
"""Test GitHub user repository importing with various scenarios."""
mock_urlopen([(json.dumps(response_json).encode(), {"x-ratelimit-remaining": "100"})])
Importr = GitHubImportr()
repos = list(Importr.Import(options))
assert len(repos) == expected_count
assert [r.name for r in repos] == expected_namesEdge Cases to Test
HTTP/Network Errors
class HTTPErrorFixture(t.NamedTuple):
test_id: str
error_code: int
response_body: bytes
expected_exception: type[Exception]
expected_message_contains: str
HTTP_ERROR_FIXTURES = [
HTTPErrorFixture(
"github-auth-failure-401",
401,
b'{"message": "Bad credentials"}',
AuthenticationError,
"credentials",
),
HTTPErrorFixture(
"github-rate-limit-403",
403,
b'{"message": "API rate limit exceeded"}',
RateLimitError,
"rate limit",
),
HTTPErrorFixture(
"github-not-found-404",
404,
b'{"message": "Not Found", "status": "404"}',
NotFoundError,
"not found",
),
HTTPErrorFixture(
"gitlab-auth-required-401",
401,
b'{"message": "401 Unauthorized"}',
AuthenticationError,
"unauthorized",
),
HTTPErrorFixture(
"codeberg-user-not-found",
404,
b'{"message": "user redirect does not exist [name: xyz]"}',
NotFoundError,
"does not exist",
),
HTTPErrorFixture(
"server-error-500",
500,
b'{"error": "Internal Server Error"}',
ServiceUnavailableError,
"unavailable",
),
HTTPErrorFixture(
"rate-limit-429",
429,
b'{"message": "Too Many Requests"}',
RateLimitError,
"rate limit",
),
]Missing API Key Cases
class MissingAuthFixture(t.NamedTuple):
test_id: str
service: str
env_vars: dict[str, str]
endpoint_requires_auth: bool
expected_behavior: str
MISSING_AUTH_FIXTURES = [
MissingAuthFixture(
"github-no-token-works",
"github",
{},
False,
"success_with_lower_rate_limit",
),
MissingAuthFixture(
"gitlab-search-requires-auth",
"gitlab",
{},
True,
"raises_authentication_error",
),
MissingAuthFixture(
"gitlab-groups-no-auth-works",
"gitlab",
{},
False, # /groups/{id}/projects works without auth
"success",
),
MissingAuthFixture(
"codeberg-no-token-works",
"codeberg",
{},
False,
"success",
),
]
@pytest.mark.parametrize(
list(MissingAuthFixture._fields),
MISSING_AUTH_FIXTURES,
ids=[f.test_id for f in MISSING_AUTH_FIXTURES],
)
def test_missing_auth_behavior(...) -> None:
"""Test behavior when API tokens are missing."""Accurate vs Inaccurate API Responses
class APIResponseFixture(t.NamedTuple):
test_id: str
response_json: dict[str, t.Any]
is_valid: bool
expected_error: str | None
API_RESPONSE_FIXTURES = [
APIResponseFixture(
"github-valid-repo",
{
"name": "repo",
"clone_url": "https://github.com/user/repo.git",
"html_url": "https://github.com/user/repo",
"owner": {"login": "user"},
"fork": False,
"archived": False,
"stargazers_count": 10,
"default_branch": "main",
},
True,
None,
),
APIResponseFixture(
"github-missing-clone-url",
{"name": "repo", "owner": {"login": "user"}},
False,
"missing required field: clone_url",
),
APIResponseFixture(
"github-missing-owner",
{"name": "repo", "clone_url": "..."},
False,
"missing required field: owner",
),
APIResponseFixture(
"gitlab-valid-project",
{
"path": "project",
"http_url_to_repo": "https://gitlab.com/ns/project.git",
"web_url": "https://gitlab.com/ns/project",
"namespace": {"path": "ns"},
"archived": False,
"star_count": 5,
"default_branch": "main",
},
True,
None,
),
APIResponseFixture(
"codeberg-valid-search-response",
{
"ok": True,
"data": [{"name": "repo", "clone_url": "...", "owner": {"login": "u"}}],
},
True,
None,
),
APIResponseFixture(
"codeberg-search-not-ok",
{"ok": False, "data": []},
False,
"search failed",
),
]Authentication Edge Cases
class AuthFixture(t.NamedTuple):
test_id: str
env_vars: dict[str, str]
cli_token: str | None
expected_token_used: str | None
AUTH_FIXTURES = [
AuthFixture("cli-token-overrides-env", {"GITHUB_TOKEN": "env"}, "cli", "cli"),
AuthFixture("env-token-used-when-no-cli", {"GITHUB_TOKEN": "env"}, None, "env"),
AuthFixture("no-auth-when-no-token", {}, None, None),
AuthFixture("codeberg-specific-env", {"CODEBERG_TOKEN": "cb"}, None, "cb"),
AuthFixture("gitea-fallback-token", {"GITEA_TOKEN": "gt"}, None, "gt"),
]Pagination Edge Cases
PAGINATION_FIXTURES = [
("single-page-under-limit", 50, 100, 1), # 50 repos, limit 100, 1 API call
("exact-page-boundary", 100, 100, 1), # 100 repos, limit 100, 1 call
("multi-page-over-limit", 150, 100, 2), # 150 repos, limit 100, stops at 100
("empty-second-page", 30, 100, 2), # First page 30, second empty
]Filter Combinations
FILTER_FIXTURES = [
("language-python-only", {"language": "Python"}, 5, 2), # 5 total, 2 Python
("topics-filter", {"topics": ["cli", "tool"]}, 5, 1),
("min-stars-filter", {"min_stars": 100}, 5, 3),
("combined-filters", {"language": "Python", "min_stars": 50}, 10, 1),
("include-archived", {"include_archived": True}, 5, 5),
("include-forks", {"include_forks": True}, 5, 5),
]AWS CodeCommit-Specific Tests
class CodeCommitFixture(t.NamedTuple):
test_id: str
list_repos_output: str # JSON output from aws codecommit list-repositories
get_repo_output: str # JSON output from aws codecommit get-repository
expected_count: int
expected_names: list[str]
CODECOMMIT_FIXTURES = [
CodeCommitFixture(
"single-repo",
'{"repositories": [{"repositoryName": "MyRepo", "repositoryId": "abc123"}]}',
'{"repositoryMetadata": {"repositoryName": "MyRepo", "cloneUrlHttp": "https://git-codecommit.us-east-1.amazonaws.com/v1/repos/MyRepo"}}',
1,
["MyRepo"],
),
CodeCommitFixture(
"multiple-repos",
'{"repositories": [{"repositoryName": "Repo1"}, {"repositoryName": "Repo2"}]}',
...,
2,
["Repo1", "Repo2"],
),
CodeCommitFixture(
"empty-account",
'{"repositories": []}',
"",
0,
[],
),
]
class CodeCommitErrorFixture(t.NamedTuple):
test_id: str
returncode: int
stderr: str
expected_exception: type[Exception]
expected_message_contains: str
CODECOMMIT_ERROR_FIXTURES = [
CodeCommitErrorFixture(
"aws-cli-not-found",
127,
"aws: command not found",
DependencyError,
"AWS CLI not installed",
),
CodeCommitErrorFixture(
"aws-credentials-missing",
255,
"Unable to locate credentials",
AuthenticationError,
"credentials",
),
CodeCommitErrorFixture(
"invalid-region",
255,
"Could not connect to the endpoint URL",
ConfigurationError,
"region",
),
]CLI Integration Tests
tests/cli/test_import_repos.py
class ImportCLIFixture(t.NamedTuple):
test_id: str
cli_args: list[str]
mock_repos: list[dict[str, t.Any]]
expected_exit_code: int
expected_output_contains: list[str]
expected_config_repos: int
IMPORT_CLI_FIXTURES = [
ImportCLIFixture(
test_id="basic-user-import-dry-run",
cli_args=["import", "github", "testuser", "-w", "~/test", "--dry-run"],
mock_repos=[...],
expected_exit_code=0,
expected_output_contains=["Found", "repositories", "Dry run"],
expected_config_repos=0,
),
ImportCLIFixture(
test_id="missing-workspace-fails",
cli_args=["import", "github", "testuser"], # No -w
mock_repos=[],
expected_exit_code=2, # argparse error
expected_output_contains=["--workspace"],
expected_config_repos=0,
),
ImportCLIFixture(
test_id="json-output-format",
cli_args=["import", "github", "testuser", "-w", "~/test", "--json"],
mock_repos=[...],
expected_exit_code=0,
expected_output_contains=['"name":', '"clone_url":'],
expected_config_repos=0,
),
ImportCLIFixture(
test_id="gitea-requires-url",
cli_args=["import", "gitea", "user", "-w", "~/test"], # No --url
mock_repos=[],
expected_exit_code=1,
expected_output_contains=["--url is required"],
expected_config_repos=0,
),
]
@pytest.mark.parametrize(
list(ImportCLIFixture._fields),
IMPORT_CLI_FIXTURES,
ids=[f.test_id for f in IMPORT_CLI_FIXTURES],
)
def test_import_cli(
test_id: str,
cli_args: list[str],
mock_repos: list[dict[str, t.Any]],
expected_exit_code: int,
expected_output_contains: list[str],
expected_config_repos: int,
tmp_path: pathlib.Path,
monkeypatch: pytest.MonkeyPatch,
capsys: pytest.CaptureFixture[str],
) -> None:
"""Test CLI argument handling and output."""Coverage Requirements
Target: 90%+ line coverage
| Module | Required Coverage | Key Areas |
|---|---|---|
_internal/remotes/base.py |
95% | HTTP handling, error mapping, dataclasses |
_internal/remotes/github.py |
90% | All 3 modes, pagination, filtering |
_internal/remotes/gitlab.py |
90% | All 3 modes, self-hosted URL handling |
_internal/remotes/gitea.py |
90% | All 3 modes, service variants |
_internal/remotes/codecommit.py |
90% | AWS CLI subprocess, batch-get, region handling |
cli/import_repos.py |
85% | Arg parsing, output modes, config writing |
Mock Strategy
-
monkeypatchfor:urllib.request.urlopen- All HTTP callsos.environ- Environment variable tests- File system operations via
tmp_path
-
Snapshot testing (syrupy) for:
- JSON/NDJSON output format
- Human-readable output format
-
capsysfor:- Capturing stdout/stderr
- Verifying colored output
Dependencies
Use stdlib only - no new dependencies needed:
urllib.requestfor HTTP APIs (GitHub, GitLab, Codeberg)subprocessfor AWS CLI (CodeCommit)jsonfor parsing responses
External requirement for CodeCommit: AWS CLI must be installed and configured (aws configure).
Critical Files to Reference
src/vcspull/cli/discover.py- Pattern for config writing, dry-run, confirmationsrc/vcspull/cli/search.py- Pattern for output formatting, JSON modessrc/vcspull/cli/_output.py- OutputFormatter to reusetests/cli/test_discover.py- Testing patterns
Verification
- Run linting:
uv run ruff check . --fix - Run type checking:
uv run mypy - Run tests:
uv run pytest tests/_internal/remotes/ tests/cli/test_import_repos.py -v - Check coverage:
uv run pytest --cov=vcspull._internal.remotes --cov=vcspull.cli.import_repos --cov-report=term-missing - Manual testing:
# GitHub - dry-run to verify API calls work vcspull import github django -w ~/study/python --mode org --dry-run # JSON output for inspection vcspull import github torvalds -w ~/repos/linux --mode user --json | jq # Full import to config vcspull import github pallets -w ~/study/python -f ~/.vcspull.yaml # AWS CodeCommit (requires AWS CLI configured) vcspull import codecommit -w ~/work/aws --region us-east-1 --dry-run