Skip to content

Import remote repos, ghorg-like fetching behavior #416

@tony

Description

@tony

Plan: Remote Repository Import for vcspull

Summary

Add a vcspull import command to search and import repositories from GitHub, GitLab, Codeberg/Gitea/Forgejo, and AWS CodeCommit.

Key Decisions:

  • Command name: import (action-oriented, mirrors common data tool patterns)
  • Workspace: --workspace required (no guessing)
  • HTTP library: stdlib urllib only (no new dependencies)

API Research (Verified via curl)

All three services provide REST APIs for searching repositories:

Service Endpoint Auth Rate Limit
GitHub /search/repositories?q=... GITHUB_TOKEN 60/hr unauth, higher with auth
GitLab /api/v4/search?scope=projects&search=... GITLAB_TOKEN Requires auth for search
Codeberg /api/v1/repos/search?q=... (Gitea API) CODEBERG_TOKEN Works without auth
CodeCommit aws codecommit list-repositories AWS CLI/boto3 AWS service limits

API Response Field Mappings

GitHub → RemoteRepo:

name = data["name"]
clone_url = data["clone_url"]  # "https://github.com/user/repo.git"
html_url = data["html_url"]
description = data.get("description")
language = data.get("language")
topics = data.get("topics", [])  # Only in detailed response
stars = data["stargazers_count"]
is_fork = data["fork"]
is_archived = data["archived"]
default_branch = data["default_branch"]
owner = data["owner"]["login"]

GitLab → RemoteRepo:

name = data["path"]  # Use path, not name (name can have spaces)
clone_url = data["http_url_to_repo"]  # "https://gitlab.com/group/repo.git"
html_url = data["web_url"]
description = data.get("description")
language = None  # GitLab doesn't return language in list
topics = data.get("topics", [])
stars = data["star_count"]
is_fork = data.get("forked_from_project") is not None
is_archived = data["archived"]
default_branch = data["default_branch"]
owner = data["namespace"]["path"]

Codeberg/Gitea → RemoteRepo:

name = data["name"]
clone_url = data["clone_url"]  # "https://codeberg.org/user/repo.git"
html_url = data["html_url"]
description = data.get("description")
language = data.get("language")
topics = data.get("topics", [])
stars = data["stars_count"]  # Note: different from GitHub!
is_fork = data["fork"]
is_archived = data["archived"]
default_branch = data["default_branch"]
owner = data["owner"]["login"]

AWS CodeCommit → RemoteRepo (via AWS CLI or boto3):

# From GetRepository API response:
name = data["repositoryMetadata"]["repositoryName"]
clone_url = data["repositoryMetadata"]["cloneUrlHttp"]  # or cloneUrlSsh
html_url = f"https://{region}.console.aws.amazon.com/codecommit/repositories/{name}"
description = data["repositoryMetadata"].get("repositoryDescription")
language = None  # CodeCommit doesn't track language
topics = []  # CodeCommit doesn't have topics
stars = 0  # CodeCommit doesn't have stars
is_fork = False  # CodeCommit doesn't have forks
is_archived = False  # CodeCommit doesn't have archived state
default_branch = data["repositoryMetadata"].get("defaultBranch", "main")
owner = data["repositoryMetadata"]["accountId"]  # AWS account ID

CodeCommit Clone URL Pattern:

HTTPS: https://git-codecommit.{region}.amazonaws.com/v1/repos/{RepoName}
SSH:   ssh://git-codecommit.{region}.amazonaws.com/v1/repos/{RepoName}

Error Response Formats

GitHub 404:

{"message": "Not Found", "documentation_url": "...", "status": "404"}

GitLab 401 (search without auth):

{"message": "401 Unauthorized"}

Codeberg 404:

{"message": "user redirect does not exist [name: ...]", "url": "..."}

Rate Limit Headers (GitHub)

x-ratelimit-limit: 60
x-ratelimit-remaining: 58
x-ratelimit-reset: 1769964016  # Unix timestamp
x-ratelimit-used: 2

Command Design

vcspull import <service> <target> -w <workspace> [options]

# Examples:
vcspull import github torvalds -w ~/repos/linux --mode user
vcspull import github django -w ~/study/python --mode org
vcspull import github "machine learning" -w ~/ml-repos --mode search --min-stars 1000
vcspull import gitlab myuser -w ~/work --url https://gitlab.company.com
vcspull import codeberg user -w ~/oss --dry-run

# AWS CodeCommit (uses AWS CLI, requires aws configure)
vcspull import codecommit -w ~/work/aws --region us-east-1
vcspull import codecommit "MyProject" -w ~/work/aws  # filter by name

Arguments & Options

Positional:
  service              github | gitlab | codeberg | gitea | forgejo | codecommit
  target               User, org name, or search query (for codecommit: optional filter)

Required:
  -w, --workspace     Workspace root directory (REQUIRED)

Options:
  --mode, -m          user (default) | org | search
  --url               Base URL for self-hosted instances
  --token             API token (overrides env var)
  --region            AWS region for CodeCommit (default: from AWS config)
  --profile           AWS profile for CodeCommit (default: from AWS config)

Filtering:
  --language, -l      Filter by programming language
  --topics            Filter by topics (comma-separated)
  --min-stars         Minimum stars (search mode)
  --archived          Include archived repos
  --forks             Include forked repos
  --limit             Max repos to fetch (default: 100)

Output:
  -f, --file          Config file to write to (default: ~/.vcspull.yaml)
  --dry-run, -n       Preview without writing
  --yes, -y           Skip confirmation
  --json / --ndjson   Machine-readable output

File Structure

src/vcspull/
  cli/
    import_repos.py     # CLI handler (avoid `import.py` - Python keyword)
    __init__.py         # Register command

  _internal/
    remotes/            # New package for remote service abstraction
      __init__.py
      base.py           # BaseImportr + RemoteRepo dataclass
      github.py         # GitHub implementation
      gitlab.py         # GitLab implementation
      gitea.py          # Gitea/Forgejo/Codeberg implementation
      codecommit.py     # AWS CodeCommit implementation (via AWS CLI)

tests/
  cli/
    test_import_repos.py
  _internal/
    remotes/
      test_github.py
      test_gitlab.py
      test_gitea.py
      test_codecommit.py
      conftest.py

Key Components

1. RemoteRepo Dataclass

@dataclass(frozen=True)
class RemoteRepo:
    name: str
    clone_url: str
    html_url: str
    description: str | None
    language: str | None
    topics: list[str]
    stars: int
    is_fork: bool
    is_archived: bool
    default_branch: str
    owner: str

2. BaseImportr Protocol

class RemoteImportr(t.Protocol):
    service_name: str

    def Import(self, options: ImportOptions) -> t.Iterator[RemoteRepo]: ...
    def authenticate(self, token: str | None = None) -> None: ...
    @property
    def is_authenticated(self) -> bool: ...

3. Service Implementations

  • GitHubImportr: Uses /users/{user}/repos, /orgs/{org}/repos, /search/repositories
  • GitLabImportr: Uses /api/v4/groups/{group}/projects (works without auth), /api/v4/search (requires auth)
  • GiteaImportr: Uses /api/v1/users/{user}/repos, /api/v1/orgs/{org}/repos, /api/v1/repos/search
  • CodeCommitImportr: Uses AWS CLI aws codecommit list-repositories + get-repository

Important API Quirks Discovered

  1. GitLab search requires authentication - the /api/v4/search?scope=projects endpoint returns 401 without a token. Show helpful error message.

  2. Codeberg search response wraps data - unlike user/org endpoints that return [...], search returns {"ok": true, "data": [...]}. Handle both formats.

  3. GitLab uses path not name - project name can contain spaces, use path for filesystem-safe names.

  4. GitHub rate limits - Unauthenticated: 60 req/hr. Check x-ratelimit-remaining header and warn when low.

  5. AWS CodeCommit requires AWS CLI - Uses subprocess to call aws codecommit commands. Auth handled by AWS CLI profiles/env vars/IAM roles.

  6. CodeCommit needs two API calls - list-repositories only returns names; must call get-repository for each to get clone URL. Can use batch-get-repositories for efficiency (up to 25 at a time).

  7. CodeCommit is region-specific - Must specify --region or use default from AWS config.

4. Authentication Priority

  1. --token CLI argument
  2. Service-specific env var (GITHUB_TOKEN, GITLAB_TOKEN, CODEBERG_TOKEN)
  3. Generic fallback (GITEA_TOKEN for Gitea-based)
  4. Unauthenticated (lowest rate limits)

Implementation Steps

  1. Create remotes package (src/vcspull/_internal/remotes/)

    • base.py - Protocol, dataclasses, base HTTP handling with urllib
    • github.py - GitHub API implementation
    • gitlab.py - GitLab API implementation
    • gitea.py - Gitea/Forgejo/Codeberg implementation
  2. Create CLI command (src/vcspull/cli/import_repos.py)

    • Argument parser following discover.py patterns
    • Require --workspace argument (no default guessing)
    • Output formatting using existing _output.py utilities
    • Config writing reusing discover.py patterns
  3. Register command in src/vcspull/cli/__init__.py

    • Add import subparser pointing to import_repos module
  4. Add comprehensive tests (see Testing Strategy below)

Testing Strategy (Target: 90%+ Coverage)

Test File Structure

tests/
  _internal/
    remotes/
      conftest.py           # Shared fixtures, mock response factories
      test_base.py          # Base Importr tests
      test_github.py        # GitHub-specific tests
      test_gitlab.py        # GitLab-specific tests
      test_gitea.py         # Gitea/Codeberg tests
  cli/
    test_import_repos.py    # CLI integration tests

Unit Tests for Importrs (Mocked HTTP)

All tests use functional style (no classes) with NamedTuple + test_id pattern.

tests/_internal/remotes/conftest.py - Shared Fixtures

@pytest.fixture
def mock_urlopen(monkeypatch: pytest.MonkeyPatch) -> t.Callable[..., None]:
    """Factory fixture to mock urllib.request.urlopen responses."""
    def _mock(responses: list[tuple[bytes, dict[str, str]]]) -> None:
        call_count = 0
        def urlopen_side_effect(request: urllib.request.Request, timeout: int = 30):
            nonlocal call_count
            body, headers = responses[call_count % len(responses)]
            call_count += 1
            return MockResponse(body, headers)
        monkeypatch.setattr("urllib.request.urlopen", urlopen_side_effect)
    return _mock

@pytest.fixture
def github_user_repos_response() -> bytes:
    """Standard GitHub user repos API response."""
    return json.dumps([...]).encode()

tests/_internal/remotes/test_github.py

class GitHubUserFixture(t.NamedTuple):
    test_id: str
    response_json: list[dict[str, t.Any]]
    options: ImportOptions
    expected_count: int
    expected_names: list[str]

GITHUB_USER_FIXTURES: list[GitHubUserFixture] = [
    GitHubUserFixture(
        test_id="single-repo-user",
        response_json=[{"name": "repo1", "clone_url": "...", ...}],
        options=ImportOptions(mode=ImportMode.USER, target="testuser"),
        expected_count=1,
        expected_names=["repo1"],
    ),
    GitHubUserFixture(
        test_id="multiple-repos-with-forks-excluded",
        response_json=[
            {"name": "repo1", "fork": False, ...},
            {"name": "forked", "fork": True, ...},
        ],
        options=ImportOptions(mode=ImportMode.USER, target="testuser", include_forks=False),
        expected_count=1,
        expected_names=["repo1"],
    ),
    GitHubUserFixture(
        test_id="archived-repos-excluded-by-default",
        ...
    ),
    GitHubUserFixture(
        test_id="language-filter-applied",
        ...
    ),
    GitHubUserFixture(
        test_id="empty-response-returns-empty-list",
        response_json=[],
        options=ImportOptions(mode=ImportMode.USER, target="emptyuser"),
        expected_count=0,
        expected_names=[],
    ),
]

@pytest.mark.parametrize(
    list(GitHubUserFixture._fields),
    GITHUB_USER_FIXTURES,
    ids=[f.test_id for f in GITHUB_USER_FIXTURES],
)
def test_github_Import_user(
    test_id: str,
    response_json: list[dict[str, t.Any]],
    options: ImportOptions,
    expected_count: int,
    expected_names: list[str],
    mock_urlopen: t.Callable[..., None],
) -> None:
    """Test GitHub user repository importing with various scenarios."""
    mock_urlopen([(json.dumps(response_json).encode(), {"x-ratelimit-remaining": "100"})])
    Importr = GitHubImportr()
    repos = list(Importr.Import(options))
    assert len(repos) == expected_count
    assert [r.name for r in repos] == expected_names

Edge Cases to Test

HTTP/Network Errors

class HTTPErrorFixture(t.NamedTuple):
    test_id: str
    error_code: int
    response_body: bytes
    expected_exception: type[Exception]
    expected_message_contains: str

HTTP_ERROR_FIXTURES = [
    HTTPErrorFixture(
        "github-auth-failure-401",
        401,
        b'{"message": "Bad credentials"}',
        AuthenticationError,
        "credentials",
    ),
    HTTPErrorFixture(
        "github-rate-limit-403",
        403,
        b'{"message": "API rate limit exceeded"}',
        RateLimitError,
        "rate limit",
    ),
    HTTPErrorFixture(
        "github-not-found-404",
        404,
        b'{"message": "Not Found", "status": "404"}',
        NotFoundError,
        "not found",
    ),
    HTTPErrorFixture(
        "gitlab-auth-required-401",
        401,
        b'{"message": "401 Unauthorized"}',
        AuthenticationError,
        "unauthorized",
    ),
    HTTPErrorFixture(
        "codeberg-user-not-found",
        404,
        b'{"message": "user redirect does not exist [name: xyz]"}',
        NotFoundError,
        "does not exist",
    ),
    HTTPErrorFixture(
        "server-error-500",
        500,
        b'{"error": "Internal Server Error"}',
        ServiceUnavailableError,
        "unavailable",
    ),
    HTTPErrorFixture(
        "rate-limit-429",
        429,
        b'{"message": "Too Many Requests"}',
        RateLimitError,
        "rate limit",
    ),
]

Missing API Key Cases

class MissingAuthFixture(t.NamedTuple):
    test_id: str
    service: str
    env_vars: dict[str, str]
    endpoint_requires_auth: bool
    expected_behavior: str

MISSING_AUTH_FIXTURES = [
    MissingAuthFixture(
        "github-no-token-works",
        "github",
        {},
        False,
        "success_with_lower_rate_limit",
    ),
    MissingAuthFixture(
        "gitlab-search-requires-auth",
        "gitlab",
        {},
        True,
        "raises_authentication_error",
    ),
    MissingAuthFixture(
        "gitlab-groups-no-auth-works",
        "gitlab",
        {},
        False,  # /groups/{id}/projects works without auth
        "success",
    ),
    MissingAuthFixture(
        "codeberg-no-token-works",
        "codeberg",
        {},
        False,
        "success",
    ),
]

@pytest.mark.parametrize(
    list(MissingAuthFixture._fields),
    MISSING_AUTH_FIXTURES,
    ids=[f.test_id for f in MISSING_AUTH_FIXTURES],
)
def test_missing_auth_behavior(...) -> None:
    """Test behavior when API tokens are missing."""

Accurate vs Inaccurate API Responses

class APIResponseFixture(t.NamedTuple):
    test_id: str
    response_json: dict[str, t.Any]
    is_valid: bool
    expected_error: str | None

API_RESPONSE_FIXTURES = [
    APIResponseFixture(
        "github-valid-repo",
        {
            "name": "repo",
            "clone_url": "https://github.com/user/repo.git",
            "html_url": "https://github.com/user/repo",
            "owner": {"login": "user"},
            "fork": False,
            "archived": False,
            "stargazers_count": 10,
            "default_branch": "main",
        },
        True,
        None,
    ),
    APIResponseFixture(
        "github-missing-clone-url",
        {"name": "repo", "owner": {"login": "user"}},
        False,
        "missing required field: clone_url",
    ),
    APIResponseFixture(
        "github-missing-owner",
        {"name": "repo", "clone_url": "..."},
        False,
        "missing required field: owner",
    ),
    APIResponseFixture(
        "gitlab-valid-project",
        {
            "path": "project",
            "http_url_to_repo": "https://gitlab.com/ns/project.git",
            "web_url": "https://gitlab.com/ns/project",
            "namespace": {"path": "ns"},
            "archived": False,
            "star_count": 5,
            "default_branch": "main",
        },
        True,
        None,
    ),
    APIResponseFixture(
        "codeberg-valid-search-response",
        {
            "ok": True,
            "data": [{"name": "repo", "clone_url": "...", "owner": {"login": "u"}}],
        },
        True,
        None,
    ),
    APIResponseFixture(
        "codeberg-search-not-ok",
        {"ok": False, "data": []},
        False,
        "search failed",
    ),
]

Authentication Edge Cases

class AuthFixture(t.NamedTuple):
    test_id: str
    env_vars: dict[str, str]
    cli_token: str | None
    expected_token_used: str | None

AUTH_FIXTURES = [
    AuthFixture("cli-token-overrides-env", {"GITHUB_TOKEN": "env"}, "cli", "cli"),
    AuthFixture("env-token-used-when-no-cli", {"GITHUB_TOKEN": "env"}, None, "env"),
    AuthFixture("no-auth-when-no-token", {}, None, None),
    AuthFixture("codeberg-specific-env", {"CODEBERG_TOKEN": "cb"}, None, "cb"),
    AuthFixture("gitea-fallback-token", {"GITEA_TOKEN": "gt"}, None, "gt"),
]

Pagination Edge Cases

PAGINATION_FIXTURES = [
    ("single-page-under-limit", 50, 100, 1),  # 50 repos, limit 100, 1 API call
    ("exact-page-boundary", 100, 100, 1),     # 100 repos, limit 100, 1 call
    ("multi-page-over-limit", 150, 100, 2),   # 150 repos, limit 100, stops at 100
    ("empty-second-page", 30, 100, 2),        # First page 30, second empty
]

Filter Combinations

FILTER_FIXTURES = [
    ("language-python-only", {"language": "Python"}, 5, 2),  # 5 total, 2 Python
    ("topics-filter", {"topics": ["cli", "tool"]}, 5, 1),
    ("min-stars-filter", {"min_stars": 100}, 5, 3),
    ("combined-filters", {"language": "Python", "min_stars": 50}, 10, 1),
    ("include-archived", {"include_archived": True}, 5, 5),
    ("include-forks", {"include_forks": True}, 5, 5),
]

AWS CodeCommit-Specific Tests

class CodeCommitFixture(t.NamedTuple):
    test_id: str
    list_repos_output: str  # JSON output from aws codecommit list-repositories
    get_repo_output: str    # JSON output from aws codecommit get-repository
    expected_count: int
    expected_names: list[str]

CODECOMMIT_FIXTURES = [
    CodeCommitFixture(
        "single-repo",
        '{"repositories": [{"repositoryName": "MyRepo", "repositoryId": "abc123"}]}',
        '{"repositoryMetadata": {"repositoryName": "MyRepo", "cloneUrlHttp": "https://git-codecommit.us-east-1.amazonaws.com/v1/repos/MyRepo"}}',
        1,
        ["MyRepo"],
    ),
    CodeCommitFixture(
        "multiple-repos",
        '{"repositories": [{"repositoryName": "Repo1"}, {"repositoryName": "Repo2"}]}',
        ...,
        2,
        ["Repo1", "Repo2"],
    ),
    CodeCommitFixture(
        "empty-account",
        '{"repositories": []}',
        "",
        0,
        [],
    ),
]

class CodeCommitErrorFixture(t.NamedTuple):
    test_id: str
    returncode: int
    stderr: str
    expected_exception: type[Exception]
    expected_message_contains: str

CODECOMMIT_ERROR_FIXTURES = [
    CodeCommitErrorFixture(
        "aws-cli-not-found",
        127,
        "aws: command not found",
        DependencyError,
        "AWS CLI not installed",
    ),
    CodeCommitErrorFixture(
        "aws-credentials-missing",
        255,
        "Unable to locate credentials",
        AuthenticationError,
        "credentials",
    ),
    CodeCommitErrorFixture(
        "invalid-region",
        255,
        "Could not connect to the endpoint URL",
        ConfigurationError,
        "region",
    ),
]

CLI Integration Tests

tests/cli/test_import_repos.py

class ImportCLIFixture(t.NamedTuple):
    test_id: str
    cli_args: list[str]
    mock_repos: list[dict[str, t.Any]]
    expected_exit_code: int
    expected_output_contains: list[str]
    expected_config_repos: int

IMPORT_CLI_FIXTURES = [
    ImportCLIFixture(
        test_id="basic-user-import-dry-run",
        cli_args=["import", "github", "testuser", "-w", "~/test", "--dry-run"],
        mock_repos=[...],
        expected_exit_code=0,
        expected_output_contains=["Found", "repositories", "Dry run"],
        expected_config_repos=0,
    ),
    ImportCLIFixture(
        test_id="missing-workspace-fails",
        cli_args=["import", "github", "testuser"],  # No -w
        mock_repos=[],
        expected_exit_code=2,  # argparse error
        expected_output_contains=["--workspace"],
        expected_config_repos=0,
    ),
    ImportCLIFixture(
        test_id="json-output-format",
        cli_args=["import", "github", "testuser", "-w", "~/test", "--json"],
        mock_repos=[...],
        expected_exit_code=0,
        expected_output_contains=['"name":', '"clone_url":'],
        expected_config_repos=0,
    ),
    ImportCLIFixture(
        test_id="gitea-requires-url",
        cli_args=["import", "gitea", "user", "-w", "~/test"],  # No --url
        mock_repos=[],
        expected_exit_code=1,
        expected_output_contains=["--url is required"],
        expected_config_repos=0,
    ),
]

@pytest.mark.parametrize(
    list(ImportCLIFixture._fields),
    IMPORT_CLI_FIXTURES,
    ids=[f.test_id for f in IMPORT_CLI_FIXTURES],
)
def test_import_cli(
    test_id: str,
    cli_args: list[str],
    mock_repos: list[dict[str, t.Any]],
    expected_exit_code: int,
    expected_output_contains: list[str],
    expected_config_repos: int,
    tmp_path: pathlib.Path,
    monkeypatch: pytest.MonkeyPatch,
    capsys: pytest.CaptureFixture[str],
) -> None:
    """Test CLI argument handling and output."""

Coverage Requirements

Target: 90%+ line coverage

Module Required Coverage Key Areas
_internal/remotes/base.py 95% HTTP handling, error mapping, dataclasses
_internal/remotes/github.py 90% All 3 modes, pagination, filtering
_internal/remotes/gitlab.py 90% All 3 modes, self-hosted URL handling
_internal/remotes/gitea.py 90% All 3 modes, service variants
_internal/remotes/codecommit.py 90% AWS CLI subprocess, batch-get, region handling
cli/import_repos.py 85% Arg parsing, output modes, config writing

Mock Strategy

  1. monkeypatch for:

    • urllib.request.urlopen - All HTTP calls
    • os.environ - Environment variable tests
    • File system operations via tmp_path
  2. Snapshot testing (syrupy) for:

    • JSON/NDJSON output format
    • Human-readable output format
  3. capsys for:

    • Capturing stdout/stderr
    • Verifying colored output

Dependencies

Use stdlib only - no new dependencies needed:

  • urllib.request for HTTP APIs (GitHub, GitLab, Codeberg)
  • subprocess for AWS CLI (CodeCommit)
  • json for parsing responses

External requirement for CodeCommit: AWS CLI must be installed and configured (aws configure).

Critical Files to Reference

  • src/vcspull/cli/discover.py - Pattern for config writing, dry-run, confirmation
  • src/vcspull/cli/search.py - Pattern for output formatting, JSON modes
  • src/vcspull/cli/_output.py - OutputFormatter to reuse
  • tests/cli/test_discover.py - Testing patterns

Verification

  1. Run linting: uv run ruff check . --fix
  2. Run type checking: uv run mypy
  3. Run tests: uv run pytest tests/_internal/remotes/ tests/cli/test_import_repos.py -v
  4. Check coverage: uv run pytest --cov=vcspull._internal.remotes --cov=vcspull.cli.import_repos --cov-report=term-missing
  5. Manual testing:
    # GitHub - dry-run to verify API calls work
    vcspull import github django -w ~/study/python --mode org --dry-run
    
    # JSON output for inspection
    vcspull import github torvalds -w ~/repos/linux --mode user --json | jq
    
    # Full import to config
    vcspull import github pallets -w ~/study/python -f ~/.vcspull.yaml
    
    # AWS CodeCommit (requires AWS CLI configured)
    vcspull import codecommit -w ~/work/aws --region us-east-1 --dry-run

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions