Skip to content

Latest commit

ย 

History

History
330 lines (255 loc) ยท 7.5 KB

File metadata and controls

330 lines (255 loc) ยท 7.5 KB
layout title description permalink
default
Quick Start Guide
Get up and running with the Dataproc MCP Server in just 5 minutes
/QUICK_START/

Quick Start Guide ๐Ÿš€

Get up and running with the Dataproc MCP Server in just 5 minutes!

Prerequisites

  • Node.js 18+ - Download here
  • Google Cloud Project with Dataproc API enabled
  • Authentication - Service account key or gcloud CLI

๐ŸŽฏ 5-Minute Setup

Step 1: Install the Package

# Install globally for easy access
npm install -g @dataproc/mcp-server

# Or install locally in your project
npm install @dataproc/mcp-server

Step 2: Quick Setup

# Run the interactive setup
dataproc-mcp --setup

# This will create:
# - config/server.json (server configuration)
# - config/default-params.json (default parameters)
# - profiles/ (cluster profile directory)

Step 3: Configure Authentication

For detailed authentication setup, refer to the Authentication Implementation Guide.

Step 4: Configure Your Project

Edit config/default-params.json:

{
  "defaultEnvironment": "development",
  "parameters": [
    {"name": "projectId", "type": "string", "required": true},
    {"name": "region", "type": "string", "required": true, "defaultValue": "us-central1"}
  ],
  "environments": [
    {
      "environment": "development",
      "parameters": {
        "projectId": "your-project-id",
        "region": "us-central1"
      }
    }
  ]
}

Step 5: Optional - Enable Semantic Search

For enhanced natural language queries (optional):

# Install and start Qdrant vector database
docker run -p 6334:6333 qdrant/qdrant

# Verify Qdrant is running
curl http://localhost:6334/health

Benefits of Semantic Search:

  • Natural language cluster queries: "show me clusters with pip packages"
  • Intelligent data extraction and filtering
  • Enhanced search capabilities with confidence scoring

Note: This is completely optional - all core functionality works without Qdrant.

Step 6: Start the Server

# Start the MCP server
dataproc-mcp

# Or run directly with Node.js
node /path/to/dataproc-mcp/build/index.js

๐ŸŒ Claude.ai Web App Integration

NEW: Full Claude.ai compatibility is now available!

For Claude.ai web app integration, see our dedicated guides:

Key Features:

  • โœ… All 22 MCP tools available in Claude.ai
  • โœ… HTTPS tunneling with Cloudflare
  • โœ… OAuth authentication with GitHub
  • โœ… Secure WebSocket connections

๐Ÿ”ง MCP Client Integration

Claude Desktop

Add to your Claude Desktop configuration:

File: ~/Library/Application Support/Claude/claude_desktop_config.json

{
  "mcpServers": {
    "dataproc": {
      "command": "npx",
      "args": [
        "@dipseth/dataproc-mcp-server@latest"
      ],
      "env": {
        "LOG_LEVEL": "info",
        "DATAPROC_CONFIG_PATH": "/path/to/your/config/server.json"
      }
    }
  }
}

Roo (VS Code)

Add to your Roo MCP settings:

File: .roo/mcp.json

{
  "mcpServers": {
    "dataproc": {
      "command": "npx",
      "args": [
        "@dipseth/dataproc-mcp-server@latest"
      ],
      "env": {
        "LOG_LEVEL": "info",
        "DATAPROC_CONFIG_PATH": "/path/to/your/config/server.json"
      },
      "alwaysAllow": []
    }
  }
}

๐ŸŽฎ First Commands

Once connected, try these commands in your MCP client:

List Available Tools

What Dataproc tools are available?

Create a Simple Cluster

Create a small Dataproc cluster named "test-cluster" in my project

List Clusters

Show me all my Dataproc clusters

Submit a Spark Job

Submit a Spark job to process data from gs://my-bucket/data.csv

Cancel a Running Job

Cancel the job with ID "my-long-running-job-12345"

Monitor Job Status

Check the status of job "my-job-67890"

Try Semantic Search (if Qdrant enabled)

Show me clusters with machine learning packages installed
Find clusters using high-memory configurations

๐Ÿ“‹ Example Cluster Profile

Create a custom cluster profile in profiles/my-cluster.yaml:

my-project-dev-cluster:
  region: us-central1
  tags:
    - development
    - testing
  labels:
    environment: dev
    team: data-engineering
  cluster_config:
    master_config:
      num_instances: 1
      machine_type_uri: n1-standard-4
      disk_config:
        boot_disk_type: pd-standard
        boot_disk_size_gb: 100
    worker_config:
      num_instances: 2
      machine_type_uri: n1-standard-4
      disk_config:
        boot_disk_type: pd-standard
        boot_disk_size_gb: 100
      is_preemptible: true  # Cost savings for dev
    software_config:
      image_version: 2.1.1-debian10
      optional_components:
        - JUPYTER
      properties:
        dataproc:dataproc.allow.zero.workers: "true"
    lifecycle_config:
      idle_delete_ttl:
        seconds: 1800  # 30 minutes

๐Ÿ” Verification

Test Your Setup

# Check if the server starts correctly
dataproc-mcp --test

# Verify authentication
dataproc-mcp --verify-auth

# List available profiles
dataproc-mcp --list-profiles

Health Check

# Run comprehensive health check
npm run pre-flight  # If installed from source

# Or basic connectivity test
curl -X POST http://localhost:3000/health  # If running as HTTP server

๐Ÿšจ Troubleshooting

Common Issues

Authentication Errors

# Check your credentials
gcloud auth list
gcloud config list project

# Verify service account permissions
gcloud projects get-iam-policy YOUR_PROJECT_ID

Permission Errors

# Enable required APIs
gcloud services enable dataproc.googleapis.com
gcloud services enable compute.googleapis.com
gcloud services enable storage.googleapis.com

Connection Issues

# Check network connectivity
ping google.com

# Verify firewall rules
gcloud compute firewall-rules list

Getting Help

  1. Check the logs: Look for error messages in the console output
  2. Verify configuration: Ensure all required fields are filled
  3. Test authentication: Use gcloud auth application-default print-access-token
  4. Check permissions: Verify your service account has Dataproc Admin role

๐Ÿ“š Next Steps

Learn More

Advanced Features

  • Multi-environment setup for dev/staging/production
  • Custom cluster profiles for different workloads
  • Automated job scheduling with cron-like syntax
  • Performance monitoring and alerting
  • Cost optimization with preemptible instances

Community

๐ŸŽ‰ You're Ready!

Your Dataproc MCP Server is now configured and ready to use. Start by creating your first cluster and exploring the available tools through your MCP client.

Happy data processing! ๐Ÿš€


Need help? Check our testing guide or open an issue.