Skip to content

Latest commit

 

History

History
485 lines (364 loc) · 9.44 KB

File metadata and controls

485 lines (364 loc) · 9.44 KB

🚀 Quick Start Guide

Get your Dataproc MCP Server up and running in under 5 minutes!

Prerequisites

  • Node.js 18.0.0 or higher (Download)
  • Google Cloud Project with Dataproc API enabled
  • Service account with appropriate permissions
  • MCP Client (Claude Desktop, Roo, or other MCP-compatible client)

📋 Required GCP APIs

Enable these APIs in your Google Cloud Project:

gcloud services enable dataproc.googleapis.com
gcloud services enable compute.googleapis.com
gcloud services enable storage.googleapis.com
gcloud services enable iam.googleapis.com

🔑 Service Account Permissions

Your service account needs these roles:

  • roles/dataproc.editor - For cluster management
  • roles/storage.objectViewer - For accessing job outputs
  • roles/iam.serviceAccountUser - For impersonation (if used)

Installation

Option 1: Clone and Build (Current)

git clone https://github.com/dipseth/dataproc-mcp.git
cd dataproc-mcp
npm install
npm run build

Option 2: NPM Install (Future)

# When published to npm
npm install -g @dataproc/mcp-server

🛠️ Setup

1. Interactive Setup (Recommended)

npm run setup

What this does:

  • ✅ Creates necessary directories (config/, state/, output/)
  • ✅ Guides you through project configuration
  • ✅ Sets up authentication with your service account
  • ✅ Creates MCP client configuration template
  • ✅ Validates your setup

Example interaction:

🚀 Dataproc MCP Server Setup
=============================

📁 Creating necessary directories...
  ✅ Created config/
  ✅ Created state/
  ✅ Created output/

🔧 Setting up default parameters...
Enter your GCP Project ID: my-dataproc-project
Enter your preferred region (default: us-central1): us-central1
Enter your environment name (default: production): production

🔐 Setting up authentication...
Do you want to use service account impersonation? (y/n): y
Enter the service account email to impersonate: dataproc-sa@my-project.iam.gserviceaccount.com
Enter the path to your source service account key file: /path/to/source-key.json

2. Manual Setup (Alternative)

# Create directories
mkdir -p config profiles state output

# Copy configuration templates
cp templates/default-params.json.template config/default-params.json
cp templates/server.json.template config/server.json
cp templates/mcp-settings.json.template mcp-settings.json

# Edit configurations with your details
nano config/default-params.json
nano config/server.json

3. Validate Setup

npm run validate

This checks:

  • ✅ Directory structure
  • ✅ Configuration files
  • ✅ Service account credentials
  • ✅ Build status
  • ✅ Profile availability

Configuration

Default Parameters (config/default-params.json)

{
  "defaultEnvironment": "production",
  "parameters": [
    {"name": "projectId", "type": "string", "required": true},
    {"name": "region", "type": "string", "required": true, "defaultValue": "us-central1"}
  ],
  "environments": [
    {
      "environment": "production",
      "parameters": {
        "projectId": "your-project-id",
        "region": "us-central1"
      }
    }
  ]
}

Authentication (config/server.json)

{
  "authentication": {
    "impersonateServiceAccount": "your-sa@your-project.iam.gserviceaccount.com",
    "fallbackKeyPath": "/path/to/your/service-account-key.json",
    "preferImpersonation": true,
    "useApplicationDefaultFallback": false
  }
}

Add to MCP Client

Add this configuration to your MCP client settings:

{
  "dataproc-server": {
    "command": "node",
    "args": ["/path/to/dataproc-mcp-server/build/index.js"],
    "disabled": false,
    "timeout": 60,
    "alwaysAllow": ["*"],
    "env": {
      "LOG_LEVEL": "error"
    }
  }
}

Validation

Verify your setup:

npm run validate

Test with MCP Inspector

npm run inspector

First Cluster

Once configured, you can create your first cluster:

# Using the MCP client or inspector
{
  "tool": "start_dataproc_cluster",
  "arguments": {
    "clusterName": "my-first-cluster"
  }
}

The server will automatically use your configured project ID and region!

🎯 Common Use Cases

1. Quick Data Analysis Cluster

Create a small cluster for data exploration:

{
  "tool": "create_cluster_from_profile",
  "arguments": {
    "profileName": "development/small",
    "clusterName": "analysis-cluster-001"
  }
}

What this creates:

  • 1 master node (n1-standard-2)
  • 2 worker nodes (n1-standard-2)
  • Preemptible instances for cost savings
  • Standard Spark/Hadoop configuration

2. Production ETL Pipeline

For production workloads with high memory requirements:

{
  "tool": "create_cluster_from_profile",
  "arguments": {
    "profileName": "production/high-memory/analysis",
    "clusterName": "etl-production-cluster"
  }
}

Features:

  • High-memory instances
  • Persistent disks
  • Auto-scaling enabled
  • Production-grade networking

3. Run Hive Query

Execute SQL queries on your data:

{
  "tool": "submit_hive_query",
  "arguments": {
    "clusterName": "analysis-cluster-001",
    "query": "SELECT COUNT(*) FROM my_table WHERE date >= '2024-01-01'"
  }
}

4. Monitor Job Progress

Check the status of running jobs:

{
  "tool": "get_job_status",
  "arguments": {
    "jobId": "your-job-id-here"
  }
}

5. Get Query Results

Retrieve results from completed queries:

{
  "tool": "get_job_results",
  "arguments": {
    "jobId": "your-job-id-here",
    "maxResults": 100
  }
}

Available Tools

The server provides 16 comprehensive tools:

Cluster Management

  • start_dataproc_cluster - Create a new cluster
  • list_clusters - List all clusters
  • get_cluster - Get cluster details
  • delete_cluster - Delete a cluster

Job Execution

  • submit_hive_query - Run Hive queries
  • submit_dataproc_job - Submit any Dataproc job
  • get_job_status - Check job status
  • get_job_results - Get job results

Profile Management

  • create_cluster_from_profile - Use predefined profiles
  • list_profiles - See available profiles
  • get_profile - Get profile details

And more!

🔧 Troubleshooting

Common Issues & Solutions

1. Authentication Problems

Error: Authentication failed or Permission denied

Solutions:

# Check service account permissions
gcloud projects get-iam-policy YOUR_PROJECT_ID

# Verify API is enabled
gcloud services list --enabled | grep dataproc

# Test authentication
gcloud auth application-default login

Required permissions:

  • dataproc.clusters.create
  • dataproc.clusters.delete
  • dataproc.jobs.create
  • compute.instances.create

2. Profile Not Found

Error: Profile 'development/small' not found

Solutions:

# List available profiles
npm run validate

# Check profile directory
ls -la profiles/

# Verify profile syntax
cat profiles/development/small.yaml

3. Cluster Creation Fails

Error: Cluster creation failed or Quota exceeded

Solutions:

# Check quotas
gcloud compute project-info describe --project=YOUR_PROJECT

# Verify region availability
gcloud compute zones list --filter="region:us-central1"

# Check firewall rules
gcloud compute firewall-rules list

4. Build Issues

Error: TypeScript compilation errors

Solutions:

# Clean and rebuild
rm -rf build/ node_modules/
npm install
npm run build

# Check Node.js version
node --version  # Should be >= 18.0.0

# Update dependencies
npm update

5. Rate Limiting

Error: Rate limit exceeded

Solutions:

# Wait for rate limit reset (1 minute)
# Or adjust rate limits in configuration

# Check current limits
grep -r "rate" config/

6. Network Connectivity

Error: Connection timeout or Network unreachable

Solutions:

# Test connectivity
curl -I https://dataproc.googleapis.com/

# Check proxy settings
echo $HTTP_PROXY $HTTPS_PROXY

# Verify DNS resolution
nslookup dataproc.googleapis.com

Debug Mode

Enable detailed logging for troubleshooting:

# Set debug log level
export LOG_LEVEL=debug

# Run with verbose output
npm start 2>&1 | tee debug.log

Configuration Validation

Run comprehensive validation:

npm run validate

What it checks:

  • ✅ Node.js version compatibility
  • ✅ Required dependencies
  • ✅ Directory structure
  • ✅ Configuration file syntax
  • ✅ Service account credentials
  • ✅ Profile availability
  • ✅ Build status

Emergency Procedures

Stop All Clusters

# List all clusters
{
  "tool": "list_clusters",
  "arguments": {}
}

# Stop specific cluster
{
  "tool": "delete_cluster",
  "arguments": {
    "clusterName": "your-cluster-name"
  }
}

# Emergency stop all server instances
npm run stop

Reset Configuration

# Backup current config
cp -r config/ config.backup/

# Reset to defaults
rm -rf config/
npm run setup

Get Help

Next Steps

Happy clustering! 🎉