Get your Dataproc MCP Server up and running in under 5 minutes!
- Node.js 18.0.0 or higher (Download)
- Google Cloud Project with Dataproc API enabled
- Service account with appropriate permissions
- MCP Client (Claude Desktop, Roo, or other MCP-compatible client)
Enable these APIs in your Google Cloud Project:
gcloud services enable dataproc.googleapis.com
gcloud services enable compute.googleapis.com
gcloud services enable storage.googleapis.com
gcloud services enable iam.googleapis.comYour service account needs these roles:
roles/dataproc.editor- For cluster managementroles/storage.objectViewer- For accessing job outputsroles/iam.serviceAccountUser- For impersonation (if used)
git clone https://github.com/dipseth/dataproc-mcp.git
cd dataproc-mcp
npm install
npm run build# When published to npm
npm install -g @dataproc/mcp-servernpm run setupWhat this does:
- ✅ Creates necessary directories (
config/,state/,output/) - ✅ Guides you through project configuration
- ✅ Sets up authentication with your service account
- ✅ Creates MCP client configuration template
- ✅ Validates your setup
Example interaction:
🚀 Dataproc MCP Server Setup
=============================
📁 Creating necessary directories...
✅ Created config/
✅ Created state/
✅ Created output/
🔧 Setting up default parameters...
Enter your GCP Project ID: my-dataproc-project
Enter your preferred region (default: us-central1): us-central1
Enter your environment name (default: production): production
🔐 Setting up authentication...
Do you want to use service account impersonation? (y/n): y
Enter the service account email to impersonate: dataproc-sa@my-project.iam.gserviceaccount.com
Enter the path to your source service account key file: /path/to/source-key.json
# Create directories
mkdir -p config profiles state output
# Copy configuration templates
cp templates/default-params.json.template config/default-params.json
cp templates/server.json.template config/server.json
cp templates/mcp-settings.json.template mcp-settings.json
# Edit configurations with your details
nano config/default-params.json
nano config/server.jsonnpm run validateThis checks:
- ✅ Directory structure
- ✅ Configuration files
- ✅ Service account credentials
- ✅ Build status
- ✅ Profile availability
{
"defaultEnvironment": "production",
"parameters": [
{"name": "projectId", "type": "string", "required": true},
{"name": "region", "type": "string", "required": true, "defaultValue": "us-central1"}
],
"environments": [
{
"environment": "production",
"parameters": {
"projectId": "your-project-id",
"region": "us-central1"
}
}
]
}{
"authentication": {
"impersonateServiceAccount": "your-sa@your-project.iam.gserviceaccount.com",
"fallbackKeyPath": "/path/to/your/service-account-key.json",
"preferImpersonation": true,
"useApplicationDefaultFallback": false
}
}Add this configuration to your MCP client settings:
{
"dataproc-server": {
"command": "node",
"args": ["/path/to/dataproc-mcp-server/build/index.js"],
"disabled": false,
"timeout": 60,
"alwaysAllow": ["*"],
"env": {
"LOG_LEVEL": "error"
}
}
}Verify your setup:
npm run validatenpm run inspectorOnce configured, you can create your first cluster:
# Using the MCP client or inspector
{
"tool": "start_dataproc_cluster",
"arguments": {
"clusterName": "my-first-cluster"
}
}The server will automatically use your configured project ID and region!
Create a small cluster for data exploration:
{
"tool": "create_cluster_from_profile",
"arguments": {
"profileName": "development/small",
"clusterName": "analysis-cluster-001"
}
}What this creates:
- 1 master node (n1-standard-2)
- 2 worker nodes (n1-standard-2)
- Preemptible instances for cost savings
- Standard Spark/Hadoop configuration
For production workloads with high memory requirements:
{
"tool": "create_cluster_from_profile",
"arguments": {
"profileName": "production/high-memory/analysis",
"clusterName": "etl-production-cluster"
}
}Features:
- High-memory instances
- Persistent disks
- Auto-scaling enabled
- Production-grade networking
Execute SQL queries on your data:
{
"tool": "submit_hive_query",
"arguments": {
"clusterName": "analysis-cluster-001",
"query": "SELECT COUNT(*) FROM my_table WHERE date >= '2024-01-01'"
}
}Check the status of running jobs:
{
"tool": "get_job_status",
"arguments": {
"jobId": "your-job-id-here"
}
}Retrieve results from completed queries:
{
"tool": "get_job_results",
"arguments": {
"jobId": "your-job-id-here",
"maxResults": 100
}
}The server provides 16 comprehensive tools:
start_dataproc_cluster- Create a new clusterlist_clusters- List all clustersget_cluster- Get cluster detailsdelete_cluster- Delete a cluster
submit_hive_query- Run Hive queriessubmit_dataproc_job- Submit any Dataproc jobget_job_status- Check job statusget_job_results- Get job results
create_cluster_from_profile- Use predefined profileslist_profiles- See available profilesget_profile- Get profile details
Error: Authentication failed or Permission denied
Solutions:
# Check service account permissions
gcloud projects get-iam-policy YOUR_PROJECT_ID
# Verify API is enabled
gcloud services list --enabled | grep dataproc
# Test authentication
gcloud auth application-default loginRequired permissions:
dataproc.clusters.createdataproc.clusters.deletedataproc.jobs.createcompute.instances.create
Error: Profile 'development/small' not found
Solutions:
# List available profiles
npm run validate
# Check profile directory
ls -la profiles/
# Verify profile syntax
cat profiles/development/small.yamlError: Cluster creation failed or Quota exceeded
Solutions:
# Check quotas
gcloud compute project-info describe --project=YOUR_PROJECT
# Verify region availability
gcloud compute zones list --filter="region:us-central1"
# Check firewall rules
gcloud compute firewall-rules listError: TypeScript compilation errors
Solutions:
# Clean and rebuild
rm -rf build/ node_modules/
npm install
npm run build
# Check Node.js version
node --version # Should be >= 18.0.0
# Update dependencies
npm updateError: Rate limit exceeded
Solutions:
# Wait for rate limit reset (1 minute)
# Or adjust rate limits in configuration
# Check current limits
grep -r "rate" config/Error: Connection timeout or Network unreachable
Solutions:
# Test connectivity
curl -I https://dataproc.googleapis.com/
# Check proxy settings
echo $HTTP_PROXY $HTTPS_PROXY
# Verify DNS resolution
nslookup dataproc.googleapis.comEnable detailed logging for troubleshooting:
# Set debug log level
export LOG_LEVEL=debug
# Run with verbose output
npm start 2>&1 | tee debug.logRun comprehensive validation:
npm run validateWhat it checks:
- ✅ Node.js version compatibility
- ✅ Required dependencies
- ✅ Directory structure
- ✅ Configuration file syntax
- ✅ Service account credentials
- ✅ Profile availability
- ✅ Build status
# List all clusters
{
"tool": "list_clusters",
"arguments": {}
}
# Stop specific cluster
{
"tool": "delete_cluster",
"arguments": {
"clusterName": "your-cluster-name"
}
}
# Emergency stop all server instances
npm run stop# Backup current config
cp -r config/ config.backup/
# Reset to defaults
rm -rf config/
npm run setup- Explore the example profiles
- Read the Configuration Guide
- Check out the Production Readiness Plan document in the project root
Happy clustering! 🎉