Merge pull request #155 from firecrawl/feat/json-format-default

leonardogrig · web-flow · commit d86c35efdf02 · 2026-02-05T11:58:26.000-08:00
feat(scrape): make JSON format the default, markdown for full content only
diff --git a/README.md b/README.md
@@ -310,23 +310,30 @@ The server utilizes Firecrawl's built-in rate limiting and batch processing capa
 Use this guide to select the right tool for your task:
 
 - **If you know the exact URL(s) you want:**
-  - For one: use **scrape**
+  - For one: use **scrape** (with JSON format for structured data)
   - For many: use **batch_scrape**
 - **If you need to discover URLs on a site:** use **map**
 - **If you want to search the web for info:** use **search**
-- **If you want to extract structured data:** use **extract**
+- **If you need complex research across multiple unknown sources:** use **agent**
 - **If you want to analyze a whole site or section:** use **crawl** (with limits!)
 
 ### Quick Reference Table
 
-| Tool         | Best for                            | Returns         |
-| ------------ | ----------------------------------- | --------------- |
-| scrape       | Single page content                 | markdown/html   |
-| batch_scrape | Multiple known URLs                 | markdown/html[] |
-| map          | Discovering URLs on a site          | URL[]           |
-| crawl        | Multi-page extraction (with limits) | markdown/html[] |
-| search       | Web search for info                 | results[]       |
-| extract      | Structured data from pages          | JSON            |
+| Tool         | Best for                            | Returns                    |
+| ------------ | ----------------------------------- | -------------------------- |
+| scrape       | Single page content                 | JSON (preferred) or markdown |
+| batch_scrape | Multiple known URLs                 | JSON (preferred) or markdown[] |
+| map          | Discovering URLs on a site          | URL[]                      |
+| crawl        | Multi-page extraction (with limits) | markdown/html[]            |
+| search       | Web search for info                 | results[]                  |
+| agent        | Complex multi-source research       | JSON (structured data)     |
+
+### Format Selection Guide
+
+When using `scrape` or `batch_scrape`, choose the right format:
+
+- **JSON format (recommended for most cases):** Use when you need specific data from a page. Define a schema based on what you need to extract. This keeps responses small and avoids context window overflow.
+- **Markdown format (use sparingly):** Only when you genuinely need the full page content, such as reading an entire article for summarization or analyzing page structure.
 
 ## Available Tools
 
@@ -342,38 +349,75 @@ Scrape content from a single URL with advanced options.
 
 - Extracting content from multiple pages (use batch_scrape for known URLs, or map + batch_scrape to discover URLs first, or crawl for full page content)
 - When you're unsure which page contains the information (use search)
-- When you need structured data (use extract)
 
 **Common mistakes:**
 
 - Using scrape for a list of URLs (use batch_scrape instead).
+- Using markdown format by default (use JSON format to extract only what you need).
+
+**Choosing the right format:**
+
+- **JSON format (preferred):** For most use cases, use JSON format with a schema to extract only the specific data needed. This keeps responses focused and prevents context window overflow.
+- **Markdown format:** Only when the task genuinely requires full page content (e.g., summarizing an entire article, analyzing page structure).
 
 **Prompt Example:**
 
-> "Get the content of the page at https://example.com."
+> "Get the product details from https://example.com/product."
 
-**Usage Example:**
+**Usage Example (JSON format - preferred):**
 
 ```json
 {
   "name": "firecrawl_scrape",
   "arguments": {
-    "url": "https://example.com",
+    "url": "https://example.com/product",
+    "formats": [{
+      "type": "json",
+      "prompt": "Extract the product information",
+      "schema": {
+        "type": "object",
+        "properties": {
+          "name": { "type": "string" },
+          "price": { "type": "number" },
+          "description": { "type": "string" }
+        },
+        "required": ["name", "price"]
+      }
+    }]
+  }
+}
+```
+
+**Usage Example (markdown format - when full content needed):**
+
+```json
+{
+  "name": "firecrawl_scrape",
+  "arguments": {
+    "url": "https://example.com/article",
     "formats": ["markdown"],
-    "onlyMainContent": true,
-    "waitFor": 1000,
-    "timeout": 30000,
-    "mobile": false,
-    "includeTags": ["article", "main"],
-    "excludeTags": ["nav", "footer"],
-    "skipTlsVerification": false
+    "onlyMainContent": true
   }
 }
 ```
 
+**Usage Example (branding format - extract brand identity):**
+
+```json
+{
+  "name": "firecrawl_scrape",
+  "arguments": {
+    "url": "https://example.com",
+    "formats": ["branding"]
+  }
+}
+```
+
+**Branding format:** Extracts comprehensive brand identity (colors, fonts, typography, spacing, logo, UI components) for design analysis or style replication.
+
 **Returns:**
 
-- Markdown, HTML, or other formats as specified.
+- JSON structured data, markdown, branding profile, or other formats as specified.
 
 ### 2. Batch Scrape Tool (`firecrawl_batch_scrape`)
 
@@ -667,6 +711,108 @@ When using a self-hosted instance, the extraction will use your configured LLM.
 }
 ```
 
+### 9. Agent Tool (`firecrawl_agent`)
+
+Autonomous web research agent. This is a separate AI agent layer that independently browses the internet, searches for information, navigates through pages, and extracts structured data based on your query.
+
+**How it works:**
+
+The agent performs web searches, follows links, reads pages, and gathers data autonomously. This runs **asynchronously** - it returns a job ID immediately, and you poll `firecrawl_agent_status` to check when complete and retrieve results.
+
+**Async workflow:**
+
+1. Call `firecrawl_agent` with your prompt/schema → returns job ID
+2. Do other work while the agent researches (can take minutes for complex queries)
+3. Poll `firecrawl_agent_status` with the job ID to check progress
+4. When status is "completed", the response includes the extracted data
+
+**Best for:**
+
+- Complex research tasks where you don't know the exact URLs
+- Multi-source data gathering
+- Finding information scattered across the web
+- Tasks where you can do other work while waiting for results
+
+**Not recommended for:**
+
+- Simple single-page scraping where you know the URL (use scrape with JSON format - faster and cheaper)
+
+**Arguments:**
+
+- `prompt`: Natural language description of the data you want (required, max 10,000 characters)
+- `urls`: Optional array of URLs to focus the agent on specific pages
+- `schema`: Optional JSON schema for structured output
+
+**Prompt Example:**
+
+> "Find the founders of Firecrawl and their backgrounds"
+
+**Usage Example (start agent, then poll for results):**
+
+```json
+{
+  "name": "firecrawl_agent",
+  "arguments": {
+    "prompt": "Find the top 5 AI startups founded in 2024 and their funding amounts",
+    "schema": {
+      "type": "object",
+      "properties": {
+        "startups": {
+          "type": "array",
+          "items": {
+            "type": "object",
+            "properties": {
+              "name": { "type": "string" },
+              "funding": { "type": "string" },
+              "founded": { "type": "string" }
+            }
+          }
+        }
+      }
+    }
+  }
+}
+```
+
+Then poll with `firecrawl_agent_status` using the returned job ID.
+
+**Usage Example (with URLs - agent focuses on specific pages):**
+
+```json
+{
+  "name": "firecrawl_agent",
+  "arguments": {
+    "urls": ["https://docs.firecrawl.dev", "https://firecrawl.dev/pricing"],
+    "prompt": "Compare the features and pricing information from these pages"
+  }
+}
+```
+
+**Returns:**
+
+- Job ID for status checking. Use `firecrawl_agent_status` to poll for results.
+
+### 10. Check Agent Status (`firecrawl_agent_status`)
+
+Check the status of an agent job and retrieve results when complete. Use this to poll for results after starting an agent.
+
+**Polling pattern:** Agent research can take minutes for complex queries. Poll this endpoint periodically (e.g., every 10-30 seconds) until status is "completed" or "failed".
+
+```json
+{
+  "name": "firecrawl_agent_status",
+  "arguments": {
+    "id": "550e8400-e29b-41d4-a716-446655440000"
+  }
+}
+```
+
+**Possible statuses:**
+
+- `processing`: Agent is still researching - check back later
+- `completed`: Research finished - response includes the extracted data
+- `failed`: An error occurred
+
 ## Logging System
 
 The server includes comprehensive logging:
diff --git a/src/index.ts b/src/index.ts
@@ -271,22 +271,12 @@ This is the most powerful, fastest and most reliable scraper tool, if available
 **Not recommended for:** Multiple pages (use batch_scrape), unknown page (use search).
 **Common mistakes:** Using scrape for a list of URLs (use batch_scrape instead). If batch scrape doesnt work, just use scrape and call it multiple times.
 **Other Features:** Use 'branding' format to extract brand identity (colors, fonts, typography, spacing, UI components) for design analysis or style replication.
-**Prompt Example:** "Get the content of the page at https://example.com."
-**Usage Example:**
-\`\`\`json
-{
-  "name": "firecrawl_scrape",
-  "arguments": {
-    "url": "https://example.com",
-    "formats": ["markdown"],
-    "maxAge": 172800000
-  }
-}
-\`\`\`
-**Performance:** Add maxAge parameter for 500% faster scrapes using cached data.
-**Returns:** Markdown, HTML, or other formats as specified.
-**Token Limit Issues:** If you encounter "tokens exceeds maximum allowed tokens" errors or the scraped content is too large, use the JSON format with a schema to extract only the specific data you need. This dramatically reduces output size by returning structured data instead of the full page content.
-**JSON Format Example:**
+
+**IMPORTANT - Choosing the right format:**
+- **Use JSON format (default):** For most use cases, use the JSON format with a schema to extract only the specific data needed. This keeps responses small and focused. Analyze the user's query to determine what fields to extract.
+- **Use markdown format (rare):** Only when the task genuinely requires the full page content, such as: reading an entire article for summarization, analyzing the full structure of a page, or when the user needs to see all the content. This is uncommon.
+
+**Usage Example (JSON format - preferred):**
 \`\`\`json
 {
   "name": "firecrawl_scrape",
@@ -308,6 +298,30 @@ This is the most powerful, fastest and most reliable scraper tool, if available
   }
 }
 \`\`\`
+**Usage Example (markdown format - when full content needed):**
+\`\`\`json
+{
+  "name": "firecrawl_scrape",
+  "arguments": {
+    "url": "https://example.com/article",
+    "formats": ["markdown"],
+    "onlyMainContent": true
+  }
+}
+\`\`\`
+**Usage Example (branding format - extract brand identity):**
+\`\`\`json
+{
+  "name": "firecrawl_scrape",
+  "arguments": {
+    "url": "https://example.com",
+    "formats": ["branding"]
+  }
+}
+\`\`\`
+**Branding format:** Extracts comprehensive brand identity (colors, fonts, typography, spacing, logo, UI components) for design analysis or style replication.
+**Performance:** Add maxAge parameter for 500% faster scrapes using cached data.
+**Returns:** JSON structured data, markdown, branding profile, or other formats as specified.
 ${
   SAFE_MODE
     ? '**Safe Mode:** Read-only content extraction. Interactive actions (click, write, executeJavascript) are disabled for security.'
@@ -646,23 +660,26 @@ Extract structured information from web pages using LLM capabilities. Supports b
 server.addTool({
   name: 'firecrawl_agent',
   description: `
-Autonomous web data gathering agent. Describe what data you want, and the agent searches, navigates, and extracts it from anywhere on the web.
+Autonomous web research agent. This is a separate AI agent layer that independently browses the internet, searches for information, navigates through pages, and extracts structured data based on your query. You describe what you need, and the agent figures out where to find it.
 
-**Best for:** Complex data gathering tasks where you don't know the exact URLs; research tasks requiring multiple sources; finding data in hard-to-reach places.
-**Not recommended for:** Simple single-page scraping (use scrape); when you already know the exact URL (use scrape or extract).
-**Key advantages over extract:**
-- No URLs required - just describe what you need
-- Autonomously searches and navigates the web
-- Faster and more cost-effective for complex tasks
-- Higher reliability for varied queries
+**How it works:** The agent performs web searches, follows links, reads pages, and gathers data autonomously. This runs **asynchronously** - it returns a job ID immediately, and you poll \`firecrawl_agent_status\` to check when complete and retrieve results.
+
+**Async workflow:**
+1. Call \`firecrawl_agent\` with your prompt/schema → returns job ID
+2. Do other work while the agent researches (can take minutes for complex queries)
+3. Poll \`firecrawl_agent_status\` with the job ID to check progress
+4. When status is "completed", the response includes the extracted data
+
+**Best for:** Complex research tasks where you don't know the exact URLs; multi-source data gathering; finding information scattered across the web; tasks where you can do other work while waiting.
+**Not recommended for:** Simple single-page scraping where you know the URL (use scrape with JSON format instead - faster and cheaper).
 
 **Arguments:**
 - prompt: Natural language description of the data you want (required, max 10,000 characters)
 - urls: Optional array of URLs to focus the agent on specific pages
 - schema: Optional JSON schema for structured output
 
 **Prompt Example:** "Find the founders of Firecrawl and their backgrounds"
-**Usage Example (no URLs):**
+**Usage Example (start agent, then poll for results):**
 \`\`\`json
 {
   "name": "firecrawl_agent",
@@ -687,7 +704,9 @@ Autonomous web data gathering agent. Describe what data you want, and the agent
   }
 }
 \`\`\`
-**Usage Example (with URLs):**
+Then poll with \`firecrawl_agent_status\` using the returned job ID.
+
+**Usage Example (with URLs - agent focuses on specific pages):**
 \`\`\`json
 {
   "name": "firecrawl_agent",
@@ -697,7 +716,7 @@ Autonomous web data gathering agent. Describe what data you want, and the agent
   }
 }
 \`\`\`
-**Returns:** Extracted data matching your prompt/schema, plus credits used.
+**Returns:** Job ID for status checking. Use \`firecrawl_agent_status\` to poll for results.
 `,
   parameters: z.object({
     prompt: z.string().min(1).max(10000),
@@ -719,7 +738,7 @@ Autonomous web data gathering agent. Describe what data you want, and the agent
       urls: a.urls as string[] | undefined,
       schema: (a.schema as Record<string, unknown>) || undefined,
     });
-    const res = await (client as any).agent({
+    const res = await (client as any).startAgent({
       ...agentBody,
       origin: ORIGIN,
     });
@@ -730,7 +749,9 @@ Autonomous web data gathering agent. Describe what data you want, and the agent
 server.addTool({
   name: 'firecrawl_agent_status',
   description: `
-Check the status of an agent job.
+Check the status of an agent job and retrieve results when complete. Use this to poll for results after starting an agent with \`firecrawl_agent\`.
+
+**Polling pattern:** Agent research can take minutes for complex queries. Poll this endpoint periodically (e.g., every 10-30 seconds) until status is "completed" or "failed".
 
 **Usage Example:**
 \`\`\`json
@@ -742,8 +763,8 @@ Check the status of an agent job.
 }
 \`\`\`
 **Possible statuses:**
-- processing: Agent is still working
-- completed: Extraction finished successfully
+- processing: Agent is still researching - check back later
+- completed: Research finished - response includes the extracted data
 - failed: An error occurred
 
 **Returns:** Status, progress, and results (if completed) of the agent job.