Skip to content

Commit d86c35e

Browse files
authored
Merge pull request #155 from firecrawl/feat/json-format-default
feat(scrape): make JSON format the default, markdown for full content only
2 parents 2269638 + ec42a7a commit d86c35e

File tree

2 files changed

+220
-53
lines changed

2 files changed

+220
-53
lines changed

README.md

Lines changed: 168 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -310,23 +310,30 @@ The server utilizes Firecrawl's built-in rate limiting and batch processing capa
310310
Use this guide to select the right tool for your task:
311311

312312
- **If you know the exact URL(s) you want:**
313-
- For one: use **scrape**
313+
- For one: use **scrape** (with JSON format for structured data)
314314
- For many: use **batch_scrape**
315315
- **If you need to discover URLs on a site:** use **map**
316316
- **If you want to search the web for info:** use **search**
317-
- **If you want to extract structured data:** use **extract**
317+
- **If you need complex research across multiple unknown sources:** use **agent**
318318
- **If you want to analyze a whole site or section:** use **crawl** (with limits!)
319319

320320
### Quick Reference Table
321321

322-
| Tool | Best for | Returns |
323-
| ------------ | ----------------------------------- | --------------- |
324-
| scrape | Single page content | markdown/html |
325-
| batch_scrape | Multiple known URLs | markdown/html[] |
326-
| map | Discovering URLs on a site | URL[] |
327-
| crawl | Multi-page extraction (with limits) | markdown/html[] |
328-
| search | Web search for info | results[] |
329-
| extract | Structured data from pages | JSON |
322+
| Tool | Best for | Returns |
323+
| ------------ | ----------------------------------- | -------------------------- |
324+
| scrape | Single page content | JSON (preferred) or markdown |
325+
| batch_scrape | Multiple known URLs | JSON (preferred) or markdown[] |
326+
| map | Discovering URLs on a site | URL[] |
327+
| crawl | Multi-page extraction (with limits) | markdown/html[] |
328+
| search | Web search for info | results[] |
329+
| agent | Complex multi-source research | JSON (structured data) |
330+
331+
### Format Selection Guide
332+
333+
When using `scrape` or `batch_scrape`, choose the right format:
334+
335+
- **JSON format (recommended for most cases):** Use when you need specific data from a page. Define a schema based on what you need to extract. This keeps responses small and avoids context window overflow.
336+
- **Markdown format (use sparingly):** Only when you genuinely need the full page content, such as reading an entire article for summarization or analyzing page structure.
330337

331338
## Available Tools
332339

@@ -342,38 +349,75 @@ Scrape content from a single URL with advanced options.
342349

343350
- Extracting content from multiple pages (use batch_scrape for known URLs, or map + batch_scrape to discover URLs first, or crawl for full page content)
344351
- When you're unsure which page contains the information (use search)
345-
- When you need structured data (use extract)
346352

347353
**Common mistakes:**
348354

349355
- Using scrape for a list of URLs (use batch_scrape instead).
356+
- Using markdown format by default (use JSON format to extract only what you need).
357+
358+
**Choosing the right format:**
359+
360+
- **JSON format (preferred):** For most use cases, use JSON format with a schema to extract only the specific data needed. This keeps responses focused and prevents context window overflow.
361+
- **Markdown format:** Only when the task genuinely requires full page content (e.g., summarizing an entire article, analyzing page structure).
350362

351363
**Prompt Example:**
352364

353-
> "Get the content of the page at https://example.com."
365+
> "Get the product details from https://example.com/product."
354366
355-
**Usage Example:**
367+
**Usage Example (JSON format - preferred):**
356368

357369
```json
358370
{
359371
"name": "firecrawl_scrape",
360372
"arguments": {
361-
"url": "https://example.com",
373+
"url": "https://example.com/product",
374+
"formats": [{
375+
"type": "json",
376+
"prompt": "Extract the product information",
377+
"schema": {
378+
"type": "object",
379+
"properties": {
380+
"name": { "type": "string" },
381+
"price": { "type": "number" },
382+
"description": { "type": "string" }
383+
},
384+
"required": ["name", "price"]
385+
}
386+
}]
387+
}
388+
}
389+
```
390+
391+
**Usage Example (markdown format - when full content needed):**
392+
393+
```json
394+
{
395+
"name": "firecrawl_scrape",
396+
"arguments": {
397+
"url": "https://example.com/article",
362398
"formats": ["markdown"],
363-
"onlyMainContent": true,
364-
"waitFor": 1000,
365-
"timeout": 30000,
366-
"mobile": false,
367-
"includeTags": ["article", "main"],
368-
"excludeTags": ["nav", "footer"],
369-
"skipTlsVerification": false
399+
"onlyMainContent": true
370400
}
371401
}
372402
```
373403

404+
**Usage Example (branding format - extract brand identity):**
405+
406+
```json
407+
{
408+
"name": "firecrawl_scrape",
409+
"arguments": {
410+
"url": "https://example.com",
411+
"formats": ["branding"]
412+
}
413+
}
414+
```
415+
416+
**Branding format:** Extracts comprehensive brand identity (colors, fonts, typography, spacing, logo, UI components) for design analysis or style replication.
417+
374418
**Returns:**
375419

376-
- Markdown, HTML, or other formats as specified.
420+
- JSON structured data, markdown, branding profile, or other formats as specified.
377421

378422
### 2. Batch Scrape Tool (`firecrawl_batch_scrape`)
379423

@@ -667,6 +711,108 @@ When using a self-hosted instance, the extraction will use your configured LLM.
667711
}
668712
```
669713

714+
### 9. Agent Tool (`firecrawl_agent`)
715+
716+
Autonomous web research agent. This is a separate AI agent layer that independently browses the internet, searches for information, navigates through pages, and extracts structured data based on your query.
717+
718+
**How it works:**
719+
720+
The agent performs web searches, follows links, reads pages, and gathers data autonomously. This runs **asynchronously** - it returns a job ID immediately, and you poll `firecrawl_agent_status` to check when complete and retrieve results.
721+
722+
**Async workflow:**
723+
724+
1. Call `firecrawl_agent` with your prompt/schema → returns job ID
725+
2. Do other work while the agent researches (can take minutes for complex queries)
726+
3. Poll `firecrawl_agent_status` with the job ID to check progress
727+
4. When status is "completed", the response includes the extracted data
728+
729+
**Best for:**
730+
731+
- Complex research tasks where you don't know the exact URLs
732+
- Multi-source data gathering
733+
- Finding information scattered across the web
734+
- Tasks where you can do other work while waiting for results
735+
736+
**Not recommended for:**
737+
738+
- Simple single-page scraping where you know the URL (use scrape with JSON format - faster and cheaper)
739+
740+
**Arguments:**
741+
742+
- `prompt`: Natural language description of the data you want (required, max 10,000 characters)
743+
- `urls`: Optional array of URLs to focus the agent on specific pages
744+
- `schema`: Optional JSON schema for structured output
745+
746+
**Prompt Example:**
747+
748+
> "Find the founders of Firecrawl and their backgrounds"
749+
750+
**Usage Example (start agent, then poll for results):**
751+
752+
```json
753+
{
754+
"name": "firecrawl_agent",
755+
"arguments": {
756+
"prompt": "Find the top 5 AI startups founded in 2024 and their funding amounts",
757+
"schema": {
758+
"type": "object",
759+
"properties": {
760+
"startups": {
761+
"type": "array",
762+
"items": {
763+
"type": "object",
764+
"properties": {
765+
"name": { "type": "string" },
766+
"funding": { "type": "string" },
767+
"founded": { "type": "string" }
768+
}
769+
}
770+
}
771+
}
772+
}
773+
}
774+
}
775+
```
776+
777+
Then poll with `firecrawl_agent_status` using the returned job ID.
778+
779+
**Usage Example (with URLs - agent focuses on specific pages):**
780+
781+
```json
782+
{
783+
"name": "firecrawl_agent",
784+
"arguments": {
785+
"urls": ["https://docs.firecrawl.dev", "https://firecrawl.dev/pricing"],
786+
"prompt": "Compare the features and pricing information from these pages"
787+
}
788+
}
789+
```
790+
791+
**Returns:**
792+
793+
- Job ID for status checking. Use `firecrawl_agent_status` to poll for results.
794+
795+
### 10. Check Agent Status (`firecrawl_agent_status`)
796+
797+
Check the status of an agent job and retrieve results when complete. Use this to poll for results after starting an agent.
798+
799+
**Polling pattern:** Agent research can take minutes for complex queries. Poll this endpoint periodically (e.g., every 10-30 seconds) until status is "completed" or "failed".
800+
801+
```json
802+
{
803+
"name": "firecrawl_agent_status",
804+
"arguments": {
805+
"id": "550e8400-e29b-41d4-a716-446655440000"
806+
}
807+
}
808+
```
809+
810+
**Possible statuses:**
811+
812+
- `processing`: Agent is still researching - check back later
813+
- `completed`: Research finished - response includes the extracted data
814+
- `failed`: An error occurred
815+
670816
## Logging System
671817

672818
The server includes comprehensive logging:

src/index.ts

Lines changed: 52 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -271,22 +271,12 @@ This is the most powerful, fastest and most reliable scraper tool, if available
271271
**Not recommended for:** Multiple pages (use batch_scrape), unknown page (use search).
272272
**Common mistakes:** Using scrape for a list of URLs (use batch_scrape instead). If batch scrape doesnt work, just use scrape and call it multiple times.
273273
**Other Features:** Use 'branding' format to extract brand identity (colors, fonts, typography, spacing, UI components) for design analysis or style replication.
274-
**Prompt Example:** "Get the content of the page at https://example.com."
275-
**Usage Example:**
276-
\`\`\`json
277-
{
278-
"name": "firecrawl_scrape",
279-
"arguments": {
280-
"url": "https://example.com",
281-
"formats": ["markdown"],
282-
"maxAge": 172800000
283-
}
284-
}
285-
\`\`\`
286-
**Performance:** Add maxAge parameter for 500% faster scrapes using cached data.
287-
**Returns:** Markdown, HTML, or other formats as specified.
288-
**Token Limit Issues:** If you encounter "tokens exceeds maximum allowed tokens" errors or the scraped content is too large, use the JSON format with a schema to extract only the specific data you need. This dramatically reduces output size by returning structured data instead of the full page content.
289-
**JSON Format Example:**
274+
275+
**IMPORTANT - Choosing the right format:**
276+
- **Use JSON format (default):** For most use cases, use the JSON format with a schema to extract only the specific data needed. This keeps responses small and focused. Analyze the user's query to determine what fields to extract.
277+
- **Use markdown format (rare):** Only when the task genuinely requires the full page content, such as: reading an entire article for summarization, analyzing the full structure of a page, or when the user needs to see all the content. This is uncommon.
278+
279+
**Usage Example (JSON format - preferred):**
290280
\`\`\`json
291281
{
292282
"name": "firecrawl_scrape",
@@ -308,6 +298,30 @@ This is the most powerful, fastest and most reliable scraper tool, if available
308298
}
309299
}
310300
\`\`\`
301+
**Usage Example (markdown format - when full content needed):**
302+
\`\`\`json
303+
{
304+
"name": "firecrawl_scrape",
305+
"arguments": {
306+
"url": "https://example.com/article",
307+
"formats": ["markdown"],
308+
"onlyMainContent": true
309+
}
310+
}
311+
\`\`\`
312+
**Usage Example (branding format - extract brand identity):**
313+
\`\`\`json
314+
{
315+
"name": "firecrawl_scrape",
316+
"arguments": {
317+
"url": "https://example.com",
318+
"formats": ["branding"]
319+
}
320+
}
321+
\`\`\`
322+
**Branding format:** Extracts comprehensive brand identity (colors, fonts, typography, spacing, logo, UI components) for design analysis or style replication.
323+
**Performance:** Add maxAge parameter for 500% faster scrapes using cached data.
324+
**Returns:** JSON structured data, markdown, branding profile, or other formats as specified.
311325
${
312326
SAFE_MODE
313327
? '**Safe Mode:** Read-only content extraction. Interactive actions (click, write, executeJavascript) are disabled for security.'
@@ -646,23 +660,26 @@ Extract structured information from web pages using LLM capabilities. Supports b
646660
server.addTool({
647661
name: 'firecrawl_agent',
648662
description: `
649-
Autonomous web data gathering agent. Describe what data you want, and the agent searches, navigates, and extracts it from anywhere on the web.
663+
Autonomous web research agent. This is a separate AI agent layer that independently browses the internet, searches for information, navigates through pages, and extracts structured data based on your query. You describe what you need, and the agent figures out where to find it.
650664
651-
**Best for:** Complex data gathering tasks where you don't know the exact URLs; research tasks requiring multiple sources; finding data in hard-to-reach places.
652-
**Not recommended for:** Simple single-page scraping (use scrape); when you already know the exact URL (use scrape or extract).
653-
**Key advantages over extract:**
654-
- No URLs required - just describe what you need
655-
- Autonomously searches and navigates the web
656-
- Faster and more cost-effective for complex tasks
657-
- Higher reliability for varied queries
665+
**How it works:** The agent performs web searches, follows links, reads pages, and gathers data autonomously. This runs **asynchronously** - it returns a job ID immediately, and you poll \`firecrawl_agent_status\` to check when complete and retrieve results.
666+
667+
**Async workflow:**
668+
1. Call \`firecrawl_agent\` with your prompt/schema → returns job ID
669+
2. Do other work while the agent researches (can take minutes for complex queries)
670+
3. Poll \`firecrawl_agent_status\` with the job ID to check progress
671+
4. When status is "completed", the response includes the extracted data
672+
673+
**Best for:** Complex research tasks where you don't know the exact URLs; multi-source data gathering; finding information scattered across the web; tasks where you can do other work while waiting.
674+
**Not recommended for:** Simple single-page scraping where you know the URL (use scrape with JSON format instead - faster and cheaper).
658675
659676
**Arguments:**
660677
- prompt: Natural language description of the data you want (required, max 10,000 characters)
661678
- urls: Optional array of URLs to focus the agent on specific pages
662679
- schema: Optional JSON schema for structured output
663680
664681
**Prompt Example:** "Find the founders of Firecrawl and their backgrounds"
665-
**Usage Example (no URLs):**
682+
**Usage Example (start agent, then poll for results):**
666683
\`\`\`json
667684
{
668685
"name": "firecrawl_agent",
@@ -687,7 +704,9 @@ Autonomous web data gathering agent. Describe what data you want, and the agent
687704
}
688705
}
689706
\`\`\`
690-
**Usage Example (with URLs):**
707+
Then poll with \`firecrawl_agent_status\` using the returned job ID.
708+
709+
**Usage Example (with URLs - agent focuses on specific pages):**
691710
\`\`\`json
692711
{
693712
"name": "firecrawl_agent",
@@ -697,7 +716,7 @@ Autonomous web data gathering agent. Describe what data you want, and the agent
697716
}
698717
}
699718
\`\`\`
700-
**Returns:** Extracted data matching your prompt/schema, plus credits used.
719+
**Returns:** Job ID for status checking. Use \`firecrawl_agent_status\` to poll for results.
701720
`,
702721
parameters: z.object({
703722
prompt: z.string().min(1).max(10000),
@@ -719,7 +738,7 @@ Autonomous web data gathering agent. Describe what data you want, and the agent
719738
urls: a.urls as string[] | undefined,
720739
schema: (a.schema as Record<string, unknown>) || undefined,
721740
});
722-
const res = await (client as any).agent({
741+
const res = await (client as any).startAgent({
723742
...agentBody,
724743
origin: ORIGIN,
725744
});
@@ -730,7 +749,9 @@ Autonomous web data gathering agent. Describe what data you want, and the agent
730749
server.addTool({
731750
name: 'firecrawl_agent_status',
732751
description: `
733-
Check the status of an agent job.
752+
Check the status of an agent job and retrieve results when complete. Use this to poll for results after starting an agent with \`firecrawl_agent\`.
753+
754+
**Polling pattern:** Agent research can take minutes for complex queries. Poll this endpoint periodically (e.g., every 10-30 seconds) until status is "completed" or "failed".
734755
735756
**Usage Example:**
736757
\`\`\`json
@@ -742,8 +763,8 @@ Check the status of an agent job.
742763
}
743764
\`\`\`
744765
**Possible statuses:**
745-
- processing: Agent is still working
746-
- completed: Extraction finished successfully
766+
- processing: Agent is still researching - check back later
767+
- completed: Research finished - response includes the extracted data
747768
- failed: An error occurred
748769
749770
**Returns:** Status, progress, and results (if completed) of the agent job.

0 commit comments

Comments
 (0)