Crawl

Start Crawl Job

Starts a crawl job for a given URL.

Method: client.crawl.start(params: StartCrawlJobParams): StartCrawlJobResponse

Endpoint: POST /api/crawl

Parameters:

StartCrawlJobParams:
- url: string - URL to scrape
- max_pages?: number - Max number of pages to crawl
- follow_links?: boolean - Follow links on the page
- ignore_sitemap?: boolean - Ignore sitemap when finding links to crawl
- exclude_patterns?: string[] - Patterns for paths to exclude from crawl
- include_patterns?: string[] - Patterns for paths to include in the crawl
- session_options?:
- scrape_options?:

Response:

Example:

response = client.crawl.start(StartCrawlJobParams(url="https://example.com"))
print(response.status)

Retrieves details of a specific crawl job.

Method: client.crawl.get(id: str): CrawlJobResponse

Endpoint: GET /api/crawl/{id}

Parameters:

id: string - Crawl job ID

Example:

response = client.crawl.get(
  "182bd5e5-6e1a-4fe4-a799-aa6d9a6ab26e"
)
print(response.status)

Start Crawl Job and Wait

Start a crawl job and wait for it to complete

Method: client.crawl.start_and_wait(params: StartCrawlJobParams): CrawlJobResponse

Parameters:

StartCrawlJobParams:
- url: string - URL to scrape
- max_pages?: number - Max number of pages to crawl
- follow_links?: boolean - Follow links on the page
- ignore_sitemap?: boolean - Ignore sitemap when finding links to crawl
- exclude_patterns?: string[] - Patterns for paths to exclude from crawl
- include_patterns?: string[] - Patterns for paths to include in the crawl

Example:

response = client.crawl.start_and_wait(StartCrawlJobParams(url="https://example.com"))
print(response.status)

Types

CrawlPageStatus

CrawlPageStatus = Literal["completed", "failed"]

CrawlJobStatus

CrawlJobStatus = Literal["pending", "running", "completed", "failed"]

StartCrawlJobResponse

class StartCrawlJobResponse(BaseModel):
    job_id: str = Field(alias="jobId")

CrawledPage

class CrawledPage(BaseModel):
    metadata: Optional[dict[str, Union[str, list[str]]]] = None
    html: Optional[str] = None
    markdown: Optional[str] = None
    links: Optional[List[str]] = None
    url: str
    status: CrawlPageStatus
    error: Optional[str] = None

CrawlJobResponse

class CrawlJobResponse(BaseModel):
    job_id: str = Field(alias="jobId")
    status: CrawlJobStatus
    error: Optional[str] = None
    data: List[CrawledPage] = Field(alias="data")
    total_crawled_pages: int = Field(alias="totalCrawledPages")
    total_page_batches: int = Field(alias="totalPageBatches")
    current_page_batch: int = Field(alias="currentPageBatch")
    batch_size: int = Field(alias="batchSize")

PreviousScrape NextExtensions

Last updated 3 months ago

Types

CrawlPageStatus

CrawlPageStatus = Literal["completed", "failed"]

CrawlJobStatus

CrawlJobStatus = Literal["pending", "running", "completed", "failed"]

StartCrawlJobResponse

class StartCrawlJobResponse(BaseModel):
    job_id: str = Field(alias="jobId")

CrawledPage

class CrawledPage(BaseModel):
    metadata: Optional[dict[str, Union[str, list[str]]]] = None
    html: Optional[str] = None
    markdown: Optional[str] = None
    links: Optional[List[str]] = None
    url: str
    status: CrawlPageStatus
    error: Optional[str] = None

CrawlJobResponse

class CrawlJobResponse(BaseModel):
    job_id: str = Field(alias="jobId")
    status: CrawlJobStatus
    error: Optional[str] = None
    data: List[CrawledPage] = Field(alias="data")
    total_crawled_pages: int = Field(alias="totalCrawledPages")
    total_page_batches: int = Field(alias="totalPageBatches")
    current_page_batch: int = Field(alias="currentPageBatch")
    batch_size: int = Field(alias="batchSize")

Crawl

Start Crawl Job

Start Crawl Job and Wait

Types

CrawlPageStatus

CrawlJobStatus

StartCrawlJobResponse

CrawledPage

CrawlJobResponse

Start Crawl Job

Get Crawl Job

Start Crawl Job and Wait

Types

CrawlPageStatus

CrawlJobStatus

StartCrawlJobResponse

CrawledPage

CrawlJobResponse

Get Crawl Job