利用 Crawl4AI, 实现简单的内容抓取

Published: 2025-01-08

Tags:

偶尔浏览网站发现喜欢的内容可能没办法很好的保存下来，即使保存下来的内容也不是我们想要的那种干净的格式。可能对于有编程能力的部分人来说，可以借助代码实现对内容的抓取及清洗，而没有编程基础的借助 Crawl4AI 也能容易提取到自己想要的内容。对于 Crawl4AI 的基本使用，我会通过几个列子来说明。

安装

Crawl4AI 是用 Python 开发，作者已经将其打包好上传到了 pip 仓库。Windows 平台最简单的安装方式是通过 pip 安装。

pip install crawl4ai # 安装 crawl4ai 库

crawl4ai-setup # 设置浏览器
[INIT].... → Running post-installation setup...
[INIT].... → Installing Playwright browsers...
[COMPLETE] ● Playwright installation completed successfully.
[INIT].... → Starting database initialization...
[COMPLETE] ● Database initialization completed successfully.
[COMPLETE] ● Post-installation setup completed!

根据上面两步命令，就能成功安装好 Crawl4AI 及相关依赖了。

基础用法

爬取文章并以 markdown 格式输出

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://blog.chendi.link/posts/2025-new-chapter/"
        )
        print(result.markdown)  # 输出 markdown

if __name__ == "__main__":
    asyncio.run(main())

如下所示，当程序运行成功后，会打印文章的内容。

PS C:\Users\EDY\Desktop\python-craw14ai> & C:/Users/EDY/AppData/Local/Programs/Python/Python312/python.exe c:/Users/EDY/Desktop/python-craw14ai/app.py
[INIT].... → Crawl4AI 0.4.247
[FETCH]... ↓ https://blog.chendi.link/posts/2025-new-chapter/... | Status: True | Time: 3.71s
[SCRAPE].. ◆ Processed https://blog.chendi.link/posts/2025-new-chapter/... | Time: 13ms
[COMPLETE] ● https://blog.chendi.link/posts/2025-new-chapter/... | Status: True | Total: 3.75s
[Skip to main content](https://blog.chendi.link/posts/2025-new-chapter/<#main>)
[ ![Nordlys logo, a drawing of two gray mountains with green northern lights in the background](https://blog.chendi.link/_astro/logo.Dhnon3_S_ZXiHo1.svg) 陈迪の自留地 ](https://blog.chendi.link/posts/2025-new-chapter/</>)

···

### 消费降级[#](https://blog.chendi.link/posts/2025-new-chapter/<#消费降级>)
  1. 咖啡，奶茶及其它饮品的消费支出要控制在 1000元以内。(2024: 3340元)
  2. 服装优先购买经典款式，不追求个性。整年支出控制在 3000 元以内。(2024: 6200元)
  3. 每年花费在电子产品上的钱是最多的，也是感觉最不值得的。2024 年新买了一台 MacBook，配置了一台游 戏主机，这两个大件的支出就近 20000元。今年不买新的电子产品，维持现有设备的使用。
  4. 在吃的方面一向是不吝啬的，2024 年在吃上一共花费了 26000 元，平均下来每天花费 71 元。今年要控制在 20000 元以内，将日均消费控制在 55 元以内。


### 持续输出[#](https://blog.chendi.link/posts/2025-new-chapter/<#持续输出>)
  1. 每周保底一篇博客文章。
  2. 想要通过播客的方式来提升自己的表达能力，只是目前只有一个初步的想法，内容，题材都没有一个方向。

···

Crawl4AI 不仅支持 markdown 输出的参数，它还支持其它的参数，如下所示：

print(result.html)         # Raw HTML
print(result.cleaned_html) # 移除 ads, popups, etc.
print(result.fit_html)     # 最具相关性的 HTML
print(result.markdown)     # Markdown
print(result.fit_markdown) # 最具相关的 markdown 内容

爬取指定 css 选择器里的内容

对于想要抓取特定 css 选择器里的内容，Crawl4AI 提供了 css_selector="" 参数，这样就可以很好的过滤掉一些无用的元素。

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://blog.chendi.link/posts/my-2024-year-summary/", css_selector="article" # 只打印 article 部分的内容
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

进阶用法 (使用 LLM 大模型为 crawl4AI 赋能)

像某些场景，我们面对的可能是比较复杂或非结构化的数据，通过简单的 css 选择器无法抓取想要的内容。这种情况下，我们可以通过 LLM 语言模型来为 crawl4ai 提供更智能的解析能力。

如果你的数据具有高度的结构化，建议使用 JsonCssExtractionStrategy 或 JsonCssExtractionStrategy, 它们能提供更高效的解析性能。而基于 LLM 语言模型的爬取方式成本高，效率较低。

下面是一个完整的示例代码：

import os
import asyncio
import json
from pydantic import BaseModel, Field
from typing import List
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import LLMExtractionStrategy

class Product(BaseModel):
    name: str
    price: str

async def main():
    # 1. 定义模型爬取策略
    llm_strategy = LLMExtractionStrategy(
        provider="openai/gpt-4o-mini",            # 也可以调用 "ollama/llama2" 模型
        api_token=os.getenv('OPENAI_API_KEY'),
        schema=Product.schema_json(),            # 或使用 Product.model_json_schema()
        extraction_type="schema",
        instruction="Extract all product objects with 'name' and 'price' from the content.", # 提示词
        chunk_token_threshold=1000,
        overlap_rate=0.0,
        apply_chunking=True,
        input_format="markdown",   # 定义输出格式： "html", "fit_markdown"
        extra_args={"temperature": 0.0, "max_tokens": 800}
    )

    # 2. 构建 crawler 配置
    crawl_config = CrawlerRunConfig(
        extraction_strategy=llm_strategy,
        cache_mode=CacheMode.BYPASS
    )

    # 3. 如果需要可以创建一个浏览器配置
    browser_cfg = BrowserConfig(headless=True)

    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        # 4. 定义爬取的页面
        result = await crawler.arun(
            url="https://example.com/products",
            config=crawl_config
        )

        if result.success:
            # 5. 将提取的内容以 json 输出
            data = json.loads(result.extracted_content)
            print("Extracted items:", data)

            # 6. 显示使用状态
            llm_strategy.show_usage()  # 打印 token 使用情况
        else:
            print("Error:", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

更进阶的用法请参考官方文档。

C·X·D の BLOG

利用 Crawl4AI, 实现简单的内容抓取

安装

基础用法

进阶用法 (使用 LLM 大模型为 crawl4AI 赋能)

参考文档