从纯文本到多模态：Clawdbot 的感知能力升级

2026年1月15日，Clawdbot v2026.1.15 发布了一个重要更新：

"Inbound media understanding for images, audio, and video with provider and CLI fallbacks"

这意味着 Clawdbot 不再是"只会读文字"的 AI，而是具备了视觉、听觉的多模态 Agent。

技术实现路径

图像理解

之前的处理方式：

// 用户发送图片
const message = await telegram.getMessage();

if (message.photo) {
  await reply('我收到了一张图片，但我无法理解内容。请描述这张图片。');
}

Bot 只能"知道有图片"，但不知道图片内容。

现在的处理方式：

// 用户发送图片
const message = await telegram.getMessage();

if (message.photo) {
  // 下载图片
  const imageBuffer = await telegram.downloadFile(message.photo.file_id);
  const base64Image = imageBuffer.toString('base64');
  
  // 调用 Vision API
  const response = await claude.messages.create({
    model: 'claude-opus-4.5',
    messages: [{
      role: 'user',
      content: [
        {
          type: 'image',
          source: {
            type: 'base64',
            media_type: 'image/jpeg',
            data: base64Image
          }
        },
        {
          type: 'text',
          text: '描述这张图片'
        }
      ]
    }]
  });
  
  await reply(response.content[0].text);
}

实际场景

场景1：OCR 文档提取

用户拍照一张名片 → 发给 Bot

Bot 识别:
  姓名: 张三
  职位: 技术总监
  公司: XYZ Tech
  电话: +86 138 0013 8000
  邮箱: zhang@xyz.com

Bot: "已添加到通讯录。需要我发送自我介绍邮件吗？"

场景2：截图问题排查

用户截图一个报错界面 → 发给 Bot

Bot 识别:
  错误类型: TypeError: Cannot read property 'map' of undefined
  文件: src/components/List.tsx:45
  堆栈信息: ...

Bot: "这是因为 `items` 是 undefined。请检查:
     1. API 响应是否正确
     2. 默认值是否设置为 []
     
     建议修改:
     const items = data?.items || [];"

场景3：图片内容审核

群聊中有人发图片 → Bot 自动检查

Bot 识别:
  内容: 二维码
  类型: 钓鱼链接

Bot: "⚠️ 警告：检测到可疑二维码，可能是钓鱼网站。请谨慎扫描。"

音频理解

语音转录

// 用户发送语音消息（.ogg/.mp3）
const message = await telegram.getMessage();

if (message.voice) {
  // 下载音频
  const audioBuffer = await telegram.downloadFile(message.voice.file_id);
  
  // 调用 Whisper API
  const transcription = await openai.audio.transcriptions.create({
    file: audioBuffer,
    model: 'whisper-1',
    language: 'zh'  // 中文
  });
  
  // 处理文本
  const response = await handleMessage(transcription.text);
  
  // 可选：语音回复
  const tts = await openai.audio.speech.create({
    model: 'tts-1',
    voice: 'alloy',
    input: response
  });
  
  await telegram.sendVoice(message.chat.id, tts);
}

音频分析

// 提取音频特征
const analysis = await analyzeAudio(audioBuffer);

if (analysis.backgroundNoise > 0.7) {
  await reply('您的环境比较嘈杂，我可能没听清楚。能重新发一次吗？');
}

if (analysis.speechRate < 0.3) {
  // 语速很慢，可能是老年人或不熟悉语音输入
  await reply('我听清楚了，您慢慢说，不着急。');
}

视频理解

帧提取

async function processVideo(videoPath: string) {
  // 使用 FFmpeg 提取关键帧
  const frames = await extractKeyFrames(videoPath, {
    fps: 1,  // 每秒 1 帧
    maxFrames: 10  // 最多 10 帧
  });
  
  // 对每帧进行 Vision API 分析
  const descriptions = await Promise.all(
    frames.map(frame => describeFrame(frame))
  );
  
  // 生成视频摘要
  const summary = await claude.messages.create({
    model: 'claude-opus-4.5',
    messages: [{
      role: 'user',
      content: `基于以下视频帧描述，总结视频内容：
      ${descriptions.join('\n---\n')}`
    }]
  });
  
  return summary.content[0].text;
}

async function extractKeyFrames(videoPath: string, options) {
  return new Promise((resolve, reject) => {
    const frames = [];
    
    ffmpeg(videoPath)
      .fps(options.fps)
      .frames(options.maxFrames)
      .on('end', () => resolve(frames))
      .on('error', reject)
      .save('/tmp/frame-%d.jpg');
  });
}

实际场景

场景1：会议录制总结

用户上传 Zoom 会议录制 (1 小时)
  ↓
Bot 提取 60 帧（每分钟 1 帧）
  ↓
识别:
  - 00:05 - 标题幻灯片：Q1 财报回顾
  - 00:15 - 图表：营收增长 25%
  - 00:30 - 讨论场景：5 人在线
  - 00:45 - 决策：批准预算
  ↓
生成摘要:
  "会议主题: Q1 财报
   关键数据: 营收增长 25%
   决策事项: 批准 Q2 预算 $500k
   参会人: 张三、李四、王五..."

场景2：监控摄像头

定时任务:
  每 10 分钟抓取摄像头画面
  ↓
  识别异常:
    - 有陌生人出现
    - 宠物不在正常位置
    - 烟雾/火光
  ↓
  发送告警

性能与成本分析

Token 消耗

文本消息 (50 字):
  约 100 tokens
  成本: $0.0015 (Opus)

图片 (1080x1920):
  约 1500 tokens
  成本: $0.0225 (Opus)

语音 (1 分钟):
  转录: $0.006 (Whisper)
  处理文本: $0.001 (Haiku)
  成本: $0.007

视频 (10 分钟):
  提取 10 帧: 10 × 1500 = 15,000 tokens
  总结: 2,000 tokens
  成本: $0.255 (Opus)

对比：

100 条文本消息: $0.15
10 张图片: $0.23
10 段语音: $0.07
1 个视频: $0.26

结论: 视频最贵，图片次之，语音和文本较便宜

处理延迟

文本: 200-500 ms (API 调用)
图片: 1-2 秒 (下载 + Vision API)
语音: 2-5 秒 (下载 + Whisper + 处理)
视频: 30-60 秒 (下载 + FFmpeg + Vision API × 10)

用户体验：

文本：即时响应
图片/语音：可接受
视频：需要提示"处理中"

带宽消耗

文本消息: < 1 KB
图片 (压缩后): 100-500 KB
语音 (1 分钟): 500 KB - 1 MB
视频 (10 分钟): 20-50 MB

如果 Clawdbot 运行在带宽受限的环境（手机热点、卫星网络），视频处理会很慢。

与 ChatGPT Vision 的对比

| 特性 | ChatGPT Vision | Clawdbot | |------|---------------|----------| | 图片识别 | ✅ 网页上传 | ✅ 消息平台直接发 | | OCR | ✅ 高精度 | ✅ 高精度 | | 视频理解 | ❌ 不支持 | ✅ 支持（通过帧提取）| | 实时处理 | ❌ 需要手动上传 | ✅ 自动处理消息 | | 与其他工具联动 | ❌ 无 | ✅ 可调用 tools | | 隐私 | ⚠️ 上传到 OpenAI | ⚠️ 上传到 Anthropic/OpenAI |

Clawdbot 的优势：

集成在消息流中（不需要单独上传）
可以联动其他功能（识别图片 → 保存到 Notion）

ChatGPT 的优势：

界面友好
无需自己部署

隐私考量

数据流向

用户发送图片 (WhatsApp)
  ↓
Clawdbot 本地下载
  ↓
上传到 Anthropic API
  ↓
Vision 模型处理
  ↓
响应返回

图片内容会被上传到 Anthropic 服务器。

Anthropic 隐私政策：

API 请求不用于训练（默认）
数据保留 30 天后删除
但有"安全审查例外"（如果检测到非法内容）

本地处理方案

如果涉及敏感图片（身份证、医疗报告），使用本地模型：

import { LLaVA } from 'llava-node';

const localVision = new LLaVA({
  modelPath: './models/llava-v1.6-34b.gguf'
});

async function processImage(imagePath: string) {
  // 在本地运行 Vision 模型
  const description = await localVision.describe(imagePath);
  return description;
}

成本对比：

云端 (Claude):
  - 延迟: 1-2 秒
  - 成本: $0.0225/张
  - 隐私: 上传到 Anthropic

本地 (LLaVA):
  - 延迟: 10-30 秒
  - 成本: 电费 + 硬件折旧
  - 隐私: 不离开本地

未来演进方向

实时视频流

当前：用户发送录制好的视频。

未来：直接接入实时视频流。

async function processLiveStream(streamUrl: string) {
  const stream = await ffmpeg(streamUrl)
    .fps(1)  // 每秒 1 帧
    .on('data', async (frame) => {
      const description = await vision.describe(frame);
      
      // 检测特定事件
      if (description.includes('举手')) {
        await notifyUser('有人在视频中举手提问');
      }
    });
}

应用场景：

在线会议实时字幕 + 翻译
监控摄像头异常检测
直播内容审核

语音对话

当前：用户发语音 → 转文本 → 处理 → 回复文本。

未来：全程语音对话。

async function voiceConversation(audioStream: Stream) {
  // 实时转录
  const transcription = await whisper.transcribe(audioStream, {
    realtime: true
  });
  
  // 生成回复
  const response = await agent.chat(transcription.text);
  
  // 语音合成
  const tts = await elevenLabs.synthesize(response, {
    voice: 'Adam',
    model: 'eleven_turbo_v2'
  });
  
  // 播放
  await playAudio(tts);
}

延迟优化：

传统模式:
  语音 → 转录 (2s) → 处理 (3s) → TTS (2s) = 7s

优化模式 (流式):
  语音 → 实时转录 (0.5s) → 流式处理 (1s) → 流式 TTS (1s) = 2.5s

多模态融合

// 用户同时发送图片 + 语音 + 文本
const message = await getComplexMessage();

const inputs = [
  {type: 'image', data: message.photo},
  {type: 'audio', data: message.voice},
  {type: 'text', data: message.text}
];

// 融合理解
const response = await claude.messages.create({
  model: 'claude-opus-4.5',
  messages: [{
    role: 'user',
    content: [
      {
        type: 'image',
        source: {type: 'base64', data: inputs[0].data}
      },
      {
        type: 'text',
        text: `语音转录: ${transcribe(inputs[1].data)}\n\n` +
              `文本补充: ${inputs[2].data}\n\n` +
              `请综合图片、语音、文本回答问题。`
      }
    ]
  }]
});

应用场景：

场景: 用户在现场勘察房屋

用户:
  - 拍摄房间照片 (5 张)
  - 录制语音备注："这个房间朝南，采光不错，但墙面有裂缝"
  - 文字: "价格 $350k，能接受吗？"

Bot 综合分析:
  - 图片识别: 客厅 30㎡，有裂缝
  - 语音识别: 朝南、采光好、有裂缝
  - 文本: 价格询问
  
Bot 回复:
  "根据照片和你的描述:
   优点: 采光好、面积合适
   风险: 墙面裂缝需要专业检查，可能需要维修（预算 $5k-10k）
   价格: $350k 在该区域属于中等偏上
   
   建议: 要求卖家先修复裂缝，或降价 $10k 补偿"

技术挑战

挑战1：Token 消耗爆炸

图片和视频的 token 消耗远超文本：

普通对话 (100 轮):
  100 × 1,000 tokens = 100,000 tokens
  成本: $1.5

包含图片 (50 轮文本 + 50 张图):
  50 × 1,000 + 50 × 1,500 = 125,000 tokens
  成本: $1.875

包含视频 (30 轮文本 + 5 个视频):
  30 × 1,000 + 5 × 15,000 = 105,000 tokens
  成本: $1.575

但如果每个视频 10 分钟 × 60 fps × 10 秒采样 = 60 帧:
  30 × 1,000 + 5 × 90,000 = 480,000 tokens
  成本: $7.2

缓解：

降低采样率（每 5 秒 1 帧而非 1 秒）
使用更便宜的模型（Haiku）预处理
只对"关键帧"使用 Opus 分析

挑战2：存储空间

用户每天发送:
  - 50 条消息 (< 1 MB)
  - 20 张图片 (10 MB)
  - 5 段语音 (5 MB)
  - 1 个视频 (50 MB)

日存储: 65 MB
月存储: 1.95 GB
年存储: 23.4 GB

如果有 10 个活跃用户，一年需要 234 GB 存储。

云存储成本：

AWS S3: $0.023/GB/月
234 GB × $0.023 = $5.38/月

+ 数据传输费用
+ API 请求费用

总计: ~$10/月

挑战3：格式兼容性

不同平台的媒体格式：

| 平台 | 图片 | 音频 | 视频 | |------|------|------|------| | Telegram | JPG, PNG, WebP | OGG, MP3 | MP4, MOV | | WhatsApp | JPG, PNG | OGG, AAC | MP4 | | Discord | JPG, PNG, GIF | MP3, WAV | MP4, WebM |

需要统一转码：

async function normalizeMedia(file: Buffer, type: string): Promise<Buffer> {
  if (type === 'image') {
    // 统一转换为 JPEG
    return sharp(file).jpeg({quality: 80}).toBuffer();
  }
  
  if (type === 'audio') {
    // 统一转换为 MP3
    return ffmpeg(file).audioCodec('libmp3lame').toBuffer();
  }
  
  if (type === 'video') {
    // 统一转换为 MP4 (H.264)
    return ffmpeg(file)
      .videoCodec('libx264')
      .audioCodec('aac')
      .toBuffer();
  }
}

最终建议

多模态能力是 AI Agent 的重要进化，但：

成本控制

{
  "media": {
    "image": {
      "maxSize": 5242880,  // 5 MB
      "allowedTypes": ["image/jpeg", "image/png"],
      "compressionQuality": 80
    },
    "video": {
      "maxSize": 52428800,  // 50 MB
      "maxDuration": 600,  // 10 分钟
      "samplingRate": 0.2  // 每 5 秒 1 帧
    },
    "dailyQuota": {
      "images": 100,
      "videos": 10
    }
  }
}

隐私保护

{
  "privacy": {
    "localProcessing": {
      "enabled": true,
      "models": {
        "vision": "llava-1.6-34b",
        "stt": "whisper-medium"
      }
    },
    "cloudFallback": {
      "enabled": true,
      "requireApproval": true
    }
  }
}

优先使用本地模型，云端只作为 fallback。

性能优化

// 预处理降低精度要求
async function preprocessImage(buffer: Buffer): Promise<Buffer> {
  return sharp(buffer)
    .resize(1024, 1024, {fit: 'inside'})  // 降低分辨率
    .jpeg({quality: 70})  // 降低质量
    .toBuffer();
}

多模态 Agent 是未来，但当前的成本和性能还需要优化。

谨慎使用，监控消耗。

参考资料：

Claude Vision: https://docs.anthropic.com/claude/docs/vision
Whisper API: https://platform.openai.com/docs/guides/speech-to-text
FFmpeg: https://ffmpeg.org/documentation.html