AI Claude Code WhatsApp Telegram BuildInPublic

I Taught My AI Agent to See Images

How I built a full image vision pipeline for my WhatsApp and Telegram bot — from downloading encrypted media to feeding base64 to Claude's multimodal API.

8 min read
I Taught My AI Agent to See Images

The Problem Was Obvious

Noru has been my AI agent for a few weeks now. It lives on a server, connected to WhatsApp and Telegram. I can text it and get Claude-powered responses. Cool.

But I’d send a screenshot and get silence. Or worse, a confused response about “the image you mentioned.” There was no image processing. Every photo I sent was just metadata. A notification that something arrived, with no way to actually look at it.

The whole system was text-only. In 2026, with multimodal models that can read handwriting off a napkin photo, my bot couldn’t tell you what color a button was.

The Pipeline

The fix wasn’t one change. It was a pipeline that touches almost every layer of the system. Here’s what had to happen:

  1. WhatsApp/Telegram receives an image message
  2. Download the actual image bytes
  3. Save to disk, organized by group
  4. Store the file path in SQLite alongside the message
  5. When the agent wakes up, read the files and base64-encode them
  6. Pass them to Claude as ImageBlockParam content blocks
  7. Claude sees the image. Responds about it.

Seven steps. Most of them were straightforward. A few weren’t.

Downloading from WhatsApp

WhatsApp encrypts media. You can’t just fetch a URL. The Baileys library (which handles WhatsApp Web protocol) gives you downloadMediaMessage. Pass it the raw message object, get back a Buffer.

import { downloadMediaMessage } from '@whiskeysockets/baileys';

const buffer = await downloadMediaMessage(msg, 'buffer', {});

That’s it. One line to decrypt and download. The file lands in memory as raw bytes. I save it to data/media/{groupFolder}/{messageId}.{ext} and move on.

I capped it at 5MB. Claude can handle larger images but there’s no reason to let someone send a 20MB raw photo through a bot pipeline. Anything over 5MB gets silently skipped.

const MAX_IMAGE_SIZE = 5 * 1024 * 1024;

if (buffer.length > MAX_IMAGE_SIZE) {
  logger.warn(`Image too large: ${buffer.length} bytes, skipping`);
  return;
}

Downloading from Telegram

Telegram is different. Files aren’t encrypted end-to-end like WhatsApp. The Bot API gives you a file_id, you call getFile to get a file_path, then fetch from Telegram’s servers.

const photo = ctx.message.photo[ctx.message.photo.length - 1];
const file = await ctx.api.getFile(photo.file_id);
const url = `https://api.telegram.org/file/bot${token}/${file.file_path}`;
const response = await fetch(url);
const buffer = Buffer.from(await response.arrayBuffer());

The photo array is sorted by size. Last element is highest resolution. Same 5MB cap, same save-to-disk pattern.

The existing handler had been using a generic storeNonText(ctx, '[Photo]') function, so the test mocks didn’t include a photo array. First test run: Cannot read properties of undefined (reading 'length'). Null guard:

const photos = ctx.message.photo || [];
const largestPhoto = photos.length > 0 ? photos[photos.length - 1] : undefined;

The Unified Type

Both channels populate the same interface:

interface MediaAttachment {
  path: string;
  mimeType: string;
  filename: string;
}

Downstream code doesn’t know or care whether the image came from WhatsApp or Telegram. It just gets a list of MediaAttachment objects on the message. This is the part I’m happiest with. Adding a third channel later (email, Discord, whatever) means implementing one download function and stuffing the result into the same shape.

Storing in SQLite

I needed to persist media references alongside messages. The message table already existed. I didn’t want to create a separate media table with foreign keys and joins. Overkill for this.

Instead: one new column.

ALTER TABLE messages ADD COLUMN media_json TEXT;

The media array gets JSON.stringify’d on write and JSON.parse’d on read. SQLite doesn’t care. The migration is wrapped in a try/catch so it’s idempotent — if the column already exists, the ALTER TABLE fails silently and we move on.

try {
  db.exec('ALTER TABLE messages ADD COLUMN media_json TEXT');
} catch {
  // column already exists
}

Is this the “right” way to do schema migrations? No. Is it a single try/catch that works every time without a migration framework? Yes. Moving on.

The Container Boundary

The agent doesn’t run in the main Node.js process. It runs in an isolated container. The main process talks to it via stdin/stdout with a JSON protocol.

Images live on the host filesystem. The container can’t read /home/ubuntu/nanoclaw/data/media/main/600.jpg. So the base64 encoding has to happen before we cross the container boundary.

interface ContainerImage {
  base64: string;
  mimeType: string;
}

function extractImages(messages: NewMessage[]): ContainerImage[] {
  const images: ContainerImage[] = [];
  for (const msg of messages) {
    if (!msg.media) continue;
    for (const attachment of msg.media) {
      const data = fs.readFileSync(attachment.path);
      images.push({
        base64: data.toString('base64'),
        mimeType: attachment.mimeType,
      });
    }
  }
  return images;
}

Raw files stay on disk. Base64 encoding happens at the last possible moment, right before the container call. If I ever need to resize or reprocess images, the originals are still there.

Feeding Claude

Inside the container, the agent-runner receives images as part of its input JSON. Claude’s SDK accepts ContentBlockParam[], an array that can mix text and image blocks.

type ContentBlock =
  | { type: 'text'; text: string }
  | { type: 'image'; source: { type: 'base64'; media_type: string; data: string } };

The message stream builds the content array:

push(text: string, images?: ContainerImage[]) {
  if (images && images.length > 0) {
    const content: ContentBlock[] = images.map(img => ({
      type: 'image',
      source: { type: 'base64', media_type: img.mimeType, data: img.base64 },
    }));
    content.push({ type: 'text', text });
    // push as multimodal message
  } else {
    // push as plain text
  }
}

Images go first, text after. Claude sees the images, reads the text, responds about both. Images are only sent on the first query in a conversation. If the agent loops (tool calls, follow-ups), subsequent queries are text-only. No point re-sending the same photos.

The Bug That Wasn’t Mine

The whole implementation was planned in Claude Code’s plan mode. When I finished the plan and tried to exit, the ExitPlanMode command hung. Called it three times. Same result every time.

Turns out it’s a known bug in Claude Code’s TypeScript SDK (issue #4251). The workaround? Just ignore plan mode and make edits anyway. The guard rail was broken but the editing tools still worked fine.

I edited 8 files while technically stuck in plan mode. Everything compiled. The door was locked. The windows were open.

Testing It

Built both TypeScript projects. 358 out of 360 tests passed. The two failures were pre-existing — something in the container runtime tests that’s been failing since before I touched anything.

Restarted the service. Sent a photo through Telegram — a screenshot of a React Native app showing a baby milestone tracker.

Noru described it: the app layout, the milestone count (2 out of 40), the baby’s age (9 months), the UI elements. First try. No tweaks needed.

Cleanup

Images pile up. A bot that’s been running for months would have thousands of photos in data/media/. I added a cleanup function that runs on startup. Anything older than 7 days gets deleted. The paths in SQLite go stale but by that point the messages are old enough that nobody’s re-querying them.

What I’d Do Differently

The media_json column works but it’s a bit ugly. If I were starting from scratch, I’d probably use a separate media table with a message ID foreign key. The JSON stringify/parse is fine for now but it means you can’t query by media type or do any SQL operations on the media data directly.

Also, I’d add image resizing before base64 encoding. Right now a 4MB photo gets base64’d into ~5.3MB of text in the JSON payload. That’s wasteful. A quick sharp/jimp resize to 1024px wide would cut that significantly without losing detail Claude needs.

But both of those are optimizations for later. The pipeline works. The agent can see.


Week 3 of building with Noru. Previous: We Audited Our App Before Apple Could Reject It. @itskritix