Cursor, Claude Code, Codex — they can read your codebase, write code, run shell commands, and even browse the web. But hand them a video file and they’re blind.
This isn’t a minor gap. Screen recordings of bugs, product demo videos, YouTube tutorials you want to reference in code — video is everywhere in modern development workflows. Yet when you drop a .mp4 into a conversation, your AI agent has no idea what to do with it.
I ran into this exact limitation and went looking for a solution. Here’s what I found, what broke, and how I fixed it.
The Problem: AI Agents Can’t See Videos
Most AI coding agents — including those powered by Claude — support image inputs natively. You can screenshot your UI bug and ask “what’s wrong here?” and get a useful answer.
But video? Nothing. The Read tool in Cursor supports JPEG, PNG, GIF, and WebP. No MP4. No MOV. No video URLs. If you ask an agent to “watch this video,” it’ll politely tell you it can’t.
The workaround people suggest is extracting frames with ffmpeg and feeding them as images. That works for visual-only content, but you lose audio, context, and temporal understanding. A series of screenshots doesn’t tell you what someone said or in what order things happened.
The Solution: The video-understand Skill
The open agent skills ecosystem (via skills.sh) has a growing collection of skills that extend what agents can do. I found one that solves the video problem elegantly.
video-understand by jrusso1020 is a multi-provider video understanding skill that gives AI agents the ability to analyze videos — both visual content and audio.
The clever part: it doesn’t try to make the agent itself process video. Instead, it routes the video to external models that can handle it natively (like Google’s Gemini), and returns the structured analysis as text that the agent can work with.
How It Works
The skill includes Python scripts that:
- Auto-detect available providers based on which API keys you have set
- Upload and process video through the best available provider
- Return structured JSON with the analysis, transcript, and metadata
It supports 9 providers with automatic fallback:
| Priority | Provider | What It Does | Cost |
|---|---|---|---|
| 1 | Gemini (Google AI Studio) | Full video understanding — visual + audio | Free tier available |
| 2 | Vertex AI | Same as Gemini, enterprise tier | Pay-as-you-go |
| 3 | OpenRouter | Routes to Gemini models | Free tier available |
| 4 | FFMPEG + Whisper | Frame extraction + audio transcription | Free, runs locally |
| 5–9 | OpenAI, AssemblyAI, Deepgram, Groq, Local Whisper | Audio transcription only | Varies |
You can also pass custom prompts — ask specific questions about the video, request timestamps, or extract particular information.
Installation
npx skills add jrusso1020/video-understand-skills@video-understand -g -y
Set up at least one provider. The simplest is Gemini:
- Go to aistudio.google.com
- Click Get API Key → Create API Key
- Add to your shell config:
echo 'export GEMINI_API_KEY="your-key-here"' >> ~/.zshrc
source ~/.zshrc
Install the Python SDK and CLI tools:
pip install google-genai
brew install ffmpeg yt-dlp
Verify everything works:
python3 ~/.agents/skills/video-understand/scripts/check_providers.py
Usage
Process a local video:
python3 ~/.agents/skills/video-understand/scripts/process_video.py /path/to/video.mp4
-p "Describe what happens in this video"
Process a YouTube video (download first, then analyze):
yt-dlp -f "best[ext=mp4]" -o /tmp/video.mp4 "https://youtube.com/watch?v=..."
python3 ~/.agents/skills/video-understand/scripts/process_video.py /tmp/video.mp4
-p "Summarize the key points"
The output is clean JSON:
{
"source": {
"type": "local",
"path": "/tmp/video.mp4",
"duration_seconds": 19.13,
"size_mb": 0.3
},
"provider": "gemini",
"model": "gemini-3-flash-preview",
"capability": "full_video",
"response": "The video shows a young man standing in front of..."
}
The Bug: Deprecated SDK Breaks Everything
Here’s where things got interesting. I installed the skill, set my Gemini API key, and ran the test. It failed immediately:
googleapiclient.errors.HttpError: <HttpError 400 when requesting
https://generativelanguage.googleapis.com/upload/v1beta/files?...
returned "API key expired. Please renew the API key.">
My key was brand new. I had just generated it 30 seconds ago.
The real issue was buried in a warning that appeared before the error:
FutureWarning: All support for the `google.generativeai` package has ended.
It will no longer be receiving updates or bug fixes.
Please switch to the `google.genai` package as soon as possible.
The skill was using google-generativeai — the old, deprecated Python SDK for Gemini. Google has fully sunset this package and replaced it with google-genai. The old package’s file upload API no longer works with current API keys, producing a misleading “API key expired” error even with valid keys.
The Fix
The core change was in the process_with_gemini() function. Here’s what the old code looked like:
# Old — broken (deprecated SDK)
import google.generativeai as genai
genai.configure(api_key=api_key)
genai_model = genai.GenerativeModel(model_name)
video_file = genai.upload_file(source)
response = genai_model.generate_content([prompt, video_file])
And the updated version using the new SDK:
# New — working (current SDK)
from google import genai
from google.genai import types
client = genai.Client(api_key=api_key)
video_file = client.files.upload(file=source)
response = client.models.generate_content(
model=model_name,
contents=[
types.Content(
parts=[
types.Part.from_uri(
file_uri=video_file.uri,
mime_type=video_file.mime_type
),
types.Part.from_text(text=prompt),
]
)
],
)
The new google.genai SDK uses a Client-based architecture instead of the old module-level configuration. Content is constructed with typed Part objects rather than raw dicts.
I updated all 7 files — the main script, setup checker, SKILL.md, README, requirements.txt, and the reference docs — to use the new SDK throughout.
The Forked Repo
I’ve submitted a PR to the original repo with the fix. Until that’s merged, you can install directly from my fork which has the fix on the main branch:
npx skills add sarvesh-ghl/video-understand-skills@video-understand -g -y
Forked repo: github.com/sarvesh-ghl/video-understand-skills
Testing It
To verify it works, I tested with the first video ever uploaded to YouTube — “Me at the zoo” by Jawed Karim:
yt-dlp -f "worst[ext=mp4]" -o /tmp/test.mp4 "https://www.youtube.com/watch?v=jNQXAC9IVRw"
python3 ~/.agents/skills/video-understand/scripts/process_video.py /tmp/test.mp4
-p "What is happening in this video? Who is the person?"
Gemini’s response:
“The man in this video is Jawed Karim, one of the co-founders of YouTube. In the video, Karim is standing in front of two elephants at the San Diego Zoo. He’s talking about how cool the elephants are, specifically pointing out their ‘really, really, really long trunks.’ This video, titled ‘Me at the zoo,’ was the first video ever uploaded to YouTube.”
Full video understanding — visual identification, audio transcription, and even historical context — all from an AI agent that couldn’t process video 20 minutes earlier.
Why This Matters
Video is becoming a primary medium for technical communication. Screen recordings for bug reports. Loom videos for async standups. YouTube tutorials for onboarding. Product demos for stakeholders.
If your AI agent can’t process video, it’s missing a significant chunk of the context it needs to be genuinely useful. This skill bridges that gap — not perfectly, not natively, but practically.
The open skills ecosystem is what makes this possible. Someone built a skill, shared it publicly, and now any agent — Cursor, Claude Code, Codex, Gemini CLI — can understand video. When the SDK broke, I fixed it and contributed back. That’s how open source is supposed to work.
Credits:
- video-understand skill by jrusso1020 — the original author who built the multi-provider video understanding system
- skills.sh — the open agent skills ecosystem where these extensions are discovered and shared
- Google Gemini — the underlying vision model that makes full video understanding possible
Links:
- Original skill: github.com/jrusso1020/video-understand-skills
- My fork (with SDK fix): github.com/sarvesh-ghl/video-understand-skills
- PR with the fix: github.com/jrusso1020/video-understand-skills/pull/1
- Skills ecosystem: skills.sh
