DopeLab
INKby DopeLab
Back
Vision Eval — AI That Checks AI (Using Gemini Vision to QA AI-Generated Images)
AI WorkflowMarch 22, 20264 min read

Vision Eval — AI That Checks AI (Using Gemini Vision to QA AI-Generated Images)

We generate 20-30 AI images daily but never QA them — covers miss safe zones, images too dark, text gets blocked. We built vision-eval.py with Gemini Vision: 8 criteria, scored /80, 3 presets, compare mode.

Tor Supakit

Tor Supakit

AI × Digital Marketing Agency

The Problem: Generating 20-30 Images a Day, QA-ing Zero

Our content pipeline generates a lot of AI images — single images, carousel backgrounds, manga scenes, video covers. Easily 20-30 per day.

But for the past two months, nobody ever "checked" a single one.

The result:

  • Cover images with text outside the safe zone — cropped on feed
  • Images too dark — text overlay unreadable
  • Text on images half-blocked by other elements
  • Manga scenes beautifully drawn but poor composition for vertical scroll
  • Colors that don't match DopeLab's brand identity

Everything looked fine at generation time. But the moment we posted and opened it on mobile — problems everywhere.

The issue is that "human eyes" reviewing each image doesn't scale.

The Solution: Let AI Check AI with Gemini Vision

The concept is dead simple: if AI can generate images, AI should be able to check them too.

We built vision-eval.py — a Python script that sends images to the Gemini Vision API for quality analysis against defined criteria, returning scores.

Why Gemini?

  • Excellent Vision API — Gemini 2.0 Flash understands composition, text readability, and visual hierarchy better than GPT-4V for social media evaluation
  • Cheap — Flash model pricing is very low. Evaluating 30 images a day costs pennies
  • Fast — Responses come back in 2-3 seconds per image

8 Scoring Criteria (Social Preset)

We designed 8 criteria specifically for social media content. Each criterion is scored out of 10, totaling 80:

#CriteriaWhat It MeasuresWeight
1Scroll-StoppingDoes it stop the thumb scroll? Bold colors? Compelling composition?/10
2Visual HierarchyClear focal point? Where does the eye go first?/10
3Text ReadabilityCan you read text on the image? Enough contrast? Font size OK?/10
4Mobile-FriendlyDoes it look good on small screens? No tiny elements?/10
5Brand ConsistencyDo colors, fonts, and style match brand identity?/10
6Emotional ImpactWhat emotion does it trigger? Strong enough?/10
7CTA ClarityIs the CTA clear and visible?/10
8CompositionRule of thirds? White space? Balance?/10

Total score is 80 — minimum passing score is 50/80.

3 Presets for Different Use Cases

Not every image should be judged by the same criteria. We built 3 presets:

Preset 1: --eval social (Default)

The 8 criteria above, scored /80. Best for single images, carousels, and ads.

python vision-eval.py image.png --eval social

Preset 2: --eval thumbnail

12 Yes/No criteria following the Karpathy Pattern — no scores, just PASS/FAIL. Designed for video thumbnails that need to drive clicks.

python vision-eval.py thumbnail.png --eval thumbnail

12 Yes/No criteria include:

  • Does it contain a face/person?
  • Is text readable at small size?
  • Does it have high contrast?
  • Is it free from clutter?
  • ... and 8 more checks

Preset 3: --eval artwork

8 criteria scored /80 like social, but calibrated for manga/illustration — emphasizing artistic quality, character design, and scene composition over marketing effectiveness.

python vision-eval.py manga-scene.png --eval artwork

--compare: Pit Two Images Against Each Other

Sometimes we generate 2-3 versions of the same image and can't decide which is better.

python vision-eval.py imageA.png --compare imageB.png

Gemini analyzes both images, scores them independently, picks a winner, and explains why:

═══ Vision Eval: Compare Mode ═══

Image A: cover-v1.png
  Score: 62/80
  Weaknesses: text readability (6/10), brand consistency (5/10)

Image B: cover-v2.png
  Score: 71/80
  Strengths: scroll-stopping (9/10), composition (9/10)

Winner: Image B (+9 points)
  Reason: Higher contrast text, better visual hierarchy,
  stronger brand color usage

No need to ask a human anymore — AI decides. (Though when scores are close, we still eyeball it.)

Real Results: Cover DL-109 FAIL → Re-gen 3 Times → PASS

A real case from yesterday:

Round 1: Generated cover for DL-109 → vision-eval scored 43/80 (FAIL)

  • Text readability: 4/10 — text blending into background
  • Mobile-friendly: 5/10 — elements too small

Round 2: Re-generated with better contrast + larger text → 51/80 (PASS... but borderline)

  • Scroll-stopping: 5/10 — still not compelling enough

Round 3: Complete re-gen with new composition → 69/80 (PASS!)

  • Every criterion >= 7/10

Without vision-eval, we would have posted the round 1 image and wondered why engagement was low.

Aggregate Results Across the Pipeline:

TypeImagesAverage ScoreStatus
Single images1267/80All passed
Manga scenes4828/30All passed
Covers863/803 re-generated
Thumbnails811/12 PASS1 re-generated

The Validation Loop: gen → eval → fail → re-gen → pass

The key isn't just "checking" — it's the feedback loop back to regeneration:

generate image
    ↓
vision-eval (score it)
    ↓
score >= 50/80? ──── YES → good to go → continue
    ↓ NO
analyze weaknesses
    ↓
re-generate (fix weak points)
    ↓
vision-eval again
    ↓
(loop until pass — max 3 attempts)

If 3 attempts still fail → flag for human review, because the issue might be beyond what prompt/model tuning can fix.

Under the Hood: The Prompt Sent to Gemini

Nothing complicated — just a structured prompt telling Gemini to analyze against the criteria:

prompt = f"""
Analyze this social media image for quality.
Score each criterion 1-10:
 
1. Scroll-Stopping Power
2. Visual Hierarchy
3. Text Readability
4. Mobile-Friendly
5. Brand Consistency
6. Emotional Impact
7. CTA Clarity
8. Composition
 
Return JSON: {{"scores": [...], "total": N, "weaknesses": [...], "suggestion": "..."}}
"""
 
response = model.generate_content([prompt, image])

Send the image as bytes alongside the prompt → Gemini responds with JSON → parse and display.

Things to watch out for:

  • Must use structured output — without specifying format, Gemini returns long prose that's impossible to parse
  • Must calibrate — first time we used it, Gemini scored everything 70+. Had to add "be strict, 8+ means exceptional" to the prompt
  • Must cache the prompt — sending 30 images with the same prompt wastes tokens unnecessarily

Vision Eval Inside the Validation Harness

Vision-eval doesn't work alone — it's 1 of 32 checks inside content-harness.py:

Phase 7: Quality Gates
  ✅ Vision eval: single image 67/80 (>= 50)
  ✅ Cover passes safe zone analysis
  ✅ Manga scenes avg 28/30 (>= 25)

If vision-eval fails → harness fails → content cannot be marked as "done."

This is AI checking AI in a form that actually works in production — not just a concept.

Key Takeaways

BeforeAfter
Generate and post immediatelyGenerate → eval → pass → then post
Don't know there's a problem until low engagementKnow problems instantly before posting
Pick images by eye (biased + tiring)Compare with --compare (objective)
1 version per image1-3 versions → pick the best

Remember:

  • Generating images is easy. Checking them matters more.
  • Gemini Vision is cheap and fast enough to check every image, every day
  • 8 criteria /80 — 50 is minimum pass, we aim for 65+
  • The validation loop (gen → eval → re-gen) is the pattern that actually improves quality
  • If AI can generate it, AI should be able to check it — don't rely only on human eyes
vision-evalgeminiimage-qualityvalidationai-tools
Share this article

Related Articles

Karpathy Proved It — AI Agents Without a Validation Harness Will Fail Every TimeAI Workflow
March 22, 2026

Karpathy Proved It — AI Agents Without a Validation Harness Will Fail Every Time

Karpathy's March of Nines math is brutal: 90% accuracy sounds great until you chain 10 steps and get 35% success. Here's how we built a 32-check Validation Harness to fix it.

4 min
Claude Channels + Telegram — Control Claude Code From Your Phone, No SSH RequiredAI Workflow
March 21, 2026

Claude Channels + Telegram — Control Claude Code From Your Phone, No SSH Required

Anthropic released Claude Channels — an MCP plugin that connects Telegram to your Claude Code session. Deploy, fix code, transcribe voice messages, view images — all from your phone. No Tailscale, no tmux.

4 min
I Built an Entire AI Marketing Team — 11 Agents, 6 Real Clients, 6 WeeksAI Workflow
March 20, 2026

I Built an Entire AI Marketing Team — 11 Agents, 6 Real Clients, 6 Weeks

Agencies charge $5,000-$10,000/month for Website Audit + Copy Analysis + Competitor Research. I built an AI Marketing Team of 11 agents that works with real clients daily — using Claude Code.

3 min