AI & LLMs November 17, 2025 · 8 min read

Transcriptor: From a Meeting Recording to Structured Minutes, Psychological Analysis, and an AI Retrospective Chat

Gonzalo Monzón

Founder & Lead Architect

What happens after a meeting ends? Usually, nothing. Maybe someone sends a partial summary. Action items get forgotten. That one crucial decision gets attributed to the wrong person a week later. We built Transcriptor to fix that — a tool that takes a raw audio recording and outputs structured minutes, per-participant psychological analysis, a retrospective AI chat, summary images, and narrated conclusions. All powered by ElevenLabs Scribe, Gemini 2.5, and FLUX, running in 4,500 lines of zero-dependency vanilla JavaScript.

The Pipeline: Audio In, Documentation Out

Transcriptor runs a 7-stage pipeline on every meeting recording:

Stage	What It Does	Powered By
1. Transcription	Speech-to-text with speaker diarization — who said what, when	ElevenLabs Scribe
2. Structured Minutes	Formal meeting minutes: attendees, topics, decisions, action items with owners and deadlines	Gemini 2.5
3. Psychological Analysis	Communication style, participation level, emotional tone, and influence analysis per participant	Gemini 2.5
4. Conclusions Module	Key takeaways, identified risks, detected opportunities	Gemini 2.5
5. Retrospective Chat	WhatsApp-style interface to ask anything about the meeting — AI responds with full context	Gemini 2.5
6. Summary Image	AI-generated visual representation of meeting key points	FLUX
7. TTS Narration	Audio narration of the executive summary with smart chunking for long texts	ElevenLabs

Upload an audio file. Get a complete meeting documentation package. Every stage feeds into the next — the transcription feeds the minutes, the minutes feed the analysis, everything feeds the retrospective chat.

Stage 1: Transcription with Speaker Diarization

Getting words from audio is the easy part. Knowing who said them — that's the hard part. Transcriptor uses ElevenLabs Scribe as the primary STT engine because it handles diarization natively:

Speaker identification — each segment is tagged with a speaker ID (Speaker 1, Speaker 2, etc.)
Timestamps per segment — precise timing for every utterance
Multi-language — Spanish, English, and more
Whisper fallback — OpenAI Whisper handles languages or edge cases Scribe doesn't support

The diarization quality matters because everything downstream depends on it. When the psychological analysis says "Speaker 2 dominated the conversation," it needs to actually be Speaker 2.

Stage 2: Structured Minutes via AI

Gemini 2.5 takes the full diarized transcription and outputs formal meeting minutes:

Attendees — identified from speech patterns and mentions, with participation level
Topics discussed — automatically grouped and numbered
Decisions made — extracted from conversational context ("So we agreed to...")
Action items — task, responsible person, deadline — pulled from natural conversation
Next steps — follow-up meeting dates, pending reviews

The minutes are editable. If the AI attributes a decision to the wrong person, you fix it inline. But in practice, with good diarization, the accuracy is surprisingly high.

Stage 3: Psychological Analysis

This is the feature that makes team leads lean forward. For each participant, Gemini analyzes:

Dimension	What's Measured
Communication Style	Direct, collaborative, passive, dominant — how does this person express ideas?
Participation Level	% of speaking time, frequency of interventions, initiative vs. reactive
Emotional Tone	Positive, neutral, negative, anxious, enthusiastic — per topic and overall
Interactions	Who responds to whom, alliances, tensions, who gets interrupted
Influence	Who generates most agreement/disagreement, who shifts the conversation direction

The psychological module doesn't diagnose — it surfaces patterns. A manager might discover that a quieter team member actually introduces the ideas that get adopted, they just don't fight for credit. Or that two people consistently talk past each other on project timeline topics. These patterns are invisible in real-time but become obvious when the AI maps them out.

Stage 4: The Retrospective Chat

A WhatsApp-style chat interface where you can ask anything about the meeting after it's over:

"What did María say about the budget?"
"Was any decision made about the launch?"
"Who proposed the partnership idea?"
"Summarize the 3 most important points"
"What was the emotional tone during the timeline discussion?"

The AI has full context: the raw transcription, the structured minutes, and the psychological analysis. So it can answer both factual questions ("What was decided?") and analytical ones ("Was there tension between Juan and María?"). It's like having a perfect memory of every meeting you've ever had.

Stages 5-6: Summary Image & TTS Narration

Two output formats for different consumption styles:

Summary image — FLUX generates a visual infographic-style representation of the meeting's key points. Useful for sharing in Slack or embedding in documentation
TTS narration — ElevenLabs narrates the executive summary with a professional voice. Smart chunking splits long summaries into manageable audio segments. Downloadable as MP3 for commute listening

The Interface: 7 Panels

Panel	Function
Upload	Drop or select audio recording
Transcription	Timeline view with speakers color-coded
Minutes	Structured document — editable
Analysis	Cards per participant with metrics and insights
Chat	WhatsApp-style retrospective interface
Image	AI-generated summary visual
Audio	TTS narration player with download

Technical Details

Metric	Value
Codebase	4,500+ lines of vanilla JavaScript — zero framework dependencies
STT Providers	2 — ElevenLabs Scribe (primary), OpenAI Whisper (fallback)
AI for Minutes	Gemini 2.5 — structured output via function calling
Image Generation	FLUX via Workers AI
TTS	ElevenLabs — professional voice, MP3 output
Dependencies	Zero — vanilla JS + CSS only

Key Takeaways

1. Diarization is the foundation everything else depends on. Speaker identification quality determines the accuracy of minutes, psychological analysis, and retrospective answers. ElevenLabs Scribe's native diarization was the breakthrough — previous attempts with Whisper-only pipelines required a separate diarization step that introduced errors.

2. Psychological analysis from meeting transcripts surfaces invisible team dynamics. Managers are often surprised by what the analysis reveals: who actually introduces the ideas that get adopted, who dominates unproductively, which topics create tension. These patterns are invisible in real-time but obvious when mapped by AI.

3. Retrospective chat turns meetings into searchable knowledge. The ability to ask "What did we decide about X three meetings ago?" and get an accurate answer transforms how teams track decisions. No more scrolling through Slack or searching email threads.

4. Multi-format output matches different consumption styles. Some people read minutes. Some prefer a visual summary to share. Some listen to the audio narration during their commute. Generating all formats automatically means the meeting documentation actually gets consumed.

5. 4,500 lines of vanilla JS proves frameworks aren't always the answer. No React, no Vue, no build step. The entire tool is vanilla JavaScript and CSS. For an internal tool with well-defined scope, framework overhead would add complexity without proportional benefit. Fast to build, fast to iterate, zero dependency maintenance.

About the Author

Gonzalo Monzón

Founder & Lead Architect

Gonzalo Monzón is a Senior Solutions Architect & AI Engineer with over 26 years building mission-critical systems in Healthcare, Industrial Automation, and enterprise AI. Founder of Cadences Lab, he specializes in bridging legacy infrastructure with cutting-edge technology.

View full profile → Connect on LinkedIn

AI & LLMs

9 min read

Why We Use 7 AI Providers (Not Just One) — And How We Track Every Cent

Vendor lock-in is a trap. Here's how our AI Gateway routes 11,200+ calls/month between Gemini, GPT-4o, Claude, DeepSeek, Groq, and more — with automatic fallback, cost tracking to the cent, and a ~$184/month total AI bill across 7 providers.

From 4-Hour Response Time to Instant: How Our AI Voice Agents Make Real Phone Calls

Twilio for calls, Gemini Flash for real-time conversation, ElevenLabs for 15+ natural voices. We built AI agents that confirm appointments in 35 seconds, qualify leads with 3 questions, and switch between Spanish, English, and Catalan mid-call. Plus: God Mode lets humans supervise and intervene live.

Synapse Studio: A 2D Virtual Office Where AI Agents Do the Real Work

We built a SimTower-style animated office where AI agents with multimodal capabilities — vision, image generation, web search, iterative image evolution — collaborate on real tasks. Zero dependencies, pure Vanilla JS, running on Cloudflare.

September 8, 2025

Read Article →

Transcriptor: From a Meeting Recording to Structured Minutes, Psychological Analysis, and an AI Retrospective Chat

The Pipeline: Audio In, Documentation Out

Stage 1: Transcription with Speaker Diarization

Stage 2: Structured Minutes via AI

Stage 3: Psychological Analysis

Stage 4: The Retrospective Chat

Stages 5-6: Summary Image & TTS Narration

The Interface: 7 Panels

Technical Details

Key Takeaways

Tags

About the Author

Related Articles

Why We Use 7 AI Providers (Not Just One) — And How We Track Every Cent

From 4-Hour Response Time to Instant: How Our AI Voice Agents Make Real Phone Calls

Synapse Studio: A 2D Virtual Office Where AI Agents Do the Real Work

Transcriptor: From a Meeting Recording to Structured Minutes, Psychological Analysis, and an AI Retrospective Chat

The Pipeline: Audio In, Documentation Out

Stage 1: Transcription with Speaker Diarization

Stage 2: Structured Minutes via AI

Stage 3: Psychological Analysis

Stage 4: The Retrospective Chat

Stages 5-6: Summary Image & TTS Narration

The Interface: 7 Panels

Technical Details

Key Takeaways

Tags

About the Author

Stay in the loop

Related Articles

Why We Use 7 AI Providers (Not Just One) — And How We Track Every Cent

From 4-Hour Response Time to Instant: How Our AI Voice Agents Make Real Phone Calls

Synapse Studio: A 2D Virtual Office Where AI Agents Do the Real Work