Transcriptor: From a Meeting Recording to Structured Minutes, Psychological Analysis, and an AI Retrospective Chat
Gonzalo Monzón
Founder & Lead Architect
What happens after a meeting ends? Usually, nothing. Maybe someone sends a partial summary. Action items get forgotten. That one crucial decision gets attributed to the wrong person a week later. We built Transcriptor to fix that — a tool that takes a raw audio recording and outputs structured minutes, per-participant psychological analysis, a retrospective AI chat, summary images, and narrated conclusions. All powered by ElevenLabs Scribe, Gemini 2.5, and FLUX, running in 4,500 lines of zero-dependency vanilla JavaScript.
The Pipeline: Audio In, Documentation Out
Transcriptor runs a 7-stage pipeline on every meeting recording:
| Stage | What It Does | Powered By |
|---|---|---|
| 1. Transcription | Speech-to-text with speaker diarization — who said what, when | ElevenLabs Scribe |
| 2. Structured Minutes | Formal meeting minutes: attendees, topics, decisions, action items with owners and deadlines | Gemini 2.5 |
| 3. Psychological Analysis | Communication style, participation level, emotional tone, and influence analysis per participant | Gemini 2.5 |
| 4. Conclusions Module | Key takeaways, identified risks, detected opportunities | Gemini 2.5 |
| 5. Retrospective Chat | WhatsApp-style interface to ask anything about the meeting — AI responds with full context | Gemini 2.5 |
| 6. Summary Image | AI-generated visual representation of meeting key points | FLUX |
| 7. TTS Narration | Audio narration of the executive summary with smart chunking for long texts | ElevenLabs |
Upload an audio file. Get a complete meeting documentation package. Every stage feeds into the next — the transcription feeds the minutes, the minutes feed the analysis, everything feeds the retrospective chat.
Stage 1: Transcription with Speaker Diarization
Getting words from audio is the easy part. Knowing who said them — that's the hard part. Transcriptor uses ElevenLabs Scribe as the primary STT engine because it handles diarization natively:
- Speaker identification — each segment is tagged with a speaker ID (Speaker 1, Speaker 2, etc.)
- Timestamps per segment — precise timing for every utterance
- Multi-language — Spanish, English, and more
- Whisper fallback — OpenAI Whisper handles languages or edge cases Scribe doesn't support
The diarization quality matters because everything downstream depends on it. When the psychological analysis says "Speaker 2 dominated the conversation," it needs to actually be Speaker 2.
Stage 2: Structured Minutes via AI
Gemini 2.5 takes the full diarized transcription and outputs formal meeting minutes:
- Attendees — identified from speech patterns and mentions, with participation level
- Topics discussed — automatically grouped and numbered
- Decisions made — extracted from conversational context ("So we agreed to...")
- Action items — task, responsible person, deadline — pulled from natural conversation
- Next steps — follow-up meeting dates, pending reviews
The minutes are editable. If the AI attributes a decision to the wrong person, you fix it inline. But in practice, with good diarization, the accuracy is surprisingly high.
Stage 3: Psychological Analysis
This is the feature that makes team leads lean forward. For each participant, Gemini analyzes:
| Dimension | What's Measured |
|---|---|
| Communication Style | Direct, collaborative, passive, dominant — how does this person express ideas? |
| Participation Level | % of speaking time, frequency of interventions, initiative vs. reactive |
| Emotional Tone | Positive, neutral, negative, anxious, enthusiastic — per topic and overall |
| Interactions | Who responds to whom, alliances, tensions, who gets interrupted |
| Influence | Who generates most agreement/disagreement, who shifts the conversation direction |
The psychological module doesn't diagnose — it surfaces patterns. A manager might discover that a quieter team member actually introduces the ideas that get adopted, they just don't fight for credit. Or that two people consistently talk past each other on project timeline topics. These patterns are invisible in real-time but become obvious when the AI maps them out.
Stage 4: The Retrospective Chat
A WhatsApp-style chat interface where you can ask anything about the meeting after it's over:
- "What did María say about the budget?"
- "Was any decision made about the launch?"
- "Who proposed the partnership idea?"
- "Summarize the 3 most important points"
- "What was the emotional tone during the timeline discussion?"
The AI has full context: the raw transcription, the structured minutes, and the psychological analysis. So it can answer both factual questions ("What was decided?") and analytical ones ("Was there tension between Juan and María?"). It's like having a perfect memory of every meeting you've ever had.
Stages 5-6: Summary Image & TTS Narration
Two output formats for different consumption styles:
- Summary image — FLUX generates a visual infographic-style representation of the meeting's key points. Useful for sharing in Slack or embedding in documentation
- TTS narration — ElevenLabs narrates the executive summary with a professional voice. Smart chunking splits long summaries into manageable audio segments. Downloadable as MP3 for commute listening
The Interface: 7 Panels
| Panel | Function |
|---|---|
| Upload | Drop or select audio recording |
| Transcription | Timeline view with speakers color-coded |
| Minutes | Structured document — editable |
| Analysis | Cards per participant with metrics and insights |
| Chat | WhatsApp-style retrospective interface |
| Image | AI-generated summary visual |
| Audio | TTS narration player with download |
Technical Details
| Metric | Value |
|---|---|
| Codebase | 4,500+ lines of vanilla JavaScript — zero framework dependencies |
| STT Providers | 2 — ElevenLabs Scribe (primary), OpenAI Whisper (fallback) |
| AI for Minutes | Gemini 2.5 — structured output via function calling |
| Image Generation | FLUX via Workers AI |
| TTS | ElevenLabs — professional voice, MP3 output |
| Dependencies | Zero — vanilla JS + CSS only |
Key Takeaways
1. Diarization is the foundation everything else depends on. Speaker identification quality determines the accuracy of minutes, psychological analysis, and retrospective answers. ElevenLabs Scribe's native diarization was the breakthrough — previous attempts with Whisper-only pipelines required a separate diarization step that introduced errors.
2. Psychological analysis from meeting transcripts surfaces invisible team dynamics. Managers are often surprised by what the analysis reveals: who actually introduces the ideas that get adopted, who dominates unproductively, which topics create tension. These patterns are invisible in real-time but obvious when mapped by AI.
3. Retrospective chat turns meetings into searchable knowledge. The ability to ask "What did we decide about X three meetings ago?" and get an accurate answer transforms how teams track decisions. No more scrolling through Slack or searching email threads.
4. Multi-format output matches different consumption styles. Some people read minutes. Some prefer a visual summary to share. Some listen to the audio narration during their commute. Generating all formats automatically means the meeting documentation actually gets consumed.
5. 4,500 lines of vanilla JS proves frameworks aren't always the answer. No React, no Vue, no build step. The entire tool is vanilla JavaScript and CSS. For an internal tool with well-defined scope, framework overhead would add complexity without proportional benefit. Fast to build, fast to iterate, zero dependency maintenance.
Tags
About the Author
Gonzalo Monzón
Founder & Lead Architect
Gonzalo Monzón is a Senior Solutions Architect & AI Engineer with over 26 years building mission-critical systems in Healthcare, Industrial Automation, and enterprise AI. Founder of Cadences Lab, he specializes in bridging legacy infrastructure with cutting-edge technology.
Related Articles
Why We Use 7 AI Providers (Not Just One) — And How We Track Every Cent
Vendor lock-in is a trap. Here's how our AI Gateway routes 11,200+ calls/month between Gemini, GPT-4o, Claude, DeepSeek, Groq, and more — with automatic fallback, cost tracking to the cent, and a ~$184/month total AI bill across 7 providers.
From 4-Hour Response Time to Instant: How Our AI Voice Agents Make Real Phone Calls
Twilio for calls, Gemini Flash for real-time conversation, ElevenLabs for 15+ natural voices. We built AI agents that confirm appointments in 35 seconds, qualify leads with 3 questions, and switch between Spanish, English, and Catalan mid-call. Plus: God Mode lets humans supervise and intervene live.
Synapse Studio: A 2D Virtual Office Where AI Agents Do the Real Work
We built a SimTower-style animated office where AI agents with multimodal capabilities — vision, image generation, web search, iterative image evolution — collaborate on real tasks. Zero dependencies, pure Vanilla JS, running on Cloudflare.