OpenAI GPT-4o Multimodal Updates: Voice, Vision, and Text Combined

June 2024 improvements to GPT-4o's vision make charts, dashboards, and presentations more accurately interpreted. Here is how to combine voice and vision in a single session.

June 19, 2024 3 min read

Quick Scan

What matters today

June 2024 improvements to GPT-4o's vision make charts, dashboards, and presentations more accurately interpreted. Here is how to combine voice and vision in a single session.

Format TOP UPDATE

Audience Executives using AI at work

Time 3 min read

Topic Top Update

Key points

The June 2024 Vision Improvements
Three Multimodal Use Cases with High Executive Value
Combining Voice and Vision in a Single Session
API Configuration for Vision

What You'll Learn

The specific June 2024 improvements to GPT-4o's vision capabilities
How to combine voice and vision in a single GPT-4o session
Which multimodal use cases produce the most executive value

GPT-4o's defining characteristic is that it handles voice, vision, and text as a single integrated model. The June 2024 updates improved the visual understanding layer: better accuracy on complex charts and financial dashboards, improved handling of technical diagrams, and lower latency on vision API calls for enterprise teams.

Most executives receive information in visual form: PowerPoint slides, financial dashboards, organizational charts, competitor screenshots. Until recently, AI tools required converting these visuals to text before analysis. GPT-4o processes visuals directly, which means executives can ask questions about what they see rather than what they have described in text.

This article covers the specific June improvements, the three multimodal use cases that produce the most executive value, and how to combine voice and vision in a single session.

SUBSCRIBER BREAK -- Premium Content Below

The June 2024 Vision Improvements

Improved chart and graph interpretation. GPT-4o is now more accurate at reading values from complex charts: multi-series line charts, waterfall charts, heat maps, and clustered bar charts. The June updates improved accuracy particularly on financial data visualizations.
Better handling of technical diagrams. Organizational charts, process flow diagrams, and architectural diagrams are now interpreted more accurately. GPT-4o can trace relationships and hierarchies in complex diagrams more reliably.
Lower vision API latency. The time from image upload to response decreased, relevant for enterprise teams building applications that process large volumes of visual documents.

Three Multimodal Use Cases with High Executive Value

Financial dashboard review. Upload a screenshot of your BI dashboard or financial reporting tool. Ask specific questions: "What were the three biggest variances from plan this month?" GPT-4o reads the dashboard and answers based on what it sees, saving the step of extracting data into text before analysis.
Presentation critique. Share a slide deck (screenshot each slide or use file upload for PDFs). Ask: "Critique this presentation for an executive audience. What is the clearest slide? What is the most confusing? Where does the narrative break down?"
Competitor material analysis. Upload competitor marketing materials, product screenshots, or pricing page captures and ask for structured analysis: positioning, messaging, pricing signals, and target audience cues.

Combining Voice and Vision in a Single Session

The most powerful GPT-4o use case combines both inputs via the ChatGPT mobile app:

Start a voice mode session.
When you want to analyze something visual, hold the document or screen up to the phone camera.
Say "I am showing you [describe what you are showing]" and ask your question.
GPT-4o sees the image through the camera, hears your voice, and responds verbally.

This mode is most useful for executives reviewing printed documents, physical whiteboards, or screens they cannot screenshot easily. It converts physical information into AI-analyzable input in real time.

API Configuration for Vision

Model: gpt-4o. Image formats: PNG, JPEG, GIF, or WEBP. Resolution: 512x512 is sufficient for most charts; use higher resolution for dense text documents. Token cost: each image incurs a base cost plus per-tile cost depending on resolution. A standard 1024x1024 chart costs approximately 765 tokens.

Bottom line

The useful move with OpenAI GPT-4o Multimodal Updates: Voice, Vision, and Text Combined is to run one narrow test this week, then keep only the workflow that saves time, improves a decision, or gives your team clearer output. Treat the announcement as raw material, not the win itself.

About the author

Pierre Bradshaw Founder, PromptHacker.ai

Pierre has spent 25+ years building growth systems across fintech, real estate, lending, campaigns, and AI workflows, with machine-learning work dating back to 2012.

If you have any questions or comments about OpenAI GPT-4o Multimodal Updates: Voice, Vision, and Text Combined feel free to reach out. I'd love to hear from you.

Contact Pierre