OpenAI GPT-4o Multimodal Updates: Voice, Vision, and Text Combined
June 2024 improvements to GPT-4o's vision make charts, dashboards, and presentations more accurately interpreted. Here is how to combine voice and vision in a single session.
What matters today
June 2024 improvements to GPT-4o's vision make charts, dashboards, and presentations more accurately interpreted. Here is how to combine voice and vision in a single session.
Key points
- The June 2024 Vision Improvements
- Three Multimodal Use Cases with High Executive Value
- Combining Voice and Vision in a Single Session
- API Configuration for Vision
What You'll Learn
- The specific June 2024 improvements to GPT-4o's vision capabilities
- How to combine voice and vision in a single GPT-4o session
- Which multimodal use cases produce the most executive value
GPT-4o's defining characteristic is that it handles voice, vision, and text as a single integrated model. The June 2024 updates improved the visual understanding layer: better accuracy on complex charts and financial dashboards, improved handling of technical diagrams, and lower latency on vision API calls for enterprise teams.
Most executives receive information in visual form: PowerPoint slides, financial dashboards, organizational charts, competitor screenshots. Until recently, AI tools required converting these visuals to text before analysis. GPT-4o processes visuals directly, which means executives can ask questions about what they see rather than what they have described in text.
This article covers the specific June improvements, the three multimodal use cases that produce the most executive value, and how to combine voice and vision in a single session.
SUBSCRIBER BREAK -- Premium Content Below
The June 2024 Vision Improvements
- Improved chart and graph interpretation. GPT-4o is now more accurate at reading values from complex charts: multi-series line charts, waterfall charts, heat maps, and clustered bar charts. The June updates improved accuracy particularly on financial data visualizations.
- Better handling of technical diagrams. Organizational charts, process flow diagrams, and architectural diagrams are now interpreted more accurately. GPT-4o can trace relationships and hierarchies in complex diagrams more reliably.
- Lower vision API latency. The time from image upload to response decreased, relevant for enterprise teams building applications that process large volumes of visual documents.
Three Multimodal Use Cases with High Executive Value
- Financial dashboard review. Upload a screenshot of your BI dashboard or financial reporting tool. Ask specific questions: "What were the three biggest variances from plan this month?" GPT-4o reads the dashboard and answers based on what it sees, saving the step of extracting data into text before analysis.
- Presentation critique. Share a slide deck (screenshot each slide or use file upload for PDFs). Ask: "Critique this presentation for an executive audience. What is the clearest slide? What is the most confusing? Where does the narrative break down?"
- Competitor material analysis. Upload competitor marketing materials, product screenshots, or pricing page captures and ask for structured analysis: positioning, messaging, pricing signals, and target audience cues.
Combining Voice and Vision in a Single Session
The most powerful GPT-4o use case combines both inputs via the ChatGPT mobile app:
- Start a voice mode session.
- When you want to analyze something visual, hold the document or screen up to the phone camera.
- Say "I am showing you [describe what you are showing]" and ask your question.
- GPT-4o sees the image through the camera, hears your voice, and responds verbally.
This mode is most useful for executives reviewing printed documents, physical whiteboards, or screens they cannot screenshot easily. It converts physical information into AI-analyzable input in real time.
API Configuration for Vision
Model: gpt-4o. Image formats: PNG, JPEG, GIF, or WEBP. Resolution: 512x512 is sufficient for most charts; use higher resolution for dense text documents. Token cost: each image incurs a base cost plus per-tile cost depending on resolution. A standard 1024x1024 chart costs approximately 765 tokens.
Three deep dives. Four useful moves. One email worth opening.
PromptHacker turns the AI firehose into practical next steps for work, health, family, and everything time keeps trying to steal.