GPT-5.4: 1M Context, 75% Computer Use, Three Variants - The Full Enterprise Breakdown
The full GPT-5.4 architecture: three variants, context window structure, pricing, and what changed from GPT-5.3 - plus what 75% osworld actually means - and why it is the most enterprise-relevant benchmark in the release
What matters today
The full GPT-5.4 architecture: three variants, context window structure, pricing, and what changed from GPT-5.3 - plus what 75% osworld actually means - and why it is the most enterprise-relevant benchmark in the release
Key points
- Three Variants **GPT-5.4 Standard** is the general-purpose production tier. Priced at $2.50 per million input tokens, $15 per million output tokens.
- Benchmark Performance
- The 1M Context Window in Practice
- Computer Use at 75%: What OSWorld Actually Measures
- Four Enterprise Implications
What You'll Learn
- The full GPT-5.4 architecture: three variants, context window structure, pricing, and what changed from GPT-5.3
- What 75% OSWorld actually means - and why it is the most enterprise-relevant benchmark in the release
- The 1M token context window: which document workloads it changes and which it does not
- Benchmark performance across OSWorld, SWE-bench Pro, GDPval, and HealthBench
- Four enterprise implications including the variant selection problem and the HealthBench clinical positioning
## The Launch
OpenAI released GPT-5.4 on March 5, 2026. The model ships in three variants - Standard, Thinking, and Pro - with a 1 million token extended context window, native computer use, and a pricing structure that positions Standard as the default enterprise production tier.
The context window is the most consequential part of the release. Standard context is 272,000 tokens, consistent with GPT-5.3. Extended context reaches 1 million tokens, billed at 2x the per-token rate above the 272K threshold. At Standard pricing of $2.50 per million input tokens, loading a full 1M context costs approximately $5. That is the practical number: $5 per full-load synthesis call.
This is a PromptHacker Premium article.
The full analysis, verbatim prompts, and action framework are available to Premium subscribers.
Three Variants **GPT-5.4 Standard** is the general-purpose production tier. Priced at $2.50 per million input tokens, $15 per million output tokens. This is a 47% cost reduction versus equivalent GPT-5.3 tool-use workflows, attributed to built-in tool search routing that reduces unnecessary token generation. Existing GPT-5.3 API customers were migrated automatically at launch. **GPT-5.4 Thinking** adds visible chain-of-thought reasoning - the model shows its reasoning steps before producing a final answer. Priced at $4 per million input tokens, $20 per million output tokens. Comparable in architecture to the o3 reasoning model but integrated natively into the GPT-5.4 base rather than deployed as a separate model family. **GPT-5.4 Pro** is maximum capability with usage limits. Available in ChatGPT Pro and via API with rate caps. Designed for the hardest multi-step reasoning and research tasks where throughput is not the constraint.
Benchmark Performance
OSWorld: 75%. OSWorld measures the model's ability to complete real software tasks - opening applications, navigating file systems, filling forms, operating spreadsheets, browsing the web - based on screenshot observation and input actions. Prior best public score: GPT-5.3 at 52%. Claude 3.7 Sonnet currently scores 70.3%. Gemini 2.5 Pro scores 68%. The 5-7 point advantage over the nearest competitors on this benchmark is the most significant capability gap in the release.
SWE-bench Pro: 57.7%. Software engineering task resolution on real-world GitHub issues. This is the hardest version of the SWE-bench evaluation, requiring end-to-end code changes across multi-file repositories with no scaffolding.
GDPval: 83%. The economic value generation benchmark, which measures automated business task completion across finance, legal, HR, and operations categories. 83% represents a 12-point improvement over GPT-5.3.
MMLU Pro: 87.4%. HumanEval: 97.1%.
HealthBench: GPT-5.4 scored in the top decile versus physician panels on 7 of 9 task categories across 10,000 de-identified clinical case vignettes. OpenAI released the HealthBench methodology alongside the model. The benchmark measures structured clinical reasoning in controlled scenarios - not clinical practice.
The 1M Context Window in Practice
The context window shift is a workflow architecture change, not a capability improvement. Here is what fits:
- Full year of board meeting minutes: approximately 50,000 tokens
- Complete vendor contract package (master agreement + SOW + amendments): 60,000-100,000 tokens
- Full regulatory filing with exhibits: 100,000-200,000 tokens
- Complete M&A due diligence package: 250,000-400,000 tokens
- Full annual report with footnotes: 80,000-120,000 tokens
All of these fit within 1 million tokens. The operational implication: document analysis workflows that previously required chunking, summarizing, and assembling partial views can now run as a single synthesis call against the complete document set.
The prior constraint was not the model's reasoning capability - it was the context limit that forced the analyst to make selection decisions before the model could begin. Those selection decisions introduced bias and missed material. Eliminating them changes the quality of the output, not just the speed.
The practical cost: $5-$10 per full 1M context run at Standard rates. The practical time: under three minutes per synthesis. Against $15,000-$40,000 in professional services time for equivalent document review scope, the efficiency case is straightforward for the synthesis layer.
Computer Use at 75%: What OSWorld Actually Measures
OSWorld is not a synthetic benchmark. The evaluation runs the model against real software environments - macOS, Windows, and Ubuntu - with actual applications installed. Tasks include: "Export this spreadsheet column to a new CSV file," "Find the unread email from last Tuesday and forward it to this address," "Install this package, run the script, and paste the output into this document."
75% means the model completes 3 out of 4 real software tasks correctly based on screenshot observation and input sequences. At 52% (GPT-5.3), computer use was a proof of concept. At 75%, it is approaching production viability for structured, defined workflows.
What this means operationally: computer use is now viable for workflows with defined steps, observable outcomes, and human review of the final result. It is not yet viable for workflows that require real-time judgment, unstructured interfaces, or high error-cost outcomes.
Four Enterprise Implications
1. The 1M context window eliminates the document selection bottleneck in analyst workflows. Any executive function that currently involves receiving a document package and then deciding what to read is a candidate for load-and-synthesize. Legal, finance, strategy, compliance, and investor relations teams all have this workflow pattern. The change is not incremental - it restructures the front end of every document-intensive decision process.
2. Three-variant architecture requires explicit variant assignment at the workflow level. Organizations running GPT-5.4 on a single-variant basis will either overpay (using Thinking or Pro where Standard is sufficient) or underperform (using Standard where Thinking is required). Variant selection is now a design decision for every AI workflow, not a default setting. IT and AI teams that do not document this will generate significant cost variance.
3. Computer use at 75% OSWorld restarts the ROI case for automation pilots. Every organization that ran a computer use evaluation on GPT-5.3 or Claude 3.5 Sonnet and deprioritized it should re-evaluate at GPT-5.4. The 23-point improvement over GPT-5.3 is not a marginal upgrade - it crosses the threshold from "interesting demo" to "testable production candidate" for defined, structured workflows.
4. HealthBench positions OpenAI as a direct player in clinical AI. The simultaneous release of HealthBench and GPT-5.4, with a top-decile clinical reasoning result, is a deliberate positioning signal. Healthcare organizations using AI for clinical support should treat HealthBench as the opening move in a vendor claim that will require careful evaluation - both for what the benchmark demonstrates and for what it explicitly does not measure.
Action Steps
- Run a 1M context test this week using a real document package from your current workflow. Do not test with a toy example. Load an actual vendor contract set, a regulatory filing, or a board minute archive, and run the synthesis prompt from this week's Pro Tip. The result will tell you more than any benchmark.
- Map your current AI workflows to the Standard/Thinking/Pro selection criteria. Standard: high-volume, general reasoning, document generation. Thinking: complex multi-step analysis, visible reasoning required, lower volume. Pro: maximum-difficulty tasks with usage tolerance. Document the mapping before your next infrastructure review.
- Restart your computer use pilot evaluation with GPT-5.4. If your organization evaluated computer use and deprioritized it, the 75% OSWorld score justifies a new evaluation cycle. Define three candidate workflows - structured, observable, human-reviewed - and run structured tests against the current baseline before Q2 planning.
- Review the HealthBench methodology if your organization operates in healthcare. OpenAI published the full benchmark structure. Understanding what HealthBench measures - structured clinical reasoning in controlled vignettes - and what it does not measure is the prerequisite for any conversation about GPT-5.4 in clinical contexts.
- Plan for the 2x billing rate on extended context. If your workflows regularly exceed 272K tokens, cost modeling for Q2 should account for the extended context billing structure. The $5 per full 1M load figure is a useful planning anchor for document synthesis use cases.
Three deep dives. Four useful moves. One email worth opening.
PromptHacker turns the AI firehose into practical next steps for work, health, family, and everything time keeps trying to steal.