Artificial intelligence

OpenAI Introduces Vision Fine-Tuning on GPT-4o

ChatGPT

OpenAI has introduced vision fine-tuning on GPT-4o.

Takeaway Points

  • OpenAI introduces vision to the fine-tuning API.
  • Developers can customize the model to have stronger image understanding capabilities, which enable applications like enhanced visual search functionality.
  • Vision fine-tuning capabilities are available today for all developers on paid usage tiers.
  • OpenAI is offering 1M training tokens per day for free through Oct 31, 2024, to fine-tune GPT-4o with images.
  • After Oct 31, 2024, GPT-4o fine-tuning training will cost $25 per 1M tokens and inference will cost $3.75 per 1M input tokens and $15 per 1M output tokens.

What did OpenAI introduce?

OpenAI on Tuesday said it’s introducing vision fine-tuning on GPT-4o, making it possible to fine-tune with images in addition to text. Developers can customize the model to have stronger image understanding capabilities, which enable applications like enhanced visual search functionality, improved object detection for autonomous vehicles or smart cities, and more accurate medical image analysis.

The ChatGPT maker said that since the first introduced fine-tuning on GPT-4o, hundreds of thousands of developers have customized our models using text-only datasets to improve performance on specific tasks. However, for many cases, fine-tuning models on text alone doesn’t provide the performance boost expected.”

How it works

Vision fine-tuning follows a similar process to fine-tuning with text—developers can prepare their image datasets to follow the proper format and then upload that dataset to our platform. They can improve the performance of GPT-4o for vision tasks with as few as 100 images, and drive even higher performance with larger volumes of text and image data, the company said.

Grab improves image detection and understanding on the road.

OpenAI explained that Grab, a leading food delivery and rideshare company, turns street-level imagery collected from their drivers into mapping data used to power GrabMaps, a mapping service enabling all of their Southeast Asia operations. Using vision fine-tuning with only 100 examples, Grab taught GPT-4o to correctly localize traffic signs and count lane dividers to refine their mapping data. As a result, Grab was able to improve lane count accuracy by 20% and speed limit sign localization by 13% over a base GPT-4o model, enabling them to better automate their mapping operations from a previously manual process.

Availability & pricing

According to the report, Vision fine-tuning capabilities are available today for all developers on paid usage tiers. These capabilities are supported on the latest GPT-4o model snapshot, ‘gpt-4o-2024-08-06’. Developers can extend existing fine-tuning training data for images using the same format as our Chat endpoints.

“We’re offering 1M training tokens per day for free through October 31, 2024 to fine-tune GPT-4o with images. After October 31, 2024, GPT-4o fine-tuning training will cost $25 per 1M tokens and inference will cost $3.75 per 1M input tokens and $15 per 1M output tokens. Image inputs are first tokenized based on image size, and then priced at the same per-token rate as text inputs. More details can be found on the API Pricing page,” OpenAI said.

To get started, visit the fine-tuning dashboard, click ‘create’, and select ‘gpt-4o-2024-08-06’ from the base model drop-down. To learn how to fine-tune GPT-4o with images, visit our docs, the company said.

Comments
To Top

Pin It on Pinterest

Share This