Sep 13, Notes on OpenAI o1 series models

type

status

date

slug

summary

0-Introduction

OpenAI has introduced its new o1 series models, which are large language models trained utilizing reinforcement learning techniques to enhance complex reasoning capabilities. This article will provide an overview of the key features of the two newly released models: o1-preview and o1-mini. It is important to note that these models are not merely an iteration of GPT-4o, but rather represent a novel approach designed to optimize the balance between cost-effectiveness and performance while significantly improving reasoning abilities.

1-How it works

OpenAI has explained the training methodology for their o1 series models in two key blog posts: "Introducing OpenAI o1-preview" and "Learning to Reason with LLMs". These models are designed to allocate extra computational resources to thinking processes before generating responses. This approach refines cognitive processes, explores diverse problem-solving strategies, and enhances error recognition capabilities. By applying reinforcement learning techniques, the o1 series optimizes its reasoning pathways and decision-making strategies. This development represents an evolution of Chain of Thought (CoT) research, a field that has seen significant advances in reasoning methodologies over the past two years.

The o1 models introduce "reasoning tokens." These tokens allow the models to "think" by breaking down their understanding of the prompt and considering multiple approaches to generating a response. After producing reasoning tokens, the model generates an answer as visible completion tokens and discards the reasoning tokens from its context. The process works as follows:

2-Evaluation

Reasoning Tasks

OpenAI showcases the performance of their newly trained models on a diverse set of human exams and benchmarks. The results demonstrate that o1 significantly outperformsGPT-4o on the vast majority of these reasoning-heavy tasks. The following images illustrate the results:

In many reasoning-heavy benchmarks, o1 rivals the performance of human experts. OpenAI evaluates math performance on AIME, an exam designed to challenge America's brightest high school math students. On the 2024 AIME exams, GPT-4o solved only 12% (1.8/15) of problems on average. In contrast, o1 achieved remarkable results:

74% (11.1/15) with a single sample per problem

83% (12.5/15) with consensus among 64 samples

93% (13.9/15) when re-ranking 1000 samples with a learned scoring function

A score of 13.9 places o1 among the top 500 students nationally, surpassing the cutoff for the USA Mathematical Olympiad.

Human preference evaluation

In addition to exams and academic benchmarks, OpenAI also assessed human preference for o1-preview over GPT-4o on challenging, open-ended prompts across a wide range of domains.

Human evaluators prefer o1-preview over GPT-4o in domains that benefit from enhanced reasoning capabilities. However, for writing tasks, people still favor GPT-4o responses. This indicates that while o1 models offer significant advancements in reasoning, they're not meant to supplant GPT-4o across all applications.

For more detailed information about the models' performance, refer to the "Learning to Reason with LLMs" blog post.

<ins/>

3-Details of using the models

API version

The o1 models are available through the chat completions endpoint.

However, there are some limitations in the chat completion API:

It only supports text input; images are not supported.

Only user and assistant messages are allowed; system messages are not permitted.

Streaming mode is not available.

Tools, function calling, and response format parameters are unsupported.

Logprobs are not supported.

Temperature, top_p, and n are fixed at 1, while presence_penalty and frequency_penalty are set to 0.

The Assistants API and Batch API are not supported.

The o1-preview and o1-mini models offer a context window of 128,000 tokens. Each completion has an upper limit on the maximum number of output tokens—this includes both the invisible reasoning tokens and the visible completion tokens. The maximum output token limits are:

o1-preview: up to 32,768 tokens

o1-mini: up to 65,536 tokens

Web

Currently, o1-preview and o1-mini are available to ChatGPT Plus or Team users, but with usage limitations. Users have access to 30 messages per week with o1-preview and 50 messages per week with o1-mini to start.

4-Advice on Prompting

Here is some advice on prompting from the OpenAI documentation:

Keep prompts simple and direct: We don't need to provide extensive guidance in our prompts. The models excel at understanding and responding to brief, clear instructions.

Avoid chain-of-thought prompts: These models can perform reasoning internally, so we don't need to prompt them with "think step by step."

Use delimiters for clarity: Employ delimiters like triple quotation marks, XML tags, or section titles to clearly indicate distinct parts of the input. This helps the model interpret different sections appropriately.

Limit additional context in retrieval-augmented generation (RAG): When providing additional context or documents, include only the most relevant information to prevent the model from overcomplicating its response.

<ins/>

5-Testing Examples

Here, I present some personal testing examples comparing GPT-4o with o1-preview. Note that not all results are correct, showcasing the models' varying performance.

API version

Question: If Alice has 6 brothers and 7 sisters, how many sisters does her brother have?

Answer:

To solve this problem, we need to determine the number of sisters that Alice's brother has.

Total number of sisters in the family: There are 7 sisters (including Alice herself).

Total number of brothers in the family: There are 6 brothers.

When we consider one of Alice's brothers, his sisters are the same as Alice's sisters, which includes Alice herself and the other 6 sisters, making a total of 7 sisters.

Answer: 7

1.1. Answer Again:

To solve this problem, let's carefully consider the composition of Alice's family.

Alice has 6 brothers.

Alice has 7 sisters. Since Alice herself is also a sister, there are a total of 8 sisters in the family (including Alice).

Therefore, the total number of children in the family is:

6 brothers (sons)

8 sisters (daughters)

Total children = 6 sons + 8 daughters = 14 children.

Now, we need to determine how many sisters one of Alice's brothers has.

Each brother will have:

6 brothers (excluding himself)
8 sisters

But the question only asks for the number of sisters the brother has.

Answer: 8

1.2. Answer by GPT-4o

Let's work through the problem step by step.

Identify Alice's siblings:

Alice has 6 brothers.

Alice has 7 sisters.

Count total children in the family:

Including Alice, the total number of children is Alice (1) + 6 brothers + 7 sisters = 14 children.

Determine how many sisters one of Alice's brothers has:

Each brother shares the same siblings as Alice (though excluding himself in his own count).

Therefore, each of Alice's brothers will have the same 7 sisters as she does.

So, each brother of Alice has 7 sisters.

Question: What is the smallest integer whose square is between 15 and 30? Please consider the negative.

Answer:

To solve this problem, we need to find all integers \( n \) such that \( 15 < n^2 < 30 \), and then identify the smallest integer among them, considering negative integers.

First, let's list the squares of integers to see which ones fall between 15 and 30:

\( (-6)^2 = 36 \) (Too large)

\( (-5)^2 = 25 \)

\( (-4)^2 = 16 \)

\( (-3)^2 = 9 \) (Too small)

\( (-2)^2 = 4 \) (Too small)

\( (-1)^2 = 1 \) (Too small)

\( 0^2 = 0 \) (Too small)

\( 1^2 = 1 \) (Too small)

\( 2^2 = 4 \) (Too small)

\( 3^2 = 9 \) (Too small)

\( 4^2 = 16 \)

\( 5^2 = 25 \)

\( 6^2 = 36 \) (Too large)

The integers whose squares are between 15 and 30 are \( -5, -4, 4, \) and \( 5 \).

Among these, the smallest integer (the one that is furthest to the left on the number line) is \(-5\).

Answer: -5

2.1. Answer by GPT-4o

To determine the smallest integer whose square lies between 15 and 30, we need to systematically test suitable values.

First, note that the problem requires examining both positive and negative integers.

Calculate the squares of a few integers:

\( (-5)^2 = 25 \)

\( (-4)^2 = 16 \)

\( (-3)^2 = 9 \)

\( (-2)^2 = 4 \)

\( (-1)^2 = 1 \)

\( 1^2 = 1 \)

\( 2^2 = 4 \)

\( 3^2 = 9 \)

\( 4^2 = 16 \)

\( 5^2 = 25 \)

Next, compare these squares with the interval (15, 30):

The square of \( -5 \) and \( 5 \) is \( 25 \), which is between 15 and 30.

The square of \( -4 \) and \( 4 \) is \( 16 \), which is also between 15 and 30.

The square of \( -3 \) and \( 3 \) is \( 9 \), which is less than 15.

The square of \( -2 \) and \( 2 \) is \( 4 \), which is also less than 15.

The squares of \( -1 \), 1, and intermediate or smaller values are less than 15.

Based on these calculations, the integers \( 4, -4, 5, \) and \(-5\) have squares within the desired range. Of these, the smallest integer in magnitude is:

\[ \min\{|4|, |-4|, |5|, |-5|\} = 4 \]

Thus, considering both positive and negative integers, the smallest integer whose square is between 15 and 30 is \( \boxed{-4} \).

Question: what is the fifth word in the following sentences? I have an apple

Answer:

The sentence you provided is:

I have an apple

This sentence contains four words:

have

apple

Therefore, there is no fifth word in this sentence.

3.1. Answer by GPT-4o:

The fifth word in the sentence "I have an apple" is "apple.”

Question: Please exchange the second word and the last second word for the following sentence:I need to go to work even though today is Saturday.

Answer:

I is to go to work even though today need Saturday.

4.1. Answer by GPT-4o

"I though to go to work even need today is Saturday.”

The API version doesn't display the thinking process. For the first question, o1-preview initially provided an incorrect answer but gave the right one on the second attempt. This suggests that the new model may still be unstable when handling tricky reasoning questions. However, we can easily perceive that its reasoning capability is superior to GPT-4o.

Web

I'm unsure if the API version performs identically to the model on the web. Therefore, I also tested some cases on the website. The advantage of the website is that we can observe the thinking process when o1-preview answers questions.

6-Some Insights

I've played with the new o1-preview model for the whole day, testing both the API and web versions. I believe this model excels at handling complex tasks and is noticeably more intelligent than its predecessors. However, this doesn't mean it's the go-to model for every task, especially when considering time and cost factors. I see this model as more of a prototype than a mature product like its predecessors. People should carefully evaluate their tasks to determine if they truly need such a highly intelligent model, given its increased cost and latency.