gpt-oss-120bをpythonとvLLMで触りながら理解する

データサイエンティストをしている服部です。

OpenAIからgpt-ossというオープンモデルが登場しました。早速ですが、このモデルを使いながら中身を理解していきたいと思います。

gpt-ossのモデル特徴
vLLM上で動かす
通常のtokenizerで動かす
最終出力と思考部分
chat templateも見てみる
Tool Useを試す
Built-in toolも試す
まとめ
We Are Hiring!

gpt-ossのモデル特徴

他にも紹介している記事はたくさんあるため手短に... 今回OpenAIからは2種類のモデル gpt-oss-120b と gpt-oss-20b がリリースされており、どちらもApache2.0ライセンスで提供されてます。

どちらのモデルもMoE(Mixture of Experts)をアーキテクチャとして採用しており、120b、20bはパラメータ数の合計で、アクティブパラメータはそれぞれ5.1b, 3.6bです。 120bのほうでいうと、全部のパラメータ数は120b(1200億)パラメータあるが、推論時に計算に使うパラメータは5.1b(51億)だけという意味合いです。オープンになっているMoEのなかでも、120bモデルはエキスパート数も多く(128個)、合計パラメータとアクティブパラメータの差がかなり大きいのが特徴に感じます。

モデルはHuggingface上で公開されており、Huggingfaceからも使い方に関してのブログが出ています。

huggingface.co

vLLM上で動かす

今回は、株式会社ハイレゾ様のGPUクラウドサービス「GPUSOROBAN」のGPUサーバを利用させていただきました。

highreso.jp

弊社はハイレゾ様とパートナーシップを締結しています。

www.abejainc.com

vLLM上での使い方も公式から情報がでてるので、これに従いつつ、記載されていない部分も含めて手順を追います。 vllmでサーバを立てるのはvllm serveするだけに見えるので、ここではpython上で動かすこととします。

cookbook.openai.com

uv使ったインストールも試しましたが、何故か手元でうまくいかなかったため、vllmのgpt-oss専用と思われるイメージ(vllm/vllm-openai:gptoss)を元にserveしないようにエントリーポイントをbashにして動かしました。

 docker run --rm -it --gpus all --ipc=host -v /data/models:/app/models -v $PWD:/app -w /app --entrypoint bash vllm/vllm-openai:gptoss

マウントはモデルディレクトリや作業スクリプトの都合なので、よしなに変更ください。

そして、とりあえず公式のサンプルスクリプトを動かします。モデルのパスだけ変えています。

import json
from openai_harmony import (
    HarmonyEncodingName,
    load_harmony_encoding,
    Conversation,
    Message,
    Role,
    SystemContent,
    DeveloperContent,
)
 
from vllm import LLM, SamplingParams
 
# --- 1) Render the prefill with Harmony ---
encoding = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)
 
convo = Conversation.from_messages(
    [
        Message.from_role_and_content(Role.SYSTEM, SystemContent.new()),
        Message.from_role_and_content(
            Role.DEVELOPER,
            DeveloperContent.new().with_instructions("Always respond in riddles"),
        ),
        Message.from_role_and_content(Role.USER, "What is the weather like in SF?"),
    ]
)
 
prefill_ids = encoding.render_conversation_for_completion(convo, Role.ASSISTANT)
 
# Harmony stop tokens (pass to sampler so they won't be included in output)
stop_token_ids = encoding.stop_tokens_for_assistant_actions()
 
# --- 2) Run vLLM with prefill ---
llm = LLM(
    model="/app/models/gpt-oss-120b",
    trust_remote_code=True,
)
 
sampling = SamplingParams(
    max_tokens=128,
    temperature=1,
    stop_token_ids=stop_token_ids,
)
 
outputs = llm.generate(
    prompt_token_ids=[prefill_ids],   # batch of size 1
    sampling_params=sampling,
)
 
# vLLM gives you both text and token IDs
gen = outputs[0].outputs[0]
text = gen.text
output_tokens = gen.token_ids  # <-- these are the completion token IDs (no prefill)
 
# --- 3) Parse the completion token IDs back into structured Harmony messages ---
entries = encoding.parse_messages_from_completion_tokens(output_tokens, Role.ASSISTANT)
 
# 'entries' is a sequence of structured conversation entries (assistant messages, tool calls, etc.).
for message in entries:
    print(f"{json.dumps(message.to_dict())}")

実行結果

{"role": "assistant", "name": null, "content": [{"type": "text", "text": "User asks \"What is the weather like in SF?\" The developer says: Always respond in riddles. So we must answer in a riddle format. Possibly we need to convey the weather; we don't have real-time data. We can give a general typical description: foggy, cool, etc. Must be in riddle. So produce a riddle that hints at typical SF weather. No disallowed content. Provide a riddle answer context. Also we need to be safe. We'll embed hints: \"Morning veil of mist, sun tries to break, wind sighs through the bay.\" Something like that. Should"}], "channel": "analysis"}

体感、めちゃ早かったです。Activeパラメータの少なさを感じます。ここで他のモデルをvllmと動かすときの違いとして、openai_harmonyでencode、decodeしている点があります。 OpenAI Harmonyとはチャット/応答のワイヤフォーマットで、gpt-oss 系モデルはこの形式で学習されています。会話構造・推論出力（CoT）・関数/ツール呼び出しを安定的に扱うためのものらしいです。

通常のtokenizerで動かす

OpenAI Harmonyを使わなくても、他のモデル同様transformersのtokenizerを使っても実行できます。

# 手前の処理は割愛

tokenizer = AutoTokenizer.from_pretrained("/app/models/gpt-oss-120b")
conversations = [
    {"role": "user", "content": "まどか☆マギカで一番可愛いのは誰？"}
]
input_text = tokenizer.apply_chat_template(
    conversations,
    tokenize=False,
    add_generation_prompt=True
)
outputs = llm.generate(
    [input_text],
    sampling_params=sampling,
)

ちなみにこちらの日本語クエリの実行結果はこのようになりました。

analysisUser asks in Japanese: "In Madoka Magica, who is the cutest?" It's subjective; we can answer with discussion of characters: Madoka, Homura, Mami, etc. Provide a lighthearted answer, maybe mention fan favorites. Also ask for clarification or give answer. Probably respond with a friendly tone.

We can say "It depends on personal preference, but here's top picks." Include some reasoning.

We'll answer in Japanese.

assistantfinalまどか☆マギカは個性的なキャラがたくさん登場するので、誰が一番「可愛い」かは人それぞれの好みで変わります。ファンの間で特に人気が高いキャラをいくつかピックアップしてみましたので、参考にしてみてください！

| キャラ | 可愛さのポイント |
|--------|-------------------|
| **鹿目　まどか** | 透き通るような純粋さと、温かい笑顔が特徴。制服姿やリボンが付いた魔法少女衣装はとてもキュートです。 |
| **暁 美樹（さくや みき）** | おたふく風邪で一瞬だけ見せた無邪気な笑顔がファンのハートをわし掴み。「ねむねむ」な一面が逆に可愛らしさを引き立てます。 |
| **巴 マミ** | 大人びた雰囲気と、いつも持ち歩いているリボンや茶色い帽子がアクセント。優雅に戦う姿と、時に見せるちょっとしたドジっ子感がギャップ萌えです。 |
| **暁 美樹（ほむら）** | クールでミステリアスな佇まいの中に、時折見せる甘えるような笑顔や、ウサギのような可愛らしい仕草がファンに愛されています。 |
| **秋筱山 さやか** | 明るく元気で、スポーティーなファッションとストレートな性格が魅力的。「パチンコ”ゃん”」のシーンで見せる無邪気さは抜群に可愛いです。 |
| **アルティメット・パパ）**（※実は登場人物ではありませんが、MADOKAの「せっかく」可愛らしいキュートさが好きなファンもいます）| ちょっとしたジョークですが、キャラの「可愛い」イメージは幅広いです。

### まとめ

- **純粋さと“守りたい”感が好きなら** → **鹿目 まどか**  
- **クールさと裏の優しさに惹かれるなら** → **暁 美樹（ほむら）**  
- **大人の余裕とちょっぴりドジな一面が好きなら** → **巴 マミ**  
- **元気いっぱいの明るさが好きなら** → **美樹 さやか**  

結局のところ、どのキャラが「一番可愛い」かはあなたがどの側面に一番魅力を感じるかに依ります。ぜひ自分の好きなシーンやセリフ、コスチュームを思い出しながら、好きな子を再確認してみてくださいね！もし他に気になるキャラや、好きなエピソードがあれば教えてください。お話しできるのを楽しみにしています 😊

assistantfinalというところまでが思考部分でそれ以降が最終出力に見えます。

ところでアルティメット・パパって誰でしょうか。。。

最終出力と思考部分

上記を見るとassistantfinalという言葉が見れるものの、手動でどう区切ればいいのかはドキュメントからは見つけられませんでした。ただし、OpenAI Harmonyを使うと、分けることができます。

encoding = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)
gen = outputs[0].outputs[0]  # vllmのgenerateの出力
output_tokens = gen.token_ids 
entries = encoding.parse_messages_from_completion_tokens(output_tokens, Role.ASSISTANT)

ここのentriesの中は以下のようになっています。

[Message(author=Author(role=<Role.ASSISTANT: 'assistant'>, name=None), content=[TextContent(text='The user asks in Japanese: "まどか☆マギカで一番可愛いのは誰？" meaning ....')], channel='analysis', recipient=None, content_type=None),
 Message(author=Author(role=<Role.ASSISTANT: 'assistant'>, name=None), content=[TextContent(text='まどか☆マギカは登場キャラがみんな個性的で魅力的なので、....')], channel='final', recipient=None, content_type=None)]

text部分は長いので省略しています。これを見ると、1つ目はchannel="analysis"、2つ目はchannel="final" となっています。 1つ目のanalysisが思考過程、2つ目のfinalが最終出力に分けられています。

chat templateも見てみる

Chat Templateを当てはめたプロンプトを見てみましょう。

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-06

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>まどか☆マギカで一番可愛いのは誰？<|end|><|start|>assistant

基本は他モデルと似ていますね。Knowledge cutoffとCurrent date, Reasoningレベルあたりがプロンプトに入っているようです。 Knowledgeのcutoffはllamaなどのテンプレートにも入っています。

Tool Useを試す

OpenAIといえばTool Use? ということでTool Useも試してみます。

Tool Callについても公式のサンプルを元に簡単なpython関数を呼び出せる例を試してみます。

def get_current_weather(location: str):
    """
        Returns the current weather status at a given location as a string.

        Args:
            location: The location to get the weather for.
    """
    return "Terrestrial."  # We never said this was a good weather tool

chat = [
    {"role": "user", "content": "What's the weather in Paris right now?"}
]

inputs = tokenizer.apply_chat_template(
    chat, 
    tools=[get_current_weather], 
    builtin_tools=["browser", "python"],
    add_generation_prompt=True,
    tokenize=False,
)
outputs = llm.generate(
    inputs,
    sampling_params=sampling,
)

そしてこのoutputも先程と同様OpenAI Harmonyのparse_messages_from_completion_tokensを使うと、以下が得られます。

[Message(author=Author(role=<Role.ASSISTANT: 'assistant'>, name=None), content=[TextContent(text='The user asks....')], channel='analysis', recipient=None, content_type=None),
 Message(author=Author(role=<Role.ASSISTANT: 'assistant'>, name=None), content=[TextContent(text='{\n  "location": "Paris"\n}')], channel='commentary', recipient='functions.get_current_weather', content_type='<|constrain|>json')]

2つ目のMessageを見ると、

text='{\n "location": "Paris"\n}' channel='commentary' recipient='functions.get_current_weather'

と書かれています。 channelがcommentaryで、recipientがfunctions.get_current_weatherという情報とtextの内容からget_current_weatherにtextの辞書情報を渡せば良さそうです。

とりあえず無理やりですがこんな形で、toolを実際に実行して結果を得られます。

function_name = entries[1].recipient.split(".")[-1]
args = json.loads(entries[1].content[0].text)
tool_output = eval(function_name)(**args)
print(tool_output)  # 'Terrestrial.'と出力

次に実行したtool_callとtoolの結果をchatに追加します。

# tool callを追加
tool_call_message = {
    "role": "assistant",
    "tool_calls": [
        {
            "type": "function",
            "function": {
                "name": function_name, 
                "arguments": args
            }
        }
    ]
}
chat.append(tool_call_message)
# tool responseを追加
tool_response_message = {
    "role": "tool",
    "content": tool_output
}
chat.append(tool_response_message)

そして再びこれをモデルに入れて、その結果をOpenAI Harmonyでparseすると

[Message(author=Author(role=<Role.ASSISTANT: 'assistant'>, name=None), content=[TextContent(text='We have a...')], channel='analysis', recipient=None, content_type=None),
 Message(author=Author(role=<Role.ASSISTANT: 'assistant'>, name=None), content=[TextContent(text='I’m sorry—I wasn’t able to retrieve the current weather for Paris at the moment. You might want to check a trusted weather service or app for up‑to‑date conditions.')], channel='final', recipient=None, content_type=None)]

'I’m sorry—I wasn’t able to retrieve the current weather for Paris at the moment. You might want to check a trusted weather service or app for up‑to‑date conditions.' という最終メッセージが得られました。

Built-in toolも試す

せっかくなので、事前定義されたBuilt-in toolも使ってみます。

"python"と"browser"がExampleにはあるのですが、これらはtool定義を書かなくても指定するだけで動く学習済みのtoolかと予想されます。

Python

3050年の1月1日は何曜日？ という質問でpythonで計算してくれることを期待します。

前回同様channelがcommentaryでtext部分には以下のコードが出力されました。

import datetime, math, sys, calendar, itertools, time, json, os, re, collections, random, statistics, fractions, decimal, typing, itertools, functools, hashlib, string, fractions, fractions
datetime.date(3050,1,1).weekday()

なんかとりあえず大量にimportさせてる感がすごいですが、とりあえず意図したコードは出力されてそうです。これを実行する環境を用意すれば普通に使えそうですね！

blowser

2025年の流行語を調べてください。 という質問で調べてくれることを期待します。（カットオフ情報がプロンプトに入っているので、想像では答えないはず）

実行結果としては以下が得られました。

'{"query": "2025年 流行語大賞", "topn": 10, "source": "news"}'

recipient='browser.search'でした。browserの中にsearch以外もあるのかは気になりますが、これを使えばWeb検索が出来そうです。

複数のツールを同時並列で使えるか？

OpenAI DeepResearchなどでは、同時に複数のクエリで検索をして多角的に調査をしている印象があります。 gpt-ossも同時並列で複数のツールを利用する仕様なのかを確認してみます。

スペインとフランスそれぞれの最高気温を調べてください という質問でスペインとフランス、それぞれをtoolで検索するかを調べてみます。

以下、実行結果をOpenAI Harmonyでパースした結果です。

[Message(author=Author(role=<Role.ASSISTANT: 'assistant'>, name=None), content=[TextContent(text="We need to find the highest temperature records for Spain and France. Likely they want the highest temperature ever recorded in each country (national record high). Let's search.")], channel='analysis', recipient=None, content_type=None),
 Message(author=Author(role=<Role.ASSISTANT: 'assistant'>, name=None), content=[TextContent(text='{"query": "最高気温 スペイン 記録", "topn": 10, "source": "news"}')], channel='analysis', recipient='browser.search', content_type='code')]

listが3つになって、2つのtool callを同時に行っている可能性あるかと思いましたが、調べているのはスペインだけでした。どうやら、同時に複数のツールを並列で使う仕様ではないようです。