NVIDIA Cosmos の世界基盤モデルによる合成データ生成入門〜 Cosmos Transfer 編

本記事は、以下の記事の後編記事になります。

tech-blog.abeja.asia

NVIDIA Cosmos の概要や Cosmos Predict に関しては、上記の前半記事をご確認ください。

後編記事では、Cosmos Transfer を実際に動かしながらその機能を確かめていきます

Cosmos Transfer による合成データ生成
Cosmos Transfer によるデータ拡張
まとめ
We Are Hiring!

Cosmos Transfer による合成データ生成

では Cosmos Transfer での合成データ生成を行います。 Cosmos Predict と同じく世界基盤モデルを使用しますが、Cosmos Transfer では｛エッジ情報 or 深度マップ情報 or セグメンテーション情報｝からの動画生成やデータ拡張などの変換（Transfer）に焦点をおいた点が異なります。

github.com

環境構築

Ampere 世代以降の GPU サーバーを準備する

Cosmos Predict のときと同じく、Ampere 世代以降の GPU サーバーを準備してください。

自分の環境では、以下の環境にしました。なお今回は公式 docker コンテナで世界基盤モデルの推論を行なうので、Ubuntu と CUDAバージョンは任意になるかと思います。

 - Ubuntu バージョン: 22.04
 - GPU：A100
 - GPU ドライババージョン: 575.51.03
 - CUDA バージョン：12.9
 - GPU メモリ: 50GB 以上
 - CPU メモリ: 50GB 以上
 - ディスク容量: 1TB 程度（Cosmos Predict と Cosmos Transfer の世界基盤モデルの学習済みチェックポイントだけで 500GB 以上のディスク容量が必要になります）

Cosmos Transfer のレポジトリを clone する

下記コマンドで cosmos-transfer1 のレポジトリを clone してください
```
 git clone https://github.com/nvidia-cosmos/cosmos-transfer1
 cd cosmos-transfer1
```
Cosmos Transfer をインストールする

今回は docker コンテナで動かします。cosmos transfer のほうは公式 docker image が公開されていないようなので、以下コマンドで docker image をビルドして起動してください（※ビルド時間はかなり長くなるので注意してください）
```
 # docker image をビルド
 docker build -f Dockerfile . -t nvcr.io/$USER/cosmos-transfer1:latest

 # docker コンテナを起動
 docker run -it \
     -v $(PWD)/cosmos-transfer1:/workspace \
     -v $(PWD)/cosmos-transfer1/checkpoints:/workspace/checkpoints \
     --gpus all \
     nvcr.io/$USER/cosmos-transfer1:latest
```
HuggingFace にログインする

Cosmos Transfer の世界基盤モデルの学習済みチェックポイントも HuggingFace 上に保存されているので、予めログインしてチェックポイントをダウンロードできるようにしてください。
```
 huggingface-cli login --token ${API_TOKEN}
```
--token 引数で指定する HuggingFace のAPI トークンは、以下 HuggingFace のページから事前に作成しておいてください。

huggingface.co
Hugging Face 上のモデルのアクセス権限を付与する

以下サイトから Cosmos Transfer の世界基盤モデルで使用する各 Hugging Face 上のモデルのアクセス権限を付与してください
学習済みモデルをダウンロードする

以下コマンドで Cosmos Transfer で使用する世界基盤モデルの学習済みチェックポイントをダウンロードしてください。全てのモデルで 300GB 以上もの容量になるので、ディスクの空き容量には注意してください。
```
 PYTHONPATH=$(pwd) python scripts/download_checkpoints.py --output_dir checkpoints/
```

エッジ情報からの合成データ生成

まずは Cosmos Transfer の世界基盤モデルを使用して、オブジェクト境界線のエッジ情報から動画内のオブジェクトの色や素材が変化した動画生成を行います。

コンテナに接続する

 docker run -it \
     -v $(PWD)/cosmos-transfer1:/workspace \
     -v $(PWD)/cosmos-transfer1/checkpoints:/workspace/checkpoints \
     --gpus all \
     nvcr.io/$USER/cosmos-transfer1:latest

推論スクリプトを実行する

コンテナ内で、以下コマンドを実行してください

 export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:=0}"
 export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
 export NUM_GPU="${NUM_GPU:=1}"

 # 推論スクリプトを実行する
 # GPUメモリに余裕があれば --offload_xxx 引数なしにすることで推論時間を短縮可能
 PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 cosmos_transfer1/diffusion/inference/transfer.py \
     --checkpoint_dir $CHECKPOINT_DIR \
     --video_save_folder outputs/example1_single_control_edge \
     --controlnet_specs assets/inference_cosmos_transfer1_single_control_edge.json \
     --offload_text_encoder_model \
     --offload_guardrail_models \
     --num_gpus $NUM_GPU

—controlnet_specs 引数で指定している設定ファイルは以下のような内容になっており、このファイルで入力データを設定しています。

assets/inference_cosmos_transfer1_single_control_edge.json

  {
          # 入力テキスト
          # 日本語翻訳: 映像は、洗練されたミニマリストなデザインの明るくモダンなオフィス環境で撮影されている。背景では数人の人々がデスクで作業しており、忙しい職場の雰囲気を示している。主な焦点はカウンターでのロボットとの相互作用にある。黒い手袋を装着した2本のロボットアームが、黒いふたが付いた赤と白の模様のコーヒーカップを扱っているのが見える。アームはカウンターの反対側に立つ女性の前に位置している。彼女はグレーの長袖シャツの上に濃色のベストを着用し、長い黒髪をしている。ロボットアームは関節構造になっており、精密に動作することから、先進技術を示唆している。開始時、ロボットアームはコーヒーカップをしっかりと保持している。映像が進むにつれて、女性が右手を伸ばしてカップを取ろうとする。相互作用はスムーズで、ロボットアームが受け渡しを促進するようにグリップを調整する。女性の手がカップに近づき、彼女は自信を持ってそれを掴み、ロボットのグリップから持ち上げる。カメラは終始静止したままで、ロボットアームと女性の間のやり取りに焦点を当てている。設定には白いカウンタートップがあり、かき混ぜ棒を入れた容器と鉢植えの植物が置かれ、モダンな美観を演出している。映像は日常的なタスクにおけるロボット工学のシームレスな統合を強調し、現代のオフィス環境での効率性と精密性を重視している。
      "prompt": "The video is set in a modern, well-lit office environment with a sleek, minimalist design. The background features several people working at desks, indicating a busy workplace atmosphere. The main focus is on a robotic interaction at a counter. Two robotic arms, equipped with black gloves, are seen handling a red and white patterned coffee cup with a black lid. The arms are positioned in front of a woman who is standing on the opposite side of the counter. She is wearing a dark vest over a gray long-sleeve shirt and has long dark hair. The robotic arms are articulated and move with precision, suggesting advanced technology. \n\nAt the beginning, the robotic arms hold the coffee cup securely. As the video progresses, the woman reaches out with her right hand to take the cup. The interaction is smooth, with the robotic arms adjusting their grip to facilitate the handover. The woman's hand approaches the cup, and she grasps it confidently, lifting it from the robotic grip. The camera remains static throughout, focusing on the exchange between the robotic arms and the woman. The setting includes a white countertop with a container holding stir sticks and a potted plant, adding to the modern aesthetic. The video highlights the seamless integration of robotics in everyday tasks, emphasizing efficiency and precision in a contemporary office setting.",
      "input_video_path" : "assets/example1_input_video.mp4",
      "edge": {
          "control_weight": 1.0
      }
  }

上記推論スクリプト実行後に以下のような動画が生成されます。

｛テキスト・RGB動画・エッジ動画｝を入力として、テーブルやコップなどの各オブジェクトの色や素材が変化したフォトリアリスティックな動画が生成されていることがわかるかと思います。

入力テキスト	入力動画（RGB動画 + エッジ動画）	出力動画
The video is set in a modern, well-lit office ...

深度マップ情報からの合成データ生成

今度は深度マップ情報から、同じく動画内のオブジェクトの色や素材が変化した動画生成を行います。

コンテナに接続する

 docker run -it \
     -v $(PWD)/cosmos-transfer1:/workspace \
     -v $(PWD)/cosmos-transfer1/checkpoints:/workspace/checkpoints \
     --gpus all \
     nvcr.io/$USER/cosmos-transfer1:latest

推論スクリプトを実行する

コンテナ内で、以下コマンドを実行してください

 export CUDA_VISIBLE_DEVICES=0
 export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"

 # 推論スクリプトを実行する
 # GPUメモリに余裕があれば --offload_xxx 引数なしにすることで推論時間を短縮可能
 PYTHONPATH=$(pwd) python cosmos_transfer1/diffusion/inference/transfer.py \
     --checkpoint_dir $CHECKPOINT_DIR \
     --video_save_folder outputs/example1_single_control_depth \
     --controlnet_specs assets/inference_cosmos_transfer1_single_control_depth.json \
     --offload_text_encoder_model \
     --offload_guardrail_models \
     --num_gpus $NUM_GPU

—controlnet_specs 引数で指定している設定ファイルは以下のような内容になっており、このファイルで入力データを設定しています。

assets/inference_cosmos_transfer1_single_control_depth.json

  {
          # 入力テキスト
          # 日本語翻訳: 映像は、洗練されたミニマリストなデザインの明るくモダンなオフィス環境で撮影されている。背景では数人の人々がデスクで作業しており、忙しい職場の雰囲気を示している。主な焦点はカウンターでのロボットとの相互作用にある。黒い手袋を装着した2本のロボットアームが、黒いふたが付いた赤と白の模様のコーヒーカップを扱っているのが見える。アームはカウンターの反対側に立つ女性の前に位置している。彼女はグレーの長袖シャツの上に濃色のベストを着用し、長い黒髪をしている。ロボットアームは関節構造になっており、精密に動作することから、先進技術を示唆している。開始時、ロボットアームはコーヒーカップをしっかりと保持している。映像が進むにつれて、女性が右手を伸ばしてカップを取ろうとする。相互作用はスムーズで、ロボットアームが受け渡しを促進するようにグリップを調整する。女性の手がカップに近づき、彼女は自信を持ってそれを掴み、ロボットのグリップから持ち上げる。カメラは終始静止したままで、ロボットアームと女性の間のやり取りに焦点を当てている。設定には白いカウンタートップがあり、かき混ぜ棒を入れた容器と鉢植えの植物が置かれ、モダンな美観を演出している。映像は日常的なタスクにおけるロボット工学のシームレスな統合を強調し、現代のオフィス環境での効率性と精密性を重視している。
      "prompt": "The video is set in a modern, well-lit office environment with a sleek, minimalist design. The background features several people working at desks, indicating a busy workplace atmosphere. The main focus is on a robotic interaction at a counter. Two robotic arms, equipped with black gloves, are seen handling a red and white patterned coffee cup with a black lid. The arms are positioned in front of a woman who is standing on the opposite side of the counter. She is wearing a dark vest over a gray long-sleeve shirt and has long dark hair. The robotic arms are articulated and move with precision, suggesting advanced technology. \n\nAt the beginning, the robotic arms hold the coffee cup securely. As the video progresses, the woman reaches out with her right hand to take the cup. The interaction is smooth, with the robotic arms adjusting their grip to facilitate the handover. The woman's hand approaches the cup, and she grasps it confidently, lifting it from the robotic grip. The camera remains static throughout, focusing on the exchange between the robotic arms and the woman. The setting includes a white countertop with a container holding stir sticks and a potted plant, adding to the modern aesthetic. The video highlights the seamless integration of robotics in everyday tasks, emphasizing efficiency and precision in a contemporary office setting.",
      "input_video_path" : "assets/example1_input_video.mp4",
      "depth": {
          "input_control": "assets/example1_depth.mp4",
          "control_weight": 1.0
      }
  }

上記推論スクリプト実行後に以下のような動画が生成されます。

｛テキスト・RGB動画・深度マップ動画｝を入力として、テーブルやコップなどの各オブジェクトの色や素材が変化したフォトリアリスティックな動画が生成されていることがわかるかと思います。

入力テキスト	入力動画（RGB動画 + 深度マップ動画）	出力動画
The video is set in a modern, well-lit office ...

セグメンテーション情報からの合成データ生成

オブジェクトのセグメンテーション情報から、動画内のオブジェクトの色や素材が変化した動画生成を行います。

コンテナに接続する

 docker run -it \
     -v $(PWD)/cosmos-transfer1:/workspace \
     -v $(PWD)/cosmos-transfer1/checkpoints:/workspace/checkpoints \
     --gpus all \
     nvcr.io/$USER/cosmos-transfer1:latest

推論スクリプトを実行する

コンテナ内で、以下コマンドを実行してください

 export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:=0}"
 export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
 export NUM_GPU="${NUM_GPU:=1}"

 # 推論スクリプトを実行する
 # GPUメモリに余裕があれば --offload_xxx 引数なしにすることで推論時間を短縮可能
 PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 cosmos_transfer1/diffusion/inference/transfer.py \
     --checkpoint_dir $CHECKPOINT_DIR \
     --video_save_folder outputs/example1_single_control_seg \
     --controlnet_specs assets/inference_cosmos_transfer1_single_control_seg.json \
     --offload_text_encoder_model \
     --offload_guardrail_models \
     --num_gpus $NUM_GPU

—controlnet_specs 引数で指定している設定ファイルは以下のような内容になっており、このファイルで入力データを設定しています。

assets/inference_cosmos_transfer1_single_control_seg.json

  {
          # 入力テキスト
          # 日本語翻訳: 映像は、洗練されたミニマリストなデザインの明るくモダンなオフィス環境で撮影されている。背景では数人の人々がデスクで作業しており、忙しい職場の雰囲気を示している。主な焦点はカウンターでのロボットとの相互作用にある。黒い手袋を装着した2本のロボットアームが、黒いふたが付いた赤と白の模様のコーヒーカップを扱っているのが見える。アームはカウンターの反対側に立つ女性の前に位置している。彼女はグレーの長袖シャツの上に濃色のベストを着用し、長い黒髪をしている。ロボットアームは関節構造になっており、精密に動作することから、先進技術を示唆している。開始時、ロボットアームはコーヒーカップをしっかりと保持している。映像が進むにつれて、女性が右手を伸ばしてカップを取ろうとする。相互作用はスムーズで、ロボットアームが受け渡しを促進するようにグリップを調整する。女性の手がカップに近づき、彼女は自信を持ってそれを掴み、ロボットのグリップから持ち上げる。カメラは終始静止したままで、ロボットアームと女性の間のやり取りに焦点を当てている。設定には白いカウンタートップがあり、かき混ぜ棒を入れた容器と鉢植えの植物が置かれ、モダンな美観を演出している。映像は日常的なタスクにおけるロボット工学のシームレスな統合を強調し、現代のオフィス環境での効率性と精密性を重視している。
      "prompt": "The video is set in a modern, well-lit office environment with a sleek, minimalist design. The background features several people working at desks, indicating a busy workplace atmosphere. The main focus is on a robotic interaction at a counter. Two robotic arms, equipped with black gloves, are seen handling a red and white patterned coffee cup with a black lid. The arms are positioned in front of a woman who is standing on the opposite side of the counter. She is wearing a dark vest over a gray long-sleeve shirt and has long dark hair. The robotic arms are articulated and move with precision, suggesting advanced technology. \n\nAt the beginning, the robotic arms hold the coffee cup securely. As the video progresses, the woman reaches out with her right hand to take the cup. The interaction is smooth, with the robotic arms adjusting their grip to facilitate the handover. The woman's hand approaches the cup, and she grasps it confidently, lifting it from the robotic grip. The camera remains static throughout, focusing on the exchange between the robotic arms and the woman. The setting includes a white countertop with a container holding stir sticks and a potted plant, adding to the modern aesthetic. The video highlights the seamless integration of robotics in everyday tasks, emphasizing efficiency and precision in a contemporary office setting.",
      "input_video_path" : "assets/example1_input_video.mp4",
      "seg": {
          "input_control": "assets/example1_seg.mp4",
          "control_weight": 1.0
      }
  }

上記推論スクリプト実行後に以下のような動画が生成されます。｛テキスト・RGB動画・セグメンテーション動画｝を入力として、テーブルやコップなどの各オブジェクトの色や素材が変化したフォトリアリスティックな動画が生成されていることがわかるかと思います。

入力テキスト	入力動画（RGB動画 + セグメンテーション動画）	出力動画
The video is set in a modern, well-lit office ...

マルチモーダルでの合成データ生成

オブジェクトの｛エッジ情報・深度マップ情報・セグメンテーション情報｝のマルチモーダルから、動画内のオブジェクトの色や素材が変化した動画生成を行います。

コンテナに接続する

 docker run -it \
     -v $(PWD)/cosmos-transfer1:/workspace \
     -v $(PWD)/cosmos-transfer1/checkpoints:/workspace/checkpoints \
     --gpus all \
     nvcr.io/$USER/cosmos-transfer1:latest

推論スクリプトを実行する

コンテナ内で、以下コマンドを実行してください

 export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:=0}"
 export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
 export NUM_GPU="${NUM_GPU:=1}"

 # 推論スクリプトを実行する
 # GPUメモリに余裕があれば --offload_xxx 引数なしにすることで推論時間を短縮可能
 PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 cosmos_transfer1/diffusion/inference/transfer.py \
     --checkpoint_dir $CHECKPOINT_DIR \
     --video_save_folder outputs/example2_uniform_weights \
     --controlnet_specs assets/inference_cosmos_transfer1_uniform_weights.json \
     --offload_text_encoder_model \
     --offload_guardrail_models \
     --num_gpus $NUM_GPU

—controlnet_specs 引数で指定している設定ファイルは以下のような内容になっており、このファイルで入力データを設定しています。

assets/inference_cosmos_transfer1_uniform_weights.json

  {
      "prompt": "The video is set in a modern, well-lit office environment with a sleek, minimalist design. The background features several people working at desks, indicating a busy workplace atmosphere. The main focus is on a robotic interaction at a counter. Two robotic arms, equipped with black gloves, are seen handling a red and white patterned coffee cup with a black lid. The arms are positioned in front of a woman who is standing on the opposite side of the counter. She is wearing a dark vest over a gray long-sleeve shirt and has long dark hair. The robotic arms are articulated and move with precision, suggesting advanced technology. \n\nAt the beginning, the robotic arms hold the coffee cup securely. As the video progresses, the woman reaches out with her right hand to take the cup. The interaction is smooth, with the robotic arms adjusting their grip to facilitate the handover. The woman's hand approaches the cup, and she grasps it confidently, lifting it from the robotic grip. The camera remains static throughout, focusing on the exchange between the robotic arms and the woman. The setting includes a white countertop with a container holding stir sticks and a potted plant, adding to the modern aesthetic. The video highlights the seamless integration of robotics in everyday tasks, emphasizing efficiency and precision in a contemporary office setting.",
      "input_video_path" : "assets/example1_input_video.mp4",
      "vis": {
          "control_weight": 0.25
      },
      "edge": {
          "control_weight": 0.25
      },
      "depth": {
          "input_control": "assets/example1_depth.mp4",
          "control_weight": 0.25
      },
      "seg": {
          "input_control": "assets/example1_seg.mp4",
          "control_weight": 0.25
      }
  }

上記推論スクリプト実行後に以下のような動画が生成されます。

｛テキスト・RGB動画・エッジ動画・深度マップ動画・セグメンテーション動画｝を入力として、テーブルやコップなどの各オブジェクトの色や素材が変化したフォトリアリスティックな動画が生成されていることがわかるかと思います。

入力テキスト	入力動画（RGB動画・エッジ動画・深度マップ動画・セグメンテーション動画）	出力動画
The video is set in a modern, well-lit office ...

Cosmos Transfer によるデータ拡張

Cosmos Transfer は、データ拡張（Data Augmentation）として活用することもできます。

データ拡張というのは、学習用データセットに様々なデータ変形や変換（画像の回転など）を施すことでデータをかさ増しを行なう手法で、AI モデルの汎化性能向上させるために非常に手頃かつ有益な手法となってます。

VLA 等のロボティクスモデルにおけるデータ拡張の有益性は、以下のブログ記事を見てもらえればと思います。

tech-blog.abeja.asia

上記ブログでは PyTorch を使用した基本的なデータ拡張のみ行ってますが、Cosmos を使用すれば PyTorch や他のデータ拡張ライブラリでは不可能なより高度なフォトリアリスティックなデータ拡張を行なうことができます。

詳細は別記事で説明しますが、Cosmos のロボティクスモデル開発における実際の活用方法は、この Cosmos Transfer によるデータ拡張になります。

コンテナに接続する

 docker run -it \
     -v $(PWD)/cosmos-transfer1:/workspace \
     -v $(PWD)/cosmos-transfer1/checkpoints:/workspace/checkpoints \
     --gpus all \
     nvcr.io/$USER/cosmos-transfer1:latest

セグメンテーション画像リストを準備する

まず segmentation ディレクトリに各動画フレームのセグメンテーション画像リストを配置し、segmentation_label ディレクトリ以下に各動画フレームのセグメンテーションラベル定義の json ファイルを配置する必要があります。

今回のサンプルでは、既に配置済みなのでこの作業は不要です。

【セグメンテーション画像例】

【セグメンテーションラベル例】

フレーム１

  {
      "(29, 0, 0, 255)": {
          "class": "gripper0_right_r_palm_vis"
      },
      "(31, 0, 0, 255)": {
          "class": "gripper0_right_R_thumb_proximal_base_link_vis"
      },
      "(33, 0, 0, 255)": {
          "class": "gripper0_right_R_thumb_proximal_link_vis"
      },
      ...
  }

フレーム２

  {
      "(29, 0, 0, 255)": {
          "class": "gripper0_right_r_palm_vis"
      },
      "(31, 0, 0, 255)": {
          "class": "gripper0_right_R_thumb_proximal_base_link_vis"
      },
      "(33, 0, 0, 255)": {
          "class": "gripper0_right_R_thumb_proximal_link_vis"
      },
      ...
  }

マスク画像生成の前処理スクリプトを実行する

ロボットの前景と背景のマスク画像リストを作成するために、コンテナ内で以下コマンドを実行する
```
 PYTHONPATH=$(pwd) python cosmos_transfer1/auxiliary/robot_augmentation/spatial_temporal_weight.py \
     --setting fg_vis_edge_bg_seg \
     --robot-keywords world_robot gripper robot \
     --input-dir assets/robot_augmentation_example \
     --output-dir outputs/robot_augmentation_example
```
本スクリプト実行後 outputs/robot_augmentation_example ディレクトリ以下に、以下のようなマスク画像や各種モデル（ヘッジ推定モデル・深度マップ推定モデルなど）のチェックポイントが出力されるので後続の推論スクリプトでこれを利用します。
データ拡張の推論スクリプトを実行する
```
 export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:=0}"
 export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
 export NUM_GPU="${NUM_GPU:=1}"

 # 推論スクリプトを実行する
 # GPUメモリに余裕があれば --offload_xxx 引数なしにすることで推論時間を短縮可能
 PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 \
 cosmos_transfer1/diffusion/inference/transfer.py \
     --checkpoint_dir $CHECKPOINT_DIR \
     --video_save_folder outputs/robot_example_spatial_temporal_setting1 \
     --controlnet_specs assets/robot_augmentation_example/example1/inference_cosmos_transfer1_robot_spatiotemporal_weights.json \
     --offload_text_encoder_model \
     --offload_guardrail_models \
     --offload_prompt_upsampler \
     --seed 8 \
     --num_gpus $NUM_GPU
```
—controlnet_specs 引数で指定している設定ファイルは以下のような内容になっており、このファイルで入力データを設定しています。
- 設定ファイル: assets/inference_cosmos_transfer1_robot_spatiotemporal_weights.json
```
  {
      "prompt": "a robotic grasp an apple from the table and move it to another place.",
      "input_video_path" : "assets/robot_augmentation_example/example1/input_video.mp4",
      "vis": {
          "control_weight": "outputs/robot_augmentation_example/example1/vis_weights.pt"
      },
      "edge": {
          "control_weight": "outputs/robot_augmentation_example/example1/edge_weights.pt"
      },
      "depth": {
          "control_weight": "outputs/robot_augmentation_example/example1/depth_weights.pt"
      },
      "seg": {
          "input_control": "assets/robot_augmentation_example/example1/segmentation.mp4",
          "control_weight": "outputs/robot_augmentation_example/example1/seg_weights.pt"
      }
  }
```
上記推論スクリプト実行後に以下のような動画が生成されます。

｛テキスト・RGB動画｝を入力として、テーブルやまな板などの各オブジェクトの色や素材が変化したフォトリアリスティックな動画が生成されており、データ拡張されていることがわかるかと思います！

特に、今回の入力動画は Isaac Sim の３Dシミュレーター上で生成した動画であり、シミュレーター上で作成した動画から現実世界に近い動画のデータ拡張できている点も注目すべき点になります。

入力テキスト入力動画出力動画

a robotic grasp an apple from the table and move it to another place.

同様にして、入力テキストを “A robot grasps a banana from the table and moves it to another location. The table has light reflections rendered with ray tracing.” に変更して推論すると、以下のような動画出力になりました

入力テキスト入力動画出力動画

A robot grasps a banana from the table and moves it to another location. The table has light reflections rendered with ray tracing.

このように入力テキストや推論時のパラメータ（ —-blur_strength, —-canny_threshold, --seed など）を様々に変更することで、以下のように各オブジェクトの色や素材が多種多様に変化したデータ拡張も行なうことができるようになっています！

入力テキスト	入力動画	出力動画
a robotic grasp an apple from the table and move it to another place.

入力テキスト	入力動画	出力動画
A robot grasps a banana from the table and moves it to another location. The table has light reflections rendered with ray tracing.

まとめ

今回の前編と後半記事では、NIVIDIA Cosmos を実際に動かすことで Cosmos の主要機能を確認しました。

Cosmos は、世界基盤モデルを用いて自動運転やロボティクスなどでのフィジカルAI開発を促進するプラットフォームということでしたが、より具体的には Cosmos で実際に出力される合成データは動画データ（但し単なる動画ではなく物理法則が考慮されたフォトリアリスティックな動画）であることがわかりました。

しかしながら、VLA モデルなどのロボティクス用モデルの学習用データセットでは｛ロボットの状態ベクトル・ロボットの行動ベクトル・観測データ（カメラ画像など）｝などのペアデータが必要であり、動画データを生成されてもこのままではこれを学習用データセットとして活用できないです。

この辺の問題を解決するための Cosmos 活用方法の話は、次回の記事で説明しようと思います