Created
April 6, 2023 17:29
-
-
Save jdubkim/7b84e201dea348f1c04e50d81b1f7239 to your computer and use it in GitHub Desktop.
Director Error on Single GPU
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Thu Apr 6 18:06:24 2023 | |
+-----------------------------------------------------------------------------+ | |
| NVIDIA-SMI 525.85.05 Driver Version: 525.85.05 CUDA Version: 12.0 | | |
|-------------------------------+----------------------+----------------------+ | |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | |
| | | MIG M. | | |
|===============================+======================+======================| | |
| 0 Tesla T4 Off | 00000000:00:09.0 Off | 0 | | |
| N/A 37C P8 9W / 70W | 2MiB / 15360MiB | 0% Default | | |
| | | N/A | | |
+-------------------------------+----------------------+----------------------+ | |
+-----------------------------------------------------------------------------+ | |
| Processes: | | |
| GPU GI CI PID Type Process name GPU Memory | | |
| ID ID Usage | | |
|=============================================================================| | |
| No running processes found | | |
+-----------------------------------------------------------------------------+ | |
18:06:24 up 10 days, 3:17, 3 users, load average: 0.06, 0.05, 0.04 | |
Using config: dmc_vision. | |
Running task: dmc_walker_walk. | |
2023-04-06 18:06:26.259836: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA | |
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. | |
2023-04-06 18:06:34.003624: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /vol/cuda/11.4.120-cudnn8.2.4/lib64:/vol/cuda/11.4.120-cudnn8.2.4/lib:/usr/lib/x86_64-linux-gnu | |
2023-04-06 18:06:34.012961: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /vol/cuda/11.4.120-cudnn8.2.4/lib64:/vol/cuda/11.4.120-cudnn8.2.4/lib:/usr/lib/x86_64-linux-gnu | |
2023-04-06 18:06:34.012991: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. | |
Config: | |
logdir: /vol/bitbucket/jk3417/explainable-mbhrl/logdir/20230406-180624 (str) | |
run: train_with_viz (str) | |
seed: 0 (int) | |
task: dmc_walker_walk (str) | |
env.amount: 4 (int) | |
env.parallel: process (str) | |
env.daemon: False (bool) | |
env.repeat: 1 (int) | |
env.size: [64, 64] (ints) | |
env.camera: -1 (int) | |
env.gray: False (bool) | |
env.length: 0 (int) | |
env.discretize: 0 (int) | |
env.lives: False (bool) | |
env.sticky: True (bool) | |
env.episodic: True (bool) | |
env.restart: True (bool) | |
env.again: False (bool) | |
env.termination: False (bool) | |
env.weaker: 1.0 (float) | |
env.seed: 0 (int) | |
replay: fixed (str) | |
replay_size: 1000000.0 (float) | |
replay_chunk: 64 (int) | |
replay_fixed.prio_starts: 0.0 (float) | |
replay_fixed.prio_ends: 1.0 (float) | |
replay_fixed.sync: 0 (int) | |
replay_consec.sync: 0 (int) | |
replay_prio.prio_starts: 0.0 (float) | |
replay_prio.prio_ends: 1.0 (float) | |
replay_prio.sync: 0 (int) | |
replay_prio.fraction: 0.1 (float) | |
replay_prio.softmax: False (bool) | |
replay_prio.temp: 1.0 (float) | |
replay_prio.constant: 0.0 (float) | |
replay_prio.exponent: 0.5 (float) | |
tf.jit: False (bool) | |
tf.platform: gpu (str) | |
tf.precision: 16 (int) | |
tf.debug_nans: False (bool) | |
tf.logical_gpus: 0 (int) | |
tf.dist_dataset: False (bool) | |
tf.dist_policy: False (bool) | |
tf.tensorfloat: True (bool) | |
tf.placement: False (bool) | |
tf.growth: True (bool) | |
eval_dir: (str) | |
filter: .* (str) | |
tbtt: 0 (int) | |
train.steps: 100000000.0 (float) | |
train.expl_until: 0 (int) | |
train.log_every: 10000.0 (float) | |
train.eval_every: 30000.0 (float) | |
train.eval_eps: 1 (int) | |
train.eval_samples: 1 (int) | |
train.train_every: 16 (int) | |
train.train_steps: 1 (int) | |
train.train_fill: 10000.0 (float) | |
train.eval_fill: 10000.0 (float) | |
train.pretrain: 1 (int) | |
train.log_zeros: False (bool) | |
train.log_keys_video: [image] (strs) | |
train.log_keys_sum: ^$ (str) | |
train.log_keys_mean: ^$ (str) | |
train.log_keys_max: ^$ (str) | |
train.log_timings: True (bool) | |
train.sync_every: 180 (int) | |
task_behavior: Hierarchy (str) | |
expl_behavior: None (str) | |
batch_size: 16 (int) | |
transform_rewards: off (str) | |
expl_noise: 0.0 (float) | |
eval_noise: 0.0 (float) | |
eval_state_mean: False (bool) | |
priority: reward_loss (str) | |
priority_correct: 0.0 (float) | |
data_loader: tfdata (str) | |
grad_heads: [decoder, reward, cont] (strs) | |
rssm.units: 1024 (int) | |
rssm.deter: 1024 (int) | |
rssm.stoch: 32 (int) | |
rssm.classes: 32 (int) | |
rssm.act: elu (str) | |
rssm.norm: layer (str) | |
rssm.initial: learned2 (str) | |
rssm.unroll: True (bool) | |
encoder.mlp_keys: $^ (str) | |
encoder.cnn_keys: image (str) | |
encoder.act: elu (str) | |
encoder.norm: layer (str) | |
encoder.mlp_layers: 4 (int) | |
encoder.mlp_units: 512 (int) | |
encoder.cnn: simple (str) | |
encoder.cnn_depth: 64 (int) | |
encoder.cnn_kernels: [4, 4, 4, 4] (ints) | |
decoder.mlp_keys: $^ (str) | |
decoder.cnn_keys: image (str) | |
decoder.act: elu (str) | |
decoder.norm: layer (str) | |
decoder.mlp_layers: 4 (int) | |
decoder.mlp_units: 512 (int) | |
decoder.cnn: simple (str) | |
decoder.cnn_depth: 64 (int) | |
decoder.cnn_kernels: [5, 5, 6, 6] (ints) | |
decoder.image_dist: mse (str) | |
decoder.inputs: [deter, stoch] (strs) | |
reward_head.layers: 4 (int) | |
reward_head.units: 512 (int) | |
reward_head.act: elu (str) | |
reward_head.norm: layer (str) | |
reward_head.dist: symlog (str) | |
reward_head.outscale: 0.1 (float) | |
reward_head.inputs: [deter, stoch] (strs) | |
cont_head.layers: 4 (int) | |
cont_head.units: 512 (int) | |
cont_head.act: elu (str) | |
cont_head.norm: layer (str) | |
cont_head.dist: binary (str) | |
cont_head.outscale: 0.1 (float) | |
cont_head.inputs: [deter, stoch] (strs) | |
loss_scales.kl: 1.0 (float) | |
loss_scales.image: 1.0 (float) | |
loss_scales.reward: 1.0 (float) | |
loss_scales.cont: 1.0 (float) | |
model_opt.opt: adam (str) | |
model_opt.lr: 0.0001 (float) | |
model_opt.eps: 1e-06 (float) | |
model_opt.clip: 100.0 (float) | |
model_opt.wd: 0.01 (float) | |
model_opt.wd_pattern: kernel (str) | |
wmkl.impl: mult (str) | |
wmkl.scale: 0.1 (float) | |
wmkl.target: 3.5 (float) | |
wmkl.min: 1e-05 (float) | |
wmkl.max: 1.0 (float) | |
wmkl.vel: 0.1 (float) | |
wmkl_balance: 0.8 (float) | |
actor.layers: 4 (int) | |
actor.units: 512 (int) | |
actor.act: elu (str) | |
actor.norm: layer (str) | |
actor.minstd: 0.03 (float) | |
actor.maxstd: 1.0 (float) | |
actor.outscale: 0.1 (float) | |
actor.unimix: 0.01 (float) | |
actor.inputs: [deter, stoch] (strs) | |
critic.layers: 4 (int) | |
critic.units: 512 (int) | |
critic.act: elu (str) | |
critic.norm: layer (str) | |
critic.dist: symlog (str) | |
critic.outscale: 0.1 (float) | |
critic.inputs: [deter, stoch] (strs) | |
actor_opt.opt: adam (str) | |
actor_opt.lr: 0.0001 (float) | |
actor_opt.eps: 1e-06 (float) | |
actor_opt.clip: 100.0 (float) | |
actor_opt.wd: 0.01 (float) | |
actor_opt.wd_pattern: kernel (str) | |
critic_opt.opt: adam (str) | |
critic_opt.lr: 0.0001 (float) | |
critic_opt.eps: 1e-06 (float) | |
critic_opt.clip: 100.0 (float) | |
critic_opt.wd: 0.01 (float) | |
critic_opt.wd_pattern: kernel (str) | |
actor_dist_disc: onehot (str) | |
actor_dist_cont: normal (str) | |
episodic: True (bool) | |
discount: 0.99 (float) | |
imag_discount: 0.99 (float) | |
imag_horizon: 16 (int) | |
imag_unroll: True (bool) | |
critic_return: gve (str) | |
actor_return: gve (str) | |
return_lambda: 0.95 (float) | |
actor_grad_disc: reinforce (str) | |
actor_grad_cont: backprop (str) | |
slow_target: True (bool) | |
slow_target_update: 100 (int) | |
slow_target_fraction: 1.0 (float) | |
actent.impl: mult (str) | |
actent.scale: 0.003 (float) | |
actent.target: 0.5 (float) | |
actent.min: 1e-05 (float) | |
actent.max: 100.0 (float) | |
actent.vel: 0.1 (float) | |
actent_norm: True (bool) | |
actent_perdim: True (bool) | |
advnorm.impl: mean_std (str) | |
advnorm.decay: 0.99 (float) | |
advnorm.max: 100000000.0 (float) | |
retnorm.impl: std (str) | |
retnorm.decay: 0.999 (float) | |
retnorm.max: 100.0 (float) | |
scorenorm.impl: off (str) | |
scorenorm.decay: 0.99 (float) | |
scorenorm.max: 100000000.0 (float) | |
adv_slow_critic: True (bool) | |
pengs_qlambda: False (bool) | |
critic_type: vfunction (str) | |
rewnorm_discount: False (bool) | |
env_skill_duration: 8 (int) | |
train_skill_duration: 8 (int) | |
skill_shape: [8, 8] (ints) | |
manager_rews.extr: 1.0 (float) | |
manager_rews.expl: 0.1 (float) | |
manager_rews.goal: 0.0 (float) | |
worker_rews.extr: 0.0 (float) | |
worker_rews.expl: 0.0 (float) | |
worker_rews.goal: 1.0 (float) | |
worker_inputs: [deter, stoch, goal] (strs) | |
worker_report_horizon: 64 (int) | |
skill_proposal: manager (str) | |
goal_proposal: replay (str) | |
goal_reward: cosine_max (str) | |
goal_encoder.layers: 4 (int) | |
goal_encoder.units: 512 (int) | |
goal_encoder.act: elu (str) | |
goal_encoder.norm: layer (str) | |
goal_encoder.dist: onehot (str) | |
goal_encoder.outscale: 0.1 (float) | |
goal_encoder.unimix: 0.0 (float) | |
goal_encoder.inputs: [goal] (strs) | |
goal_decoder.layers: 4 (int) | |
goal_decoder.units: 512 (int) | |
goal_decoder.act: elu (str) | |
goal_decoder.norm: layer (str) | |
goal_decoder.dist: mse (str) | |
goal_decoder.outscale: 0.1 (float) | |
goal_decoder.inputs: [skill] (strs) | |
worker_goals: [manager] (strs) | |
jointly: new (str) | |
vae_imag: False (bool) | |
vae_replay: True (bool) | |
vae_span: False (bool) | |
encdec_kl.impl: mult (str) | |
encdec_kl.scale: 0.0 (float) | |
encdec_kl.target: 10.0 (float) | |
encdec_kl.min: 1e-05 (float) | |
encdec_kl.max: 1.0 (float) | |
encdec_opt.opt: adam (str) | |
encdec_opt.lr: 0.0001 (float) | |
encdec_opt.eps: 1e-06 (float) | |
encdec_opt.clip: 100.0 (float) | |
encdec_opt.wd: 0.01 (float) | |
encdec_opt.wd_pattern: kernel (str) | |
explorer: False (bool) | |
explorer_repeat: False (bool) | |
expl_rew: adver (str) | |
manager_dist: onehot (str) | |
manager_grad: reinforce (str) | |
manager_actent: 0.5 (float) | |
adver_impl: squared (str) | |
manager_delta: False (bool) | |
goal_kl: True (bool) | |
expl_rewards.extr: 0.0 (float) | |
expl_rewards.disag: 0.0 (float) | |
expl_rewards.vae: 0.0 (float) | |
expl_rewards.ctrl: 0.0 (float) | |
expl_rewards.pbe: 0.0 (float) | |
expl_discount: 0.99 (float) | |
expl_retnorm.impl: std (str) | |
expl_retnorm.decay: 0.999 (float) | |
expl_retnorm.max: 100000000.0 (float) | |
expl_scorenorm.impl: off (str) | |
expl_scorenorm.decay: 0.999 (float) | |
expl_scorenorm.max: 100000000.0 (float) | |
disag_head.layers: 4 (int) | |
disag_head.units: 512 (int) | |
disag_head.act: elu (str) | |
disag_head.norm: layer (str) | |
disag_head.dist: mse (str) | |
disag_head.inputs: [deter, stoch, action] (strs) | |
expl_opt.opt: adam (str) | |
expl_opt.lr: 0.0001 (float) | |
expl_opt.eps: 1e-06 (float) | |
expl_opt.clip: 100.0 (float) | |
expl_opt.wd: 0.01 (float) | |
disag_target: [stoch] (strs) | |
disag_models: 8 (int) | |
ctrl_embed.layers: 3 (int) | |
ctrl_embed.units: 512 (int) | |
ctrl_embed.act: elu (str) | |
ctrl_embed.norm: layer (str) | |
ctrl_embed.dist: mse (str) | |
ctrl_embed.inputs: [deter, stoch] (strs) | |
ctrl_head.layers: 1 (int) | |
ctrl_head.units: 128 (int) | |
ctrl_head.act: elu (str) | |
ctrl_head.norm: layer (str) | |
ctrl_head.dist: mse (str) | |
ctrl_head.inputs: [current, next] (strs) | |
ctrl_size: 32 (int) | |
ctrl_opt.opt: adam (str) | |
ctrl_opt.lr: 0.0001 (float) | |
ctrl_opt.eps: 1e-06 (float) | |
ctrl_opt.clip: 100.0 (float) | |
ctrl_opt.wd: 0.01 (float) | |
expl_enc.layers: 4 (int) | |
expl_enc.units: 512 (int) | |
expl_enc.act: elu (str) | |
expl_enc.norm: layer (str) | |
expl_enc.dist: onehot (str) | |
expl_enc.outscale: 0.1 (float) | |
expl_enc.inputs: [deter] (strs) | |
expl_enc.shape: [8, 8] (ints) | |
expl_dec.layers: 4 (int) | |
expl_dec.units: 512 (int) | |
expl_dec.act: elu (str) | |
expl_dec.norm: layer (str) | |
expl_dec.dist: mse (str) | |
expl_dec.outscale: 0.1 (float) | |
expl_kl.impl: mult (str) | |
expl_kl.scale: 0.1 (float) | |
expl_kl.target: 10.0 (float) | |
expl_kl.min: 0.01 (float) | |
expl_kl.max: 1.0 (float) | |
expl_kl.vel: 0.1 (float) | |
expl_vae_elbo: False (bool) | |
Encoder CNN shapes: {'image': (64, 64, 3)} | |
Encoder MLP shapes: {} | |
Decoder CNN shapes: {'image': (64, 64, 3)} | |
Decoder MLP shapes: {} | |
Synced last 0/0 trajectories. | |
Synced last 0/0 trajectories. | |
Synced last 0/0 trajectories. | |
Synced last 0/0 trajectories. | |
Logdir /vol/bitbucket/jk3417/explainable-mbhrl/logdir/20230406-180624 | |
Fill eval dataset (10000.0 steps). | |
Saved episode: 20230406T170729-92735c1c31f4439485ed7245847340ae-len1001-rew29.npz | |
Saved episode: 20230406T170729-e8aca7f8b0c64967aa5e63fcb2869c26-len1001-rew30.npz | |
Saved episode: 20230406T170729-d4bf7726d9df4e08b6d9895b59ccfeb4-len1001-rew35.npz | |
Saved episode: 20230406T170729-af46c9a976f64f558e8ea2ceb76eaf3e-len1001-rew29.npz | |
Saved episode: 20230406T170734-1014cf6331f044af88f4676138e825ea-len1001-rew29.npz | |
Saved episode: 20230406T170734-85a5360c961f437f8513de73b75f68d4-len1001-rew31.npz | |
Saved episode: 20230406T170735-dc3b2c1de0684cc0b6ebddace3c49752-len1001-rew33.npz | |
Saved episode: 20230406T170735-c466903916da43f4ae81af9f93a2cfdc-len1001-rew30.npz | |
Fill train dataset (10000.0 steps). | |
Episode has 1000 steps and return 31.2. | |
────────────────────────────────── Step 4004 ─────────────────────────────────── | |
episode/length 1000 / episode/score 31.21 / episode/reward_rate 0.02 / | |
replay/replay_steps 4004 / replay/replay_trajs 4 | |
Episode has 1000 steps and return 27.4. | |
────────────────────────────────── Step 4004 ─────────────────────────────────── | |
episode/length 1000 / episode/score 27.44 / episode/reward_rate 0.01 / | |
replay/replay_steps 4004 / replay/replay_trajs 4 | |
Saved episode: 20230406T170743-34382ac90ec94bbf8bdb645c1440cc09-len1001-rew31.npz | |
Saved episode: 20230406T170743-959f2bb241124cea8c8f4d1af9826c62-len1001-rew27.npz | |
Saved episode: 20230406T170743-f4525518e8e04f54934c0514f8a6bdc7-len1001-rew28.npz | |
Saved episode: 20230406T170743-c8b4070fd3ee4511a1b396ddaf83017d-len1001-rew31.npz | |
Episode has 1000 steps and return 28.7. | |
────────────────────────────────── Step 4004 ─────────────────────────────────── | |
episode/length 1000 / episode/score 28.67 / episode/reward_rate 0.01 / | |
replay/replay_steps 4004 / replay/replay_trajs 4 | |
Episode has 1000 steps and return 31.6. | |
────────────────────────────────── Step 4004 ─────────────────────────────────── | |
episode/length 1000 / episode/score 31.56 / episode/reward_rate 0.03 / | |
replay/replay_steps 4004 / replay/replay_trajs 4 | |
Episode has 1000 steps and return 29.6. | |
────────────────────────────────── Step 8008 ─────────────────────────────────── | |
episode/length 1000 / episode/score 29.61 / episode/reward_rate 0.01 / | |
replay/replay_steps 8008 / replay/replay_trajs 8 | |
Episode has 1000 steps and return 33.0. | |
────────────────────────────────── Step 8008 ─────────────────────────────────── | |
episode/length 1000 / episode/score 32.99 / episode/reward_rate 0.04 / | |
replay/replay_steps 8008 / replay/replay_trajs 8 | |
Episode has 1000 steps and return 30.5. | |
────────────────────────────────── Step 8008 ─────────────────────────────────── | |
episode/length 1000 / episode/score 30.45 / episode/reward_rate 0.01 / | |
replay/replay_steps 8008 / replay/replay_trajs 8 | |
Episode has 1000 steps and return 34.8. | |
────────────────────────────────── Step 8008 ─────────────────────────────────── | |
episode/length 1000 / episode/score 34.85 / episode/reward_rate 0.05 / | |
replay/replay_steps 8008 / replay/replay_trajs 8 | |
/vol/bitbucket/jk3417/xmbhrl/lib/python3.10/site-packages/tensorflow/python/data/ops/structured_function.py:256: UserWarning: Even though the `tf.config.experimental_run_functions_eagerly` option is set, this option does not apply to tf.data functions. To force eager execution of tf.data functions, please use `tf.data.experimental.enable_debug_mode()`. | |
warnings.warn( | |
Saved episode: 20230406T170751-28dd26e28f5f4c99b8c65e4f6cc2866b-len1001-rew29.npz | |
Saved episode: 20230406T170751-fff7d72bdaed417d82da120502a40cec-len1001-rew32.npz | |
Saved episode: 20230406T170751-c3a72ecb21d04962b053e6b5e45f5e29-len1001-rew30.npz | |
Saved episode: 20230406T170751-4b07fe4103f0450897ff931a4e3c29e4-len1001-rew34.npz | |
Found 34318853 model parameters. | |
Found 2696256 goal parameters. | |
Optimizer applied weight decay to goal variables: | |
[x] dense_16/kernel:0 | |
[x] dense_17/kernel:0 | |
[x] dense_18/kernel:0 | |
[x] dense_19/kernel:0 | |
[x] dense_20/kernel:0 | |
[x] dense_21/kernel:0 | |
[x] dense_22/kernel:0 | |
[x] dense_23/kernel:0 | |
[x] dense_24/kernel:0 | |
[x] dense_25/kernel:0 | |
[ ] dense_20/bias:0 | |
[ ] dense_25/bias:0 | |
[ ] norm_22/offset:0 | |
[ ] norm_22/scale:0 | |
[ ] norm_23/offset:0 | |
[ ] norm_23/scale:0 | |
[ ] norm_24/offset:0 | |
[ ] norm_24/scale:0 | |
[ ] norm_25/offset:0 | |
[ ] norm_25/scale:0 | |
[ ] norm_26/offset:0 | |
[ ] norm_26/scale:0 | |
[ ] norm_27/offset:0 | |
[ ] norm_27/scale:0 | |
[ ] norm_28/offset:0 | |
[ ] norm_28/scale:0 | |
[ ] norm_29/offset:0 | |
[ ] norm_29/scale:0 | |
╭───────────────────── Traceback (most recent call last) ──────────────────────╮ | |
│ /vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/train.py:12 │ | |
│ 4 in <module> │ | |
│ │ | |
│ 121 │ | |
│ 122 │ | |
│ 123 if __name__ == '__main__': │ | |
│ ❱ 124 main() │ | |
│ 125 │ | |
│ │ | |
│ /vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/train.py:10 │ | |
│ 3 in main │ | |
│ │ | |
│ 100 │ │ assert config.train.eval_fill │ | |
│ 101 │ │ eval_replay = make_replay('eval_episodes', config.replay_size │ | |
│ 102 │ replay = make_replay('episodes', config.replay_size) │ | |
│ ❱ 103 │ train_with_viz.train_with_viz( │ | |
│ 104 │ │ agent, env, replay, eval_replay, logger, args) │ | |
│ 105 │ elif config.run == 'learning': │ | |
│ 106 │ assert config.replay.sync │ | |
│ │ | |
│ /vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/train_with_ │ | |
│ viz.py:85 in train_with_viz │ | |
│ │ | |
│ 82 state = [None] # To be writable from train step function below. │ | |
│ 83 assert args.pretrain > 0 # At least one step to initialize variable │ | |
│ 84 for _ in range(args.pretrain): │ | |
│ ❱ 85 │ _, state[0], _ = agent.train(next(dataset_train), state[0]) │ | |
│ 86 │ | |
│ 87 metrics = collections.defaultdict(list) │ | |
│ 88 batch = [None] │ | |
│ │ | |
│ /usr/lib/python3.10/contextlib.py:79 in inner │ | |
│ │ | |
│ 76 │ │ @wraps(func) │ | |
│ 77 │ │ def inner(*args, **kwds): │ | |
│ 78 │ │ │ with self._recreate_cm(): │ | |
│ ❱ 79 │ │ │ │ return func(*args, **kwds) │ | |
│ 80 │ │ return inner │ | |
│ 81 │ | |
│ 82 │ | |
│ │ | |
│ /vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/tfagent.py: │ | |
│ 57 in train │ | |
│ │ | |
│ 54 │ if key not in self._cached_fns: │ | |
│ 55 │ │ self._cached_fns[key] = fn.get_concrete_function(data, state) │ | |
│ 56 │ fn = self._cached_fns[key] │ | |
│ ❱ 57 │ outs, state, metrics = self._strategy_run(fn, data, state) │ | |
│ 58 │ outs = self._convert_outs(outs) │ | |
│ 59 │ metrics = self._convert_mets(metrics) │ | |
│ 60 │ return outs, state, metrics │ | |
│ │ | |
│ /vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/tfagent.py: │ | |
│ 86 in _strategy_run │ | |
│ │ | |
│ 83 │ if self.strategy: │ | |
│ 84 │ return self.strategy.run(fn, args, kwargs) │ | |
│ 85 │ else: │ | |
│ ❱ 86 │ return fn(*args, **kwargs) │ | |
│ 87 │ | |
│ 88 def _convert_inps(self, value): │ | |
│ 89 │ if not self.strategy: │ | |
│ │ | |
│ /vol/bitbucket/jk3417/xmbhrl/lib/python3.10/site-packages/tensorflow/python/ │ | |
│ util/traceback_utils.py:153 in error_handler │ | |
│ │ | |
│ 150 │ return fn(*args, **kwargs) │ | |
│ 151 │ except Exception as e: │ | |
│ 152 │ filtered_tb = _process_traceback_frames(e.__traceback__) │ | |
│ ❱ 153 │ raise e.with_traceback(filtered_tb) from None │ | |
│ 154 │ finally: │ | |
│ 155 │ del filtered_tb │ | |
│ 156 │ | |
│ │ | |
│ /vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/agent.py:78 │ | |
│ in train │ | |
│ │ | |
│ 75 │ context = {**data, **wm_outs['post']} │ | |
│ 76 │ start = tf.nest.map_structure( │ | |
│ 77 │ │ lambda x: x.reshape([-1] + list(x.shape[2:])), context) │ | |
│ ❱ 78 │ _, mets = self.task_behavior.train(self.wm.imagine, start, context │ | |
│ 79 │ metrics.update(mets) │ | |
│ 80 │ if self.config.expl_behavior != 'None': │ | |
│ 81 │ _, mets = self.expl_behavior.train(self.wm.imagine, start, conte │ | |
│ │ | |
│ /vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/hierarchy.p │ | |
│ y:118 in train │ | |
│ │ | |
│ 115 │ │ goal = self.feat(traj)[-1] │ | |
│ 116 │ │ metrics.update(self.train_worker(imagine, start, goal)[1]) │ | |
│ 117 │ if self.config.jointly == 'new': │ | |
│ ❱ 118 │ traj, mets = self.train_jointly(imagine, start) │ | |
│ 119 │ metrics.update(mets) │ | |
│ 120 │ metrics['success_manager'] = success(traj['reward_goal']) │ | |
│ 121 │ if self.config.vae_imag: │ | |
│ │ | |
│ /vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/hierarchy.p │ | |
│ y:149 in train_jointly │ | |
│ │ | |
│ 146 │ metrics = {} │ | |
│ 147 │ with tf.GradientTape(persistent=True) as tape: │ | |
│ 148 │ policy = functools.partial(self.policy, imag=True) │ | |
│ ❱ 149 │ traj = self.wm.imagine_carry( │ | |
│ 150 │ │ policy, start, self.config.imag_horizon, │ | |
│ 151 │ │ self.initial(len(start['is_first']))) │ | |
│ 152 │ traj['reward_extr'] = self.extr_reward(traj) │ | |
│ │ | |
│ /vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/agent.py:24 │ | |
│ 1 in imagine_carry │ | |
│ │ | |
│ 238 │ carries = [carry] │ | |
│ 239 │ for _ in range(horizon): │ | |
│ 240 │ states.append(self.rssm.img_step(states[-1], actions[-1])) │ | |
│ ❱ 241 │ outs, carry = policy(states[-1], carry) │ | |
│ 242 │ action = outs['action'] │ | |
│ 243 │ if hasattr(action, 'sample'): │ | |
│ 244 │ │ action = action.sample() │ | |
│ │ | |
│ /vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/hierarchy.p │ | |
│ y:97 in policy │ | |
│ │ | |
│ 94 │ dist = self.worker.actor(sg({**latent, 'goal': goal, 'delta': delt │ | |
│ 95 │ outs = {'action': dist} │ | |
│ 96 │ if 'image' in self.wm.heads['decoder'].shapes: │ | |
│ ❱ 97 │ outs['log_goal'] = self.wm.heads['decoder']({ │ | |
│ 98 │ │ 'deter': goal, 'stoch': self.wm.rssm.get_stoch(goal), │ | |
│ 99 │ })['image'].mode() │ | |
│ 100 │ carry = {'step': carry['step'] + 1, 'skill': skill, 'goal': goal} │ | |
│ │ | |
│ /vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/nets.py:255 │ | |
│ in __call__ │ | |
│ │ | |
│ 252 │ dists = {} │ | |
│ 253 │ if self.cnn_shapes: │ | |
│ 254 │ flat = features.reshape([-1, features.shape[-1]]) │ | |
│ ❱ 255 │ output = self._cnn(flat) │ | |
│ 256 │ output = output.reshape(features.shape[:-1] + output.shape[1:]) │ | |
│ 257 │ means = tf.split(output, [v[-1] for v in self.cnn_shapes.values( │ | |
│ 258 │ dists.update({ │ | |
│ │ | |
│ /vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/nets.py:305 │ | |
│ in __call__ │ | |
│ │ | |
│ 302 │ x = tf.reshape(x, [-1, 1, 1, x.shape[-1]]) │ | |
│ 303 │ depth = self._depth * 2 ** (len(self._kernels) - 2) │ | |
│ 304 │ for i, kernel in enumerate(self._kernels[:-1]): │ | |
│ ❱ 305 │ x = self.get(f'conv{i}', ConvT, depth, kernel, **self._kw)(x) │ | |
│ 306 │ depth //= 2 │ | |
│ 307 │ x = self.get('out', ConvT, self._shape[-1], self._kernels[-1])(x) │ | |
│ 308 │ x = tf.math.sigmoid(x) │ | |
│ │ | |
│ /vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/nets.py:445 │ | |
│ in __call__ │ | |
│ │ | |
│ 442 │ self._norm = Norm(norm) │ | |
│ 443 │ | |
│ 444 def __call__(self, hidden): │ | |
│ ❱ 445 │ hidden = self._layer(hidden) │ | |
│ 446 │ hidden = self._norm(hidden) │ | |
│ 447 │ hidden = self._act(hidden) │ | |
│ 448 │ return hidden │ | |
│ │ | |
│ /vol/bitbucket/jk3417/xmbhrl/lib/python3.10/site-packages/keras/utils/traceb │ | |
│ ack_utils.py:70 in error_handler │ | |
│ │ | |
│ 67 │ │ │ filtered_tb = _process_traceback_frames(e.__traceback__) │ | |
│ 68 │ │ │ # To get the full stack trace, call: │ | |
│ 69 │ │ │ # `tf.debugging.disable_traceback_filtering()` │ | |
│ ❱ 70 │ │ │ raise e.with_traceback(filtered_tb) from None │ | |
│ 71 │ │ finally: │ | |
│ 72 │ │ │ del filtered_tb │ | |
│ 73 │ | |
│ │ | |
│ /vol/bitbucket/jk3417/xmbhrl/lib/python3.10/site-packages/keras/backend.py:6 │ | |
│ 122 in conv2d_transpose │ | |
│ │ | |
│ 6119 │ │ strides = (1, 1) + strides │ | |
│ 6120 │ │ | |
│ 6121 │ if dilation_rate == (1, 1): │ | |
│ ❱ 6122 │ │ x = tf.compat.v1.nn.conv2d_transpose( │ | |
│ 6123 │ │ │ x, │ | |
│ 6124 │ │ │ kernel, │ | |
│ 6125 │ │ │ output_shape, │ | |
╰──────────────────────────────────────────────────────────────────────────────╯ | |
ResourceExhaustedError: Exception encountered when calling layer | |
'conv2d_transpose_1' (type Conv2DTranspose). | |
{{function_node | |
__wrapped__Conv2DBackpropInput_device_/job:localhost/replica:0/task:0/device:GPU | |
:0}} OOM when allocating tensor with shape[1024,13,13,128] and type half on | |
/job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc | |
[Op:Conv2DBackpropInput] | |
Call arguments received by layer 'conv2d_transpose_1' (type Conv2DTranspose): | |
• inputs=tf.Tensor(shape=(1024, 5, 5, 256), dtype=float16) | |
srun: error: cloud-vm-42-53: task 0: Exited with exit code 1 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment