[go: nahoru, domu]

Skip to content

Instantly share code, notes, and snippets.

@jdubkim
Created April 6, 2023 17:29
Show Gist options
  • Save jdubkim/7b84e201dea348f1c04e50d81b1f7239 to your computer and use it in GitHub Desktop.
Save jdubkim/7b84e201dea348f1c04e50d81b1f7239 to your computer and use it in GitHub Desktop.
Director Error on Single GPU
Thu Apr 6 18:06:24 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.05 Driver Version: 525.85.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:09.0 Off | 0 |
| N/A 37C P8 9W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
18:06:24 up 10 days, 3:17, 3 users, load average: 0.06, 0.05, 0.04
Using config: dmc_vision.
Running task: dmc_walker_walk.
2023-04-06 18:06:26.259836: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-06 18:06:34.003624: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /vol/cuda/11.4.120-cudnn8.2.4/lib64:/vol/cuda/11.4.120-cudnn8.2.4/lib:/usr/lib/x86_64-linux-gnu
2023-04-06 18:06:34.012961: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /vol/cuda/11.4.120-cudnn8.2.4/lib64:/vol/cuda/11.4.120-cudnn8.2.4/lib:/usr/lib/x86_64-linux-gnu
2023-04-06 18:06:34.012991: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Config:
logdir: /vol/bitbucket/jk3417/explainable-mbhrl/logdir/20230406-180624 (str)
run: train_with_viz (str)
seed: 0 (int)
task: dmc_walker_walk (str)
env.amount: 4 (int)
env.parallel: process (str)
env.daemon: False (bool)
env.repeat: 1 (int)
env.size: [64, 64] (ints)
env.camera: -1 (int)
env.gray: False (bool)
env.length: 0 (int)
env.discretize: 0 (int)
env.lives: False (bool)
env.sticky: True (bool)
env.episodic: True (bool)
env.restart: True (bool)
env.again: False (bool)
env.termination: False (bool)
env.weaker: 1.0 (float)
env.seed: 0 (int)
replay: fixed (str)
replay_size: 1000000.0 (float)
replay_chunk: 64 (int)
replay_fixed.prio_starts: 0.0 (float)
replay_fixed.prio_ends: 1.0 (float)
replay_fixed.sync: 0 (int)
replay_consec.sync: 0 (int)
replay_prio.prio_starts: 0.0 (float)
replay_prio.prio_ends: 1.0 (float)
replay_prio.sync: 0 (int)
replay_prio.fraction: 0.1 (float)
replay_prio.softmax: False (bool)
replay_prio.temp: 1.0 (float)
replay_prio.constant: 0.0 (float)
replay_prio.exponent: 0.5 (float)
tf.jit: False (bool)
tf.platform: gpu (str)
tf.precision: 16 (int)
tf.debug_nans: False (bool)
tf.logical_gpus: 0 (int)
tf.dist_dataset: False (bool)
tf.dist_policy: False (bool)
tf.tensorfloat: True (bool)
tf.placement: False (bool)
tf.growth: True (bool)
eval_dir: (str)
filter: .* (str)
tbtt: 0 (int)
train.steps: 100000000.0 (float)
train.expl_until: 0 (int)
train.log_every: 10000.0 (float)
train.eval_every: 30000.0 (float)
train.eval_eps: 1 (int)
train.eval_samples: 1 (int)
train.train_every: 16 (int)
train.train_steps: 1 (int)
train.train_fill: 10000.0 (float)
train.eval_fill: 10000.0 (float)
train.pretrain: 1 (int)
train.log_zeros: False (bool)
train.log_keys_video: [image] (strs)
train.log_keys_sum: ^$ (str)
train.log_keys_mean: ^$ (str)
train.log_keys_max: ^$ (str)
train.log_timings: True (bool)
train.sync_every: 180 (int)
task_behavior: Hierarchy (str)
expl_behavior: None (str)
batch_size: 16 (int)
transform_rewards: off (str)
expl_noise: 0.0 (float)
eval_noise: 0.0 (float)
eval_state_mean: False (bool)
priority: reward_loss (str)
priority_correct: 0.0 (float)
data_loader: tfdata (str)
grad_heads: [decoder, reward, cont] (strs)
rssm.units: 1024 (int)
rssm.deter: 1024 (int)
rssm.stoch: 32 (int)
rssm.classes: 32 (int)
rssm.act: elu (str)
rssm.norm: layer (str)
rssm.initial: learned2 (str)
rssm.unroll: True (bool)
encoder.mlp_keys: $^ (str)
encoder.cnn_keys: image (str)
encoder.act: elu (str)
encoder.norm: layer (str)
encoder.mlp_layers: 4 (int)
encoder.mlp_units: 512 (int)
encoder.cnn: simple (str)
encoder.cnn_depth: 64 (int)
encoder.cnn_kernels: [4, 4, 4, 4] (ints)
decoder.mlp_keys: $^ (str)
decoder.cnn_keys: image (str)
decoder.act: elu (str)
decoder.norm: layer (str)
decoder.mlp_layers: 4 (int)
decoder.mlp_units: 512 (int)
decoder.cnn: simple (str)
decoder.cnn_depth: 64 (int)
decoder.cnn_kernels: [5, 5, 6, 6] (ints)
decoder.image_dist: mse (str)
decoder.inputs: [deter, stoch] (strs)
reward_head.layers: 4 (int)
reward_head.units: 512 (int)
reward_head.act: elu (str)
reward_head.norm: layer (str)
reward_head.dist: symlog (str)
reward_head.outscale: 0.1 (float)
reward_head.inputs: [deter, stoch] (strs)
cont_head.layers: 4 (int)
cont_head.units: 512 (int)
cont_head.act: elu (str)
cont_head.norm: layer (str)
cont_head.dist: binary (str)
cont_head.outscale: 0.1 (float)
cont_head.inputs: [deter, stoch] (strs)
loss_scales.kl: 1.0 (float)
loss_scales.image: 1.0 (float)
loss_scales.reward: 1.0 (float)
loss_scales.cont: 1.0 (float)
model_opt.opt: adam (str)
model_opt.lr: 0.0001 (float)
model_opt.eps: 1e-06 (float)
model_opt.clip: 100.0 (float)
model_opt.wd: 0.01 (float)
model_opt.wd_pattern: kernel (str)
wmkl.impl: mult (str)
wmkl.scale: 0.1 (float)
wmkl.target: 3.5 (float)
wmkl.min: 1e-05 (float)
wmkl.max: 1.0 (float)
wmkl.vel: 0.1 (float)
wmkl_balance: 0.8 (float)
actor.layers: 4 (int)
actor.units: 512 (int)
actor.act: elu (str)
actor.norm: layer (str)
actor.minstd: 0.03 (float)
actor.maxstd: 1.0 (float)
actor.outscale: 0.1 (float)
actor.unimix: 0.01 (float)
actor.inputs: [deter, stoch] (strs)
critic.layers: 4 (int)
critic.units: 512 (int)
critic.act: elu (str)
critic.norm: layer (str)
critic.dist: symlog (str)
critic.outscale: 0.1 (float)
critic.inputs: [deter, stoch] (strs)
actor_opt.opt: adam (str)
actor_opt.lr: 0.0001 (float)
actor_opt.eps: 1e-06 (float)
actor_opt.clip: 100.0 (float)
actor_opt.wd: 0.01 (float)
actor_opt.wd_pattern: kernel (str)
critic_opt.opt: adam (str)
critic_opt.lr: 0.0001 (float)
critic_opt.eps: 1e-06 (float)
critic_opt.clip: 100.0 (float)
critic_opt.wd: 0.01 (float)
critic_opt.wd_pattern: kernel (str)
actor_dist_disc: onehot (str)
actor_dist_cont: normal (str)
episodic: True (bool)
discount: 0.99 (float)
imag_discount: 0.99 (float)
imag_horizon: 16 (int)
imag_unroll: True (bool)
critic_return: gve (str)
actor_return: gve (str)
return_lambda: 0.95 (float)
actor_grad_disc: reinforce (str)
actor_grad_cont: backprop (str)
slow_target: True (bool)
slow_target_update: 100 (int)
slow_target_fraction: 1.0 (float)
actent.impl: mult (str)
actent.scale: 0.003 (float)
actent.target: 0.5 (float)
actent.min: 1e-05 (float)
actent.max: 100.0 (float)
actent.vel: 0.1 (float)
actent_norm: True (bool)
actent_perdim: True (bool)
advnorm.impl: mean_std (str)
advnorm.decay: 0.99 (float)
advnorm.max: 100000000.0 (float)
retnorm.impl: std (str)
retnorm.decay: 0.999 (float)
retnorm.max: 100.0 (float)
scorenorm.impl: off (str)
scorenorm.decay: 0.99 (float)
scorenorm.max: 100000000.0 (float)
adv_slow_critic: True (bool)
pengs_qlambda: False (bool)
critic_type: vfunction (str)
rewnorm_discount: False (bool)
env_skill_duration: 8 (int)
train_skill_duration: 8 (int)
skill_shape: [8, 8] (ints)
manager_rews.extr: 1.0 (float)
manager_rews.expl: 0.1 (float)
manager_rews.goal: 0.0 (float)
worker_rews.extr: 0.0 (float)
worker_rews.expl: 0.0 (float)
worker_rews.goal: 1.0 (float)
worker_inputs: [deter, stoch, goal] (strs)
worker_report_horizon: 64 (int)
skill_proposal: manager (str)
goal_proposal: replay (str)
goal_reward: cosine_max (str)
goal_encoder.layers: 4 (int)
goal_encoder.units: 512 (int)
goal_encoder.act: elu (str)
goal_encoder.norm: layer (str)
goal_encoder.dist: onehot (str)
goal_encoder.outscale: 0.1 (float)
goal_encoder.unimix: 0.0 (float)
goal_encoder.inputs: [goal] (strs)
goal_decoder.layers: 4 (int)
goal_decoder.units: 512 (int)
goal_decoder.act: elu (str)
goal_decoder.norm: layer (str)
goal_decoder.dist: mse (str)
goal_decoder.outscale: 0.1 (float)
goal_decoder.inputs: [skill] (strs)
worker_goals: [manager] (strs)
jointly: new (str)
vae_imag: False (bool)
vae_replay: True (bool)
vae_span: False (bool)
encdec_kl.impl: mult (str)
encdec_kl.scale: 0.0 (float)
encdec_kl.target: 10.0 (float)
encdec_kl.min: 1e-05 (float)
encdec_kl.max: 1.0 (float)
encdec_opt.opt: adam (str)
encdec_opt.lr: 0.0001 (float)
encdec_opt.eps: 1e-06 (float)
encdec_opt.clip: 100.0 (float)
encdec_opt.wd: 0.01 (float)
encdec_opt.wd_pattern: kernel (str)
explorer: False (bool)
explorer_repeat: False (bool)
expl_rew: adver (str)
manager_dist: onehot (str)
manager_grad: reinforce (str)
manager_actent: 0.5 (float)
adver_impl: squared (str)
manager_delta: False (bool)
goal_kl: True (bool)
expl_rewards.extr: 0.0 (float)
expl_rewards.disag: 0.0 (float)
expl_rewards.vae: 0.0 (float)
expl_rewards.ctrl: 0.0 (float)
expl_rewards.pbe: 0.0 (float)
expl_discount: 0.99 (float)
expl_retnorm.impl: std (str)
expl_retnorm.decay: 0.999 (float)
expl_retnorm.max: 100000000.0 (float)
expl_scorenorm.impl: off (str)
expl_scorenorm.decay: 0.999 (float)
expl_scorenorm.max: 100000000.0 (float)
disag_head.layers: 4 (int)
disag_head.units: 512 (int)
disag_head.act: elu (str)
disag_head.norm: layer (str)
disag_head.dist: mse (str)
disag_head.inputs: [deter, stoch, action] (strs)
expl_opt.opt: adam (str)
expl_opt.lr: 0.0001 (float)
expl_opt.eps: 1e-06 (float)
expl_opt.clip: 100.0 (float)
expl_opt.wd: 0.01 (float)
disag_target: [stoch] (strs)
disag_models: 8 (int)
ctrl_embed.layers: 3 (int)
ctrl_embed.units: 512 (int)
ctrl_embed.act: elu (str)
ctrl_embed.norm: layer (str)
ctrl_embed.dist: mse (str)
ctrl_embed.inputs: [deter, stoch] (strs)
ctrl_head.layers: 1 (int)
ctrl_head.units: 128 (int)
ctrl_head.act: elu (str)
ctrl_head.norm: layer (str)
ctrl_head.dist: mse (str)
ctrl_head.inputs: [current, next] (strs)
ctrl_size: 32 (int)
ctrl_opt.opt: adam (str)
ctrl_opt.lr: 0.0001 (float)
ctrl_opt.eps: 1e-06 (float)
ctrl_opt.clip: 100.0 (float)
ctrl_opt.wd: 0.01 (float)
expl_enc.layers: 4 (int)
expl_enc.units: 512 (int)
expl_enc.act: elu (str)
expl_enc.norm: layer (str)
expl_enc.dist: onehot (str)
expl_enc.outscale: 0.1 (float)
expl_enc.inputs: [deter] (strs)
expl_enc.shape: [8, 8] (ints)
expl_dec.layers: 4 (int)
expl_dec.units: 512 (int)
expl_dec.act: elu (str)
expl_dec.norm: layer (str)
expl_dec.dist: mse (str)
expl_dec.outscale: 0.1 (float)
expl_kl.impl: mult (str)
expl_kl.scale: 0.1 (float)
expl_kl.target: 10.0 (float)
expl_kl.min: 0.01 (float)
expl_kl.max: 1.0 (float)
expl_kl.vel: 0.1 (float)
expl_vae_elbo: False (bool)
Encoder CNN shapes: {'image': (64, 64, 3)}
Encoder MLP shapes: {}
Decoder CNN shapes: {'image': (64, 64, 3)}
Decoder MLP shapes: {}
Synced last 0/0 trajectories.
Synced last 0/0 trajectories.
Synced last 0/0 trajectories.
Synced last 0/0 trajectories.
Logdir /vol/bitbucket/jk3417/explainable-mbhrl/logdir/20230406-180624
Fill eval dataset (10000.0 steps).
Saved episode: 20230406T170729-92735c1c31f4439485ed7245847340ae-len1001-rew29.npz
Saved episode: 20230406T170729-e8aca7f8b0c64967aa5e63fcb2869c26-len1001-rew30.npz
Saved episode: 20230406T170729-d4bf7726d9df4e08b6d9895b59ccfeb4-len1001-rew35.npz
Saved episode: 20230406T170729-af46c9a976f64f558e8ea2ceb76eaf3e-len1001-rew29.npz
Saved episode: 20230406T170734-1014cf6331f044af88f4676138e825ea-len1001-rew29.npz
Saved episode: 20230406T170734-85a5360c961f437f8513de73b75f68d4-len1001-rew31.npz
Saved episode: 20230406T170735-dc3b2c1de0684cc0b6ebddace3c49752-len1001-rew33.npz
Saved episode: 20230406T170735-c466903916da43f4ae81af9f93a2cfdc-len1001-rew30.npz
Fill train dataset (10000.0 steps).
Episode has 1000 steps and return 31.2.
────────────────────────────────── Step 4004 ───────────────────────────────────
episode/length 1000 / episode/score 31.21 / episode/reward_rate 0.02 /
replay/replay_steps 4004 / replay/replay_trajs 4
Episode has 1000 steps and return 27.4.
────────────────────────────────── Step 4004 ───────────────────────────────────
episode/length 1000 / episode/score 27.44 / episode/reward_rate 0.01 /
replay/replay_steps 4004 / replay/replay_trajs 4
Saved episode: 20230406T170743-34382ac90ec94bbf8bdb645c1440cc09-len1001-rew31.npz
Saved episode: 20230406T170743-959f2bb241124cea8c8f4d1af9826c62-len1001-rew27.npz
Saved episode: 20230406T170743-f4525518e8e04f54934c0514f8a6bdc7-len1001-rew28.npz
Saved episode: 20230406T170743-c8b4070fd3ee4511a1b396ddaf83017d-len1001-rew31.npz
Episode has 1000 steps and return 28.7.
────────────────────────────────── Step 4004 ───────────────────────────────────
episode/length 1000 / episode/score 28.67 / episode/reward_rate 0.01 /
replay/replay_steps 4004 / replay/replay_trajs 4
Episode has 1000 steps and return 31.6.
────────────────────────────────── Step 4004 ───────────────────────────────────
episode/length 1000 / episode/score 31.56 / episode/reward_rate 0.03 /
replay/replay_steps 4004 / replay/replay_trajs 4
Episode has 1000 steps and return 29.6.
────────────────────────────────── Step 8008 ───────────────────────────────────
episode/length 1000 / episode/score 29.61 / episode/reward_rate 0.01 /
replay/replay_steps 8008 / replay/replay_trajs 8
Episode has 1000 steps and return 33.0.
────────────────────────────────── Step 8008 ───────────────────────────────────
episode/length 1000 / episode/score 32.99 / episode/reward_rate 0.04 /
replay/replay_steps 8008 / replay/replay_trajs 8
Episode has 1000 steps and return 30.5.
────────────────────────────────── Step 8008 ───────────────────────────────────
episode/length 1000 / episode/score 30.45 / episode/reward_rate 0.01 /
replay/replay_steps 8008 / replay/replay_trajs 8
Episode has 1000 steps and return 34.8.
────────────────────────────────── Step 8008 ───────────────────────────────────
episode/length 1000 / episode/score 34.85 / episode/reward_rate 0.05 /
replay/replay_steps 8008 / replay/replay_trajs 8
/vol/bitbucket/jk3417/xmbhrl/lib/python3.10/site-packages/tensorflow/python/data/ops/structured_function.py:256: UserWarning: Even though the `tf.config.experimental_run_functions_eagerly` option is set, this option does not apply to tf.data functions. To force eager execution of tf.data functions, please use `tf.data.experimental.enable_debug_mode()`.
warnings.warn(
Saved episode: 20230406T170751-28dd26e28f5f4c99b8c65e4f6cc2866b-len1001-rew29.npz
Saved episode: 20230406T170751-fff7d72bdaed417d82da120502a40cec-len1001-rew32.npz
Saved episode: 20230406T170751-c3a72ecb21d04962b053e6b5e45f5e29-len1001-rew30.npz
Saved episode: 20230406T170751-4b07fe4103f0450897ff931a4e3c29e4-len1001-rew34.npz
Found 34318853 model parameters.
Found 2696256 goal parameters.
Optimizer applied weight decay to goal variables:
[x] dense_16/kernel:0
[x] dense_17/kernel:0
[x] dense_18/kernel:0
[x] dense_19/kernel:0
[x] dense_20/kernel:0
[x] dense_21/kernel:0
[x] dense_22/kernel:0
[x] dense_23/kernel:0
[x] dense_24/kernel:0
[x] dense_25/kernel:0
[ ] dense_20/bias:0
[ ] dense_25/bias:0
[ ] norm_22/offset:0
[ ] norm_22/scale:0
[ ] norm_23/offset:0
[ ] norm_23/scale:0
[ ] norm_24/offset:0
[ ] norm_24/scale:0
[ ] norm_25/offset:0
[ ] norm_25/scale:0
[ ] norm_26/offset:0
[ ] norm_26/scale:0
[ ] norm_27/offset:0
[ ] norm_27/scale:0
[ ] norm_28/offset:0
[ ] norm_28/scale:0
[ ] norm_29/offset:0
[ ] norm_29/scale:0
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/train.py:12 │
│ 4 in <module> │
│ │
│ 121 │
│ 122 │
│ 123 if __name__ == '__main__': │
│ ❱ 124 main() │
│ 125 │
│ │
│ /vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/train.py:10 │
│ 3 in main │
│ │
│ 100 │ │ assert config.train.eval_fill │
│ 101 │ │ eval_replay = make_replay('eval_episodes', config.replay_size │
│ 102 │ replay = make_replay('episodes', config.replay_size) │
│ ❱ 103 │ train_with_viz.train_with_viz( │
│ 104 │ │ agent, env, replay, eval_replay, logger, args) │
│ 105 │ elif config.run == 'learning': │
│ 106 │ assert config.replay.sync │
│ │
│ /vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/train_with_ │
│ viz.py:85 in train_with_viz │
│ │
│ 82 state = [None] # To be writable from train step function below. │
│ 83 assert args.pretrain > 0 # At least one step to initialize variable │
│ 84 for _ in range(args.pretrain): │
│ ❱ 85 │ _, state[0], _ = agent.train(next(dataset_train), state[0]) │
│ 86 │
│ 87 metrics = collections.defaultdict(list) │
│ 88 batch = [None] │
│ │
│ /usr/lib/python3.10/contextlib.py:79 in inner │
│ │
│ 76 │ │ @wraps(func) │
│ 77 │ │ def inner(*args, **kwds): │
│ 78 │ │ │ with self._recreate_cm(): │
│ ❱ 79 │ │ │ │ return func(*args, **kwds) │
│ 80 │ │ return inner │
│ 81 │
│ 82 │
│ │
│ /vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/tfagent.py: │
│ 57 in train │
│ │
│ 54 │ if key not in self._cached_fns: │
│ 55 │ │ self._cached_fns[key] = fn.get_concrete_function(data, state) │
│ 56 │ fn = self._cached_fns[key] │
│ ❱ 57 │ outs, state, metrics = self._strategy_run(fn, data, state) │
│ 58 │ outs = self._convert_outs(outs) │
│ 59 │ metrics = self._convert_mets(metrics) │
│ 60 │ return outs, state, metrics │
│ │
│ /vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/tfagent.py: │
│ 86 in _strategy_run │
│ │
│ 83 │ if self.strategy: │
│ 84 │ return self.strategy.run(fn, args, kwargs) │
│ 85 │ else: │
│ ❱ 86 │ return fn(*args, **kwargs) │
│ 87 │
│ 88 def _convert_inps(self, value): │
│ 89 │ if not self.strategy: │
│ │
│ /vol/bitbucket/jk3417/xmbhrl/lib/python3.10/site-packages/tensorflow/python/ │
│ util/traceback_utils.py:153 in error_handler │
│ │
│ 150 │ return fn(*args, **kwargs) │
│ 151 │ except Exception as e: │
│ 152 │ filtered_tb = _process_traceback_frames(e.__traceback__) │
│ ❱ 153 │ raise e.with_traceback(filtered_tb) from None │
│ 154 │ finally: │
│ 155 │ del filtered_tb │
│ 156 │
│ │
│ /vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/agent.py:78 │
│ in train │
│ │
│ 75 │ context = {**data, **wm_outs['post']} │
│ 76 │ start = tf.nest.map_structure( │
│ 77 │ │ lambda x: x.reshape([-1] + list(x.shape[2:])), context) │
│ ❱ 78 │ _, mets = self.task_behavior.train(self.wm.imagine, start, context │
│ 79 │ metrics.update(mets) │
│ 80 │ if self.config.expl_behavior != 'None': │
│ 81 │ _, mets = self.expl_behavior.train(self.wm.imagine, start, conte │
│ │
│ /vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/hierarchy.p │
│ y:118 in train │
│ │
│ 115 │ │ goal = self.feat(traj)[-1] │
│ 116 │ │ metrics.update(self.train_worker(imagine, start, goal)[1]) │
│ 117 │ if self.config.jointly == 'new': │
│ ❱ 118 │ traj, mets = self.train_jointly(imagine, start) │
│ 119 │ metrics.update(mets) │
│ 120 │ metrics['success_manager'] = success(traj['reward_goal']) │
│ 121 │ if self.config.vae_imag: │
│ │
│ /vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/hierarchy.p │
│ y:149 in train_jointly │
│ │
│ 146 │ metrics = {} │
│ 147 │ with tf.GradientTape(persistent=True) as tape: │
│ 148 │ policy = functools.partial(self.policy, imag=True) │
│ ❱ 149 │ traj = self.wm.imagine_carry( │
│ 150 │ │ policy, start, self.config.imag_horizon, │
│ 151 │ │ self.initial(len(start['is_first']))) │
│ 152 │ traj['reward_extr'] = self.extr_reward(traj) │
│ │
│ /vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/agent.py:24 │
│ 1 in imagine_carry │
│ │
│ 238 │ carries = [carry] │
│ 239 │ for _ in range(horizon): │
│ 240 │ states.append(self.rssm.img_step(states[-1], actions[-1])) │
│ ❱ 241 │ outs, carry = policy(states[-1], carry) │
│ 242 │ action = outs['action'] │
│ 243 │ if hasattr(action, 'sample'): │
│ 244 │ │ action = action.sample() │
│ │
│ /vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/hierarchy.p │
│ y:97 in policy │
│ │
│ 94 │ dist = self.worker.actor(sg({**latent, 'goal': goal, 'delta': delt │
│ 95 │ outs = {'action': dist} │
│ 96 │ if 'image' in self.wm.heads['decoder'].shapes: │
│ ❱ 97 │ outs['log_goal'] = self.wm.heads['decoder']({ │
│ 98 │ │ 'deter': goal, 'stoch': self.wm.rssm.get_stoch(goal), │
│ 99 │ })['image'].mode() │
│ 100 │ carry = {'step': carry['step'] + 1, 'skill': skill, 'goal': goal} │
│ │
│ /vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/nets.py:255 │
│ in __call__ │
│ │
│ 252 │ dists = {} │
│ 253 │ if self.cnn_shapes: │
│ 254 │ flat = features.reshape([-1, features.shape[-1]]) │
│ ❱ 255 │ output = self._cnn(flat) │
│ 256 │ output = output.reshape(features.shape[:-1] + output.shape[1:]) │
│ 257 │ means = tf.split(output, [v[-1] for v in self.cnn_shapes.values( │
│ 258 │ dists.update({ │
│ │
│ /vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/nets.py:305 │
│ in __call__ │
│ │
│ 302 │ x = tf.reshape(x, [-1, 1, 1, x.shape[-1]]) │
│ 303 │ depth = self._depth * 2 ** (len(self._kernels) - 2) │
│ 304 │ for i, kernel in enumerate(self._kernels[:-1]): │
│ ❱ 305 │ x = self.get(f'conv{i}', ConvT, depth, kernel, **self._kw)(x) │
│ 306 │ depth //= 2 │
│ 307 │ x = self.get('out', ConvT, self._shape[-1], self._kernels[-1])(x) │
│ 308 │ x = tf.math.sigmoid(x) │
│ │
│ /vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/nets.py:445 │
│ in __call__ │
│ │
│ 442 │ self._norm = Norm(norm) │
│ 443 │
│ 444 def __call__(self, hidden): │
│ ❱ 445 │ hidden = self._layer(hidden) │
│ 446 │ hidden = self._norm(hidden) │
│ 447 │ hidden = self._act(hidden) │
│ 448 │ return hidden │
│ │
│ /vol/bitbucket/jk3417/xmbhrl/lib/python3.10/site-packages/keras/utils/traceb │
│ ack_utils.py:70 in error_handler │
│ │
│ 67 │ │ │ filtered_tb = _process_traceback_frames(e.__traceback__) │
│ 68 │ │ │ # To get the full stack trace, call: │
│ 69 │ │ │ # `tf.debugging.disable_traceback_filtering()` │
│ ❱ 70 │ │ │ raise e.with_traceback(filtered_tb) from None │
│ 71 │ │ finally: │
│ 72 │ │ │ del filtered_tb │
│ 73 │
│ │
│ /vol/bitbucket/jk3417/xmbhrl/lib/python3.10/site-packages/keras/backend.py:6 │
│ 122 in conv2d_transpose │
│ │
│ 6119 │ │ strides = (1, 1) + strides │
│ 6120 │ │
│ 6121 │ if dilation_rate == (1, 1): │
│ ❱ 6122 │ │ x = tf.compat.v1.nn.conv2d_transpose( │
│ 6123 │ │ │ x, │
│ 6124 │ │ │ kernel, │
│ 6125 │ │ │ output_shape, │
╰──────────────────────────────────────────────────────────────────────────────╯
ResourceExhaustedError: Exception encountered when calling layer
'conv2d_transpose_1' (type Conv2DTranspose).
{{function_node
__wrapped__Conv2DBackpropInput_device_/job:localhost/replica:0/task:0/device:GPU
:0}} OOM when allocating tensor with shape[1024,13,13,128] and type half on
/job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[Op:Conv2DBackpropInput]
Call arguments received by layer 'conv2d_transpose_1' (type Conv2DTranspose):
• inputs=tf.Tensor(shape=(1024, 5, 5, 256), dtype=float16)
srun: error: cloud-vm-42-53: task 0: Exited with exit code 1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment