- Hanging at the beginning of training: W&B’s multiprocessing can interfere with the multiprocessing from distributed training frameworks.
- Hanging at the end of training: The W&B process does not know when it needs to exit.
Fix hanging at the start
Enable W&B Service, which is the default for W&B SDK0.13.0 and above. If you are on an older version, upgrade your SDK:
0.12.5 through 0.12.x, enable W&B Service explicitly:
0.12.4 and below, set the WANDB_START_METHOD environment variable:
Fix hanging at the end
Callwandb.finish() at the end of your training script to tell W&B that the run is complete:
Experiments Run Crashes