Why does my training hang with distributed training?

There are two common reasons training hangs when using W&B with distributed training:

Hanging at the beginning of training: W&B’s multiprocessing can interfere with the multiprocessing from distributed training frameworks.
Hanging at the end of training: The W&B process does not know when it needs to exit.

Fix hanging at the start

Enable W&B Service, which is the default for W&B SDK 0.13.0 and above. If you are on an older version, upgrade your SDK:

pip install --upgrade wandb

For W&B SDK 0.12.5 through 0.12.x, enable W&B Service explicitly:

def main():
    wandb.require("service")
    # rest of your script

For W&B SDK 0.12.4 and below, set the WANDB_START_METHOD environment variable:

export WANDB_START_METHOD=thread

Call wandb.finish() at the end of your training script to tell W&B that the run is complete:

wandb.finish()

This ensures all data is uploaded and the W&B process exits cleanly. For more information, see Distributed training.

⌘I