|
|
- Reproduce: on compute canada, with singularity, installed
mesa-utils
,libgl1-mesa-glx
.
|
|
-
Solution: use
--nv
args insingularity shell
command. -
Experience: the mesa requires nvidia graphical card to run correctly. It is necessary to have one GPU.
|
|
- Solution:
export LIBGL_ALWAYS_INDIRECT=1
,apt-get install -y mesa-utils libgl1-mesa-glx
- in pytorch lightning when I use DDP for training, the code collapsed when validation loop finishes and models are checkpointed. That is caused by the valiadation batch size across each process is not same. In my case, 465 val batches in total, each process has 117 val batches