PyTorch Lightning is the deep learning framework with “batteries included” for professional AI researchers and machine learning engineers who need maximal flexibility while super-charging performance at scale.
LightningDataModule was designed as a way of decoupling data-related hooks from the LightningDataModule, so you can develop dataset agonostic models. The LightningDataModule makes it easy to hot swap different Datasets with your model, so you can test it and benchmark it across domains. It also makes sharing and resuing the exact data splits and transforms across projects possible.
LIGHTNINGDATAMODULE
A datamodule encapsulates the five steps involved in data preprocessing in Pytorch:
In normal pytorch code, the data cleaning or preparation is usually scattered across many files. This makes sharing and reusing the exact splits and transforms across projects impossible.
A DataModule is simply a collection of a train_dataloader, val_dataloader, test_dataloader and predict_dataloader along with the matching transforms and data precessing/downloads steps required.
As the complexity of the preprocessing grows (transforms, multiple-GPU training), you can let lightning handle those details for you while making this dataset reusable so you can share with collegues or use in different projects.
Downloading and saving data with multiple processes will result in corrupt data. Lightning ensures the prepare_data() is called only within a single process on CPU, so you can safely add your downloading lgic within. In case of multi-node training, the execution of this hook depends upon prepare_data_per_node setup() is called after prepare_data and there is a barrier in between which ensures that all the processes proceed to setup once the data is prepared and available for use. It will only be executed once.