Goals
I am recently working on a computer vision task and need a large volume data to be downloaded and processed. However, it takes too much time working in a single thread. So, working in parrallel way in a HPC system would be a better choice.
- Understand the multiprocessing, subprocess, threading package in python
- The workflow for a MPI work
- Transfer to HPC
Multiprocessing package: Process-based parallelism
Poolobject: parallelizing execution and distributing data (data parallelism)
Basic example:
|
|
Processclass: processes are spawned by creating aProcessobject and then calling itsstart()method.
Basic example:
|
|
join() method blocks the execution of the main process until the process whose join method is called terminates. Without it, the main process won’t wait.
Difference between subprocess and multiprocess
-
subprocesshelps the python code spawn a new process to execute external source code likec++,shell -
multiprocessspawns multiple processes for data processing or other parallel works.
Difference between Pool and Process
In my setting:
Poolfor multiple threadsProcessfor multiple processes
Deploy on HPC
Note to specify cpus-per-task $\geq$ Processes $\times$ Pools.
Ref code
- Spwan
ntasksprocesses.
|
|
notes
- The shared data need to be split for each process
- The
join()method has to be called after all processes started - Monitor
KeyboardInterruptand terminate all processes.join()method need to follow eachterminate()to make sure the processes are terminated one-by-one.
- Use
Poolto createcpus-per-taskthreads.
|
|
notes
-
Send the whole shared data to each
Pool, the interpreter would dispatch them in backend. Don’t have to explicitly split the data. -
If multiple processes wanna write or read the same file, a
Queueof wirting/reading operation has to be created to avoid conflict.