Goals
I am recently working on a computer vision task and need a large volume data to be downloaded and processed. However, it takes too much time working in a single thread. So, working in parrallel way in a HPC system would be a better choice.
- Understand the multiprocessing, subprocess, threading package in python
- The workflow for a MPI work
- Transfer to HPC
Multiprocessing package: Process-based parallelism
Pool
object: parallelizing execution and distributing data (data parallelism)
Basic example:
|
|
Process
class: processes are spawned by creating aProcess
object and then calling itsstart()
method.
Basic example:
|
|
join()
method blocks the execution of the main process until the process whose join method is called terminates. Without it, the main process won’t wait.
Difference between subprocess and multiprocess
-
subprocess
helps the python code spawn a new process to execute external source code likec++
,shell
-
multiprocess
spawns multiple processes for data processing or other parallel works.
Difference between Pool and Process
In my setting:
Pool
for multiple threadsProcess
for multiple processes
Deploy on HPC
Note to specify cpus-per-task
$\geq$ Processes
$\times$ Pools
.
Ref code
- Spwan
ntasks
processes.
|
|
notes
- The shared data need to be split for each process
- The
join()
method has to be called after all processes started - Monitor
KeyboardInterrupt
and terminate all processes.join()
method need to follow eachterminate()
to make sure the processes are terminated one-by-one.
- Use
Pool
to createcpus-per-task
threads.
|
|
notes
-
Send the whole shared data to each
Pool
, the interpreter would dispatch them in backend. Don’t have to explicitly split the data. -
If multiple processes wanna write or read the same file, a
Queue
of wirting/reading operation has to be created to avoid conflict.