mercredi 18 mars 2015

Parallelizing a large dynamic program


I have a high-performance dynamic program in C++ whose results are placed in an M × N table, which is roughly on the order of 2000 rows × 30000 columns.

Each entry (r, c) depends on a few of the rows in a few other columns in the table.


The most obvious way to parallelize the computation of row r across P processors is to statically partition the data: i.e., have processor p compute the entries (r, p + k P) for all valid k.


However, the entries for different columns take a somewhat different amount of time to compute (e.g., one might take five times as long as the other), so I would like to partition the data dynamically to achieve better performance, by avoiding the stalling of CPUs that finish early and having them instead steal work from CPUs that are still catching up.


One way to approach this is to keep an atomic global counter that specifies the number of columns already computed, and to increase it each time a CPU needs more work.

However, this forces all CPUs to access the same global counter after computing every entry in the table -- i.e., it serializes the program to some extent. Since computing each entry is more or less a quick process, this is somewhat undesirable.


So, my question is:

Is there a way to perform this dynamic partitioning in a more scalable fashion (i.e. without having to access a single global counter after computing every entry)?




Aucun commentaire:

Enregistrer un commentaire