I have to program an optimized multi-thread implementation of the Levenshtein distance problem. It can be computed using dynamic programming with a matrix, the wikipedia page on Levenshtein distance covers that well enough.

Now, I can compute diagonal elements concurrently. That is all alright.

My problem now comes with caches. Matrices in c++ are normaly saved in memory row by row, correct? Well, that is not good for me as I need 2 element of the previous row and 1 element of the current row to compute my result, that is horrible cache-wise. The cache will hold the current row (or part of it), then I ask for the previous one which it will probably not hold anymore. Then for another one, I need a different part of the diagonal, so yet again, I ask for completely different rows and the cache will not have those ready for me.

Therefore, I would like to save my matrix to memory in blocks or maybe diagoals. That will result in fewer cachce misses and make my implementation faster again.

How do you do that? I tried searching the internet, but I could never find anything that would show me the way. Is it possible to tell c++ how to order that type in memory?

EDIT: As some of you seem confused about the nature of my question. I want to save a matrix (does not matter if I will make it a 2D array or any other way) in a custom way into the MEMORY. Normally, a 2D array will save row after row, I need to work with diagonals therefore caches will miss a lot on the huge matrices I will work at (possibly millions of rows and columns).