Parallel GMT operations fail to output results files in some instances

chintals · September 28, 2022, 7:26pm

When running a c-shell script in parallel (using subprocess in python) with some number of GMT operations involving manipulation of grd files (such as grd2xyz, grdsample, grdmath, grdedit etc.), I am having issues where some of the child processes do not output the final grd file (or the final grd file is missing in the directory)

if the number of parallel instances is set to 6-8, all instances run properly and output the files per the csh script. But if the number of parallel instances is increased to, say, 24, some of the parallel instances do not output final grd files, AND I don’t see any error print on the console (even with the -v flag).

I am running GMT 6.3, 64 bit on Ubuntu 20.04, real CPU cores greater than 24, physical RAM greater than 200 GB, and 4x swap memory. The harddrive configuration is a SATA SSD raid 0.

Assuming that I might be close to the harddrive throughput, I put wait commands in the csh script after every I/O operation, but still have missing files for some instances issue.

Any help or suggestion in this matter is much appreciated,
Thank you,

pwessel · September 28, 2022, 7:47pm

Are you able to post your script so we can see if there are any commands that may not work well if a general parallel situation?

chintals · September 28, 2022, 8:55pm

Hi Paul,
The c-shell scripts I am referring to are primarily from GMTSAR. One of the modified csh script is for interferogram generation. The modified script eliminates the for loop and passes the reference image and aligned image directly to the modified script. Also, each of the temp files created are uniquely named. But some parallel instances could be accessing the same source file.

pwessel · September 28, 2022, 9:35pm

So that probably means GMT classic mode then (i.e., no gmt begin and end). In general there is no problem if the same file (e.g., a grid) is read by GMT; trouble only applies to writing to the same file!
Also, GMT communicates with future commands via the .gmtcommands hidden file and here things can cause trouble as one thread writes to that file while another tries to read. You can turn off some of this via the --GMT_HISTORY= readonly so that it is only read from. But if your scripts has a blank -R after it has been set once then of course it needs to be written at least once and then further calls can read and get -R that way. So little things like that can trip you up and the more cores the more likely that could be an issue, which explains that it works OK for just a few.

chintals · September 28, 2022, 10:05pm

Ok, that makes sense. Is .gmtcommands hidden file created for each parallel instance?
Could a solution be to create a unique temp directory and do processing in there (so the .gmtcommand is created within the temp directory) and then move files to the final directory?

pwessel · September 28, 2022, 10:27pm

No, GMT does not know about your parallel instances. You can avoid most if these issues by working in separate subdirectories since GMT will read/write to .gmtcommands in the current directory.

chintals · September 28, 2022, 10:41pm

Hi Paul,

Ok, that makes sense. I meant to say create a unique temp directory for each parallel instance, then do processing, and then move the files from subdirectories after all of the parallel GMT processing is completed.

Thank you for your help,

Sunil