when I run them in parallel. By sporadic, one in a hundred (one or two for each run of the test suite), different test fails each time, seemingly randomly. Any idea how to debug this further?
Please note
yes, I have gs on my path, it’s in /usr/bin/gs
gs version 9.55.0
gmt version 6.4.0
Linux mint (ubuntu variant)
each test is run in a different temporary directory
I’ve done a bit of debugging on this, it seems that popen() is not the issue, in these error cases that succeeds and does not modify errno, instead it is the subsequent gmt_gets() or more accurately the fgets() therein which fails, setting errno to 4, EINTR, “Interrupted system call” leaving the line argument as “\0”;
I’m not sure where these signals are coming from, as far as I can see, there’s not use of them in GMT itself right?
Yes, you are right that it can fail also in fgets step. Regarding the usage of the line contents I see that movie.c L2704 is using the contents of it.
Could you try to trick that bit of code so that the gmt_fgets failure is not fatal and see if it doesn’t have consequences in your case. We could than think in a test that is milder when detecting the ghost.
Having read around a bit, it seems that this behaviour is for signal handling. The fgets blocks until completed, and if something goes pear-shape in the system call, might lock the program progression. So fgets returns NULL and sets errno to EINTR if interrupted by a signal, giving the caller the opportunity to check flags set by the signal handler and handle recovery. In this case there is (I think) no handler in scope, so the “correct” response should be to retry: something along the lines of
errno = 0;
while (fgets(...) == NULL)
{
if (errno != EINTR)
return 1;
errno = 0;
}
possibly have some counter to limit the number of retries. Now I think it would be a bit ill-mannered of me to suggest that GMT implements this for all fgets calls (I see there are 50+ in the codebase), and I think can fix my usescase by doing the essentially the same retry mechanism around the call to GMT_Call_Module, since errno will still be EINTR there on failures. So could I suggest that I make that fix, check that it works and provide a link to the change here – then similar could be implemented by other GMT wrappers if affected by the same issue (I’m thinking pyGMT).
For a bit more context, the code here is a Ruby program which calls the GMT C API in a native extension, and that Ruby program is acceptance tested by BATS which is using GNU parallel for parallelisation, I’m pretty sure (but not certain) it’s one of these later two which is generating the problematic signals. It’s in the native extension that I plan to implement the fix, some more notes here.
The gmt_check_executable function (that calls the sometimes-failing code) is not called in many places. Basically by psconvert and when plotting latex expressions, so if it solves this case I think we could apply your patch directly at gmt_run_process_get_first_line. Can you find how many times that while loop would have to be executed in order to jump the first failure?
I think that the failure occurring in gmt_check_executable is just bad luck, in my particular case that happens to be running at about the same time as a signal is delivered, and occasionally it hits during its execution … I think it is reasonable for the GMT library to error in this way – if the user chooses to use it in a signal-heavy environment, then they (I) should be expected to deal with the EINTRs. Arguably the GMT commandspsconvert etc could/should do similar, since the user has no visibility of errno so cannot reasonably retry on failure. But this is obviously a rare failure mode (else there would be reports) so I guess low priority.
In any case, the fix I mentioned above does seem to do the trick – in this change I modify the GMT gem so that if any of the calls to GMT API functions fails and sets errno, then the Errno::<name> exception is raised (in Ruby), and in the Ruby code I have
begin
gmt.psconvert(...)
rescue Errno::EINTR
retry
end
and now I run the acceptance tests in parallel numerous times and don’t see any errors.
As a sort of aside, with this new facility of making errno visible via the Ruby-C native extension, I find that all GMT programs set errno to ENOENT (file not found) on calling -^, -? or -+ , not an issue but a bit odd.
I have no idea why that happens. The only references to ENOENT I find in the GMT code is when it tests if a directory exists. As in
GMT_LOCAL int psconvert_make_dir_if_needed (struct GMTAPI_CTRL *API, char *dir) {
struct stat S;
int err = stat (dir, &S);
if (err && errno == ENOENT && gmt_mkdir (dir)) { /* Does not exist - try to create it */
Sure, but popen calls fork and runs shell against the arguments in the child process. One could avoid the shell by doing a fork and exec yourself, but adds complexity (and I see in a few places shell is actually required, wildcards etc).
Sure I don’t want to spend time on that. But, from memory, there is no fork on Windows. To build GMTSAR with MSCV I had to find some 3rd party implementation of fork.