I regularly end up getting Mismatch between actual (2) and expected (3) fields near line… when reading files. Is there an option to ignore lines that causes this mismatch, and complete the operation?
$ gmt info -I50 -i8,7,15 data.csv
gmtinfo [WARNING]: Mismatch between actual (2) and expected (3) fields near line 19688 in file data.csv
For reasonable data tables passed to GMT, the number of fields should be the same for all rows. The warnings are actually very useful and they usually mean that the data table passed to GMT is incorrect. So we shouldn’t silent these warnings by simply skipping these lines.
cat << EOF > t.txt
> First line
1 2 3
4 5
3 2 1
> second
6 5
6 7 8
9 9 9
EOF
gmt convert t.txt
gmtconvert [WARNING]: Mismatch between actual (2) and expected (3) fields near line 3 in file t.txt
gmtconvert [WARNING]: Mismatch between actual (2) and expected (3) fields near line 6 in file t.txt
> First line
1 2 3
3 2 1
> second
6 7 8
9 9 9
But need more tests to make sure I implemented it where it needs to go.
So the story is that there are 36 GMT modules that read line-by-line. Many (but not all) of these need to do this since they may see gazillions of records so we don’t want to load that giant file into memory. All of those modules check if the input record was sensible, and if we get < n_columns (e.g., 2 when 3 is expected) it calls gmt_quit_bad_record and we are out. gmtconvert reads the entire file into memory so it and other modules have a different path through the i/o.
Some options I can think of, which would also remain backwards compatible. (1) Some users may wish to have an error exit and can check the status of the previous command via $?. How about adding another IO default:
IO_BAD_RECORD = skip|fail (or error) [fail]
With a default to fail we retain current bahvior. Folks who wish to just skip incomplete records (like above) would set this to skip.
The alternative is of course to (2) leave as it was or (3) just skip, never error.
I can entertain a discussion of what the default should be before proceeding.
I concur with @seisman,
Sometimes I download data and/or fast store analysis without proper checking. These warnings are a safe guard in that regard.
Wether it should exit or not is debatable indeed. My two cents of it is:
If it’s a big project, I’d rather have it stop than use lot of resources for nothing. But I can do it manually if I have a proper warning early.
If it’s a small project, it may be faster to check results first and ask questions later. But I don’t improve by hidding the dust under the carpet.
(1) sounds great.
Is gmt convert special, since you mention it specifically? Is it used to pre-process under the hood?
And note; there is absolutely a need to report lines with wrong number of columns. Even quitting is often sensible.
My experience is that very often, csv files, etc., will have some malformed lines.
This may be due to sloppy work, not enough caffeine, etc. The point being; the safe guard (which, again, should be there), often leads to hair-pulling and frustration. Therefore, it would be great if you could turn it off if you want to; like the scheme presented by @pwessel would allow.
gmtconvert is one of numerous table-readers that calls GMT_Read_Data on the input file(s) so the checking is in the belly of the GMT API. gmtinfo is one of 36 modules that read line by line via GMT_Get_Record and checking is in the module. Hence the tests are applied in two completely separate places (36 places in one case, 1-2 in the other). So anything but option 2 involves editing `40 source files.