Skip lines with "mismatch between actual (2) and expected (3) fields..."

Andreas · October 9, 2023, 9:32am

I regularly end up getting Mismatch between actual (2) and expected (3) fields near line… when reading files. Is there an option to ignore lines that causes this mismatch, and complete the operation?

 $ gmt info -I50 -i8,7,15 data.csv
gmtinfo [WARNING]: Mismatch between actual (2) and expected (3) fields near line 19688 in file data.csv

jjg · October 9, 2023, 11:06am

A workaround …

cut -d, -f1,2 data.csv | gmt info ...

Andreas · October 9, 2023, 11:18am

Thanks, but it doesn’t help if field 1 or 2 is missing data (blank).

PlanetGus · October 9, 2023, 11:40am

Is filling the empty cells an option (NaN, -99999, False, whatever)?

jjg · October 9, 2023, 11:50am

cut -d, -f1,2 data.csv | grep -E '.+,.+' | gmt info ...

pwessel · October 9, 2023, 12:45pm

Not to hard for us to skip the line instead of giving error and quitting. Any issue with that for the “rappers”? @Joaquim @seisman

seisman · October 9, 2023, 1:05pm

They’re just warnings. No errors and no quitting.

For reasonable data tables passed to GMT, the number of fields should be the same for all rows. The warnings are actually very useful and they usually mean that the data table passed to GMT is incorrect. So we shouldn’t silent these warnings by simply skipping these lines.

Andreas · October 9, 2023, 1:25pm

Hmm, I need to recheck this. I thought the warnings were show stoppers.

Agree that they are useful and should be printed.

pwessel · October 9, 2023, 1:26pm

I mean we can simultaneously:

Issue the warning but now for every such line (good for finding the problem)
Not stop execution with an ERROR. I get two messages: First a wANING and then an ERROR with the same message. Then it quits.

Andreas · October 9, 2023, 1:30pm

So it does quit without finishing?

Here’s a new example.
I want to know the range of input cols 8,9 and 10.
Since there is a *mismatch… *, I get no result:

$ gmt info data.csv -I1 -i8,9,10
gmtinfo [WARNING]: Mismatch between actual (2) and expected (3) fields near line 31695 in file data.csv
$ *blank*

Not sure if it should…

do what it does now
report nan for col 10 or
just print out the range it was able read

Col 8 and 9 apparently has eveything in order, since if I only check those, I get my result:

$ gmt info data.csv -I1 -i8,9
-R8305707/8312776/657084/660814

seisman · October 9, 2023, 1:46pm

I mean we can simultaneously:

Issue the warning but now for every such line (good for finding the problem)

Not stop execution with an ERROR. I get two messages: First a wANING and then an ERROR with the same message. Then it quits.

Hmmm, yes, it should warn and continue.

Actually, I find the behavior is inconsistent. For example, the data file is:

1 2 3                                                                           
4 5                                                                                
6 7 8 9

The following command gives an warning and then silently quit without errors. Only the first data point is plotted.

$ gmt plot test.txt -R0/10/0/10 -Sc -pdf map
plot [WARNING]: Mismatch between actual (2) and expected (3) fields near line 2 in file test.txt

pwessel · October 9, 2023, 2:29pm

Please build from this branch, then this works

cat << EOF > t.txt
> First line
1 2 3
4 5
3 2 1
> second
6 5
6 7 8
9 9 9
EOF

gmt convert t.txt
gmtconvert [WARNING]: Mismatch between actual (2) and expected (3) fields near line 3 in file t.txt
gmtconvert [WARNING]: Mismatch between actual (2) and expected (3) fields near line 6 in file t.txt
> First line
1	2	3
3	2	1
> second
6	7	8
9	9	9

But need more tests to make sure I implemented it where it needs to go.

Andreas · October 10, 2023, 6:48am

I just built master, and was hoping that my initial problem would be gone, but it’s still there:

$ gmt info -I50 -i8,7,15 data.csv 
gmtinfo [WARNING]: Mismatch between actual (2) and expected (3) fields near line 19688 in file data.csv

To be expected…?

pwessel · October 10, 2023, 7:14am

Different route inside I think. Could you post a few lines of the file that triggers the message?

Andreas · October 10, 2023, 7:46am

Sure thing.
I have simplified; awk’ed relevant columns so that we don’t have to bother with gmt info -i.. stuff.

File (now) consists of triplets:

$ head -2 temp.csv 
660244.51693753 8305707.6246238025 0.01855089750218317
660244.633878774 8305709.270103678 0.020164467304224117

Lets find the range:

$ gmt info -I1 temp.csv 
gmtinfo [WARNING]: Mismatch between actual (2) and expected (3) fields near line 19688 in file temp.csv

Ok? Lets investigate.
(Showing some more lines for context):

$ sed -n '19687,19699p' temp.csv 
660132.4487523573 8312544.907028771 0.12923912446132652
659949.5566811168 8312435.849908027  <--- line 19688, ref. warning from gmt info
659949.3897713334 8312434.227916929 
659949.1878641804 8312432.612078833 
659949.0263897448 8312431.037566451 
659948.8520125422 8312429.443068489 
659948.6819503569 8312427.801996682 0.22122219764235304
659948.5354060156 8312426.172498143 0.23837520364580878
659948.3782451807 8312424.551411953 0.2557883059310993
659948.2247941371 8312422.996433336 0.2623604045800133
659948.1112157609 8312421.435754686 0.2567225807656601
659947.9955018739 8312419.818508312 0.24892660364641217
659947.8664303548 8312418.265789021 0.24445158135137451

pwessel · October 10, 2023, 9:07am

So the story is that there are 36 GMT modules that read line-by-line. Many (but not all) of these need to do this since they may see gazillions of records so we don’t want to load that giant file into memory. All of those modules check if the input record was sensible, and if we get < n_columns (e.g., 2 when 3 is expected) it calls gmt_quit_bad_record and we are out. gmtconvert reads the entire file into memory so it and other modules have a different path through the i/o.

Some options I can think of, which would also remain backwards compatible. (1) Some users may wish to have an error exit and can check the status of the previous command via $?. How about adding another IO default:

IO_BAD_RECORD = skip|fail (or error) [fail]

With a default to fail we retain current bahvior. Folks who wish to just skip incomplete records (like above) would set this to skip.

The alternative is of course to (2) leave as it was or (3) just skip, never error.

I can entertain a discussion of what the default should be before proceeding.

PlanetGus · October 10, 2023, 9:40am

I concur with @seisman,
Sometimes I download data and/or fast store analysis without proper checking. These warnings are a safe guard in that regard.
Wether it should exit or not is debatable indeed. My two cents of it is:
If it’s a big project, I’d rather have it stop than use lot of resources for nothing. But I can do it manually if I have a proper warning early.

If it’s a small project, it may be faster to check results first and ask questions later. But I don’t improve by hidding the dust under the carpet.

Andreas · October 10, 2023, 9:47am

(1) sounds great.
Is gmt convert special, since you mention it specifically? Is it used to pre-process under the hood?

And note; there is absolutely a need to report lines with wrong number of columns. Even quitting is often sensible.
My experience is that very often, csv files, etc., will have some malformed lines.
This may be due to sloppy work, not enough caffeine, etc. The point being; the safe guard (which, again, should be there), often leads to hair-pulling and frustration. Therefore, it would be great if you could turn it off if you want to; like the scheme presented by @pwessel would allow.

pwessel · October 10, 2023, 10:03am

gmtconvert is one of numerous table-readers that calls GMT_Read_Data on the input file(s) so the checking is in the belly of the GMT API. gmtinfo is one of 36 modules that read line by line via GMT_Get_Record and checking is in the module. Hence the tests are applied in two completely separate places (36 places in one case, 1-2 in the other). So anything but option 2 involves editing `40 source files.

Andreas · October 10, 2023, 10:57am

Oh, I see. Thanks.

So IO_BAD_RECORD = skip|fail (or error) [fail] is not straightforward to implement?