Computing mean across rows with gmtmath?

John_Geodesy · May 25, 2023, 7:55pm

The title pretty much states the problem.

Given a table (matrix) of m (cols) by n (rows), is there an easy way, using gmtmath, to compute the mean (i.e., average) across each row?

The table is in binary form, or I would use awk to transpose rows to columns, then feed into gmtmath, which works by column.

Gmtconvert could be used to put the table in text format, then transpose, but that adds to the runtime required. Actually not that bad, in terms of runtime, but gmtmath didn’t like the width of the text table, which had (after transpose) 483 columns. Got lots of warnings:

gmtmath [WARNING]: Long input record (4898 bytes) was truncated to first 4094 bytes!

How can I get past this? Any ideas?

Joaquim · May 26, 2023, 2:15am

That would be so simple with GMT.jl that it even hurts

# Invent a table with 10 columns
julia> gmtwrite("lixo.dat", mat2ds(rand(5,10)))

# Load the pretend-to-be dataset
julia> D = gmtread("lixo.dat")
BoundingBox: [0.0581610695909, 0.980840753115, 0.072613296562, 0.866659845031, 0.107663139704, 0.908576186605, 0.354323700223, 0.678818573041, 0.0271959458778, 0.6064954132, 0.071393531721, 0.927347389372, 0.01304400113, 0.696648710976, 0.0417805833876, 0.827697720595, 0.211541809729, 0.986194363033, 0.0415487557649, 0.732005040647]
5×10 GMTdataset{Float64, 2}
 Row │     col.1      col.2     col.3     col.4      col.5      col.6     col.7      col.8     col.9     col.10
     │   Float64    Float64   Float64   Float64    Float64    Float64   Float64    Float64   Float64    Float64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 0.0581611  0.59417    0.107663  0.678819  0.606495   0.114512   0.113356  0.463102   0.439212  0.303902
   2 │ 0.892739   0.86666    0.696239  0.354324  0.540871   0.518044   0.122986  0.728437   0.644147  0.677205
   3 │ 0.460035   0.663536   0.908576  0.606943  0.541868   0.0713935  0.175379  0.596579   0.986194  0.0415488
   4 │ 0.827295   0.502392   0.662726  0.580983  0.0271959  0.927347   0.013044  0.0417806  0.211542  0.732005
   5 │ 0.980841   0.0726133  0.471736  0.531702  0.380023   0.82452    0.696649  0.827698   0.615786  0.704942

# Compute averages of each row
julia> mean(D,dims=2)
5×1 Matrix{Float64}:
 0.34793926271188996
 0.6041650133401001
 0.5052053096779902
 0.45263105080714006
 0.6106508736944

John_Geodesy · May 26, 2023, 2:51pm

Ouch! Yes, that hurts! I’m afraid that I’ve never met Julia.

Paul, is there a simple way to increase the maximum byte size for the width of the input table? Or is this limited by the c-language?

To do what I was hoping to do will require table widths on the order of 500,000 bytes (or more!). That is, I want to compute mean values over ~25,000 columns, all with one command!

pwessel · May 26, 2023, 3:33pm

Would be just as simple in PyGMY or GMT/MEX of course, but tricky in bash.
Sounds like your input file is wrong, I.e., the long direction should be the columns (same variable) and then it is easy with gmtmath, but presumably you have no control of that file format (?).
Unless you do this day long and cannot wait for a slow GMT transposing step, I would convert table to a dummy grid, transpose with grdedit, dump to binary table via grd2xyz then gmtmath on those columns.

Alternatively, do as Joaquim (or equivalent in Matlab or python, have your bash script create the mex, Julia, or python scripts, then have it call it, e.g.

matlab -nodisplay mymex.m

And then continue with whatever.

John_Geodesy · May 26, 2023, 3:52pm

Thanks, Paul, for the reply. The problem can be solved fairly easily in awk.

From a post I discovered on the StackExchange website, something like this can be done to compute the mean of each row of a table:

awk ' { sumrows = 0; lcol = 0
    for (i=1; i<=NF; i++) {
        if ( $i !~ /NaN/ ) { sumrows+= $i; lcol++ }
    }; print sumrows/lcol } ' infile

In my case, I wanted to avoid NaN values, thus the need for the if-statement.

The awk command creates a single column of mean values computed for for each row.

The input file is not exactly “wrong” but each row can vary by the number of non-NaN fields. Here’s the background story:

Each row contains values every 10m. The whole row represents a “segment” and there can be rows that have 710 bins (extending up to 7,100 meters). There can be thousands of these rows (segments) in the file I’m working from.

Joaquim · May 26, 2023, 3:53pm

The 4094 limit is a GMT limitation when reading text files. No way around it, not even my dummy Julia example. But no problem if the file is in binary.

John_Geodesy · May 26, 2023, 3:56pm

Thanks, Joaquim. Understand the limitation.

I think I found a work-around. I’ll convert the binary file to text and pass it through the awk command. No need to switch rows to columns, since it just works straight-through with data in rows.

Sorry to take time away for your troubles. Just needed to search for alternatives.

pwessel · May 27, 2023, 1:54am

Hi @John_Geodesy -

Long boring flight from NEW to LAX gave me time to review a paper and implement this PR:

gmt convert -Z

which will transpose the single segment in the input; trailing text obviously gets lost. Seems to work as I added a test in the PR - give it a spin on your large files.

gmt convert -Z your.txt | gmt math -Ca -Sl STDIN MEAN =

Esteban82 · May 29, 2023, 9:35pm

The PR was just merged.

@John_Geodesy you could use Paul’s solution from dev version.

John_Geodesy · May 31, 2023, 2:34pm

Thanks for implementing the PR. I updated to the latest dev version.

I’m not entirely clear on what is meant by “single segment.” Perhaps this can be written so that most users will understand it.

For example, this command completes successfully on (e.g.,) row zero of my large data set (t1m2.bin):

gmt gmtconvert -bi710f -q0 -Z t1m2.bin | gmt gmtmath -Ca -Sl STDIN MEAN = 0.3302355750

But the command fails if I try to feed-in the whole data set (likely due to the fact that the input is binary and output is text):

> gmt gmtconvert -bi710f -Z t1m2.bin | gmt gmtmath -Ca -Sl STDIN MEAN = tmean2.txt 
gmtmath [WARNING]: Long input record (6433 bytes) was truncated to first 4094 bytes!
gmtmath [WARNING]: Long input record (6378 bytes) was truncated to first 4094 bytes!

However, given that the number of rows in my input table can vary (number of columns is fixed at 710), if the number of rows is first ascertained with gmtinfo, then that count can be fed to the command to get it to complete successfully. This command sequence works:

nrows=$(gmt gmtinfo -bi710f -i0 t1m2.bin | awk ' { print $4 } ')
gmt gmtconvert -bi710f -bo${nrows}f -Z t1m2.bin | gmt gmtmath -bi${nrows}f -bo${nrows}f -Ca -Sl STDIN MEAN = tmean2.bin
gmt gmtconvert -bi${nrows}f -bo1f -Z tmean2.bin > tmean2vec.bin

This sequence yields a vector of mean values computed for each row of the original table. Seems very fast and very efficient. With binary input (to avoid hitting the 4094 byte limit), it is important to supply the number of rows found in the original table.

pwessel · May 31, 2023, 8:46pm

Yes, forgot that for binary we must indicate number of columns since there is no way to figure that out otherwise.,You probably could get rid of that ugly awk pipe by using -o3 in gmt info instead?

pwessel · June 1, 2023, 9:35pm

Maybe useful to add the essence of your nrows stuff to the man page since for large binary files you will need to know the number of rows since it becomes the number of columns after transposition. In fact, perhaps gmt convert can report this under -Vi so then it is available without that second call.

pwessel · June 2, 2023, 7:08am

Now, -V will report dimension of transposed matrix.

Regarding your comment

I’m not entirely clear on what is meant by “single segment.” Perhaps this can be written so that most users will understand it

Since the 1990s, GMT datasets can be composed of one or more data segments, separated by the “> whatever comments” lines. Since transposing applies to a matrix there can only be one “matrix” in the file, hence a single segment.
.

John_Geodesy · June 2, 2023, 3:07pm

Awesome, Paul. I was thinking along those lines: that the use of the phrase, “single segment,” referred to tables of data (line) segments with the separator “>”. Yes, in my case, I was working with a single “segment” table.

WRT the usage of -V, can an example be provided? I tried it, as follows, and wonder whether this how it should be used, where /dev/null is just a place to stuff non-important output:

> gmt gmtconvert -bi710f -V t1m2.bin > /dev/null
gmtconvert [INFORMATION]: Processing input table data
gmtconvert [INFORMATION]: Input 710 columns via binary records using format ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
gmtconvert [INFORMATION]: Reading Data Table from File t1m2.bin
gmtconvert [INFORMATION]: Write Data Table to <stdout>
gmtconvert [INFORMATION]: 1 tables concatenated, 483 records passed (input cols = 710; output cols = 710)

What would be the best way to trap the value for the number of records?

John_Geodesy · June 2, 2023, 3:13pm

That didn’t seem to work:

> gmt gmtinfo -bi710f -i0 t1m2.bin -o3
NaN

Compared to the ugly awk pipe:

> gmt gmtinfo -bi710f -i0 t1m2.bin | awk ' { print $4 } '
483

pwessel · June 2, 2023, 7:06pm

Hm, not expected. Might you be able to attach a small piece of the binary file that then demonstrates the same problem (so I can debug)?

John_Geodesy · June 2, 2023, 7:33pm

Attaching the first 11 rows of my t1m2.bin file, now called tpart.bin.tpart.bin.zip (17.8 KB)

Here was what I saw when I ran the following commands:

> gmt gmtinfo -bi710f -i0 tpart.bin
tpart.bin: N = 11	<-0.2901957333/1.0185788870>
> gmt gmtinfo -bi710f -i0 -o3 tpart.bin
NaN

This next line was to see if it could handle a 710 column-wide binary data set without specifying an input column.

> gmt gmtinfo -bi710f tpart.bin
ERROR: Caught signal number 4 (Illegal instruction: 4) at
0   libsystem_c.dylib                   0x00007ff81753f363 __chk_fail_overflow + 16
1   libsystem_c.dylib                   0x00007ff81753f363 __chk_fail_overflow + 16
Stack backtrace:
0   libgmt.6.5.0.dylib                  0x0000000110714df8 gmt_sig_handler_unix + 216
1   libsystem_platform.dylib            0x00007ff81760adfd _sigtramp + 29
2   ???                                 0x0000000000000000 0x0 + 0
3   libsystem_c.dylib                   0x00007ff8174c64fb __cxa_atexit + 0
4   libgmt.6.5.0.dylib                  0x000000011081cae4 GMT_gmtinfo + 15876
5   libgmt.6.5.0.dylib                  0x000000011060df97 GMT_Call_Module + 2535
6   gmt                                 0x000000010fed2299 main + 1785
7   dyld                                0x000000011516752e start + 462

pwessel · June 2, 2023, 10:47pm

I gave you wrong info. You need -Fd with -o:

gmt gmtinfo -bi710f -i0 tpart.bin -Fd -o2

11

pwessel · June 2, 2023, 10:57pm

As for the crash, it is related to too long output ascii strings, but when I tried -bo I got the same so more work to be done there. Thanks for the data and example.

pwessel · June 4, 2023, 11:21pm

Need help from @Joaquim on this (need branch mathcol-check): I updated gmtinfo to handle output summary strings longer the 4096 by allocating the string, etc. However, get odd crashes every few times. Command is

gmt gmtinfo -bi710f tpart.bin

and @John_Geodesy’s file was attached in his earlier message. if you get any clues please share.