Grdtrack results inconsistent when same data read with Xarray vs from GRD file

rtwalker19 · October 15, 2021, 7:50pm

Hello,

I’m getting slightly different results when using pygmt.grdtrack on the same dataset, depending on whether I read the dataset directly from the GRD file or with xarray.load_dataset(). I strongly suspect it has something to do with float32 vs float64, so I’ll lay out an example as briefly as I can.

I have several GRD files (netCDF 4 format) with data as float32 and the coordinates (lon/lat) as float64, which is necessary because the grid is 128 pixels per degree. For this example, I cut down to a 16x16 grid centered on (45, 45).

rtwalker@pgda101:~$ gmt grd2xyz small.grd --FORMAT_FLOAT_OUT=%.30g
44.98828125	45.01171875	-0.563958942890167236328125
44.99609375	45.01171875	-0.56336867809295654296875
45.00390625	45.01171875	-0.564554393291473388671875
45.01171875	45.01171875	-0.5662910938262939453125
44.98828125	45.00390625	-0.554759800434112548828125
44.99609375	45.00390625	-0.551300346851348876953125
45.00390625	45.00390625	-0.553443372249603271484375
45.01171875	45.00390625	-0.556558072566986083984375
44.98828125	44.99609375	-0.546033084392547607421875
44.99609375	44.99609375	-0.54321348667144775390625
45.00390625	44.99609375	-0.545960903167724609375
45.01171875	44.99609375	-0.5504894256591796875
44.98828125	44.98828125	-0.54240500926971435546875
44.99609375	44.98828125	-0.544292449951171875
45.00390625	44.98828125	-0.545701205730438232421875
45.01171875	44.98828125	-0.547770082950592041015625

Using dem = xarray.load_dataset(‘small.grd’), I’ve verified that the values and coordinates have exactly the same values and are the right data types in xarray. I then define a dataframe that just contains (45, 45).

After running

df = pygmt.grdtrack(df, 'small.grd', newcolname='elevation_grd', coltypes='g')
df = pygmt.grdtrack(df, dem.z, newcolname='elevation_xr', coltypes='g')

I get this from df.head():

longitude	latitude	elevation_grd	elevation_xr
45	        45	        -0.54725	        -0.54848

which is not good.

However, if I do this:

dem64 = dem.assign(z64=dem.z.astype('float64'))
df = pygmt.grdtrack(df, dem64.z64, newcolname='elevation_xr64', coltypes='g')

then df becomes

longitude	latitude	elevation_grd	elevation_xr	elevation_xr64
45	45	-0.54725	-0.54848	-0.54725

so the result is the same as from reading the GRD file directly.

I don’t know the internals, but this looks to me like there’s a problem with the variable and coordinates being different precision floats. (Using float32 for everything would truncate the coordinate values, so that’s my best uninformed guess about why the result is different.) Is there a better way to handle this that doesn’t require casting the entire dataset to float64?

Thanks!
Ryan

pwessel · October 15, 2021, 8:08pm

GMT uses double precision for the coordinates and float32 for the z-values only. Could you post small.grd here for inspection?

rtwalker19 · October 15, 2021, 8:18pm

I’m sorry, I don’t see how I can attach a file in the forum.

rtwalker19 · October 15, 2021, 8:24pm

small.grd.zip (645 Bytes)

rtwalker19 · October 15, 2021, 8:25pm

(Sorry, end of week brain drain)

pwessel · October 15, 2021, 8:40pm

Not a pyGMT user so perhaps I am missing something. Just so we agree on the rules here: The 45,45 point is not one of the original data points so grdtrack will resample the surface using whatever interpolator chosen (controlled by -n on the CLI) and estimate what z might be at (45,45). On the CLI I get (with default bicubic spline interpolation):

echo 45 45 | gmt grdtrack -Gsmall.grd
45	45	-0.547249913914

but choosing a B-spline (-nb) gives

45 45 -0.548869582592

bilinear (-nl) gives

45 45 -0.548479527235

and finally nearest neighbor (-nn) picks the nearest node:

45 45 -0.545960903168

which is (45.00390625, 44.99609375).

Passing in float64 in pyGMT to eventually ends up in a float32 grid so it should not matter, except perhaps the rounding is different or there is something fishy in that section of the API that deals with a double precision matrix. Yet, the x and y coordinates are kept as float64 always.

rtwalker19 · October 15, 2021, 9:25pm

Thanks!

My results with pygmt.grdtrack reading small.grd directly agree with yours for all interpolation methods.

However, I saw that your result for bilinear interpolation matched my result for bicubic interpolation starting from Xarray. Which gave me a thought.

Running the other interpolation methods with pygmt starting from Xarray shows that I get the same result whether I specify bilinear, bicubic, or B-spline. By comparison with your results, it appears that I’m getting bilinear even when I ask for bicubic or B-spline. This is obviously no good. I’m wondering if there’s some setting in Xarray that’s causing this?

rtwalker19 · October 15, 2021, 9:31pm

And while I’m at it, why would forcing the data in Xarray to be float64 fix this?

pwessel · October 15, 2021, 9:41pm

I do not know that part. Let’s wait for someone familiar with the grdtrack wrapper to comment, e.g. @weiji14, @seisman or @maxrjones

rtwalker19 · October 15, 2021, 9:48pm

Have a great weekend! You’ve already saved me from considerable further pain.

weiji14 · October 16, 2021, 10:56am

It seems that xarray.open_dataset() loads data variables as float32 by default (see https://github.com/pydata/xarray/issues/2304), and there is a pending (albeit stagnant) pull request to force loading in data as float64 (https://github.com/pydata/xarray/pull/2751/files). So your method of casting to float64 using dem.z.astype("float64") might be a good interim solution for now for processing xarray grids (assuming that you are interested in millimetre level elevation precision ). Alternatively, you could always process the GRD files directly using pygmt.grdtrack(df, grid="small.grd", ...) which is likely to be faster.

I’m not entirely sure what we can or should do on the PyGMT side. Ideally we would try to convince xarray upstream to have the option of loading data directly into float64, but this still puts the onus on users to know that they need to use a non-default float64 setting for better precision. We could raise a warning in GMT/PyGMT, but float32 is still a perfectly reasonable precision for most applications so it might get annoying for >90% of users that don’t need high numerical precision.

rtwalker19 · October 16, 2021, 1:17pm

So, to clarify – It appears that I’m getting the right data types from Xarray (my data is float32):

dem = xarray.load_dataset(‘small.grd’)
type(dem.z.data[0, 0]), type(dem.x.data[0]), type(dem.y.data[0])
(numpy.float32, numpy.float64, numpy.float64)

What’s going wrong when I try bicubic interpolation is this:

df = pd.DataFrame.from_dict({‘longitude’: [45], ‘latitude’: [45]})
df = pygmt.grdtrack(df, ‘small.grd’, newcolname=‘el_grd_c’, coltypes=‘g’, interpolation=‘c’)
df = pygmt.grdtrack(df, dem.z, newcolname=‘el_xr_c’, coltypes=‘g’, interpolation=‘c’)
longitude latitude el_grd_c el_xr_c
45 45 -0.547249913914 -0.548479527235

If I try bilinear instead:

df = pd.DataFrame.from_dict({‘longitude’: [45], ‘latitude’: [45]})
df = pygmt.grdtrack(df, ‘small.grd’, newcolname=‘el_grd_l’, coltypes=‘g’, interpolation=‘l’)
df = pygmt.grdtrack(df, dem.z, newcolname=‘el_xr_l’, coltypes=‘g’, interpolation=‘l’)
longitude latitude el_grd_l el_xr_l
45 45 -0.548479527235 -0.548479527235

This suggests that the interpolation starting from Xarray is bilinear even when bicubic is specified. (The same happens for B-splines, but nearest neighbor works right.)

I have no idea why casting the data to float64 would remove the problem with apparently calling the wrong interpolation method. I definitely do not need that precision.

The reason I’m trying to use Xarray is that I have thousands of files each containing about 100,000 points that have to be interpolated on multiple grids. And I’m working on a cluster that is unfortunately very slow on I/O, so it should be a fair bit faster to keep the grids in memory via Xarray than to reopen the GRD files millions of times. Ideally, I’d like to construct an interpolator for each grid that could persist in memory, but that might require an entirely different package.

Joaquim · October 16, 2021, 3:00pm

GMT loads all grid data internally as float32, regardless of the original data type (loading images is different). The computations are often done as float64 but the data storage as well as reading are done in float32 variables (there is an compiling switch that can put all in float64 but that’s a different story)

rtwalker19 · October 16, 2021, 3:21pm

From my results and what I’ve learned from the community so far (thanks!), it looks to me like the float precision is OK and the primary problem is that the wrong interpolation method is being used when the input is from Xarray (bicubic specified but bilinear used). I know that Xarray has its own interpolation methods – is it possible that there’s some kind of conflict?

Joaquim · October 16, 2021, 3:36pm

I also can’t help you with the pygmt side but can tell that once the data lands in the GMT memory via memory pointer, it doesn’t matter where it came from (well, aside potential issues trying to free memory that it does not own). So all the methods in a Xarray are invisible (and irrelevant) to GMT, only the data (the float32 array) counts.

rtwalker19 · October 20, 2021, 2:44pm

Just checking again to see if any of the experts on the pygmt interface have any advice on the apparent problem with the wrong interpolation method being used with Xarray input. (It seems clear now that float32 vs float64 is not the problem.) Based on the example I gave above and results for my full grid, I’m getting bilinear interpolation when the input is Xarray and bicubic interpolation is requested.

maxrjones · October 20, 2021, 9:04pm

Thanks for raising this issue; I have opened a bug report in the GMT repository at https://github.com/GenericMappingTools/gmt/issues/5882. I’ll keep you posted regarding a resolution.

rtwalker19 · October 21, 2021, 12:01pm

Great, thanks Meghan!

maxrjones · October 21, 2021, 8:02pm

Hi @rtwalker19, currently the interpolation methods for xarray.DataArray input are limited to nearest-neighbor and bilinear. I have submitted a documentation fix to clarify this in the PyGMT repository and added a warning to GMT such that the user is notified with the interpolation method is reset.

We will also look into supporting other interpolation methods for xarray.DataArray input, tracked at https://github.com/GenericMappingTools/gmt/issues/5886. In the meantime, you would need to use file input in order to get bicubic interpolation.

rtwalker19 · October 21, 2021, 8:46pm

Thanks, the documentation surrounding using xarray.DataArray is very limited. And implementing the other interpolation methods would be helpful – depending on the efficiency of system I/O, having to repeatedly read the GRD files can be a performance issue.

Again, much appreciated, everyone!

Ryan