Using binstats for quantile calculations in PyGMT

dtrugman · December 19, 2023, 12:57am

Hi,

I have a lon/lat/depth dataset where I want to compute quantiles (e.g., the 95th percentile depth) on a spatial grid. This seems like a good problem for the binstats function, e.g.:

# statistics by bin by bin
bdata = pygmt.binstats( # depth data
    data,
    statistic="q95",
    empty=np.nan, 
    search_radius="10k",
    spacing="5k",
)

While this code runs, it seems to give smaller/lower values in each grid cell than if I ask for the median depth:

# statistics by bin by bin
bdata = pygmt.binstats( # depth data
    data,
    statistic="m",
    empty=np.nan, 
    search_radius="10k",
    spacing="5k",
)

This is a bit puzzling to me, I am not aware of a mathematical situation where the median value could be greater than the 95th percentile. Am I doing something wrong?

Thanks for your help and all of your hard work developing this code!

Daniel

pwessel · December 19, 2023, 9:10am

Would be Nobel prize in mathematics if we could show median (q50) is greater than q95. Sadly, seems slightly more likely there is a bug involved (or user error). Could you provide a subset of your data and the command that reproduces this craziness and I can have a productive look at it?

weiji14 · December 19, 2023, 11:03am

Is your depth in negative values? E.g. 0 is sea level, and -20 is 20 meters below sea level? Maybe your data needs to be inverted for binstats?

dtrugman · December 19, 2023, 4:31pm

data.txt (739.8 KB)

Thanks for looking into this! I’ve attached the dataset. In this dataset, positive depths are deeper, so q95 is what I want in the end. I’ve tried out different quantiles and they give strange / nonmonotonic results, so I think I’m either doing something wrong or there is a bug.

# 95 percentile
print("Q95 Results:")
bdata = pygmt.binstats(
    "data.txt",
    statistic="q95",
    empty=np.nan, 
    search_radius="10k",
    spacing="5k",
)
print(bdata)

# Median
print("\nMedian Results:")
bdata = pygmt.binstats(
    "data.txt",
    statistic="m",
    empty=np.nan, 
    search_radius="10k",
    spacing="5k",
)
print(bdata)

pwessel · December 19, 2023, 4:57pm

What are you using for -Rw/e/s/n and -Idx[/dy]. I see your search radius is 10km but that is a separate parameter. Perfectly possible to ask for a -I30s 30 arc-sec grid and use all data inside a 10 km circle centered on each node. If radius is significantly larger than cell size then it basically does filtering. Remember we search for all data points inside those circles and they do look different than what blockm* modules return on a per rectangular bin basis.

It sounds like you imply you want blockmedian -T0.95. But, would like to know what -R -I you used in case I can hunt down an issue>

dtrugman · December 19, 2023, 5:58pm

That’s a good question. In the code above I hadn’t specified the region but it seemed to select it appropriately based on the span of the data. I tried again with the region hard-coded in and got the same result. What I’m trying to do I think is consistent with the idea of the binstats, which is to compute a statistic on a final spatial grid. I realize the radii of the circles will overlap in this case, but the basic result (median values are larger than q95) seems to hold regardless of my binning.

# 95 percentile
print("Q95 Results:")
bdata = pygmt.binstats(
    "data.txt",
    statistic="q95",
    empty=np.nan, 
    search_radius="10k",
    spacing="5k",
    region=[-121.35,-118.5,37.65,40.95],
)
print(bdata)

# Median
print("\nMedian Results:")
bdata = pygmt.binstats(
    "data.txt",
    statistic="m",
    empty=np.nan, 
    search_radius="10k",
    spacing="5k",
    region=[-121.35,-118.5,37.65,40.95],
)
print(bdata)

jjg · December 20, 2023, 9:42am

Sadly there is no Nobel prize in mathematics, apparently Nobel considered it too theoretical to benefit mankind. You’d probably get an Abel for that though.

PlanetGus · December 21, 2023, 9:37am

Gossips might say he had a grudge against a mathematician…
If your under 40 you can try the Fields medal too

pwessel · December 29, 2023, 1:42pm

Apparently Abel, Fields, and Nobel is out of the question when the nominee caused the exciting result by having bugs in the code. Now fixed in this PR which will be approved in no time. Thanks for reporting this.

dtrugman · December 29, 2023, 2:01pm

Great! Thank you for looking into this!

pwessel · December 29, 2023, 2:02pm

Now merged into master.

sfmiguel · January 2, 2024, 9:14pm

Hi,

Thank you for solving this. Is there a way to use the gmt with fix (without compiling it)? I have no idea on how to install the file with the new code. I tried to install the latest version of pygmt, but the issue remains.

I do this: interval=pygmt.binstats(data,statistic=‘i’,spacing=(100, 100),region=[min_x,max_x,min_y,max_y],search_radius=100)
And get negatives values (values of q75 lower then the values of q25).

Thanks in advance

Regards

Miguel

pwessel · January 3, 2024, 8:31am

Either wait a few days for GMT 6.5 to b released or build from source. Not sure what else needs to b done on the pyGMT side.

weiji14 · January 3, 2024, 9:35am

Can try to install the dev version of GMT 6.5.0 from conda-forge. The 6.5.0.dev9+0776994 version from https://github.com/conda-forge/gmt-feedstock/pull/279 should have the bugfix (https://github.com/GenericMappingTools/gmt/pull/8243)

conda install -c conda-forge/label/dev gmt=6.5.0

sfmiguel · January 3, 2024, 10:38am

Thank you very much your help and recommendations. I will try to install.
I mentioned the pygmt, because I do not know the difference between the GMT and pygmt concerning how and where they operate (not sure where the bug was :)). This is what I meant.

Regards

Miguel

weiji14 · January 3, 2024, 10:07pm

PyGMT is a wrapper around GMT’s C API, and in this case, the bug in binstats was in the GMT C library, so you would need the newer GMT 6.5.0 dev version (keeping the same PyGMT v0.10.0 version is fine). If you’re unsure, it’s always ok to ask