Unexpectedly Large Output File Size and Duplicate Timestamps in gmt sample1d

mindioanni · January 28, 2025, 8:55am

Hello GMTers,

I’ve encountered a significant issue while using the gmt sample1d command to resample a tidal gauge dataset. Despite defining a specific time range and interval, the output file is excessively large (13 GB) compared to the input file (75 MB). Additionally, an analysis of the output reveals an unexpectedly high number of duplicate timestamps.

Here is the command I executed:

gmt sample1d dummy2 -gx300s -T2018-09-14T07:25:00/2024-02-05T09:00:00/300s -Fa -V > dummy3

Problem Details:

Input File (dummy2 ):

Contains a tidal gauge time series (sample rate 1 min for epoch 2018-09-14T07:25:00/2024-02-05T09:00:00).
Record is mostly continuous within the defined time range with some manageable gaps and out of order periods.

Output File (dummy3 ):

File size: 13 GB.
Total lines: 569,083,143.
Unique timestamps: 990,725 (determined via cut -d' ' -f1 dummy3 | sort | uniq | wc -l ).

Expected Behavior:

The defined time range (2018-09-14 to 2024-02-05) with a 5-minute interval corresponds to 567,379 unique timestamps.
The output should align with this interval and exclude duplicate or excessive lines.

Observations:

The verbose output during execution indicates multiple segment headers due to detected data gaps:

sample1d [INFORMATION]: Data gap detected via -g; Segment header inserted near/at line # XXXXXXX

This suggests that the gaps in the input file might be improperly handled or lead to unnecessary duplication.

Actions Taken:

Verified the number of unique timestamps:

cut -d' ' -f1 dummy3 | sort | uniq | wc -l

Result: 990,725 unique timestamps.
2. Inspected the input file for discontinuities or anomalies (e.g., unexpected gaps or NaN values).
3. Attempted to identify lines with duplicate timestamps but found the scale of the file challenging to process efficiently.

Questions:

Any ideas about why the output file contains such a large number of duplicate lines, far exceeding the expected unique timestamps?

Any insights or recommendations for addressing this issue would be greatly appreciated.

Thank you in advance for your help!

Best regards,

Joaquim · January 28, 2025, 3:27pm

Thanks for the detailed report, but can you please annex a trimmed version of the dummy2 file (zip it) so that we can reproduce this issue?

mindioanni · January 29, 2025, 5:19pm

Dear Joaquim,

Thank you for your prompt response.

I have uploaded a truncated version of the input file (trunc_dummy2.dat) for testing. However, I strongly encourage you to try reproducing the issue using the full dataset.

Here is the link to download the full version of the file:

The issue occurs when executing the following command:
gmt sample1d dummy2 -gx300s -T2018-09-14T07:25:00/2024-02-05T09:00:00/300s -Fa -V > dummy3
on GMT 6.5.0 running under Ubuntu 24.04.1 LTS.
Please let me know if you need any further details.

Best regards,
Υιannis
trunc_dummy2.dat (874.6 KB)

Joaquim · January 30, 2025, 1:54am

Definitively a bug. It seems to get trapped in some NaN logic when it detects a gap and is not able to resume the interpolation on the other side of the gap.

Joaquim · January 30, 2025, 3:09pm

Created an issue where it is also explained what is happening.

Workaround for now seems to be adding the option -s1.

mindioanni · January 30, 2025, 3:36pm

I confirm that using -s1 eliminates the excessive output size and duplicate timestamps, and the results now match the expected behavior.

Thank You!

A huge thanks to Joaquim for the quick response and thorough debugging. I really appreciate the support from the GMT community!

Best regards,
Yiannis