Mapping coordinates to countries in a pandas dataframe? (in a way that scales)

I am trying to add a country field and a boolean flag to each row of a pandas dataframe with a few million records, each containing a lat and a long floating point field per row.

I’ve fetched administration level 2 country land area geojson from osm-boundaries.com and sea area shape files from marineregions.org.

My plan was to add two extra columns to my pandas dataframe, one strings columns for country, and a boolean for land vs sea area.

So far, I’ve tried geopandas, but all my attempts at converting the lat/long info to country/land-sea info scaled so badly that even adding the info to a small chunk of my data frame takes ages (something like 50 or 60 rows per second).

Tried using a combination of two apply calls on the data frame and using the geopandas “cx” thingy. I have a stong feeling that there should be a better, much faster and more efficient way to do this operation, but I’m at a loss right now.

I’m hoping, given the audience of this forum, maybe someone here has faced a similar question before.

Is there a way to do this task that takes minutes, not many many hours? Any input would be highly appreciated.

So, you want to know to which country your million points belong to and if they are on land or on water? Or only later case? Because later case is easily achievable with grdlandmask and grdtrack and should take only some tens of seconds.

I want to know both. I have one geojson file with all country land areas and one shape file with all country extended sea regions.

image

I am hoping I can use these in a way that can expand the few million record dataframe in minutes, not (many) hours, with the two extra columns.

GMT also has country polygons in pscoast DCW and lad/sea areas that can be used to build grid masks. I’m not sure I get what you mean by country extended sea regions land+ZEE for example? That GMT does not have.

Anyway, if you want to assign a country to a million points dataset there is no escape then to loop over that million points and to find its nationality. A slow process for sure. grdtrack can do it but would need some programing.

And it should help if you post code for what you have tried so far (but warn, Python is not for me).

What I tried looked something like this:

import pandas as pd
import geopandas as gpd
data = pd.read_csv("data_with_latlong.csv")
land_mass=gpd.read_file("osmboundaries.geojson")[["name", "geometry"]]
sea_mass=gpd.read_file("eez_v11.shp")[["SOVEREIGN1","geometry"]]

def find_country(lat,lon):
    countries = land_mass.cx[lon:lon+0.000001, lat:lat+.000001]
    if len(countries):
        return countries.head()["name"].values[0]
    else:
        countries = sea_mass.cx[lon:lon+0.000001, lat:lat+.000001]
        if len(countries):
            return countries.head()["SOVEREIGN1"].values[0]
        else:
            return None

def land_or_sea(lat,lon):
    countries = land_mass.cx[lon:lon+0.000001, lat:lat+.000001]
    if len(countries):
        return True
    else:
        countries = sea_mass.cx[lon:lon+0.000001, lat:lat+.000001]
        if len(countries):
            return True
        else:
            return None

data['country'] = data.apply(lambda x: find_country(x['latitude'], x['longitude']), axis=1)
data['land'] = data.apply(lambda x: land_or_sea(x['latitude'], x['longitude']), axis=1)

With respect to the sea area data, here is what the UK sea area looks like in the sea_mass file:

image

and this if Ireland and the UK are added:

As opposed to standard teretorial sea region assigned to a country, the sea areas touch in this file. Not sure what exactly the different legal frameworks are that make that there are two distinct sea areas for a country, a standard one and an extended one, but for this usecase I need the extended one.

I see. Those areas are what countries have submitted to UN under the famous article 76 for claiming sovereignty rights of the SEA BOTTOM that can, in best cases extend to 350 NM, … but have to cope with neighbors claims.

But regarding your case, I see no GMT usage. Only Python.

This is a standard GIS operation, so it would be a lot easier in a GIS such as QGIS.