What's the most optimal query in PgSql to find the nearest neighbor in a huge dataset?

Refresh

5 weeks ago

Views

19 time

2

I have a huge table (about 40 Millions rows) called nearest_spot representing lines (in the linestring format) and the closest spot they're to (there is about 1500 different spot, stored in another table). The nearest_spot table comes like this :

 data_id || spot_id || spot_name || link_geom 

Where data_id the the primary key, spot_id is a foreign key to the primary key of the spot table, spot_name is the spot name (I know redundancy isn't good but I'm not allowed to modify the database) and link_geom is the line coordinates.


The database is in PostgreSQL 10.6, PostGIS 2.5, there is a gist index for the link_geom column, and a VACUUM ANALYZE has already been done on the nearest_spot table.

My goal is to find the nearest neighbor (in this table) to a point in a data record, as fast as possible.

I already know how to find the nearest neighbor, my problem is the time it takes to find it. I'm pretty new to PostgreSQL and PostGIS and I've been reading their documentations, going through a lot of topics about KNN optimizations, I've been searching for the most effective answer and yet I can't have a result under 5 minutes (and it goes up to 30min sometimes), even when only searching for one row . The different queries I've tried are as follows :

SELECT *
FROM( SELECT A.position, B.spot_id
      FROM data A, nearest_spot B
      WHERE A.id = 1
      AND ST_DWithin(A.position,B.link_geom,20)
      ORDER BY A.position <-> B.link_geom
      LIMIT 10;)
ORDER BY ST_Distance(A.position,B.link_geom)
LIMIT 1;

SELECT *
FROM( SELECT A.position, B.spot_id
      FROM data A, nearest_spot B
      WHERE A.id = 1
      AND ST_Buffer(A.position,20) && B.link_geom
      ORDER BY A.position <-> B.link_geom
      LIMIT 10;)
ORDER BY ST_Distance(A.position,B.link_geom)
LIMIT 1;

SELECT *
FROM( SELECT A.position, B.spot_id
      FROM data A, nearest_spot B
      WHERE A.id = 1
      AND ST_Intersects(ST_Buffer(A.position,20), B.link_geom)
      ORDER BY A.position <-> B.link_geom
      LIMIT 10;)
ORDER BY ST_Distance(A.position,B.link_geom)
LIMIT 1;

The reason I ORDER BY with <-> first, then with ST_Distance is that according to this documentation from PostGIS, <-> is faster but less precise (for bounding boxes) while ST_Distance is more precise but slower.

I also used this documentation about Spatial Indexing, and this one about the <-> operator, both from PostGIS also.

EDIT: I realized all my coordinates are stored as geometry (SRID 4326), so the ST_DWithin call, while having good syntax, was returning all rows not within 20 meters as a thought it would, but all rows within 20 degrees (of earth), so in fact my ST_DWithin wasn't making the result set any smaller, and it might be one of the biggest reason why it was taking so long, and the same goes for ST_Buffer. I'll try to cast all coordinates as geography (with ::geography) before using them with meters and hopefully I'll see an improvement

0 answers