Experiments with in-memory spatial radius queries in Python

Efficient geo-spatial radius queries on a given dataset are a problem that we’re very regularly facing. Usually we handle the problem by spinning up a PostgreSQL database and installing PostGis with all its nice and efficient spatial indexing capabilities. Recently however, we decided to investigate some other possibilities to do fast geo-spatial radius queries in-memory to get some load off our live Postgres instances. In this blog post I want to share three alternatives that we discovered during our search and benchmark them against each other.

Our Use Case

Map illustrating the Use Case (© for the map goes to OpenStreetMap)

The general problem is that we want to get points that are within a radius of a given spatial event (see figure above). So we’re looking for a function in the form of:

def radius_search(point_database,spatial_event,radius)->list

Where the point_database is a list of tuples containing geo-coordinates latitude and longitude. The same goes for the spatial_event variable, which is also just a tuple containing latitude and longitude. The radius is the maximum allowed distance (in meters) between the spatial_event and the returned list of nearby points.

Because the code will work with long-lived objects that are maybe updated once or twice a day I mostly care about the the query runtime while evaluating different implementations of the radius_search function. The time to build the index is considered less important during the comparison.

Our desired query runtime for apoint_database containing a couple of 100 thousand points should be within the range of milliseconds.

Looking for ways to solve the problem

First of all I wanted to solve this problem in python, since most of our projects are python-only. The first thing you will most likely do when you encounter such an issue is that you’ll hit google. After some hours of searching I basically had a list of potentially different alternatives to solve our problem:

  • Using R-trees as in the R-tree python pip package: They provide an efficient way to index spatial points and enable fast nearest neighbour queries and bounding box queries.
  • Brute Force Method: Calculate the distance between our query point with every point in the database and then select the points that are within the desired radius.
  • Tree Based radius search based on nearest neighbourhood algorithms. There are a couple of methods to speed up nearest neighbourhood search using tree-based indices. A very promising candidate which I found was the BallTree implemented in scikit-learn.
  • Use Geohashes in various precisions (to get to know more about geohashes checkout Wikipedia) to discover a set of candidate points that are approximately within the radius and then perform a brute force search on the much smaller candidate set.

Taking into consideration all these methods, we decided to try out the brute-force method, the balltree-based radius search and the geohash-based method Why not R-trees? It would follow the same logic as a geohash-based method and would roughly have the same time complexity. However, it still remains an option for future trials.

If you are interested in how R-Tress work with python checkout out this Blogpost: http://geoffboeing.com/2016/10/r-tree-spatial-index-python/

Performing a small Benchmark

I took some inspiration from a scikit-learn example implementation and set up a small environment to compare the different methods:

  • Generate a spatial dataset
  • Perform radius queries based on different sizes of the spatial dataset
Preparing the Benchmark

Implement the brute-force approach

We’re using pair-wise haversine distance to get the distance between the query point and the list of points in our point database. This allows to filter on top of the points that are within the specified radius. Please remark that the geo-coordinates are defined by their latitude and longitude in the units of a radian.

Brute force spatial radius search implementation

Implement the ball-tree based Index

In order to implement the BallTree-based index we’re using the BallTree already implemented in scikit-learn and wrap it in the same interface as the brute-force method to have a good comparison.

Balltree-based spatial radius search implementation

Implement a geohash-based indexer

In this case it got a little bit more complicated. First of all a geohash is a representation of spatial points that maps the geo-coordinates to a hash value that represents an area on the globe. Meaning points that are nearby will fall in the same hash, whereby points that are far away have a different hash value. Geohashes have a different precision (hash of different length) which is basically describing the area which the geohash covers. A higher precision (longer hash) leads to a smaller area that the geohash is covering. You can read more about geohashes on wikipedia.

In principal the implementation follows the same pattern as the two above. In the constructor we’re building the index (which consumes here quite some memory) and then use it at runtime.

However it is worth spending some word on what is happening at query time.

  1. We’re getting a list of geohashes that (inclusively) approximates the area in the radius around the specified latitude and longitude. In order to do this we’re using this implementation of proximityhash.
  2. We will then use this list of geohashes and get all points that are lying within one of these geohashes at precision 5. This precision turned out to be a good choice since the average error of 2.4k seems reasonable to prune the candidate space considerably.
  3. At last we have to perform brute-force search computing the distance between the query and the candidate objects and filtering based on the specified radius and return this list.
Geohah-based spatial radius search implementation

Comparison of the three approaches

Of course, implementing is just half of the work. let us now validate the outcome. We’re comparing the algorithm based upon the average query time on differently sized datasets (determined byn_samples ).

Some side note: I’m running this benchmark inside a docker-image on my Macbook Pro Early 2015 with 16 gb of RAM.

Let us have a look at the numbers.

Ok! Good news so far the BallTree and the Geohash approach are both outperforming the BruteForce search. In the magnitude of roughly 100 times at 5 million samples. Let us know have a look at the other two approaches a bit more in detail.

There we can see that the GeoHash based approach is performing better than the BallTree based approach. As we see both approaches grow linear with the amount of samples that need to be queried. This is no surprise since for both still have a linear component by running the brute force search on the generate candidates. Still according to the relatively low query time it seams that both are are pruning the search space quite well.

So as a first general conclusion we can say that we reached our goal to enable a fast lookup of nearby geo objects by the use of an in-memory index within a couple of milliseconds, on a commodity macbook.

Training times for the Ball Tree

So far so good. Even though I stressed in the beginning that training or indexing time is not the main criteria to select the best running solution. However, let us briefly examine it.

For this we see a different picture that the BallTree searcher roughly performs 3 times (5 miollion samples) faster than the Geohash Training. As a consequence if you need need a lower indexing time a BallTree based search algorithm might be a good alternative for you.

Conclusion

  • If you’re using a very small dataset (let’s say up a couple of thousand geo points) you’re good with a bruteforce search.
  • As data grows you should think about using either the Geohash or BallTree based index search.
  • Those spatial search indices can be held in-memory and be used for radius search very efficiently if you have up to a couple of million examples to index and search.
  • However, as long as these approaches are in-memory they will not scale infinitely. We probably have to add some memory profiling to really get the full picture here.
  • Training and query time is linear for both BallTree and Geohash-based indexes.

So what did we at MiNODES use in the end? We decided to go for the BallTree-based index, because it gave us a good balance between query-time and indexing time.

If you’re interested in using these findings you can find the code here.

As part of this small blog article I improved the version of proximityhash to make it easier to use, which can be found here. The original version was developed with this project.