Faking it: Geospatial search without the prereqs or infrastructure

First, a public service announcement: tomorrow, Nov. 16th, is “GIS Day” which means universities all over will be holding day lectures, in most cases for free. Many of them will be over our heads since we’re but neogeographers, but inspiration can come from anywhere – and it’s a good opportunity to meet others interested in this stuff. I’ll be at the one held at the University of Kansas here in Lawrence – why not look it up and see if your nearby university is having a program? My apologies on the short notice for this though, hopefully you can still make something happen as a professional development day.

I’ve shown more than a few ways we can use the ORM with GeoDjango to do real geographic searches of various types. But what if your limitations aren’t so generous and geospatial libraries and databases are off limits?

What if there were a way we could FAKE it?

There’s the incredibly nasty way which I won’t even put in code. Seriously, don’t do this. For completeness I’ll put it in semi-pseudocode that at least conveys the basic idea.

Create a set, loop over all objects in the database and use our
multi-tool GeoPy's vincentydistance (correct capitalization) method
to calculate the distance to each of them. Store the distance and a
record identifier in a dictionary and append it as a new "row" to our
set. Sort the set, and WOW, we have a really slow, potentially
disastrous in memory usage, but pure Python and minimal prerequisite
implementation of geographic searching.

Still listening? Please don’t do this.

Somebody smarter at math than me had this problem and thought about a way to solve it. It’s called Geohash and essentially it’s goal is to have a way of defining geography with varying degrees of accuracy with the variance in the number of characters. So 9yum8 will match 9yum8yef3vds6 (Lawrence, KS) and we can use a CharField to store, and query with the __startswith operator.

It’s not perfect, though – items JUST across the line which do overlap, but are not centered inside the hash we search for won’t be retured. To get them we need to calculate the neighbors of our hash. (in the case of python-geohash a method called “expand” will return neighbors + our origin box)

There’s a good library called python-geohash (use that name to pip install it, “geohash” on pip is not the right thing) that will do both of these calculations for us – pure python with C for speed if that works on your system.

>>> import geohash
>>> geohash.encode(38.9716689, -95.2352501)
'9yum8yef3vds'
>>> geohash.expand('9yum8yef3vds')
['9yum8yef3vdk', '9yum8yef3vdu', '9yum8yef3vde', '9yum8yef3vd7', '9yum8yef3vdg', '9yum8yef3vdt', '9yum8yef3vdm', '9yum8yef3vdv', '9yum8yef3vds']
>>> from django.db.models import Q
q_object = Q()
>>> for hash in geohash.expand('9yum8yef3vds'):
...     q_object.add(Q(geohash__startswith=hash), Q.OR)
>>> print(q_object)
(OR: ('geohash__startswith', '9yum8yef3vdk'), ('geohash__startswith', '9yum8yef3vdu'), ('geohash__startswith', '9yum8yef3vde'), ('geohash__startswith', '9yum8yef3vd7'), ('geohash__startswith', '9yum8yef3vdg'), ('geohash__startswith', '9yum8yef3vdt'), ('geohash__startswith', '9yum8yef3vdm'), ('geohash__startswith', '9yum8yef3vdv'), ('geohash__startswith', '9yum8yef3vds'))

So what we’ve done is create a Q object you can use on a queryset to fake a geographic search around the general area. This isn’t a very wide area, though. Trial and error will help. Passing a “precision=” kwarg into .encode() to request less accuracy (and a wider search area) like this:

>>> geohash.encode(38.9716689, -95.2352501, precision=4)
'9yum'
>>> geohash.expand('9yum')
['9yuj', '9yut', '9yuk', '9yuh', '9yus', '9yuq', '9yun', '9yuw', '9yum']
>>> q_object = Q()
>>> for hash in geohash.expand('9yum'):
...     q_object.add(Q(geohash__startswith=hash), Q.OR)
...
>>> print(q_object)
(OR: ('geohash__startswith', '9yuj'), ('geohash__startswith', '9yut'), ('geohash__startswith', '9yuk'), ('geohash__startswith', '9yuh'), ('geohash__startswith', '9yus'), ('geohash__startswith', '9yuq'), ('geohash__startswith', '9yun'), ('geohash__startswith', '9yuw'), ('geohash__startswith', '9yum'))

According to Wikipedia’s explanation of the algorithm (see under “Worked Example”) accuracy ranges from +/- 2500km at 1 character to +/- 0.019km at 8 characters. A length of 5 is +/- 2.4km, close to what we were using for circle radius generation in previous work. But drop to 4 and you’re grabbing +/- 20km worth of records.

It’s far from perfect, but when there’s nothing available but plain text search, it’s better than nothing!

Leave a Reply

Your email address will not be published. Required fields are marked *