Monday, January 14, 2008

Reverse Geocoding Using GeoName Data

I have be using GeoNames data for implementing reverse geocoding.

It took me a while to get it working and thought of sharing the steps that I followed ( Postgres + Postgis ) to get the reverse geocoding work ( and it is fast! )
1) Created and loaded the table by following link or link.

2) Created a geometry column
SELECT AddGeometryColumn( 'public', 'geoname', 'latlon_point', 2163, 'POINT', 2 );
Note that I am using 2163 as SRID ( the unit is meter )

3) Populated the column
update geoname set latlon_point =
transform(GeomFromText('POINT(' || longitude || ' ' || latitude || ')',4326),2163)

4) Created a clustered gist index
CREATE INDEX geoname _latlon_place_index
ON geoname
USING gist
ALTER TABLE geoname CLUSTER ON geoname _latlon_place_index;

5) To find nearest 5 records (within 5 kms) for the given lat/long ( 12.97199/77.60483)

SELECT * FROM geoname
WHERE feature_class = 'P'
and st_dwithin(latlon_point, transform(GeomFromText('POINT(77.60483 12.97199)',4326),2163),5000)
order by ST_Distance(latlon_point, transform(GeomFromText('POINT(77.60483 12.97199)',4326),2163))
limit 5

The above query should be extremely fast if the gist index is created properly. One way to confirm whether is it using the index or not, is by looking at the query planner output :

"Limit (cost=9.82..9.82 rows=1 width=175)"
" -> Sort (cost=9.82..9.82 rows=1 width=175)"
" Sort Key: st_distance(latlon_point, '...'::geometry)"
" -> Index Scan using geoname1_latlon_place_index on geoname1 (cost=0.00..9.81 rows=1 width=175)"
" Index Cond: (latlon_point && '..'::geometry)"
" Filter: ((feature_class = 'P'::bpchar) AND (latlon_point && '..'::geometry) AND ('..'::geometry && st_expand(latlon_point, 5000::double precision))

As with any user generated data, there are dirty data present in this dump like duplicate place record, empty place name etc. Also since I am interested only in cities, I added a filter on feature_code = 'PPL' and feature_class = 'P'.

This is just the first pass, and I still need to do couple more processing for cases where the place name contains punctuation, duplicate places across state/country etc. I'm calling it a night! The alarm clock is going to go off in just six hours.