Converting Ordnance Survey's postcode dataset coordinates

Introduction

A project recently required a geo-location lookup via the user’s postcode. Geo-location lookups via an address or postcode can be done through services that offer some Geocoding API, such as the Google Maps Platform. However, this typically requires you to make an account and add billing details, and they usually have some pay-as-you-go model for their billing. For most projects, existing Geocoding APIs are the correct choice.

The alternative would be to find a data set where we have the postcode and the map coordinates and then import them into a database and put a small web service in front of it.

For UK postcodes, the Ordinance Survey provides an open-source Geo-location dataset that we can use.

Codepoint Open Coordinate Conversion

The Ordnance Survey provides an open dataset of all the current postcodes in the UK. This is provided under the Open Government License which can be found here.

The dataset that the Ordnance Survey provides uses Easting and Northing Coordinates. Google Maps and Mapbox such as Google Maps use Latitude and Longitude coordinates. More specifically these are WGS84 coordinates.

The Easting and Northing coordinates are in the form of EPSG:27700. Latitude and Longitude coordinates used by Google Maps are using EPSG:4326 which is the 2D coordinates reference system for WGS84.

To convert the coordinates we can use PyProj. Pyproj is a Python interface to PROJ. We should be able to use PyProj to do the coordinate conversion.

We can set up a virtual environment and install the pyproj library:

$ mkdir pyproj_test
$ python -m venv venv
$ . venv/bin/activate
$ pip install pyproj

To transform our coordinates we need to do roughly the following:

from pyproj import Transformer

easting = <easting>
northing = <northing>

# initialise the transformer
transformer = Transformer.from_crs('EPSG:27700', 'EPSG:4326')

# do the transform
lng, lat = transformer.transform(easting, northing)

print(f'latitude {lat}, longitude {lng}')

e.g. the easting/northing coordinates of 429157, 623009 should return lng/lat coordinates of -1.54, 55.5. We can verify this in the Python REPL.

(venv) $ python
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyproj import Transformer
>>> easting = 429157
>>> northing = 623009
>>> 
>>> transformer = Transformer.from_crs('EPSG:27700', 'EPSG:4326')
>>> 
>>> lat, lng = transformer.transform(easting, northing)
>>> 
>>> print(f'latitude {lat}, longitude {lng}')
latitude 55.49999960817628, longitude -1.5400079100517177

We can convert this to a function to do the coordinate conversion which uses a transformer that we define globally in our script.

transformer = Transformer.from_crs('EPSG:27700', 'EPSG:4326')

# snip other code

def coords_from_uk_easting_northing(easting, northing) -> tuple[float, float]:
    lng, lat = transformer.transform(easting, northing)

    return lat, lng

Working the the Codepoint open archive

Codepoint Open has an option of two zip archives you can download. One contains a list of CSV files. The other is a GeoPackage. It is simplest to just read and loop through the CSV files.

If I download and extract the CSV archive you can see the following directory structure:

$ tree . --filelimit=10
├── Data
│   └── CSV  [120 entries exceeds filelimit, not opening dir]
└── Doc
    ├── Codelist.xlsx
    ├── Code-Point_Open_Column_Headers.csv
    ├── licence.txt
    ├── metadata.txt
    ├── NHS_Codelist.xls
    └── readme.txt

The Data directory contains another directory called CSV which has 120 CSV files according to the tree program. There is one CSV file for each postcode area.

If we have a look at one of the CSV files:

$ head Data/CSV/ab.csv
"AB10 1AB",10,394235,806529,"S92000003","","S08000020","","S12000033","S13002842"
"AB10 1AF",10,394235,806529,"S92000003","","S08000020","","S12000033","S13002842"
"AB10 1AG",10,394230,806469,"S92000003","","S08000020","","S12000033","S13002842"
"AB10 1AH",10,394235,806529,"S92000003","","S08000020","","S12000033","S13002842"
"AB10 1AL",10,394296,806581,"S92000003","","S08000020","","S12000033","S13002842"
"AB10 1AN",10,394367,806541,"S92000003","","S08000020","","S12000033","S13002842"
"AB10 1AP",10,394309,806459,"S92000003","","S08000020","","S12000033","S13002842"
"AB10 1AQ",10,394230,806469,"S92000003","","S08000020","","S12000033","S13002842"
"AB10 1AR",10,394235,806529,"S92000003","","S08000020","","S12000033","S13002842"
"AB10 1AS",10,394198,806385,"S92000003","","S08000020","","S12000033","S13002842"

You can see the files have no column headers.

The column headers are described in the Doc/Code-Point_Open_Column_Headers.csv. If we have a peek at the file:

$ cat Doc/Code-Point_Open_Column_Headers.csv
PC,PQ,EA,NO,CY,RH,LH,CC,DC,WC
Postcode,Positional_quality_indicator,Eastings,Northings,Country_code,NHS_regional_HA_code,NHS_HA_code,Admin_county_code,Admin_district_code,Admin_ward_code

The first line is some sort of abbreviation for the actual column names and the second line is the column names themselves. From the above we can see that:

Postcode is column 0
Easting is column 2
Northing is Column 3.

So to do the conversion we will have to do roughly the following.

Find all the CSV files in the Data/CSV directory.
Read each line in the CSV file.
Get the Postcode, the Easting and the Northing from that row.
Convert the coordinates
Create a dict that places the Postcode, Lat and Lng into it
Write out those records to another CSV file.

So not too complicated. However when originally writing the script with the built-in CSV module it took quite a while to read each file, this was minutes per file and there were 120 files. It would have taken several hours to process all the files. Therefore some optimisation was necessary as this was far too slow.

To have a rough idea of how many entries are in each file we can just count the number of lines in the file:

$ wc -l Data/CSV/ab.csv 
17329 Data/CSV/ab.csv

So it looks like there are typically tens of thousands of rows per file.

Pandas can read CSV files much faster than the CSV module can. With Pandas, you need to create a data frame before working on the data. Pandas has a from_csv function that we can use to create the data frame.

When creating the panda’s data frame it will try to infer the headers from the CSV file itself or you can tell it what the CSV file headers are. Our CSV files don’t have a header therefore we will need to tell Pandas what each column is. This is done via a list of headers that we can define at the top of our script.

CSV_COLUMN_NAMES = [
    "Postcode",
    "Positional_quality_indicator",
    "Eastings",
    "Northings",
    "Country_code",
    "NHS_regional_HA_code",
    "NHS_HA_code",
    "Admin_county_code",
    "Admin_district_code",
    "Admin_ward_code"
]

When we read the CSV with Pandas we need to pass the column names via the names keyword argument.

 df = pd.read_csv(input_file, header=None, index_col=False, names=CSV_COLUMN_NAMES)

The other keyword arguments are fairly straightforward.

header=None tells Pandas that the first row of the file isn’t the column headers.
index_col=False tells Pandas not to use the first column as an index. Pandas will try to infer the index from the first column.

Once the data frame is loaded we can just loop through all rows, and do the conversion:

records = []

df = pd.read_csv(input_file, header=None, index_col=False, names=CSV_COLUMN_NAMES)

for index, row in df.iterrows():
    postcode = row["Postcode"]  
    eastings = row["Eastings"]
    northings = row["Northings"]

    lng, lat = coords_from_uk_easting_northing(eastings, northings)

    record = {
        "postcode": postcode,
        "lat": lat,
        "lng": lng
    }

    records.append(record)

You will notice that our column names match those in the CSV_COLUMN_NAMES list that was defined earlier.

Writing the files

This is very straightforward we can use the CSV module to write the file:

with open(output_file, mode="w", newline='') as csv_file:
    writer = csv.DictWriter(csv_file, ["postcode", "lat", "lng"])
    writer.writerows(records)

Source code

The rest of the script is straightforward and isn’t worth going through extensively. If you want to see the full source code can be found on Github.

Converting Ordnance Survey’s postcode dataset coordinates

Introduction

Codepoint Open Coordinate Conversion

Working the the Codepoint open archive

Writing the files

Source code