The purpose of this Python package is to allow easier navigation of the NOMIS API and easier collection of GSS geocodes from ONS Open Geography Portal. The GSS geocodes are necessary for selecting the right tables in the NOMIS API, which can otherwise be very difficult to navigate.
The lookup tables do not contain all possible lookup tables that are available from ONS's Open Geography Portal. Instead, only a handful of the most useful tables were downloaded.
However, you can easily extend the list of lookup tables by placing a new .csv file in the appropriate folder under /lookups/
after downloading it from the ONS Open Geography Portal. You can even create a custom folder for any new lookup tables that span between years (e.g. 2011 to 2016), but I recommend you stick with the folder naming pattern for your own sanity's sake. If you add more lookup tables, remember to delete the json_data.json
file in the /lookups/
folder - this file will be recreated automatically the next time you use the SmartGeocodeLookup class and will slow the process down at first, but once the json file has been recreated, the next time you use the
class the speed will be back to normal.
To install this package:
python -m pip install git+https://github.com/lb-lewisham/LBLDataAccess.git
Or
python -m pip install LBLDataAccess==0.2.2
If using Conda, you may need to just write:
pip install git+https://github.com/lb-lewisham/LBLDataAccess.git
or
pip install LBLDataAccess==0.2.2
To get help with selecting geocodes, you must first import GeoHelper
from LBLDataAccess.load_geocodes import GeoHelper
geo_help = GeoHelper()
GeoHelper class has a few helpful options to select the correct GSS codes to use with NOMIS. First of all, we can check what tables are available:
all_available_tables = geo_help.files_and_folders
The result is a JSON-like dictionary output that the user can use to navigate the geocode tables. Call it the 'Advanced mode' if you like and you can use this structure as a guide if you really want to do joins manually. This dictionary is organised like:
{year: file_name{'columns': list_of_columns, 'useful_columns': columns_that_end_in_CD}}
where year is the sub-folder in lookups
folder, file_name is the .csv or .xlsx file in that sub-folder, list_of_columns
is a list containing all columns for that table, and columns_that_end_in_CD
is the subset of all columns in the lookup table consisting of columns that end in 'CD'.
A more user-friendly option over the 'Advanced mode' is to explore the available geographical regions. You can do this by first printing the available 'years' sub-folders, from which you can select the sub-folder that interests you and use it as an input to an additional helper method:
years = geo_help.year_options
for year in years:
print(geo_help.available_geographies(year=year))
If you added a custom folder with your own .csv files and deleted the old JSON file, these will also show up with the above command. If you want the available geographies for all tables, simply write:
print(geo_help.available_geographies())
If you need an explanation for any of the codes output by the available_geographies()
method, you can use
print(geo_help.geography_keys())
which will print describe what the first few letters of the columns stand for. These descriptions aren't in-depth descriptions, but do provide some context. To fully understand what each of these areas are supposed to measure, please refer to ONS' Open Geography Portal.
Please note that you need to select the starting and the ending columns from the list of available geographies that are printed using the available_geographies()
method.
Now that you have decided what transformation you want, you're ready to use the SmartGeocodeLookup
class. To import this class:
from LBLDataAccess.load_geocodes import SmartGeocodeLookup
SmartGeocodeLookup
class first finds common columns between tables, and then uses graph theory to find the shortest path between all columns, regardless of whether they are in the same or different tables. This class is particularly helpful when the path between columns requires more than one join operation. As the original reason for creating this class was to find the geocodes for local authorities, we also need to provide a list of local authorities. As an example, let's say we want to get the 2022 GSS codes for Lewisham at Ward level, we'd use:
gss = SmartGeocodeLookup(end_column_max_value_search=True, local_authority_constraint=True, verbose=True)
# changing end_column_max_value_search to True gives different results from setting it False
gss.run_graph(starting_column='LAD22CD', ending_column='WD22CD', local_authorities=['Lewisham'])
# try changing WD22CD to WD21CD and WD23CD to see how this affects the results
geocodes = gss.get_filtered_geocodes(1) # the integer input here sets the number of possible routes returned.
print(geocodes[0])
This returns a Pandas dataframe containing all Wards for Lewisham as at 2022. Note that Pandas dataframe consists of other columns too, so you may need to apply further filters. If you want to know the path (i.e. list of tables that were joined), use:
print(gss.shortest_paths)
As another example, to get 2021 census Output area GSS codes, we'd have to choose output area column for 2021:
gss.run_graph(starting_column='LAD22CD', ending_column='OA21CD', local_authorities=['Lewisham'])
print(gss.get_filtered_geocodes())
Note that we haven't created a new SmartGeocodeLookup object, but rather are just building a new graph based on the data we already have.
The pre-requisite for using the NOMIS class is that you'll need to register for NOMIS and find your API key (www.nomisweb.co.uk) in settings. When you first use the DownloadFromNomis
class, you should set the api_key
and proxies
arguments. If you want, you can additionally set memorize=True
, which will save these variables in the config.json
file in the config
folder. The proxies argument has to be set as a dictionary proxies = {'http': your_http_proxy, 'https': your_https_proxy}
. For example:
from LBLDataAccess.access_nomis import DownloadFromNomis
api_key = 'examplestringthatcouldbeanything'
proxies = {'http': your_http_proxy, 'https': your_https_proxy}
conn = DownloadFromNomis(api_key=api_key, proxies=proxies, memorize=True)
conn.connect()
If you need to change the API key and proxies that are stored in the config.json
file, you can use the update_config()
method:
from LBLDataAccess.access_nomis import DownloadFromNomis
api_key = 'example_api_string'
proxies = {'http': your_http_proxy, 'https': your_https_proxy}
conn = DownloadFromNomis(api_key=api_key, proxies=proxies, memorize=True)
conn.connect()
api_key = 'new_api_string'
proxies = {'http': your_new_http_proxy, 'https': your_new_https_proxy}
conn.update_config(api_key=api_key, proxies=proxies)
Furthermore, if tracking the proxies and the API key becomes a burden, you can also reset these config settings with the reset_config()
method, which deletes the config.json
file in the config
folder as well as removes the api_key
and proxies
class attributes:
conn.reset_config()
print(conn.api_key)
The above will output an error:
AttributeError: 'DownloadFromNomis' object has no attribute 'api_key'
If you haven't removed the configurations, you can always print the stored api_key and proxies with:
print(conn.api_key, conn.proxies)
However, note also that even after calling the conn.reset_config()
method, you can still use the conn.connect()
as the API key has been stored as part of the URL call. This also means that you can still download data from NOMIS too. If this is not desired, you should delete the DownloadFromNomis()
object.
Once you have the geocodes, you can use them to download census data using the NOMIS API. To do so, you need to import the class for downloading data from NOMIS:
from LBLDataAccess.access_nomis import DownloadFromNomis
To use this class, you need to create an object and call the connect()
method:
conn = DownloadFromNomis()
conn.connect()
Note that here we assume that you have set up your api_key
and proxies
previously, saving them with the memorize
argument.
Now that you have created a connection, you can print a list of all available tables from NOMIS:
conn.print_table_info()
Sample of output:
NM_1_1 : Records the number of people claiming Jobseeker's Allowance (JSA) and National Insurance credits at Jobcentre Plus local offices. This is not an official measure of unemployment, but is the only indicative statistic available for areas smaller than Local Authorities.
NM_2_1 : A quartery count of claimants who were claiming Jobseeker's Allowance on the count date analysed by their age and the duration.
NM_4_1 : A monthly count of Jobseeker's Allowance (JSA) claimants broken down by age and the duration of claim. Totals exclude non-computerised clerical claims (approx. 1%). Available for areas smaller than Local Authorities.
.
.
.
.
NM_2096_1 : TS031 - Religion (detailed)
NM_2097_1 : TS075 - Multi religion households
NM_2098_1 : TS076 - Welsh language skills (speaking) by single year of age
Calling the above method prints two important details for every table (more than 1300 currently) available through the NOMIS API, separated by a colon: 1) the code that you'll use when downloading a table or information therein through the API; and 2) the table name, including a brief description of the data contained. To get more detailed information about a table, take the code string before the colon (e.g. NM_2098_1 for the table TS076 - Welsh language skills (speaking) by single year of age) and use it as input to detailed_info_for_table()
method:
conn.detailed_info_for_table(table_name='NM_2097_1')
This will print a more detailed description of the table:
Table id: NM_2097_1
Table description: TS075 - Multi religion households
Status: Current (being actively updated)
Keywords: Religion,Household
Units: Households
contenttype/sources: census_2021_ts
contenttype/geoglevel: oa,msoa,la
SubDescription: All households
Mnemonic: c2021ts075
FirstReleased: 2022-11-29 09:30:00
LastUpdated: 2022-11-29 09:30:00
MetadataTitle0: About this dataset
MetadataText0: This dataset provides Census 2021 estimates that classify households in England and Wales by multi religion households. The estimates are as at Census Day, 21 March 2021.
MetadataTitle1: Protecting personal data
MetadataText1: Sometimes we need to make changes to data if it is possible to identify individuals. This is known as statistical disclosure control. In Census 2021, we:
* Swapped records (targeted record swapping), for example, if a household was likely to be identified in datasets because it has unusual characteristics, we swapped the record with a similar one from a nearby small area. Very unusual households could be swapped with one in a nearby local authority.
* Added small changes to some counts (cell key perturbation), for example, we might change a count of four to a three or a five. This might make small differences between tables depending on how the data are broken down when we applied perturbation.
MetadataCount: 2
Column: GEOGRAPHY, column code: CL_2097_1_GEOGRAPHY
Column: C2021_RELMULT_7, column code: CL_2097_1_C2021_RELMULT_7
Column: MEASURES, column code: CL_2097_1_MEASURES
Column: FREQ, column code: CL_2097_1_FREQ
The DownloadFromNomis
class gives you three main downloading methods: table_to_csv()
, table_to_pandas()
, and get_bulk()
. The difference between the first two methods is that the first one saves the downloaded data to a csv file, while table_to_pandas()
returns a pandas dataframe. The get_bulk()
method takes an argument to specify whether you want to download the data as a csv or output a pandas dataframe. While the table_to_csv()
and table_to_pandas()
methods take six and five arguments, respectively, the get_bulk()
method takes only three. The reason for this is simple: these methods all take the dataset name as an argument, but table_to_pandas()
and table_to_csv()
methods also allow you to provide a dictionary of qualifiers to download just a subset of the data from NOMIS. These two methods also share the ability that you can ask for a specific set of columns to be output, instead of all columns. Likewise, you can also define whether you want the data in counts or as percentages. In contrast, get_bulk()
only takes the name of the dataset, data_format
argument used for specifying whether you want a pandas dataframe or a downlaoded csv file, and the save_location
argument.
To use get_bulk()
, you type:
from LBLDataAccess.access_nomis import DownloadFromNomis
conn = DownloadFromNomis()
conn.connect()
dataset = 'NM_2097_1'
save_location = '../nomis_downloads/'
# to get a pandas dataframe:
bulk_data = conn.get_bulk(dataset=dataset, data_format='df') # you can write either 'df' or 'pandas' for data_format
# or alternatively, if you want to save the data:
conn.get_bulk(dataset=dataset, data_format='csv', save_location=save_location) # you can use 'csv' or 'download' for data_format
# the above command will create a file 'NM_2097_1_bulk.csv' in the folder '../nomis_downloads/'
To use table_to_pandas()
to download data for Lewisham:
from LBLDataAccess.access_nomis import DownloadFromNomis
conn = DownloadFromNomis()
conn.connect()
dataset = 'NM_2097_1'
geographies = ['E09000023'] # E09000023 is the GSS code for Lewisham
qualifiers = {'geography': geographies} # other qualifiers are also possible (e.g. age or sex), but need to be provided as a list of values
table_columns = ['geography_name', 'obs_value'] # optional argument, which limits the columns returned
value_or_percent = 'value' # this argument is used for selecting either raw values or percentages
df = conn.table_to_pandas(dataset=dataset, qualifiers=qualifiers, table_columns=table_columns, value_or_percent=value_or_percent)
To save this information, you can use table_to_csv()
:
from LBLDataAccess.access_nomis import DownloadFromNomis
conn = DownloadFromNomis()
conn.connect()
dataset = 'NM_2097_1'
geographies = ['E09000023'] # E09000023 is the GSS code for Lewisham
qualifiers = {'geography': geographies} # other qualifiers are also possible (e.g. age or sex), but need to be provided as a list of values
file_name = 'lewisham_multi_religion_household.csv' # optional argument, defaults to f"{dataset}_query.csv" - must provide .csv extension
table_columns = ['geography_name', 'obs_value'] # optional argument, which limits the columns returned
save_location = '../nomis_downloads/'
value_or_percent = 'value' # this argument is used for selecting either raw values ('value') or percentages ('percent')
conn.table_to_csv(dataset=dataset, qualifiers=qualifiers, file_name=file_name, table_columns=table_columns, value_or_percent=value_or_percent)
# the above command will create a file 'lewisham_multi_religion_household.csv' in the folder '../nomis_downloads/'
# the downloaded data will include all rows for Lewisham and contain the geography_name and obs_value columns - the values will be the raw
# values.
Let's go through an example where we want to get data about religion at output area level for Lewisham and Waltham Forest from 2021 Census. From nomisweb.co.uk we can say that we want data from the table TS075 - Multi religion households.
First, we import our classes:
from LBLDataAccess.load_geocodes import GeoHelper, SmartGeocodeLookup
from LBLDataAccess.access_nomis import DownloadFromNomis
Then we want to have a look at what geographies are available
geo_help = GeoHelper()
print(geo_help.available_geographies()) # find what available geocodes there are and choose 'LAD22CD' and 'OA21CD' from this list
Output:
All available geographies:
['BUA11CD', 'BUA22CD', 'BUASD11CD', 'CAUTH22CD', 'CCG20CD', 'CED21CD', 'CMWD11CD', 'CTRY22CD', 'CTY21CD', 'CTY22CD', 'EER11CD', 'LAD11CD', 'LAD21CD', 'LAD22CD', 'LAU110CD', 'LAU210CD', 'LPA22CD', 'LSOA11CD', 'LSOA21CD', 'LTLA22CD', 'MSOA11CD', 'MSOA21CD', 'NAT22CD', 'NHSER20CD', 'NUTS106CD', 'NUTS206CD', 'NUTS306CD', 'OA11CD', 'OA21CD', 'PCO11CD', 'PCON11CD', 'RGN11CD', 'RGN22CD', 'SHA11CD', 'STP20CD', 'TTWA11CD', 'UTLA22CD', 'WD11CD', 'WD21CD', 'WD22CD', 'WZ11CD']
We have now explored the geographies and made our choice, and we can define the local authorities too:
start_column = 'LAD22CD'
end_column = 'OA21CD'
local_authorities = ['Lewisham', 'Waltham Forest'] # the input has to be a list and it is case sensitive
Let's get the GSS codes:
gss = SmartGeocodeLookup(end_column_max_value_search=True, local_authority_constraint=True, verbose=True)
gss.run_graph(starting_column=start_column, ending_column=end_column, local_authorities=local_authorities)
oa_geocodes = gss.get_filtered_geocodes(1)
Our function helpfully tells us that the data is all contained in one table and fetches that:
Reading JSON file json_data.json
['OA21_LSOA21_MSOA21_LAD22_EW_LU.csv', 'OA2021_to_LSOA2021_to_MSOA2021_to_LAD2022_DEC_2021.csv']
Same Node
Same Node
Shortest path is in the same table
Shortest path is in the same table
All possible shortest paths:
Path 1
Path 2
Got UnicodeDecodeError 'utf-8' codec can't decode byte 0xf4 in position 49806: invalid continuation byte for file C:\Users\isipila\OneDrive - Lewisham Council\Documents\Github\LBLDataAccess\LBLDataAccess\lookups\2021\OA21_LSOA21_MSOA21_LAD22_EW_LU.csv - changing encoding to latin-1
Above output simply shows that the algorithm identified two equally short paths from the column LAD22CD to column OA21CD. In both cases that path occurs in the same table, but as there are two paths, there are two tables that contain this same information. Indeed, under the line Reading JSON file json_data.json
, we have a list of the tables being joined, which in our case are simply the individual tables from which the data comes from. Lastly, we ignore the UnicodeDecodeError as it simply refers to a method of opening a file in Pandas. We handle this error by changing the encoding to latin-1, as stated in the message.
When we print oa_geocodes[0]
, we get the pandas dataframe:
OA21CD LSOA21CD LSOA21NM MSOA21CD \
15188 E00016112 E01003195 Lewisham 029C E02000681
15189 E00016113 E01003196 Lewisham 029D E02000681
15190 E00016114 E01003197 Lewisham 029E E02000681
15191 E00016115 E01003189 Lewisham 034A E02000686
15192 E00016116 E01003190 Lewisham 034B E02000686
... ... ... ... ...
173838 E00185827 E01004327 Waltham Forest 028A E02000922
173839 E00185828 E01004430 Waltham Forest 026D E02000920
173840 E00185829 E01004424 Waltham Forest 022B E02000916
173844 E00185833 E01004382 Waltham Forest 008C E02000902
173854 E00185843 E01034485 Waltham Forest 014F E02000908
MSOA21NM LAD22CD LAD22NM LAD22NMW
15188 Lewisham 029 E09000023 Lewisham NaN
15189 Lewisham 029 E09000023 Lewisham NaN
15190 Lewisham 029 E09000023 Lewisham NaN
15191 Lewisham 034 E09000023 Lewisham NaN
15192 Lewisham 034 E09000023 Lewisham NaN
... ... ... ... ...
173838 Waltham Forest 028 E09000031 Waltham Forest NaN
173839 Waltham Forest 026 E09000031 Waltham Forest NaN
173840 Waltham Forest 022 E09000031 Waltham Forest NaN
173844 Waltham Forest 008 E09000031 Waltham Forest NaN
173854 Waltham Forest 014 E09000031 Waltham Forest NaN
[1705 rows x 8 columns]
Okay, great, we have the necessary geocodes. Now let's get the NOMIS data as a pandas dataframe too. Before, though, we need to set the API key and possibly also the proxies:
api_key = 'change_me' # needs to be changed!!
proxies = {'http': 'change_me', 'https': 'change_me'} # optional, but can be beneficial
And let's connect:
conn = DownloadFromNomis(api_key=api_key, proxies=proxies, memorize=True) # setting memorize=True stores your API key and proxies for future use
conn.connect()
conn.print_table_info()
This prints the long list of information about the tables contained in NOMIS. We go through the printed text and find that the corresponding
table code for TS075 - Multi religion households, which we now know to be NM_2097_1
.
dataset = 'NM_2097_1'
geographies = list(oa_geocodes[0]['OA21CD'].unique())
qualifiers = {'geography': geographies}
df = conn.table_to_pandas(dataset=dataset, qualifiers=qualifiers, value_or_percent='value')
print(df)
The printed dataframe is:
DATE DATE_NAME ... RECORD_OFFSET RECORD_COUNT
0 2021 2021 ... 0 12888
1 2021 2021 ... 1 12888
2 2021 2021 ... 2 12888
3 2021 2021 ... 3 12888
4 2021 2021 ... 4 12888
... ... ... ... ...
12883 2021 2021 ... 12883 12888
12884 2021 2021 ... 12884 12888
12885 2021 2021 ... 12885 12888
12886 2021 2021 ... 12886 12888
12887 2021 2021 ... 12887 12888
[13640 rows x 28 columns]
There are 13,640 rows because unlike when using bulk downloading, the data is presented depth-wise rather than breadth-wise. This means that as there are eight categories for religion, we need to divide 13,640 by 8 to make sure that the output matches the expected number of records, which is the len(geographies)
or 1705.
13,640/8 = 1705 so we now that we have downloaded the correct number of records, and the only thing left to do is to wrangle the data into another format, but we won't be showing an example of that here.
from LBLDataAccess.load_geocodes import GeoHelper, SmartGeocodeLookup
from LBLDataAccess.access_nomis import DownloadFromNomis
import pandas as pd
pd.set_option('display.max_columns', None)
geo_help = GeoHelper()
print(geo_help.available_geographies()) # find what available geocodes there are and choose 'LAD22CD' and 'OA21CD' from this list
start_column = 'LAD22CD'
end_column = 'OA21CD'
local_authorities = ['Lewisham', 'Waltham Forest'] # the input has to be a list and it is case sensitive
gss = SmartGeocodeLookup(end_column_max_value_search=True, local_authority_constraint=True, verbose=True)
gss.run_graph(starting_column=start_column, ending_column=end_column, local_authorities=local_authorities)
oa_geocodes = gss.get_filtered_geocodes(1)
print(oa_geocodes[0])
api_key = 'change_me' # needs to be changed!!
proxies = {'http': 'change_me', 'https': 'change_me'} # optional, but can be beneficial
conn = DownloadFromNomis(api_key=api_key, proxies=proxies, memorize=True) # setting memorize=True stores your API key and proxies for future use
conn.connect()
conn.print_table_info()
dataset = 'NM_2097_1'
geographies = list(oa_geocodes[0]['OA21CD'].unique())
qualifiers = {'geography': geographies}
df = conn.table_to_pandas(dataset=dataset, qualifiers=qualifiers, value_or_percent='value')
print(df.head(20))