Memory overhead of unoptimised data types on EPC load #49

danlewis85 · 2023-03-07T12:48:06Z

I was exploring asf_core_data.getters.epc.epc_data module via the function load_england_wales_data() and was finding significant memory constraints on read.

Looking at the read logic, there may be considerable gains to be made if you are willing to be more assertive over data types.

I've written an illustrative example targeting the 'low hanging fruit' for loading the current Welsh EPC data which requires a memory footprint 28% smaller than the current approach. Original approach 3.6gb in memory, revise approach 2.6gb.

The approach is to cast categorical data to the pandas categorical datatype where possible as this can save substantial memory compared to the default object representation (sometimes 100x less). Of course, the trade-off is that you need to assert beforehand what the possible categories are, however in a lot of cases this is possible (as in the case of the likert scales for efficiency).

The reason you might not do this is if you want to maintain the distinction between 'raw' data and preprocessed data.

Here is a reproducible example:

Set some parameters in base config

from asf_core_data.config import base_config

# Some references to the downloaded epc certs.
base_config.ROOT_DATA_PATH = "/home/xxx/projects/data/"
base_config.RAW_DATA_PATH = ""
base_config.RAW_ENG_WALES_DATA_ZIP = "/home/xxx/projects/data/all-domestic-certificates.zip"
base_config.RAW_ENG_WALES_DATA_PATH = ""

# Non-exhaustive set of updates to data types.
# Note that pandas defaults int and float to int64 and float64 unless told.
base_config.dtypes['CURRENT_ENERGY_EFFICIENCY'] = 'int32'
base_config.dtypes['ENERGY_CONSUMPTION_CURRENT'] = 'float32'
base_config.dtypes['CO2_EMISSIONS_CURRENT'] = 'float32'
base_config.dtypes['CO2_EMISS_CURR_PER_FLOOR_AREA'] = 'float32'
base_config.dtypes['TOTAL_FLOOR_AREA'] = 'float32'
base_config.dtypes['MULTI_GLAZE_PROPORTION'] = 'float32'
base_config.dtypes['NUMBER_HABITABLE_ROOMS'] = 'float32'
base_config.dtypes['LOW_ENERGY_LIGHTING'] = 'float32'
base_config.dtypes['CURRENT_ENERGY_RATING'] = 'category'
base_config.dtypes['POTENTIAL_ENERGY_RATING'] = 'category'
base_config.dtypes['PROPERTY_TYPE'] = 'category'
base_config.dtypes['BUILT_FORM'] = 'category'
base_config.dtypes['MAINS_GAS_FLAG'] = 'category'
base_config.dtypes['FLOOR_ENERGY_EFF'] = 'category'
base_config.dtypes['WINDOWS_ENERGY_EFF'] = 'category'
base_config.dtypes['HOT_WATER_ENERGY_EFF'] = 'category'
base_config.dtypes['WALLS_ENERGY_EFF'] = 'category'
base_config.dtypes['ROOF_ENERGY_EFF'] = 'category'
base_config.dtypes['MAINHEAT_ENERGY_EFF'] = 'category'
base_config.dtypes['MAINHEATC_ENERGY_EFF'] = 'category'
base_config.dtypes['LIGHTING_ENERGY_EFF'] = 'category'

Now, pandas concat won't actually concat categorical data, unless all categories are represented in each dataframe to concat. This might change in pandas release 2.1, but for now we need to manually ensure that this is the case. To do this, let's define a new dictionary of categories.

categories = {'CURRENT_ENERGY_RATING': lambda df: df['CURRENT_ENERGY_RATING'].cat.set_categories(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'INVALID!']),
              'POTENTIAL_ENERGY_RATING': lambda df: df['POTENTIAL_ENERGY_RATING'].cat.set_categories(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'INVALID!']),
              'PROPERTY_TYPE': lambda df: df['PROPERTY_TYPE'].cat.set_categories(['Flat', 'House', 'Park home', 'Bungalow', 'Maisonette']),
              'BUILT_FORM': lambda df: df['BUILT_FORM'].cat.set_categories(['Enclosed Mid-Terrace', 'Detached', 'Semi-Detached', 'Mid-Terrace',
                                                                            'End-Terrace', 'Enclosed End-Terrace', 'NO DATA!']),
              'MAINS_GAS_FLAG': lambda df: df['MAINS_GAS_FLAG'].cat.set_categories(['Y', 'N']),
              'FLOOR_ENERGY_EFF': lambda df: df['FLOOR_ENERGY_EFF'].cat.set_categories(['Very Good', 'Good', 'Average', 'Poor', 'Very Poor', 'NO DATA!']),
              'WINDOWS_ENERGY_EFF': lambda df: df['WINDOWS_ENERGY_EFF'].cat.set_categories(['Very Good', 'Good', 'Average', 'Poor', 'Very Poor', 'NO DATA!']),
              'HOT_WATER_ENERGY_EFF': lambda df: df['HOT_WATER_ENERGY_EFF'].cat.set_categories(['Very Good', 'Good', 'Average', 'Poor', 'Very Poor', 'NO DATA!']),
              'WALLS_ENERGY_EFF': lambda df: df['WALLS_ENERGY_EFF'].cat.set_categories(['Very Good', 'Good', 'Average', 'Poor', 'Very Poor', 'NO DATA!']),
              'ROOF_ENERGY_EFF': lambda df: df['ROOF_ENERGY_EFF'].cat.set_categories(['Very Good', 'Good', 'Average', 'Poor', 'Very Poor', 'NO DATA!']),
              'MAINHEAT_ENERGY_EFF': lambda df: df['MAINHEAT_ENERGY_EFF'].cat.set_categories(['Very Good', 'Good', 'Average', 'Poor', 'Very Poor', 'NO DATA!']),
              'MAINHEATC_ENERGY_EFF': lambda df: df['MAINHEATC_ENERGY_EFF'].cat.set_categories(['Very Good', 'Good', 'Average', 'Poor', 'Very Poor', 'NO DATA!']),
              'LIGHTING_ENERGY_EFF': lambda df: df['LIGHTING_ENERGY_EFF'].cat.set_categories(['Very Good', 'Good', 'Average', 'Poor', 'Very Poor', 'NO DATA!']),
             }

Now we need to adjust the load_england_wales_data() function to use this new information about categories, here's a simple example:

import os
import pandas as pd
from asf_core_data import Path
from asf_core_data.getters.epc import data_batches

def load_england_wales_data(
    data_path=base_config.ROOT_DATA_PATH,
    rel_data_path=base_config.RAW_ENG_WALES_DATA_PATH,
    batch=None,
    subset=None,
    usecols=None,
    n_samples=None,
    dtype=base_config.dtypes,
):
    RAW_ENG_WALES_DATA_PATH = data_batches.get_batch_path(
        Path(data_path) / rel_data_path, data_path=data_path, batch=batch
    )
    RAW_ENG_WALES_DATA_ZIP = data_batches.get_batch_path(
        Path(data_path) / base_config.RAW_ENG_WALES_DATA_ZIP,
        data_path=data_path,
        batch=batch,
    )

    # Get all directories
    directories = [
        dir
        for dir in os.listdir(RAW_ENG_WALES_DATA_PATH)
        if not (dir.startswith(".") or dir.endswith(".txt") or dir.endswith(".zip"))
    ]

    # Set subset dict to select respective subset directories
    start_with_dict = {"Wales": "domestic-W", "England": "domestic-E"}

    directories = [
        dir for dir in directories if dir.startswith(start_with_dict[subset])
    ]

    if usecols is not None:
        usecols = [
            col for col in usecols if col not in base_config.scotland_only_features
        ]

     epc_certs = pd.concat(
        (pd.read_csv(
            RAW_ENG_WALES_DATA_PATH / directory / "certificates.csv",
            dtype=dtype,
            usecols=usecols,
        )
         .assign(**categories) for directory in directories),
        axis=0).assign(COUNTRY = subset)

    if "UPRN" in epc_certs.columns:
        epc_certs["UPRN"].fillna(epc_certs.BUILDING_REFERENCE_NUMBER, inplace=True)

    return epc_certs

I can then run the function like:

data = load_england_wales_data(
    data_path=base_config.ROOT_DATA_PATH,
    rel_data_path=base_config.RAW_DATA_PATH ,
    subset='Wales',
    usecols=['ADDRESS1', 'ADDRESS2', 'POSTCODE', 'MAINS_GAS_FLAG', 'NUMBER_HABITABLE_ROOMS', 'CURRENT_ENERGY_RATING', 'POTENTIAL_ENERGY_RATING', 'CURRENT_ENERGY_EFFICIENCY', 'ENERGY_CONSUMPTION_CURRENT', 'TENURE', 'MAINHEAT_ENERGY_EFF', 'HOT_WATER_ENERGY_EFF', 'FLOOR_ENERGY_EFF', 'WINDOWS_ENERGY_EFF', 'WALLS_ENERGY_EFF', 'ROOF_ENERGY_EFF', 'MAINHEATC_ENERGY_EFF', 'LIGHTING_ENERGY_EFF', 'MAINHEAT_DESCRIPTION', 'CO2_EMISSIONS_CURRENT', 'CO2_EMISS_CURR_PER_FLOOR_AREA', 'BUILDING_REFERENCE_NUMBER', 'INSPECTION_DATE', 'BUILT_FORM', 'PROPERTY_TYPE', 'CONSTRUCTION_AGE_BAND', 'TRANSACTION_TYPE', 'TOTAL_FLOOR_AREA', 'ENERGY_TARIFF', 'UPRN', 'SECONDHEAT_DESCRIPTION', 'FLOOR_LEVEL', 'LOCAL_AUTHORITY', 'LOCAL_AUTHORITY_LABEL', 'GLAZED_AREA', 'GLAZED_TYPE', 'PHOTO_SUPPLY', 'OSG_REFERENCE_NUMBER', 'NUMBER_HABITABLE_ROOMS', 'SOLAR_WATER_HEATING_FLAG', 'LMK_KEY', 'WINDOWS_DESCRIPTION', 'HOTWATER_DESCRIPTION', 'FLOOR_DESCRIPTION', 'WALLS_DESCRIPTION', 'ROOF_DESCRIPTION', 'LIGHTING_DESCRIPTION', 'MAIN_HEATING_CONTROLS', 'MULTI_GLAZE_PROPORTION', 'LOW_ENERGY_LIGHTING'],
)

Preprocessing can then occur as usual via: 'asf_core_data.pipeline.preprocessing.preprocess_epc_data.preprocess_data()`

Given the size of the EPC data, and the likelihood that it will grow in future, it may be worth exploring dask as a processing option in the pipeline, as comprehensive processing will be increasingly constrained by ram availability otherwise.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory overhead of unoptimised data types on EPC load #49

Memory overhead of unoptimised data types on EPC load #49

danlewis85 commented Mar 7, 2023

Memory overhead of unoptimised data types on EPC load #49

Memory overhead of unoptimised data types on EPC load #49

Comments

danlewis85 commented Mar 7, 2023