Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory overhead of unoptimised data types on EPC load #49

Open
danlewis85 opened this issue Mar 7, 2023 · 0 comments
Open

Memory overhead of unoptimised data types on EPC load #49

danlewis85 opened this issue Mar 7, 2023 · 0 comments

Comments

@danlewis85
Copy link

I was exploring asf_core_data.getters.epc.epc_data module via the function load_england_wales_data() and was finding significant memory constraints on read.

Looking at the read logic, there may be considerable gains to be made if you are willing to be more assertive over data types.

I've written an illustrative example targeting the 'low hanging fruit' for loading the current Welsh EPC data which requires a memory footprint 28% smaller than the current approach. Original approach 3.6gb in memory, revise approach 2.6gb.

The approach is to cast categorical data to the pandas categorical datatype where possible as this can save substantial memory compared to the default object representation (sometimes 100x less). Of course, the trade-off is that you need to assert beforehand what the possible categories are, however in a lot of cases this is possible (as in the case of the likert scales for efficiency).

The reason you might not do this is if you want to maintain the distinction between 'raw' data and preprocessed data.

Here is a reproducible example:

  1. Set some parameters in base config
from asf_core_data.config import base_config

# Some references to the downloaded epc certs.
base_config.ROOT_DATA_PATH = "/home/xxx/projects/data/"
base_config.RAW_DATA_PATH = ""
base_config.RAW_ENG_WALES_DATA_ZIP = "/home/xxx/projects/data/all-domestic-certificates.zip"
base_config.RAW_ENG_WALES_DATA_PATH = ""

# Non-exhaustive set of updates to data types.
# Note that pandas defaults int and float to int64 and float64 unless told.
base_config.dtypes['CURRENT_ENERGY_EFFICIENCY'] = 'int32'
base_config.dtypes['ENERGY_CONSUMPTION_CURRENT'] = 'float32'
base_config.dtypes['CO2_EMISSIONS_CURRENT'] = 'float32'
base_config.dtypes['CO2_EMISS_CURR_PER_FLOOR_AREA'] = 'float32'
base_config.dtypes['TOTAL_FLOOR_AREA'] = 'float32'
base_config.dtypes['MULTI_GLAZE_PROPORTION'] = 'float32'
base_config.dtypes['NUMBER_HABITABLE_ROOMS'] = 'float32'
base_config.dtypes['LOW_ENERGY_LIGHTING'] = 'float32'
base_config.dtypes['CURRENT_ENERGY_RATING'] = 'category'
base_config.dtypes['POTENTIAL_ENERGY_RATING'] = 'category'
base_config.dtypes['PROPERTY_TYPE'] = 'category'
base_config.dtypes['BUILT_FORM'] = 'category'
base_config.dtypes['MAINS_GAS_FLAG'] = 'category'
base_config.dtypes['FLOOR_ENERGY_EFF'] = 'category'
base_config.dtypes['WINDOWS_ENERGY_EFF'] = 'category'
base_config.dtypes['HOT_WATER_ENERGY_EFF'] = 'category'
base_config.dtypes['WALLS_ENERGY_EFF'] = 'category'
base_config.dtypes['ROOF_ENERGY_EFF'] = 'category'
base_config.dtypes['MAINHEAT_ENERGY_EFF'] = 'category'
base_config.dtypes['MAINHEATC_ENERGY_EFF'] = 'category'
base_config.dtypes['LIGHTING_ENERGY_EFF'] = 'category'
  1. Now, pandas concat won't actually concat categorical data, unless all categories are represented in each dataframe to concat. This might change in pandas release 2.1, but for now we need to manually ensure that this is the case. To do this, let's define a new dictionary of categories.
categories = {'CURRENT_ENERGY_RATING': lambda df: df['CURRENT_ENERGY_RATING'].cat.set_categories(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'INVALID!']),
              'POTENTIAL_ENERGY_RATING': lambda df: df['POTENTIAL_ENERGY_RATING'].cat.set_categories(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'INVALID!']),
              'PROPERTY_TYPE': lambda df: df['PROPERTY_TYPE'].cat.set_categories(['Flat', 'House', 'Park home', 'Bungalow', 'Maisonette']),
              'BUILT_FORM': lambda df: df['BUILT_FORM'].cat.set_categories(['Enclosed Mid-Terrace', 'Detached', 'Semi-Detached', 'Mid-Terrace',
                                                                            'End-Terrace', 'Enclosed End-Terrace', 'NO DATA!']),
              'MAINS_GAS_FLAG': lambda df: df['MAINS_GAS_FLAG'].cat.set_categories(['Y', 'N']),
              'FLOOR_ENERGY_EFF': lambda df: df['FLOOR_ENERGY_EFF'].cat.set_categories(['Very Good', 'Good', 'Average', 'Poor', 'Very Poor', 'NO DATA!']),
              'WINDOWS_ENERGY_EFF': lambda df: df['WINDOWS_ENERGY_EFF'].cat.set_categories(['Very Good', 'Good', 'Average', 'Poor', 'Very Poor', 'NO DATA!']),
              'HOT_WATER_ENERGY_EFF': lambda df: df['HOT_WATER_ENERGY_EFF'].cat.set_categories(['Very Good', 'Good', 'Average', 'Poor', 'Very Poor', 'NO DATA!']),
              'WALLS_ENERGY_EFF': lambda df: df['WALLS_ENERGY_EFF'].cat.set_categories(['Very Good', 'Good', 'Average', 'Poor', 'Very Poor', 'NO DATA!']),
              'ROOF_ENERGY_EFF': lambda df: df['ROOF_ENERGY_EFF'].cat.set_categories(['Very Good', 'Good', 'Average', 'Poor', 'Very Poor', 'NO DATA!']),
              'MAINHEAT_ENERGY_EFF': lambda df: df['MAINHEAT_ENERGY_EFF'].cat.set_categories(['Very Good', 'Good', 'Average', 'Poor', 'Very Poor', 'NO DATA!']),
              'MAINHEATC_ENERGY_EFF': lambda df: df['MAINHEATC_ENERGY_EFF'].cat.set_categories(['Very Good', 'Good', 'Average', 'Poor', 'Very Poor', 'NO DATA!']),
              'LIGHTING_ENERGY_EFF': lambda df: df['LIGHTING_ENERGY_EFF'].cat.set_categories(['Very Good', 'Good', 'Average', 'Poor', 'Very Poor', 'NO DATA!']),
             }
  1. Now we need to adjust the load_england_wales_data() function to use this new information about categories, here's a simple example:
import os
import pandas as pd
from asf_core_data import Path
from asf_core_data.getters.epc import data_batches

def load_england_wales_data(
    data_path=base_config.ROOT_DATA_PATH,
    rel_data_path=base_config.RAW_ENG_WALES_DATA_PATH,
    batch=None,
    subset=None,
    usecols=None,
    n_samples=None,
    dtype=base_config.dtypes,
):
    RAW_ENG_WALES_DATA_PATH = data_batches.get_batch_path(
        Path(data_path) / rel_data_path, data_path=data_path, batch=batch
    )
    RAW_ENG_WALES_DATA_ZIP = data_batches.get_batch_path(
        Path(data_path) / base_config.RAW_ENG_WALES_DATA_ZIP,
        data_path=data_path,
        batch=batch,
    )

    # Get all directories
    directories = [
        dir
        for dir in os.listdir(RAW_ENG_WALES_DATA_PATH)
        if not (dir.startswith(".") or dir.endswith(".txt") or dir.endswith(".zip"))
    ]

    # Set subset dict to select respective subset directories
    start_with_dict = {"Wales": "domestic-W", "England": "domestic-E"}

    directories = [
        dir for dir in directories if dir.startswith(start_with_dict[subset])
    ]

    if usecols is not None:
        usecols = [
            col for col in usecols if col not in base_config.scotland_only_features
        ]

     epc_certs = pd.concat(
        (pd.read_csv(
            RAW_ENG_WALES_DATA_PATH / directory / "certificates.csv",
            dtype=dtype,
            usecols=usecols,
        )
         .assign(**categories) for directory in directories),
        axis=0).assign(COUNTRY = subset)

    if "UPRN" in epc_certs.columns:
        epc_certs["UPRN"].fillna(epc_certs.BUILDING_REFERENCE_NUMBER, inplace=True)

    return epc_certs

I can then run the function like:

data = load_england_wales_data(
    data_path=base_config.ROOT_DATA_PATH,
    rel_data_path=base_config.RAW_DATA_PATH ,
    subset='Wales',
    usecols=['ADDRESS1', 'ADDRESS2', 'POSTCODE', 'MAINS_GAS_FLAG', 'NUMBER_HABITABLE_ROOMS', 'CURRENT_ENERGY_RATING', 'POTENTIAL_ENERGY_RATING', 'CURRENT_ENERGY_EFFICIENCY', 'ENERGY_CONSUMPTION_CURRENT', 'TENURE', 'MAINHEAT_ENERGY_EFF', 'HOT_WATER_ENERGY_EFF', 'FLOOR_ENERGY_EFF', 'WINDOWS_ENERGY_EFF', 'WALLS_ENERGY_EFF', 'ROOF_ENERGY_EFF', 'MAINHEATC_ENERGY_EFF', 'LIGHTING_ENERGY_EFF', 'MAINHEAT_DESCRIPTION', 'CO2_EMISSIONS_CURRENT', 'CO2_EMISS_CURR_PER_FLOOR_AREA', 'BUILDING_REFERENCE_NUMBER', 'INSPECTION_DATE', 'BUILT_FORM', 'PROPERTY_TYPE', 'CONSTRUCTION_AGE_BAND', 'TRANSACTION_TYPE', 'TOTAL_FLOOR_AREA', 'ENERGY_TARIFF', 'UPRN', 'SECONDHEAT_DESCRIPTION', 'FLOOR_LEVEL', 'LOCAL_AUTHORITY', 'LOCAL_AUTHORITY_LABEL', 'GLAZED_AREA', 'GLAZED_TYPE', 'PHOTO_SUPPLY', 'OSG_REFERENCE_NUMBER', 'NUMBER_HABITABLE_ROOMS', 'SOLAR_WATER_HEATING_FLAG', 'LMK_KEY', 'WINDOWS_DESCRIPTION', 'HOTWATER_DESCRIPTION', 'FLOOR_DESCRIPTION', 'WALLS_DESCRIPTION', 'ROOF_DESCRIPTION', 'LIGHTING_DESCRIPTION', 'MAIN_HEATING_CONTROLS', 'MULTI_GLAZE_PROPORTION', 'LOW_ENERGY_LIGHTING'],
)

Preprocessing can then occur as usual via: 'asf_core_data.pipeline.preprocessing.preprocess_epc_data.preprocess_data()`

Given the size of the EPC data, and the likelihood that it will grow in future, it may be worth exploring dask as a processing option in the pipeline, as comprehensive processing will be increasingly constrained by ram availability otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant