You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was exploring asf_core_data.getters.epc.epc_data module via the function load_england_wales_data() and was finding significant memory constraints on read.
Looking at the read logic, there may be considerable gains to be made if you are willing to be more assertive over data types.
I've written an illustrative example targeting the 'low hanging fruit' for loading the current Welsh EPC data which requires a memory footprint 28% smaller than the current approach. Original approach 3.6gb in memory, revise approach 2.6gb.
The approach is to cast categorical data to the pandas categorical datatype where possible as this can save substantial memory compared to the default object representation (sometimes 100x less). Of course, the trade-off is that you need to assert beforehand what the possible categories are, however in a lot of cases this is possible (as in the case of the likert scales for efficiency).
The reason you might not do this is if you want to maintain the distinction between 'raw' data and preprocessed data.
Here is a reproducible example:
Set some parameters in base config
from asf_core_data.config import base_config
# Some references to the downloaded epc certs.
base_config.ROOT_DATA_PATH = "/home/xxx/projects/data/"
base_config.RAW_DATA_PATH = ""
base_config.RAW_ENG_WALES_DATA_ZIP = "/home/xxx/projects/data/all-domestic-certificates.zip"
base_config.RAW_ENG_WALES_DATA_PATH = ""
# Non-exhaustive set of updates to data types.
# Note that pandas defaults int and float to int64 and float64 unless told.
base_config.dtypes['CURRENT_ENERGY_EFFICIENCY'] = 'int32'
base_config.dtypes['ENERGY_CONSUMPTION_CURRENT'] = 'float32'
base_config.dtypes['CO2_EMISSIONS_CURRENT'] = 'float32'
base_config.dtypes['CO2_EMISS_CURR_PER_FLOOR_AREA'] = 'float32'
base_config.dtypes['TOTAL_FLOOR_AREA'] = 'float32'
base_config.dtypes['MULTI_GLAZE_PROPORTION'] = 'float32'
base_config.dtypes['NUMBER_HABITABLE_ROOMS'] = 'float32'
base_config.dtypes['LOW_ENERGY_LIGHTING'] = 'float32'
base_config.dtypes['CURRENT_ENERGY_RATING'] = 'category'
base_config.dtypes['POTENTIAL_ENERGY_RATING'] = 'category'
base_config.dtypes['PROPERTY_TYPE'] = 'category'
base_config.dtypes['BUILT_FORM'] = 'category'
base_config.dtypes['MAINS_GAS_FLAG'] = 'category'
base_config.dtypes['FLOOR_ENERGY_EFF'] = 'category'
base_config.dtypes['WINDOWS_ENERGY_EFF'] = 'category'
base_config.dtypes['HOT_WATER_ENERGY_EFF'] = 'category'
base_config.dtypes['WALLS_ENERGY_EFF'] = 'category'
base_config.dtypes['ROOF_ENERGY_EFF'] = 'category'
base_config.dtypes['MAINHEAT_ENERGY_EFF'] = 'category'
base_config.dtypes['MAINHEATC_ENERGY_EFF'] = 'category'
base_config.dtypes['LIGHTING_ENERGY_EFF'] = 'category'
Now, pandas concat won't actually concat categorical data, unless all categories are represented in each dataframe to concat. This might change in pandas release 2.1, but for now we need to manually ensure that this is the case. To do this, let's define a new dictionary of categories.
Now we need to adjust the load_england_wales_data() function to use this new information about categories, here's a simple example:
import os
import pandas as pd
from asf_core_data import Path
from asf_core_data.getters.epc import data_batches
def load_england_wales_data(
data_path=base_config.ROOT_DATA_PATH,
rel_data_path=base_config.RAW_ENG_WALES_DATA_PATH,
batch=None,
subset=None,
usecols=None,
n_samples=None,
dtype=base_config.dtypes,
):
RAW_ENG_WALES_DATA_PATH = data_batches.get_batch_path(
Path(data_path) / rel_data_path, data_path=data_path, batch=batch
)
RAW_ENG_WALES_DATA_ZIP = data_batches.get_batch_path(
Path(data_path) / base_config.RAW_ENG_WALES_DATA_ZIP,
data_path=data_path,
batch=batch,
)
# Get all directories
directories = [
dir
for dir in os.listdir(RAW_ENG_WALES_DATA_PATH)
if not (dir.startswith(".") or dir.endswith(".txt") or dir.endswith(".zip"))
]
# Set subset dict to select respective subset directories
start_with_dict = {"Wales": "domestic-W", "England": "domestic-E"}
directories = [
dir for dir in directories if dir.startswith(start_with_dict[subset])
]
if usecols is not None:
usecols = [
col for col in usecols if col not in base_config.scotland_only_features
]
epc_certs = pd.concat(
(pd.read_csv(
RAW_ENG_WALES_DATA_PATH / directory / "certificates.csv",
dtype=dtype,
usecols=usecols,
)
.assign(**categories) for directory in directories),
axis=0).assign(COUNTRY = subset)
if "UPRN" in epc_certs.columns:
epc_certs["UPRN"].fillna(epc_certs.BUILDING_REFERENCE_NUMBER, inplace=True)
return epc_certs
Preprocessing can then occur as usual via: 'asf_core_data.pipeline.preprocessing.preprocess_epc_data.preprocess_data()`
Given the size of the EPC data, and the likelihood that it will grow in future, it may be worth exploring dask as a processing option in the pipeline, as comprehensive processing will be increasingly constrained by ram availability otherwise.
The text was updated successfully, but these errors were encountered:
I was exploring
asf_core_data.getters.epc.epc_data
module via the functionload_england_wales_data()
and was finding significant memory constraints on read.Looking at the read logic, there may be considerable gains to be made if you are willing to be more assertive over data types.
I've written an illustrative example targeting the 'low hanging fruit' for loading the current Welsh EPC data which requires a memory footprint 28% smaller than the current approach. Original approach 3.6gb in memory, revise approach 2.6gb.
The approach is to cast categorical data to the pandas categorical datatype where possible as this can save substantial memory compared to the default object representation (sometimes 100x less). Of course, the trade-off is that you need to assert beforehand what the possible categories are, however in a lot of cases this is possible (as in the case of the likert scales for efficiency).
The reason you might not do this is if you want to maintain the distinction between 'raw' data and preprocessed data.
Here is a reproducible example:
load_england_wales_data()
function to use this new information about categories, here's a simple example:I can then run the function like:
Preprocessing can then occur as usual via: 'asf_core_data.pipeline.preprocessing.preprocess_epc_data.preprocess_data()`
Given the size of the EPC data, and the likelihood that it will grow in future, it may be worth exploring dask as a processing option in the pipeline, as comprehensive processing will be increasingly constrained by ram availability otherwise.
The text was updated successfully, but these errors were encountered: