All notable changes to this project will be documented in this file.
Maintainer | Lien Michiels | [email protected]
Maintainer | Robin Verachtert | [email protected]
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Updated PyYAML dependency to 6.0.1 due to a breaking change in Cython 3.0.0
- Updated pandas to 2.x
- Updated torch to 2.x
- algorithms
TimeAwareItemKNN
was moved from experimental to the main algorithms package, and renamed toTARSItemKNNXia
to distinguish it from other time aware item KNN models.- Created a base class for time aware variants of
ItemKNN
. - Added the following time aware variants of
ItemKNN
:TARSItenKNNHermann
as Presented in Hermann, Christoph. "Time-based recommendations for lecture materials." EdMedia+ Innovate Learning. Association for the Advancement of Computing in Education (AACE), 2010.TARSItemKNNVaz
as described in Vaz, Paula Cristina, Ricardo Ribeiro, and David Martins De Matos. "Understanding the Temporal Dynamics of Recommendations across Different Rating Scales." UMAP Workshops. 2013.SequentialRules
as described in Ludewig, Malte, and Dietmar Jannach. "Evaluation of session-based recommendation algorithms." User Modeling and User-Adapted Interaction 28.4 (2018): 331-390.TARSItemKNNLee
from Lee, Tong Queue, Young Park, and Yong-Tae Park. "A time-based approach to effective recommender systems using implicit feedback." Expert systems with applications 34.4 (2008): 3055-3062.TARSItemKNNDing
from Ding, Yi, and Xue Li. "Time weight collaborative filtering." Proceedings of the 14th ACM international conference on Information and knowledge management. 2005.TARSItemKNNLiu2012
based on Liu, Yue, et al. "Time-based k-nearest neighbor collaborative filtering."
- metrics
- Fixed bug in
CalibratedRecallK
results property, which caused an error when requesting this property.
- Fixed bug in
- datasets
- Made Datasets more robust by removing the defaults for
USER_IX
,ITEM_IX
,TIMESTAMP_IX
. - Fixed bug in DummyDataset that used
num_users
to setself.num_items
.
- Made Datasets more robust by removing the defaults for
- matrix
- Made
InteractionMatrix
more robust to funky indices. InteractionMatrix
now verifies if shape matches
- Made
- tests.test_algorithms
- Removed unused fixtures and cleaned up namings.
- Replaced remaining usage of
np.bool
withbool
to be compatible with numpy1.24.1
- algorithms
GRU4Rec
fixed bug in training, where it still tried to learn padding tokens.
- preprocessing
- Make sure DataFrames returned in filters and preprocessors are copies, and no DataFrame is ever edited inplace.
- Added getting started guide for using Hyperopt optimisation
- matrix
- Added
items_in
function, which selects only those interactions with the given list of items. - Added
active_items
andnum_active_items
properties, to get the active items in a dataset.
- Added
- datasets
- Added MovieLens100K, MovieLens1M, MovieLens10M and MillionSongDataset (aka TasteProfileDataset).
- Added Globo dataset
- pipelines
- Added option to use hyperopt to optimise parameters
grid
parameter has been superseded by theoptimisation_info
parameter, which takes anOptimisationInfo
object. Two relevant subclasses have been defined:GridSearchInfo
contains a grid with parameters to support grid search,HyperoptInfo
allows users to specify a hyperopt space, with timeout and or maximum number of evaluations.
- Extended output of optimisation output, to now also include an
Algorithm
column for easier analysis.
- Added option to use hyperopt to optimise parameters
- datasets
- CosmeticsShop now loads from zip archive instead of requiring users to manually append files together
- tests.test_datasets
- Split dataset tests into different files for ease of reading.
- algorithms
RecVAE
andMultVAE
implementations were updated to use theget_batches
function.
- pipelines
- Made
PipelineBuilder
work withMetric
(without K). Affects onlyPercentileRanking
- Fixed issue with optimisation selecting the wrong parameters.
- Made
- preprocessing
- Made sure DataFrames returned in filters and preprocessors are copies, and no DataFrame is ever edited inplace.
- Updated
Prod2Vec
andProd2VecClustered
to remove similarities from and to unvisited items.
- Updated
- Made sure DataFrames returned in filters and preprocessors are copies, and no DataFrame is ever edited inplace.
- metrics:
.name
property of a metric now returns camelcase name instead of lowercase name used before.
- splitters
- Helper function
yield_batches
and classFoldIterator
were removed, as they were unused in samples. Alternative functionget_batches
fromalgorithms.util
should be used instead.
- Helper function
- algorithms
BPRMF
:- Changed optimizer to Adagrad instead of SGD.
- Set maximum on std of embedding initialisation to make sure the initial embedding does not contain too large values.
-
algorithms
NMF
: updated to sklearnalpha_W
andalpha_H
parameters, but keeping our singlealpha
parameter.
-
Removed the
dataclasses
package dependency, this is included by default in python already. -
Added a step to run example notebooks during testing to make sure these do not break.
-
Uses DummyDataset for all demo notebooks to avoid long download and runtimes in the example notebooks.
- algorithms
- Renamed hyperparameters for clarity
- Renamed
embedding_size
tonum_components
to follow sklearn parameter names. - Renamed
U
parameter tonum_negatives
- Renamed
num_neg_samples
tonum_negatives
for consistency. - Renamed
J
for warp loss tonum_items
- Renamed
- Deprecated parameter
normalize
of ItemKNN has been removed.
- Renamed hyperparameters for clarity
- scenarios
n_most_recent
parameter for LastItemPrediction is changed ton_most_recent_in
and default value changed to infinity such that it is inline with the TimedLastItemPrediction scenario.n
parameter forStrongGeneralizationTimedMostRecent
scenario has been renamed ton_most_recent_out
, and behaviour for a negative value has been removed.
- datasets
- renamed
load_dataframe
to_load_dataframe
, it is now a private member, which should not be called directly. - renamed
preprocessing_default
parameter touse_default_filters
for clarity.
- renamed
- scenarios
- Added default values for parameters of WeakGeneralization and StrongGeneralization.
- datasets
- Improved performance of load functionality for AdressaOneWeek.
- datasets
- Changed behaviour of
force=True
on the Adressa dataset, it is now guaranteed to redownload the tar file, and will no longer look for a local tar file. The tar file will also be deleted at the end of the download method. - Added
Netflix
dataset to use the Netflix Prize dataset.
- Changed behaviour of
- scenarios
- Added
TimedLastItemPrediction
scenario, which is different from LastItemPrediction, in that it only trains its model on data before a certain timestamp, and evaluates only on data after that timestamp, thus avoiding leakage.
- Added
- Configured optional extra installs:
pip install recpack[doc]
will install the dependencies to generate documentationpip install recpack[test]
will install the dependencies to run tests.
- Pinned version of sphinx
- algorithms
- Added
validation_sample_size
parameter to the TorchMLAlgorithm base class and all child classes. This parameter allows a user to select only a sample of the validation data in every evaluation iteration. This speeds up the evaluation step after every training epoch significantly.
- Added
The 0.3.0 release contains large amounts of changes compared to 0.2.2, these changes should make the library more intuitively useable, and better serve our users. In addition to interface changes and additions we also made a big effort to increase documentation, and improve it's readability.
- scenarios
- Parameters of the
WeakGeneralization
scenario have been changed to now only accept a singlefrac_data_in
parameter. Which makes sure the validation task is as difficult as the test tasks. - Training data structure has changed. Rather than a single training dataset, we use two training datasets.
validation_training_data
is used to train models while optimising their parameters, and evaluation happens on the validation data.full_training_data
is the union of training and validation data on which the final model should be trained during evaluation on the test dataset.
- Parameters of the
- pipelines
- PipelineBuilder's
set_train_data
function has been replaced withset_full_training_data
andset_validation_training_data
to set the two separate training datasets. - Added
set_data_from_scenario
function toPipelineBuilder
, which makes setting data easy based on a split scenario. - Updated
get_metrics
function, which returns the metrics as a dataframe, rather than a dict.
- PipelineBuilder's
- Several module changes were made:
- splitters module was restructured. Base module is called scenarios. Inside scenarios the splitters submodule contains the splitters functionality.
- data module was removed, and the submodules were turned into modules (
dataset
andmatrix
). The matrix file is split into additional files as well.
- datasets:
- Default min rating for
Movielens25M
was changed from 1 to 4 similar to most paper preprocessing of the dataset. load_interaction_matrix
function was renamed toload
- Default min rating for
- metrics:
DiscountedCumulativeGainK
is renamed toDCGK
NormalizedDiscountedCumulativeGainK
is renamed toNDCGK
-
splitters.scenarios
- Improved tests for WeakGeneralization scenario to confirm mismatches between train and validation_in are only due to users with not enough items and not a potential bug.
- Changed order of checks when accessing validation data, so that error when no validation data is configured, is thrown first, rather than the error about split first.
-
algorithms.experimental.time_decay_nearest_neighbour
- Improved computation speed of the algorithm by changing typing
- Added
decay_interval
parameter to improve performance.
-
algorithms
-
Standardised train and predict input validation functions have been added, and used to make sure inputs are InteractionMatrix objects when needed, and contain timestamps when needed.
-
Added
TARSItemKNNLiu
andTARSItemKNN
algorithms, which compute item-item similarities based on a weighted matrix based on the age of events.TARSItemKNNLiu
algorithm as defined in Liu, Nathan N., et al. "Online evolutionary collaborative filtering."
-
Added
STAN
algorithm, presented in Garg, Diksha, et al. "Sequence and time aware neighborhood for session-based recommendations: Stan". This is a session KNN algorithm that takes into account order and time difference between sessions and interactions.
-
-
data.matrix
- Added
last_timestamps_matrix
property which creates a csr matrix, with the last timestamp as nonzero values.
- Added
-
data.datasets
- Added
AdressaOneWeek
dataset.
- Added
-
preprocessing.filters
- Added MaxItemsPerUser filter to remove users with extreme amounts of interactions from a dataframe.
- Added the
SessionDataFramePreprocessor
which cuts user histories into sessions while processing data into InteractionMatrices.
-
Notebook with implementation of NeuMF for demonstration purposes.
- data.datasets
- Added CosmeticsShop for https://www.kaggle.com/mkechinov/ecommerce-events-history-in-cosmetics-shop
- Added RetailRocket for https://www.kaggle.com/retailrocket/ecommerce-dataset
- Added parameters to dummy dataset to define the output expectations.
- algorithms.baseline
- Added boolean parameter
use_only_interacted_items
to Random baseline, selects if all items should be used, or only those interacted with in the training dataset.
- Added boolean parameter
- splitters.scenarios
- Added parameter
n_most_recent
toNextItemPrediction
class to limit test_in data to only the N most recent interactions of each user.
- Added parameter
- algorithms.wmf
- Refactored WeightedMatrixFactorization
- It now uses PyTorch operations instead of NumPy (for GPU speedups).
- It now processes batches of users and items instead of individual users and items.
- Refactored WeightedMatrixFactorization
- splitters.scenarios
- Renamed
NextItemPrediction
toLastItemPrediction
class, keptNextItemPrediction
as a copy with deprecation warning.
- Renamed
- splitters.scenarios
- Fixed bug in
NextItemPrediction
scenario:- if validation was specified, test_in data contained 1 too few interactions
- Fixed bug in
- Removed dependency on numba
- Removed numba decorators in shared account implementation. It's potentially slower now, which we don't consider a problem since it is in the experimental module.
- splitters.scenarios
- You can set the seed for scenarios with random components. This allows exact recreation of splits for reproducability.
- pipelines.pipeline
optimisation_results
property added to Pipeline to allow users to inspect the results for the different hyperparameters that were tried.
- data.datasets
- Added DummyDataset for easy testing purposes.
- algorithms.nearest_neighbour
- Add ItemPNN, it's ItemKNN, but instead of selecting the top K neighbours, they are sampled from a uniform, softmax-empirical or empirical distribution of item similarities.
- Add a few options to ItemKNN:
- pop_discount: Discount relationships with popular items to avoid a large popularity bias
- sim_normalize: Renamed normalize to sim_normalize. Normalize rows in the similarity matrix to counteract artificially large similarity scores when the predictive item is rare, defaults to False.
- normalize_X: Normalize rows in the interaction matrix so that the contribution of users who have viewed more items is smaller
- data.datasets
filename
parameter now has a default value for almost all datasets.- After initializing a dataset, the code will make sure the specified path exists, and create directories if they were missing.
- Datasets:
- The filename parameter behaviour has changed.
This parameter used to expect the full path to the file.
It now expects just the filename, the directory is specified using
path
.
- The filename parameter behaviour has changed.
This parameter used to expect the full path to the file.
It now expects just the filename, the directory is specified using
- preprocessing.preprocessors:
- Removed the
USER_IX
andITEM_IX
members from DataframePreprocessor. You should useInteractionMatrix.USER_IX
andInteractionMatrix.ITEM_IX
instead.
- Removed the
- util:
get_top_K_values
andget_top_K_ranks
parameterk
changed toK
so it is in line with rest of Recpack.
- Cleanup of Torch class interface
- Updated dependencies to support ranges rather than fixed versions.
- algorithms
- Added GRU4RecNegSampling
- Added GRU4RecCrossEntropy
- Added new loss functions:
- bpr_max_loss
- top_1_loss
- top_1_max_loss
- Added sequence batch samplers
- Added
predict_topK
parameter to TorchMLAlgorithm baseclass and all children. This parameter is used to cut predictions in case a dense user x item matrix is too large to fit in memory.
- Added option to install recpack from gitlab pypi repository.
- See README for instructions.
- Very first release of Recpack
- Contains tested and documented code for:
- Preprocessing
- Scenarios
- Algorithms
- Metrics
- Postprocessing
- Contains pipelines for running experiments.