Releases: MuMiN-dataset/mumin-build
Releases · MuMiN-dataset/mumin-build
v1.10.0
Added
- Added
n_jobs
andchunksize
arguments toMuminDataset
, to enable customisation
of these.
Changed
- Lowered the default value of
chunksize
from 50 to 10, which also lowers the memory
requirements when processing articles and images, as fewer of these are kept in
memory at a time. - Now stores all images as
uint8
NumPy arrays rather thanint64
, reducing memory
usage of images significantly.
v1.9.0
Added
- Added checkpoint after rehydration. This means that if compilation fails for whatever
reason after this point, the next compilation will resume after the rehydration
process. - Added some more unit tests.
Fixed
- Fixed bug on Windows where some tweet IDs were negative.
- Fixed another bug on Windows where the timeout decorator did not work, due to the use
of signals, which are not available on Windows machines. - Fixed bug on MacOS causing Python to crash during parallel extraction of articles and
images.
Changed
- Refactored repository to use the more modern
pyproject.toml
withpoetry
.
v1.8.0
Changed
- Now allows instantiation of
MuminDataset
without having any Twitter bearer
token, neither as an explicit argument nor as an environment variable, which
is useful for pre-compiled datasets. If the dataset needs to be compiled then
aRuntimeError
will be raised when calling thecompile
method.
v1.7.0
Added
- Now allows setting
twitter_bearer_token=None
in the constructor of
MuminDataset
, which uses the environment variableTWITTER_API_KEY
instead, which can be stored in a separate.env
file. This is now the
default value oftwitter_bearer_token
.
Changed
- Replaced
DataFrame.append
calls withpd.concat
, as the former is
deprecated and will be removed frompandas
in the future.
v1.6.2
Fixed
- Now removes claims that are only connected to deleted tweets when calling
to_dgl
. This previously caused a bug that was due to a mismatch between
nodes in the dataset (which includes deleted ones) and nodes in the DGL graph
(which does not contain the deleted ones).
v1.6.1
Fixed
- Now correctly catches JSONDecodeError during rehydration.
v1.6.0
- Changed the download link from Git-LFS to the official data.bris data
repository, with URI https://doi.org/10.5523/bris.23yv276we2mll25fjakkfim2ml.
v1.5.0
Changed
- Now using dicts rather than Series in
to_dgl
. This improved the wall time
from 1.5 hours to 2 seconds!
Fixed
- There was a bug in the call to
dgl.data.utils.load_graphs
causing
load_dgl_graph
to fail. This is fixed now.
v1.4.1
Changed
- Now only saves dataset at the end of
add_embeddings
if any embeddings were
added.
v1.4.0
Added
- The
to_dgl
method is now being parallelised, speeding export up
significantly. - Added convenience functions
save_dgl_graph
andload_dgl_graph
, which
stores the Boolean train/val/test masks as unsigned 8-bit integers and
handles the conversion. Using thedgl
-nativesave_graphs
and
load_graphs
causes an error, as it cannot handle Boolean tensors. These two
convenience functions can be loaded simply as
from mumin import save_dgl_graph, load_dgl_graph
.