Releases: MuMiN-dataset/mumin-build
Releases · MuMiN-dataset/mumin-build
v0.3.0
Fixed
- Now catches
SSLError
andOSError
when processing images. - Now catches
ReadTimeoutError
when processing articles. - The
(:Tweet)-[:MENTIONS]->(:User)
was missing in the dataset. It has now
been added back in. - Added tokenizer truncation when adding node embeddings.
- Fixed an issue with embedding user descriptions when the description is not
available.
Changed
- Changed the download link to the dataset, which now fetches the dataset from
a specific commit, enabling proper dataset versioning. - Changed the timeout parameter when downloading images from five seconds to
ten seconds. - Now processing 50 articles and images on each worker, compared to the
previous 5. - When loading in an existing dataset, auxilliaries and islands are removed.
This ensures thatto_dgl
works properly.
Removed
- Removed the review warning from the
README
and when initialising the
dataset. The dataset is still not complete, in the sense that we will add
retweets and timelines, but we will instead just keep versioning the dataset
until we have included these extra features.
v0.2.0
Added
- Added claim embeddings to Claim nodes, being the transformer embeddings of
the claims translated to English, as described in the paper. - Added train/val/test split to claim nodes. When exporting to DGL using the
to_dgl
method, the Claim and Tweet nodes will havetrain_mask
,val_mask
andtest_mask
attributes that can be used to control loss and metric
calculation. These are consistent, meaning that tweets connected to claims
will always belong to the same split. - Added labels to both Tweet and Claim nodes.
Fixed
- Properly embeds reviewers of claims in case a claim has been reviewed by
multiple reviewers. - Load claim embeddings properly.
- Catches
TooManyRequests
exception when extracting images. - Load dataset CSVs with Python engine, as the C engine caused errors.
- Disable tokenizer parallelism, which caused warning messages during
rehydration of tweets. - Ensure proper quoting of strings when dumping dataset to CSVs.
- Enable truncation of strings before tokenizing, when embedding texts.
- Convert masks to integers, which caused an issue when exporting to a DGL
graph. - Bug when computing reviewer embeddings for claims.
- Now properly shows
compiled=True
when printing the dataset, after
compilation.
Changed
- Changed disclaimer about review period.
v0.1.4
Fixed
- Include
(:User)-[:POSTED]->(:Reply)
in the dataset, extracted from the
rehydrated reply and quote tweets.
v0.1.3
Fixed
- Compilation error when including images.
- Only include videos if they are present in the dataset.
- Ensure that article embeddings can properly be converted to PyTorch tensors
when exporting to DGL.
v0.1.2
Fixed
- The replies were not reduced correctly when the
small
ormedium
variants
of the dataset was compiled. - The reply features were not filtered and renamed properly, to keep them
consistent with the tweet nodes. - Users without any description now gets assigned a zero vector as their
description embedding. - If a relation does not have any node pairs then do not try to create a
corresponding DGL relation. - Reset
nodes
andrels
attributes when loading dataset. - Add embeddings for
Reply
nodes.
v0.1.1
Changed
- Changed installation instructions in readme to
pip install mumin
.
v0.1.0
Added
- First release, including a
MuminDataset
class that can compile the dataset
dump the compiled dataset to localcsv
files, and export it as adgl
graph.