- Fix issues using
MAKE_NUMERIC_IDENTIFIER
instead ofMAKE_NUMERIC_IDENTIFIER_UDL
on GCC 7.1.1. - Work around (what we assume is) a bug on MSYS2 where
cmake
would link in additional exception handling libraries that would cause a crash during indexing by building themman-win32
library as shared. - Silence fallthrough warnings on Clang from
murmur_hash
.
- Add an optional
xz{i,o}fstream
tometa::io
if compiled with liblzma available. util::disk_vector<const T>
can now be used to specify a read-only view of a disk-backed vector.
ir_eval::print_stats
now takes anum_docs
parameter to properly display evaluation metrics at a certain cutoff point, which was always 5 beforehand. This fixes a bug inquery-runner
where the stats were not being computed according to the cutoff point specified in the configuration.ir_eval::avg_p
now correctly stops computing afternum_docs
. Before, if you specifiednum_docs
as a smaller value than the size of the result list, it would erroneously keep calculating until the end of the result list instead of stopping afternum_docs
elements.{inverted,forward}_index
can now be loaded from read-only filesystems.
-
Add an
embedding_analyzer
that represents documents with their averaged word vectors. -
Add a
parallel::reduction
algorithm designed for parallelizing complex accumulation operations (like an E step in an EM algorithm) -
Parallelize feature counting in feature selector using the new
parallel::reduction
-
Add a
parallel::for_each_block
algorithm to run functions on (relatively) equal sub-ranges of an iterator range in parallel -
Add a parallel merge sort as
parallel::sort
-
Add a
util/traits.h
header for general useful traits -
Add a Markov model implementation in
sequence::markov_model
-
Add a generic unsupervised HMM implementation. This implementation supports HMMs with discrete observations (what is used most often) and sequence observations (useful for log mining applications). The forward-backward algorithm is implemented using both the scaling method and the log-space method. The scaling method is used by default, but the log-space method is useful for HMMs with sequence observations to avoid underflow issues when the output probabilities themselves are very small.
-
Add the KL-divergence retrieval function using pseudo-relevance feedback with the two-component mixture-model approach of Zhai and Lafferty, called
kl_divergence_prf
. This ranker internally can use anylanguage_model_ranker
subclass likedirichlet_prior
orjelinek_mercer
to perform the ranking of the feedback set and the result documents with respect to the modified query.The EM algorithm used for the two-component mixture model is provided as the
index::feedback::unigram_mixture
free function and returns the feedback model. -
Add the Rocchio algorithm (
rocchio
) for pseudo-relevance feedback in the vector space model. -
Breaking Change. To facilitate the above to changes, we have also broken the
ranker
hierarchy into one more level. At the top we haveranker
, which has a pure virtual functionrank()
that can be overridden to provide entirely custom ranking behavior, This is the class the KL-divergence and Rocchio methods derive from, as we need to re-define what it means to rank documents (first retrieving a feedback set, then ranking documents with respect to an updated query).Most of the time, however, you will want to derive from the second level
ranking_function
, which is what was calledranker
before. This class provides a definition ofrank()
to perform document-at-a-time ranking, and expects deriving classes to instead provideinitial_score()
andscore_one()
implementations to define the scoring function used for each document. Existing code that derived fromranker
prior to this version of MeTA likely needs to be changed to instead derive fromranking_function
. -
Add the
util::transform_iterator
class andutil::make_transform_iterator
function for providing iterators that transform their output according to a unary function. -
Breaking Change.
whitespace_tokenizer
now emits only word tokens by default, suppressing all whitespace tokens. The old default was to emit tokens containing whitespace in addition to actual word tokens. The old behavior can be obtained by passingfalse
to its constructor, or settingsuppress-whitespace = false
in its configuration group inconfig.toml.
(Note that whitespace tokens are still needed if using asentence_boundary
filter but, in nearly all circumstances,icu_tokenizer
should be preferred.) -
Breaking Change. Co-occurrence counting for embeddings now uses history that crosses sentence boundaries by default. The old behavior (clearing the history when starting a new sentence) can be obtained by ensuring that a tokenizer is being used that emits sentence boundary tags and by setting
break-on-tags = true
in the[embeddings]
table ofconfig.toml
. -
Breaking Change. All references in the embeddings library to "coocur" are have changed to "cooccur". This means that some files and binaries have been renamed. Much of the co-occurrence counting part of the embeddings library has also been moved to the public API.
-
Co-occurrence counting now is performed in parallel. Behavior of its merge strategy can be configured with the new
[embeddings]
config parametermerge-fanout = n
, which specifies the maximum number of on-disk chunks to allow before kicking off a multi-way merge (default 8).
- Add additional
packed_write
andpacked_read
overloads: forstd::pair
,stats::dirichlet
,stats::multinomial
,util::dense_matrix
, andutil::sparse_vector
- Additional functions have been added to
ranker_factory
to allow construction/loading of language_model_ranker subclasses (useful for thekl_divergence_prf
implementation) - Add a
util::make_fixed_heap
helper function to simplify the declaration ofutil::fixed_heap
classes with lambda function comparators. - Add regression tests for rankers MAP and NDCG scores. This adds a new
dataset
cranfield
that contains non-binary relevance judgments to facilitate these new tests. - Bump bundled version of ICU to 58.2.
- Fix bug in NDCG calculation (ideal-DCG was computed using the wrong sorting order for non-binary judgments)
- Fix bug where the final chunks to be merged in index creation were not being deleted when merging completed
- Fix bug where GloVe training would allocate the embedding matrix before starting the shuffling process, causing it to exceed the "max-ram" config parameter.
- Fix bug with consuming MeTA from a build directory with
cmake
when building a static ICU library.meta-utf
is now forced to be a shared library, which (1) should save on binary sizes and (2) ensures that the statically build ICU is linked into thelibmeta-utf.so
library to avoid undefined references to ICU functions. - Fix bug with consuming Release-mode MeTA libraries from another project
being built in Debug mode. Before,
identifiers.h
would change behavior based on theNDEBUG
macro's setting. This behavior has been removed, and opaque identifiers are always on.
disk_index::doc_name
anddisk_index::doc_path
have been deprecated in favor of the more general (and less confusing)metadata()
. They will be removed in a future major release.- Support for 32-bit architectures is provided on a best-effort basis. MeTA makes heavy use of memory mapping, which is best paired with a 64-bit address space. Please move to a 64-bit platform for using MeTA if at all possible (most consumer machines should support 64-bit if they were made in the last 5 years or so).
- Properly shuffle documents when doing an even-split classification test
- Make forward indexer listen to
indexer-num-threads
config option. - Use correct number of threads when deciding block sizes for
parallel_for
- Add workaround to
filesystem::remove_all
for Windows systems to avoid spurious failures caused by virus scanners keeping files open after we deleted them - Fix invalid memory access in
gzstreambuf::underflow
- Eliminate excess warnings on Darwin about double preprocessor definitions
- Fix issue finding
config.h
when used as a sub-project via add_subdirectory()
-
Add a minimal perfect hashing implementation for
language_model
, and unify the querying interface with the existing language model. -
Add a CMake
install()
command to install MeTA as a library (issue #143). For example, once the library is installed, users can do:find_package(MeTA 2.4 REQUIRED) add_executable(my-program src/my_program.cpp) target_link_libraries(my-program meta-index) # or whatever other libs you need from MeTA
-
Feature selection functionality added to
multiclass_dataset
andbinary_dataset
and views (issues #111, #149 and PR #150 thanks to @siddshuk).auto selector = features::make_selector(*config, training_vw); uint64_t total_features_selected = 20; selector->select(total_features_selected); auto filtered_dset = features::filter_dataset(dset, *selector);
-
Users can now, similar to
hash_append
, declare standalone functions in the same scope as their type calledpacked_read
andpacked_write
which will be called byio::packed::read
andio::packed::write
, respectively, via argument-dependent lookup.
- Fix edge-case bug in the succinct data structures
- Fix off-by-one error in
lm::diff
- Added functionality to the
meta::hashing
library:hash_append
overload forstd::vector
, manually-seeded hash function - Further isolate ICU in MeTA to allow CMake to
install()
- Updates to EWS (UIUC) build guide
- Add
std::vector
operations toio::packed
- Consolidated all variants of chunk iterators into one template
- Add MeTA's citation to the README!
-
Forward and inverted indexes are now stored in one directory. To make use of your existing indexes, you will need to move their directories. For example, a configuration that used to look like the following
dataset = "20newsgroups" corpus = "line.toml" forward-index = "20news-fwd" inverted-index = "20news-inv"
will now look like the following
dataset = "20newsgroups" corpus = "line.toml" index = "20news-index"
and your folder structure should now look like
20news-index ├── fwd └── inv
You can do this by simply moving the old folders around like so:
mkdir 20news-index mv 20news-fwd 20news-index/fwd mv 20news-inv 20news-index/inv
-
stats::multinomial
now can report the number of unique event types counted (unique_events()
) -
std::vector
can now be hashed viahash_append
.
- Fix rounding bug in language model-based rankers. This bug caused severely degraded performance for these rankers with short queries. The unit tests have been improved to prevent such a regression in the future.
- The bundled ICU version has been bumped to ICU 57.1.
- MeTA will now attempt to build its own version of ICU on Windows if it fails to find a suitable ICU installed.
- CI support for GCC 6.x was added for all three major platforms.
- CI support also uses a fixed version of LLVM/libc++ instead of trunk.
- Parallelized versions of PageRank and Personalized PageRank have been
added. A demo is available in
wiki-page-rank
; see the website for more information on obtaining the required data. - Add a disk-based streaming minimal perfect hash function library. A sub-component of this is a small memory-mapped succinct data structure library for answering rank/select queries on bit vectors.
- Much of our CMake magic has been moved into a separate project included as a submodule: https://github.com/meta-toolkit/meta-cmake, which can now be used in other projects to simplify initial build system configuration.
- Fix parameter settings in language model rankers not being range checked (issue #134).
- Fix incorrect incoming edge insertion in
directed_graph::add_edge()
. - Fix
find_first_of
andfind_last_of
inutil::string_view
.
forward_index
now knows how to tokenize a document down to afeature_vector
, provided it was generated with a non-LIBSVM analyzer.- Allow loading of an existing index where its corpus is no longer available.
- Data is no longer shuffled in
batch_train
. Shuffling the data causes horrible access patterns in the postings file, so the data should instead shuffled before indexing. util::array_view
s can now be constructed as empty.util::multiway_merge
has been made more generic. You can now specify both the comparison function and merging criteria as parameters, which default tooperator<
andoperator==
, respectively.- A simple utility classes
io::mifstream
andio::mofstream
have been added for places where a moveableifstream
orofstream
is desired as a workaround for older standard libraries lacking these move constructors. - The number of indexing threads can be controlled via the configuration
key
indexer-num-threads
(which defaults to the number of threads on the system), and the number of threads allowed to concurrently write to disk can be controlled viaindexer-max-writers
(which defaults to 8).
- Add the GloVe algorithm for
training word embeddings and a library class
word_embeddings
for loading and querying trained embeddings. To facilitate returning word embeddings, a simpleutil::array_view
class was added. - Add simple vector math library (and move
fastapprox
into themath
namespace).
- Fix
probe_map::extract()
forinline_key_value_storage
type; old implementation forgot to delete all sentinel values before returning the vector. - Fix incorrect definition of
l1norm()
insgd_model
. - Fix
gmap
calculation where 0 average precision was ignored - Fix progress output in
multiway_merge
.
- Improve performance of
printing::progress
. Before,progress::operator()
in tight loops could dramatically hurt performance, particularly due to frequent calls tostd::chrono::steady_clock::now()
. Now,progress::operator()
simply sets an atomic iteration counter and a background thread periodically wakes to update the progress output. - Allow full text storage in index as metadata field. If
store-full-text = true
(default false) in the corpus config, the string metadata field "content" will be added. This is to simplify the creation of full text metadata: the user doesn't have to duplicate their dataset inmetadata.dat
, andmetadata.dat
will still be somewhat human-readable without large strings of full text added. - Allow
make_index
to take a user-supplied corpus object.
- ZLIB is now a required dependency.
- Switch to just using the standalone
./unit-test
instead ofctest
. There aren't really many advantages for us to using CTest at this point with the new unit test framework, so just use our unit test executable.
- Fix issue where
metadata_parser
would not consume spaces in string metadata fields. Thanks to @hopsalot on the forum for the bug report! - Fix build issue on OS X with Xcode 6.4 and
clang
related to their shipped version ofstring_view
lacking a constto_string()
method
- The
./profile
executable ensures that the file exists before operating on it. Thanks to @domarps for the PR! - Add a generic
util::multiway_merge
algorithm for performing the merge-step of an external memory merge sort. - Build with the following Xcode versions on Travis CI:
- Xcode 6.1 and OS X 10.9 (as before)
- Xcode 6.4 and OS X 10.10 (new)
- Xcode 7.1.1 and OS X 10.10 (new)
- Xcode 7.2 and OS X 10.11 (new)
- Index format rewrite: both inverted and forward indices now use the same compressed postings format, and intermediate chunks are now also compressed on-the-fly. There is now a built in tool to dump any forward index to libsvm format (as this is not the on-disk format for that type of index anymore).
- Metadata support: indices can now store arbitrary metadata associated with individual documents with string, integer, unsigned integer, and floating point values
- Corpus configuration is now stored within the corpus directory itself, allowing for corpora to be distributed with their proper configurations rather than having to bake this into the main configuration file
- RAM limits can be set for the indexing process via the configuration file. These are approximate and based on heuristics, so you should always set these to lower than available RAM.
- Forward indices can now be created directly instead of forcing the creation of an inverted index first
- ICU will be built and statically linked if the system provided library is too old on both OS X and Linux platforms. MeTA now will specify an exact version of ICU that should be used per release for consistency. That version is 56.1 as of this release.
- Analyzers have been modified to support both integral and floating point
values via the use of the
featurizer
object passed totokenize()
- Documents no longer store any count information during the analysis process
- Postings lists can now be read in a streaming fashion rather than all at
once via
postings_stream
- Ranking is now performed using a document-at-a-time scheme
- Ranking functions now use fast approximate math from fastapprox
- Rank correlation measures have been added to the evaluation library
- Rewrite of the language model library which can load models from the .arpa format
- SyntacticDiff implementation for comparative text mining, which may include grammatical error correction, summarization, or feature generation
- A feature selection library for selecting features for machine learning using chi square, information gain, correlation coefficient, and odds ratio has been added
- The API for the machine learning algorithms has been changed to use
dataset
classes; these are separate from the index classes and represent data that is memory-resident - Support for regression has been added (currently only via SGD)
- The SGD algorithm has been improved to use a normalized adaptive gradient method which should make it less sensitive to feature scaling
- The SGD algorithm now supports (approximate) L1 regularization via a cumulative penalty approach
- The libsvm modules are now also built using CMake
- Packed binary I/O functions allow for writing integers/floating point values in a compressed format that can be efficiently decoded. This should be used for most binary I/O that needs to be performed in the toolkit unless there is a specific reason not to.
- An interactive demo application has been added for the shift-reduce constituency parser
- A
string_view
class is provided in themeta::util
namespace to be used for non-owning references to strings. This will usestd::experimental::string_view
if available and our own implementation if not meta::util::optional
will resolve tostd::experimental::optional
if it is available- Support for jemalloc has been added to the build system. We strongly recommend installing and linking against jemalloc for improved indexing performance.
- A tool has been added to print out the top k terms in a corpus
- A new library for hashing has been added in namespace
meta::hashing
. This includes a generic framework for writing hash functions that are randomly keyed as well as (insertion only) probing-based hash sets/maps with configurable resizing and probing strategies - A utility class
fixed_heap
has been added for places where a fixed size set of maximal/minimal values should be maintained in constant space - The filesystem management routines have been converted to use STLsoft in
the event that the filesystem library in
std::experimental::filesystem
is not available - Building MeTA on Windows is now officially supported via MSYS2 and MinGW-w64, and continuious integration now builds it on every commit in this environment
- A small support library for things related to random number generation
has been added in
meta::random
- Sparse vectors now support
operator+
andoperator-
- An STL container compatible allocator
aligned_allocator<T, Alignment>
has been added that can over-align data (useful for performance in some situations) - Bandit is now used for the unit tests, and these have been substantially improved upon
io::parser
deprecated and removed; most uses simply converted tostd::fstream
binary_file_{reader,writer}
deprecated and removed;io::packed
orio::{read,write}_binary
should be used instead
- knn classifier now only requests the top k when performing classification
- An issue where uncompressed model files would not be found if using a zlib-enabled build (#101)
- Travis CI integration has been switched to their container infrastructure, and it now builds with OS X with Clang in addition to Linux with Clang and GCC
- Appveyor CI for Windows builds alongside Travis
- Indexing speeds are dramatically faster (thanks to many changes both in the in-memory posting chunks as well as optimizations in the tokenization process)
- If no build type is specified, MeTA will be built in Release mode
- The cpptoml dependency version has been bumped, allowing the use of
things like
value_or
for cleaner code - The identifiers library has been dramatically simplified
- Fix issue with
confusion_matrix
where precision and recall values were swapped. Thanks to @husseinhazimeh for finding this!
- Better unit tests for
confusion_matrix
- Add functions to
confusion_matrix
to directly access precision, recall, and F1 score - Create a
predicted_label
opaque identifier to emphasizeclass_labels
that are output from some model (and thus shouldn't be interchangeable)
- Fix inconsistent behavior of
utf::segmenter
(and thusicu_tokenizer
) for different locales. Thanks @CanoeFZH and @tng-konrad for helping debug this!
- Allow for specifying the language and country for locale generation in
setting up
utf::segmenter
(and thusicu_tokenizer
) - Allow for suppression of
<s>
and</s>
tags withinicu_tokenizer
, mostly useful for information retrieval experiments with unigram words. Thanks @husseinhazimeh for the suggestion! - Add a
default-unigram-chain
filter chain preset which is suitable for information retrieval experiments using unigram words. Thanks @husseinhazimeh for the suggestion!
- Fix potential off-by-one when calculating the number of documents in a
line_corpus
when its files do not end in a newline
- Change
score_data
to support floating-point weights on query terms
- Fix missing support for sequence/parser analyzers in the classify tools
- Support building with biicode
- Add Vagrantfile for virtual machine configuration
- Add Dockerfile for Docker support
- Improve
ir_eval
unit tests
- Fix
ir_eval::ndcg
incorrect log base and addition instead of subtraction in IDCG calculation - Fix
ir_eval::avg_p
incorrect early termination
- Fix issues with system-defined integer widths in binary model files (mainly impacted the greedy tagger and parser); please re-download any parser model files you may have had before
- Fix bug where parser model directory is not created if a non-standard prefix is used (anything other than "parser")
- Silence inconsistent missing overrides warning on clang >= 3.6
- fix potentially incorrect generation of vocabulary map files on 32-bit systems (this appears to have only impacted non-default block sizes)
- fix calculation of average precision in
ir_eval
(the denominator was incorrect) - specify that labels are required for the
file_corpus
document list; this allows spaces in the path to each document
- additions to the graph library:
- myopic search
- BFS
- preferential attachment graph generation model (supports node attractiveness from different distributions)
- betweenness centrality
- eigenvector centrality
- added a new natural language parsing library:
- parse tree library (visitor-based)
- shift-reduce constituency parser for generating phrase structure trees
- reimplementation of evalb metrics for evaluating parsers
- new filter for Penn Treebank-style normalization
- added a greedy averaged Perceptron-based tagger
- demo application for various basic text processing (profile)
- basic iostreams that support gzip compression (if compiled with ZLib support)
- added iteration method for
stats::multinomial
seen events - added expected value and entropy functions to
stats
namespace - added
linear_model
: a generic multiclass classifier storage class - added
gz_corpus
: a compressed version ofline_corpus
- added macros for generating type safe identifiers with user defined literal suffixes
- added a persistent stack data structure to
meta::util
- added operator== for
util::optional<T>
- better CMake support for building the libsvm modules
- better CMake support for downloading unit-test data
- improved setup guide in README (for OS X, Ubuntu, Arch, and EWS/ENGRIT)
- tree analyzers refactored to use the new parser library (removes dependency on outside toolkits for generating tree files)
- analyzers that are not part of the "core" have been moved into their
respective folders (so
ngram_pos_analyzer
is insrc/sequence
,tree_analyzer
is insrc/parser
) make_index
now checks if the files exist before loading an index, and if they are missing creates a new one (as opposed to just throwing an exception on a nonexistent file)- cpptoml upgraded to support TOML v0.4.0
- enable extra warnings (-Wextra) for clang++ and g++
- fix
sequence_analyzer::analyze() const
when applied to untagged sequences (was throwing when it shouldn't) - ensure that the inverted index object is destroyed first before
uninverting occurs in the creation of a
forward_idnex
- fix bug where
icu_tokenizer
would output spaces as tokens - fix bugs where index objects were not destroyed before trying to delete their files in the unit tests
- fix bug in
sparse_vector::find()
where it would return a non-end iterator when asked to find an element that does not exist
- demo application for CRF-based POS tagging
nearest_centroid
classifier- basic statistics library for representing relevant probability distributions
sparse_vector
utility class
ngram_pos_analyzer
now uses the CRf internally (see issue #46)knn
classifier new supports weighted knnfilesystem::copy_file()
no longer hangs without progress reporting with large files- CMake build system now includes
INTERFACE
targets (better inclusion as a subproject in external projects) - MeTA can now (optionally) be built with C++14 support
language_model_ranker
scoring function corrected (see issue #50)naive_bayes
classifier scoring corrected- several incorrect instances of
numeric_limits<double>::min()
replaced with the intendednumeric_limits<double>::lowest()
- fix compilation with versions of ICU < 4.4
- sequence analyzer and CRF implementation
- basic language model
- basic directed and undirected graphs
- restructure CMakeLists
- Initial release.