This repository contain the assignment that we covered during CSCI-599(Content Detection and Big Data Analytics) under Prof. Chris Mattman.
Subject provides a in depth overivew of various Content detection approaches, MetaData cataloging ,Language Detection and Machine Translation techniques.
Learning Byte-based fingerprints of the data via Byte Frequency Analysis (BFA), Byte Frequency Distribution (BFD) Correlation, Byte Frequency Cross-Correlation (BFC), and File Header Trailer (FHT). To implement a set of MIME diversity programs and applications that will help in better understanding these unknown types in a rich scientific domain.Compute BFA,BFC and FHT of these unknown (and other) Polar data types from the dataset, and build a system that allows visual interaction and introspection of the MIME diversity in this dataset. Those classifications will improve Tika’s overall ability by suggesting new MIME magic for its database, and improve techniques for MIME detection in the Big Data present in the TREC-DD-Polar dataset. read more here
Demo for MIME Divesity for various MIME type-BFA approach
Demo for MIME Divesity for various MIME type-BFC approach
Demo for MIME Divesity for various MIME type-FHT approach
To significantly enrich the metadata, and automatically extracted text and entities from the TREC Polar Dataset, and to make the dataset easily to relate to and to interact with. To do so, you will apply and leverage knowledge gained from context extraction, metadata, information similarity and clustering, and from the named entity recognition lectures. read more
Demo for Scientific Content Enrichment for TREC polar dataset
To expand the analysis of the TREC-DD-Polar Dataset.Evaluating the efficacy, utility, and overall contribution of your Content detection approach is an extremely important and difficult challenge. Questions such as Is my MIME detection good? Are my parsers extracting the right text? Are we selecting the right parser? Is my Metadata appropriate? What’s missing? How well is my language detection performing? Are there mixed languages? How well is my Machine Translation? Do my Named Entities make sense? read more here