Preparatory work to create a pipeline enabling analysts to retrieve and query web archives for a given date range and get relevant documents assessed by semi-supervised NLP model.
Find documents related to the fake letter from the CEO of BlackRock sent to several media outlets on January 16 2019.
Information about this case available in 2020 report by French Financial Markets Authority.
Neyret, A. Stock Market Cybercrime. Definition, cases and perspectives. Autorité des Marchés Financiers, 2020.
https://commoncrawl.org/ provides an open repository of web crawl data where archives can be accessed. Terms of use are available here.
For the example use case, the first experiment focuses on WET archive.
The current selected date range for the example goes from 15 to 17 January 2019 but it may be extended.
After gettings the relevant files, the package warcio is used to iterate over them.
Only documents in English are kept for this case. The model is then trained with CoRex to retrieve relevant documents.
Gallagher, R. J., Reing, K., Kale, D., and Ver Steeg, G. "Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge." Transactions of the Association for Computational Linguistics (TACL), 2017.