Skip to content

MariellaCC/web-archives-queries

Repository files navigation

web-archives-queries

Preparatory work to create a pipeline enabling analysts to retrieve and query web archives for a given date range and get relevant documents assessed by semi-supervised NLP model.

Example use case

Find documents related to the fake letter from the CEO of BlackRock sent to several media outlets on January 16 2019.
Information about this case available in 2020 report by French Financial Markets Authority.
Neyret, A. Stock Market Cybercrime. Definition, cases and perspectives. Autorité des Marchés Financiers, 2020.

Archives source

https://commoncrawl.org/ provides an open repository of web crawl data where archives can be accessed. Terms of use are available here.

Process

1. Web crawl download

For the example use case, the first experiment focuses on WET archive.

2. Get relevant archives from web crawl

The current selected date range for the example goes from 15 to 17 January 2019 but it may be extended.
After gettings the relevant files, the package warcio is used to iterate over them.

3. Processing and modeling

Only documents in English are kept for this case. The model is then trained with CoRex to retrieve relevant documents.

Gallagher, R. J., Reing, K., Kale, D., and Ver Steeg, G. "Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge." Transactions of the Association for Computational Linguistics (TACL), 2017.

https://medium.com/pew-research-center-decoded/overcoming-the-limitations-of-topic-models-with-a-semi-supervised-approach-b947374e0455

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published