web-archives-queries

Preparatory work to create a pipeline enabling analysts to retrieve and query web archives for a given date range and get relevant documents assessed by semi-supervised NLP model.

Example use case

Find documents related to the fake letter from the CEO of BlackRock sent to several media outlets on January 16 2019.
Information about this case available in 2020 report by French Financial Markets Authority.
Neyret, A. Stock Market Cybercrime. Definition, cases and perspectives. Autorité des Marchés Financiers, 2020.

Archives source

https://commoncrawl.org/ provides an open repository of web crawl data where archives can be accessed. Terms of use are available here.

Process

1. Web crawl download

For the example use case, the first experiment focuses on WET archive.

2. Get relevant archives from web crawl

The current selected date range for the example goes from 15 to 17 January 2019 but it may be extended.
After gettings the relevant files, the package warcio is used to iterate over them.

3. Processing and modeling

Only documents in English are kept for this case. The model is then trained with CoRex to retrieve relevant documents.

Gallagher, R. J., Reing, K., Kale, D., and Ver Steeg, G. "Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge." Transactions of the Association for Computational Linguistics (TACL), 2017.

https://medium.com/pew-research-center-decoded/overcoming-the-limitations-of-topic-models-with-a-semi-supervised-approach-b947374e0455

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset_preparation.ipynb		dataset_preparation.ipynb
get_archives.sh		get_archives.sh
get_relevant_documents.ipynb		get_relevant_documents.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

web-archives-queries

Example use case

Archives source

Process

1. Web crawl download

2. Get relevant archives from web crawl

3. Processing and modeling

About

Releases

Packages

Contributors 2

Languages

License

MariellaCC/web-archives-queries

Folders and files

Latest commit

History

Repository files navigation

web-archives-queries

Example use case

Archives source

Process

1. Web crawl download

2. Get relevant archives from web crawl

3. Processing and modeling

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages