Skip to content

archive reddit data as offline web pages, but with image support!

Notifications You must be signed in to change notification settings

YoungerDryas89/reddit-html-archiver

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

reddit html archiver

pulls reddit data from the pushshift api and renders offline compatible html pages. uses the reddit markdown renderer.

install

requires python 3 on linux, OSX, or Windows

sudo apt-get install pip
pip install psaw
git clone https://github.com/chid/snudown
cd snudown
sudo python setup.py install
cd ..
git clone [this repo]
cd reddit-html-archiver
chmod u+x *.py

Windows users may need to run

chcp 65001
set PYTHONIOENCODING=utf-8

before running fetch_links.py or write_html.py to resolve encoding errors such as 'codec can't encode character'.

fetch reddit data

data is fetched by subreddit and date range and is stored as csv files in data.

./fetch_links.py politics 2017-1-1 2017-2-1
# or add some link/post filtering to download less data
./fetch_links.py --self_only --score "> 2000" politics 2015-1-1 2016-1-1
# show available filters
./fetch_links.py -h

decrease your date range or adjust pushshift_rate_limit_per_minute in fetch_links.py if you are getting connection errors.

Imgur

If you want to also download Imgur images, you need to add a credentials.ini with your Imgur client id in it. Example:

[MAIN]
imgur_client_id=ID_HERE

write web pages

write html files for all subreddits to r.

./write_html.py
# or add some output filtering for less fluff or a smaller archive size
./write_html.py --min-score 100 --min-comments 100 --hide-deleted-comments
# show available filters
./write_html.py -h

your html archive has been written to r. once you are satisfied with your archive feel free to copy/move the contents of r to elsewhere and to delete the git repos you have created. everything in r is fully self contained.

to update an html archive, delete everything in r aside from r/static and re-run write_html.py to regenerate everything.

hosting the archived pages

copy the contents of the r directory to a web root or appropriately served git repo.

potential improvements

  • fetch_links
    • num_comments filtering
    • thumbnails or thumbnail urls
    • media posts
    • score update
    • scores from reddit with praw
  • real templating
  • choose Bootswatch theme
  • specify subreddits to output
  • show link domain/post type
  • user pages
    • add pagination, posts sorted by score, comments, date, sub
    • too many files in one directory
  • view on reddit.com
  • js powered search page, show no links by default
  • js inline media embeds/expandos
  • archive.org links

see also

screenshots

About

archive reddit data as offline web pages, but with image support!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 71.1%
  • HTML 21.0%
  • CSS 6.0%
  • JavaScript 1.9%