Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

图片本地化的一种解决方案 #13

Open
jingyig01 opened this issue Apr 20, 2021 · 1 comment
Open

图片本地化的一种解决方案 #13

jingyig01 opened this issue Apr 20, 2021 · 1 comment

Comments

@jingyig01
Copy link

jingyig01 commented Apr 20, 2021

Hi,
我使用你的代码,成功保存了一批珍贵资料。感谢你所做的工作。
关于“收集网页出现的所有图片并保存至本地,把所有图片内嵌至html”,我的解决思路是这样的:
先使用tiebaImageGet将帖子图片下载到本地文件夹(名称为帖子PID),然后修改html文件中的image src. 这种方式下载的图片为贴吧缩略图,避免了浏览器同时加载原图导致内存占用过大的问题。

python代码如下:

def modify_src(folder_path, file_name):
    file_path = folder_path + '//' + file_name

    soup = BeautifulSoup(open(file_path, encoding = "utf-8"), "html.parser")
    url = [elm.get_text() for elm in soup.find_all("a", href=re.compile(r"^https://tieba.baidu.com/p/"))]
    
    # Some links are http
    if len(url) == 0:
        url_new = [elm.get_text() for elm in soup.find_all("a", href=re.compile(r"^http://tieba.baidu.com/p/"))]
        pid = url_new[0][-10:]
    else:
        # get pid
        pid = url[0][-10:]

    # modify image src
    # unmodified src: https://imgsa.baidu.com/forum/w%3D580/sign=4d3033fbbdde9c82a665f9875c8080d2/4417d558ccbf6c815f62fb2ab23eb13532fa4035.jpg
    # modified: ./img/6233150605/09d6a94bd11373f0a6c6bb5daa0f4bfbf9ed0488.jpg
    # pattern: ./img/pid/img_name
    # img_name: img["src"][-44:]
    # unmodified emoticon src :https://gsp0.baidu.com/5aAHeD3nKhI2p27j8IqW0jdnxx1xbK/tb/editor/images/client/image_emoticon72.png
    # modified: ../emoticon/image_emoticon72.png
    for img in soup.findAll('img',{"src":True}):
        if img["src"].endswith(".jpg"):
            modified = './img/' + pid + '/' + img['src'][-44:]
            img['src'] = modified
        if img['src'].endswith('.png'):
            splited = img['src'].split('/')
            emoticon_name = splited[-1]
            emoti_modified = '../tieba_emoticon/' + emoticon_name
            img['src'] = emoti_modified

    with open(file_path, "w", encoding = "utf-8") as file:
        file.write(str(soup))

所用到的emoticon文件:tieba_emoticon.zip

祝好,
Jingyi

@hjhee
Copy link
Owner

hjhee commented Sep 11, 2022

提供的方法很清晰, 待有合适的机会研究一下. 现在直接按照原URL的结构创建目录了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants