We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hi, 我使用你的代码,成功保存了一批珍贵资料。感谢你所做的工作。 关于“收集网页出现的所有图片并保存至本地,把所有图片内嵌至html”,我的解决思路是这样的: 先使用tiebaImageGet将帖子图片下载到本地文件夹(名称为帖子PID),然后修改html文件中的image src. 这种方式下载的图片为贴吧缩略图,避免了浏览器同时加载原图导致内存占用过大的问题。
python代码如下:
def modify_src(folder_path, file_name): file_path = folder_path + '//' + file_name soup = BeautifulSoup(open(file_path, encoding = "utf-8"), "html.parser") url = [elm.get_text() for elm in soup.find_all("a", href=re.compile(r"^https://tieba.baidu.com/p/"))] # Some links are http if len(url) == 0: url_new = [elm.get_text() for elm in soup.find_all("a", href=re.compile(r"^http://tieba.baidu.com/p/"))] pid = url_new[0][-10:] else: # get pid pid = url[0][-10:] # modify image src # unmodified src: https://imgsa.baidu.com/forum/w%3D580/sign=4d3033fbbdde9c82a665f9875c8080d2/4417d558ccbf6c815f62fb2ab23eb13532fa4035.jpg # modified: ./img/6233150605/09d6a94bd11373f0a6c6bb5daa0f4bfbf9ed0488.jpg # pattern: ./img/pid/img_name # img_name: img["src"][-44:] # unmodified emoticon src :https://gsp0.baidu.com/5aAHeD3nKhI2p27j8IqW0jdnxx1xbK/tb/editor/images/client/image_emoticon72.png # modified: ../emoticon/image_emoticon72.png for img in soup.findAll('img',{"src":True}): if img["src"].endswith(".jpg"): modified = './img/' + pid + '/' + img['src'][-44:] img['src'] = modified if img['src'].endswith('.png'): splited = img['src'].split('/') emoticon_name = splited[-1] emoti_modified = '../tieba_emoticon/' + emoticon_name img['src'] = emoti_modified with open(file_path, "w", encoding = "utf-8") as file: file.write(str(soup))
所用到的emoticon文件:tieba_emoticon.zip
祝好, Jingyi
The text was updated successfully, but these errors were encountered:
提供的方法很清晰, 待有合适的机会研究一下. 现在直接按照原URL的结构创建目录了
Sorry, something went wrong.
No branches or pull requests
Hi,
我使用你的代码,成功保存了一批珍贵资料。感谢你所做的工作。
关于“收集网页出现的所有图片并保存至本地,把所有图片内嵌至html”,我的解决思路是这样的:
先使用tiebaImageGet将帖子图片下载到本地文件夹(名称为帖子PID),然后修改html文件中的image src. 这种方式下载的图片为贴吧缩略图,避免了浏览器同时加载原图导致内存占用过大的问题。
python代码如下:
所用到的emoticon文件:tieba_emoticon.zip
祝好,
Jingyi
The text was updated successfully, but these errors were encountered: