Skip to content

SiriusZYZ/pixiv-spider

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

English Intro
简体中文简介
changelog
future plan
Contact me at [email protected], your feedbacks will be appreciated.

pixiv-spider [en]

Fetch contents from pixiv.
This module is still under development. Supported features includes:

  • fetch the title, illustid, date, tags, author and other metadata of the listed artworks in the trending page.
  • given the illustid of an artwork, fetch the art content url (original size).
  • module-wise logging.

Notice:
This module may be refactored in the future. The names of defined objects and methods may change.

Dependencies

This module use requests. Nothing third-party else (Currently).
Use the following command in your CLI to install requests. Noted that this module is developing with requests.__version__ >= 2.0.

pip install requests

How to Use

  1. import the module from ./pixivSpider. You can rename this module in your project later if you want.
import pixivSpider
  1. Use pixivSpider.rankingSession to fetch stuff from the trending page.
rs = pixivSpider.rankingSession()
rs.set_proxies(7890)    # if needed. apply for both http and https.

print(rs.valid_modes)   # valid modes of the rankingSession, such as "daily", "weekly", "rookies", etc.
print(rs.valid_contents)# valid contents of the rankingSession, such as "illustration", "manga", etc. 
rs.get_ranking_page(
    mode = "",          # fetch mode, optional parameter.
    content = "illust", # fetch content, Optional parameter.
    date = "20240101",  # trending page date, optional parameter. format (YYYYMMDD)
    page = 1            # trending page num, optional parameter. each page has 50 artworks, pageNum starts from 1.
)   # see help(pixivSpider.rankingSession) for more info
rs.get_ranking_page(
    mode = "",          
    content = "illust", 
    date = "20240101",  
    page = 2
)

for idx, item in enumerate(rs.resolve()):   # .resolve() return the fetched results in a list of item dict.
    print(idx, item)

rs.reset()              # reset proxy setting and clear all the result.
  1. Use pixivSpider.illustpageSession to fetch the url of original-size images from a given illustid
ips = pixivSpider.illustpageSession()
ips.set_proxies(7890)

ips.get_illust_page(
    illust_id = 84421525    # https://www.pixiv.net/artworks/{illustid}
)
ips.get_illust_page(
    illust_id = 93341155
)

for idx, item in enumerate(ips.resolve()):   # .resolve() return the fetched results in a list of item dict.
    print(idx, item)
ips.reset()     # reset proxy setting and clear all the result.
  1. Config the module logger.
  • The logger has set a stream handler in advance, and only show levels that severer than (and include) "info". This results in log popping in the CLI. You may disable this feature using pixivSpider.Logger.silent_stream().
  • The logger also support a file handler to save log into a file, but you need to set it explicitly using pixivSpider.Logger.set_file_handler(path).
import pixivSpider.Logger       # it's a singleton logger for the whole module.

pixivSpider.Logger.set_file_handler(path = "test.log")  # let the logs to be recorded into a file.
pixivSpider.Logger.set_stream_level("info")             # control the level of log that display in the console. levels from verbose to silent are "debug", "info", "warning", "error", "critical". levels name are case-insensitive.  

pixiv-spider [zh-cn]

抓取 pixiv 内容
目前模块仍在开发中, 支持的功能如下:

  • 抓取排行榜中艺术作品的标题, illustid, 日期, 标签, 作者等元数据
  • 抓取指定illustid艺术作品的原图url
  • 模块范围内的日志记录

注意:
模块可能会进行重构, 各定义的对象及方法名也可能因此改变.

依赖

目前只用到了requests 第三方模块.
使用以下命令行命令安装, 注意本模块是在requests.__version__ >= 2.0下开发的:

pip install requests

使用方式

  1. ./pixivSpider中导入模块. 如果需要可以在自己的项目中更改此模块名
import pixivSpider
  1. 使用 pixivSpider.rankingSession 获取排行榜中的内容.
rs = pixivSpider.rankingSession()
rs.set_proxies(7890)    # 设置proxy, 面向http和https

print(rs.valid_modes)   # 有效的爬取模式, 指每日/每周/新人...
print(rs.valid_contents)# 有效的爬取内容, 指插画/漫画/动图
rs.get_ranking_page(
    mode = "",          # 爬取模式, 可选参数.
    content = "illust", # 爬取内容, 可选参数.
    date = "20240101",  # 排行榜日期, 可选参数. 格式为 (年年年年月月日日)
    page = 1            # 排行榜页面, 可选参数. 每页50个作品, 页面数从1开始.
)   # 更多信息请 help(pixivSpider.rankingSession)
rs.get_ranking_page(
    mode = "",          
    content = "illust", 
    date = "20240101",  
    page = 2
)

for idx, item in enumerate(rs.resolve()):   # .resolve() 以列表形式返回爬取到的所有结果, 列表中每一个dict对应一个列出的作品
    print(idx, item)

rs.reset()              # 重置proxy设置并清空结果
  1. 使用 pixivSpider.illustpageSession 来获取 illustid 作品的所有原始大小图片url
ips = pixivSpider.illustpageSession()
ips.set_proxies(7890)

ips.get_illust_page(
    illust_id = 84421525    # https://www.pixiv.net/artworks/{illustid}
)
ips.get_illust_page(
    illust_id = 93341155
)

for idx, item in enumerate(ips.resolve()):   # .resolve() 以列表形式返回爬取到的所有结果, 列表中每一个dict对应一个执行过的每一个illust_id
    print(idx, item)
ips.reset()             # 重置proxy设置并清空结果
  1. 配置日志记录器
  • 日志记录器预先设置了一个流处理程序,且只显示比info更严重的级别的日志提示, 于是可以看到命令行中弹出的日志. 你可以使用pixivSpider.Logger.silent_stream()来禁用这一功能.
  • 记录器也支持文件处理程序来把日志保存到文件中,但你需要使用pixivSpider.Logger.set_file_handler(path)来显式设置.
import pixivSpider.Logger       # 这是一个给整个模块的单例Logger

pixivSpider.Logger.set_file_handler(path = "test.log")  # 让日志写到一个文件中
pixivSpider.Logger.set_stream_level("info")             # 控制控制台中显示的日志级别. 从详细级别到静默级别分别为"debug", "info", "warning", "error", "critical. 级别名称不区分大小写.

Changelog

0.2.1   2024 Apr 11

  • rankingSession now supports original, ai-generate contents, popular in male/female modes.
  • rankingSession now validates the input date and page, some bugs are fixed.
  • new logging system, supports a console (stream) handler and an optional file handler. Access it using pixivSpider.Logger. See this doc for more info.

0.2.0   2024 Apr 7

  • change this repo development from scrips-oriented to module-oriented.
  • api changes. add rankingSession, illustpageSession to support different features.
  • add a demo.
  • README.md support English now.
  • archived digest.py

0.1.2   2020 Mar 25

  • fix possible AttributeError when using CLI.

0.1.1   2020 Mar 19

  • digest.py support socks now.

0.1.0   2020 Mar 18

  • repo base.

Future Plan