Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

實作文章解析器 #32

Open
PichuChen opened this issue Jan 9, 2021 · 4 comments
Open

實作文章解析器 #32

PichuChen opened this issue Jan 9, 2021 · 4 comments
Labels
help wanted Extra attention is needed important

Comments

@PichuChen
Copy link
Member

目前有個重要的任務我忘了開,就是需要解析文章

以前並沒有結構性的去描述 BBS 文章結構的結構,因此這部分得我們自行發明。

初步的話我希望可以解析成以下的格式

"is_header_modify": {{is_header_modified}},
"author_id": {{author_id}},
"author_name": {{author_name}},
"title": {{title}},
"post_time": {{post_time}},
"board_name": {{board_name}},
"text": {
    "text": {{text}},
    "color_map": {{text_color_map}}
},
"signature": {
    "text": {{signature_text}},
    "color_map": {{signature_color_map}}
},
"sender_info": {
    "site": {{sender_site}},
    "ip_address": {{sender_ip_address}},
    "ip_country": {{sender_ip_country}},
},
"edit_record": [{{edit_record}}],
"push_record": [
    {
        "type": {{push_record.type}},
        "id": {{push_record.pusher_id}},
        "ip_address": {{push_record.pusher_ip}},
        "text": {{push_record.type}},
        "time": {{push_record.time}},
    }
]

方法目前沒有特別的想法,也許建議使用regex或者是if else, 如果用 goyacc 寫出一套解析器應該也是可以,雖然我懷疑這個解析器會不會遇到2003以前的文章就失效了。 如果要用 NN 訓練出一個模型出來我覺得也不是不行,雖然這樣的話我可能會先讓傳統 if else 作法的先上線。

@PichuChen PichuChen added help wanted Extra attention is needed important labels Jan 9, 2021
@ifanchu
Copy link
Contributor

ifanchu commented Jan 27, 2021

一篇文章的存檔格式是長怎樣的呢?
有沒有一篇所有內容(推文啊,上色啊什麼的)都有的範例我來看一下?

@PichuChen
Copy link
Member Author

雖然說測試資料包裡面有,不過因為方便討論,把其中一個轉成 UTF-8 之後放到 gist 了 (這個範例沒有推文上色)
https://gist.github.com/PichuChen/dcfcc1db826e3a35942985e4442abd19

實際上如果要更複雜的可以到 pttapp.cc 上面產生

@PichuChen
Copy link
Member Author

https://gist.github.com/PichuChen/4ae56b2d8fda9f7e12df7c4d34befe5a

再來一個版本是用十六進位表示的

@PichuChen
Copy link
Member Author

關於檔案格式比較完整的研究在這邊,不過這是以 PTT 為例的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed important
Projects
None yet
Development

No branches or pull requests

2 participants