-
-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to get publication year when auto-titling links #520
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding:
I would put this logic into getMetadata using a custom ruleset. Afaict, looking at the code, you can add new rules (like for date here) and not only extend existing rules. Just add a new key to the ruleset.
I see you are already using getMetadata
.
Why are you extracting the date in different functions? Why not have a single call to getMetadata
with all date rules? You don't have to reimplement the python code 1:1.
You also mention this:
// try to get date from various sources in order of precedence
But the rule order would already specify the precedence:
The order in which rules are defined indicate their preference, with the first rule being the most preferred.
-- https://www.npmjs.com/package/page-metadata-parser#rules
What's interesting is that there was already a PR for this but the package is unmaintained since February 2022: https://github.com/mozilla/page-metadata-parser/pull/122/files
I haven't found a maintained fork or similar though.
I think we should also use the name publishedDate
instead of just date
for the rule because it's more clear what kind of date is meant (even though I don't know what other date could be meant, lol).
Also, do you have example websites where different tags are used for the published date? The url in your comment uses script[type="application/ld+json"]
but would be nice to see examples for the other tags, too.
edit: also please do a rebase, there are conflicts currently
Just a quick update that I've been working on this. I probably won't have new push till next week though. |
Mhh, I think we missed that this is ready for review again. I'll try to do a review tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks much better now!
Most important comments:
- I don't understand the purpose of
initDateRule
- TypeError for links where no publication date was found
By the way, it's a good practice to rebase your branches frequently on master to avoid complex conflicts. There are currently no conflicts but this means you can rebase for free. For example, your branch
|
It just adds our custom rule to the ruleset, since it isn't built-in. Without it, the following line wouldn't know how to find the date in the page. If there's a better way/place to do it, I'm open, but this just seemed like the most logical/reliable place to put it. |
Thanks... I will go through them all now. |
3b5a9f6
to
c6a1d93
Compare
Oh, I never added why I don't understand it. I had written a lengthy comment about it but somehow I must have lost it in all my tabs, lol. Github is confusing if you have multiple tabs of it open ... I don't understand it because you're manipulating something you imported at
I am actually surprised this even works - I guess because javascript imports are references? So this changed value is actually reflected where you use it in
But since Anyway, you shouldn't manipulate imports imo. That probably relies on very specific ESM import mechanics and I've never seen this before, lol |
Done, thanks! |
To resolve #51.