-
-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 382/ support X-Robots-Tag as a typed http header XRobotsTag #393
base: main
Are you sure you want to change the base?
Conversation
1a46b37
to
f3c6d97
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a bad start. Do take your time to complete it and add sufficient start. But great start indeed. Added a couple of remarks to help you with some more guidance.
rama-http-types/src/lib.rs
Outdated
@@ -131,6 +131,9 @@ pub mod header { | |||
"x-real-ip", | |||
]; | |||
|
|||
// non-std web-crawler info headers | |||
static_header!["x-robots-tag",]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static_header!["x-robots-tag",]; | |
// | |
// More infornation at | |
// <https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Robots-Tag>. | |
static_header!["x-robots-tag"]; |
|
||
#[derive(Debug, Clone, PartialEq, Eq)] | ||
pub struct Element { | ||
bot_name: Option<String>, // or `rama_ua::UserAgent` ??? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can copy https://github.com/hyperium/headers/blob/master/src/util/value_string.rs to https://github.com/plabayo/rama/tree/main/rama-http/src/headers/util and use that one here, that's also what the UserAgent typed header uses: https://docs.rs/headers/0.4.0/src/headers/common/user_agent.rs.html#43
use std::str::FromStr; | ||
|
||
#[derive(Debug, Clone, PartialEq, Eq)] | ||
pub struct Element { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending how you structure it, this actually has to be either:
struct Element { bot_name: Option<HeaderValueString>, rules: Vec<Rule> }
or
enum Element {
BotName(HeaderValueString),
Rule(Rule),
}
Because when a botname is mentioned it applies to all rules that follow it, until another botname is mentioned
fn from_str(s: &str) -> Result<Self, Self::Err> { | ||
let (bot_name, indexing_rule) = match Rule::from_str(s) { | ||
Ok(rule) => (None, Ok(rule)), | ||
Err(e) => match *s.split(":").map(str::trim).collect::<Vec<_>>().as_slice() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need to anyway modify the Element code, so this might dissapear, but as extra info, be careful with collecting stuff in a Vec, just for tmp reasons. Gives allocations for not much good. Instead better to split fixed into a tuple, or use the iterator directly .
MaxVideoPreview(Option<u32>), | ||
NoTranslate, | ||
NoImageIndex, | ||
UnavailableAfter(String), // "A date must be specified in a format such as RFC 822, RFC 850, or ISO 8601." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This best be validated and enforced in a typed manner :)
use std::str::FromStr; | ||
|
||
#[derive(Clone, Debug, Eq, PartialEq)] | ||
pub(super) enum Rule { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget about the custom ones: noai, noimageai, SPC
} | ||
} | ||
|
||
impl TryFrom<&[&str]> for Rule { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think it's worth implementing a public trait from this. You can just make simple private functions from this, and I would instead of collecting it into a vec just use the iterator's next method. You can then have a private fn for 1 arg en one with 2 arg.
…ield to 'Vec<Rule>'
0c8c3b3
to
e66d95b
Compare
I've attempted to implement your suggestions, but I still have a couple of uncertainties:
Please note that this implementation is not yet complete, as the |
} | ||
|
||
pub fn is_valid(&self) -> bool { | ||
let rfc_822 = r"^(Mon|Tue|Wed|Thu|Fri|Sat|Sun),\s(0[1-9]|[12]\d|3[01])\s(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\d{2}\s([01]\d|2[0-4]):([0-5]\d|60):([0-5]\d|60)\s(UT|GMT|EST|EDT|CST|CDT|MST|MDT|PST|PDT|[+-]\d{4})$"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that you are anyway storing it as a string, do we really care about validating these for now? And even if we do I wonder if this really needs a regexp. This is defintely a file that needs more work even if we want to go for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the ValidDate struct even relevant if it does not validate the String? I'm wondering if it wouldn't then make more sense to just revert it to a String in the UnavailableAfter
enum variant.
} | ||
|
||
fn check_is_valid(re: &str, date: &str) -> bool { | ||
Regex::new(re) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use regexes as little as possible, but even if you use them, you'll want to have them as lazy statics (which you can now-adays do with the stuff that the std library offsets. In past you needed the lazy_static crate for that, but no longer.
As the compilation step is the more heavy process of regex usage. As such you want to do that only once, not every time you call this function
type Err = OpaqueError; | ||
|
||
fn from_str(s: &str) -> Result<Self, Self::Err> { | ||
match s { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a pretty strict check, is it case sensitive or can we be case insensitive here? Also what about spaces, no needed to trim first?
Questions like this are answered by thinking about its typical usage. Usually the parsing from the raw http header to the typed header will happen in the background as part of layer or web extractor or the like. So while you could try to return something like a custom error with exposure of the kind of error that happened I don't see it all that being useful here, even if you manually decode. Because even if you know what specific error happened, the result is either way the same, it's an invalid header so either you are okay with it or you are not. TLDR: opaque Error is fine here. If there is ever a strong reason to expose certain error variants, it can be requested by opening a new issue about it. But that's a discussion for then.
The answer is that you'll want to ensure you support this: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Robots-Tag#syntax ; So purely looking at whether it's a valid implementation you would do by checking against that reference syntax. After that the question will be whether it's an optimal enough implementation, but let's first focussing on correctly implementing this. |
type Err = OpaqueError; | ||
|
||
fn from_str(s: &str) -> Result<Self, Self::Err> { | ||
let regex = Regex::new(r"^\s*([^:]+?):\s*(.+)$") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be no need for a regex here, it's a pretty linear process, so you should be able to easily parse out rules. E.g. something like ready until ':' or EOF
, ':' => ...`.
You might want to read up a bit on how other folks have done it, even in other languages. E.g. look at https://github.com/nielsole/xrobotstag/blob/4cd7d8885a3e26fda9dd1a4075663cd3b80617f0/parser.go#L97. There's a lot to like about that implementation, but of course not to be copied as-is, given there is also plenty not to like about it. But what I personally take away from an implementation like this is:
CustomRule would be a private construct which can be as simple as
As such the only things we really need to expose here are the Not sure if you are a fan of the builder pattern but I like it a lot for cases like this, as for each rule you can expose this method to
The default one would be basically an empty tag, not rendered at all. Which is okay as the
This allows them to be constructed in any way. Feel free to define a private The custom rules can be constructed without exposing the I think this is pretty elegant and does everything you wanted to do without even having to expose our internal representation of it. Keeps the structure also a lot more flat without really losing much, given the TLDR: great job on your work so far. Seems you are getting pretty comfortable with the codebase and defintely know your way around rust. I hope my feedback and guidance above helps you out in finishing it. Feel free to push back on something you do not agree with or propose alternatives. Goal is mostly to keep this thing as simple and minimal as possible, while still allowing people to also do custom stuff, but without exposing too much of our internal API as to give the flexibility to change stuff internally without breaking stuff. Not that I mind making breaking changes where needed, but even better if you can upgrade without needing to break. |
I was wondering to maybe add some sort of
|
Out of scope for this PR and overkill for what you need here. A macro_rules has plenty to fill our needs here If one day you want to learn to write and read proc macros you might enjoy https://www.manning.com/books/write-powerful-rust-macros Could use some help in my venndb crate for a future release |
another thing, do you prefer a separate builder struct (example: |
The reason to introduce a separate builder struct, e.g. RobotsTag::builder().build() As at that point there is not yet anything set. In my proposal above I said as a shortcut you can just ignore empty robotstag and not write those. But if you do want to go this extra mile we can indeed prevent it at compile time, which is better so I would argue. https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=82444bb713185c3f09ceeb917cc4fd66 here is an exmaple playground for you to show what i mean with that Reason why you would need a builder for that is because:
I'm totally okay with you building that out, good proposal of you, so why not. Btw note also in my example that I add both the "consume" variant ( in cases where you want to pass it immediately to a function or parent structure it is easier to just consume:
but in cases where you work with conditionals it's more conveniant to have the setter also available, e.g.:
Of course in the case of a bool this seems a bit silly, as you could just assign |
Initial Implementation of
X-Robots-Tag
HeaderThis pull request introduces an initial implementation of the
X-Robots-Tag
header for therama-http
crate and closes #382Summary of Changes:
Header Registration:
x-robots-tag
header to the static headers list inrama-http-types/src/lib.rs
.Implementation of the Header:
XRobotsTag
struct that represents the header and implements therama_http::headers::Header
trait.Rule
enum to represent indexing rules such asnoindex
,nofollow
, andmax-snippet
. Complex rules likemax-image-preview
andunavailable_after
are also included.Element
struct to represent a rule optionally associated with a bot name.XRobotsTag
to iterate over its elements.File Structure:
rama-http/src/headers/x_robots_tag/
, which includes the following files:rule.rs
: Defines theRule
enum and parsing logic.element.rs
: Implements theElement
struct and its parsing/formatting logic.iterator.rs
: Provides an iterator forXRobotsTag
.mod.rs
: Combines and exposes the module’s functionality.Encoding and Decoding:
Header
trait, supporting CSV-style comma-separated values.Questions and Feedback Requested:
Code Structure:
x_robots_tag
) appropriate?Implementation Design:
Rule
andElement
structs?Vec<Element>
forXRobotsTag
suitable, or are there optimizations to consider?Edge Cases and Standards:
Testing:
I look forward to your feedback and suggestions for improvement. Thank you!