Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 382/ support X-Robots-Tag as a typed http header XRobotsTag #393

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

hafihaf123
Copy link
Contributor

@hafihaf123 hafihaf123 commented Jan 12, 2025

Initial Implementation of X-Robots-Tag Header

This pull request introduces an initial implementation of the X-Robots-Tag header for the rama-http crate and closes #382

Summary of Changes:

  1. Header Registration:

    • Added the x-robots-tag header to the static headers list in rama-http-types/src/lib.rs.
  2. Implementation of the Header:

    • Created a XRobotsTag struct that represents the header and implements the rama_http::headers::Header trait.
    • Defined a Rule enum to represent indexing rules such as noindex, nofollow, and max-snippet. Complex rules like max-image-preview and unavailable_after are also included.
    • Designed an Element struct to represent a rule optionally associated with a bot name.
    • Added an iterator for XRobotsTag to iterate over its elements.
  3. File Structure:

    • The implementation resides in a new module at rama-http/src/headers/x_robots_tag/, which includes the following files:
      • rule.rs: Defines the Rule enum and parsing logic.
      • element.rs: Implements the Element struct and its parsing/formatting logic.
      • iterator.rs: Provides an iterator for XRobotsTag.
      • mod.rs: Combines and exposes the module’s functionality.
  4. Encoding and Decoding:

    • Encoding and decoding are implemented using the Header trait, supporting CSV-style comma-separated values.

Questions and Feedback Requested:

  1. Code Structure:

    • Is the location of the new module (x_robots_tag) appropriate?
    • Does the filename structure and organization align with project conventions?
  2. Implementation Design:

    • Are there any improvements or alternative approaches you’d suggest for the Rule and Element structs?
    • Is the use of Vec<Element> for XRobotsTag suitable, or are there optimizations to consider?
  3. Edge Cases and Standards:

    • Are there additional rules, formats, or edge cases that should be covered?

Testing:

  • For now, the module is not tested, I will add the tests after the code structure becomes stable.

I look forward to your feedback and suggestions for improvement. Thank you!

@hafihaf123 hafihaf123 force-pushed the issue-382/x-robots-tag branch from 1a46b37 to f3c6d97 Compare January 12, 2025 22:01
Copy link
Member

@GlenDC GlenDC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a bad start. Do take your time to complete it and add sufficient start. But great start indeed. Added a couple of remarks to help you with some more guidance.

@@ -131,6 +131,9 @@ pub mod header {
"x-real-ip",
];

// non-std web-crawler info headers
static_header!["x-robots-tag",];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static_header!["x-robots-tag",];
//
// More infornation at
// <https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Robots-Tag>.
static_header!["x-robots-tag"];


#[derive(Debug, Clone, PartialEq, Eq)]
pub struct Element {
bot_name: Option<String>, // or `rama_ua::UserAgent` ???
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use std::str::FromStr;

#[derive(Debug, Clone, PartialEq, Eq)]
pub struct Element {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending how you structure it, this actually has to be either:

struct Element { bot_name: Option<HeaderValueString>, rules: Vec<Rule> }

or

enum Element {
    BotName(HeaderValueString),
    Rule(Rule),
}

Because when a botname is mentioned it applies to all rules that follow it, until another botname is mentioned

fn from_str(s: &str) -> Result<Self, Self::Err> {
let (bot_name, indexing_rule) = match Rule::from_str(s) {
Ok(rule) => (None, Ok(rule)),
Err(e) => match *s.split(":").map(str::trim).collect::<Vec<_>>().as_slice() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to anyway modify the Element code, so this might dissapear, but as extra info, be careful with collecting stuff in a Vec, just for tmp reasons. Gives allocations for not much good. Instead better to split fixed into a tuple, or use the iterator directly .

MaxVideoPreview(Option<u32>),
NoTranslate,
NoImageIndex,
UnavailableAfter(String), // "A date must be specified in a format such as RFC 822, RFC 850, or ISO 8601."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This best be validated and enforced in a typed manner :)

use std::str::FromStr;

#[derive(Clone, Debug, Eq, PartialEq)]
pub(super) enum Rule {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget about the custom ones: noai, noimageai, SPC

}
}

impl TryFrom<&[&str]> for Rule {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think it's worth implementing a public trait from this. You can just make simple private functions from this, and I would instead of collecting it into a vec just use the iterator's next method. You can then have a private fn for 1 arg en one with 2 arg.

@hafihaf123 hafihaf123 force-pushed the issue-382/x-robots-tag branch from 0c8c3b3 to e66d95b Compare January 17, 2025 22:33
@hafihaf123
Copy link
Contributor Author

I've attempted to implement your suggestions, but I still have a couple of uncertainties:

  1. I've used the OpaqueError type almost everywhere an Error type is required. Is this acceptable, or should I consider a different approach?
  2. I've implemented the FromStr trait for Element, expecting a string slice in the format: "<bot_name:> indexing_rule, indexing_rule_with_value: value <...>". However, I'm unsure if this is the correct interpretation or the best approach.

Please note that this implementation is not yet complete, as the XRobotsTag::decode() function is not functional at this stage.

@hafihaf123 hafihaf123 requested a review from GlenDC January 17, 2025 22:57
}

pub fn is_valid(&self) -> bool {
let rfc_822 = r"^(Mon|Tue|Wed|Thu|Fri|Sat|Sun),\s(0[1-9]|[12]\d|3[01])\s(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\d{2}\s([01]\d|2[0-4]):([0-5]\d|60):([0-5]\d|60)\s(UT|GMT|EST|EDT|CST|CDT|MST|MDT|PST|PDT|[+-]\d{4})$";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that you are anyway storing it as a string, do we really care about validating these for now? And even if we do I wonder if this really needs a regexp. This is defintely a file that needs more work even if we want to go for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the ValidDate struct even relevant if it does not validate the String? I'm wondering if it wouldn't then make more sense to just revert it to a String in the UnavailableAfter enum variant.

}

fn check_is_valid(re: &str, date: &str) -> bool {
Regex::new(re)
Copy link
Member

@GlenDC GlenDC Jan 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use regexes as little as possible, but even if you use them, you'll want to have them as lazy statics (which you can now-adays do with the stuff that the std library offsets. In past you needed the lazy_static crate for that, but no longer.

As the compilation step is the more heavy process of regex usage. As such you want to do that only once, not every time you call this function

type Err = OpaqueError;

fn from_str(s: &str) -> Result<Self, Self::Err> {
match s {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a pretty strict check, is it case sensitive or can we be case insensitive here? Also what about spaces, no needed to trim first?

@GlenDC
Copy link
Member

GlenDC commented Jan 19, 2025

1. I've used the `OpaqueError` type almost everywhere an `Error` type is required. Is this acceptable, or should I consider a different approach?

Questions like this are answered by thinking about its typical usage. Usually the parsing from the raw http header to the typed header will happen in the background as part of layer or web extractor or the like. So while you could try to return something like a custom error with exposure of the kind of error that happened I don't see it all that being useful here, even if you manually decode. Because even if you know what specific error happened, the result is either way the same, it's an invalid header so either you are okay with it or you are not.

TLDR: opaque Error is fine here. If there is ever a strong reason to expose certain error variants, it can be requested by opening a new issue about it. But that's a discussion for then.

2. I've implemented the `FromStr` trait for `Element`, expecting a string slice in the format: `"<bot_name:> indexing_rule, indexing_rule_with_value: value <...>"`. However, I'm unsure if this is the correct interpretation or the best approach.

The answer is that you'll want to ensure you support this: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Robots-Tag#syntax ; So purely looking at whether it's a valid implementation you would do by checking against that reference syntax. After that the question will be whether it's an optimal enough implementation, but let's first focussing on correctly implementing this.

type Err = OpaqueError;

fn from_str(s: &str) -> Result<Self, Self::Err> {
let regex = Regex::new(r"^\s*([^:]+?):\s*(.+)$")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no need for a regex here, it's a pretty linear process, so you should be able to easily parse out rules. E.g. something like ready until ':' or EOF, ':' => ...`.

@GlenDC
Copy link
Member

GlenDC commented Jan 20, 2025

You might want to read up a bit on how other folks have done it, even in other languages. E.g. look at https://github.com/nielsole/xrobotstag/blob/4cd7d8885a3e26fda9dd1a4075663cd3b80617f0/parser.go#L97. There's a lot to like about that implementation, but of course not to be copied as-is, given there is also plenty not to like about it. But what I personally take away from an implementation like this is:

  • I like the idea of just having a single RobotTag, which would look something like:
struct RobotsTag {
     bot_name: Option<HeaderValueString>,
     all: bool,
     no_follow: bool,
     unavailable_after: Option<chrono::DateTime<chrono::Utc>>, 
     // ...
     no_ai: bool,
     // ...
     custom_rules: Option<Vec<CustomRule>>,
}

CustomRule would be a private construct which can be as simple as

struct CustomRule {
      key: HeaderValueString,
      value: Option<HeaderValueString>,
}
  • you are already doing pretty well on parsing, but really there is no need for regexes here, at least not us directly
  • you can make use of the chrono crate and try the formats that are specified in the documentation, there are three possibilities: RFC 822, RFC 850, or ISO 8601. Using chrono makes that trivial.

As such the only things we really need to expose here are the XRobotsTag typed header and the RobotsTag as a single "element". So 2 structs.

Not sure if you are a fan of the builder pattern but I like it a lot for cases like this, as for each rule you can expose this method to RobotsTag

fn new() == Default::default

The default one would be basically an empty tag, not rendered at all. Which is okay as the Headers trait allows you to not encode anything in the encode step, which you would do in case all your tags would be empty.

fn all(&self) -> bool { self.all }
fn with_all(mut self, all: bool) -> Self { self.all = all; self }
fn set_all(&mut self, all: bool) ->  &mut Self { self.all = all; self }

This allows them to be constructed in any way. Feel free to define a private macro_rules! to easily set create this triplet for the bool variants, saves you from typing, and easy enough to do if you combine it with the paste crate already used in other parts of rama.

The custom rules can be constructed without exposing the CustomRule struct. And to get them you can just return an opaque iterator with a tuple &str or w/e.

I think this is pretty elegant and does everything you wanted to do without even having to expose our internal representation of it. Keeps the structure also a lot more flat without really losing much, given the RobotsTag should be able to be represented pretty small. It's in the end just a bunch of booleans and options, with the value parts of the option being on the heap, so pretty great for this purpose.


TLDR: great job on your work so far. Seems you are getting pretty comfortable with the codebase and defintely know your way around rust. I hope my feedback and guidance above helps you out in finishing it. Feel free to push back on something you do not agree with or propose alternatives.

Goal is mostly to keep this thing as simple and minimal as possible, while still allowing people to also do custom stuff, but without exposing too much of our internal API as to give the flexibility to change stuff internally without breaking stuff. Not that I mind making breaking changes where needed, but even better if you can upgrade without needing to break.

@hafihaf123
Copy link
Contributor Author

This allows them to be constructed in any way. Feel free to define a private macro_rules! to easily set create this triplet for the bool variants, saves you from typing, and easy enough to do if you combine it with the paste crate already used in other parts of rama.

I was wondering to maybe add some sort of #[derive(Builder)] functionality to the builder to generate the boilerplate code but I am wondering whether it is even possible as I am quite new to rust macros. From what I've found, I would probably have needed to define it as a procedural macro and kept it in a separate crate. I found rama-macros in the code base, but in its documentation, it states

There are no more macros for Rama.

@GlenDC
Copy link
Member

GlenDC commented Jan 20, 2025

I was wondering to maybe add some sort of #[derive(Builder)] functionality to the builder to generate the boilerplate code but

Out of scope for this PR and overkill for what you need here. A macro_rules has plenty to fill our needs here

If one day you want to learn to write and read proc macros you might enjoy https://www.manning.com/books/write-powerful-rust-macros

Could use some help in my venndb crate for a future release

@hafihaf123
Copy link
Contributor Author

another thing, do you prefer a separate builder struct (example: let robots_tag = RobotsTag::builder().noindex().max_video_preview(5).build()) or building using setters (example: let robots_tag = RobotsTag::new(); robots_tag.set_noindex(true).set_max_video_preview(5) or let robots_tag = RobotsTag::new().with_noindex(true).with_max_video_preview(5))

@GlenDC
Copy link
Member

GlenDC commented Jan 21, 2025

another thing, do you prefer a separate builder struct (example: let robots_tag = RobotsTag::builder().noindex().max_video_preview(5).build()) or building using setters (example: let robots_tag = RobotsTag::new(); robots_tag.set_noindex(true).set_max_video_preview(5) or let robots_tag = RobotsTag::new().with_noindex(true).with_max_video_preview(5))

The reason to introduce a separate builder struct, e.g. RobotsTagBuilder is because it allows you to when done right to at compile time make an impossible state. For example it would allow you if done right to make it that this does not compile:

RobotsTag::builder().build()

As at that point there is not yet anything set. In my proposal above I said as a shortcut you can just ignore empty robotstag and not write those. But if you do want to go this extra mile we can indeed prevent it at compile time, which is better so I would argue.

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=82444bb713185c3f09ceeb917cc4fd66 here is an exmaple playground for you to show what i mean with that

Reason why you would need a builder for that is because:

  1. the most important reason: you avoid that you are polluting your otherwise generic-free struct with generics
  2. less important is that it allows you to get around the name colission trick, as we otherwise have 3 methods a for the same struct, which is not allowed in rust

I'm totally okay with you building that out, good proposal of you, so why not.
Even if you do that I would still add setters to the regular struct as well as it allows, e.g. in a proxy setting, to still modify the struct itself. Another option is to be able to turn it back into a builder e.g. impl From and perhaps also add into_builder. That works fine as well.

Btw note also in my example that I add both the "consume" variant (mut self) and the setter variant (&mut self) to the builder. Please do this as well if you make a builder. Because in some scenarios you want to consume while in another one it is not as easy.

in cases where you want to pass it immediately to a function or parent structure it is easier to just consume:

foo(Foo::builder().a(true).b(42).build())

but in cases where you work with conditionals it's more conveniant to have the setter also available, e.g.:

let mut foo = Foo::builder().a(true);
if bar {
    foo.set_b(true);
}

Of course in the case of a bool this seems a bit silly, as you could just assign bar to set_b. But for cases where the property type is a bit more complex, e.g. an enum it becomes less easy so. As such, I usually provide both options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support X-Robots-Tag as a typed http header XRobotsTag
2 participants