Issue 382/ support X-Robots-Tag as a typed http header XRobotsTag #393

hafihaf123 · 2025-01-12T21:55:51Z

Initial Implementation of `X-Robots-Tag` Header

This pull request introduces an initial implementation of the X-Robots-Tag header for the rama-http crate and closes #382

Summary of Changes:

Header Registration:
- Added the x-robots-tag header to the static headers list in rama-http-types/src/lib.rs.
Implementation of the Header:
- Created a XRobotsTag struct that represents the header and implements the rama_http::headers::Header trait.
- Defined a Rule enum to represent indexing rules such as noindex, nofollow, and max-snippet. Complex rules like max-image-preview and unavailable_after are also included.
- Designed an Element struct to represent a rule optionally associated with a bot name.
- Added an iterator for XRobotsTag to iterate over its elements.
File Structure:
- The implementation resides in a new module at rama-http/src/headers/x_robots_tag/, which includes the following files:
  - rule.rs: Defines the Rule enum and parsing logic.
  - element.rs: Implements the Element struct and its parsing/formatting logic.
  - iterator.rs: Provides an iterator for XRobotsTag.
  - mod.rs: Combines and exposes the module’s functionality.
Encoding and Decoding:
- Encoding and decoding are implemented using the Header trait, supporting CSV-style comma-separated values.

Questions and Feedback Requested:

Code Structure:
- Is the location of the new module (x_robots_tag) appropriate?
- Does the filename structure and organization align with project conventions?
Implementation Design:
- Are there any improvements or alternative approaches you’d suggest for the Rule and Element structs?
- Is the use of Vec<Element> for XRobotsTag suitable, or are there optimizations to consider?
Edge Cases and Standards:
- Are there additional rules, formats, or edge cases that should be covered?

Testing:

For now, the module is not tested, I will add the tests after the code structure becomes stable.

I look forward to your feedback and suggestions for improvement. Thank you!

GlenDC

Not a bad start. Do take your time to complete it and add sufficient start. But great start indeed. Added a couple of remarks to help you with some more guidance.

GlenDC · 2025-01-12T22:06:43Z

rama-http-types/src/lib.rs

@@ -131,6 +131,9 @@ pub mod header {
        "x-real-ip",
    ];

+    // non-std web-crawler info headers
+    static_header!["x-robots-tag",];


Suggested change

static_header!["x-robots-tag",];

//

// More infornation at

// <https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Robots-Tag>.

static_header!["x-robots-tag"];

GlenDC · 2025-01-12T22:09:53Z

rama-http/src/headers/x_robots_tag/element.rs

+
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub struct Element {
+    bot_name: Option<String>, // or `rama_ua::UserAgent` ???


You can copy https://github.com/hyperium/headers/blob/master/src/util/value_string.rs to https://github.com/plabayo/rama/tree/main/rama-http/src/headers/util and use that one here, that's also what the UserAgent typed header uses: https://docs.rs/headers/0.4.0/src/headers/common/user_agent.rs.html#43

GlenDC · 2025-01-12T22:13:11Z

rama-http/src/headers/x_robots_tag/element.rs

+use std::str::FromStr;
+
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub struct Element {


Depending how you structure it, this actually has to be either:

struct Element { bot_name: Option<HeaderValueString>, rules: Vec<Rule> }

or

enum Element { BotName(HeaderValueString), Rule(Rule), }

Because when a botname is mentioned it applies to all rules that follow it, until another botname is mentioned

GlenDC · 2025-01-12T22:14:37Z

rama-http/src/headers/x_robots_tag/element.rs

+    fn from_str(s: &str) -> Result<Self, Self::Err> {
+        let (bot_name, indexing_rule) = match Rule::from_str(s) {
+            Ok(rule) => (None, Ok(rule)),
+            Err(e) => match *s.split(":").map(str::trim).collect::<Vec<_>>().as_slice() {


You'll need to anyway modify the Element code, so this might dissapear, but as extra info, be careful with collecting stuff in a Vec, just for tmp reasons. Gives allocations for not much good. Instead better to split fixed into a tuple, or use the iterator directly .

GlenDC · 2025-01-12T22:16:01Z

rama-http/src/headers/x_robots_tag/rule.rs

+    MaxVideoPreview(Option<u32>),
+    NoTranslate,
+    NoImageIndex,
+    UnavailableAfter(String), // "A date must be specified in a format such as RFC 822, RFC 850, or ISO 8601."


This best be validated and enforced in a typed manner :)

GlenDC · 2025-01-12T22:17:12Z

rama-http/src/headers/x_robots_tag/rule.rs

+use std::str::FromStr;
+
+#[derive(Clone, Debug, Eq, PartialEq)]
+pub(super) enum Rule {


Don't forget about the custom ones: noai, noimageai, SPC

GlenDC · 2025-01-12T22:19:08Z

rama-http/src/headers/x_robots_tag/rule.rs

+    }
+}
+
+impl TryFrom<&[&str]> for Rule {


Don't think it's worth implementing a public trait from this. You can just make simple private functions from this, and I would instead of collecting it into a vec just use the iterator's next method. You can then have a private fn for 1 arg en one with 2 arg.

…ield to 'Vec<Rule>'

hafihaf123 · 2025-01-17T22:56:59Z

I've attempted to implement your suggestions, but I still have a couple of uncertainties:

I've used the OpaqueError type almost everywhere an Error type is required. Is this acceptable, or should I consider a different approach?
I've implemented the FromStr trait for Element, expecting a string slice in the format: "<bot_name:> indexing_rule, indexing_rule_with_value: value <...>". However, I'm unsure if this is the correct interpretation or the best approach.

Please note that this implementation is not yet complete, as the XRobotsTag::decode() function is not functional at this stage.

GlenDC · 2025-01-19T12:01:17Z

rama-http/src/headers/x_robots_tag/valid_date.rs

+    }
+
+    pub fn is_valid(&self) -> bool {
+        let rfc_822 = r"^(Mon|Tue|Wed|Thu|Fri|Sat|Sun),\s(0[1-9]|[12]\d|3[01])\s(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\d{2}\s([01]\d|2[0-4]):([0-5]\d|60):([0-5]\d|60)\s(UT|GMT|EST|EDT|CST|CDT|MST|MDT|PST|PDT|[+-]\d{4})$";


given that you are anyway storing it as a string, do we really care about validating these for now? And even if we do I wonder if this really needs a regexp. This is defintely a file that needs more work even if we want to go for this

Is the ValidDate struct even relevant if it does not validate the String? I'm wondering if it wouldn't then make more sense to just revert it to a String in the UnavailableAfter enum variant.

GlenDC · 2025-01-19T12:02:10Z

rama-http/src/headers/x_robots_tag/valid_date.rs

+}
+
+fn check_is_valid(re: &str, date: &str) -> bool {
+    Regex::new(re)


I use regexes as little as possible, but even if you use them, you'll want to have them as lazy statics (which you can now-adays do with the stuff that the std library offsets. In past you needed the lazy_static crate for that, but no longer.

As the compilation step is the more heavy process of regex usage. As such you want to do that only once, not every time you call this function

GlenDC · 2025-01-19T12:03:17Z

rama-http/src/headers/x_robots_tag/rule.rs

+    type Err = OpaqueError;
+
+    fn from_str(s: &str) -> Result<Self, Self::Err> {
+        match s {


this is a pretty strict check, is it case sensitive or can we be case insensitive here? Also what about spaces, no needed to trim first?

GlenDC · 2025-01-19T12:09:04Z

1. I've used the `OpaqueError` type almost everywhere an `Error` type is required. Is this acceptable, or should I consider a different approach?

Questions like this are answered by thinking about its typical usage. Usually the parsing from the raw http header to the typed header will happen in the background as part of layer or web extractor or the like. So while you could try to return something like a custom error with exposure of the kind of error that happened I don't see it all that being useful here, even if you manually decode. Because even if you know what specific error happened, the result is either way the same, it's an invalid header so either you are okay with it or you are not.

TLDR: opaque Error is fine here. If there is ever a strong reason to expose certain error variants, it can be requested by opening a new issue about it. But that's a discussion for then.

2. I've implemented the `FromStr` trait for `Element`, expecting a string slice in the format: `"<bot_name:> indexing_rule, indexing_rule_with_value: value <...>"`. However, I'm unsure if this is the correct interpretation or the best approach.

The answer is that you'll want to ensure you support this: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Robots-Tag#syntax ; So purely looking at whether it's a valid implementation you would do by checking against that reference syntax. After that the question will be whether it's an optimal enough implementation, but let's first focussing on correctly implementing this.

GlenDC · 2025-01-20T12:15:02Z

rama-http/src/headers/x_robots_tag/element.rs

+    type Err = OpaqueError;
+
+    fn from_str(s: &str) -> Result<Self, Self::Err> {
+        let regex = Regex::new(r"^\s*([^:]+?):\s*(.+)$")


There should be no need for a regex here, it's a pretty linear process, so you should be able to easily parse out rules. E.g. something like ready until ':' or EOF, ':' => ...`.

GlenDC · 2025-01-20T12:44:24Z

You might want to read up a bit on how other folks have done it, even in other languages. E.g. look at https://github.com/nielsole/xrobotstag/blob/4cd7d8885a3e26fda9dd1a4075663cd3b80617f0/parser.go#L97. There's a lot to like about that implementation, but of course not to be copied as-is, given there is also plenty not to like about it. But what I personally take away from an implementation like this is:

I like the idea of just having a single RobotTag, which would look something like:

struct RobotsTag {
     bot_name: Option<HeaderValueString>,
     all: bool,
     no_follow: bool,
     unavailable_after: Option<chrono::DateTime<chrono::Utc>>, 
     // ...
     no_ai: bool,
     // ...
     custom_rules: Option<Vec<CustomRule>>,
}

CustomRule would be a private construct which can be as simple as

struct CustomRule {
      key: HeaderValueString,
      value: Option<HeaderValueString>,
}

you are already doing pretty well on parsing, but really there is no need for regexes here, at least not us directly
you can make use of the chrono crate and try the formats that are specified in the documentation, there are three possibilities: RFC 822, RFC 850, or ISO 8601. Using chrono makes that trivial.

As such the only things we really need to expose here are the XRobotsTag typed header and the RobotsTag as a single "element". So 2 structs.

Not sure if you are a fan of the builder pattern but I like it a lot for cases like this, as for each rule you can expose this method to RobotsTag

fn new() == Default::default

The default one would be basically an empty tag, not rendered at all. Which is okay as the Headers trait allows you to not encode anything in the encode step, which you would do in case all your tags would be empty.

fn all(&self) -> bool { self.all }
fn with_all(mut self, all: bool) -> Self { self.all = all; self }
fn set_all(&mut self, all: bool) ->  &mut Self { self.all = all; self }

This allows them to be constructed in any way. Feel free to define a private macro_rules! to easily set create this triplet for the bool variants, saves you from typing, and easy enough to do if you combine it with the paste crate already used in other parts of rama.

The custom rules can be constructed without exposing the CustomRule struct. And to get them you can just return an opaque iterator with a tuple &str or w/e.

I think this is pretty elegant and does everything you wanted to do without even having to expose our internal representation of it. Keeps the structure also a lot more flat without really losing much, given the RobotsTag should be able to be represented pretty small. It's in the end just a bunch of booleans and options, with the value parts of the option being on the heap, so pretty great for this purpose.

TLDR: great job on your work so far. Seems you are getting pretty comfortable with the codebase and defintely know your way around rust. I hope my feedback and guidance above helps you out in finishing it. Feel free to push back on something you do not agree with or propose alternatives.

Goal is mostly to keep this thing as simple and minimal as possible, while still allowing people to also do custom stuff, but without exposing too much of our internal API as to give the flexibility to change stuff internally without breaking stuff. Not that I mind making breaking changes where needed, but even better if you can upgrade without needing to break.

hafihaf123 · 2025-01-20T21:26:31Z

This allows them to be constructed in any way. Feel free to define a private macro_rules! to easily set create this triplet for the bool variants, saves you from typing, and easy enough to do if you combine it with the paste crate already used in other parts of rama.

I was wondering to maybe add some sort of #[derive(Builder)] functionality to the builder to generate the boilerplate code but I am wondering whether it is even possible as I am quite new to rust macros. From what I've found, I would probably have needed to define it as a procedural macro and kept it in a separate crate. I found rama-macros in the code base, but in its documentation, it states

There are no more macros for Rama.

GlenDC · 2025-01-20T21:32:28Z

I was wondering to maybe add some sort of #[derive(Builder)] functionality to the builder to generate the boilerplate code but

Out of scope for this PR and overkill for what you need here. A macro_rules has plenty to fill our needs here

If one day you want to learn to write and read proc macros you might enjoy https://www.manning.com/books/write-powerful-rust-macros

Could use some help in my venndb crate for a future release

hafihaf123 · 2025-01-20T22:12:40Z

another thing, do you prefer a separate builder struct (example: let robots_tag = RobotsTag::builder().noindex().max_video_preview(5).build()) or building using setters (example: let robots_tag = RobotsTag::new(); robots_tag.set_noindex(true).set_max_video_preview(5) or let robots_tag = RobotsTag::new().with_noindex(true).with_max_video_preview(5))

GlenDC · 2025-01-21T09:06:37Z

another thing, do you prefer a separate builder struct (example: let robots_tag = RobotsTag::builder().noindex().max_video_preview(5).build()) or building using setters (example: let robots_tag = RobotsTag::new(); robots_tag.set_noindex(true).set_max_video_preview(5) or let robots_tag = RobotsTag::new().with_noindex(true).with_max_video_preview(5))

The reason to introduce a separate builder struct, e.g. RobotsTagBuilder is because it allows you to when done right to at compile time make an impossible state. For example it would allow you if done right to make it that this does not compile:

RobotsTag::builder().build()

As at that point there is not yet anything set. In my proposal above I said as a shortcut you can just ignore empty robotstag and not write those. But if you do want to go this extra mile we can indeed prevent it at compile time, which is better so I would argue.

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=82444bb713185c3f09ceeb917cc4fd66 here is an exmaple playground for you to show what i mean with that

Reason why you would need a builder for that is because:

the most important reason: you avoid that you are polluting your otherwise generic-free struct with generics
less important is that it allows you to get around the name colission trick, as we otherwise have 3 methods a for the same struct, which is not allowed in rust

I'm totally okay with you building that out, good proposal of you, so why not.
Even if you do that I would still add setters to the regular struct as well as it allows, e.g. in a proxy setting, to still modify the struct itself. Another option is to be able to turn it back into a builder e.g. impl From and perhaps also add into_builder. That works fine as well.

Btw note also in my example that I add both the "consume" variant (mut self) and the setter variant (&mut self) to the builder. Please do this as well if you make a builder. Because in some scenarios you want to consume while in another one it is not as easy.

in cases where you want to pass it immediately to a function or parent structure it is easier to just consume:

foo(Foo::builder().a(true).b(42).build())

but in cases where you work with conditionals it's more conveniant to have the setter also available, e.g.:

let mut foo = Foo::builder().a(true);
if bar {
    foo.set_b(true);
}

Of course in the case of a bool this seems a bit silly, as you could just assign bar to set_b. But for cases where the property type is a bit more complex, e.g. an enum it becomes less easy so. As such, I usually provide both options.

hafihaf123 force-pushed the issue-382/x-robots-tag branch from 1a46b37 to f3c6d97 Compare January 12, 2025 22:01

GlenDC requested changes Jan 12, 2025

View reviewed changes

hafihaf123 added 12 commits January 17, 2025 23:32

add XRobotsTag, initial implementation

ce6e2b7

add value_string.rs

ff26238

add more context with comments

caefce6

add ValidDate, custom rules

a7b8ebd

fix value_string.rs visibility issues

f696c50

rename Iterator to ElementIter

78c2ba6

fix visibility issues

23c8fef

change trait TryFrom<&[&str]> to private function from_iter

36af384

separate 'split_csv_str' function from 'from_comma_delimited'

4dacfcb

change bot_name field type to 'HeaderValueString' and indexing_rule f…

a57a00b

…ield to 'Vec<Rule>'

implement FromStr for Element

d4fa1ad

reformat with rustfmt

e66d95b

hafihaf123 force-pushed the issue-382/x-robots-tag branch from 0c8c3b3 to e66d95b Compare January 17, 2025 22:33

todo/ fix XRobotsTag::decode()

6d0cf14

hafihaf123 requested a review from GlenDC January 17, 2025 22:57

GlenDC reviewed Jan 19, 2025

View reviewed changes

Merge branch 'plabayo:main' into issue-382/x-robots-tag

6c350db

GlenDC reviewed Jan 20, 2025

View reviewed changes

Merge branch 'plabayo:main' into issue-382/x-robots-tag

2c2dcfa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 382/ support X-Robots-Tag as a typed http header XRobotsTag #393

Issue 382/ support X-Robots-Tag as a typed http header XRobotsTag #393

hafihaf123 commented Jan 12, 2025 •

edited

Loading

GlenDC left a comment

GlenDC Jan 12, 2025

GlenDC Jan 12, 2025

GlenDC Jan 12, 2025

GlenDC Jan 12, 2025

GlenDC Jan 12, 2025

GlenDC Jan 12, 2025

GlenDC Jan 12, 2025

hafihaf123 commented Jan 17, 2025

GlenDC Jan 19, 2025

hafihaf123 Jan 20, 2025

GlenDC Jan 19, 2025 •

edited

Loading

GlenDC Jan 19, 2025

GlenDC commented Jan 19, 2025

GlenDC Jan 20, 2025

GlenDC commented Jan 20, 2025 •

edited

Loading

hafihaf123 commented Jan 20, 2025

GlenDC commented Jan 20, 2025

hafihaf123 commented Jan 20, 2025

GlenDC commented Jan 21, 2025 •

edited

Loading

-    static_header!["x-robots-tag",];
+    //
+    // More infornation at
+    // <https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Robots-Tag>.
+    static_header!["x-robots-tag"];

Issue 382/ support X-Robots-Tag as a typed http header XRobotsTag #393

Are you sure you want to change the base?

Issue 382/ support X-Robots-Tag as a typed http header XRobotsTag #393

Conversation

hafihaf123 commented Jan 12, 2025 • edited Loading

Initial Implementation of X-Robots-Tag Header

Summary of Changes:

Questions and Feedback Requested:

Testing:

GlenDC left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hafihaf123 commented Jan 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GlenDC Jan 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GlenDC commented Jan 19, 2025

Choose a reason for hiding this comment

GlenDC commented Jan 20, 2025 • edited Loading

hafihaf123 commented Jan 20, 2025

GlenDC commented Jan 20, 2025

hafihaf123 commented Jan 20, 2025

GlenDC commented Jan 21, 2025 • edited Loading

hafihaf123 commented Jan 12, 2025 •

edited

Loading

Initial Implementation of `X-Robots-Tag` Header

GlenDC Jan 19, 2025 •

edited

Loading

GlenDC commented Jan 20, 2025 •

edited

Loading

GlenDC commented Jan 21, 2025 •

edited

Loading