Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too frustrating: converting from ExaBGP 3.4 to 4.2.21 #1179

Open
he32 opened this issue Sep 18, 2023 · 21 comments
Open

Too frustrating: converting from ExaBGP 3.4 to 4.2.21 #1179

he32 opened this issue Sep 18, 2023 · 21 comments
Assignees

Comments

@he32
Copy link
Contributor

he32 commented Sep 18, 2023

  • The exabgp.conf syntax changed, but the documentation didn't
    (exabgp.conf(5)). No "group" at outer layer anymore. Apparently,
    nobody took the hint to "please keep this updated" and "please make
    this reasonably complete" when I submitted it earlier.
    Also, ExaBGP 4 did not complain when I asked it to validate the old 3.4-based configuration file(!)

  • Running with

[exabgp.daemon]
daemonize = true

[exabgp.log]
destination = syslog
short = true
level = WARNING

and adding the new

[exabgp.api]
ack = false

because I use the healthcheck module and I don't think it consumes any input, I earlier got only

Sep 18 19:09:54 oliven -: healthcheck-resolver[17649]: send announces for EXIT state to ExaBGP

in the log. Nothing else.

Commenting out "daemonize" and "destination" and running with -d gets more detailed information: it turns out that /var/run/exabgp/exabgp.in and exabgp.out were missing, they apparently need to exist as named pipes and be writable by nobody.

  • Now, I use ExaBGP to set up two BGP sessions, one for IPv4 and one
    for IPv6, to announce members of a resolver cluster to our routers.

    The healthcheck module outputs (when run interactively)

announce route a.b.c.d/32 next-hop self med 100
announce route 2001:x:0:y:z::1/128 next-hop self med 101

This used to work nicely with ExaBGP 3.4, but apparently will not with ExaBGP 4.2.21. Running ExaBGP in the foreground with debug reveals

command from process recursor : announce route a.b.c.d/32 next-hop self med 100 
async | recursor | announce route a.b.c.d/32 next-hop self med 100
no neighbor matching the command : announce route a.b.c.d/32 next-hop self med 100

So... ExaBGP is now bone-headed enough to insist on an explicit neighbor specification
from the healthcheck script, even if there is only a single address-family-matching BGP session?
That makes it impossible to share the healthcheck config file between installations, because the neighbor
address (which is "of course" site-specific) creeps over from the ExaBGP config file itself into
the healthcheck configuration file, which is unfortunate.

  • I see from the --help output of healthcheck that one can specify
  --neighbor NEIGHBOR   advertise the route to the selected neigbors

However, how NEIGHBOR goes from singular to plural (neighbors) is not explained.
What is the syntax for supplying multiple NEIGHBORs?

  • And ... possibly the "neighbor" parameter can also be specified in the healthcheck configuration file?
    But ... one probably does not want an IPv4 route to be announced over an IPv6 BGP session, and I'm
    assuming there is no logic in healthcheck to match address families for route and neighborships?
    So this pushes me in the direction of having two separate healthcheck processes, one for IPv4 and
    one for IPv6? That ups the config complexity and the cost for the health checking, which is again unfortunate.

  • Because I think the healthcheck script does not consume input
    (true? I've not been able to find a definitive answer to that...),
    I think I need to run with

[exabgp.api]
ack = false

That has the unfortunate side-effect that exabgpcli gets partially stuck
after executing a command, as has been noted already in #1164.

So, sorry, more of a rant to vent my frustration of the process, which is so far not finished.

I've already read https://github.com/Exa-Networks/exabgp/wiki/Migration-from-3.4-to-4.0

I also tried (and failed) to make sense of
https://github.com/Exa-Networks/exabgp/wiki/Configuration-:-Process

because this is, IMHO, to be charitable "light on explanation of actual semantics / behaviour modifications". I think I figured out that ExaBGP in 4.x needs "encoder text" added to the process section if using the "healthcheck" module, but that is by now more than a guess than anything else.

Answers or hints would be greatly appreciated.

@thomas-mangin
Copy link
Member

thomas-mangin commented Sep 18, 2023

You have quite eloquently expressed how bad I am at being responsible for the maintenance of this code and I will gladly accept some help: I gave you write access to the repository years ago.

neighbour * should only require a few lines of changes if it does not already work.

Now if you wish to know why these decisions were made, even bone-headed ones, happy to explain.

Many came from trying to be nice to users, a mistake I do not do as much anymore, and trying to not add too much complexity to a code base which was authored in 2009, before async, and is way from ideal and would require weeks of work to bring to modern python standards and even more to the level of quality of rustyBGP.

Some people find useful and I am doing my best to keep it working.

I worked on many prototypes over the years in one in Go - with some brain dead ideas, one V - which has decoding and the start of a Yang based cli, and even within this repo for my first attempt to Yang support.
But I would rather rewrite the whole code base in Odin.. or perhaps wait to see Jai released and if I do.

I am also not as young as I used to be. I have a business to run which 50 people rely on for their living. I also need time to relax to remain healthy and time for my family so it is a case of https://xkcd.com/2347/

still open from you:
#1121

and some other motivational issue:
#1038

@he32
Copy link
Contributor Author

he32 commented Sep 18, 2023

Well, now, I realize that was deserved, and I'm sorry if my rant rubbed you the wrong way.

neighbor * would be useful, I think, and would resolve at least three of the points I mentioned.

Getting write access to e.g. keep the exabgp.conf man page updated is one thing, decoding the actual supported grammar from the code is quite another, and me being a python newbie doesn't exactly help in that endavour. I'll see what I can come up with on that front.

@thomas-mangin
Copy link
Member

thomas-mangin commented Sep 18, 2023

I have removed some of the "marketing" blurb on the README. I will look at neighbor *

@thomas-mangin
Copy link
Member

thomas-mangin commented Sep 18, 2023

Support for neighbor *

@thomas-mangin
Copy link
Member

I appreciate and appreciated your help over the years. I currently have code in 4.2.22 which I believe is not correct for all users (following the acceptance of a patch from a user) and can not release 4.2.22 without a rollback of his contribution or figuring out why my own testing led me to believe there is an issue but the main branch should be fine.

@thomas-mangin
Copy link
Member

I am surprised to hear that normal announce did not work for you, I have a test to make sure it did:

❯ ./qa/bin/functional encoding 2     
 ✖ ✖ ✓ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖

❯ ./qa/bin/functional encoding --list

The available tests are:

 0  api-add-remove            L  api-no-respawn            g  conf-ipv46routes4family  
 1  api-announce-star         M  api-notification          h  conf-ipv46routes6family  
 2  api-announce              N  api-open                  i  conf-ipv6grouping        
 3  api-announcement          O  api-reload                j  conf-l2vpn               
 4  api-api                   P  api-rib                   k  conf-largecommunity      
 5  api-attributes-path       Q  api-rr-rib                l  conf-name                
 6  api-attributes-vpn        R  api-teardown              m  conf-new-v4              
 7  api-attributes            S  api-vpls                  n  conf-new-v6              
 8  api-broken-flow           T  api-vpnv4                 o  conf-no-asn4             
 9  api-check                 U  conf-addpath              p  conf-parity              
 A  api-eor                   V  conf-aggregator           q  conf-path-information    
 B  api-fast                  W  conf-attributes           r  conf-prefix-sid          
 C  api-flow                  X  conf-ebgp                 s  conf-split               
 D  api-ipv4                  Y  conf-extended-attributes  t  conf-srv6-mup            
 E  api-ipv6                  Z  conf-flow-redirect        u  conf-template            
 F  api-manual-eor            a  conf-flow                 v  conf-unknowncap          
 G  api-multi-neighbor        b  conf-generic-attribute    w  conf-vpn                 
 H  api-multiple-api          c  conf-group-limit          x  conf-watchdog            
 J  api-nexthop-self          e  conf-ipself4             

@thomas-mangin
Copy link
Member

thomas-mangin commented Sep 18, 2023

ah! you have two sessions, it may be why ...

@he32
Copy link
Contributor Author

he32 commented Sep 18, 2023 via email

@he32
Copy link
Contributor Author

he32 commented Sep 19, 2023

Hm, the first obstacle I butted into was that healthcheck.py didn't want to accept neighbor *. I changed the argument type from ip_address to str to work around that. The second obstacle is that the neighbor spec is still not matching; exabgp (after an initial tweak to the debug output) says "no matching neighbors".

I'm wondering if limit.py should instead do:

def match_neighbor(description, name):
    for string in description:
        if string.strip() == 'neighbor *':

However, with that I still get no matching neighbors (with tweaked logging):

command from process recursor : neighbor * announce route 3.4.5.6/32 next-hop self med 100 
async | recursor | neighbor * announce route 3.4.5.6/32 next-hop self med 100
no neighbor matching the description(s) [['neighbor *']] for command : announce route 3.4.5.6/32 next-hop self med 100

So I expanded logging in announce.py to say

            descriptions, command = extract_neighbors(line)
            peers = match_neighbors(reactor.peers(service), descriptions)
            if not peers:
                self.log_failure('no neighbor matching the description(s) %s (neighbors %s, service %s) for command : %s' % (repr(descriptions), repr(reactor.peers(service)), repr(service), command))

and with that I get logged

command from process recursor : neighbor * announce route 3.4.5.6/32 next-hop self med 100 
async | recursor | neighbor * announce route 3.4.5.6/32 next-hop self med 100
no neighbor matching the description(s) [['neighbor *']] (neighbors [], service 'recursor') for command : announce route 3.4.5.6/32 next-hop self med 100

So... reactor.peers(service) doesn't appear to actually present the list of configured neighbors.
Any idea what should be used instead? Or is this just a mis-configuration on my part?
The service is the name of the process from exabgp.conf which injected this command. What do I use to couple the configured neighbors with the configured process(es)? The output from exabgp started with --validate does not provide any hints in that direction that I can see. Description? Sounds unlikely.

This, unfortunately, comes back to the documentation of exabgp.conf, or points to the massive confusion on my part.

@longmalx
Copy link

I appreciate and appreciated your help over the years. I currently have code in 4.2.22 which I believe is not correct for all users (following the acceptance of a patch from a user) and can not release 4.2.22 without a rollback of his contribution or figuring out why my own testing led me to believe there is an issue but the main branch should be fine.

Does this refer to this change (#1128) ?

We have been running this for quite a while without issue. If you can add a ticket with the issues you see I can try and find time to help have a look to try and root cause what's happening.

@he32
Copy link
Contributor Author

he32 commented Sep 19, 2023

Just for reference, here's a redacted exabgp.conf that I'm currently trying to make work with exabgp 4.2.21:

neighbor 158.38.x.y {
    peer-as 224;
    local-as 65053;
    router-id 158.38.x.z;
    local-address 158.38.x.z;
    family {
        ipv4 unicast;
    }
}
neighbor 2001:700:0::a {
    peer-as 224;
    local-as 65053;
    router-id 158.38.x.z;
    local-address 2001:700:0::b;
    family {
        ipv6 unicast;
    }
}
process recursor {
        run /usr/pkg/bin/python3.10 -m exabgp healthcheck --config /etc/exabgp/resolver-hc.conf;
        encoder text;
}

and my slightly redacted resolver-hc.conf looks like this:

# debug
name = resolver
interval = 5
fast-interval = 1
ip = 158.38.c.d
ip = 2001:700::a
neighbor = *
no-ip-setup
withdraw-on-down
command = /usr/local/bin/resolver-health-check.pl -c /etc/exabgp/resolver-checks.txt -r 127.0.0.1
pid = /var/run/exabgp/resolver-hc.pid
silent

@he32
Copy link
Contributor Author

he32 commented Sep 19, 2023

More...

I've found lib/exabgp/reactor/loop.py which defines peers(self, service=''), and it's

    def peers(self, service=''):
        matching = []
        for peer_name, peer in self._peers.items():
            if service == '':
                matching.append(peer_name)
                continue
            if service.startswith('api-internal-cli'):
                matching.append(peer_name)
                continue
            if service in peer.neighbor.api['processes']:
                matching.append(peer_name)
                continue
        return matching

However, I have not been able to find out how anything ends up in peer.neighbor.api['processes'], much less how that is configured. I did find

        neighbor.api = ParseAPI.flatten(local.pop('api', {}))

but that is way too much magic in one sentence for me to decipher and translate into how that's configured.

I'm reasonably good with find ... grep, but appear to lack the requisite python knowledge.
"Help!"

@thomas-mangin
Copy link
Member

thomas-mangin commented Sep 21, 2023

@he32 Sorry, I deleted the patch I posted here, added two commits on the same evening on the master and 4.2 branch with neighbor * and added some testing. All seemed to be working. I did not test the healthcheck program itself tho - but I modified it to use neighbor *

And for some reason I did not get emails from GH for your posts.

@thomas-mangin
Copy link
Member

thomas-mangin commented Sep 21, 2023

@longmalx oh, thank you for following issues!

Yes, I performed a test (I can not recall what) which showed that in some scenario something was off with what ended up in (or out of) the RIB - as I was in between things I did not document what caused it and was later not able to remember what it was 🤦

It may be a bug in the RIB code itself which is really ugly, right now I do not know. I am pleased to hear that you have no issue.

I wrote when merging "I have merged the patch but now wonder if there will be an issue with API routes? I need to check." .. it may have been what I checked ...

@he32
Copy link
Contributor Author

he32 commented Sep 22, 2023

I did not test the healthcheck program itself tho - but I modified it to use neighbor *

Hmm... In my case the output from the healthcheck program looks reasonably sane:

neighbor * announce route 158.38.0.x/32 next-hop self med 100
neighbor * announce route 2001:700:0::a/128 next-hop self med 101
neighbor * announce route 158.38.0.x/32 next-hop self med 100
neighbor * announce route 2001:700:0::a/128 next-hop self med 101

So the logging I extended is inside exabgp itself (in announce_route() inside
lib/exabgp/reactor/api/command/announce.py, and it's apparently a problem with
knowing which neighbors to match the "neighbor *" against. The code uses the process
name as "service" in reactor.peers(service), and that ends up as an empty list in my case,
so "of course" exabgp complains that it can't find any matching neighbors.

And ... this goes back to the configuration which I posted above -- what (if anything) is missing to make
the reactor.peers(service) where service == "recursor" return the configured
neighbors? I.e. how do you configure so that the process is "related to" the configured
neighbors? (If that is what's needed, that is.)

And for some reason I did not get emails from GH for your posts.

That part I cannot explain, sorry.

@he32
Copy link
Contributor Author

he32 commented Sep 22, 2023

And ... this goes back to the configuration which I posted above -- what (if anything) is missing to make the reactor.peers(service) where service == "recursor" return the configured neighbors? I.e. how do you configure so that the process is "related to" the configured neighbors? (If that is what's needed, that is.)

Never mind, found what's needed to tie a neighbor to a process:

    api {
        processes [ recursor ];
    }

inside the neighbor specification.

The issue I then get stuck on is that when processing the two lines (which occur about the same time):

neighbor * announce route 158.38.0.x/32 next-hop self med 100
neighbor * announce route 2001:700:0::a/128 next-hop self med 101

The IPv4 route gets announced to the IPv4 neighbor, but the IPv6 route does
not get announced to the IPv6 neighbor, because it first tries the IPv4 route to
the IPv6 neighbor, and that won't work, and exabgp logs

the route family is not configured on neighbor

So, it stops doing anything on the first occurrance of that error for the neighbor?

Hm, that pushes me in the direction of two processes, one for IPv4, one for IPv6,
which doubles the cost of the health-checking.

@thomas-mangin
Copy link
Member

Thank you for this investigation, please give me a few days to look into fixing this.

@thomas-mangin
Copy link
Member

thomas-mangin commented Sep 25, 2023

I looked at the code and this string, while reported as an error, does not stop any processing. The route with the wrong family for the peer will be ignored. So if the neighbours have explicit ipv4 only, and ipv6 only, it will be verbose but should work as expected.

@he32
Copy link
Contributor Author

he32 commented Sep 28, 2023

I looked at the code and this string, while reported as an error, does not stop any processing. The route with the wrong family for the peer will be ignored. So if the neighbours have explicit ipv4 only, and ipv6 only, it will be verbose but should work as expected.

That did not match with what I observed at the time on the neighboring BGP speaker (a Juniper router). However, I will re-test to confirm, and look a bit closer at the issue. To be continued.

thomas-mangin added a commit that referenced this issue Oct 31, 2023
The behaviour was to return an issue if any of the peers could
not take the RIB. With the new `neighbor *` feature bulk sending
routes should now be accepted.

If any peer can accept the route as the family was negotiated,
then the announcement was a success, only a failure if no peers
could take it.
@thomas-mangin
Copy link
Member

@he32 I have changed the behaviour on error returning as it made sense, but the patch above may not resolve your issue. I am still looking into it.

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Feb 19, 2024
Allow "neighbor *" in route announce command, to match
"all configured neighbors".
Ref. Exa-Networks/exabgp#1179
Adapt the healthcheck module to allow this argument.

Bump PKGREVISION.
@thomas-mangin
Copy link
Member

But the lack of documentation for the migration from 3.4 to 4.0 (and the need for a 5.0 release soon and the similar associated pain). Is there anything left here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants