Combine batches of successive roles for same nodes, #1345

sjamgade · 2017-10-02T13:36:32Z

we can speed-up the application of (n+1)th role if both(n,n+1)
roles are being applied on the same node. This speedsup of deployment of
ceilometer by atleast 1m20s (measured: 90sec) and swift by ~20s.

eg. In our 2node deployment ceilometer{server,central} are always
applied on the same node, given that they have different priorities, they are
to be applied, one after the other.

This does not violate any order constraints as the application procedure
of (n+1)th role is transparent to the nth role

Why is this change necessary?
Speed up the chef-client execution

@aspiers : it's actually reducing the number of times chef-client is run rather than speeding up execution of any single run

houndci-bot · 2017-10-02T13:36:46Z

crowbar_framework/app/models/service_object.rb

-      batches << [roles, nodes_in_batch] unless nodes_in_batch.empty?
+      if pnodes_in_batch == nodes_in_batch && !proles.nil?
+          # proles is same as batches[-1][0]
+          proles << roles


Layout/IndentationWidth: Use 2 (not 4) spaces for indentation. (https://github.com/bbatsov/ruby-style-guide#spaces-indentation)

sjamgade · 2017-10-02T14:10:39Z

Combine batches of successive roles for same nodes

aspiers

In principle I think this sounds like a great idea, so thanks a lot for working on this!

As you can see from my comments, there are some serious code legibility issues here. The vast majority of these issues existed with #apply_role for many years before you joined SUSE ;-) However it's super-important that any changes to it should remove technical debt, not add new technical debt, otherwise we just continue the trend of this method getting worse and worse over the years.

In particular, #apply_role has quite clearly delimited blocks of logic, so it lends itself very well to extraction of a few new methods (one commit at a time!). This would make it far easier to see the inputs, outputs, and actions of each chunk, and thus the overall flow of logic. If you could do this before enhancing it then you'd probably get a +1000 from me :-)

aspiers · 2017-10-06T09:31:21Z

crowbar_framework/app/models/service_object.rb

@@ -1071,6 +1071,8 @@ def apply_role(role, inst, in_queue, bootstrap = false)

    # element_order is an Array where each item represents a batch of roles and
    # the batches must be applied sequentially in this order.
+    pnodes_in_batch = nil


Please use easily understandable variable names. It is not clear from the names what these two variables are for. I'm guessing p means "previously added to batches" rather than "previously seen in element_order", is that right? IIUC the distinction is subtle but crucial to how you've written this code. If so, call them something like nodes_in_prev_batch and roles_for_prev_batch.

Also, by inserting these assignments here, you are divorcing the comment explaining element_order from its usage, so please don't do that.

aspiers · 2017-10-06T09:41:05Z

crowbar_framework/app/models/service_object.rb

@@ -1169,7 +1171,14 @@ def apply_role(role, inst, in_queue, bootstrap = false)
        end
      end # roles.each

-      batches << [roles, nodes_in_batch] unless nodes_in_batch.empty?
+      if pnodes_in_batch == nodes_in_batch && !proles.nil?


The #apply_role method is already WAY too long, and hopelessly unreadable as a result, so I don't want even 8 new lines to be added. Please can you find a way to make this change which makes the method smaller not larger, by extracting out 1 or more blocks into new methods, one commit at a time?

Also, what if nodes_in_batch is a subset of pnodes_in_batch? Couldn't that also be a situation where the batches could be merged?

element_order = [ [b1] [c1,c2] ] batch 1: node1=[b1] node2=[b1] batch 2: node1=[c1]

It think : this could break, if c1 needs b1 to be entirely functional

this could break, if c1 needs b1 to be entirely functional

Not necessarily, but that demonstrates my point 2) below - it depends on whether all the recipes for b1 were ordered to happen before the recipes for c1. For example:

element_order = [ [b1] [c1,c2] ] batch 1: node1=[b1] node2=[b1] batch 2: node1=[c1] node2=[c1]

This PR would change that to:

batch 1: node1=[b1,c1] node2=[b1,c1]

If some of c1's recipes come before b1's in the runlist then there's a problem. Normally runlist ordering is determined for all roles within a barclamp by chef_order so this should not be an issue (assuming barclamps are deployed in an order matching chef_order?), but this can be overridden per role by element_run_list_order. The only place we actually use that is with pacemaker remotes, but even there I'm not sure we actually use it - maybe the code only has remnants of a previous usage which we reverted when @vuntz came up with a different approach? I can't remember but @vuntz will know for sure.

If some of c1's recipes come before b1's in the runlist then there's a problem.
yes this is correct.

I did look at element_run_list_order in nova, ceilometer swift and other barclamps.
Didnt find the issue you mentioned. As each one of them had explicit priorities.

Yeah, I think I must have done the grep wrong last time because I see them now.

aspiers · 2017-10-06T09:50:56Z

crowbar_framework/app/models/service_object.rb

-      batches << [roles, nodes_in_batch] unless nodes_in_batch.empty?
+      if pnodes_in_batch == nodes_in_batch && !proles.nil?
+        # proles is same as batches[-1][0]
+        proles << roles


I'm really struggling to understand this because #apply_role is such horrible spaghetti code and I don't look at it often enough to remember all the details (I guess @vuntz is the only one who does - which is maybe why he fails to see the need for it to be refactored, but that's another discussion I'll continue with him separately ;-) ...

But IIUC it seems like this would result in say, ceilometer-agent being appended to proles so that it runs in the same batch as ceilometer-central, and moving the line which appends to batches into the else clause means that no new batch is added during this iteration of the element_order.each loop. Is that right? If so, I think it deserves more comments explaining what's going on, because it just took me about 20 minutes to understand these few lines.

Exactly.
Will add comments.

aspiers · 2017-10-06T09:53:17Z

crowbar_framework/app/models/service_object.rb

+        # proles is same as batches[-1][0]
+        proles << roles
+      else
+        proles = roles


OK, so these two lines are only in the else block because that allows merging of even more than two batches into one?

aspiers · 2017-10-06T10:21:50Z

Two more thoughts:

The answer given to "Why is this change necessary?" was "Speed up the chef-client execution", but IIUC that's not precisely what's happening; it's actually reducing the number of times chef-client is run rather than speeding up execution of any single run - right? So if I'm right please change the wording on that to clarify.
Actually, this also assumes that the cookbooks for (say) ceilometer would be ordered so that in the chef-client runs for the combined batch, any actions for ceilometer-agent would be performed after ceilometer-central. And that sounds like it could potentially be a fundamental flaw in this optimization. I'll leave it to @vuntz to comment on whether or not this is likely to be an issue. I suspect it's fine for most if not all cookbooks, but it would be introducing a new assumption into the design so we'd have to go through all of them to make sure. In particular I wonder whether it would break compute HA deployment.

aspiers · 2017-10-06T10:22:40Z

And another:

Orchestration is hard ;-/

sjamgade · 2017-10-06T12:28:15Z

Done.
ceilometer-{central,agent} have different priorities which makes them n and n+1.
Chef client respects these priorities.
Cannot agree more.

Itxaka · 2017-10-06T13:19:56Z

Sorry to barge in here but:

As you can see from my comments, there are some serious code legibility issues here. The vast majority of these issues existed with #apply_role for many years before you joined SUSE ;-) However it's super-important that any changes to it should remove technical debt, not add new technical debt, otherwise we just continue the trend of this method getting worse and worse over the years.

In particular, #apply_role has quite clearly delimited blocks of logic, so it lends itself very well to extraction of a few new methods (one commit at a time!). This would make it far easier to see the inputs, outputs, and actions of each chunk, and thus the overall flow of logic. If you could do this before enhancing it then you'd probably get a +1000 from me :-)

Totally agree here after having to deal with apply_role, and exactly what I was thinking of doing shortly, extracting apply_role into a minimum of 4 methods (as those are the four big steps inside) in order to be able to refactor that monstrous 500 line method into smaller chunks for clarity and maybe being able to (kind of) easily try to reduce/refactor smaller pieces inside it.

Plus the test coverage for it is minimal.

@sjamgade I understand this falls outside of the scope of the card but let me know if you plan of doing anything to apply_role so we dont do the same work twice in case I got some free time to refactor it :D

sjamgade · 2017-10-06T13:50:17Z

@aspiers gating passed.
I am not sure if that covers the HA-Compute case you mentioned.

https://ci.suse.de/job/openstack-mkcloud/86719/

https://ci.suse.de/job/openstack-mkcloud/86733/

vuntz · 2017-10-09T10:37:59Z

I'm mixed here: this is probably fine at the moment, but this would conflict with a potential change where we restrict what chef is doing on apply (which is something we were slowly heading towards). So if anything, we should have a big comment explaining the shortcut and that this is not compatible with the minimal chef-client approach on apply.

aspiers

Yeah, so I now think this is probably OK (modulo the nitpick), but I agree with @vuntz that it could conflict with the future direction. However if we ever get to that, we can make sure to properly refactor #apply_role (thanks @Itxaka, great to hear you agree with me on this!) and then it should be relatively easy to adapt the logic accordingly.

aspiers · 2017-10-13T11:25:56Z

crowbar_framework/app/models/service_object.rb

@@ -1069,6 +1069,8 @@ def apply_role(role, inst, in_queue, bootstrap = false)
    proposal = Proposal.where(barclamp: @bc_name, name: inst).first
    save_proposal = false

+    nodes_previousbatch = nil


Nitpick: please be consistent in your variable naming style. "previous batch" is two words, not one. Since the existing style is to put _ between words, put it between all words ("nodes_previous_batch"). Also, since maximising legibility is an important goal, I prefer names which are as close as possible to natural English language, which is why I suggested nodes_in_prev_batch and roles_for_prev_batch. Admittedly that's more of a personal preference, but still worth considering.

Complaints addressed / invalid

aspiers · 2017-10-13T11:29:59Z

Totally agree here after having to deal with apply_role, and exactly what I was thinking of doing shortly, extracting apply_role into a minimum of 4 methods (as those are the four big steps inside) in order to be able to refactor that monstrous 500 line method into smaller chunks for clarity and maybe being able to (kind of) easily try to reduce/refactor smaller pieces inside it.

YES that's exactly what I was thinking of!

Plus the test coverage for it is minimal.

@sjamgade I understand this falls outside of the scope of the card but let me know if you plan of doing anything to apply_role so we dont do the same work twice in case I got some free time to refactor it :D

It would make me sooo happy to see this refactoring done :-)

houndci-bot · 2017-10-16T08:55:47Z

crowbar_framework/app/models/service_object.rb

        end
+


Layout/TrailingWhitespace: Trailing whitespace detected. (https://github.com/bbatsov/ruby-style-guide#no-trailing-whitespace)

houndci-bot · 2017-10-16T08:55:47Z

crowbar_framework/app/models/service_object.rb

+        # role (which would be an improvement over the requirement to
+        # explicitly list all nodes).
+        new_nodes.reject! do |node_name|
+          not pre_cached_nodes[node_name].nil? && Rails.logger.debug("skipping deleted node #{node_name}")


Style/Not: Use ! instead of not. (https://github.com/bbatsov/ruby-style-guide#bang-not-not)
Metrics/LineLength: Line is too long. [106/100] (https://github.com/SUSE/style-guides/blob/master/Ruby.md#metricslinelength)

houndci-bot · 2017-10-16T08:55:47Z

crowbar_framework/app/models/service_object.rb

+        # have some alias that be used to assign all existing nodes to a
+        # role (which would be an improvement over the requirement to
+        # explicitly list all nodes).
+        new_nodes.reject! do |node_name|


Style/InverseMethods: Use select! instead of inverting reject!.

houndci-bot · 2017-10-16T08:55:47Z

crowbar_framework/app/models/service_object.rb

+        # Don't add deleted nodes to the run order, they clearly won't have
+        # the old role
+        old_nodes.reject! do |node_name|
+          not pre_cached_nodes[node_name].nil? && Rails.logger.debug("skipping deleted node #{node_name}")


Style/Not: Use ! instead of not. (https://github.com/bbatsov/ruby-style-guide#bang-not-not)
Metrics/LineLength: Line is too long. [106/100] (https://github.com/SUSE/style-guides/blob/master/Ruby.md#metricslinelength)

houndci-bot · 2017-10-16T08:55:47Z

crowbar_framework/app/models/service_object.rb

-              end
+        # Don't add deleted nodes to the run order, they clearly won't have
+        # the old role
+        old_nodes.reject! do |node_name|


Style/InverseMethods: Use select! instead of inverting reject!.

houndci-bot · 2017-10-16T09:01:16Z

crowbar_framework/app/models/service_object.rb

+      end
+      merged_batches << current_batch
+    end
+    return merged_batches


Style/RedundantReturn: Redundant return detected. (https://github.com/bbatsov/ruby-style-guide#no-explicit-return)

houndci-bot · 2017-10-16T09:01:16Z

crowbar_framework/app/models/service_object.rb

@@ -919,6 +919,35 @@ def self.proposal_to_role(proposal, bc_name)
    RoleObject.new role
  end

+  #  we can speed-up the application of (n+1)th role if both(n,n+1)
+  #  roles are being applied on the same node. 
+  #  


Layout/TrailingWhitespace: Trailing whitespace detected. (https://github.com/bbatsov/ruby-style-guide#no-trailing-whitespace)

houndci-bot · 2017-10-16T09:01:16Z

crowbar_framework/app/models/service_object.rb

@@ -919,6 +919,35 @@ def self.proposal_to_role(proposal, bc_name)
    RoleObject.new role
  end

+  #  we can speed-up the application of (n+1)th role if both(n,n+1)
+  #  roles are being applied on the same node. 


Layout/TrailingWhitespace: Trailing whitespace detected. (https://github.com/bbatsov/ruby-style-guide#no-trailing-whitespace)

aspiers · 2017-10-16T11:48:54Z

crowbar_framework/app/models/service_object.rb

+  #  a batch is [roles, nodes]
+  def mergebatches(batches)
+    merged_batches = []
+    unless bathces.empty?


batches typo. If you are pushing code before testing it, please clearly label and mark the PR as WIP, otherwise there is a risk broken code will get merged.

aspiers · 2017-10-16T11:49:27Z

crowbar_framework/app/models/service_object.rb

+  #  merging batches together
+  #
+  #  a batch is [roles, nodes]
+  def mergebatches(batches)


Yes, much better to have this as a separate method!

But as mentioned before, please stay consistent with the existing style, so this should be merge_batches.

aspiers · 2017-10-16T11:49:46Z

crowbar_framework/app/models/service_object.rb

+  #
+  #  eg. In our 2node deployment ceilometer{server,central} are always
+  #  applied on the same node, given that they have different priorities,
+  #  they are to be applied, one after the other.


It would be helpful to explicitly mention element_run_list_order here and explain its effect on the run-list, otherwise the reader will be wondering why/how they end up with different priorities.

aspiers · 2017-10-16T11:50:31Z

crowbar_framework/app/models/service_object.rb

@@ -919,6 +919,35 @@ def self.proposal_to_role(proposal, bc_name)
    RoleObject.new role
  end

+  #  we can speed-up the application of (n+1)th role if both(n,n+1)


Thanks for this comment block - definitely important to have it since this batch merging is not immediately obvious without explanation!

aspiers · 2017-10-16T11:52:40Z

crowbar_framework/app/models/service_object.rb

+  #  is run rather than speeding up execution of any single run, by
+  #  merging batches together
+  #
+  #  a batch is [roles, nodes]


Thanks - I find comments like this really helpful for understanding the data structures being manipulated.

We can speed-up the application of (n+1)th role if both(n,n+1) roles are being applied on the same node. This speedup of deployment of ceilometer by atleast 1m20s (measured: 90sec) and swift by ~20s. eg. In our 2node deployment ceilometer{server,central} are always applied on the same node, given that they have different priorities, they are to be applied, one after the other. This does not violate any order constraints as the application procedure of (n+1)th role is transparent to the nth role

please do review the latest changes

jloehel · 2018-02-19T11:04:47Z

I tested these changes in a cloud environment with 4 nodes and develcloud7 as source. I compared the time with and without the patch. You can see the results here:

It's definitely a performance gain for in example the Barclamp "ceilometer". The roles are merged correctly.

(Pdb) batches["A"]["ceilometer"]
'[[["ceilometer-server", ["ceilometer-central"], ["ceilometer-agent"]], ["d52-54-77-77-01-01.xxx"]]]'
(Pdb) batches["B"]["ceilometer"]
'[[["ceilometer-server"], ["d52-54-77-77-01-01.xxx"]], [["ceilometer-central"], ["d52-54-77-77-01-01.xxx"]], [["ceilometer-agent"], ["d52-54-77-77-01-01.xxx"]]]'

It looks good to me. 👍

aspiers · 2018-02-19T15:46:22Z

crowbar_framework/app/models/service_object.rb

+  #  eg. In our 2node deployment ceilometer{server,central} are always
+  #  applied on the same node, given that they have different priorities,
+  #  they are to be applied, one after the other, these priorities come
+  #  from element_run_list_order


Two nitpicks here:

The priorities come from chef_order by default but can be overridden per role by element_run_list_order

ceilometer-{server,central} is just one example, but this code will merge any adjacent batches regardless of what run_list priority the roles in. Is it possible to make the code automatically check that the run_list priority of batch n is lower than the priority of batch n+1, and only merge if that is true? If that was added then I think this would feel safe enough to me for a +1 vote.

sjamgade requested a review from vuntz October 2, 2017 13:36

houndci-bot reviewed Oct 2, 2017

View reviewed changes

sjamgade requested review from aspiers and dirkmueller October 2, 2017 13:36

sjamgade force-pushed the mergebatch branch 3 times, most recently from 058b6e4 to ae3a82f Compare October 2, 2017 13:43

aspiers previously requested changes Oct 6, 2017

View reviewed changes

sjamgade force-pushed the mergebatch branch from ae3a82f to 1171ecf Compare October 6, 2017 14:16

aspiers reviewed Oct 13, 2017

View reviewed changes

houndci-bot reviewed Oct 16, 2017

View reviewed changes

sjamgade force-pushed the mergebatch branch from 562129c to f038c32 Compare October 16, 2017 09:01

houndci-bot reviewed Oct 16, 2017

View reviewed changes

sjamgade force-pushed the mergebatch branch from f038c32 to 7ce1bdf Compare October 16, 2017 09:03

aspiers added the wip label Oct 16, 2017

aspiers previously requested changes Oct 16, 2017

View reviewed changes

sjamgade force-pushed the mergebatch branch from 7ce1bdf to 7558c2f Compare October 16, 2017 12:27

sjamgade force-pushed the mergebatch branch from 7558c2f to 7d7e7e8 Compare October 19, 2017 13:00

sjamgade removed the wip label Oct 19, 2017

aspiers requested changes Feb 19, 2018

View reviewed changes

JanZerebecki added the wip label Jan 13, 2021

Combine batches of successive roles for same nodes, #1345

Are you sure you want to change the base?

Combine batches of successive roles for same nodes, #1345

Conversation

sjamgade commented Oct 2, 2017 • edited Loading

Choose a reason for hiding this comment

sjamgade commented Oct 2, 2017

aspiers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjamgade Oct 6, 2017 • edited Loading

Choose a reason for hiding this comment

aspiers Oct 6, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjamgade Oct 6, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aspiers commented Oct 6, 2017

aspiers commented Oct 6, 2017

sjamgade commented Oct 6, 2017

Itxaka commented Oct 6, 2017

sjamgade commented Oct 6, 2017

vuntz commented Oct 9, 2017

aspiers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aspiers commented Oct 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jloehel commented Feb 19, 2018

Choose a reason for hiding this comment

sjamgade commented Oct 2, 2017 •

edited

Loading

sjamgade Oct 6, 2017 •

edited

Loading

aspiers Oct 6, 2017 •

edited

Loading

sjamgade Oct 6, 2017 •

edited

Loading