Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rsync option to package command and improve performance #303

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,15 +44,15 @@ script:
- echo "Working copy disk usage"; du -chs .
- grunt --version
- grunt help
- grunt --quiet --timer
- grunt --timer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove quiet? Seems like an unrelated change. We might make the test less verbose by setting the environment variable GDT_QUIET.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the --quiet because it was super hard to debug the failing travis tests. Travis already spits out a ton of stuff and seeing what the grunt tasks actually do and if they get errors is important I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was --quiet suppressing errors too? In our case it's used to suppress desktop notifications which are irrelevant from inside travis.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The package command had a bug that was not completing all of its tasks, so without seeing the task output it was difficult to debug. Also hard to see the actual tasks being run through the cloud of all the other output generated in the travis log. I don't consider those desktop notifications irrelevant for travis as they aided me in debugging. And it doesn't hurt to have them, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the --quiet option and GDT_QUIET env var are supposed to only affect desktop notifications (i.e. growl). If the option affects the error reporting from Grunt, we should change the option to avoid conflicts.

- echo "Fully built disk usage"; du -chs .
- grunt install --db-url=sqlite:/`pwd`/.ht.sqlite --quiet
- grunt install --db-url=sqlite:/`pwd`/.ht.sqlite
- grunt serve:test >/dev/null &
- until curl -I -XGET -s $GDT_TEST_URL 2>/dev/null | egrep -q '^HTTP.*200'; do sleep 0.5; done
- vendor/bin/drush --root=build/html en -y simpletest
- grunt test
- sleep 1; while (ps aux | grep '[b]ehat' > /dev/null); do sleep 1; done
- for pid in `ps aux | grep drush | grep runserver | awk '{print $2}'`; do echo "Stopping drush pid $pid"; kill -SIGINT $pid; done;
- grunt package --quiet
- grunt package
- mocha --timeout 30000 node_modules/grunt-drupal-tasks/test/build.js
- mocha node_modules/grunt-drupal-tasks/test/library.js
30 changes: 30 additions & 0 deletions docs/60_PACKAGE.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,35 @@ directory. Defaults to the docroot.
packages directory. This can be overridden by calling grunt package with the
`--name` parameter.

**archive**: Used to enable the `compress` option via the configuration instead
of the command line. Defaults to false.

**rsync**: If set to true, rsync will be used instead of file copying for
improved performance. This automatically enables the `compress` option and will
write the archive to the package destination.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why enable compress by default? The main use case I've had for this requires not doing it, so curious what the thought is. Don't recall if that can be turned off, we should note that in docs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a default. You can still set "archive" to false if you don't want it. Just trying to support the best defaults for new projects. I want "rsync=true" to represent the default behavior for "high performance" that enables archive, vm_data, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose a project leaves it creating an archive, and turns on rsync. What is the use case we're aiming for? Operational code backups? I'm open to calling this a best default, I'm just trying to understand what we're talking about empowering.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By creating a single archive and using rsync you get the highest performance. There is only a single file (the archive) the needs to be moved out of the VM disk and into the NFS mounted system which is hugely faster than moving the non-archived directory.

What they do with the archive is project specific. I expect we'll eventually add a "deploy" command that does all of the git commit/push work from inside grunt, but for now the archived package can be moved to a hosting site, or uncompressed locally and committed to a production repo.

Basically, we are just trying to minimize the total data flow to NFS along with the individual number of files (since each file has NFS overhead, so lots of small files are really bad performance).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, writing the tarball being less I/O, offset by the time to create the tarball. Have you noted the time to create the tarball is less than syncing the files individually? I expect so.


**vmData**: This path is used as the intermediate generation directory for the
package when using the `rsync` option. By selecting a volume mounted in your
VM `/data` area, this can significantly improve performance. The archive is
still moved to the normal package destination when complete. Defaults to
`build/vm_data`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GDT is not targeted at virtual/containerized environments. This piece needs a bit more contextual documentation and to remove /data as an explicit concept.

Seems strange to place it in build/vm_data, why not build/package/tmp or even get a formal /tmp directory generated via os.tmpdir()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the super important part of this PR. Without the virtual intermediate directory there was no way to get anything performant and people are bypassing gdt on their projects because of this. I know GDT is not "targetting" at containers, which is why this option isn't enabled by default.

The idea here is that in your docker container you mount a volume that gdt can use for various purposes to improve performance. Only tackling the packaging right now, but I can imagine other commands using this also. So don't want it to be within "build/package". The /data is just an example that pertains to our own container, but that isn't anywhere in the GDT except the docs as an example. Since our docs already reference many P2 practices, I don't see this as a problem right now (see the CI docs for example).

Still, specific updates to this doc file are welcome if you want to wordsmith it a bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make sure all config options and documentation are not overly bound to containers or virtual environments specifically, even if the changes we have facilitate that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this less specific to Docker and the /data path set up by devtools, could this be recast as a temporary directory path (i.e. tempDir rather than vmData)? We can note in the documentation that the purpose for this path is to allow using a path that performs quickly if the project directory lives in a slow performing location (i.e. NFS).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I think calling it a tempDir option would be fine. I'll make that change to the code and docs. Thanks for the suggestion.


## Performance

When running `grunt package` within a docker container, the default file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we focus more on NFS mounts and less on Docker? I'm fine with Docker in the docs, but it should be almost more like a sidebar detail. It's great that we're looking into accelerating this for use with NFS mounts, it would be great if we could identify similar tricks for Drush Make and Composer (build in an intermediate directory, transfer to the volume mounted space where speedup can better affect the complete codebase.

The major issue that originally drove me to look into rsync and other alternatives to copy was more about size of projects - a built up D8 project or a vanilla Atrium project will break Node under default Node.js settings rather than complete the operation. I'd like to note this -- it's going to be the more common problem than NFS outside Phase2 usage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I covered this above, but the whole point of this PR was performance. Switching from copy to rsync gives a small performance improvement. The majority of the improvement comes from using the vm_data.

Yes, it's really about NFS and not docker, so the docs can be wordsmithed a bit.

And yes, this vm_data is definitely something we want to incorporate into other areas of gdt. The package command was just the big target because it was taking 40 minutes on a project.

operations will be using NFS to update the files on your local disk, which will
be very slow. Using the `rsync` option along with creating the `build/vm_data`
intermediate mount point in your docker container can improve performance by
a factor of 20 or more. However, the downside is only the compressed archive of
your package will be written to your local disk. The full package folder will
only be accessible within your docker container.

A sample docker-composer.yml entry to mount the vm_data looks like this:
```
volumes:
- /data/PROJECTNAME/package:/var/www/build/vm_data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've picked up this branch and am trying to pull it along. As part of this I'm reviewing and revising the docs to clarify my game plan. I'm confused why we would volume mount the intermediary directory. Wouldn't that slow down the rsync operation compared to leaving the tempDir somewhere in the container but not in a volume mount where NFS slowdowns can catch it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The /data volume is just in the VM, not on the local Mac NFS system. This simply makes the stuff in /var/www/build/vm_data persistent, similar to what we do with the MySQL database. Volumes are only slow if they map the local Mac folders

```

## Packaging for Acquia

The `package` command has the flexibility to support many different use cases,
Expand All @@ -65,6 +94,7 @@ Acquia repository with support for custom hooks and scripts.
"packages": {
"srcFiles": ["!sites/*/files/**", "!xmlrpc.php", "!modules/php/*"],
"projFiles": ["README*", "bin/**", "hooks/**"],
"rsync": true,
"dest": {
"docroot": "docroot"
}
Expand Down
115 changes: 98 additions & 17 deletions tasks/package.js
Original file line number Diff line number Diff line change
Expand Up @@ -61,27 +61,99 @@ module.exports = function(grunt) {

grunt.registerTask('package', 'Package the operational codebase for deployment. Use package:compress to create an archive.', function() {
grunt.loadNpmTasks('grunt-contrib-copy');
grunt.loadNpmTasks('grunt-shell');

var config = grunt.config.get('config.packages');
var srcFiles = ['**', '!**/.gitkeep'].concat((config && config.srcFiles && config.srcFiles.length) ? config.srcFiles : '**');
var srcFiles = (config && config.srcFiles && config.srcFiles.length) ? config.srcFiles : [];
var projFiles = (config && config.projFiles && config.projFiles.length) ? config.projFiles : [];

// Look for a package target spec, build destination path.
var packageName = grunt.option('name') || config.name || 'package';
var destPath = grunt.config.get('config.buildPaths.packages') + '/' + packageName;

// Determine if we are using rsync and vmPath and tarball for performance.
var useRsync = grunt.config('config.packages.rsync') || false;
// Determine if we are creating a tarball archive.
var archive = grunt.config('config.packages.archive') || useRsync;

// When using rsync, generate to a path that can be mounted in the VM.
var vmPath = grunt.config.get('config.packages.vmData') || 'build/vm_data';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per earlier comments, I'd like a more contextual, less tech-specific name for this -- intermediary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specific suggestions are welcome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per other thread, we're calling it tempDir. Just closing the loop here :) Certainly shorter than intermediary.

var destPath = vmPath + '/' + packageName;
var finalPath = grunt.config.get('config.buildPaths.packages') + '/' + packageName;
if (!useRsync) {
// If not using rsync, then directly generate to final path.
destPath = finalPath;
}

var tasks = [];
var cleanPaths = [destPath];
if (destPath !== finalPath) {
cleanPaths.push(finalPath);
}
grunt.option('package-dest', destPath);
grunt.config.set('clean.packages', cleanPaths);
tasks.push('clean:packages');

var excludePaths = ['bower_components', 'node_modules', '.gitkeep'];

if (useRsync) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As previously explored with https://github.com/phase2/grunt-drupal-tasks/blob/master/tasks/make.js#L71, we need a fallback for Windows which so far has been simply "Got Windows? Okay, we'll still use copy!" Because the configuration for copy and rsync are not fully compatible, I've been thinking of rsync not as a mode, but rsync to copy fallback as a breaking change so we can revise configuration requirements. Open to discussion on this point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code I added makes them compatible. You still use srcFiles, projFiles, and the new excludePaths. The code builds the proper arguments for either copy or rsync. So yes, Windows cannot use the "rsync=true" option.

If there is a simple way to detect if rsync is installed I'd be happy to use that to set the default and have it fallback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a canRsync utility function used by the code I linked to. I'd like that to be checked as part of the condition whether we are using Rsync. We can make that function smarter to detect if rsync is installed in a follow-up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I didn't see the canRsync. I'll add that.

// Pull negate conditions from srcFiles into excludePaths.
for (var srcIndex = 0; srcIndex < srcFiles.length; srcIndex++) {
var item = srcFiles[srcIndex];
if (item.substr(0, 1) === '!') {
item = item.slice(1);
item = item.replace('**', '*');
excludePaths.push(item);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really smart, however, the filesystem matching by the Node minimatch library has more breadth of syntax than is typically supported by Bash/Rsync/etc beyond just negation, such as fancy globbing and simple expressions. That's the incompatibility I alluded to earlier. I think we need to confirm that failure is reasonably graceful as a minimum.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I think we can address that in the documentation. IMHO none of our config options should be tied to a specific implementation (node, rsync, etc). So we should specify how wildcards are handled and then convert to the needed format for either Copy or Rsync.

Right now the only issue is ** vs * and rsync seems to take the ** and still function, so I didn't need to change that.

If people want fancy expressions then we should use some form of RegEx that would be less implementation-specific. But that's work for later. For now if somebody wants to use a fancy Node minimatch expression then they just can't use the new rsync option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Confirm using a minimatch expression results in rsync failing in a reasonably informative and quick manner.
  • Release notes and documentation detail the new restriction to stick with Bash-compatible file matching patterns with rsync mode.

}
}

// Setup default rsync command and options.
var rsync = 'rsync -ahWL --no-l --stats --chmod=Du+rwx ';
for (var pathIndex = 0; pathIndex < excludePaths.length; pathIndex++) {
rsync = rsync + "--exclude " + excludePaths[pathIndex] + ' ';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like only the final excluded file would be included here. Shouldn't we remove the loop and replace with an array join on whitespace?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is adding "--exclude path" to rsync for each item in the array. So maybe a join with "--exclude " would work? Otherwise this loop does actually work fine.

}

// Ensure destination exists and sent the rsync.
var srcPath = grunt.config('config.buildPaths.html');
var destSrc = path.resolve(destPath, grunt.config.get('config.packages.dest.docroot') || '');
var rsyncCommand = rsync + srcPath + '/ ' + destSrc + '/';

grunt.config.set('mkdir.package', {
options: {
create: [destSrc]
}
});

grunt.config('shell.rsync', {
command: rsyncCommand
});
} else {
// Use slower copy if rsync is disabled in options.
srcFiles.unshift('**');
// Add any additional excludePaths to srcFiles.
for (var i = 0; i < excludePaths.length; i++) {
srcFiles.push('!**/' + excludePaths[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Array merge seems more clean here. Also, something to dedup might be worthwhile to make sure any code that wants to remove exclusions only has to find a single instance of a mask.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally looped through excludePaths to add the '!**/' and then merged the array with srcFiles but this seemed cleaner. So I guess I'd welcome a code alternative to review.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like:

var _ = require('lodash'); at the top.

srcFiles = _.union(srcFiles, _.map(excludePaths, function (item) {
  return '!**/' + item;
}));

Union both merges and unique-ifies.

Looks like we should check on updating lodash.

}
grunt.config('copy.source', {
files: [
{
expand: true,
cwd: '<%= config.buildPaths.html %>',
src: srcFiles,
dest: path.resolve(destPath, grunt.config.get('config.packages.dest.docroot') || ''),
dot: true,
follow: true
}
],
options: {
gruntLogHeader: false,
mode: true
}
});
}

// Always copy any files specified in projFiles.
grunt.config('copy.package', {
files: [
{
expand: true,
cwd: '<%= config.buildPaths.html %>',
src: srcFiles,
dest: path.resolve(destPath, grunt.config.get('config.packages.dest.docroot') || ''),
dot: true,
follow: true
},
{
expand: true,
src: projFiles,
Expand All @@ -96,9 +168,12 @@ module.exports = function(grunt) {
}
});

grunt.config.set('clean.packages', [destPath]);

tasks.push('clean:packages');
if (useRsync) {
tasks.push('mkdir:package');
tasks.push('shell:rsync');
} else {
tasks.push('copy:source');
}
tasks.push('copy:package');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy:package looks like it's been stripped down to some final copy operations of random project files. Maybe this should be renamed for clarity? copy:package-extras?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I even thought of "copy:projFiles" to be really specific. If we call it "package-extras" then having the config be "projFiles" might be confusing too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems fine. This is a stepping stone to #255, otherwise I'd suggest we drop projFiles and rename the config for clarity.


// If the `composer.json` file is being packaged, rebuild composer dependencies without dev.
Expand All @@ -117,25 +192,31 @@ module.exports = function(grunt) {
tasks.push('composer:drupal-scaffold');
}

if (this.args[0] && this.args[0] === 'compress') {
if (archive || (this.args[0] && this.args[0] === 'compress')) {
grunt.loadNpmTasks('grunt-contrib-compress');
grunt.config('compress.package', {
options: {
archive: destPath + '.tgz',
mode: 'tgz',
gruntLogHeader: false
mode: 'tgz'
},
files: [
{
expand: true,
dot: true,
cwd: grunt.config.get('config.buildPaths.packages') + '/' + packageName,
cwd: destPath,
src: ['**']
}
]
});

tasks.push('compress:package');

if (destPath !== finalPath) {
grunt.config('shell.mvArchive', {
command: 'mv ' + destPath + '.tgz ' + finalPath + '.tgz'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of note, mv will not work across NFS volume mounts. We will want to document that limitation a bit to help inform how projects are configured.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that's what I thought too. But it seems to work across docker volumes. Maybe we should change it from a "mv" to a normal grunt copy. Then delete the tarball from the destPath.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, maybe that restriction is only outside the container. mv is an order of magnitude faster than copy, so if it works I think we should keep it until we have reason to switch to copy.

A bigger issue is windows compat. For portability we should just switch to an fs.rename[Sync].

});
tasks.push('shell:mvArchive');
}
}

grunt.task.run(tasks);
Expand Down