Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EPIC: New Machine requirement: Windows dockerBuild containers #3286

Closed
8 tasks done
sxa opened this issue Dec 6, 2023 · 44 comments · Fixed by #3702
Closed
8 tasks done

EPIC: New Machine requirement: Windows dockerBuild containers #3286

sxa opened this issue Dec 6, 2023 · 44 comments · Fixed by #3702

Comments

@sxa
Copy link
Member

sxa commented Dec 6, 2023

I need to request a new machine:

  • New machine operating system (e.g. linux/windows/macos/solaris/aix): Windows
  • New machine architecture (e.g. x64/aarch32/arm32/ppc64/ppc64le/sparc): x64
  • Provider (leave blank if it does not matter): Docker :-)
  • Desired usage: Build containers, similar to what we have for Linux
  • Any unusual specification/setup required:
  • How many of them are required: n/a - they should be created dynamically

Please explain what this machine is needed for:
Running builds in an isolated way where we can achieve SLSA build level 3 compliance on Windows along with the other primary platforms. Ideally we'll be able to create windows-on-windows container images which we share and then download and run the builds in.

As background info:

So the tasks required would be:

  • Identify the appropriate software for running containers and ensure no licensing concerns (Likely something from the microsoft site linked above)
  • See if we can verify that a "basic" dockerfile works in that environment and whether we can map directories into it (same as -v on linux) which are read+write in the container
  • Determine whether we can create a container from the playbooks using a dockerfile equivalent to the Linux ones
  • Once we create the container, map a directoryi from the host into it with -v and use that to build Temurin in the container on the mapped volume so that the output is visible on the host system.
  • Understand whether we can reasonably push the resulting container images with the compiler up to dockerhub [Answer: We will push them to gcr.io and mirror to azurecr for better performance on azure dynamic machines.]
  • Integrate this into the build pipelines
  • Implement processes to regenerate the images when playbook updates are made, - likely an addition to what we do for Linux in https://github.com/adoptium/infrastructure/blob/master/FAQ.md#what-about-the-builds-that-use-the-dockerbuild-tag
  • Declare SLSA Build level 3 on Windows :-)

Once this level of analysis and expertise is gained it will likely make windows installer testing, or any other such activities simpler and give us more options moving forward.

Related for historic reference:

@RadekCap
Copy link

Please, assign this task to me. Thank you.

@sxa
Copy link
Member Author

sxa commented Jul 4, 2024

Of the three options listed on the Microsoft website:

  • The first (Docker CE / Moby) seems to work well out of the box
  • The second (Mirantis) appears to be a commercial offering
  • The third (Containerd+nerdctl) appers functional although networking doesn't work out of the box and it seems to fail to be able to start the eclipse-temurin container's default jshell process.

@sxa
Copy link
Member Author

sxa commented Jul 4, 2024

OK First phase done ...

  • docker run -p 5986:5986 -v c:\Users\sxa:c:\sxa mcr.microsoft.com/windows/servercore:ltsc2022
  • Run ConfigureRemotingForAnsible.ps1 with the usual parameters with the netsh commands disabled (They require windows defender which isn't in the image)
  • Create a user to connect with for the playbooks (MyPassword is not what I've used on the live system!):
net user ansible MyPassword /ADD
net localgroup "Administrators" ansible /ADD
net localgroup "Remote Management Users" ansible /ADD

This allows the machine to be accessible via ansible running on a remote machine :-)

(Also, for my own notes, to debug powershell scripts use Set-PSDebug -Trace 2)

@sxa sxa self-assigned this Jul 4, 2024
@sxa
Copy link
Member Author

sxa commented Jul 5, 2024

Playbook execution notes:

  • VS2013 requires the archive under /Vendor_Files/windows, otherwise MSVS_2013 needs to be skipped
  • NTP_TIME needs to be skipped as that has issues that are presumably related to running in a container: FAILED! => {"changed": false, "msg": "Unhandled exception while executing module: Service 'Windows Time (W32Time)' cannot be started due to the following error: Cannot start service W32Time on computer '.'."}
  • In the absence of the fixed layout files for VS2019 and VS2022, adoptopenjdk needs to be skipped to allow them to complete successfully

@sxa sxa moved this to In Progress in 2024 3Q Adoptium Plan Jul 5, 2024
@sxa sxa added this to the 2024-07 (July) milestone Jul 5, 2024
@sxa
Copy link
Member Author

sxa commented Jul 5, 2024

ansible can be run on the host to point at the container if you install cygwin which has ansible as one of its installable options (You probably want to include git too if it's a clean install on the host system). Noting that if you use localhost/127.0.0.1 in your hosts file you should specify -e git_sha=12345 or something appropriate otherwise the execution will trip up over

- name: Get Latest git commit SHA (Windows Localhost)

Noting that WSL could probably be used too, but that requires a system with virtualization extension instructions to be available which is not the case on all systems.

@sxa
Copy link
Member Author

sxa commented Jul 26, 2024

Latest attempt is with:
--skip-tags adoptopenjdk,reboot,MSVS_2013,MSVS_2017,NTP_TIME
(Note: MSVS_2013 is because I didn't have the installer on the machine, 2017 did not work, could also add Dragonwell to skip that install which is not required for Temurin.
Playbook changes to make it complete:

  • Set ansible_connection/ansible_winrm_transport in ansible.cfg
  • Set ansible_user/ansible_password in group_vars/all/adoptopenjdk_variables.yml
  • Remove win_reboot: from Common/roles/main.yml Line 60
  • Remove win_reboot: from MSVS_2013 role line 50
  • Remove win_reboot: from MSVS_2017 role line 37
  • Remove checksum parameters MSVS_2022 role line 103 as it's been updated
  • Remove win_reboot from WMF_5.1 role line 29
  • Remove win_reboot from cygwin role line 45 (Although it's already covered with th reboot tag)

After ansible run is complete, run the commands shown in this article

docker ps
docker stop <image>
docker commit <image> win2022_build_image

After which it can be started again and used

@sxa
Copy link
Member Author

sxa commented Jul 29, 2024

docker commit didn't work on my image:
Error response from daemon: re-exec error: exit status 1: output: mkdir \\?\C:\Windows\SystemTemp\hcs376450290\Files: Access is denied
This is specific to the new image which has had the playbook run on it and does not occur when attempting to commit a image with only basic changes applied.

EDIT: This seems to be the temporary location where it is storing the entire image before it is committed and the machine ran out of space.

Noting that outside that directory most of the docker data is stored in C:\ProgramData\docker

EDIT 2: The docker commit command on the second machine which had adequate space used around 95GB of space in C:\windows\SystemTemp to perform the commit (excluded VS2013 and 2017) and took about 40 minutes at 40-50Mb/sec showing on resource monitor, followed by about 10 minutes of using another 15GB on C: then moving data back to the docker directory at a faster rate (Maybe ~100Mb/sec)

It did, however, hit an error Error response from damon: re-execx error: exit status 1: output: hcsshim::IpmportLayer failed in Win32: Access is denied. (0x5) (Probably hit a zero disk space condition on C: since DOCKER_TMPDIR apparently isn't working to relocate that since docker 25)

@sxa
Copy link
Member Author

sxa commented Jul 29, 2024

This is unfortunate. The builds aren't working because it looks like the automatic shortname generation (fsutil behavior set disable8.3 0) does not appear to be working within the container but is mandatory for the openjdk build process. Directories can have a shortname created manually with fsutil file setshortname "Long name" shortname but that is not ideal to do for each possible path.

EDIT: Noting that https://github.com/adoptium/infrastructure/blob/master/ansible/playbooks/AdoptOpenJDK_Windows_Playbook/roles/shortNames/tasks/main.yml already has some explicit short name creation.

@sxa
Copy link
Member Author

sxa commented Jul 29, 2024

Manually created a few of the shortnames that the configure step was objecting to and I have a JDK21u build complete in a container, so this seems feasible 👍🏻

@sxa
Copy link
Member Author

sxa commented Jul 30, 2024

Noting that we should look at doing this with the MS build tools installer which is suitable for use by Open Source projects. The jdk21u builds currently use:

10:04:20  * C Compiler:     Version 19.37.32822 (at /cygdrive/c/progra~1/micros~3/2022/commun~1/vc/tools/msvc/1437~1.328/bin/hostx64/x64/cl.exe)
10:04:20  * C++ Compiler:   Version 19.37.32822 (at /cygdrive/c/progra~1/micros~3/2022/commun~1/vc/tools/msvc/1437~1.328/bin/hostx64/x64/cl.exe)

Other references (this numbering is more confiusing that I realised - I thought we only had the '2022' vs '19.xx' versioning differences to worry about before today...)

@sxa
Copy link
Member Author

sxa commented Jul 30, 2024

Noting that we should look at doing this with the MS build tools installer which is suitable for use by Open Source projects. The jdk21u builds currently use:

10:04:20  * C Compiler:     Version 19.37.32822 (at /cygdrive/c/progra~1/micros~3/2022/commun~1/vc/tools/msvc/1437~1.328/bin/hostx64/x64/cl.exe)
10:04:20  * C++ Compiler:   Version 19.37.32822 (at /cygdrive/c/progra~1/micros~3/2022/commun~1/vc/tools/msvc/1437~1.328/bin/hostx64/x64/cl.exe)

Other references (this numbering is more confiusing that I realised - I thought we only had the '2022' vs '19.xx' versioning differences to worry about before today...)

@sxa
Copy link
Member Author

sxa commented Jul 30, 2024

Struggling with the GPG role at the moment which is called during the ANT role (I'm getting gnupg as a requirement which supplies gpg2 instead of gpg). Also Wix has to be skipped as I don't have ansible.builtin.runs available.

Other than that a two-phase dockerfile is looking quite promising. The first sets up WinRM (will only be invoked locally) and installs cygwin with git and ansible, then triggers a reboot to ensure the cygwin path takes effect.

The second runs the playbooks as normal, although for now I've currently it running in multiple layers for performance of testing to allow the caching of each layer to take effect independently:

  1. --skip-tags adoptopenjdk,reboot,ANT,NTP_TIME,Wix,MSVS_2013,MSVS_2017,MSVS_2019,MSVS_2022
  2. -t ANT
  3. -t MSVS_2019
  4. -t MSVS_2022

This is currently using the playbook branch at https://github.com/sxa/infrastructure/tree/sxa_allhosts which makes a few changes to support this execution.

@sxa sxa modified the milestones: 2024-07 (July), 2024-08 (August) Jul 31, 2024
@sxa
Copy link
Member Author

sxa commented Aug 1, 2024

The above approach seemed to work yesterday now that the machine is rebooted after adding cygwin to the PATH and I had a system which was able to successfully build jdk21u using two dockerfiles (First to configure WinRM, the second to run the playbooks using the individual layers from the previous comment. Next steps as follows:

  • Verify that on a clean image (I made some changes inside the image after my infrastructure branch was extracted, so that needs to be confirmed as captured in the branch)
  • Fix Wix install
  • Fix the git_sha detection
  • Update the MSVS_2022 role to use MS build tools to ensure reproducibility of the builds
  • Ideally test with the MSVS_2013 and 2017 installers available in the image so those roles do not need to be skipped.

Noting that the image without VS2013 or 2017 is 99GB in size.

@sxa
Copy link
Member Author

sxa commented Aug 1, 2024

Now fixed the path setting so that it only requires one dockerfile so we have something consistent with what we have on Linux now 👍🏻

It still currently requires a username/password for the authentication, but the password can be passed into the dockerfile with --build arg PW=SomeAcceptablePassword on the docker build command.

I haven't got it picking up the git_sha properly yet so that is currently hard-coded. Everything else is good enough to be able to run a jdk21u build on, but it's missing the compilers for some earlier versions (Will need those on the host and mapped in via Vendor_Files, similar to what we do with AWX). Also we'll want the jenkins_user role (Currently skipped via adoptopenjdk unless we're happy with the processes running as an administrator within the container (Need to check how well user mapping works in these containers)

Otherwise, here is the dockerfile Dockerfile.win2022v2.txt which uses the playbook changes from https://github.com/sxa/infrastructure/tree/windows_docker_fixes

@sxa sxa pinned this issue Aug 1, 2024
@smlambert smlambert moved this from Todo to In Progress in Adoptium Backlog Nov 7, 2024
@smlambert smlambert moved this to In Progress in 2024 4Q Adoptium Plan Nov 7, 2024
@sxa
Copy link
Member Author

sxa commented Nov 7, 2024

pipelines PR was merged yesterday so the code is in and this can now be used once we identify suitable systems which can run docker, bearing in mind that for the current docker tests you need to have jenkins running as an administrative user, which is not the case for our existing machines.

@sxa sxa changed the title New Machine requirement: Windows dockerBuild containers EPIC: New Machine requirement: Windows dockerBuild containers Nov 26, 2024
@sxa sxa unpinned this issue Nov 26, 2024
@sxa
Copy link
Member Author

sxa commented Nov 26, 2024

New machines being tested:

The AMD machine completed a jdk21u build in a container from the command line in just under 3h so it is possible to build with 4GiB of RAM. It was slightly slower than the numbers from the B2ms systems in the earlier comment, but it's also a different CPU, plus those tests were done with the machine having been worked a bit from the start so may have been subject to bursting limits. Since the AMD one seems to work I will also look at loading up its 256GiB C: drive with a normal playbook run (excluding VS2013, 2017 and 2019) so it can act as a drop-in replacement for an existing build machine even without enabling the containerised builds. The first two here are my prototype machines which will be replaced by the two new ones, but here is a spec comparison so we have the info stored:

Machine Cores Docker Disk RAM jdk8u jdk21u jdk24
dh-w22-1 Xeon 8370C 250GiB Premium SSD v2 8GiB 57m29 (re-run)
dh-w22-sxa1/3 Xeon 8370C 200GiB Premium SSD 8GiB 33m02 (Now deleted)
dh-w22-1-intel Xeon 8171M 128GiB HDD 8GiB TBC (Failed when run with docker support)
dh-w22-2-amd AMD EPYC 7763 128GiB HDD 4GiB 1h04
dh-w22-3-intel Xeon E5-2673v4 128GiB HDD 4GiB

@sxa
Copy link
Member Author

sxa commented Nov 27, 2024

I have got "static" containers running jenkins agents which are running on burstable machines to test the process with the ea pipelines this week. This will mean we can switch to/from this performance testing without modifying the pipelines (i.e. the new explicit docker support is not yet being enabled). A couple of notes on this:

  1. Other than the first build, this is likely to be quite slow due to the use of burstable machines
  2. The wix label has been removed from these agents as there is a problem with the locales in the containers (Potentially we could install enough to make WIX work on the docker host system) [*]
  3. I have removed the build label from the two existing machines so that they will not be used for the builds, but they are kept online so that the installer jobs requiring wix can run on them.

[*] - Sample failure: https://ci.adoptium.net/job/build-scripts/job/release/job/create_installer_windows/1108/console

Building setup translation for culture "de-de" with LangID "1031"...
Input Error: Can not find script file "C:\Program Files (x86)\Windows Kits\10\bin\10.0.17763.0\x64\WiLangId.vbs".
WiLangId failed with : 1
Failed to generate setup translation of culture "de-de" with LangID "1031".
failed to build translation de-de 1031

@sxa
Copy link
Member Author

sxa commented Nov 28, 2024

I've done a bit of rebasing to allow us to use the updated playbooks (Currently the dockerfile is still pointing at my original fork/branch of the infrastructure repo, which has resulted in it not having recent updates such as ant)

I still have to have fixes for ant-contrib (The download isn't working - I have to pull it from a copy I've put in place) and #3828 is preventing parts of the playbook from completing too with the latest versions.

@sxa
Copy link
Member Author

sxa commented Nov 29, 2024

Ref the WiX error above, I have done the following to try to allow it to run on the 8GiB Intel machine:

This means that right now the machine can technically run jobs from three different jenkins agents running on it:

The first two should not be enabled in parallel (same for the other similar machines) as this will overwhelm them. None of these agents are currently running as a service during this prototype phase.

@sxa
Copy link
Member Author

sxa commented Nov 29, 2024

First pass with JDK8

Machine Time
build-docker-1-amd (on dh-2-amd) 1h06 (35m with RTP off)
build-docker-2-intel (on dh-1-intel) 1h32
build-docker-3-intel (on dh-2-intel) 1h40

With jdk21 (I'll kick off some runs and populate this table once adoptium/installer#1063 is merged):

Machine Type Time Reproducible? Notes
b-d-w22-2-intel b2ms US-E 8171M 2.6GHz 1h51 + 2h04 (1h04+1h09 noRTP)
b-d-w22-1-amd b2alsV2 .se EPYC7763 1h21+1h26 Hosted on dh-w22-2-amd
b-d-w22-3-intel b2s E5-2673v4 2.3GHz 2h06 + 2h24 (1h22+1h33 noRTP)
b-d-w22-2-amd b2alsV2 US-E (1h22+XhXX) (0h55+1h58noRTP-Burst) Hosted on dh-w22-1-amd
dh-w22-1 (sxa) b2ms .se 8370C 2.8GHz+SSD 56m+55m-noRTP 1h28+1h38-RTP 52m19+57m24-noRTP link
dh-w22-1-amd b2alsV2 US-E EPYC7763 48m54+54m22 1h06+1h13-RTPnoWS link linknoWS RTPnoWS=Real time Defender excluding C:\workspace
dh-w22-1-intel b2ms git fetch failure - deprovisioned
dh-w22-2-amd b2alsV2 1h23+1h45 50m12+54m53-noRTP link
dh-w22-3-intel b2s 1h11+1h20 link
build-azure-1 D4s_v3 0h42+0h45 1h14+1h02-RTPon link
dh-w22-4-intel ? 1h19+1h24-RTPon Weird failure [*]

[*] Weird failure is:

13:25:30  Reading The SBOM Content & Validating The Structure..
13:25:30  
13:25:30  SBOM Is Structurally Sound.. Extracting Values:
13:25:30  
13:25:30  assertion "cb == jq_util_input_next_input_cb" failed: file "/usr/src/ports/jq/jq-1.6-1.x86_64/src/jq-1.6/src/util.c", line 371, function: jq_util_input_get_position
First 4 jobs section breakdown

b-d-w22-2-intel#1273 8171M:

[2024-12-01T17:25:48.907Z] build.sh : 17:25:48 : Clearing out target dir ...
[2024-12-01T17:25:48.907Z] build.sh : 17:25:48 : Configuring workspace inc. clone and cacerts generation ...
[2024-12-01T17:33:25.302Z] build.sh : 17:33:25 : Initiating build ...
[2024-12-01T18:56:13.193Z] build.sh : 18:56:12 : Build complete ...
[2024-12-01T18:56:13.193Z] build.sh : 18:56:12 : All done!

b-d-w22-1-amd#1267

[2024-11-29T16:29:35.694Z] build.sh : 16:29:35 : Clearing out target dir ...
[2024-11-29T16:29:36.081Z] build.sh : 16:29:35 : Configuring workspace inc. clone and cacerts generation ...
[2024-11-29T16:35:13.621Z] build.sh : 16:35:13 : Initiating build ...
[2024-11-29T17:37:31.500Z] build.sh : 17:37:30 : Build complete ...
[2024-11-29T17:37:31.500Z] build.sh : 17:37:30 : All done!

b-d-w22-3-intel#1276 E5-2673v4

[2024-12-01T20:54:59.134Z] build.sh : 20:54:58 : Clearing out target dir ...
[2024-12-01T20:54:59.134Z] build.sh : 20:54:58 : Configuring workspace inc. clone and cacerts generation ...
[2024-12-01T20:58:20.750Z] build.sh : 20:58:20 : Initiating build ...
[2024-12-01T22:30:13.414Z] build.sh : 22:30:11 : Build complete ...
[2024-12-01T22:30:13.414Z] build.sh : 22:30:11 : All done!

dh-w22-1 (sxa) #1277 8Gb 8370C+SSD

2024-12-01T21:07:58.152Z] build.sh : 21:07:57 : Clearing out target dir ...
[2024-12-01T21:07:58.152Z] build.sh : 21:07:57 : Configuring workspace inc. clone and cacerts generation ...
[2024-12-01T21:09:31.124Z] build.sh : 21:09:31 : Initiating build ...
[2024-12-01T22:13:54.747Z] build.sh : 22:13:54 : Build complete ...
[2024-12-01T22:13:54.747Z] build.sh : 22:13:54 : All done!

@sxa
Copy link
Member Author

sxa commented Nov 29, 2024

Noting that all -ea builds from this week were run int he static docker containers, and are therefore the first set to be built with machines that only have the MS VS2022 Build Tools installation which meet the requirement in adoptium/temurin-build#3787

@sxa
Copy link
Member Author

sxa commented Dec 3, 2024

Noting that the reproducible build test (special.system) failed when there was a --with-ucrt-dll parameter with brackets for the Program Files directory:

syntax error near unexpected token '('

Fortunately with the devkit this is no longer required.

@sxa
Copy link
Member Author

sxa commented Dec 7, 2024

I've provisioned another non-burstable spot-provisioned 2-core 8GiB system dockerhost-azure-win2022--intel to run some jdk8u sanity.openjdk tests in static containers on the hosts:

Machine Type Grinder Time Results
bd-4-intel D2sV4 Xeon 8370C 11993 2h10 Overlapped with build job running on host
bd-4-intel D2sV4 Xeon 8370C 12000 2h01
bd-2-amd B2alsV2 11997 1h32 Similar to build - ~25% faster than Intel systems
test-azure-4 D2s v4 Xeon 8272CL 11998 2h04
test-azure-2 F4s v2 Xeon 8272CL 11999 1h09

For reference, run from jdk21u on bd-4-intel is at https://ci.adoptium.net/job/Grinder/12001/ and took 7h46

@adamfarley
Copy link
Contributor

Hi @sxa - I'm seeing an issue on dockerhost-azure-win2022-x64-3-intel:

15:20:06  C:\workspace\openjdk-build>git clean -fdx 
15:20:06  warning: failed to remove test/: Permission denied

This issue is preventing builds from working on that machine.
https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk17u/job/jdk17u-windows-x86-32-temurin/376/

@adamfarley
Copy link
Contributor

Git clean continues to fail.

22:17:05  C:\workspace\openjdk-build>git clean -fdx 
22:17:05  warning: failed to remove workspace/: Permission denied

Examples:
https://ci.adoptium.net/job/build-scripts-pr-tester/job/build-test/job/jobs/job/jdk8u/job/jdk8u-windows-x64-temurin/212/console
https://ci.adoptium.net/job/build-scripts-pr-tester/job/build-test/job/jobs/job/jdk11u/job/jdk11u-windows-x64-temurin/220/console
https://ci.adoptium.net/job/build-scripts-pr-tester/job/build-test/job/jobs/job/jdk17u/job/jdk17u-windows-x64-temurin/214/console

THE jdk 21 build didn't see this issue, but we did have rm failures.

00:02:59  rm: cannot remove 'C:/workspace/openjdk-build/CONTRIBUTING.md': Permission denied
00:02:59  rm: cannot remove 'C:/workspace/openjdk-build/FAQ.md': Permission denied
00:02:59  rm: cannot remove 'C:/workspace/openjdk-build/LICENSE': Permission denied

@sxa - Is this the right place for these bug reports? Maybe you'd prefer separate issues?

@sxa
Copy link
Member Author

sxa commented Dec 17, 2024

This is being discussed in slack. It's specific to the PR tester which is using CLEAN_WORKSPACE_AFTER which the EA builds do not. (I'm not certain why the x86-32 build two comments up hit it though)

@adamfarley
Copy link
Contributor

(I'm not certain why the x86-32 build two comments up hit it though)

The 32 bit one was a week ago. Maybe it's fixed now?

@sxa
Copy link
Member Author

sxa commented Dec 17, 2024

We think we've got a solution to the above, which I've just tested. The scenario typically occurs when a job has been run on a dockerhost node without a docker_image specified, at which point some of the permissions within the workspace directory c:\workspace\openjdk-build get messed up in a way that they cannot be correctly cleared.

This PR in the pipelines repository seems to resolve it.

@sxa
Copy link
Member Author

sxa commented Dec 20, 2024

Closing this since the builds are now happening in containers which are being pulled from the Azure registry.
Noting that #3712 is still open regarding capturing the dockerhost setup specifically in a playbook.

@sxa sxa closed this as completed Dec 20, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in 2024 4Q Adoptium Plan Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants