Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accumulo-proxy #22 - Reduce Docker image size #23

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

volmasoft
Copy link
Contributor

Updated the Dockerfile to change some of the behavior and reduce the image size to 1.46GB from 1.86GB.

I also tried to keep the download_verify_and_extract() function readable given it now downloads the TARs, extracts them and useful symlinks in place.
I moved the HASH strings to SHA512 to match the apache.org artifacts found along side the binaries.

I've put this up early as there's some discussion on the issue #22

@volmasoft
Copy link
Contributor Author

After discussion on the main issue I've pulled the alpine change into this pull.

This is probably where I'm going to stop on this pull, there's a further discussion to be had around Hadoop docs but for now this pull has reduced the image size from 1.86GB to 1.06GB.

@volmasoft volmasoft force-pushed the accumulo-proxy-22 branch from f8137ac to 5f09502 Compare May 5, 2020 14:04
…final image size to 1.46GB from 1.86GB.

I also tried to keep the download_verify_and_extract() function readable given it now downloads the TARs, extracts them and useful symlinks in place.
I moved the HASH strings to SHA512 to match the apache.org artifacts found along side the binaries.
Switched to alpine after discussion on the issue.
@volmasoft volmasoft force-pushed the accumulo-proxy-22 branch from 5f09502 to 8147a65 Compare May 5, 2020 14:56
@@ -13,20 +13,18 @@
# See the License for the specific language governing permissions and
# limitations under the License.

FROM openjdk:8
FROM openjdk:8-alpine3.9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you select alpine? I was looking at the openjdk tags on dockerhub to see what else was small and saw openjdk:8-jre-slim. It seems to be based on Debian, is a similar size, and was updated more recently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I spotted it to be fair, it's about 3M smaller so might be worth it and as you say it seems to be getting kept up to date.

Note, all values here unless specified otherwise are from the Docker command.

I did a quick rebuild with the new one and this is the sizes:

  • openjdk:8-alpine3.9 1058822286 bytes
  • openjdk:8-jre-slim 1136341123 bytes (+73 MB)

So it was actually larger to use the smaller base image, not what I expected, I did have to add wget and removed the installation of bash from the current Dockerfile as it was already present in 8-jre-slim

On disk the pulled images were different to dockerhub values

  • openjdk:8-alpine3.9 - 105MB (disk) 70.09MB (dockerhub)
  • opendjk:8-jre-slim - 184MB (disk) (dockerhub)

Looking into it, it seems that dockerhub reports the compressed size, not the size on disk once you pull the image.

I did the comparison for the final image sizes:

  • openjdk:8-alpine3.9 base - 1.06GB
  • openjdk:8-jre-slim base - 1.14GB

I tested both builds so they both function the same, perhaps taking the 80MB increase is worth it for being up to date?

What are your thoughts, stick with what we have or update to 8-jre-slim?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in favor of 8-jre-slim because its more up to date and its based on Debian which I think a larger base of contributors and users may be familiar with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice analysis and write up of the analysis.

@keith-turner
Copy link
Contributor

@volmasoft on the issue removing the hadoop docs folder was mentioned, but I did not see it.

Based on the default Accumulo classpath, maybe everything except ${HADOOP_HOME}/share/hadoop/client/* and the hadoop conf dir could be removed.

@volmasoft
Copy link
Contributor Author

@volmasoft on the issue removing the hadoop docs folder was mentioned, but I did not see it.

Based on the default Accumulo classpath, maybe everything except ${HADOOP_HOME}/share/hadoop/client/* and the hadoop conf dir could be removed.

I wasn't sure if it's allowable to remove it e.g. does the license restrict it, as I'm no expert in this arena I was hoping someone else would be able to provide guidance.

Looking at the official license (https://apache.org/licenses/LICENSE-2.0) it appears that we should be fine (see 4. Redistribution)
"4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:"

So perhaps we can modify this but if so do we need to put any documentation/license information anywhere?

I didn't want to do something in good faith but potentially cause issues for us down the line so hence I haven't included that change currently.

@keith-turner
Copy link
Contributor

So perhaps we can modify this but if so do we need to put any documentation/license information anywhere?

I think we would be fine making a modification w/o doing anything additional. We could also wget the two needed jars from maven central possibly instead of getting the hadoop tar ball and then trimming it.

I didn't want to do something in good faith but potentially cause issues for us down the line so hence I haven't included that change currently.

I think if anything was done to reduce the Hadoop size, it would make sense to do it in its own PR.

@ctubbsii ctubbsii changed the base branch from master to main August 6, 2020 06:17
@milleruntime
Copy link
Contributor

@volmasoft @keith-turner This PR got lost. Either of you OK with merging as-is?

@@ -71,7 +71,7 @@ You can test if this will work for you by executing the following steps

Start the accumulo-proxy container and enter it
```commandline
docker run -it --rm -p 42424:42424 --network="host" --name accumulo-proxy accumulo-proxy:latest bash;
docker run -it --rm -p 42424:42424 --network="host" --name accumulo-proxy accumulo-proxy:latest sh;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the impact of this, but I do know a lot of Accumulo scripts use bash specifically.

@DomGarguilo
Copy link
Member

It seems like there are some good improvements here. If there is no activity on this PR I may open a new one and try to incorporate some of these changes at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants