Configure using instance metadata #1

reachfh · 2018-08-08T19:22:41Z

I am using this module with Nerves, so logging runs very early in the boot process. I reorganized the initialization and configuration process so that it works reliably.

pmenhart · 2018-08-11T22:07:43Z

Thanks for the pull request! Nerves is definitely bringing new challenges - I have no experience with Elixir starting faster than the EC2 instance metadata :-)
I forked from the original cloud_watch to use ExAws instead of the the AWS Elixir library. Unfortunately, your proposal breaks compatibility with ExAws.
I believe this can fix it: configure_aws/1 is specific for the AWS, and should be moved to CloudWatch.AwsProxy (e.g. refactor AwsProxy.client/4, replace all parameters with state, return nil if the client is not ready yet).
ExAws has its own credential resolution mechanism, including ECS task roles, EC2 instance roles, assume sts roles, etc. However, I have no experience how it would work in your "early boot" scenario. I hope that the ExAws implementation of AwsProxy.client(state) can be enhanced to check whether the configuration is valid, and return nil if false. Next check would be invoked from your :refresh_creds timer.

Additional questions or remarks:

I like your implementation of periodic timer to check AWS client validity. I suggest to set reasonably large max_buffer_size, as the boot often generates tons of messages. We experienced log overflow and/or failure of flush/1 when the buffer was small.
I wonder why you keep calling :refresh_creds -> configure_aws/1 every 5 min even after the credentials were resolved? Is there a scenario that the EC2 instance metadata or ECS task roles change values over time?
What is the timeout of :hackney.request calls? We use similar calls, and when running the release outside of EC2 (e.g. developers using local Docker containers, the call took 60 seconds to timeout. We had to configure such calls with short timeout (1 sec should be plenty in this case, IMHO).
FYI, we'll need more flexibility in log_stream_name in my current project. I am thinking about adding configurable callback hook, i.e. the name can be generated dynamically in the application (incl. ability to change dynamically, e.g. every day or every hour).

Cheers
Peter

reachfh · 2018-08-12T11:56:59Z

Hi Peter,

Did you see any specific problem with ExAws? That's what I tested with, and it worked for me.

ExAws does its own credential loading, but it doesn't deal with the metadata not being available, and doesn't dynamically configure things like the region that it could. So it doesn't hurt to wait for the metadata to be available before trying to connect to ExAws. Probably the right thing to do is enhance ExAws a bit.

I didn't get the AWS module working, as it has funny runtime dependencies, e.g. trying to load tzdata from the internet as part of it's startup process. I generally like the idea of generating the library from the metadata maintained by Amazon, as it's likely to be up to date and support the sprawl of new services. It doesn't seem to be as popular or well maintained, though. I think that this code would work for AWS, but need to test in a non-Nerves environment. I wanted to make the PR to you now as you were currently working on it.

It's necessary to periodically refresh the credentials from the instance profile, as they expire.

I didn't specifically set timeout on hackney. In AWS, it generally responds quickly, but the data is not available. Perhaps it would be useful to set it to something short and let the retry logic handle it.

A hook for setting the stream name might be interesting. We do some thing similar on Elasticsearch by logging to time based indexes which we expire. I am finding CloudWatch Logs to be relatively weak, so I am thinking about logging messages to Kinesis, feeding them to Elasticsearch and simultaneously making them available to the developer in real time.

My overall goal is what I am calling "Cloud Native Elixir", including a minimal legacy-free runtime instance which has monitoring and logging included, deploying using CodeBuild/CodePipeline/CodeDeploy, and integrating well with other Amazon services for configuration (e.g. Parameter Store, KMS) and data storage (S3, RDS, etc).

Cheers,
Jake

PS, perhaps easier to chat in real time. I am jakemorrison on Elixir Slack and Discord, and reachfh on IRC.

pmenhart · 2018-08-13T04:16:19Z

Hi Jake,

When I run my app outside of EC2, the ExAws client is never activated. There is an endless loop where configure_aws/1 tries to get credentials from non-existent metadata. I have to add cloud_watch config access_key_id: "fake" to get past the if statement on lib/cloud_watch.ex:112, and get the AwsProxy.client.

Please note that all the credentials obtained in configure_aws/1 are useful only for the AWS library. ExAws ignores these credentials ! See AwsProxy.client implementation for ExAws at lib/cloud_watch/aws_proxy.ex:35.
(Only log_stream_name = metadata_instance_id() would be used.)

The core of what you are trying to achieve is a delayed start of ExAws client. An ideal solution would be to enhance the ExAws with something like ExAws.isReady?. Important especially if there are other AWS components needed early on!
In our case, we are reading some configuration data from DynamoDB tables during startup. We'd need to implement a similar delayed configuration twice, so the DRY principle is important.

ExAws doesn't dynamically configure things like the region

Strange, I'd love to understand details... Quick browse shows that https://github.com/ex-aws/ex_aws/blob/master/lib/ex_aws/instance_meta.ex does not retrieve the region.
I see several open ExAws issues related to regions: 516, 365, 521 - but none seem to be addressing sourcing from metadata. Is this a gap in the library, or the authors simply expect us to use someting like metadata_region/0 explicitly in the application?

As a workaround, set region to an env variable:

EC2_AVAIL_ZONE=`curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone`
EC2_REGION="`echo \"$EC2_AVAIL_ZONE\" | sed -e 's:\([0-9][0-9]*\)[a-z]*\$:\\1:'`"

AWS trying to load tzdata from the internet:

I found such problem when using Timex for Logger.format. Timex depends on tzdata, I had to switch to DateTime (I didn't learn how to disable tzdata updating timezone info).

I contacted you on Elixir Slack as pmenhart.

Thanks
Peter

* This is an experimental code. Two options are explored: 1. Delay the transfer then re-try. Consequences are unknown. Risk of compromising system stability. 2. Remove log messages from the buffer (which delays the transfer implicitly). This is safer, but some messages are lost. Option #1 is used by default. To trigger option heyoutline#2, set config purge_buffer_if_throttled: true

* This is an experimental code. Two options are explored: 1. Remove log messages from the buffer (which delays the transfer implicitly). This is safer, but some messages are lost. 2. Delay the transfer then re-try. Consequences are unknown. Risk of compromising system stability. Option heyoutline#2 is used by default. To trigger option #1, set config purge_buffer_if_throttled: true

* This is an experimental code. Two options are explored: 1. Remove log messages from the buffer (which delays the transfer implicitly). This is safer, but some messages are lost. 2. Delay the transfer then re-try. Consequences are unknown. Risk of compromising system stability. Option heyoutline#2 is used by default. To trigger option #1, set config purge_buffer_if_throttled: true * Also added delay after connect_timeout, and logging of successful flushes

* This is an experimental code. Two options are explored: 1. Remove log messages from the buffer (which delays the transfer implicitly). This is safer, but some messages are lost. 2. Delay the transfer then re-try. Consequences are unknown. Risk of compromising system stability. Option heyoutline#2 is used by default. To trigger option #1, set config purge_buffer_if_throttled: true * Also added delay after connect_timeout, and logging of successful flushes (only to CloudWatch)

* This is an experimental code. Two options are explored: 1. Remove log messages from the buffer (which delays the transfer implicitly). This is safer, but some messages are lost. 2. Delay the transfer then re-try. Consequences are unknown. Risk of compromising system stability. Option heyoutline#2 is used by default. To trigger option #1, set config purge_buffer_if_throttled: true * Also added delay after connection/timeout errors * Added logging of successful flushes (temporary; only to CloudWatch)

* This is an experimental code. Two options are explored: 1. Remove log messages from the buffer (which delays the transfer implicitly). This is safer, but some messages are lost. 2. Delay the transfer then re-try. Consequences are unknown. Risk of compromising system stability. Option heyoutline#2 is used by default. To trigger option #1, set config purge_buffer_if_throttled: true * Added logging of successful flushes (only to CloudWatch backend). Useful for troubleshooting; currently commented out

* This is an experimental code. Two options are explored: 1. Remove log messages from the buffer (which delays the transfer implicitly). This is safer, but some messages are lost. 2. Delay the transfer then re-try. Consequences are unknown. Risk of compromising system stability. Option heyoutline#2 is used by default. To trigger option #1, set config purge_buffer_if_throttled: true * Added logging of successful flushes (only to CloudWatch backend). Useful for troubleshooting; currently commented out * Added heap limit, restricting the Logger process to a hardwired value of 32MiB (including message queue)

reachfh added 18 commits August 6, 2018 17:34

Start hackney; add ex_aws so tests will run

dafb6ec

Update deps

ce68193

Add access_key_id in config for tests

18915fa

Read AWS config from instance metadata

f725b23

Add explicit dep on poison

5ee4565

Use 3.0 version of poison

6c619bd

Use hackney directly to avoid dependency on httpoison

dacabb8

Add explicit hackney dep

353f96b

Use top level API for Poison

84aed09

Remove aws deps

41e65bd

Add deps directly

db5218a

Debugging

54f494c

Update location of aws-elixir

49e8f64

Formatting

341dee0

Save config to env; crash if group/stream create fails; reduce piping

b5cfa4d

Cleanup and commenting

7a777d8

Update deps

2709093

Doc

c0147ea

pmenhart force-pushed the master branch from 6d33acc to 8d448d2 Compare October 6, 2018 16:17

pmenhart force-pushed the master branch from bee96f3 to 46afdc7 Compare October 11, 2018 00:40

pmenhart force-pushed the master branch from 46afdc7 to 9302821 Compare October 11, 2018 22:47

pmenhart force-pushed the master branch from 1d949ce to a672895 Compare October 20, 2018 00:51

pmenhart force-pushed the master branch from a672895 to 8080733 Compare October 31, 2018 22:46

pmenhart force-pushed the master branch 11 times, most recently from f72ac7e to ff3f8f5 Compare April 29, 2021 22:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configure using instance metadata #1

Configure using instance metadata #1

reachfh commented Aug 8, 2018

pmenhart commented Aug 11, 2018

reachfh commented Aug 12, 2018

pmenhart commented Aug 13, 2018

Configure using instance metadata #1

Are you sure you want to change the base?

Configure using instance metadata #1

Conversation

reachfh commented Aug 8, 2018

pmenhart commented Aug 11, 2018

reachfh commented Aug 12, 2018

pmenhart commented Aug 13, 2018