Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure using instance metadata #1

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

reachfh
Copy link

@reachfh reachfh commented Aug 8, 2018

I am using this module with Nerves, so logging runs very early in the boot process. I reorganized the initialization and configuration process so that it works reliably.

@pmenhart
Copy link
Owner

Thanks for the pull request! Nerves is definitely bringing new challenges - I have no experience with Elixir starting faster than the EC2 instance metadata :-)
I forked from the original cloud_watch to use ExAws instead of the the AWS Elixir library. Unfortunately, your proposal breaks compatibility with ExAws.
I believe this can fix it: configure_aws/1 is specific for the AWS, and should be moved to CloudWatch.AwsProxy (e.g. refactor AwsProxy.client/4, replace all parameters with state, return nil if the client is not ready yet).
ExAws has its own credential resolution mechanism, including ECS task roles, EC2 instance roles, assume sts roles, etc. However, I have no experience how it would work in your "early boot" scenario. I hope that the ExAws implementation of AwsProxy.client(state) can be enhanced to check whether the configuration is valid, and return nil if false. Next check would be invoked from your :refresh_creds timer.

Additional questions or remarks:

  • I like your implementation of periodic timer to check AWS client validity. I suggest to set reasonably large max_buffer_size, as the boot often generates tons of messages. We experienced log overflow and/or failure of flush/1 when the buffer was small.
  • I wonder why you keep calling :refresh_creds -> configure_aws/1 every 5 min even after the credentials were resolved? Is there a scenario that the EC2 instance metadata or ECS task roles change values over time?
  • What is the timeout of :hackney.request calls? We use similar calls, and when running the release outside of EC2 (e.g. developers using local Docker containers, the call took 60 seconds to timeout. We had to configure such calls with short timeout (1 sec should be plenty in this case, IMHO).
  • FYI, we'll need more flexibility in log_stream_name in my current project. I am thinking about adding configurable callback hook, i.e. the name can be generated dynamically in the application (incl. ability to change dynamically, e.g. every day or every hour).

Cheers
Peter

@reachfh
Copy link
Author

reachfh commented Aug 12, 2018

Hi Peter,

Did you see any specific problem with ExAws? That's what I tested with, and it worked for me.

ExAws does its own credential loading, but it doesn't deal with the metadata not being available, and doesn't dynamically configure things like the region that it could. So it doesn't hurt to wait for the metadata to be available before trying to connect to ExAws. Probably the right thing to do is enhance ExAws a bit.

I didn't get the AWS module working, as it has funny runtime dependencies, e.g. trying to load tzdata from the internet as part of it's startup process. I generally like the idea of generating the library from the metadata maintained by Amazon, as it's likely to be up to date and support the sprawl of new services. It doesn't seem to be as popular or well maintained, though. I think that this code would work for AWS, but need to test in a non-Nerves environment. I wanted to make the PR to you now as you were currently working on it.

It's necessary to periodically refresh the credentials from the instance profile, as they expire.

I didn't specifically set timeout on hackney. In AWS, it generally responds quickly, but the data is not available. Perhaps it would be useful to set it to something short and let the retry logic handle it.

A hook for setting the stream name might be interesting. We do some thing similar on Elasticsearch by logging to time based indexes which we expire. I am finding CloudWatch Logs to be relatively weak, so I am thinking about logging messages to Kinesis, feeding them to Elasticsearch and simultaneously making them available to the developer in real time.

My overall goal is what I am calling "Cloud Native Elixir", including a minimal legacy-free runtime instance which has monitoring and logging included, deploying using CodeBuild/CodePipeline/CodeDeploy, and integrating well with other Amazon services for configuration (e.g. Parameter Store, KMS) and data storage (S3, RDS, etc).

Cheers,
Jake

PS, perhaps easier to chat in real time. I am jakemorrison on Elixir Slack and Discord, and reachfh on IRC.

@pmenhart
Copy link
Owner

Hi Jake,

When I run my app outside of EC2, the ExAws client is never activated. There is an endless loop where configure_aws/1 tries to get credentials from non-existent metadata. I have to add cloud_watch config access_key_id: "fake" to get past the if statement on lib/cloud_watch.ex:112, and get the AwsProxy.client.

Please note that all the credentials obtained in configure_aws/1 are useful only for the AWS library. ExAws ignores these credentials ! See AwsProxy.client implementation for ExAws at lib/cloud_watch/aws_proxy.ex:35.
(Only log_stream_name = metadata_instance_id() would be used.)

The core of what you are trying to achieve is a delayed start of ExAws client. An ideal solution would be to enhance the ExAws with something like ExAws.isReady?. Important especially if there are other AWS components needed early on!
In our case, we are reading some configuration data from DynamoDB tables during startup. We'd need to implement a similar delayed configuration twice, so the DRY principle is important.

ExAws doesn't dynamically configure things like the region

Strange, I'd love to understand details... Quick browse shows that https://github.com/ex-aws/ex_aws/blob/master/lib/ex_aws/instance_meta.ex does not retrieve the region.
I see several open ExAws issues related to regions: 516, 365, 521 - but none seem to be addressing sourcing from metadata. Is this a gap in the library, or the authors simply expect us to use someting like metadata_region/0 explicitly in the application?

As a workaround, set region to an env variable:

EC2_AVAIL_ZONE=`curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone`
EC2_REGION="`echo \"$EC2_AVAIL_ZONE\" | sed -e 's:\([0-9][0-9]*\)[a-z]*\$:\\1:'`"

AWS trying to load tzdata from the internet:

I found such problem when using Timex for Logger.format. Timex depends on tzdata, I had to switch to DateTime (I didn't learn how to disable tzdata updating timezone info).

I contacted you on Elixir Slack as pmenhart.

Thanks
Peter

pmenhart added a commit that referenced this pull request Oct 5, 2018
* This is an experimental code. Two options are explored:
  1. Delay the transfer then re-try.
     Consequences are unknown. Risk of compromising system stability.
  2. Remove log messages from the buffer (which delays the transfer implicitly).
     This is safer, but some messages are lost.
  Option #1 is used by default.
  To trigger option heyoutline#2, set config purge_buffer_if_throttled: true
pmenhart added a commit that referenced this pull request Oct 6, 2018
* This is an experimental code. Two options are explored:
  1. Remove log messages from the buffer (which delays the transfer implicitly).
     This is safer, but some messages are lost.
  2. Delay the transfer then re-try.
     Consequences are unknown. Risk of compromising system stability.
  Option heyoutline#2 is used by default.
  To trigger option #1, set config purge_buffer_if_throttled: true
pmenhart added a commit that referenced this pull request Oct 11, 2018
* This is an experimental code. Two options are explored:
  1. Remove log messages from the buffer (which delays the transfer implicitly).
     This is safer, but some messages are lost.
  2. Delay the transfer then re-try.
     Consequences are unknown. Risk of compromising system stability.
  Option heyoutline#2 is used by default.
  To trigger option #1, set config purge_buffer_if_throttled: true
* Also added delay after connect_timeout,
  and logging of successful flushes
pmenhart added a commit that referenced this pull request Oct 11, 2018
* This is an experimental code. Two options are explored:
  1. Remove log messages from the buffer (which delays the transfer implicitly).
     This is safer, but some messages are lost.
  2. Delay the transfer then re-try.
     Consequences are unknown. Risk of compromising system stability.
  Option heyoutline#2 is used by default.
  To trigger option #1, set config purge_buffer_if_throttled: true
* Also added delay after connect_timeout,
  and logging of successful flushes (only to CloudWatch)
pmenhart added a commit that referenced this pull request Oct 11, 2018
* This is an experimental code. Two options are explored:
  1. Remove log messages from the buffer (which delays the transfer implicitly).
     This is safer, but some messages are lost.
  2. Delay the transfer then re-try.
     Consequences are unknown. Risk of compromising system stability.
  Option heyoutline#2 is used by default.
  To trigger option #1, set config purge_buffer_if_throttled: true
* Also added delay after connection/timeout errors
* Added logging of successful flushes (temporary; only to CloudWatch)
pmenhart added a commit that referenced this pull request Oct 20, 2018
* This is an experimental code. Two options are explored:
  1. Remove log messages from the buffer (which delays the transfer implicitly).
     This is safer, but some messages are lost.
  2. Delay the transfer then re-try.
     Consequences are unknown. Risk of compromising system stability.
  Option heyoutline#2 is used by default.
  To trigger option #1, set config purge_buffer_if_throttled: true

* Added logging of successful flushes (only to CloudWatch backend).
  Useful for troubleshooting; currently commented out
pmenhart added a commit that referenced this pull request Oct 31, 2018
* This is an experimental code. Two options are explored:
  1. Remove log messages from the buffer (which delays the transfer implicitly).
     This is safer, but some messages are lost.
  2. Delay the transfer then re-try.
     Consequences are unknown. Risk of compromising system stability.
  Option heyoutline#2 is used by default.
  To trigger option #1, set config purge_buffer_if_throttled: true

* Added logging of successful flushes (only to CloudWatch backend).
  Useful for troubleshooting; currently commented out
* Added heap limit, restricting the Logger process to a hardwired value
  of 32MiB (including message queue)
@pmenhart pmenhart force-pushed the master branch 11 times, most recently from f72ac7e to ff3f8f5 Compare April 29, 2021 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants