Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing spool size results in truncated lines #252

Open
jlintz opened this issue Aug 21, 2014 · 14 comments
Open

Changing spool size results in truncated lines #252

jlintz opened this issue Aug 21, 2014 · 14 comments

Comments

@jlintz
Copy link

jlintz commented Aug 21, 2014

I originally had lsf running with a spool size of 100, and went to adjust it back to its default of 1024. After changing it back and puppet bouncing the service, all lines were being truncated that were getting sent to the indexers. After poking around a bit, I found that after I removed the .logstash-forwarder file, things resolved themselves.

Is this expected behavior when increasing the spool size? If so, is there a way to handle this gracefully?

@driskell
Copy link
Contributor

Hi @jlintz what do you mean by "truncated". Can you give an example? Was it all lines or just specific ones? Spool size shouldn't affect this so its more likely just down to the restart and some incorrect offset or something inside the ".logstash-forwarder" file.

@jlintz
Copy link
Author

jlintz commented Aug 21, 2014

I was getting pieces of syslog lines, which was causing grok to barf, I need to adjust my settings since I wasnt having grok keep the raw message. It was all lines btw, I got into a very weird state last night where the bad data was causing the logstash indexers to crash and basically had to stop the world, remove the .logstash-forwarder files, and then bring everything back up for it to recover. When looking at the .logstash-forwarder file after the spool change and when the problem was occurring, i didn't see it updating the file with new offsets

@jlintz
Copy link
Author

jlintz commented Aug 22, 2014

Just ran into this again. It seems its not related to changing of the spool size as much as a restart of LSF across sytems? When I bumped the version for LSF to apply your fix @driskell , my logs started coming in out of order and corrupt. Getting parts of lines , it seems the new line delimiting gets corrupt.

Not exactly sure how this is occurring. It will start sending the corrupt lines, then the indexers seems to lock up. Last time my only path of recovery was doing a full stop of all systems, and then bringing them back up slowly

Going to try and do more debugging this time and see if theres additional information I can get

@driskell
Copy link
Contributor

If you can grab one of the lines that gets sent, and what it should have been, that might help. After the first restart - there is possibility it coughs a single line that's half, but all following lines should be fine.

Check out #249 too as it has an offset fix for resume caused by a bad merge earlier.

@jlintz
Copy link
Author

jlintz commented Aug 22, 2014

@driskell yes, this started after I applied that fix and in mass did a restart of LSF ;)

Here are some Example lines
2014-08-22T15:43:01.105646-04:00 nagios01 CRON[29849]: pam_unix(cron:session): session opened for user ubuntu by (uid=0)

and logstash just outputs
(uid=0)

It's different parts of the line for each hosts and log file, some , the datetimstamp is truncated. I saw parts of another line make its way into the next line. Unsure if its LSF or logstash itself

@jlintz
Copy link
Author

jlintz commented Aug 22, 2014

It appears this may be a stampeding herd affect of doing a mass restart of LSF. Perhaps overwhelming the indexers with connections. LSF starts logging errors about connecting and the memory climbs on the indexers till they become unresponsive.

@driskell
Copy link
Contributor

I managed to get my old test suite running on the current branch of logstash-forwarder with the PR #249 merged, and for resume tests it passed so it looks fine.

But yes. The stampede problem, that will be the indexer crash. That's the #180 and the Too many open files problem reported in https://logstash.jira.com/browse/LOGSTASH-2168 and https://logstash.jira.com/browse/LOGSTASH-1253 . See #180 for background, basically connections block waiting for logstash queue to clear, but its so slow LSF reconnects, but old connection does not clear until queue clears, thus it builds up quickly and crashes. Could be something else... but this the only crash I know of really. Rather than patching left right and centre it may be worth you looking at http://github.com/driskell/log-courier which is where all my patches and work ended up.

Back to the corrupt lines... maybe... if its lots of forwarders... if each sends ONE single corrupt line, that's what you see come in, it sends normal lines after. But because of stampede and the #180 issue everything is reconnecting all the time and resending all those logs again, including the first corrupt one. Could this be possible or are all lines corrupt?

@jlintz
Copy link
Author

jlintz commented Aug 22, 2014

It appears all lines are corrupt.

I'll poke around those issues thanks, maybe worth running the latest dev version to see if issues are resolved. We're still in testing phase of our deployment so blowing away data and starting over is fine for now

@jlintz
Copy link
Author

jlintz commented Aug 23, 2014

It appears bumping the network_timeout on LSF has alleviated/resolved(?) this issue. I moved it from the default to 120 seconds and was able to have puppet perform a reload of the services without bringing down the indexers.

@alphazero
Copy link
Contributor

@jlintz That's probably masking an underlying issue. But thank you for the effective workaround. We should at least determine what the actual issue is here. If the scope of a fix turns out to be substantial, we'll probably just advise to remedy with timeout tweaks.

@driskell the the next iteration of LS/F (with an entirely different approach to moving data) is specifically designed to address flow regulation issues. Will update with a heads-up to the community for review and feedback after we're through with our own internal review cycle.

@jlintz
Copy link
Author

jlintz commented Aug 25, 2014

@alphazero yea I agree, thanks for following up

@alphazero
Copy link
Contributor

@jlintz np.

@driskell
Copy link
Contributor

@alphazero The specific issue is at https://github.com/elasticsearch/logstash-forwarder/blob/master/lib/lumberjack/server.rb#L241 which calls into the block specified by the Logstash plugin at https://github.com/elasticsearch/logstash/blob/master/lib/logstash/inputs/lumberjack.rb#L46

In the Logstash plugin, note line 50 where the event is pushed onto the Logstash pipeline. If the pipeline is very busy and you have say, 50 clients attached, this can block for a period of time since the queue is only 10 in size IIRC. If the LSF network timeout is reached while this call blocks, LSF will reconnect - but the old connection will not tear down until the call unblocks - a connection exception won't occur until the subsequent attempt to ACK events which calls a send. So you can see here you'd duplicate events too.

I created a spooler on server side originally, and wrapped Timeout.timeout around the queue. The second way is to write a custom queue - taking the Ruby SizedQueue and taking advantage of the mutex wait timeout, since I can't find a builtin version that lets you use it.

@driskell
Copy link
Contributor

@alphazero Oh - and even if you did wrap around a timeout - you'd then NEED to send an ACK of some sort to prevent LSF network timeout. But you can't ACK all events since not all are queued. The protocol does allow a partial ACK... however, the LSF publisher1.go does not handle it.

All this is in #180 - with the Timeout.timeout approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants