Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

writer: Events get discarded on raw datastore IO errors #29

Open
tomkooij opened this issue Sep 24, 2018 · 5 comments
Open

writer: Events get discarded on raw datastore IO errors #29

tomkooij opened this issue Sep 24, 2018 · 5 comments

Comments

@tomkooij
Copy link
Member

Today we discovered the writer had been erroneously running as root lately, thus creating raw datastore hdf5 files chown root.root. A few days ago frome was physically moved to a new location and the server was restarted. The writer was restarted as user www (as specified in the docs).

The writer running as user www could not write to the raw data store. All data was dropped:

/var/log/hisparc/hisparc-log.writer
2018-09-24 00:00:05,473 writer.store_events[4758].store_event_list.ERROR: Cannot process event, discarding event (station: 8006)
2018-09-24 00:00:05,473 writer.store_events[4758].store_event_list.ERROR: Cannot process event, discarding event (station: 8006)

The code that generates this error:
https://github.com/HiSPARC/datastore/blob/master/writer/store_events.py#L127L148

When store_events.store_event_list is unsuccesful, we still remove the incoming pickled data from the partial folder!

Solution: Only remove the pickle if process_data is succesful: https://github.com/HiSPARC/datastore/blob/master/writer/writer_app.py#L73

@kaspervd
Copy link
Contributor

This means we lost several days of raw data..

@153957
Copy link
Member

153957 commented Sep 24, 2018

The permissions of files incorrectly being owned by root have been fixed?

I guess this only affects 'old' data being uploaded now for days during which root was active and was the one to initially create the datafile. So only several days of raw data for specific stations are lost, and perhaps a partial day for all stations for the day on which the writer was restarted as running under www.

@tomkooij
Copy link
Member Author

Permissions have been fixed.

However, as these things go, some failures and maintenance added up and we lost about a week of data for all active stations:

  • On Tue Sep 18 the writer crashed because CT changed NFS settings on trave. The incoming folder filled up. Meanwhile pique was being replaced, so we did not notice.
  • On Fri Sep 21 at the end of the day, frome was moved, so frome needed to be restarted in a hurry. At that moment all data up to that point was lost in minutes, but we only discovered this the next monday morning.
  • Only at Mon 24 Sep we stopped the writer which was flushing all incoming packets into /dev/null

@kaspervd
Copy link
Contributor

Yes, that sums it up. On Friday (after moving and restart) I found out we didn't have NFS access anymore because all stations gave errors. I fixed that and all stations were correctly pushing data to frome. I started the writer as user www and everything seemed to work. Unfortunately I didn't check the writer log..

I don't really remember anymore why we decided to check the log yesterday but luckily we did.

@davidfokkema
Copy link
Member

davidfokkema commented Sep 25, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants