writer: Events get discarded on raw datastore IO errors #29

tomkooij · 2018-09-24T13:42:41Z

Today we discovered the writer had been erroneously running as root lately, thus creating raw datastore hdf5 files chown root.root. A few days ago frome was physically moved to a new location and the server was restarted. The writer was restarted as user www (as specified in the docs).

The writer running as user www could not write to the raw data store. All data was dropped:

/var/log/hisparc/hisparc-log.writer
2018-09-24 00:00:05,473 writer.store_events[4758].store_event_list.ERROR: Cannot process event, discarding event (station: 8006)
2018-09-24 00:00:05,473 writer.store_events[4758].store_event_list.ERROR: Cannot process event, discarding event (station: 8006)

The code that generates this error:
https://github.com/HiSPARC/datastore/blob/master/writer/store_events.py#L127L148

When store_events.store_event_list is unsuccesful, we still remove the incoming pickled data from the partial folder!

Solution: Only remove the pickle if process_data is succesful: https://github.com/HiSPARC/datastore/blob/master/writer/writer_app.py#L73

The text was updated successfully, but these errors were encountered:

kaspervd · 2018-09-24T13:48:45Z

This means we lost several days of raw data..

153957 · 2018-09-24T13:53:29Z

The permissions of files incorrectly being owned by root have been fixed?

I guess this only affects 'old' data being uploaded now for days during which root was active and was the one to initially create the datafile. So only several days of raw data for specific stations are lost, and perhaps a partial day for all stations for the day on which the writer was restarted as running under www.

tomkooij · 2018-09-24T20:06:10Z

Permissions have been fixed.

However, as these things go, some failures and maintenance added up and we lost about a week of data for all active stations:

On Tue Sep 18 the writer crashed because CT changed NFS settings on trave. The incoming folder filled up. Meanwhile pique was being replaced, so we did not notice.
On Fri Sep 21 at the end of the day, frome was moved, so frome needed to be restarted in a hurry. At that moment all data up to that point was lost in minutes, but we only discovered this the next monday morning.
Only at Mon 24 Sep we stopped the writer which was flushing all incoming packets into /dev/null

kaspervd · 2018-09-25T06:30:47Z

Yes, that sums it up. On Friday (after moving and restart) I found out we didn't have NFS access anymore because all stations gave errors. I fixed that and all stations were correctly pushing data to frome. I started the writer as user www and everything seemed to work. Unfortunately I didn't check the writer log..

I don't really remember anymore why we decided to check the log yesterday but luckily we did.

davidfokkema · 2018-09-25T07:19:57Z

Pity! But after a restart, always check the logs, ;-)!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

writer: Events get discarded on raw datastore IO errors #29

writer: Events get discarded on raw datastore IO errors #29

tomkooij commented Sep 24, 2018

kaspervd commented Sep 24, 2018

153957 commented Sep 24, 2018

tomkooij commented Sep 24, 2018

kaspervd commented Sep 25, 2018

davidfokkema commented Sep 25, 2018 via email

writer: Events get discarded on raw datastore IO errors #29

writer: Events get discarded on raw datastore IO errors #29

Comments

tomkooij commented Sep 24, 2018

kaspervd commented Sep 24, 2018

153957 commented Sep 24, 2018

tomkooij commented Sep 24, 2018

kaspervd commented Sep 25, 2018

davidfokkema commented Sep 25, 2018 via email