Pandas-based LMA datafile reader #2

vbalderdash · 2019-02-18T18:00:18Z

A minimal, Pandas-based reader for LMA files would be useful. An example of such a method is below with a basic selection within the dataframe, which could be easily extended to other criteria.

Basic Pandas object from .dat.gz file

import pandas as pd
import gzip

class lmafile:
    def __init__(self,filename):
        self.file = filename

    def datastart(self):
        with gzip.open(fname,'rt') as f:
            for line_no, line in enumerate(f):
                    if line.rstrip() == "*** data ***":
                        break
            f.close()
            return line_no

    def readfile(self):
        lmad = pd.read_csv(self.file,delim_whitespace=True,header=None,
                                          compression='gzip',skiprows=self.datastart()+1)
        columns = ['time','lat','lon','alt','chi','num','p','charge','mask']
        lmad.columns = columns
        return lmad

Use example:

max_chi  = 1.0
min_station  = 7.0

lma_file = [.dat.gz data file]
lmad = ls.lmafile(lma_file).readfile()

tstart = [minute of day, but should probably be switched to datetime object]
tend   = [minute of day]

sources = lmad[(lmad['time'] >= tstart)  & (lmad['time'] < tend) & 
               (lmad['chi']  <= max_chi)   & (lmad['num'] >= min_station)]

The text was updated successfully, but these errors were encountered:

deeplycloudy · 2019-02-18T20:59:44Z

OK, this is really straightforward. So then some next things to add would be:

of grabbing the start date from the datastart method; return a numpy datetime64 tstart, tend
Convert the time column to a timedelta64: probably tsec = lmad['time'].astype('timedelta64[s]')
Reassign the time column as tstart+tsec
calculate the station count from the mask using the function that's already in lmatools

vbalderdash · 2019-02-18T21:08:52Z

Per number 4. The number of stations contributing to each source should already archived in the LMA file, right? Or are you thinking of a selection based on the list of which stations were contributing?

deeplycloudy · 2019-02-19T00:07:48Z

Not from all networks. OKLMA data I used back in the day had the station count, but you have to count the bits for the WTLMA data and others I've seen.

Data: time (UT sec of day), lat, lon, alt(m), reduced chi^2, P(dBW), mask
Data format: 15.9f 12.8f 13.8f 9.2f 6.2f 5.1f 5x
Number of events: 257422
*** data ***
12599.289137587   3.16553416 -113.07229594 257639475.42  47.17  70.4 0x364
12599.999582905  28.83720144 -112.52882903 746668.56  38.71  19.7 0x724

deeplycloudy · 2019-02-19T14:20:17Z

Given the differences in the two formats, it seems that we actually need to parse on the basis of the Data: line above.

vbalderdash · 2019-02-19T15:51:37Z

Alright, I've been staring at the one format too long! That should be an easy adjustment.

vbalderdash · 2019-06-18T15:42:47Z

Here's a snippet to automatically pull the column headers from the file header. I'm sure there's a more fundamental method to parse the text line from the zipped file, but this does get the job done.

def datastart(self):
    with open(self.file) as f:         
        for line_no, line in enumerate(f):
            if line.rstrip() == "*** data ***":
                break
    f.close()
    return line_no

def dataformat(self):
    with open(self.file) as f:         
        for line_no, line in enumerate(f):
            if line.startswith('Data:'): break
        f.close()
    names = pd.read_csv(self.file,
                        header=None,skiprows=line_no,nrows=1)
    names = names.values.tolist()[0]
    names[0] = names[0][6:]
    return [x.strip(' ') for x in names]

def dataformat_gz(self):
    with gzip.open(self.file) as f:         
        for line_no, line in enumerate(f):
            if line.startswith(b'Data:'): break
        f.close()
    names = pd.read_csv(self.file,compression='gzip',
                        header=None,skiprows=line_no,nrows=1)
    names = names.values.tolist()[0]
    names[0] = names[0][6:]
    return [x.strip(' ') for x in names]
    
def datastart_gz(self):
    with gzip.open(self.file,'rt') as f:
        for line_no, line in enumerate(f):
            if line.rstrip() == "*** data ***":
                break
    f.close()
    return line_no
    
def readfile(self):
    if self.file[-4:] == '.dat':
        lmad = pd.read_csv(self.file,delim_whitespace=True,header=None,skiprows=self.datastart()+1)
        columns = self.dataformat()
    if self.file[-7:] == '.dat.gz':
        lmad = pd.read_csv(self.file,compression='gzip',delim_whitespace=True,
                            header=None,skiprows=self.datastart_gz()+1)
        columns = self.dataformat_gz()
    lmad.columns = columns
    return lmad `

vbalderdash · 2019-07-10T16:34:27Z

Added some pieces for pulling more of the header information via Pandas and separating out all of the stations in the mask using the mask_to_int function from lmatools. Time issues 1-3 are addressed in the addition of a new column called 'Datetime.'

Potential issue: Header information (station locations, % contributions, etc.) are read in with a fixed width format since station names may white space. I'm unsure how consistent those widths are among existing file formats.

def mask_to_int(mask):
    """ Convert object array of mask strings to integers"""
    if len(mask.shape) == 0:
        mask_int = np.asarray([], dtype=int)
    else:
        try:
            # mask is a plain integer
            mask_int = np.fromiter((int(v) for v in mask), int)
        except ValueError:
            # mask is a string representing a base-16 (hex) number
            mask_int = np.fromiter((int(v,16) for v in mask), int)
    return mask_int

class lmafile:
    def __init__(self,filename):
        self.file = filename
        
        with gzip.open(self.file) as f: 
            for line_no, line in enumerate(f):
                if line.startswith(b'Data start time:'):
                    timestring = line.decode().split()[-2:]
                    self.startday = dt.datetime.strptime(timestring[0],'%m/%d/%y')
                    # Full start time and second, might be unneeded
                    # self.starttime = dt.datetime.strptime(timestring[0]+timestring[1],'%m/%d/%y%H:%M:%S')
                    # self.startsecond = (starttime-dt.datetime(starttime.year,starttime.month,starttime.day)).seconds
                # Find rows for station information
                if line.startswith(b'Station information:'):
                    self.station_info_start = line_no
                if line.startswith(b'Station data:'):
                    self.station_data_start = line_no
                if line.startswith(b'Metric file:'):
                    self.station_data_end = line_no
                # Find mask list
                if line.startswith(b'Station mask order:'): 
                    self.maskorder = line.decode().split()[-1]
                # Pull data header
                if line.startswith(b'Data:'): 
                    self.names = [x.strip(' ') for x in line.decode()[5:-1].split(",")]
                # Find start of the data
                if line.rstrip() == b"*** data ***":
                    break
        f.close()
        self.data_starts = line_no
        
        # Station overview information
        self.overview = pd.read_fwf(self.file,compression='gzip',
                colspecs=[[10,11],[13,30],[30,35],[35,43],[43,48],[48,56],[56,61],[61,68],[68,73]],
                names=['ID', 'Name','win(us)', 'dec_win(us)', 'data_ver', 'rms_error(ns)', 
                       'sources','<P/P_m>','active'],
                header=None,skiprows=self.station_data_start+1, 
                nrows=self.station_data_start-self.station_info_start-1)
        # Station Locations
        self.stations = pd.read_fwf(self.file,compression='gzip',
                                    colspecs=[[10,11],[13,32],[32,43],[44,56],[56,66],[66,70]],
                                    names=['ID', 'Name','Lat','Long','Alt','Delay Time'],
                                    header=None,skiprows=self.station_info_start+1, 
                                    nrows=self.station_data_start-self.station_info_start-1)
        
    def readfile(self):
        # Read in data
        lmad = pd.read_csv(self.file,compression='gzip',delim_whitespace=True,
                            header=None,skiprows=self.data_starts+1)
        lmad.columns = self.names
        
        # Convert seconds column to new datetime-formatted column
        lmad.insert(1,'Datetime',pd.to_timedelta(lmad['time (UT sec of day)'], unit='s')+self.startday)
        
        # Parse out which stations contributed into new columns for each station
        for index,items in enumerate(self.maskorder[::-1]):
            lmad.insert(8,items,(mask_to_int(lmad["mask"])>>index)%2)
        # Count the number of stations contributing and put in a new column
        lmad.insert(8,'Station Count',lmad[list(self.maskorder)].sum(axis=1))
        return lmad

deeplycloudy · 2019-07-10T16:49:56Z

Nice, thanks! Could you add a complete version of your reader class in something like xlma-python/pyxlma/lmalib/io/read.py?

pyxlma will live at the same level as the main README, and will be the eventual package name. We'll probably need to rearrange/rename later. lmalib is to separate the plot-agnostic calculations and readers from any GUI or plotting stuff that makes use of data processed in lmalib.

deeplycloudy · 2019-07-10T16:52:52Z

Regarding the header format, yeah – there's no Data format: 15.9f 12.8f 13.8f 9.2f 6.2f 5.1f 7x line like there is for the main data section. I think we just live with it. Maybe there are some simple checks that could be done to throw a warning / error if, say, lon lat for stations aren't close to the coord center.

vbalderdash · 2019-07-10T18:54:33Z

With the lack of the check on format, it has been added. Maybe an check on whether the values read in are floats/strings would work? Pandas can infer the widths instead of specifying it. I have not had luck with that method, but maybe it's something that would be useful to troubleshoot.

deeplycloudy · 2019-07-10T23:21:39Z

Cool, I just swapped this in to my glueviz example (now added, too) and it works great. I also added some minimal packaging stuff so pyxlma can be installed as a package. Thanks!

Technically we've fulfilled the basics of this issue. Are we ready to start splitting out a separate issues for specific things like the header i/o? Or keep this as an open issue until it's a bit more mature?

vbalderdash · 2019-07-11T03:48:59Z

It's probably fine to split it into smaller issues and work on the specific things along the way.

I just played around with the glueviz example, looks like it will be useful!

deeplycloudy · 2019-07-11T19:51:57Z

Closing this issue, having opened #6 to continue the discussion about the remaining issue about the header format.

deeplycloudy mentioned this issue Feb 18, 2019

Pandas-based NLDN reader #4

Open

deeplycloudy mentioned this issue Jul 11, 2019

Header parsing in LMA reader #6

Open

deeplycloudy closed this as completed Jul 11, 2019

wx4stg mentioned this issue Jan 19, 2024

Add NLDN/ENTLN plotting #34

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas-based LMA datafile reader #2

Pandas-based LMA datafile reader #2

vbalderdash commented Feb 18, 2019 •

edited

Loading

deeplycloudy commented Feb 18, 2019

vbalderdash commented Feb 18, 2019

deeplycloudy commented Feb 19, 2019

deeplycloudy commented Feb 19, 2019

vbalderdash commented Feb 19, 2019

vbalderdash commented Jun 18, 2019

vbalderdash commented Jul 10, 2019

deeplycloudy commented Jul 10, 2019

deeplycloudy commented Jul 10, 2019

vbalderdash commented Jul 10, 2019

deeplycloudy commented Jul 10, 2019

vbalderdash commented Jul 11, 2019

deeplycloudy commented Jul 11, 2019

Pandas-based LMA datafile reader #2

Pandas-based LMA datafile reader #2

Comments

vbalderdash commented Feb 18, 2019 • edited Loading

Basic Pandas object from .dat.gz file

Use example:

deeplycloudy commented Feb 18, 2019

vbalderdash commented Feb 18, 2019

deeplycloudy commented Feb 19, 2019

deeplycloudy commented Feb 19, 2019

vbalderdash commented Feb 19, 2019

vbalderdash commented Jun 18, 2019

vbalderdash commented Jul 10, 2019

deeplycloudy commented Jul 10, 2019

deeplycloudy commented Jul 10, 2019

vbalderdash commented Jul 10, 2019

deeplycloudy commented Jul 10, 2019

vbalderdash commented Jul 11, 2019

deeplycloudy commented Jul 11, 2019

vbalderdash commented Feb 18, 2019 •

edited

Loading