Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas-based LMA datafile reader #2

Closed
vbalderdash opened this issue Feb 18, 2019 · 13 comments
Closed

Pandas-based LMA datafile reader #2

vbalderdash opened this issue Feb 18, 2019 · 13 comments

Comments

@vbalderdash
Copy link
Collaborator

vbalderdash commented Feb 18, 2019

A minimal, Pandas-based reader for LMA files would be useful. An example of such a method is below with a basic selection within the dataframe, which could be easily extended to other criteria.

Basic Pandas object from .dat.gz file

import pandas as pd
import gzip

class lmafile:
    def __init__(self,filename):
        self.file = filename

    def datastart(self):
        with gzip.open(fname,'rt') as f:
            for line_no, line in enumerate(f):
                    if line.rstrip() == "*** data ***":
                        break
            f.close()
            return line_no

    def readfile(self):
        lmad = pd.read_csv(self.file,delim_whitespace=True,header=None,
                                          compression='gzip',skiprows=self.datastart()+1)
        columns = ['time','lat','lon','alt','chi','num','p','charge','mask']
        lmad.columns = columns
        return lmad

Use example:

max_chi  = 1.0
min_station  = 7.0

lma_file = [.dat.gz data file]
lmad = ls.lmafile(lma_file).readfile()

tstart = [minute of day, but should probably be switched to datetime object]
tend   = [minute of day]

sources = lmad[(lmad['time'] >= tstart)  & (lmad['time'] < tend) & 
               (lmad['chi']  <= max_chi)   & (lmad['num'] >= min_station)]
@deeplycloudy
Copy link
Owner

OK, this is really straightforward. So then some next things to add would be:

  1. of grabbing the start date from the datastart method; return a numpy datetime64 tstart, tend
  2. Convert the time column to a timedelta64: probably tsec = lmad['time'].astype('timedelta64[s]')
  3. Reassign the time column as tstart+tsec
  4. calculate the station count from the mask using the function that's already in lmatools

@vbalderdash
Copy link
Collaborator Author

Per number 4. The number of stations contributing to each source should already archived in the LMA file, right? Or are you thinking of a selection based on the list of which stations were contributing?

@deeplycloudy
Copy link
Owner

Not from all networks. OKLMA data I used back in the day had the station count, but you have to count the bits for the WTLMA data and others I've seen.

Data: time (UT sec of day), lat, lon, alt(m), reduced chi^2, P(dBW), mask
Data format: 15.9f 12.8f 13.8f 9.2f 6.2f 5.1f 5x
Number of events: 257422
*** data ***
12599.289137587   3.16553416 -113.07229594 257639475.42  47.17  70.4 0x364
12599.999582905  28.83720144 -112.52882903 746668.56  38.71  19.7 0x724

@deeplycloudy
Copy link
Owner

Given the differences in the two formats, it seems that we actually need to parse on the basis of the Data: line above.

@vbalderdash
Copy link
Collaborator Author

Alright, I've been staring at the one format too long! That should be an easy adjustment.

@vbalderdash
Copy link
Collaborator Author

Here's a snippet to automatically pull the column headers from the file header. I'm sure there's a more fundamental method to parse the text line from the zipped file, but this does get the job done.

def datastart(self):
    with open(self.file) as f:         
        for line_no, line in enumerate(f):
            if line.rstrip() == "*** data ***":
                break
    f.close()
    return line_no

def dataformat(self):
    with open(self.file) as f:         
        for line_no, line in enumerate(f):
            if line.startswith('Data:'): break
        f.close()
    names = pd.read_csv(self.file,
                        header=None,skiprows=line_no,nrows=1)
    names = names.values.tolist()[0]
    names[0] = names[0][6:]
    return [x.strip(' ') for x in names]

def dataformat_gz(self):
    with gzip.open(self.file) as f:         
        for line_no, line in enumerate(f):
            if line.startswith(b'Data:'): break
        f.close()
    names = pd.read_csv(self.file,compression='gzip',
                        header=None,skiprows=line_no,nrows=1)
    names = names.values.tolist()[0]
    names[0] = names[0][6:]
    return [x.strip(' ') for x in names]
    
def datastart_gz(self):
    with gzip.open(self.file,'rt') as f:
        for line_no, line in enumerate(f):
            if line.rstrip() == "*** data ***":
                break
    f.close()
    return line_no
    
def readfile(self):
    if self.file[-4:] == '.dat':
        lmad = pd.read_csv(self.file,delim_whitespace=True,header=None,skiprows=self.datastart()+1)
        columns = self.dataformat()
    if self.file[-7:] == '.dat.gz':
        lmad = pd.read_csv(self.file,compression='gzip',delim_whitespace=True,
                            header=None,skiprows=self.datastart_gz()+1)
        columns = self.dataformat_gz()
    lmad.columns = columns
    return lmad `

@vbalderdash
Copy link
Collaborator Author

Added some pieces for pulling more of the header information via Pandas and separating out all of the stations in the mask using the mask_to_int function from lmatools. Time issues 1-3 are addressed in the addition of a new column called 'Datetime.'

Potential issue: Header information (station locations, % contributions, etc.) are read in with a fixed width format since station names may white space. I'm unsure how consistent those widths are among existing file formats.

def mask_to_int(mask):
    """ Convert object array of mask strings to integers"""
    if len(mask.shape) == 0:
        mask_int = np.asarray([], dtype=int)
    else:
        try:
            # mask is a plain integer
            mask_int = np.fromiter((int(v) for v in mask), int)
        except ValueError:
            # mask is a string representing a base-16 (hex) number
            mask_int = np.fromiter((int(v,16) for v in mask), int)
    return mask_int
class lmafile:
    def __init__(self,filename):
        self.file = filename
        
        with gzip.open(self.file) as f: 
            for line_no, line in enumerate(f):
                if line.startswith(b'Data start time:'):
                    timestring = line.decode().split()[-2:]
                    self.startday = dt.datetime.strptime(timestring[0],'%m/%d/%y')
                    # Full start time and second, might be unneeded
                    # self.starttime = dt.datetime.strptime(timestring[0]+timestring[1],'%m/%d/%y%H:%M:%S')
                    # self.startsecond = (starttime-dt.datetime(starttime.year,starttime.month,starttime.day)).seconds
                # Find rows for station information
                if line.startswith(b'Station information:'):
                    self.station_info_start = line_no
                if line.startswith(b'Station data:'):
                    self.station_data_start = line_no
                if line.startswith(b'Metric file:'):
                    self.station_data_end = line_no
                # Find mask list
                if line.startswith(b'Station mask order:'): 
                    self.maskorder = line.decode().split()[-1]
                # Pull data header
                if line.startswith(b'Data:'): 
                    self.names = [x.strip(' ') for x in line.decode()[5:-1].split(",")]
                # Find start of the data
                if line.rstrip() == b"*** data ***":
                    break
        f.close()
        self.data_starts = line_no
        
        # Station overview information
        self.overview = pd.read_fwf(self.file,compression='gzip',
                colspecs=[[10,11],[13,30],[30,35],[35,43],[43,48],[48,56],[56,61],[61,68],[68,73]],
                names=['ID', 'Name','win(us)', 'dec_win(us)', 'data_ver', 'rms_error(ns)', 
                       'sources','<P/P_m>','active'],
                header=None,skiprows=self.station_data_start+1, 
                nrows=self.station_data_start-self.station_info_start-1)
        # Station Locations
        self.stations = pd.read_fwf(self.file,compression='gzip',
                                    colspecs=[[10,11],[13,32],[32,43],[44,56],[56,66],[66,70]],
                                    names=['ID', 'Name','Lat','Long','Alt','Delay Time'],
                                    header=None,skiprows=self.station_info_start+1, 
                                    nrows=self.station_data_start-self.station_info_start-1)
        
    def readfile(self):
        # Read in data
        lmad = pd.read_csv(self.file,compression='gzip',delim_whitespace=True,
                            header=None,skiprows=self.data_starts+1)
        lmad.columns = self.names
        
        # Convert seconds column to new datetime-formatted column
        lmad.insert(1,'Datetime',pd.to_timedelta(lmad['time (UT sec of day)'], unit='s')+self.startday)
        
        # Parse out which stations contributed into new columns for each station
        for index,items in enumerate(self.maskorder[::-1]):
            lmad.insert(8,items,(mask_to_int(lmad["mask"])>>index)%2)
        # Count the number of stations contributing and put in a new column
        lmad.insert(8,'Station Count',lmad[list(self.maskorder)].sum(axis=1))
        return lmad

@deeplycloudy
Copy link
Owner

Nice, thanks! Could you add a complete version of your reader class in something like xlma-python/pyxlma/lmalib/io/read.py?

pyxlma will live at the same level as the main README, and will be the eventual package name. We'll probably need to rearrange/rename later. lmalib is to separate the plot-agnostic calculations and readers from any GUI or plotting stuff that makes use of data processed in lmalib.

@deeplycloudy
Copy link
Owner

Regarding the header format, yeah – there's no Data format: 15.9f 12.8f 13.8f 9.2f 6.2f 5.1f 7x line like there is for the main data section. I think we just live with it. Maybe there are some simple checks that could be done to throw a warning / error if, say, lon lat for stations aren't close to the coord center.

@vbalderdash
Copy link
Collaborator Author

With the lack of the check on format, it has been added. Maybe an check on whether the values read in are floats/strings would work? Pandas can infer the widths instead of specifying it. I have not had luck with that method, but maybe it's something that would be useful to troubleshoot.

@deeplycloudy
Copy link
Owner

Cool, I just swapped this in to my glueviz example (now added, too) and it works great. I also added some minimal packaging stuff so pyxlma can be installed as a package. Thanks!

Technically we've fulfilled the basics of this issue. Are we ready to start splitting out a separate issues for specific things like the header i/o? Or keep this as an open issue until it's a bit more mature?

@vbalderdash
Copy link
Collaborator Author

It's probably fine to split it into smaller issues and work on the specific things along the way.

I just played around with the glueviz example, looks like it will be useful!

@deeplycloudy
Copy link
Owner

Closing this issue, having opened #6 to continue the discussion about the remaining issue about the header format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants