Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with pyxlma_flash_sort_grid script #51

Open
gewitterblitz opened this issue Aug 26, 2024 · 2 comments
Open

Problems with pyxlma_flash_sort_grid script #51

gewitterblitz opened this issue Aug 26, 2024 · 2 comments

Comments

@gewitterblitz
Copy link
Contributor

gewitterblitz commented Aug 26, 2024

I found a possible issue while trying to run the examples/pyxlma_flash_sort_grid.py file. The code would break and throw error at the following line in the line dataset, start_time = lma_read.dataset(paths_to_read) within the flash_sort_grid function. This occurs beacuse the lmafile class calls the gen_sta_data function:

overview = pd.DataFrame(self.gen_sta_data(),
            columns=['ID','Name','win(us)', 'data_ver', 'rms_error(ns)',
                     'sources','percent','<P/P_m>','active'])

Apparently, this happens when some LYLOUT*.dat.gz files have inconsistent number of columns under station data. For example, here's what I found in two different OKLMA files from the same day:

Notice how the second file doesn't contain any values corresponding to the dec_win(us) column header.

File content in LYLOUT_110524_000000_0600.dat.gz

Screenshot 2024-08-26 at 4 58 58 PM

File content in LYLOUT_110524_205000_0600.dat.gz

Screenshot 2024-08-26 at 5 00 37 PM

I figured some flexibility in both gen_sta_data and gen_sta_info functions can deal with this inconsistency. For example, here's what worked for me:

def gen_sta_info(self):
    """ Parse the station info table from the header. Some files do not
    have fixed width columns, and station names may have spaces, so this
    function chops out the space-delimited columns to the left and right
    of the station names.
    """
    nstations = self.station_data_start-self.station_info_start-1
    with open_gzip_or_dat(self.file) as f:
        for i in range(self.station_info_start+1):
            line = next(f)
        for line_num in range(nstations):
            line = next(f)
            parts = line.decode("utf-8").split()

            if line_num == 0:
                slen = len(parts)

            if slen == 9:
                name = ' '.join(parts[2:-5])
                sta_info, code = parts[0:2]
                yield (code, name) + tuple(parts[-5:-1])

            elif slen == 10: # files with one extra station data column
                name = ' '.join(parts[2:-6])
                sta_info, code = parts[0:2]
                yield (code, name) + tuple(parts[-6:-2])
def gen_sta_data(self):
    """ Parse the station data table from the header. Some files do not
    have fixed width columns, and station names may have spaces, so this
    function chops out the space-delimited columns to the left and right
    of the station names.
    """
    nstations = self.station_data_start-self.station_info_start-1

    with open_gzip_or_dat(self.file) as f:
        for i in range(self.station_data_start+1):
            line = next(f)

        for line_num in range(nstations):
            line = next(f)
            parts = line.decode("utf-8").split()

            if line_num == 0:  # Calculate slen only for the first line
                slen = len(parts)

            if slen == 11:
                name = ' '.join(parts[2:-7])
                sta_info, code = parts[0:2]
                yield (code, name) + tuple(parts[-7:])

            elif slen == 12: # files with one extra station data column
                name = ' '.join(parts[2:-8])
                sta_info, code = parts[0:2]
                yield (code, name) + tuple(parts[-7:])

I could run the flash_sort script after these modifications, but it was quite slow compared to simply running lmatools. Ingesting too many files at the same time overloaded the kernel due to out-of-memory issues with xarray data handler. I am not sure if this script is still WIP or is meant to replace lmatools eventually, but at the time of testing, did not offer any advantage over the good old lmatools' processing speed. I'd love to hear what @deeplycloudy or @wx4stg have to say. Happy to be corrected, of course!

@wx4stg
Copy link
Contributor

wx4stg commented Aug 27, 2024

I haven't tested the issue yet, but I believe this is largely what #42 is designed to address, handling datasets where the network changes configurations across different files. I haven't tested that draft against inconsistent number of columns across the files, only inconsistent column data across the files, but I'll be sure to do so before that gets merged. (Given my current schedule, that PR being polished and merged is probably a ways out, so it might be good to look at fixing this now temporarily)

As noted in my rambling comments of that PR, lma_analysis outputs a dat file with one more column header in the Sta_data than it does columns in the following lines describing the actual station data. This is why gen_sta_data uses hardcoded values for the indices of the station information instead of reading the header, which IMO would be a more 'correct' approach. Eric and I discussed a while ago how to resolve this, and the 'solution' we came to is "read the header, check to make sure there are enough data columns to match that header, and if there's one less, silently drop the extra column from the header, then proceed, maybe raising a warning or maybe not". If you have any thoughts on this, I'm very open to suggestions as I don't feel great about this.

Regarding out of memory operations, one of the things I haven't added yet (but would be relatively trivial) is to allow lma_read.dataset to include an **xarray_kwargs where additional arguments get passed to xarray.read_dataset, so that if you have an out-of-memory operation, you could specify a chunks='auto' to allow lazy loading the data.

I have done some benchmarking of pyxlma_flash_sort_grid vs lmatools and I remember it being faster, but I think I just compared script runtimes, and lmatools' version generates the pdf files with the plots. This was a while ago, so maybe some retesting with the plotting on lmatools disabled would be a good idea. I believe the goal is to eventually replace lmatools, but clearly we aren't there yet..

@gewitterblitz
Copy link
Contributor Author

Thanks for sharing your thoughts, @wx4stg. I'll keep an eye on the updates!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
@gewitterblitz @wx4stg and others