feat: Process TSV files as streams and validate only the first 1000 rows by default #139

effigies · 2025-01-09T19:40:33Z

This PR is an optimization. In #138 we found a case with >300k lines in a TSV file. In order to limit the number of lines being inspected, I needed to switch TSV loading to be stream-based instead of slurping the entire file.

This PR does the following:

Refactors the UTF8 enforcement of BIDSFileDeno.text into a stream transformer in files/stream.ts
Rewrites loadTSV to process files as a stream.
- UTF8 validation
- Rechunk as lines of text, splitting on \r?\n
Rewrites column loading as an array of pre-allocated arrays for efficiency. The ColumnsMap is constructed at the end.
Adds a --max-rows flag to the CLI and a maxRows variable to validator options.

Note that this adds a new error condition, where we tolerate empty lines only at the end of files (<content><LF><EOF>). In passing, this permits us to report the line number of bad TSV lines.

I also do not attempt to add maxRows to the TSV cache key, so calling loadTSV() successively on the same file and different maxRows values will return the result from the first call. This does not seem like a problem in terms of running the validator, but might be surprising to future developers. I can look into that, if desired.

Closes #138.

effigies · 2025-01-09T19:41:39Z

src/files/tsv.ts

-const normalizeEOL = (str: string): string => str.replace(/\r\n/g, '\n').replace(/\r/g, '\n')
-// Typescript resolved `row && !/^\s*$/.test(row)` as `string | boolean`
-const isContentfulRow = (row: string): boolean => !!(row && !/^\s*$/.test(row))
+async function _loadTSV(file: BIDSFile, maxRows: number = -1): Promise<ColumnsMap> {


This file in particular will be easiest to review just by reading the new file, as the contents are almost entirely new.

effigies added 5 commits January 9, 2025 13:39

rf: Abstract out UTF8-enforcing stream handling

b6dfc51

rf: Rewrite loadTSV to accept maxRows

e6ffb65

test: Add loadTSV tests

487fa3a

feat(cli): Add --max-rows flag

d94b0c1

test: createUTF8Stream

f76973d

effigies commented Jan 9, 2025

View reviewed changes

effigies added 4 commits January 9, 2025 16:27

feat: Add issue text for TSV_EMPTY_LINE

ab1550e

fix(tests): Pass TSV data as stream in regression test

fecaf42

fix: Replace leaky dummy stream

505fc87

chore(build): deno.lock

cfe5120

effigies force-pushed the feat/tsv-maxrows branch from ef7442a to cfe5120 Compare January 9, 2025 21:27

fix: Release reader lock on exceptions

9dc329a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Process TSV files as streams and validate only the first 1000 rows by default #139

feat: Process TSV files as streams and validate only the first 1000 rows by default #139

effigies commented Jan 9, 2025 •

edited

Loading

effigies Jan 9, 2025

feat: Process TSV files as streams and validate only the first 1000 rows by default #139

Are you sure you want to change the base?

feat: Process TSV files as streams and validate only the first 1000 rows by default #139

Conversation

effigies commented Jan 9, 2025 • edited Loading

effigies Jan 9, 2025

Choose a reason for hiding this comment

effigies commented Jan 9, 2025 •

edited

Loading