The primary functionality of the streambyter
library is to efficiently iterate and execute regex against a large number of files/streams. I created this library out of a need to quickly iterate thousands of large json files and extract out a single piece of text. Out of the box the brute force way to do this is to download each json or text file, run a regex on each file, then return results. The problems with this approach are:
- Each file must be downloaded fully (Speed, bandwidth, and memory cost)
- The regex is run against the full file (Speed cost)
And this is where streambyter
comes in. You can use this library to efficient execute regex (testing or matching groups) against many files locally or in the cloud. This library doesn't care where the stream is located, just as long as it's a stream.
$ npm i -s streambyter
In the below example a file path can be provided with a regex with named matching groups to extract those groups out as a dictionary.
import { regexGroupPathReader } from 'streambyter';
// Assume there is a `file` with contents: '{"foo":"Hello","bar":"World", /* more content */}'
const filePath = '/path/to/some.json';
const regex = /"foo":"(?<foo>.*?)","bar":"(?<bar>.*?)"/;
const result = await regexGroupPathReader({ path: filePath }, regex);
console.log(result); // prints { path: '/path/to/some.json', result: { foo: "Hello", bar: "World" }}
In this example a stream can be provided with a regex with named matching groups to extract those groups out as a dictionary. Note that here you have to create the stream yourself, but the benefit is you have full control over the options of that stream, like manually changing the highWaterMark
. Why might you want to do this? Maybe you know for a fact that the data you want is in the first 100 bytes of the json, then you'd want to set the highWaterMark
to 100
since the streambyter
library will close the stream after the first match. Note that in the above regexGroupPathReader
the stream is created with { highWaterMark: 512 }
by default.
import { regexGroupStreamReader } from 'streambyter';
import { createReadStream } from 'fs';
// Assume there is a `file` with contents: '{"foo":"Hello","bar":"World", /* more content */}'
const filePath = '/path/to/some.json';
const regex = /"foo":"(?<foo>.*?)","bar":"(?<bar>.*?)"/;
const result = await regexGroupPathReader({ stream: createReadStream(filePath, { highWaterMark: 100 }) }, regex);
console.log(result); // prints { path: '/path/to/some.json', result: { foo: "Hello", bar: "World" }}
In this example an array of file paths can be provided along with a regex with named matching groups to extract those groups out as an array of dictionaries.
import { regexGroupPathsReader } from 'streambyter';
// Assume there are an array of `files` with contents: '{"foo":"Hello1","bar":"World1", /* more content */}'
const filePaths = ['/path/to/some1.json', '/path/to/some2.json'];
const regex = /"foo":"(?<foo>.*?)","bar":"(?<bar>.*?)"/;
const objs = filePaths.map((p) => ({ path: p }));
const results = await regexGroupPathsReader(objs, regex);
console.log(results); // prints [{ path: '/path/to/some1.json', result: { foo: "Hello1", bar: "World1" }}, { path: '/path/to/some2.json', result: { foo: "Hello2", bar: "World2" }}]
In this example an array of streams can be provided along with a regex with named matching groups to extract those groups out as an array of dictionaries.
import { regexGroupStreamsReader } from 'streambyter';
import { createReadStream } from 'fs';
// Assume there are an array of `files` with contents: '{"foo":"Hello1","bar":"World1", /* more content */}'
const filePaths = ['/path/to/some1.json', '/path/to/some2.json'];
const regex = /"foo":"(?<foo>.*?)","bar":"(?<bar>.*?)"/;
const objs = filePaths.map((p) => ({ stream: createReadStream(p, { highWaterMark: 100 }) }));
const results = await regexGroupStreamsReader(objs, regex);
console.log(results); // prints [{ path: '/path/to/some1.json', result: { foo: "Hello1", bar: "World1" }}, { path: '/path/to/some2.json', results: { foo: "Hello2", bar: "World2" }}]
When dealing with the cloud, the sdk you are dealing with should have the ability to return a stream. For example, when using the Azure Storage SDK you can obtain a stream to the blob
via const stream = await blockBlobClient.download(0);
. This is efficient since no contents have actually been downloaded, only a stream which can iterate and close as desired.
So let's say we want to list blobs in an Azure Blob Storage container and replicate one of the above examples.
async function downloadBlobAsStream(containerClient: ContainerClient, blobName: string): Promise<NodeJS.ReadableStream> {
const blockBlobClient = containerClient.getBlockBlobClient(blobName);
const downloadBlockBlobResponse = await blockBlobClient.download(0);
return downloadBlockBlobResponse.readableStreamBody;
}
const account = 'someaccountname';
const sharedKeyCredential = '...';
const client = new BlobServiceClient(`https://${account}.blob.core.windows.net`, sharedKeyCredential);
const containerClient = client.getContainerClient('somecontainer');
// Let's assume there are a list of files in the root directory of the container
const blobs = containerClient.listBlobsByHierarchy('/', { prefix: prefix || '' });
const objs = [];
// Iterate each blob returned and construct an array of objects containing the stream reference for each blob
for await (const blob of blobs) {
objs.push({ name: blob.name, stream: await downloadBlobAsStream(containerClient, blob.name) });
}
const regex = /"foo":"(?<foo>.*?)","bar":"(?<bar>.*?)"/;
const results = await regexGroupStreamsReader(objs, regex);
console.log(results); // prints [{ path: 'blob1.json', result: { foo: "Hello1", bar: "World1" }}, { path: 'blob2.json', results: { foo: "Hello2", bar: "World2" }}]
You'll notice that objects are being passed instead of just the path
or the stream
alone, why? The reason is so you can map back individual results. For example if you had await regexGroupPathsReader([{ path: '/path/to/a.txt', path: '/path/to/b.txt'}], regex)
that might result in: [{ path: '/path/to/a.txt', result: { someMatch: '1' }}, {path: '/path/to/b/txt', result: { someMatch: '2' }}]
See the *.spec.ts
files in the ./test directory for a great reference on using the library.
Note that the library is built with rollup.js and targets commonjs and is intended to be used with nodejs.
npm run test
30 passing (3s)
----------|---------|----------|---------|---------|-------------------
File | % Stmts | % Branch | % Funcs | % Lines | Uncovered Line #s
----------|---------|----------|---------|---------|-------------------
All files | 100 | 100 | 100 | 100 |
index.ts | 100 | 100 | 100 | 100 |
----------|---------|----------|---------|---------|-------------------
npm i
- make code changes
npm run test
npm run lint
npm run build
- Bump the package.json version
npm publish --access public
git tag vx.y.z
git push origin --tags