Skip to content

Commit

Permalink
Excel extractor header detection update (#962)
Browse files Browse the repository at this point in the history
* feat:header detection update

* fix for failing check

---------

Co-authored-by: Ashley Mulligan <[email protected]>
  • Loading branch information
carlbrugger and ashleygmulligan2 authored Dec 7, 2023
1 parent 6006567 commit a9f4dde
Showing 1 changed file with 67 additions and 3 deletions.
70 changes: 67 additions & 3 deletions plugins/extractors/xlsx-extractor.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -42,16 +42,27 @@ icon: "download"
take the raw numbers and disregard how it's displayed in Excel.
</ParamField>

<ParamField path="options.chunkSize" default="10_000" type="number" optional>
<ParamField path="chunkSize" default="10_000" type="number" optional>
The `chunkSize` parameter allows you to specify the quantity of records to in
each chunk.
</ParamField>

<ParamField path="options.parallel" default="1" type="number" optional>
<ParamField path="parallel" default="1" type="number" optional>
The `parallel` parameter allows you to specify the number of chunks to process
in parallel.
</ParamField>

<ParamField path="headerDetectionOptions" type="Object" optional>
The `headerDetectionOptions` parameter allows you to specify the options for
detecting headers in the file. By default, the first 10 rows are scanned for
the row with the most non-empty cells.
</ParamField>

<ParamField path="debug" default="false" type="boolean" optional>
The `debug` parameter lets you toggle on/off helpful debugging messages for
development purposes.
</ParamField>

## API Calls

- `api.files.download`
Expand All @@ -70,7 +81,7 @@ icon: "download"
- [`@flatfile/[email protected]+`](https://npmjs.com/package/@flatfile/api)
- [`@flatfile/[email protected]+`](https://npmjs.com/package/@flatfile/hooks)
- [`@flatfile/[email protected]`](https://npmjs.com/package/@flatfile/listener)
- [`@flatfile/[email protected]`](../utils/extractor) provides utility functions for extracting and parsing data from various file formats and sources, streamlining data import processes.
- [`@flatfile/[email protected]`](https://npmjs.com/package/@flatfile/util-extractor) provides utility functions for extracting and parsing data from various file formats and sources, streamlining data import processes.
- [`remeda`](https://remedajs.com/) offers a set of utility functions for functional programming and data manipulation in JavaScript, providing a convenient way to work with arrays and objects.
- [`xlsx`](https://sheetjs.com/) allows for reading, writing, and manipulating Microsoft Excel files in JavaScript applications.

Expand Down Expand Up @@ -98,6 +109,59 @@ listener.use(ExcelExtractor({ raw: true, rawNumbers: true }));

</CodeGroup>

### Header Detection

Three detection options are provided for detecting headers in the file: `default`, `explicitHeaders`, and `specificRows`. By default, the first 10 rows are scanned for the row with the most non-empty cells. This row is then used as the header row.

#### Default

It looks at the first `rowsToSearch` rows and takes the row
with the most non-empty cells as the header, preferring the earliest
such row in the case of a tie.

```js
listener.use(ExcelExtractor());
// or...
listener.use(
ExcelExtractor({
headerDetectionOptions: {
algorithm: "default",
rowsToSearch: 30, // Default is 10
},
})
);
```

#### Explicit Headers

This implementation simply returns an explicit list of headers it was provided with.

```js
listener.use(
ExcelExtractor({
headerDetectionOptions: {
algorithm: "explicitHeaders",
headers: ["fiRsT NamE", "LaSt nAme", "emAil"],
},
})
);
```

#### Specific Rows

This implementation looks at specific rows and combines them into a single header. For example, if you knew that the header was in the third row, you could pass it `{ rowNumbers: [2] }`.

```js
listener.use(
ExcelExtractor({
headerDetectionOptions: {
algorithm: "specificRows",
rowNumbers: [2], // 0 based
},
})
);
```

### Full Example

In this example, the `ExcelExtractor` is initialized with optional options, and then registered as middleware with the Flatfile listener. When an Excel file is uploaded, the plugin will extract the structured data and process it using the extractor's parser.
Expand Down

0 comments on commit a9f4dde

Please sign in to comment.