chore: xlsx improvements #52

hewliyang · 2024-12-16T07:56:49Z

a couple simple heuristics that should be safe to apply in any general case:

if any row or column are fully null, we can save tokens by deleting that column.
dataframe column names that are null get auto-filled with Unnamed_{i} and I think we can scrub this out
NaN's in the dataframe materialize as "NaN" in the markdown and can probably be scrubbed out

Examples

here is what I have expanded the test.xlsx to look like (changes annotated in red):

which would produce:

Sheet1

Alpha	Unnamed: 1	Beta	Gamma	Delta	Unnamed: 5
89	NaN	82	100	12	13.0
76	NaN	89	33	42	43.0
60	NaN	84	19	19	53.0
7	NaN	69	10	17	63.0
87	NaN	89	86	54	312.0
23	NaN	4	89	25	23.0
70	NaN	84	62	59	34.0
83	NaN	37	43	21	321.0
71	NaN	15	88	32	431.0
20	NaN	62	20	67	647.0
67	NaN	18	15	48	878.0
42	NaN	5	15	67	NaN
58	NaN	6ff4173b-42a5-4784-9b19-f49caff4d93d	22	9	346.0
49	NaN	93	6	38	317.0
82	NaN	28	1	39	341.0
95	NaN	55	18	82	135.0
50	NaN	46	98	86	135.0
31	NaN	46	47	82	3634.0
40	NaN	65	19	31	357.0
95	NaN	65	29	62	34.0
68	NaN	57	34	54	901.0
96	NaN	66	63	14	299.0
87	NaN	93	95	80	19.0

09060124-b5e7-4717-9d07-3c046eb

ColA	ColB	ColC	ColD
1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	affc7dad-52dc-4b98-9b5d-51e65d8a8ad0

with the changes:

Sheet1

Alpha	Beta	Gamma	Delta
89	82	100	12	13.0
76	89	33	42	43.0
60	84	19	19	53.0
7	69	10	17	63.0
87	89	86	54	312.0
23	4	89	25	23.0
70	84	62	59	34.0
83	37	43	21	321.0
71	15	88	32	431.0
20	62	20	67	647.0
67	18	15	48	878.0
42	5	15	67
58	6ff4173b-42a5-4784-9b19-f49caff4d93d	22	9	346.0
49	93	6	38	317.0
82	28	1	39	341.0
95	55	18	82	135.0
50	46	98	86	135.0
31	46	47	82	3634.0
40	65	19	31	357.0
95	65	29	62	34.0
68	57	34	54	901.0
96	66	63	14	299.0
87	93	95	80	19.0

09060124-b5e7-4717-9d07-3c046eb

ColA	ColB	ColC	ColD
1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	affc7dad-52dc-4b98-9b5d-51e65d8a8ad0

hewliyang · 2024-12-16T07:59:47Z

@hewliyang please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree [company="{your company}"]

hewliyang · 2024-12-16T08:00:19Z

@hewliyang please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree

gagb · 2024-12-17T00:29:30Z

@hewliyang thanks for the PR.
Our typical use case is including this data as context in an LLM.
Do you think your suggested changes, NaN -> "", Unnamed:{I} -> "" improve generation?

cc @afourney

gagb · 2024-12-17T00:31:00Z

If the objective is beautification, what is we added a kwarg called beautify=True so that both options are available to users?

hewliyang · 2024-12-17T06:07:28Z

@hewliyang thanks for the PR. Our typical use case is including this data as context in an LLM. Do you think your suggested changes, NaN -> "", Unnamed:{I} -> "" improve generation?

cc @afourney

thank for for reviewing @gagb 🤗

well, the right answer is that it is non-trivial because we would have to curate a QA test set and run experiments on it.

but what I do know for sure is having NaN,Unnamed: {i} and fully NaN rows/cols are non-informative, hence does not give the LLM any more context than if it was scrubbed out in the first place. increasing the signal while decreasing the noise. plus, you save a bunch of input tokens. the assumptions here is that LLMs understand that the empty string "" implies null/empty cell.

If the objective is beautification, what is we added a kwarg called beautify=True so that both options are available to users?

sounds good!

vmatt · 2024-12-17T10:26:24Z

For empty values, "" is definitely better than NaN. Although I'm not sure if we want to remove empty columns, that would be transforming the structure of the data. - maybe this can be a flag, such as --remove-empty-columns, --remove-empty-rows, what do you think @hewliyang?

hewliyang · 2024-12-17T13:45:28Z

For empty values, "" is definitely better than NaN. Although I'm not sure if we want to remove empty columns, that would be transforming the structure of the data. - maybe this can be a flag, such as --remove-empty-columns, --remove-empty-rows, what do you think @hewliyang?

sounds good. my idea of when we would want to keep fully null rows/cols is such as when there are subtables in a sheet & we want to maintain some spatial seperation (or not). added two flags drop_empty_{rows|cols} for this.

also am now fowarding na_rep = "" so the user can provide a custom placeholder for NaNs as well

afourney · 2024-12-17T23:47:09Z

Arguably, if the xlsx cell is empty, the conversion should be empty.

If the excel value is "#N/A" then I believe it should be interpreted as NaN (though it's perhaps not exactly the same).

In any case, I agree with dropping completely empty columns (including empty headers), and with encoding empty cells with "". I'm not sure what to do about column headers yet. Let me think on this more, and test with #N/A values.

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

hewliyang · 2024-12-22T13:53:44Z

Arguably, if the xlsx cell is empty, the conversion should be empty.

If the excel value is "#N/A" then I believe it should be interpreted as NaN (though it's perhaps not exactly the same).

In any case, I agree with dropping completely empty columns (including empty headers), and with encoding empty cells with "". I'm not sure what to do about column headers yet. Let me think on this more, and test with #N/A values.

I've pushed a change to:

only drop columns iff it's header is also nullish
side note: i've seen a case at work where a financial model for some reason had a rouge - at row 1,000,000, and pandas blew up after consuming > 15GB of memory trying to convert the dataframe into HTML. so cleaning is kinda important especially for unsanitized inputs i.e. user data

Additionally:

rows should be safe because we don't care about indexes
i've checked that =NA() or #N/A values in Excel are indeed also parsed to NaN by pandas

Regarding XLSXConverter kwargs, currently:

na_rep defaults to "" - covered in tests
drop_null_{rows|columns} defaults to False - not covered in tests

Not sure how verbose you would want the tests to be, because I think for the latter we will need to reparse the string representation.

Also I see a related PR #169. If we intend to merge this then another todo would be to wrap all the .xl{s|sx|sm|...} converters in another base class and move the cleaning logic to a separate internal method. Lmk what you think

vmatt · 2024-12-23T11:54:59Z

Thanks for adding the --drop flags!

i've seen a case at work where a financial model for some reason had a rouge - at row 1,000,000, and pandas blew up after consuming > 15GB of memory trying to convert the dataframe into HTML. so cleaning is kinda important especially for unsanitized inputs i.e. user data

Excel supports ~1.048m rows max, but in my experience it starts to get sluggish with a sheet over 100k, but that's another story.

On the pandas using too much memory topic - I'm generally advocating for all projects to replace pandas with polars or duckdb whenever it's possible, especially when working with large datasets.

I'm actual planning to open a PR myself in January, to replace all pandas functions with polars whenever it's possible, as I'm expecting this library will be used to convert vast amounts of data, we should make this as efficient as possible.

hewliyang added 2 commits December 16, 2024 15:27

chore: excel improvements

5de769f

chore: update test excel with a nan

19dc6a3

hewliyang and others added 2 commits December 16, 2024 17:35

chore: type annot

42027aa

Merge branch 'main' into hly/chore/xlsx

134d35a

gagb self-assigned this Dec 17, 2024

chore: make cleaning optional

c2aae4d

hewliyang added 2 commits December 17, 2024 21:17

chore: finer flags, forward na_rep

5c60d8c

chore: simplify xlsx tests

113f774

gagb assigned afourney and unassigned gagb Dec 18, 2024

gagb requested review from afourney and Copilot December 18, 2024 00:45

Copilot AI reviewed Dec 18, 2024

View reviewed changes

gagb added the open for reviewing Invites community to help review this PR. label Dec 20, 2024

l-lumin mentioned this pull request Dec 21, 2024

Blank cells in .xlsx would return NaN in the markdown. #183

Open

hewliyang added 3 commits December 22, 2024 21:22

chore: consider header for column-wise drop

7b64e6e

Merge branch 'main' into hly/chore/xlsx

b10b295

chore: update tests

ba30117

yeungadrian mentioned this pull request Jan 4, 2025

Feature: Support XLSM, XLSB & Replace excel engine with faster calamine #259

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: xlsx improvements #52

chore: xlsx improvements #52

hewliyang commented Dec 16, 2024

hewliyang commented Dec 16, 2024

hewliyang commented Dec 16, 2024

gagb commented Dec 17, 2024

gagb commented Dec 17, 2024

hewliyang commented Dec 17, 2024 •

edited

Loading

vmatt commented Dec 17, 2024

hewliyang commented Dec 17, 2024 •

edited

Loading

afourney commented Dec 17, 2024

hewliyang commented Dec 22, 2024 •

edited

Loading

vmatt commented Dec 23, 2024 •

edited

Loading

chore: xlsx improvements #52

Are you sure you want to change the base?

chore: xlsx improvements #52

Conversation

hewliyang commented Dec 16, 2024

Examples

Sheet1

09060124-b5e7-4717-9d07-3c046eb

Sheet1

09060124-b5e7-4717-9d07-3c046eb

hewliyang commented Dec 16, 2024

hewliyang commented Dec 16, 2024

gagb commented Dec 17, 2024

gagb commented Dec 17, 2024

hewliyang commented Dec 17, 2024 • edited Loading

vmatt commented Dec 17, 2024

hewliyang commented Dec 17, 2024 • edited Loading

afourney commented Dec 17, 2024

Choose a reason for hiding this comment

hewliyang commented Dec 22, 2024 • edited Loading

vmatt commented Dec 23, 2024 • edited Loading

hewliyang commented Dec 17, 2024 •

edited

Loading

hewliyang commented Dec 17, 2024 •

edited

Loading

hewliyang commented Dec 22, 2024 •

edited

Loading

vmatt commented Dec 23, 2024 •

edited

Loading