Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: xlsx improvements #52

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

hewliyang
Copy link

a couple simple heuristics that should be safe to apply in any general case:

  • if any row or column are fully null, we can save tokens by deleting that column.
  • dataframe column names that are null get auto-filled with Unnamed_{i} and I think we can scrub this out
  • NaN's in the dataframe materialize as "NaN" in the markdown and can probably be scrubbed out

Examples


here is what I have expanded the test.xlsx to look like (changes annotated in red):

image

which would produce:


Sheet1

Alpha Unnamed: 1 Beta Gamma Delta Unnamed: 5
89 NaN 82 100 12 13.0
76 NaN 89 33 42 43.0
60 NaN 84 19 19 53.0
7 NaN 69 10 17 63.0
87 NaN 89 86 54 312.0
23 NaN 4 89 25 23.0
70 NaN 84 62 59 34.0
83 NaN 37 43 21 321.0
71 NaN 15 88 32 431.0
20 NaN 62 20 67 647.0
67 NaN 18 15 48 878.0
42 NaN 5 15 67 NaN
58 NaN 6ff4173b-42a5-4784-9b19-f49caff4d93d 22 9 346.0
49 NaN 93 6 38 317.0
82 NaN 28 1 39 341.0
95 NaN 55 18 82 135.0
50 NaN 46 98 86 135.0
31 NaN 46 47 82 3634.0
40 NaN 65 19 31 357.0
95 NaN 65 29 62 34.0
68 NaN 57 34 54 901.0
96 NaN 66 63 14 299.0
87 NaN 93 95 80 19.0

09060124-b5e7-4717-9d07-3c046eb

ColA ColB ColC ColD
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 affc7dad-52dc-4b98-9b5d-51e65d8a8ad0

with the changes:


Sheet1

Alpha Beta Gamma Delta
89 82 100 12 13.0
76 89 33 42 43.0
60 84 19 19 53.0
7 69 10 17 63.0
87 89 86 54 312.0
23 4 89 25 23.0
70 84 62 59 34.0
83 37 43 21 321.0
71 15 88 32 431.0
20 62 20 67 647.0
67 18 15 48 878.0
42 5 15 67
58 6ff4173b-42a5-4784-9b19-f49caff4d93d 22 9 346.0
49 93 6 38 317.0
82 28 1 39 341.0
95 55 18 82 135.0
50 46 98 86 135.0
31 46 47 82 3634.0
40 65 19 31 357.0
95 65 29 62 34.0
68 57 34 54 901.0
96 66 63 14 299.0
87 93 95 80 19.0

09060124-b5e7-4717-9d07-3c046eb

ColA ColB ColC ColD
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 affc7dad-52dc-4b98-9b5d-51e65d8a8ad0

@hewliyang
Copy link
Author

@hewliyang please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree [company="{your company}"]

@hewliyang
Copy link
Author

@hewliyang please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree

@gagb
Copy link
Contributor

gagb commented Dec 17, 2024

@hewliyang thanks for the PR.
Our typical use case is including this data as context in an LLM.
Do you think your suggested changes, NaN -> "", Unnamed:{I} -> "" improve generation?

cc @afourney

@gagb
Copy link
Contributor

gagb commented Dec 17, 2024

If the objective is beautification, what is we added a kwarg called beautify=True so that both options are available to users?

@gagb gagb self-assigned this Dec 17, 2024
@hewliyang
Copy link
Author

hewliyang commented Dec 17, 2024

@hewliyang thanks for the PR. Our typical use case is including this data as context in an LLM. Do you think your suggested changes, NaN -> "", Unnamed:{I} -> "" improve generation?

cc @afourney

thank for for reviewing @gagb 🤗

well, the right answer is that it is non-trivial because we would have to curate a QA test set and run experiments on it.

but what I do know for sure is having NaN,Unnamed: {i} and fully NaN rows/cols are non-informative, hence does not give the LLM any more context than if it was scrubbed out in the first place. increasing the signal while decreasing the noise. plus, you save a bunch of input tokens. the assumptions here is that LLMs understand that the empty string "" implies null/empty cell.

If the objective is beautification, what is we added a kwarg called beautify=True so that both options are available to users?

sounds good!

@vmatt
Copy link

vmatt commented Dec 17, 2024

For empty values, "" is definitely better than NaN. Although I'm not sure if we want to remove empty columns, that would be transforming the structure of the data. - maybe this can be a flag, such as --remove-empty-columns, --remove-empty-rows, what do you think @hewliyang?

@hewliyang
Copy link
Author

hewliyang commented Dec 17, 2024

For empty values, "" is definitely better than NaN. Although I'm not sure if we want to remove empty columns, that would be transforming the structure of the data. - maybe this can be a flag, such as --remove-empty-columns, --remove-empty-rows, what do you think @hewliyang?

sounds good. my idea of when we would want to keep fully null rows/cols is such as when there are subtables in a sheet & we want to maintain some spatial seperation (or not). added two flags drop_empty_{rows|cols} for this.

also am now fowarding na_rep = "" so the user can provide a custom placeholder for NaNs as well

@afourney
Copy link
Member

Arguably, if the xlsx cell is empty, the conversion should be empty.

If the excel value is "#N/A" then I believe it should be interpreted as NaN (though it's perhaps not exactly the same).

In any case, I agree with dropping completely empty columns (including empty headers), and with encoding empty cells with "". I'm not sure what to do about column headers yet. Let me think on this more, and test with #N/A values.

@gagb gagb assigned afourney and unassigned gagb Dec 18, 2024
@gagb gagb requested review from afourney and Copilot December 18, 2024 00:45

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

@gagb gagb added the open for reviewing Invites community to help review this PR. label Dec 20, 2024
@hewliyang
Copy link
Author

hewliyang commented Dec 22, 2024

Arguably, if the xlsx cell is empty, the conversion should be empty.

If the excel value is "#N/A" then I believe it should be interpreted as NaN (though it's perhaps not exactly the same).

In any case, I agree with dropping completely empty columns (including empty headers), and with encoding empty cells with "". I'm not sure what to do about column headers yet. Let me think on this more, and test with #N/A values.

I've pushed a change to:

  • only drop columns iff it's header is also nullish
  • side note: i've seen a case at work where a financial model for some reason had a rouge - at row 1,000,000, and pandas blew up after consuming > 15GB of memory trying to convert the dataframe into HTML. so cleaning is kinda important especially for unsanitized inputs i.e. user data

Additionally:

  • rows should be safe because we don't care about indexes
  • i've checked that =NA() or #N/A values in Excel are indeed also parsed to NaN by pandas

Regarding XLSXConverter kwargs, currently:

  • na_rep defaults to "" - covered in tests
  • drop_null_{rows|columns} defaults to False - not covered in tests

Not sure how verbose you would want the tests to be, because I think for the latter we will need to reparse the string representation.

Also I see a related PR #169. If we intend to merge this then another todo would be to wrap all the .xl{s|sx|sm|...} converters in another base class and move the cleaning logic to a separate internal method. Lmk what you think

@vmatt
Copy link

vmatt commented Dec 23, 2024

Thanks for adding the --drop flags!

i've seen a case at work where a financial model for some reason had a rouge - at row 1,000,000, and pandas blew up after consuming > 15GB of memory trying to convert the dataframe into HTML. so cleaning is kinda important especially for unsanitized inputs i.e. user data

Excel supports ~1.048m rows max, but in my experience it starts to get sluggish with a sheet over 100k, but that's another story.

On the pandas using too much memory topic - I'm generally advocating for all projects to replace pandas with polars or duckdb whenever it's possible, especially when working with large datasets.

I'm actual planning to open a PR myself in January, to replace all pandas functions with polars whenever it's possible, as I'm expecting this library will be used to convert vast amounts of data, we should make this as efficient as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
open for reviewing Invites community to help review this PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants