-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support mix of date/timestamp and strings #184
Comments
IMO, we will have to take a design decision here, I see two paths:
For both cases, it's an opiniated feature, so I'd rather see this in 0.10.0 than in 0.9.1. Thoughts ? |
Yup I wanted to have this discussion with you (hence this new issue and my message on Slack 😛) Making string the uber-type may or may not be a good idea. If we go with this option I would like some kind of parameter. But let's start with manual dtype declaration. WDYT? |
Sorry, missed the slack message, I've disabled all notifications 😬 Yeah, let's get 0.9.1 out asap. And I agree, having dtype enforcement is a good start, and it provides a workaround 👍 |
My team needs some way to handle bad data in excel files. We eventually want only dates for a date column but need to check if there are cells with non-dates and report where they occur. Right now we are using pandas and openpyxl with dtype=object to read excel files with mixed data types matching the original types, so we can still get a dataframe but later check and report an error if there are text cells in a column that should have all dates. Would be nice if we could use fastexcel (maybe with polars) instead since openpyxl is so slow. |
This is planned for v0.10.0. I'll work on it this weekend. @lukapeschke and I will publish the new release in March for sure! |
FYI I'm working on #173 , which should allow to enforce a |
Hopefully there will be an option to disallow conversion. For our use case we ideally need mixed data types in a dataframe (I know pyarrow supports union, but I don't now if it's easy to allow just any type for a column) or at least a list of errors we can trace back to every failed cell by location. For us it also depends on the type/column whether we need strict enforcement. We have have had many problems due to incorrect or ambiguous conversion:
For the first two cases I mentioned, for us to be able to use fastexcel instead of openpyxl, we'd need behavior like this:
For the third problem, I have no idea if there is even anything fastexcel could do about this. If percentages and floats are both just converted to floats, we have no way of knowing what the original data was. To address this, we currently have to manually go through every cell for columns that should be percentages with openpyxl and explicitly check the excel type. Unfortunately, working with excel to correctly detect bad data is complicated. Not sure if fastexcel can handle our use case, but it would be nice if it could because openpyxl is painfully slow vs. calamine for the large files we have to work with. |
@PrettyWood Hello! I am curious about the timeline for when this will be fixed and released. Thanks! |
@armgabrielyan Hello! We just released v0.10.0 this morning. It allows to enforce a dtype for a column: https://fastexcel.toucantoco.dev/fastexcel.html#ExcelReader.load_sheet . If you want a column of mixed strings and timestamps as a string, you can now use @noctuid could you please let us know if this works for you ? For more advanced mixed dtypes, we're thinking about providing a |
@lukapeschke It sounds good. However, we use |
@armgabrielyan I believe it should be possible via |
@lukapeschke You are correct, it is possible to do this via |
@lukapeschke In our case, the column names or the number of columns is not pre-defined. I was wondering if there is any way to specify the data types for all columns to be |
@armgabrielyan No, that's currently not possible, but feel free to open a different issue if you'd like support for that 🙂 |
@lukapeschke Sounds great! Thank you so much for your support! |
It doesn't. Unfortunately |
@noctuid unfortunately, the percentage edge case would require some work in calamine to properly support the excel Percent type, as it is not supported there for now: https://github.com/tafia/calamine/blob/master/src/datatype.rs#L23-L44 . Are the other points (apart from the percentage case) working for you with v0.10.0 ? If yes, can we close this and create a separate issue for the percentage case ? |
With #197, we could handle the other edge cases I mentioned, but I think it would be preferable for us to only use We'd need some control over the coercion to detect bad data. Will it coerce strings to dates currently? For example, for a dtype string column, we would want coercion to string from number (or anything). However, for a date dtype column, we would want an error for a string in that column, even if it could be coerced to a date. |
@noctuid Right now, dates will be coerced to strings only if the dtype of the column is explicitly set to I created an issue to explicitly describe that behaviour in the docs: #215 In the future, it'd be nice to be able to specify how multi-dtype coercion should work |
With #245 mixed dtypes and string columns are now automatically coerced to strings. Since this can be unexpected behaviour for some users, there will be an option to completely disable automatic coercion (tracked in #247). In the long term, we'd like to offer options to support mixed dtypes. This could be with a |
See #181 (comment)
and #158 (comment)
TEST_FASTEXCEL_MIXED_DATA.xlsx
example-skip-rows.xlsx
The text was updated successfully, but these errors were encountered: