Support mix of date/timestamp and strings #184

PrettyWood · 2024-02-19T17:49:40Z

TEST_FASTEXCEL_MIXED_DATA.xlsx
- Same as the prior one, but introducing two additional columns with mixed data types (integer plus string, date plus string, no "NULL" string values this time)
example-skip-rows.xlsx

0: Could not build schema for sheet Sheet2
1: could not figure out column type for following type combination: {Timestamp(Millisecond, None), Utf8}

The text was updated successfully, but these errors were encountered:

lukapeschke · 2024-02-20T08:47:16Z

IMO, we will have to take a design decision here, I see two paths:

Either we consider strings to be some kind of uber-type, on which we can fallback in any case meaning we will have to add support for that on the calamine side if we want to avoid double checks for every cell in case .as_string returns None. Or maybe it should be a new to_string method
Or we do not do that and consider that data that gets converted to the arrow format should be uniform (with tolerance for the current int/bool/float implicit conversions). In that case, we would provide a to_python method, which would return a simple matrix of python objects. Its type would be dict[str, list[scalar | None]] (or ordereddict as the columns appear in a certain order) in case we have column names, and list[list[scalar | None]] in case we don't. scalar would be bool | int | float | datetime | timedelta | str

For both cases, it's an opiniated feature, so I'd rather see this in 0.10.0 than in 0.9.1. Thoughts ?

PrettyWood · 2024-02-20T09:34:50Z

Yup I wanted to have this discussion with you (hence this new issue and my message on Slack 😛)
Let's add that in v0.10.0 and release a new 0.9.1 today.
I guess we could start with the dtype enforcement (see #173) and set for a column that fails the type string to make it work
So we wouldn't read directly the file but if the error is clear enough, the workaround is easy

Making string the uber-type may or may not be a good idea. If we go with this option I would like some kind of parameter. But let's start with manual dtype declaration. WDYT?

lukapeschke · 2024-02-20T11:19:29Z

Sorry, missed the slack message, I've disabled all notifications 😬

Yeah, let's get 0.9.1 out asap.

And I agree, having dtype enforcement is a good start, and it provides a workaround 👍

durgeksh · 2024-02-22T12:45:04Z

I was about to report this error. For below data I am also getting error as:

Caused by:
    0: Could not build schema for sheet Sheet1
    1: could not figure out column type for following type combination: {Timestamp(Millisecond, None), Float64}

noctuid · 2024-02-29T23:34:19Z

My team needs some way to handle bad data in excel files. We eventually want only dates for a date column but need to check if there are cells with non-dates and report where they occur. Right now we are using pandas and openpyxl with dtype=object to read excel files with mixed data types matching the original types, so we can still get a dataframe but later check and report an error if there are text cells in a column that should have all dates. Would be nice if we could use fastexcel (maybe with polars) instead since openpyxl is so slow.

PrettyWood · 2024-02-29T23:48:39Z

This is planned for v0.10.0. I'll work on it this weekend. @lukapeschke and I will publish the new release in March for sure!

lukapeschke · 2024-02-29T23:53:29Z

FYI I'm working on #173 , which should allow to enforce a string dtype for a given column and convert datetime objects to strings

noctuid · 2024-03-01T14:55:00Z

Hopefully there will be an option to disallow conversion. For our use case we ideally need mixed data types in a dataframe (I know pyarrow supports union, but I don't now if it's easy to allow just any type for a column) or at least a list of errors we can trace back to every failed cell by location.

For us it also depends on the type/column whether we need strict enforcement. We have have had many problems due to incorrect or ambiguous conversion:

For date columns, we only want to allow date-type cells and no text cells to avoid ambiguity due to different date formats.
We have some string columns that can have cells that are just numbers. Excel will by default set the cell types to numbers. Therefore, we want to allow numbers in these columns and convert to strings. The problem we've seen is incorrect conversion with pyspark or pandas for numbers. Very large numbers are converted to scientific notation, which is wrong for our case. Haven't tested this but floats could also be an issue, e.g. we would need to read a cell showing 198.0 exactly as "198.0".
We have some columns where all cells need to be percentages. So we need these cells to either be excel's percent type or strings with % at the end (this wouldn't normally happen but isn't ambiguous). But if these columns have a mix of percentages and floats, that's ambiguous/wrong (e.g. 0.2 - we have no way of knowing if this was supposed to be 20% or 0.2%).

For the first two cases I mentioned, for us to be able to use fastexcel instead of openpyxl, we'd need behavior like this:

We get a mix of arbitrary types matching their original in the dataframe - this is the easier since we can handle the complex checks ourselves later. Getting the dataframe doesn't error, so we can later map bad data to specific cells in the original file ourselves just once. This allows us to check/report all errors in one place (we need to report not just type errors but also value errors like "this date is in the future").
Or we could type e.g. a string cell as string or number and handle the conversion ourselves; we could type a date column as date or string and later error ourselves. The downside of this is that if there was wildly bad data (e.g. some unforeseen bad type like a date in a string column), we'd now additionally have to parse any errors given when loading the file and cannot report all errors at once.
Or we could try to rely on fastexcel for all type errors (we type a date column as just date, and fastexcel errors if there are strings). We'd still have to check/report errors in two places in two different ways though.

For the third problem, I have no idea if there is even anything fastexcel could do about this. If percentages and floats are both just converted to floats, we have no way of knowing what the original data was. To address this, we currently have to manually go through every cell for columns that should be percentages with openpyxl and explicitly check the excel type.

Unfortunately, working with excel to correctly detect bad data is complicated. Not sure if fastexcel can handle our use case, but it would be nice if it could because openpyxl is painfully slow vs. calamine for the large files we have to work with.

armgabrielyan · 2024-03-25T10:54:18Z

@PrettyWood Hello! I am curious about the timeline for when this will be fixed and released. Thanks!

lukapeschke · 2024-03-25T11:10:12Z

@armgabrielyan Hello! We just released v0.10.0 this morning. It allows to enforce a dtype for a column: https://fastexcel.toucantoco.dev/fastexcel.html#ExcelReader.load_sheet . If you want a column of mixed strings and timestamps as a string, you can now use dtypes={'my_column': 'string'} (note that if the string dtype is not specified, no conversion will be done implicitly).

@noctuid could you please let us know if this works for you ?

For more advanced mixed dtypes, we're thinking about providing a to_python method, which would return a list of lists of python objects. This would be slower and not very efficient memory-wise, as we wouldn't be able to rely on Arrow for data representation, but would allow for greater flexibility. Progress for that is tracked in #197

armgabrielyan · 2024-03-25T11:18:34Z

@lukapeschke It sounds good. However, we use fastexcel as an engine for reading Excel spreadsheets in polars and it looks like polars does not support this new functionality yet.

lukapeschke · 2024-03-25T11:22:49Z

@armgabrielyan I believe it should be possible via read_options: https://docs.pola.rs/py-polars/html/reference/api/polars.read_excel.html as polars passes it though to load_sheet: https://github.com/pola-rs/polars/blob/53f55367d1428b6d4ab51a7b17a8dbf4c003ac43/py-polars/polars/io/spreadsheet/functions.py#L821

armgabrielyan · 2024-03-25T11:42:33Z

@lukapeschke You are correct, it is possible to do this via read_options. Thank you so much!

armgabrielyan · 2024-03-25T11:45:05Z

@lukapeschke In our case, the column names or the number of columns is not pre-defined. I was wondering if there is any way to specify the data types for all columns to be "string", for example.

lukapeschke · 2024-03-25T12:39:22Z

@armgabrielyan No, that's currently not possible, but feel free to open a different issue if you'd like support for that 🙂

armgabrielyan · 2024-03-25T13:17:24Z

@lukapeschke Sounds great! Thank you so much for your support!

noctuid · 2024-03-25T14:12:28Z

could you please let us know if this works for you ?

It doesn't. Unfortunately to_python wouldn't either because of the percentage edge case I mentioned (assuming both excel percentage type and excel float would be converted to python float, there is no way to know what type the original data was).

lukapeschke · 2024-03-26T13:53:18Z

@noctuid unfortunately, the percentage edge case would require some work in calamine to properly support the excel Percent type, as it is not supported there for now: https://github.com/tafia/calamine/blob/master/src/datatype.rs#L23-L44 .

Are the other points (apart from the percentage case) working for you with v0.10.0 ? If yes, can we close this and create a separate issue for the percentage case ?

noctuid · 2024-03-26T17:20:55Z

With #197, we could handle the other edge cases I mentioned, but I think it would be preferable for us to only use to_python as a fallback if there is bad data.

We'd need some control over the coercion to detect bad data. Will it coerce strings to dates currently? For example, for a dtype string column, we would want coercion to string from number (or anything). However, for a date dtype column, we would want an error for a string in that column, even if it could be coerced to a date.

lukapeschke · 2024-03-28T08:54:24Z

@noctuid Right now, dates will be coerced to strings only if the dtype of the column is explicitly set to string. Implicit coercion will fail, as we did not want to be too smart and return unexpected data.

I created an issue to explicitly describe that behaviour in the docs: #215

In the future, it'd be nice to be able to specify how multi-dtype coercion should work

lukapeschke · 2024-07-02T07:49:46Z

With #245 mixed dtypes and string columns are now automatically coerced to strings. Since this can be unexpected behaviour for some users, there will be an option to completely disable automatic coercion (tracked in #247).

In the long term, we'd like to offer options to support mixed dtypes. This could be with a to_python method (tracked in #247), or through arrow union types.

PrettyWood added this to the v0.9.1 milestone Feb 19, 2024

lukapeschke added 🦀 rust 🦀 Pull requests that edit Rust code feature request labels Feb 20, 2024

lukapeschke added discussion and removed 🦀 rust 🦀 Pull requests that edit Rust code labels Feb 20, 2024

PrettyWood modified the milestones: v0.9.1, v0.10.0 Feb 20, 2024

PrettyWood modified the milestones: v0.10.0, v0.11.0 Mar 21, 2024

noctuid mentioned this issue Apr 19, 2024

Feature Request: Reading (and writing) formatting tafia/calamine#404

Open

PrettyWood mentioned this issue Jul 1, 2024

refactor(dtype): automatically cast mixed string and date columns to strings #245

Merged

PrettyWood closed this as completed in #245 Jul 1, 2024

lukapeschke reopened this Jul 2, 2024

lukapeschke closed this as completed Jul 2, 2024

lukapeschke reopened this Jul 2, 2024

lukapeschke closed this as completed Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support mix of date/timestamp and strings #184

Support mix of date/timestamp and strings #184

PrettyWood commented Feb 19, 2024

lukapeschke commented Feb 20, 2024

PrettyWood commented Feb 20, 2024

lukapeschke commented Feb 20, 2024

durgeksh commented Feb 22, 2024

noctuid commented Feb 29, 2024

PrettyWood commented Feb 29, 2024

lukapeschke commented Feb 29, 2024

noctuid commented Mar 1, 2024

armgabrielyan commented Mar 25, 2024

lukapeschke commented Mar 25, 2024

armgabrielyan commented Mar 25, 2024

lukapeschke commented Mar 25, 2024

armgabrielyan commented Mar 25, 2024

armgabrielyan commented Mar 25, 2024

lukapeschke commented Mar 25, 2024

armgabrielyan commented Mar 25, 2024

noctuid commented Mar 25, 2024

lukapeschke commented Mar 26, 2024

noctuid commented Mar 26, 2024

lukapeschke commented Mar 28, 2024

lukapeschke commented Jul 2, 2024

Support mix of date/timestamp and strings #184

Support mix of date/timestamp and strings #184

Comments

PrettyWood commented Feb 19, 2024

lukapeschke commented Feb 20, 2024

PrettyWood commented Feb 20, 2024

lukapeschke commented Feb 20, 2024

durgeksh commented Feb 22, 2024

noctuid commented Feb 29, 2024

PrettyWood commented Feb 29, 2024

lukapeschke commented Feb 29, 2024

noctuid commented Mar 1, 2024

armgabrielyan commented Mar 25, 2024

lukapeschke commented Mar 25, 2024

armgabrielyan commented Mar 25, 2024

lukapeschke commented Mar 25, 2024

armgabrielyan commented Mar 25, 2024

armgabrielyan commented Mar 25, 2024

lukapeschke commented Mar 25, 2024

armgabrielyan commented Mar 25, 2024

noctuid commented Mar 25, 2024

lukapeschke commented Mar 26, 2024

noctuid commented Mar 26, 2024

lukapeschke commented Mar 28, 2024

lukapeschke commented Jul 2, 2024