Handle empty string in Ergast API (#432) #433

harningle · 2023-08-02T21:56:12Z

Ergast API sometimes returns empty string where it is expected to be an integer. When converting Ergast json response to pandas, we want to convert them to integer, and int() won't accept an empty string as input. We need to check if a variable is empty before casting it to integer.

Working example
Ergast gives '12345', and we do int('12345') to get the number 12345

Bug
Ergast gives ''. We do int('') and get ValueError: invalid literal for int() with base 10: ''

Fix
Check if it's empty before doing int(). If empty, return None

Detailed explanation is here: #432 (comment).

theOehrly · 2023-08-28T12:59:00Z

Hi, sorry for not responding to this for a month now. I didn't have any time for FastF1 devlopment over the last few weeks.

There are a few issues with this PR in its current state:

There are (probably accidental) changes to many other files
We cannot use None instead of an integer, because that breaks the column datatypes in Pandas. Therefore, I'd suggest using -1 as an alterantive error value.
This is the most complicated one. The definition of how the conversion is handled should fully be defined in the "structure tree" of the API. I don't want this to be handled with an if mapping['type'] is .... This goes against the modular approach of the API structure definition. Instead, we should define a function (similar to the datetime and timedelta conversion) def save_int(string) -> int: ... that handles the value error and returns -1 in case of the ValueError. Then, in the structure objects for the subcategories 'type': int should be replaced with 'type': save_int. It's probably a good idea to do this for all instances where int is used directly for conversion from string. But some more thought should be given to this, to determine whether that actually makes sense everywhere.
It's probably a good idea to do the same for all values where float is used for conversion directly. But here, nan can be returned, in case of an error.

(Also, this could be considered as an error in the Ergast data and I'll probably make a list for every place where this occurs and report it to Ergast. But adding better error handling here is still a good idea.)

@harningle do you want to give this a try and update this PR (feel free to force push to override the previous commits or even open a new PR if that's easier for you). Please tell me if you want to work on this or if someone else should do that.

harningle · 2023-08-28T21:42:54Z

Hi, sorry for not responding to this for a month now. I didn't have any time for FastF1 devlopment over the last few weeks.

There are a few issues with this PR in its current state:

There are (probably accidental) changes to many other files

We cannot use None instead of an integer, because that breaks the column datatypes in Pandas. Therefore, I'd suggest using -1 as an alterantive error value.

This is the most complicated one. The definition of how the conversion is handled should fully be defined in the "structure tree" of the API. I don't want this to be handled with an if mapping['type'] is .... This goes against the modular approach of the API structure definition. Instead, we should define a function (similar to the datetime and timedelta conversion) def save_int(string) -> int: ... that handles the value error and returns -1 in case of the ValueError. Then, in the structure objects for the subcategories 'type': int should be replaced with 'type': save_int. It's probably a good idea to do this for all instances where int is used directly for conversion from string. But some more thought should be given to this, to determine whether that actually makes sense everywhere.

It's probably a good idea to do the same for all values where float is used for conversion directly. But here, nan can be returned, in case of an error.

(Also, this could be considered as an error in the Ergast data and I'll probably make a list for every place where this occurs and report it to Ergast. But adding better error handling here is still a good idea.)

@harningle do you want to give this a try and update this PR (feel free to force push to override the previous commits or even open a new PR if that's easier for you). Please tell me if you want to work on this or if someone else should do that.

Yes sure happy to do this! I'm going to Monza this weekend so may update this later.

theOehrly · 2023-08-29T07:38:01Z

@harningle Enjoy the race weekend :)

* add tests for `save_int` and `save_float` in `fastf1.ergast.structure`

harningle · 2023-09-06T21:56:28Z

There are (probably accidental) changes to many other files

This is probably due to CRLF vs LF difference between Windows and Unix. My last commit was done in Mac. Sorry I didn't notice that. Already cleaned it.

We cannot use None instead of an integer, because that breaks the column datatypes in Pandas. Therefore, I'd suggest using -1 as an alterantive error value.

Make sense. We now use -1 where the integer string is not a valid integer, e.g. an empty string.

I also found a separate issue: we have missing's in a should-be-integer column, which eventually makes the column to be float. For instance, in 1954 British GP, Ergast has totalRaceTimeMillis for two drivers only, and thus the final dataframe has totalRaceTimeMillis as a float column, while totalRaceTimeMillis is an integer column for 2021 Belgian GP. Not sure if this is the expected behaviour.

from fastf1.ergast import Ergast

ergast = Ergast()

# `totalRaceTimeMillis` is a float column for 1954 British GP
raw_resp = ergast.get_race_results(season=1954, round=5, result_type='raw')[0]
df = ergast.get_race_results(season=1954, round=5).content[0]
print('Time' in raw_resp['Results'][0])
# True
print('Time' in raw_resp['Results'][5])
# False
print(df['totalRaceTimeMillis'].dtype)
# float64

# `totalRaceTimeMillis` is an integer column for 2021 Belgian GP
df = ergast.get_race_results(season=2021, round=12).content[0]
print(df['totalRaceTimeMillis'].dtype)
# int64

This is the most complicated one. The definition of how the conversion is handled should fully be defined in the "structure tree" of the API. I don't want this to be handled with an if mapping['type'] is .... This goes against the modular approach of the API structure definition. Instead, we should define a function (similar to the datetime and timedelta conversion) def save_int(string) -> int: ... that handles the value error and returns -1 in case of the ValueError. Then, in the structure objects for the subcategories 'type': int should be replaced with 'type': save_int. It's probably a good idea to do this for all instances where int is used directly for conversion from string. But some more thought should be given to this, to determine whether that actually makes sense everywhere.

It's probably a good idea to do the same for all values where float is used for conversion directly. But here, nan can be returned, in case of an error.

(Also, this could be considered as an error in the Ergast data and I'll probably make a list for every place where this occurs and report it to Ergast. But adding better error handling here is still a good idea.)

We now handle the string conversion in save_int and save_float.

For integer strings, I use int only for season and round, where there shouldn't be any invalid integer string. All other places, such as fastestLapTimeMillis, are handled by save_int.
All float strings are handled by save_float.

I can have a look at Ergast data (the csv database) and list all errors if you haven't done so.

theOehrly · 2023-09-08T16:42:21Z

Looks good, very well commented, documented and tested as well 👍

We now handle the string conversion in save_int and save_float.
For integer strings, I use int only for season and round, where there shouldn't be any invalid integer string. All other places, such as fastestLapTimeMillis, are handled by save_int.
All float strings are handled by save_float.

Is there any harm in using save_int there as well? After all, there "shouldn't" be any invalid integer strings anywhere else as well. But if there ever is an error in the returned data, this would then again result in an unhandled exception.

I also found a separate issue: we have missing's in a should-be-integer column, which eventually makes the column to be float.

Yes, this is an issue with Python all the time. I did some digging just now and one solution might be to use the pandas.NA type and then call .convert_dtypes on the DataFrame. That will cast the column to Int64 which is different from int64 and supports NA types (https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html).
For that to work, we need to pass a new pandas=True argument to the flattening methods. Then the save_int function needs a use_na keyword argument and needs to be handled as a special case in the flattening methods. Then, there will still be Int64 and int64 depending on if there are NA types or not.

But I am not a huge fan of pandas' custom data types because they mostly don't play nicely with anything but pandas. For example, they are impossible to plot with matplotlib, because it has no idea what a pandas.NAType value is. If you convert the data to numpy, you are stuck with an object type array (if it includes NAType). Barely anybody knows that these types exist and many users might be confused by them.

The most compatible solution would be to just give up on integers and convert everything numeric to float. That's a bit ridiculous only for NaN support but it certainly is the most compatible solution.

What's your opinion on that? I might ask some other people/users as well and see what they think.

I can have a look at Ergast data (the csv database) and list all errors if you haven't done so.

I haven't looked at that yet.

This is probably due to CRLF vs LF difference between Windows and Unix. My last commit was done in Mac. Sorry I didn't notice that. Already cleaned it.

Yes, I mostly code on Windows. But IMO git auto converts CRLF to LF on commit. I think the issue where file mode changes if I remember correctly. Doesn't really matter, though, it's fixed.

harningle · 2023-09-08T22:26:55Z

Is there any harm in using save_int there as well? After all, there "shouldn't" be any invalid integer strings anywhere else as well. But if there ever is an error in the returned data, this would then again result in an unhandled exception.

Yes I agree. Using save_int for all interger strings makes it more consistent, and avoid confusion like "oh why they use int for this one but not that one".

What's your opinion on that? I might ask some other people/users as well and see what they think.

No I don't like pandas type either, especially pd.NA. They have very different behaviours from np.nan: plotting, boolean of NA, etc. I actually just had this very issue on pd.NA in my job this summer... So IMO just keep it as float.

I can have a look at Ergast data (the csv database) and list all errors if you haven't done so.

I haven't looked at that yet.

I can do this later. Maybe open a new discussion afterwards?

theOehrly · 2023-09-09T16:11:46Z

No I don't like pandas type either, especially pd.NA. They have very different behaviours from np.nan: plotting, boolean of NA, etc. I actually just had this very issue on pd.NA in my job this summer... So IMO just keep it as float.

What I meant was, should we just cast every numeric value to float explicitly. Instead of casting them to int and having pandas fall back to float if NaN values exist. That would just be a consistency thing, so it's always the same type. But I'm not sure if it actually matters that much here. I guess we can just leave it as it is.

I can have a look at Ergast data (the csv database) and list all errors if you haven't done so.

I haven't looked at that yet.

I can do this later. Maybe open a new discussion afterwards?

Yes, a separate discussion for that is a good idea

harningle · 2023-09-09T16:26:41Z

What I meant was, should we just cast every numeric value to float explicitly. Instead of casting them to int and having pandas fall back to float if NaN values exist. That would just be a consistency thing, so it's always the same type. But I'm not sure if it actually matters that much here. I guess we can just leave it as it is.

Oh I misread that. I agree we can leave it as it is, and perhaps add something in the API mapping in documentation, like "the rawdata in those columns are integers, but if we have any missing's in some cell, it will be in float type eventually".

theOehrly · 2023-09-19T14:26:40Z

I haven't forgotten about this, just FYI. I'll merge this once I have time to also maybe add a note to the docs.

(cherry picked from commit dd6b4c5)

harningle closed this Sep 5, 2023

harningle force-pushed the master branch from cf7db8a to 317bacf Compare September 5, 2023 22:50

harningle added 2 commits September 6, 2023 00:40

FIX: handle invalid integer and float string from Ergast API

17759e6

FIX: incorrect regex for matching float string

2b8df2c

* add tests for `save_int` and `save_float` in `fastf1.ergast.structure`

harningle reopened this Sep 6, 2023

refactor: exceed 79 line width

5fab6e1

save_int for all integer string from Ergast

a48b2a7

theOehrly merged commit dd6b4c5 into theOehrly:master Sep 21, 2023
7 checks passed

theOehrly pushed a commit that referenced this pull request Oct 26, 2023

FIX: Handle empty strings in Ergast API response (#433)

dc81d64

(cherry picked from commit dd6b4c5)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle empty string in Ergast API (#432) #433

Handle empty string in Ergast API (#432) #433

harningle commented Aug 2, 2023 •

edited

Loading

theOehrly commented Aug 28, 2023

harningle commented Aug 28, 2023

theOehrly commented Aug 29, 2023

harningle commented Sep 6, 2023

theOehrly commented Sep 8, 2023

harningle commented Sep 8, 2023

theOehrly commented Sep 9, 2023

harningle commented Sep 9, 2023

theOehrly commented Sep 19, 2023

Handle empty string in Ergast API (#432) #433

Handle empty string in Ergast API (#432) #433

Conversation

harningle commented Aug 2, 2023 • edited Loading

theOehrly commented Aug 28, 2023

harningle commented Aug 28, 2023

theOehrly commented Aug 29, 2023

harningle commented Sep 6, 2023

theOehrly commented Sep 8, 2023

harningle commented Sep 8, 2023

theOehrly commented Sep 9, 2023

harningle commented Sep 9, 2023

theOehrly commented Sep 19, 2023

harningle commented Aug 2, 2023 •

edited

Loading