Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle empty string in Ergast API (#432) #433

Merged
merged 4 commits into from
Sep 21, 2023

Conversation

harningle
Copy link
Contributor

@harningle harningle commented Aug 2, 2023

Ergast API sometimes returns empty string where it is expected to be an integer. When converting Ergast json response to pandas, we want to convert them to integer, and int() won't accept an empty string as input. We need to check if a variable is empty before casting it to integer.

Working example
Ergast gives '12345', and we do int('12345') to get the number 12345

Bug
Ergast gives ''. We do int('') and get ValueError: invalid literal for int() with base 10: ''

Fix
Check if it's empty before doing int(). If empty, return None

Detailed explanation is here: #432 (comment).

@theOehrly
Copy link
Owner

Hi, sorry for not responding to this for a month now. I didn't have any time for FastF1 devlopment over the last few weeks.

There are a few issues with this PR in its current state:

  1. There are (probably accidental) changes to many other files
  2. We cannot use None instead of an integer, because that breaks the column datatypes in Pandas. Therefore, I'd suggest using -1 as an alterantive error value.
  3. This is the most complicated one. The definition of how the conversion is handled should fully be defined in the "structure tree" of the API. I don't want this to be handled with an if mapping['type'] is .... This goes against the modular approach of the API structure definition. Instead, we should define a function (similar to the datetime and timedelta conversion) def save_int(string) -> int: ... that handles the value error and returns -1 in case of the ValueError. Then, in the structure objects for the subcategories 'type': int should be replaced with 'type': save_int. It's probably a good idea to do this for all instances where int is used directly for conversion from string. But some more thought should be given to this, to determine whether that actually makes sense everywhere.
  4. It's probably a good idea to do the same for all values where float is used for conversion directly. But here, nan can be returned, in case of an error.

(Also, this could be considered as an error in the Ergast data and I'll probably make a list for every place where this occurs and report it to Ergast. But adding better error handling here is still a good idea.)

@harningle do you want to give this a try and update this PR (feel free to force push to override the previous commits or even open a new PR if that's easier for you). Please tell me if you want to work on this or if someone else should do that.

@harningle
Copy link
Contributor Author

Hi, sorry for not responding to this for a month now. I didn't have any time for FastF1 devlopment over the last few weeks.

There are a few issues with this PR in its current state:

  1. There are (probably accidental) changes to many other files
  2. We cannot use None instead of an integer, because that breaks the column datatypes in Pandas. Therefore, I'd suggest using -1 as an alterantive error value.
  3. This is the most complicated one. The definition of how the conversion is handled should fully be defined in the "structure tree" of the API. I don't want this to be handled with an if mapping['type'] is .... This goes against the modular approach of the API structure definition. Instead, we should define a function (similar to the datetime and timedelta conversion) def save_int(string) -> int: ... that handles the value error and returns -1 in case of the ValueError. Then, in the structure objects for the subcategories 'type': int should be replaced with 'type': save_int. It's probably a good idea to do this for all instances where int is used directly for conversion from string. But some more thought should be given to this, to determine whether that actually makes sense everywhere.
  4. It's probably a good idea to do the same for all values where float is used for conversion directly. But here, nan can be returned, in case of an error.

(Also, this could be considered as an error in the Ergast data and I'll probably make a list for every place where this occurs and report it to Ergast. But adding better error handling here is still a good idea.)

@harningle do you want to give this a try and update this PR (feel free to force push to override the previous commits or even open a new PR if that's easier for you). Please tell me if you want to work on this or if someone else should do that.

Yes sure happy to do this! I'm going to Monza this weekend so may update this later.

@theOehrly
Copy link
Owner

@harningle Enjoy the race weekend :)

@harningle
Copy link
Contributor Author

  1. There are (probably accidental) changes to many other files

This is probably due to CRLF vs LF difference between Windows and Unix. My last commit was done in Mac. Sorry I didn't notice that. Already cleaned it.

  1. We cannot use None instead of an integer, because that breaks the column datatypes in Pandas. Therefore, I'd suggest using -1 as an alterantive error value.

Make sense. We now use -1 where the integer string is not a valid integer, e.g. an empty string.

I also found a separate issue: we have missing's in a should-be-integer column, which eventually makes the column to be float. For instance, in 1954 British GP, Ergast has totalRaceTimeMillis for two drivers only, and thus the final dataframe has totalRaceTimeMillis as a float column, while totalRaceTimeMillis is an integer column for 2021 Belgian GP. Not sure if this is the expected behaviour.

from fastf1.ergast import Ergast

ergast = Ergast()

# `totalRaceTimeMillis` is a float column for 1954 British GP
raw_resp = ergast.get_race_results(season=1954, round=5, result_type='raw')[0]
df = ergast.get_race_results(season=1954, round=5).content[0]
print('Time' in raw_resp['Results'][0])
# True
print('Time' in raw_resp['Results'][5])
# False
print(df['totalRaceTimeMillis'].dtype)
# float64

# `totalRaceTimeMillis` is an integer column for 2021 Belgian GP
df = ergast.get_race_results(season=2021, round=12).content[0]
print(df['totalRaceTimeMillis'].dtype)
# int64
  1. This is the most complicated one. The definition of how the conversion is handled should fully be defined in the "structure tree" of the API. I don't want this to be handled with an if mapping['type'] is .... This goes against the modular approach of the API structure definition. Instead, we should define a function (similar to the datetime and timedelta conversion) def save_int(string) -> int: ... that handles the value error and returns -1 in case of the ValueError. Then, in the structure objects for the subcategories 'type': int should be replaced with 'type': save_int. It's probably a good idea to do this for all instances where int is used directly for conversion from string. But some more thought should be given to this, to determine whether that actually makes sense everywhere.

  2. It's probably a good idea to do the same for all values where float is used for conversion directly. But here, nan can be returned, in case of an error.

(Also, this could be considered as an error in the Ergast data and I'll probably make a list for every place where this occurs and report it to Ergast. But adding better error handling here is still a good idea.)

We now handle the string conversion in save_int and save_float.

  • For integer strings, I use int only for season and round, where there shouldn't be any invalid integer string. All other places, such as fastestLapTimeMillis, are handled by save_int.
  • All float strings are handled by save_float.

I can have a look at Ergast data (the csv database) and list all errors if you haven't done so.

@harningle harningle reopened this Sep 6, 2023
@theOehrly
Copy link
Owner

Looks good, very well commented, documented and tested as well 👍

We now handle the string conversion in save_int and save_float.
For integer strings, I use int only for season and round, where there shouldn't be any invalid integer string. All other places, such as fastestLapTimeMillis, are handled by save_int.
All float strings are handled by save_float.

Is there any harm in using save_int there as well? After all, there "shouldn't" be any invalid integer strings anywhere else as well. But if there ever is an error in the returned data, this would then again result in an unhandled exception.

I also found a separate issue: we have missing's in a should-be-integer column, which eventually makes the column to be float.

Yes, this is an issue with Python all the time. I did some digging just now and one solution might be to use the pandas.NA type and then call .convert_dtypes on the DataFrame. That will cast the column to Int64 which is different from int64 and supports NA types (https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html).
For that to work, we need to pass a new pandas=True argument to the flattening methods. Then the save_int function needs a use_na keyword argument and needs to be handled as a special case in the flattening methods. Then, there will still be Int64 and int64 depending on if there are NA types or not.

But I am not a huge fan of pandas' custom data types because they mostly don't play nicely with anything but pandas. For example, they are impossible to plot with matplotlib, because it has no idea what a pandas.NAType value is. If you convert the data to numpy, you are stuck with an object type array (if it includes NAType). Barely anybody knows that these types exist and many users might be confused by them.

The most compatible solution would be to just give up on integers and convert everything numeric to float. That's a bit ridiculous only for NaN support but it certainly is the most compatible solution.

What's your opinion on that? I might ask some other people/users as well and see what they think.

I can have a look at Ergast data (the csv database) and list all errors if you haven't done so.

I haven't looked at that yet.

This is probably due to CRLF vs LF difference between Windows and Unix. My last commit was done in Mac. Sorry I didn't notice that. Already cleaned it.

Yes, I mostly code on Windows. But IMO git auto converts CRLF to LF on commit. I think the issue where file mode changes if I remember correctly. Doesn't really matter, though, it's fixed.

@harningle
Copy link
Contributor Author

Is there any harm in using save_int there as well? After all, there "shouldn't" be any invalid integer strings anywhere else as well. But if there ever is an error in the returned data, this would then again result in an unhandled exception.

Yes I agree. Using save_int for all interger strings makes it more consistent, and avoid confusion like "oh why they use int for this one but not that one".

What's your opinion on that? I might ask some other people/users as well and see what they think.

No I don't like pandas type either, especially pd.NA. They have very different behaviours from np.nan: plotting, boolean of NA, etc. I actually just had this very issue on pd.NA in my job this summer... So IMO just keep it as float.

I can have a look at Ergast data (the csv database) and list all errors if you haven't done so.

I haven't looked at that yet.

I can do this later. Maybe open a new discussion afterwards?

@theOehrly
Copy link
Owner

No I don't like pandas type either, especially pd.NA. They have very different behaviours from np.nan: plotting, boolean of NA, etc. I actually just had this very issue on pd.NA in my job this summer... So IMO just keep it as float.

What I meant was, should we just cast every numeric value to float explicitly. Instead of casting them to int and having pandas fall back to float if NaN values exist. That would just be a consistency thing, so it's always the same type. But I'm not sure if it actually matters that much here. I guess we can just leave it as it is.

I can have a look at Ergast data (the csv database) and list all errors if you haven't done so.

I haven't looked at that yet.

I can do this later. Maybe open a new discussion afterwards?

Yes, a separate discussion for that is a good idea

@harningle
Copy link
Contributor Author

What I meant was, should we just cast every numeric value to float explicitly. Instead of casting them to int and having pandas fall back to float if NaN values exist. That would just be a consistency thing, so it's always the same type. But I'm not sure if it actually matters that much here. I guess we can just leave it as it is.

Oh I misread that. I agree we can leave it as it is, and perhaps add something in the API mapping in documentation, like "the rawdata in those columns are integers, but if we have any missing's in some cell, it will be in float type eventually".

@theOehrly
Copy link
Owner

I haven't forgotten about this, just FYI. I'll merge this once I have time to also maybe add a note to the docs.

@theOehrly theOehrly merged commit dd6b4c5 into theOehrly:master Sep 21, 2023
7 checks passed
theOehrly pushed a commit that referenced this pull request Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants