-
-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle empty string in Ergast API (#432) #433
Conversation
Hi, sorry for not responding to this for a month now. I didn't have any time for FastF1 devlopment over the last few weeks. There are a few issues with this PR in its current state:
(Also, this could be considered as an error in the Ergast data and I'll probably make a list for every place where this occurs and report it to Ergast. But adding better error handling here is still a good idea.) @harningle do you want to give this a try and update this PR (feel free to force push to override the previous commits or even open a new PR if that's easier for you). Please tell me if you want to work on this or if someone else should do that. |
Yes sure happy to do this! I'm going to Monza this weekend so may update this later. |
@harningle Enjoy the race weekend :) |
* add tests for `save_int` and `save_float` in `fastf1.ergast.structure`
This is probably due to CRLF vs LF difference between Windows and Unix. My last commit was done in Mac. Sorry I didn't notice that. Already cleaned it.
Make sense. We now use I also found a separate issue: we have missing's in a should-be-integer column, which eventually makes the column to be float. For instance, in 1954 British GP, Ergast has from fastf1.ergast import Ergast
ergast = Ergast()
# `totalRaceTimeMillis` is a float column for 1954 British GP
raw_resp = ergast.get_race_results(season=1954, round=5, result_type='raw')[0]
df = ergast.get_race_results(season=1954, round=5).content[0]
print('Time' in raw_resp['Results'][0])
# True
print('Time' in raw_resp['Results'][5])
# False
print(df['totalRaceTimeMillis'].dtype)
# float64
# `totalRaceTimeMillis` is an integer column for 2021 Belgian GP
df = ergast.get_race_results(season=2021, round=12).content[0]
print(df['totalRaceTimeMillis'].dtype)
# int64
We now handle the string conversion in
I can have a look at Ergast data (the csv database) and list all errors if you haven't done so. |
Looks good, very well commented, documented and tested as well 👍
Is there any harm in using
Yes, this is an issue with Python all the time. I did some digging just now and one solution might be to use the But I am not a huge fan of pandas' custom data types because they mostly don't play nicely with anything but pandas. For example, they are impossible to plot with matplotlib, because it has no idea what a The most compatible solution would be to just give up on integers and convert everything numeric to float. That's a bit ridiculous only for NaN support but it certainly is the most compatible solution. What's your opinion on that? I might ask some other people/users as well and see what they think.
I haven't looked at that yet.
Yes, I mostly code on Windows. But IMO git auto converts CRLF to LF on commit. I think the issue where file mode changes if I remember correctly. Doesn't really matter, though, it's fixed. |
Yes I agree. Using
No I don't like pandas type either, especially
I can do this later. Maybe open a new discussion afterwards? |
What I meant was, should we just cast every numeric value to float explicitly. Instead of casting them to int and having pandas fall back to float if NaN values exist. That would just be a consistency thing, so it's always the same type. But I'm not sure if it actually matters that much here. I guess we can just leave it as it is.
Yes, a separate discussion for that is a good idea |
Oh I misread that. I agree we can leave it as it is, and perhaps add something in the API mapping in documentation, like "the rawdata in those columns are integers, but if we have any missing's in some cell, it will be in float type eventually". |
I haven't forgotten about this, just FYI. I'll merge this once I have time to also maybe add a note to the docs. |
(cherry picked from commit dd6b4c5)
Ergast API sometimes returns empty string where it is expected to be an integer. When converting Ergast json response to pandas, we want to convert them to integer, and
int()
won't accept an empty string as input. We need to check if a variable is empty before casting it to integer.Working example
Ergast gives
'12345'
, and we doint('12345')
to get the number12345
Bug
Ergast gives
''
. We doint('')
and getValueError: invalid literal for int() with base 10: ''
Fix
Check if it's empty before doing
int()
. If empty, returnNone
Detailed explanation is here: #432 (comment).