Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Decimal year #60391

Open
1 of 3 tasks
dshean opened this issue Nov 21, 2024 · 6 comments
Open
1 of 3 tasks

ENH: Decimal year #60391

dshean opened this issue Nov 21, 2024 · 6 comments
Labels
Datetime Datetime data dtype Enhancement Needs Discussion Requires discussion from core team before further action

Comments

@dshean
Copy link

dshean commented Nov 21, 2024

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I wish I could use pandas to quickly convert datetime/Timestamp objects to "decimal year" floating point numbers for subsequent visualization and analysis.

A number of plotting packages (e.g., GeoPandas, matplotlib) encounter issues when casting datetime/Timestamp objects to float. For example, I often encounter errors when trying to create a choropleth map to visualize a GeoDataFrame column containing datetime objects. Decimal years also simplify the legend/colorbar labels.

example decimal year map

Feature Description

This is a simple function to accomplish this. It's not perfect, but does the job. Would need to re-implement as a Timestamp and/or dt accessor property (dt.decyear). Should be relatively simple, I think.

#Decimal year (useful for plotting)
from datetime import datetime as dt
import time
def toYearFraction(date):
    def sinceEpoch(date): # returns seconds since epoch
        return time.mktime(date.timetuple())
    s = sinceEpoch

    year = date.year
    startOfThisYear = dt(year=year, month=1, day=1)
    startOfNextYear = dt(year=year+1, month=1, day=1)

    yearElapsed = s(date) - s(startOfThisYear)
    yearDuration = s(startOfNextYear) - s(startOfThisYear)
    fraction = yearElapsed/yearDuration

    return date.year + fraction

Alternative Solutions

Define and apply a custom function:
df['dt_col_decyear'] = df['dt_col'].apply(toYearFraction)

Additional Context

When attempting to plot column containing datetime values...

gdf.plot(column='dt_col', legend=True)

File [~/sw/miniconda3/envs/shean_py3/lib/python3.12/site-packages/geopandas/plotting.py:175](http://localhost:8888/lab/tree/src/stereo-lidar_archive_search/notebooks/~/sw/miniconda3/envs/shean_py3/lib/python3.12/site-packages/geopandas/plotting.py#line=174), in _plot_polygon_collection(ax, geoms, values, color, cmap, vmin, vmax, autolim, **kwargs)
    172 collection = PatchCollection([_PolygonPatch(poly) for poly in geoms], **kwargs)
    174 if values is not None:
--> 175     collection.set_array(np.asarray(values))
    176     collection.set_cmap(cmap)
    177     if "norm" not in kwargs:

File [~/sw/miniconda3/envs/shean_py3/lib/python3.12/site-packages/matplotlib/cm.py:452](http://localhost:8888/lab/tree/src/stereo-lidar_archive_search/notebooks/~/sw/miniconda3/envs/shean_py3/lib/python3.12/site-packages/matplotlib/cm.py#line=451), in ScalarMappable.set_array(self, A)
    450 A = cbook.safe_masked_invalid(A, copy=True)
    451 if not np.can_cast(A.dtype, float, "same_kind"):
--> 452     raise TypeError(f"Image data of dtype {A.dtype} cannot be "
    453                     "converted to float")
    455 self._A = A
    456 if not self.norm.scaled():

TypeError: Image data of dtype object cannot be converted to float
@dshean dshean added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 21, 2024
@rhshadrach
Copy link
Member

Thanks for the request. Can you provide input, a proposed syntax for the operation, and what your expected output would be.

@rhshadrach rhshadrach added Datetime Datetime data dtype Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 21, 2024
@dshean
Copy link
Author

dshean commented Nov 21, 2024

Sure. Something like df['dt_col'].dt.decyear could work well, using the dt accessor.

Would convert column of datetime64 (e.g.,2024-11-15 12:13:12+00:00) to float64 (e.g., 2024.872976)

@AryanK1511
Copy link

@rhshadrach if you don't mind, I would love to work on this issue

@rhshadrach rhshadrach added Needs Discussion Requires discussion from core team before further action and removed Needs Info Clarification about behavior needed to assess issue labels Nov 22, 2024
@rhshadrach
Copy link
Member

Here is a vectorized version:

dates = ["2024-05-30", "2025-05-30"]
df = pd.DataFrame({"date": dates})
df["date"] = pd.to_datetime(df["date"], format="%Y-%m-%d")

year = df["date"].dt.year
days = (pd.to_datetime(year+1, format='%Y') - pd.to_datetime(year, format='%Y')).dt.days
result = year + (df["date"] - pd.to_datetime(year, format='%Y')) / (days * pd.to_timedelta(1, unit="D"))
print(result)
# 0    2024.409836
# 1    2025.408219
# Name: date, dtype: float64

@AryanK1511 - I think this needs discussion from the core team. It seems straightforward to calculate this from the existing API, I'm not sure it warrants inclusion.

@dshean
Copy link
Author

dshean commented Nov 23, 2024

Thanks @rhshadrach. Nice simple solution. My only suggestion would be to include timestamps as well.

dates = ["2024-05-30 12:00:00", "2024-05-30 12:00:01", "2025-05-30 12:00:00"]
df = pd.DataFrame({"date": dates})
df["date"] = pd.to_datetime(df["date"], format="%Y-%m-%d %H:%M:%S")
...
pd.set_option("display.precision", 8)
print(result)
#0    2024.41120219
#1    2024.41120222
#2    2025.40958904

I agree this is a straightforward calculation. The request is mostly one of convenience and centralization, so each user doesn't have to implement their own function or include those 3 lines whenever they want to do this.

@rhshadrach
Copy link
Member

The request is mostly one of convenience and centralization, so each user doesn't have to implement their own function or include those 3 lines whenever they want to do this.

I do not think using such a criteria is sustainable for the pandas API. Rather, it should be the goal of pandas to provide an API with the fundamental tools so that users can combine various operations in a short and straight forward manner to accomplish their needs. I believe that is already being done here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Enhancement Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

3 participants