-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support TIMESTAMP_NTZ, DATE and TIME datatype #14398
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #14398 +/- ##
============================================
+ Coverage 61.75% 63.51% +1.76%
- Complexity 207 1568 +1361
============================================
Files 2436 2662 +226
Lines 133233 146806 +13573
Branches 20636 22451 +1815
============================================
+ Hits 82274 93241 +10967
- Misses 44911 46650 +1739
- Partials 6048 6915 +867
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Cool! It looks like the PR is not yet finished (there are tests failing due to this new type) but it looks promising! I'll try to review it soon. |
Thanks for working on this!
|
Thanks! I will fix the tests today |
As suggested by @Jackie-Jiang, I think a design document would be great. There we can discuss the nomenclature, which from my point of view is not clear. I would like to open the discussion to use the same definitions used by Postgres, which would mean to have a TIMESTAMP and TIMESTAMP_WITH_TIME_ZONE (or TIMESTAMP_TZ for short, which is how it is called in Calcite) |
TIMESTAMP(LONG, NullValuePlaceHolder.LONG) { | ||
@Override | ||
public RelDataType toType(RelDataTypeFactory typeFactory) { | ||
return typeFactory.createSqlType(SqlTypeName.TIMESTAMP_WITH_LOCAL_TIME_ZONE); | ||
} | ||
}, | ||
TIMESTAMP_NTZ(LONG, NullValuePlaceHolder.LONG) { | ||
@Override | ||
public RelDataType toType(RelDataTypeFactory typeFactory) { | ||
return typeFactory.createSqlType(SqlTypeName.TIMESTAMP); | ||
} | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you did this to follow the Spark terminology, but apart from my preference to following Postgres terminology I think this is breaking backward compatibility. Before this PR a Pinot TIMESTAMP was converted to Calcite TIMESTAMP and now we are converting it into a TIMESTAMP WITH LOCAL TIME ZONE.
I may be wrong but I think what we should be doing here is to return that our TIMESTAMP_TZ returns a Calcite TIMESTAMP_TZ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you did this to follow the Spark terminology, but apart from my preference to following Postgres terminology I think this is breaking backward compatibility. Before this PR a Pinot TIMESTAMP was converted to Calcite TIMESTAMP and now we are converting it into a TIMESTAMP WITH LOCAL TIME ZONE.
I may be wrong but I think what we should be doing here is to return that our TIMESTAMP_TZ returns a Calcite TIMESTAMP_TZ
Thanks for your review! From my understanding, the TIMESTAMP type in Calcite has no time zone, but the TIMESTAMP type in pinot is all using java.sql.Timestamp to explain, which is with the local time zone, so i made changes here.
In order to minimize modifications to the TIMESTAMP type in pinot, i referred to the data type in Spark and extended the TIMESTAMP_NTZ type.
@@ -490,12 +540,24 @@ public Object toInternal(Object value) { | |||
return ((boolean) value) ? 1 : 0; | |||
case TIMESTAMP: | |||
return ((Timestamp) value).getTime(); | |||
case TIMESTAMP_NTZ: | |||
return ((LocalDateTime) value).atZone(ZoneId.of("UTC")).toInstant().toEpochMilli(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bziobrowski IIRC you knew a way to do this without allocating so many objects. Am I right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That time I used joda.time's MutableDateTime, not java classes.
I agree that this line creates way too many temporary objects and should be rewritten.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this specific case, this can be rewritten with (AFAIK) zero allocation with:
return ((LocalDateTime) value).atZone(ZoneId.of("UTC")).toInstant().toEpochMilli(); | |
return ((LocalDateTime) value).toEpochSecond(ZoneOffset.UTC) * 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this specific case, this can be rewritten with (AFAIK) zero allocation with:
Thanks! But I think this will lose millisecond accuracy😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gortiz I think that snippet isn't correct because code assumes offset is constant (and actually 0 while local is likely not :) )
@chunxiaozheng Please consider :
public static void main(String[] args) {
LocalDateTime dt = LocalDateTime.of(2024, 11, 8, 12, 13, 14, 231000000);
ZoneId zoneId = ZoneId.of("Europe/Warsaw");// zone should be cached
ZoneRules rules = zoneId.getRules();
ZoneOffset offset = rules.getOffset(dt);
System.out.println("to epoch (UTC) " + dt.toEpochSecond(ZoneOffset.UTC) * 1000);
System.out.println("to epoch " + dt.toEpochSecond(offset) * 1000);
System.out.println("to epoch plus nanos " + dt.toEpochSecond(offset) * 1000 + dt.getNano() / 1000000);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gortiz I think that snippet isn't correct because code assumes offset is constant (and actually 0 while local is likely not :) ) @chunxiaozheng Please consider :
public static void main(String[] args) { LocalDateTime dt = LocalDateTime.of(2024, 11, 8, 12, 13, 14, 231000000); ZoneId zoneId = ZoneId.of("Europe/Warsaw");// zone should be cached ZoneRules rules = zoneId.getRules(); ZoneOffset offset = rules.getOffset(dt); System.out.println("to epoch (UTC) " + dt.toEpochSecond(ZoneOffset.UTC) * 1000); System.out.println("to epoch " + dt.toEpochSecond(offset) * 1000); System.out.println("to epoch plus nanos " + dt.toEpochSecond(offset) * 1000 + dt.getNano() / 1000000); }
Thanks! The implementation of TIMESTAMP type without time zone, such as TIMESTAMP
in Flink and TIMESTAMP_NTZ
type in Spark, will parse according to UTC time to ensure that the time queried in any time zone is consistent. Therefore, for TIMESTAMP types without time zones, I believe it is need to use a constant(UTC time zone) for parsing.
Thanks! I will write an document for this PR, and then we can discuss it together. |
if (map.containsKey(SqlTypeName.DATE)) { | ||
Set<SqlTypeName> fromTypes = new HashSet<>(map.get(SqlTypeName.DATE)); | ||
fromTypes.addAll(TYPES); | ||
map.put(SqlTypeName.DATE, ImmutableSet.copyOf(fromTypes)); | ||
} | ||
if (map.containsKey(SqlTypeName.TIME)) { | ||
Set<SqlTypeName> fromTypes = new HashSet<>(map.get(SqlTypeName.TIME)); | ||
fromTypes.addAll(TYPES); | ||
map.put(SqlTypeName.TIME, ImmutableSet.copyOf(fromTypes)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we actually want to support these castings. Instead we should add the functions that use them. For example, in postgres there is a function that adds date + integer
. See https://www.postgresql.org/docs/current/functions-datetime.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we actually want to support these castings. Instead we should add the functions that use them. For example, in postgres there is a function that adds
date + integer
. See https://www.postgresql.org/docs/current/functions-datetime.html
I don't know if we need to support the syntax of cast(12345 as DATE)
or cast(12345 as TIME(3))
. If we need this syntax, we need to add a similar implementation.
Otherwise, when executing the above statement, the mapping rules in org.apache.calcite.sql.type.SqlTypeCoercionRule
will be used, which does not support converting Integer
type to Date
or Time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like at runtime we sometimes use longs to implement timestamps, date and time, but some other times we use LocalDate and LocalTime. That seems a bit odd. I don't know if we are doing the same with Timestamp right now in master, we would need to study if this is what we want
Thanks! In fact, I added these new data types based on the implementation of |
Instructions:
feature
bugfix
performance
ui
backward-incompat
release-notes
(**)This PR add TIMESTAMP_NTZ, DATE and TIME datatype.
For flink, TIMESTAMP does not have time zone, which corresponds to TIMESTAMP_NTZ in pinot, TIMESTAMP_LTZ has local time zone, which corresponds to TIMESTAMP in pinot.
For spark, TIMESTAMP has local time zone, which corresponds to TIMESTAMP in pinot, TIMESTAMP_NTZ does not have time zone, which corresponds to TIMESTAMP_NTZ in pinot.
In my case, the schema of my table is as follows
and the query result is as follows
the schema json of TIME, DATE, TIMESTAMP_NTZ and TIMESTAMP is as follows