You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The where clause in the following aggregation causes the entry with client_ip:25.152.171.147 to show a count of 1 when it should have been 0.
$ SUPER_VAM=1 super -c 'from data.parquet | count() where log_time >= 2012-10-01T00:00:00Z by client_ip'
{client_ip:"99.85.61.193",count:1(uint64)}
{client_ip:"25.152.171.147",count:1(uint64)}
Details
Repro is with super commit fc8ab65. This is a simplification of the mgbench bench2/q4 query.
Starting with the data.zson.gz test data shown above, in sequential runtime we see the record with client_ip:25.152.171.147 showing a count of 0 as we'd expect given the filter where log_time >= 2012-10-01T00:00:00Z.
$ super -version
Version: v1.18.0-213-gfc8ab655
$ super -c 'from data.zson.gz | count() where log_time >= 2012-10-01T00:00:00Z by client_ip'
{client_ip:99.85.61.193,count:1(uint64)}
{client_ip:25.152.171.147,count:0(uint64)}
However, the problem surfaces if we turn the data into Parquet and execute the query in the vector runtime.
$ super -f parquet -o data.parquet data.zson.gz
$ SUPER_VAM=1 super -c 'from data.parquet | count() where log_time >= 2012-10-01T00:00:00Z by client_ip'
{client_ip:"99.85.61.193",count:1(uint64)}
{client_ip:"25.152.171.147",count:1(uint64)}
But the problem doesn't happen if I query the same Parquet file using the sequential runtime, or query the data as CSUP in vector runtime.
$ super -c 'from data.parquet | count() where log_time >= 2012-10-01T00:00:00Z by client_ip'
{client_ip:"99.85.61.193",count:1(uint64)}
{client_ip:"25.152.171.147",count:0(uint64)}
$ super -f csup -o data.csup data.zson.gz
$ SUPER_VAM=1 super -c 'from data.csup | count() where log_time >= 2012-10-01T00:00:00Z by client_ip'
{client_ip:99.85.61.193,count:1(uint64)}
{client_ip:25.152.171.147,count:0(uint64)}
The text was updated successfully, but these errors were encountered:
I seem to be hitting this same problem when trying to write a SuperSQL equivalent of ClickBench query 10.
The top entry in the aggregation result below contains the value with MobilePhoneModel:"" despite the attempt to filter it out via the clause where MobilePhoneModel <> ''.
$ super -version
Version: v1.18.0-227-g66b20d0f
$ SUPER_VAM=1 super -z -c "
from "hits.parquet"
| summarize
by UserID,
MobilePhoneModel
| summarize
u := count(UserID)
where MobilePhoneModel <> ''
by MobilePhoneModel
| sort -r u
| head 10"
{MobilePhoneModel:"",u:16443343(uint64)}
{MobilePhoneModel:"iPad",u:1090347(uint64)}
{MobilePhoneModel:"iPhone",u:45758(uint64)}
{MobilePhoneModel:"A500",u:16046(uint64)}
{MobilePhoneModel:"N8-00",u:5565(uint64)}
{MobilePhoneModel:"iPho",u:3300(uint64)}
{MobilePhoneModel:"ONE TOUCH 6030A",u:2759(uint64)}
{MobilePhoneModel:"GT-P7300B",u:1907(uint64)}
{MobilePhoneModel:"3110000",u:1871(uint64)}
{MobilePhoneModel:"GT-I9500",u:1598(uint64)}
Once I drop the SUPER_VAM=1 and run the same query against the same data in sequential runtime, now the result is filtered out as expected.
$ super -z -c "
from "hits.parquet"
| summarize
by UserID,
MobilePhoneModel
| summarize
u := count(UserID)
where MobilePhoneModel <> ''
by MobilePhoneModel
| sort -r u
| head 10"
{MobilePhoneModel:"iPad",u:1090347(uint64)}
{MobilePhoneModel:"iPhone",u:45758(uint64)}
{MobilePhoneModel:"A500",u:16046(uint64)}
{MobilePhoneModel:"N8-00",u:5565(uint64)}
{MobilePhoneModel:"iPho",u:3300(uint64)}
{MobilePhoneModel:"ONE TOUCH 6030A",u:2759(uint64)}
{MobilePhoneModel:"GT-P7300B",u:1907(uint64)}
{MobilePhoneModel:"3110000",u:1871(uint64)}
{MobilePhoneModel:"GT-I9500",u:1598(uint64)}
{MobilePhoneModel:"eagle75",u:1492(uint64)}
tl;dr
With this test data in Parquet form:
The
where
clause in the following aggregation causes the entry withclient_ip:25.152.171.147
to show a count of1
when it should have been0
.Details
Repro is with super commit fc8ab65. This is a simplification of the mgbench bench2/q4 query.
Starting with the data.zson.gz test data shown above, in sequential runtime we see the record with
client_ip:25.152.171.147
showing a count of0
as we'd expect given the filterwhere log_time >= 2012-10-01T00:00:00Z
.However, the problem surfaces if we turn the data into Parquet and execute the query in the vector runtime.
But the problem doesn't happen if I query the same Parquet file using the sequential runtime, or query the data as CSUP in vector runtime.
The text was updated successfully, but these errors were encountered: