Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregation's "where" clause not working when querying Parquet in vector runtime #5559

Open
philrz opened this issue Jan 8, 2025 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@philrz
Copy link
Contributor

philrz commented Jan 8, 2025

tl;dr

With this test data in Parquet form:

{log_time:2012-10-01T00:00:02Z,client_ip:99.85.61.193,request:"/courses/cs132/2012/",status_code:304(uint16),object_size:213(uint64)}(=bench2)
{log_time:2012-01-01T00:00:00Z,client_ip:25.152.171.147,request:"/books/Six_Easy_Pieces.html",status_code:404(uint16),object_size:271(uint64)}(=bench2)

The where clause in the following aggregation causes the entry with client_ip:25.152.171.147 to show a count of 1 when it should have been 0.

$ SUPER_VAM=1 super -c 'from data.parquet | count() where log_time >= 2012-10-01T00:00:00Z by client_ip'
{client_ip:"99.85.61.193",count:1(uint64)}
{client_ip:"25.152.171.147",count:1(uint64)}

Details

Repro is with super commit fc8ab65. This is a simplification of the mgbench bench2/q4 query.

Starting with the data.zson.gz test data shown above, in sequential runtime we see the record with client_ip:25.152.171.147 showing a count of 0 as we'd expect given the filter where log_time >= 2012-10-01T00:00:00Z.

$ super -version
Version: v1.18.0-213-gfc8ab655

$ super -c 'from data.zson.gz | count() where log_time >= 2012-10-01T00:00:00Z by client_ip'
{client_ip:99.85.61.193,count:1(uint64)}
{client_ip:25.152.171.147,count:0(uint64)}

However, the problem surfaces if we turn the data into Parquet and execute the query in the vector runtime.

$ super -f parquet -o data.parquet data.zson.gz 

$ SUPER_VAM=1 super -c 'from data.parquet | count() where log_time >= 2012-10-01T00:00:00Z by client_ip'
{client_ip:"99.85.61.193",count:1(uint64)}
{client_ip:"25.152.171.147",count:1(uint64)}

But the problem doesn't happen if I query the same Parquet file using the sequential runtime, or query the data as CSUP in vector runtime.

$ super -c 'from data.parquet | count() where log_time >= 2012-10-01T00:00:00Z by client_ip'
{client_ip:"99.85.61.193",count:1(uint64)}
{client_ip:"25.152.171.147",count:0(uint64)}

$ super -f csup -o data.csup data.zson.gz 

$ SUPER_VAM=1 super -c 'from data.csup | count() where log_time >= 2012-10-01T00:00:00Z by client_ip'
{client_ip:99.85.61.193,count:1(uint64)}
{client_ip:25.152.171.147,count:0(uint64)}
@philrz philrz added the bug Something isn't working label Jan 8, 2025
@philrz
Copy link
Contributor Author

philrz commented Jan 16, 2025

I seem to be hitting this same problem when trying to write a SuperSQL equivalent of ClickBench query 10.

The top entry in the aggregation result below contains the value with MobilePhoneModel:"" despite the attempt to filter it out via the clause where MobilePhoneModel <> ''.

$ super -version
Version: v1.18.0-227-g66b20d0f

$ SUPER_VAM=1 super -z -c "
from "hits.parquet"
| summarize
  by UserID,
     MobilePhoneModel 
| summarize
  u := count(UserID)
  where MobilePhoneModel <> ''
  by MobilePhoneModel
| sort -r u
| head 10"

{MobilePhoneModel:"",u:16443343(uint64)}
{MobilePhoneModel:"iPad",u:1090347(uint64)}
{MobilePhoneModel:"iPhone",u:45758(uint64)}
{MobilePhoneModel:"A500",u:16046(uint64)}
{MobilePhoneModel:"N8-00",u:5565(uint64)}
{MobilePhoneModel:"iPho",u:3300(uint64)}
{MobilePhoneModel:"ONE TOUCH 6030A",u:2759(uint64)}
{MobilePhoneModel:"GT-P7300B",u:1907(uint64)}
{MobilePhoneModel:"3110000",u:1871(uint64)}
{MobilePhoneModel:"GT-I9500",u:1598(uint64)}

Once I drop the SUPER_VAM=1 and run the same query against the same data in sequential runtime, now the result is filtered out as expected.

$ super -z -c "
from "hits.parquet"
| summarize
  by UserID,
     MobilePhoneModel 
| summarize
  u := count(UserID)
  where MobilePhoneModel <> ''
  by MobilePhoneModel
| sort -r u
| head 10"

{MobilePhoneModel:"iPad",u:1090347(uint64)}
{MobilePhoneModel:"iPhone",u:45758(uint64)}
{MobilePhoneModel:"A500",u:16046(uint64)}
{MobilePhoneModel:"N8-00",u:5565(uint64)}
{MobilePhoneModel:"iPho",u:3300(uint64)}
{MobilePhoneModel:"ONE TOUCH 6030A",u:2759(uint64)}
{MobilePhoneModel:"GT-P7300B",u:1907(uint64)}
{MobilePhoneModel:"3110000",u:1871(uint64)}
{MobilePhoneModel:"GT-I9500",u:1598(uint64)}
{MobilePhoneModel:"eagle75",u:1492(uint64)}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants