Skip to content

RandomQueryGeneratorCaveats

philip-stoev edited this page Jul 17, 2012 · 1 revision

This page lists known RQG caveats and some workarounds for them

Table of Contents

Generating semantically valid queries

The RQG is capable of generating queries containing variable numbers of columns and tables. This variability is a problem because the items generated from a FROM clause would then be used in a WHERE clause. Using some simple embedded Perl code, it is possible to keep track of things generated for one part of the query in order to properly reuse them in another part.

Variable number of tables in a FROM clause

Knowing what tables participate in the FROM clause is important when generating a comprehensive and meaningful WHERE clause. So, we give each table a unique alias and record the total number of aliases that have been assigned:

 table_list:
  new_table_item |
  table_list , new_table_item ;
 
 where_list:
  (where_item) |
  (where_list AND where_item);
 
 where_item:
   existing_table_item . _field = _digit ;
 
 new_table_item:
   _table AS { "table".++$tables };
 
 existing_table_item:
  { "table".$prng->int(1, $tables) };

The new_table_item rule picks a random table, gives it an alias, and increments the number of aliases given so far. The existing_table_item rule picks one such alias. This way, the WHERE clause will only contain table aliases that have actually been generated for the FROM clause.

Variable number of SELECT columns

We can use the same trick to keep track of the SELECT items we generate - we will give each one a unique alias:

 select_list:
   new_select_item |
   select_list , new_select_item ;
 
 having_list:
   (having_item) |
   (having_item AND having_list) ;
  
 having_item:
   existing_select_item = _digit ;
 
 new_select_item:
   _field AS { "field".++$fields };
 
 existing_select_item:
   { "field".$prng->int(1, $fields) };

Generating queries that can be compared between server versions

Certain queries can legitimately return different results between different server versions or storage engines. Such false positives can be avoided by careful grammar constructions.

LIMIT without full ORDER BY

Even though the ResultsetComparator Validator attempts to account for different ordering of result sets, if a LIMIT clause is specified without a full ORDER BY, two servers may return a different result set altogether. There are two ways to avoid this:

  • ORDER BY using all columns from the SELECT
After your usual ORDER BY, append all columns that were generated so far, using the aliases that were given to them using the grammar above:
 total_order_by:
   { join(', ', map { "field".$_ } (1..$fields) ) };
  • ORDER BY on the PRIMARY KEY of each participating table (not compatible with LIMIT)
 total_order_by:
   { join(', ', map { "table".$_." . `pk`" } (1..$tables) ) };

Different fields in SELECT and GROUP BY

If the SELECT contains non-aggregate fields that are different from the ones used in the GROUP BY, then the result will be undefined. To avoid this:

  • Use the ONLY_FULL_GROUP_BY SQL mode
Which will make all such queries invalid. The chance of generating a valid query however will not be improved and will remain very low.
  • GROUP BY on all non-aggregate SELECT columns.
 select_list:
   select_item |
   select_list , select_item ;
  
 select_item:
   aggregate_select_item | nonaggregate_select_item ;
 
 aggregate_select_item:
   COUNT( _field ) | SUM( _field ) | MIN( _field ) | MAX( _field );
 
 nonaggregate_select_item:
   _field AS { $field_name = "field".++$fields ; push @nonaggregates , $field_name ; $field_name }; 
 
 group_by:
   { scalar(@nonaggregates) > 0 ? " GROUP BY ".join (', ' , @nonaggregates ) : "" };

GROUP BY with ORDER BY

If you use GROUP BY and ORDER BY on a column that is not listed in the SELECT, the result may vary, depending on whether sorting happened before the grouping or the other way around. To avoid this, always use a column alias in the ORDER BY clause.

Sufficent number of database items in face of DDL statements

If you run a lot of DDL statements, your DML queries may frequently try to access a database object that does not exist. The following solutions may help to partially alleviate the problem:

  • have many CREATE statements in your query_init rule. They will be run first and will hopefully populate your database sufficiently:
 query_init:
    create ; create ; create ; create ; create ;
  • have way more CREATE than DROP statements:
  dml:
    create | create | create | create | drop ;
  • keep track of the objects just created and dropped and only CREATE or DROP objects that do not (or do) exist:
 create:
    CREATE TABLE IF NOT EXISTS pick_create_table (F1 INTEGER) ;
 
 drop:
    DROP TABLE IF EXISTS pick_drop_table ;
 
 pick_create_table:
    { if (scalar(@dropped_tables) > 0) { $created_table = shift @dropped_tables } else { $created_table = $prng->letter() } ; push @created_tables, $created_table ; $created_table } ;
 
 pick_drop_table:
    { if (scalar(@created_tables) > 0) { $dropped_table = pop @created_tables } else { $dropped_table = $prng->letter() } ; push @dropped_tables, $dropped_table ; $dropped_table } ;
 
 pick_existing_table:
    { if (scalar(@created_tables) > 0) { $prng->arrayElement(\@created_tables) } else { $prng->letter() } } ;
 

Category:RandomQueryGenerator