-
Notifications
You must be signed in to change notification settings - Fork 3
Queries
Conclave supports a subset of SQL queries. Each query is documented below.
Each query is represented within Conclave as a node within a DAG. This allows Conclave to store critical relationships between nodes and optimize workflows. As such, each operation on a dataset should be stored in a variable as follows:
<node> = cc.<query>(<arg1>, <arg2>, ... , <argn>)
Where:
1. <node> is the name of the variable that will store the node. This variable will be passed as input to further
queries.
2. <query> is the query being executed and stored in <node>.
3. *<arg1>, <arg2>, ... , <argn>* are the input parameters to the query.
-
aggregate(): aggregate over a column in a dataset.
- Syntax: aggregate(input_node, output_name, group_column_names, aggregated_column_name, aggregator, output_column_name)
- input_node: The node being used as input to the query.
- output_name: A string to identify the query. This is used internally by Conclave to keep track of nodes.
- group_col_names: A list of column names that will serve as key columns for the query. The column names are expected to match the names of columns present in the input dataset.
- aggregated_column_name: The name of the column that will be aggregated over.
- aggregator: A string that identifies the type of aggregation being performed. Conclave supports the following aggregation types: "sum", "mean", "std_dev"
- output_column_name: If the user wants to rename the output column to identify it as having been aggregated over, the new name can be passed with this argument.
- Returns: An Aggregate node.
-
aggregate_count(): aggregate over a column in a dataset.
- Syntax: aggregate(input_node, output_name, group_column_names,output_column_name)
- input_node: The node being used as input to the query.
- output_name: A string to identify the query. This is used internally by Conclave to keep track of nodes.
- group_col_names: A list of column names that will serve as key columns for the query. The column names are expected to match the names of columns present in the input dataset.
- output_column_name: If the user wants to rename the output column to identify it as having been aggregated over, the new name can be passed with this argument.
- Returns: An Aggregate node.
-
project(): select a list of columns from a dataset.
- Syntax: project(input_node, output_name, selected_column_names)
- input_node: The node being used as input to the query.
- output_name: A string to identify the query. This is used internally by Conclave to keep track of nodes.
- selected_column_names: A list of columns that will be projected out from the input node. The columns will be projected in the order present in the list.
- Returns: A Project node.
-
divide(): Divide the values in a column, either by the values in another column (or columns), a scalar value, or both.
- Syntax: divide(input_node, output_name, target_column_name, operands)
- input_node: The node being used as input to the query.
- output_name: A string to identify the query. This is used internally by Conclave to keep track of nodes.
- target_column_name: Name of the column that will be operated upon.
- operands: List of columns / values that will be applied to the target column. Note -- the first element in this list must be a string, and will be the name of the outputted column. If this element is equal to the name of the target column, then the target column will be operated on in place. Else, A new column will be created with this value as it's name, and store the result of the divide operation.
- Returns: A Divide node.
-
multiply(): Multiply the values in a column, either by the values in another column (or columns), a scalar value, or both.
- Syntax: multiply(input_node, output_name, target_column_name, operands)
- input_node: The node being used as input to the query.
- output_name: A string to identify the query. This is used internally by Conclave to keep track of nodes.
- target_column_name: Name of the column that will be operated upon.
- operands: List of columns / values that will be applied to the target column. Note -- the first element in this list must be a string, and will be the name of the outputted column. If this element is equal to the name of the target column, then the target column will be operated on in place. Else, A new column will be created with this value as it's name, and store the result of the divide operation.
- Returns: A Multiply node.
-
join(): Join two datasets over a list of columns.
- Syntax: join(left_node, right_node, output_name, left_column_names, right_column_names)
- left_node: Left input node to the query.
- right_node: Right input node to the query.
- output_name: A string to identify the query. This is used internally by Conclave to keep track of nodes.
- left_column_names: A list of column names that will be used as key columns.
- right_column_names: A list of column names that will be used as key columns.
- Returns: A Join node.
- NOTE: Although left_column_names and right_column_names are passed to the function as lists, Obliv-C only supports single-column joins. Accordingly, this will be a limitation on Join queries until another MPC framework is integrated within Conclave.
-
concat(): Concatenate two datasets.
- Syntax: concat(input_nodes, output_name, column_names)
- input_nodes: A list of nodes that will be concatenated. All nodes must have the same number of columns.
- output_name: A string to identify the query. This is used internally by Conclave to keep track of nodes.
- column_names: Names assigned to the columns of the output dataset.
- Returns: A Concat node.
-
collect(): Collect data at the end of a computation and reveal it to a compute party.
- Syntax: collect(input_node, target_party)
- input_node: The node being used as input to the query.
- target_party: The compute party that will receive the revealed data.
- NOTE: calls to collect() are not stored in a variable. They just indicate that the input can be collected.
cols_in_a = [
defCol('a', 'INTEGER', [1]),
defCol('b', 'INTEGER', [1]),
defCol('c', 'INTEGER', [1]),
]
cols_in_b = [
defCol('a', 'INTEGER', [2]),
defCol('b', 'INTEGER', [2]),
defCol('c', 'INTEGER', [2]),
]
in1 = create("in1", cols_in_a, {1})
in2 = create("in2", cols_in_b, {2})
cc1 = concat([in1, in2], 'cc1', ['a', 'b', 'c'])
agg1 = aggregate(cc1, "agg1", ['a'], "b", "sum", "b")
collect(agg1, 1)
return {in1, in2}
- Note that this query must be passed to the submit API as a b64 encoded string.