-
Notifications
You must be signed in to change notification settings - Fork 49
Preprocessing in Titus
Often, you want a PFA scoring engine to exactly reproduce the preprocessing (or postprocessing) steps involved in model-production. For instance, if you normalize vectors before building k-means clusters, you must normalize the data using the same offset and scale while scoring.
PFA preprocessing steps are expressed using PFA functions, but the model producer may use its own language to express them. If you're building models in Python/Numpy, you're probably preprocessing the training data as Python lists or Numpy arrays.
The titus.producer.transformation.Transformation
class is intended to help coordinate offline (producer) transformations in Python/Numpy with online (scoring) transformations in PFA.
Download and install Titus and Numpy. This article was tested with Titus 0.8.1 and Numpy 1.8.2; newer versions should work with no modification. Python >= 2.6 and < 3.0 is required.
Launch a Python prompt and import numpy
, titus.producer.transformation
, and some Avro types for generating PFA output:
Python 2.7.6
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> from titus.producer.transformation import Transformation
>>> from titus.datatype import AvroArray, AvroMap, AvroDouble, AvroRecord, AvroField
For the purposes of the Transformation
class, a dataset is a rectangular table of numbers or strings. It may or may not have column names. Although this is similar to R data.frames, Pandas DataFrames, and SQL tables, the Transformation
class expects tables without columns names to be in the form of Numpy 2-D arrays and tables with column names to be Numpy record arrays or Python dictionaries of equal-length Numpy 1-D arrays.
For instance, a table without column names would look like
>>> dataFrame1 = np.random.normal(0.0, 1.0, (100, 3))
and a table with column names would look like
>>> dataFrame2 = np.core.records.fromarrays((np.random.normal(0, 1, 100),
np.random.normal(0, 1, 100),
np.random.normal(0, 1, 100)),
names="x, y, z")
or
>>> dataFrame3 = {"x": np.random.normal(0, 1, 100),
"y": np.random.normal(0, 1, 100),
"z": np.random.normal(0, 1, 100)}
Transformations can accept any of these forms and it outputs the same type it received.
Perhaps the most common type of transformation is just an offset and scaling of the inputs.
>>> means = dataFrame1.mean(axis=0)
>>> stdevs = dataFrame1.std(axis=0)
>>> t1 = Transformation("(_0 - {means[0]})/{stdevs[0]}".format(**vars()),
"(_1 - {means[1]})/{stdevs[1]}".format(**vars()),
"(_2 - {means[2]})/{stdevs[2]}".format(**vars()))
But any transformation from the N column input to an M column output is possible. The following takes each of the three input columns (_0
, _1
, and _2
, because they are unnamed) to two output columns (which will be named _0
and _1
).
>>> t2 = Transformation("_0**2 + _1 * m.sin(_2)", "_0**2 + _1 * m.cos(_2)")
Strings are parsed as PrettyPFA, though an expression built from PFA AST instances (e.g. built using titus.reader.jsonToAst
or titus.reader.yamlToAst
) would also work.
Expressions using named columns are somewhat easier to read.
>>> t3 = Transformation(xprime="x**2 + y * m.sin(z)", yprime="x**2 + y * m.cos(z)")
Before training a model, you can preprocess your Numpy arrays using the transform
method.
>>> normalized = t1.transform(dataFrame1)
>>> dataFrame2Prime = t3.transform(dataFrame2)
>>> dataFrame3Prime = t3.transform(dataFrame3)
Although the transformations were expressed in PrettyPFA, those expressions were converted to Python/Numpy so that fast, vectorized Numpy functions are called, not Titus's PFA engine.
The Transformation
class has three methods to create a PFA snippet. The expr
method creates a PFA expression that can be embedded in a larger calculation (as an argument in model input, for instance), though it is limited to one data column. The let
method creates a PFA "let" block that defines several new variables at once. The new
method creates a PFA "new" expression that creates an array, a map, or a record. Your choice depends on how you need the data formatted.
This is what the normalization t1
looks like as PFA. Since it has unnamed columns, the data type for new
is probably AvroArray(AvroDouble())
.
>>> t1.expr("_0")
{'/': [{'-': ['_0', -0.08392950604]}, 0.985158554796]}
>>> t1.expr("_1")
{'/': [{'-': ['_1', -0.113170895046]}, 1.11869880821]}
>>> t1.expr("_2")
{'/': [{'-': ['_2', 0.0375438926037]}, 0.958713316426]}
>>> t1.let()
{'let': {'_0': {'/': [{'-': ['_0', -0.08392950604]}, 0.985158554796]},
'_2': {'/': [{'-': ['_2', 0.0375438926037]}, 0.958713316426]},
'_1': {'/': [{'-': ['_1', -0.113170895046]}, 1.11869880821]}}}
>>> t1.new(AvroArray(AvroDouble()))
{'new': [{'/': [{'-': ['_0', -0.08392950604]}, 0.985158554796]},
{'/': [{'-': ['_1', -0.113170895046]}, 1.11869880821]},
{'/': [{'-': ['_2', 0.0375438926037]}, 0.958713316426]}],
'type': {'items': 'double', 'type': 'array'}}
This is what the transformation t3
looks like. Since it has named columns, the data type for new
is probably AvroMap(AvroDouble())
(if a record type is not defined) or an AvroRecord
(if it is).
>>> t3.expr("xprime")
{'+': [{'**': ['x', 2]}, {'*': ['y', {'m.sin': ['z']}]}]}
>>> t3.expr("yprime")
{'+': [{'**': ['x', 2]}, {'*': ['y', {'m.cos': ['z']}]}]}
>>> t3.let()
{'let': {'xprime': {'+': [{'**': ['x', 2]}, {'*': ['y', {'m.sin': ['z']}]}]},
'yprime': {'+': [{'**': ['x', 2]}, {'*': ['y', {'m.cos': ['z']}]}]}}}
>>> t3.new(AvroMap(AvroDouble()))
{'new': {'xprime': {'+': [{'**': ['x', 2]}, {'*': ['y', {'m.sin': ['z']}]}]},
'yprime': {'+': [{'**': ['x', 2]}, {'*': ['y', {'m.cos': ['z']}]}]}},
'type': {'values': 'double', 'type': 'map'}}
>>> recType = AvroRecord([AvroField("xprime", AvroDouble()), AvroField("yprime", AvroDouble())], "Prime")
>>> t3.new(recType)
{'new': {'xprime': {'+': [{'**': ['x', 2]}, {'*': ['y', {'m.sin': ['z']}]}]},
'yprime': {'+': [{'**': ['x', 2]}, {'*': ['y', {'m.cos': ['z']}]}]}},
'type': 'Prime'}
These PFA expressions are ready to be inserted into a PFA document.
Return to the Hadrian wiki table of contents.
Licensed under the Hadrian Personal Use and Evaluation License (PUEL).