pycassa is a Cassandra library with the following features:
- Auto-failover single or thread-local connections
- A simplified version of the thrift interface
- A method to map an existing class to a Cassandra ColumnFamily.
- Support for SuperColumns
If you have any Cassandra questions, try the IRC channel #cassandra on irc.freenode.net
thrift: http://incubator.apache.org/thrift/
Cassandra: http://incubator.apache.org/cassandra/ (0.5.0-rc3 or later)
Install thrift with the python bindings.
It comes with the Cassandra python files for convenience, but you can replace them with your own.
The simplest way to get started is to copy the pycassa and cassandra directories to your program. If you want to install, run setup.py as a superuser.
python setup.py install
If you also want to install the cassandra python package (if it's not already included on your system):
python setup.py --cassandra install
All functions are documented with docstrings. To read usage documentation:
>>> import pycassa
>>> help(pycassa.ColumnFamily.get)
For a single connection (which is not thread-safe), pass a list of servers.
>>> client = pycassa.connect() # Defaults to connecting to the server at 'localhost:9160'
>>> client = pycassa.connect(['localhost:9160'])
If you need Framed Transport, pass the framed_transport argument.
>>> client = pycassa.connect(framed_transport=True)
Thread-local connections opens a connection for every thread that calls a Cassandra function. It also automatically balances the number of connections between servers, unless round_robin=False.
>>> client = pycassa.connect_thread_local() # Defaults to connecting to the server at 'localhost:9160'
>>> client = pycassa.connect_thread_local(['localhost:9160', 'other_server:9160']) # Round robin connections
>>> client = pycassa.connect_thread_local(['localhost:9160', 'other_server:9160'], round_robin=False) # Connect in list order
Connections are robust to server failures. Upon a disconnection, it will attempt to connect to each server in the list in turn. If no server is available, it will raise a NoServerAvailable exception.
Timeouts are also supported and should be used in production to prevent a thread from freezing while waiting for Cassandra to return.
>>> client = pycassa.connect(timeout=3.5) # 3.5 second timeout
(Make some pycassa calls and the connection to the server suddenly becomes unresponsive.)
Traceback (most recent call last):
...
pycassa.connection.NoServerAvailable
Note that this only handles socket timeouts. The TimedOutException from Cassandra may still be raised.
You can save login credentials with the .login() function.
>>> client.login('Keyspace1', {'username': 'jsmith', 'password': 'havebadpass'})
These credentials will automatically be re-used on every successful connection. You only need to call login() once.
To use the standard interface, create a ColumnFamily instance.
>>> cf = pycassa.ColumnFamily(client, 'Test Keyspace', 'Test ColumnFamily')
The value returned by an insert is the timestamp used for insertion, or int(time.time() * 1e6). You may replace this function with your own (see Extra Documentation).
>>> cf.insert('foo', {'column1': 'val1'})
1261349837816957
>>> cf.get('foo')
{'column1': 'val1'}
Insert also acts to update values.
>>> cf.insert('foo', {'column1': 'val2'})
1261349910511572
>>> cf.get('foo')
{'column1': 'val2'}
You may insert multiple columns at once.
>>> cf.insert('bar', {'column1': 'val3', 'column2': 'val4'})
1261350013606860
>>> cf.multiget(['foo', 'bar'])
{'foo': {'column1': 'val2'}, 'bar': {'column1': 'val3', 'column2': 'val4'}}
>>> cf.get_count('bar')
2
get_range() returns an iterable. Call it with list() to convert it to a list.
>>> list(cf.get_range())
[('bar', {'column1': 'val3', 'column2': 'val4'}), ('foo', {'column1': 'val2'})]
>>> list(cf.get_range(row_count=1))
[('bar', {'column1': 'val3', 'column2': 'val4'})]
You can remove entire keys or just a certain column.
>>> cf.remove('bar', columns=['column1'])
1261350220106863
>>> cf.get('bar')
{'column2': 'val4'}
>>> cf.remove('bar')
1261350226926859
>>> cf.get('bar')
Traceback (most recent call last):
...
cassandra.ttypes.NotFoundException: NotFoundException()
pycassa retains the behavior of Cassandra in that get_range() may return removed keys for a while. Cassandra will eventually delete them, so that they disappear.
>>> cf.remove('foo')
>>> cf.remove('bar')
>>> list(cf.get_range())
[('bar', {}), ('foo', {})]
... After some amount of time
>>> list(cf.get_range())
[]
You can also map existing classes using ColumnFamilyMap.
>>> class Test(object):
... string_column = pycassa.String(default='Your Default')
... int_str_column = pycassa.IntString(default=5)
... float_str_column = pycassa.FloatString(default=8.0)
... float_column = pycassa.Float64(default=0.0)
... datetime_str_column = pycassa.DateTimeString() # default=None
The defaults will be filled in whenever you retrieve instances from the Cassandra server and the column doesn't exist. If, for example, you add columns in the future, you simply add the relevant column and the default will be there when you get old instances.
IntString, FloatString, and DateTimeString all use string representations for storage. Float64 is stored as a double and is native-endian. Be aware of any endian issues if you use it on different architectures, or perhaps make your own column type.
>>> Test.objects = pycassa.ColumnFamilyMap(Test, cf)
All the functions are exactly the same, except that they return instances of the supplied class when possible.
>>> t = Test()
>>> t.key = 'maptest'
>>> t.string_column = 'string test'
>>> t.int_str_column = 18
>>> t.float_column = t.float_str_column = 35.8
>>> from datetime import datetime
>>> t.datetime_str_column = datetime.now()
>>> Test.objects.insert(t)
1261395560186855
>>> Test.objects.get(t.key).string_column
'string test'
>>> Test.objects.get(t.key).int_str_column
18
>>> Test.objects.get(t.key).float_column
35.799999999999997
>>> Test.objects.get(t.key).datetime_str_column
datetime.datetime(2009, 12, 23, 17, 6, 3)
>>> Test.objects.multiget([t.key])
{'maptest': <__main__.Test object at 0x7f8ddde0b9d0>}
>>> list(Test.objects.get_range())
[<__main__.Test object at 0x7f8ddde0b710>]
>>> Test.objects.get_count(t.key)
7
>>> Test.objects.remove(t)
1261395603906864
>>> Test.objects.get(t.key)
Traceback (most recent call last):
...
cassandra.ttypes.NotFoundException: NotFoundException()
Note that, as mentioned previously, get_range() may continue to return removed rows for some time:
>>> Test.objects.remove(t)
1261395603756875
>>> list(Test.objects.get_range())
[<__main__.Test object at 0x7fac9c85ea90>]
>>> list(Test.objects.get_range())[0].string_column
'Your Default'
To use SuperColumns, pass super=True to the ColumnFamily constructor.
>>> cf = pycassa.ColumnFamily(client, 'Test Keyspace', 'Test SuperColumnFamily', super=True)
>>> cf.insert('key1', {'1': {'sub1': 'val1', 'sub2': 'val2'}, '2': {'sub3': 'val3', 'sub4': 'val4'}})
1261490144457132
>>> cf.get('key1')
{'1': {'sub2': 'val2', 'sub1': 'val1'}, '2': {'sub4': 'val4', 'sub3': 'val3'}}
>>> cf.remove('key1', super_column='1')
1261490176976864
>>> cf.get('key1')
{'2': {'sub4': 'val4', 'sub3': 'val3'}}
>>> cf.get('key1', super_column='2')
{'sub3': 'val3', 'sub4': 'val4'}
>>> cf.multiget(['key1'], super_column='2')
{'key1': {'sub3': 'val3', 'sub4': 'val4'}}
>>> list(cf.get_range(super_column='2'))
[('key1', {'sub3': 'val3', 'sub4': 'val4'})]
You may also use a ColumnFamilyMap with SuperColumns:
>>> Test.objects = pycassa.ColumnFamilyMap(Test, cf)
>>> t = Test()
>>> t.key = 'key1'
>>> t.super_column = 'super1'
>>> t.string_column = 'foobar'
>>> t.int_str_column = 5
>>> t.float_column = t.float_str_column = 35.8
>>> t.datetime_str_column = datetime.now()
>>> Test.objects.insert(t)
>>> Test.objects.get(t.key)
{'super1': <__main__.Test object at 0x20ab350>}
>>> Test.objects.multiget([t.key])
{'key1': {'super1': <__main__.Test object at 0x20ab550>}}
These output values retain the same format as given by the Cassandra thrift interface.
pycassa currently returns Cassandra Columns and SuperColumns as python dictionaries. Sometimes, though, you care about the order of elements. If you have access to an ordered dictionary class (such as collections.OrderedDict in python 2.7), then you may pass it to the constructor. All returned values will be of that class.
>>> cf = pycassa.ColumnFamily(client, 'Test Keyspace', 'Test ColumnFamily',
dict_class=collections.OrderedDict)
You may also define your own Column types for the mapper. For example, the IntString may be defined as:
>>> class IntString(pycassa.Column):
... def pack(self, val):
... return str(val)
... def unpack(self, val):
... return int(val)
...
All of the underlying Cassandra interface functions are available through the connection.
>>> client = pycassa.connect()
>>> client.get_string_property('version')
'0.5.1'
>>> client.get_string_list_property('keyspaces')
['Test Keyspace', 'system']
>>> client.describe_keyspace('system')
{'LocationInfo': {'Type': 'Standard', 'CompareWith': 'org.apache.cassandra.db.marshal.UTF8Type', 'Desc': 'persistent metadata for the local node'}, 'HintsColumnFamily': {'CompareSubcolumnsWith': 'org.apache.cassandra.db.marshal.BytesType', 'Type': 'Super', 'CompareWith': 'org.apache.cassandra.db.marshal.UTF8Type', 'Desc': 'hinted handoff data'}}
As of Cassandra 0.5.1, valid arguments for get_string_property() are:
"cluster name", "config file", "token map", "version"
Valid arguments for get_string_list_property() are:
"keyspaces"
All the functions have the exact same functionality as their thrift counterparts, but it may be hidden as keyword arguments.
ColumnFamily.__init__()
Parameters
----------
client : cassandra.Cassandra.Client
Cassandra client with thrift API
keyspace : str
The Keyspace this ColumnFamily belongs to
column_family : str
The name of this ColumnFamily
buffer_size : int
When calling get_range(), the intermediate results need to be
buffered if we are fetching many rows, otherwise the Cassandra
server will overallocate memory and fail. This is the size of
that buffer.
read_consistency_level : ConsistencyLevel
Affects the guaranteed replication factor before returning from
any read operation
write_consistency_level : ConsistencyLevel
Affects the guaranteed replication factor before returning from
any write operation
timestamp : function
The default timestamp function returns:
int(time.mktime(time.gmtime()))
Or the number of seconds since Unix epoch in GMT.
Set timestamp to replace the default timestamp function with your
own.
super : bool
Whether this ColumnFamily has SuperColumns
dict_class : class (must act like the dict type)
The default dict_class is dict.
If the order of columns matter to you, pass your own dictionary
class, or python 2.7's new collections.OrderedDict. All returned
rows and subcolumns are instances of this.
ColumnFamily.get()
Parameters
----------
key : str
The key to fetch
columns : [str]
Limit the columns or super_columns fetched to the specified list
column_start : str
Only fetch when a column or super_column is >= column_start
column_finish : str
Only fetch when a column or super_column is <= column_finish
column_reversed : bool
Fetch the columns or super_columns in reverse order. This will do
nothing unless you passed a dict_class to the constructor.
column_count : int
Limit the number of columns or super_columns fetched per key
include_timestamp : bool
If true, return a (value, timestamp) tuple for each column
super_column : str
Return columns only in this super_column
read_consistency_level : ConsistencyLevel
Affects the guaranteed replication factor before returning from
any read operation
ColumnFamily.multiget()
Parameters
----------
keys : [str]
A list of keys to fetch
columns : [str]
Limit the columns or super_columns fetched to the specified list
column_start : str
Only fetch when a column or super_column is >= column_start
column_finish : str
Only fetch when a column or super_column is <= column_finish
column_reversed : bool
Fetch the columns or super_columns in reverse order. This will do
nothing unless you passed a dict_class to the constructor.
column_count : int
Limit the number of columns or super_columns fetched per key
include_timestamp : bool
If true, return a (value, timestamp) tuple for each column
super_column : str
Return columns only in this super_column
read_consistency_level : ConsistencyLevel
Affects the guaranteed replication factor before returning from
any read operation
ColumnFamily.get_count()
Parameters
----------
key : str
The key with which to count columns
super_column : str
Count the columns only in this super_column
read_consistency_level : ConsistencyLevel
Affects the guaranteed replication factor before returning from
any read operation
ColumnFamily.get_range()
Parameters
----------
keys : [str]
A list of keys to fetch
columns : [str]
Limit the columns or super_columns fetched to the specified list
column_start : str
Only fetch when a column or super_column is >= column_start
column_finish : str
Only fetch when a column or super_column is <= column_finish
column_reversed : bool
Fetch the columns or super_columns in reverse order. This will do
nothing unless you passed a dict_class to the constructor.
column_count : int
Limit the number of columns or super_columns fetched per key
include_timestamp : bool
If true, return a (value, timestamp) tuple for each column
super_column : str
Return columns only in this super_column
read_consistency_level : ConsistencyLevel
Affects the guaranteed replication factor before returning from
any read operation
ColumnFamily.insert()
Insert or update columns for a key
Parameters
----------
key : str
The key to insert or update the columns at
columns : dict
Column: {'column': 'value'}
SuperColumn: {'column': {'subcolumn': 'value'}}
The columns or supercolumns to insert or update
write_consistency_level : ConsistencyLevel
Affects the guaranteed replication factor before returning from
any write operation
ColumnFamily.remove()
Parameters
----------
key : str
The key to remove. If columns is not set, remove all columns
columns : list
Delete the columns or super_columns in this list
super_column : str
Delete the columns from this super_column
write_consistency_level : ConsistencyLevel
Affects the guaranteed replication factor before returning from
any write operation
ColumnFamilyMap.__init__()
Parameters
----------
cls : class
Instances of cls are generated on get*() requests
column_family: ColumnFamily
The ColumnFamily to tie with cls
raw_columns: boolean
Whether all columns should be fetched into the raw_columns field in
requests