DataStax has made it again!!
So far I’ve attended to Cassandra Day London 2014, Cassandra Summit 2014 and today’s Cassandra Day London 2015 and several Cassandra Meetups, all organised by DataStax and I can only admire them, both for the organisation itself (food, merchandising, sponsors, etc…) but most important for the quality of the contents they deliver. I can arguably say that Cassandra would not be the same without DataStax.
But now let’s focus in what’s really important to us, the contents. I usually make notes on important things I listen to in conferences and then just transcribe them here for further reading and sharing.
Cassandra resilience through a catastrophe’s post-mortem.
by @cjrolo
They lost slightly more than 50% of their data center and their experience was that, after some tweaks and nights awake, Cassandra could still ingest all the data.
Their setup:
- 1Tb writes per day
- Three nodes cluster
- Write consistency: ONE
- Read consistency: QUORUM
Their recommendations:
- Five nodes cluster (RF=3)
- >1Gb links
- SSDs
- Avoid Batches and Counters.
They claim to have been in Cassandra since pre releases and that particular catastrophe happened to them before DataStax had released any OpsCenter whatsoever so I was curious to know how they were monitoring their cluster. They were using the bundled Graphite Reporter along with Statsd.
Using Cassandra in a micro services environment.
by @mck_sw
Mick’s talk was mostly about tools, particularly highlighting two:
- Zipkin: A distributed tracing system developed by Twitter.
- Useful for debugging, tracing and profiling on distributed services.
- Grafana: OpenSource Graphs dashboard.
- Very useful because can easily integrate with tools like Graphite or Cyanite.
One of the most interesting parts was, once again, the emphasis on monitoring.
Lessons learnt building a data platform.
by @jcasals & @jimanning, from british gas connected homes
They are building the Connected Homes product at British Gas, that is basically an IOT system that monitors temperature and boilers with several benefits for the customers.
They receive data back from users every two minutes.
And the lessons are:
- Spark has overhead, so make sure it worths using it.
- Basically, Spark takes advantage of parallelism and distribution across nodes, so if all computations are to be done in a single node then maybe you don’t need Spark.
- Upsert data from different sources
Given this structure:
CREATE TABLE users( id integer, name text, surname text, birthdate timestamp, PRIMARY KEY (id));We can UPSERT like this
INSERT INTO users (id, name, surname) VALUES (1, 'Carlos', 'Alonso'); INSERT INTO users (id, birthdate) VALUES (1, 1368438171000);Resulting in a completed record.
- Tweak Spark jobs to avoid it killing Cassandra. Bear in mind that Spark is much more powerful than Cassandra and can, kill its memory. Check this little comic below for more info 😉
- Gain velocity by breaking the barrier between Data Scientists and Developers in your team.
Amaze yourself with this visualisation of London’s energy consumption they showed!
Cassandra at Hailo Cabs
by chris hoolihan, infrastructure engineer at hailo
At Hailo Cabs, they use Amazon AWS as their infrastructure to support Cassandra, particularly they use:
- m1.xlarge instances in development systems
- c3.2xlarge instances in production systems.
- striped-ephemeral disks
- 3 availability zones per DC
Again, one of the most interesting parts were the monitoring. They showed several really interesting tools, some of them developed by themselves!
- Grafana
- CTOP (Top for Casandra).
- The Cassandra metrics graphite plugin.
And GoCassa, a Go Language wrapper for the Go Cassandra driver they developed themselves, to basically encourage best practices.
Finally he gave one final advice: Don’t put too much data!!
Antipatterns
By @CHBATEY, Apache Cassandra evangelist at Datastax
This talk was simply awesome, it really was a long time ago since I last had to make notes so fast and be so concentrated to try not to miss a word, and here are them!
Make sure every operation hits ONLY ONE SINGLE NODE.
Easy to explain, right? The more nodes, the more connections and therefore more time in resolving your query.
Use Cassandra Cluster Manager.
This is a development tool for creating local Cassandra clusters. Can be found here.
Use query TRACING.
Is the best way to profile how your queries perform.
- Good queries trace small.
- Bad queries trace long.
Cassandra cannot join or aggregate, so denormalise.
You have to find the balance between denormalisation and too much duplication. Also bear in mind that User Defined Types are very useful when denormalising.
‘Bucketing’ is good for time series.
It can help you distributing load among the different nodes and also achieving the first principle here: “Make sure every operations hits only one single node”.
It is better to have several asynchronous ‘gets’ hitting only one node each than a single ‘get’ query that hits several nodes.
Unlogged batches
Beware that this batches do not guarantee completion.
Unlogged batches can save on network hops but while the coordinator is going to be very busy while processing the batch, the other nodes will be mostly idle. It’s better to run individual queries and let the driver load the balance and manage the responses. Only if all parts of the batch are to be executed on the same partition, then, the batch is a good choice.
Logged batches
This ones guarantee completion by saving the batch to a particular commit log.
Logged batches are much slower than their unlogged counterpart (~30%) so only use them if consistency is ABSOLUTELY mandatory.
Shared mutable data is dangerous also in Cassandra.
This always reminds me of this tweet with a very descriptive explanation of how dangerous it is 😉
There are two main ways to avoid it:
- Upserting (explained above)
- Event sourcing: Basically just appending new data as it comes.
- As this doesn’t scale, it’s good to combine it with some snapshot technique (taking a snapshot every night in batch job).
Cassandra does not rollback
So it’s pointless retrying failed inserts unless failed in the coordinator, because if it reached the coordinator, the it’ll have a hint to retry it later.
Don’t use Cassandra as a queue!!
Cassandra doesn’t delete, instead marks as deleted and those registers are around for a while and that will affect reads.
TTL’s also generate tombstones so beware!! (unless DateTieredCompaction)
Secondary Indexes
As Cassandra doesn’t know the cardinality it saves the index in local tables.
Local tables are on every node and only contains references to data that could be found on the corresponding node.
Therefore, to use them, a query will run on all the nodes.
You can see slides of this last talk here: http://www.slideshare.net/chbatey/webinar-cassandra-antipatterns-45996021
And that was it!! Amazing, huh?