Cassandra Day London 2015

DataStax has made it again!!

So far I’ve attended to Cassandra Day London 2014, Cassandra Summit 2014 and today’s Cassandra Day London 2015 and several Cassandra Meetups, all organised by DataStax and I can only admire them, both for the organisation itself (food, merchandising, sponsors, etc…) but most important for the quality of the contents they deliver. I can arguably say that Cassandra would not be the same without DataStax.

But now let’s focus in what’s really important to us, the contents. I usually make notes on important things I listen to in conferences and then just transcribe them here for further reading and sharing.

Cassandra resilience through a catastrophe’s post-mortem.

by @cjrolo

They lost slightly more than 50% of their data center and their experience was that, after some tweaks and nights awake, Cassandra could still ingest all the data.

Their setup:

  • 1Tb writes per day
  • Three nodes cluster
  • Write consistency: ONE
  • Read consistency: QUORUM

Their recommendations:

  • Five nodes cluster (RF=3)
  • >1Gb links
  • SSDs
  • Avoid Batches and Counters.

They claim to have been in Cassandra since pre releases and that particular catastrophe happened to them before DataStax had released any OpsCenter whatsoever so I was curious to know how they were monitoring their cluster. They were using the bundled Graphite Reporter along with Statsd.

Using Cassandra in a micro services environment.

by @mck_sw

Mick’s talk was mostly about tools, particularly highlighting two:

  • Zipkin: A distributed tracing system developed by Twitter.
    • Useful for debugging, tracing and profiling on distributed services.
  • Grafana: OpenSource Graphs dashboard.
    • Very useful because can easily integrate with tools like Graphite or Cyanite.

One of the most interesting parts was, once again, the emphasis on monitoring.

Lessons learnt building a data platform.

by @jcasals & @jimanning, from british gas connected homes

They are building the Connected Homes product at British Gas, that is basically an IOT system that monitors temperature and boilers with several benefits for the customers.

They receive data back from users every two minutes.

And the lessons are:

  • Spark has overhead, so make sure it worths using it.
    • Basically, Spark takes advantage of parallelism and distribution across nodes, so if all computations are to be done in a single node then maybe you don’t need Spark.
  • Upsert data from different sources

Given this structure:

CREATE TABLE users(
id integer,
name text,
surname text,
birthdate timestamp,
PRIMARY KEY (id));

We can UPSERT like this

INSERT INTO users (id, name, surname) VALUES (1, 'Carlos', 'Alonso');
INSERT INTO users (id, birthdate) VALUES (1, 1368438171000);

Resulting in a completed record.

  • Tweak Spark jobs to avoid it killing Cassandra. Bear in mind that Spark is much more powerful than Cassandra and can, kill its memory. Check this little comic below for more info ūüėČ

Spark vs Cassandra Comic

  • Gain velocity by breaking the barrier between Data Scientists and Developers in your team.

Amaze yourself with this visualisation of London’s energy consumption they showed!

London's Energy consumption

 

Cassandra at Hailo Cabs

by chris hoolihan, infrastructure engineer at hailo

At Hailo Cabs, they use Amazon AWS as their infrastructure to support Cassandra, particularly they use:

  • m1.xlarge instances in development systems
  • c3.2xlarge instances in production systems.
  • striped-ephemeral disks
  • 3 availability zones per DC

Again, one of the most interesting parts were the monitoring. They showed several really interesting tools, some of them developed by themselves!

  • Grafana
  • CTOP (Top for Casandra).
  • The Cassandra metrics graphite plugin.

And GoCassa, a Go Language wrapper for the Go Cassandra driver they developed themselves, to basically encourage best practices.

Finally he gave one final advice:¬†Don’t put too much data!!

Antipatterns

By @CHBATEY, Apache Cassandra evangelist at Datastax

This talk was simply awesome, it really was a long time ago since I last had to make notes so fast and be so concentrated to try not to miss a word, and here are them!

Make sure every operation hits ONLY ONE SINGLE NODE.

Easy to explain, right? The more nodes, the more connections and therefore more time in resolving your query.

Use Cassandra Cluster Manager.

This is a development tool for creating local Cassandra clusters. Can be found here.

Use query TRACING.

Is the best way to profile how your queries perform.

  • Good queries trace small.
  • Bad queries trace long.

Cassandra cannot join or aggregate, so denormalise.

You have to find the balance between denormalisation and too much duplication. Also bear in mind that User Defined Types are very useful when denormalising.

‘Bucketing’ is good for time series.

It can help you distributing load among the different nodes and also achieving the first principle here: “Make sure every operations hits only one single node”.

It is better to have several asynchronous ‘gets’ hitting only one node each than a single ‘get’ query that hits several nodes.

Unlogged batches

Beware that this batches do not guarantee completion.

Unlogged batches can save on network hops but while the coordinator is going to be very busy while processing the batch, the other nodes will be mostly idle. It’s better to run individual queries and let the driver load the balance and manage the responses. Only if all parts of the batch are to be executed on the same partition, then, the batch is a good choice.

Logged batches

This ones guarantee completion by saving the batch to a particular commit log.

Logged batches are much slower than their unlogged counterpart (~30%) so only use them if consistency is ABSOLUTELY mandatory.

Shared mutable data is dangerous also in Cassandra.

This always reminds me of¬†this tweet with a very descriptive explanation of how dangerous it is ūüėČ

There are two main ways to avoid it:

  • Upserting (explained above)
  • Event sourcing: Basically just appending new data as it comes.
    • As this doesn’t scale, it’s good to combine it with some snapshot technique (taking a snapshot every night in batch job).

Cassandra does not rollback

So it’s pointless retrying failed inserts unless failed in the coordinator, because if it reached the coordinator, the it’ll have a hint to retry it later.

Don’t use Cassandra as a queue!!

Cassandra doesn’t delete, instead marks as deleted and those registers are around for a while and that will affect reads.

TTL’s also generate tombstones so beware!! (unless DateTieredCompaction)

Secondary Indexes

As Cassandra doesn’t know the cardinality it saves the index in local tables.

Local tables are on every node and only contains references to data that could be found on the corresponding node.

Therefore, to use them, a query will run on all the nodes.

You can see slides of this last talk here: http://www.slideshare.net/chbatey/webinar-cassandra-antipatterns-45996021

And that was it!! Amazing, huh?

iOS Data Collection Using Swift

Here at MyDrive we are completely data-driven and as such we need to collect
as much data as possible. This data is then analysed by our data scientists and/or
used for Machine Learning purposes. To collect all that data we have designed and implemented our
own iOS data collection app (a.k.a. the iOS awesome app) that silently records
all our movements (lovely, isn’t it?).

Now let’s get to the technical part. The application has been written in Apple’s
brand new programming language: Swift, and we started developing it since it was
in beta version, so that means that parts of our codebase had to be rewritten to
adopt the incoming changes from any new released beta.

That has been my first hands-on time with Swift and I have to say that I like it.
Although I used to like it more in one of the betas than in the current first release.
The idea of writing

if var v = some_method() {

}

or

if some_var? {

}

rather than:

var v = some_method()
if v != nil {

}

was, IMO, really a cleaner way of doing nil variable testing, but, for some reason
Apple decided to remove that feature.

Apart from that detail, I think that the language has some ‘must haves’ and cool
features such as closures support, tuples, multiple return values, variable
type inference and functional programming patterns. It also has some which are
not so cool like optionals, dodgy casting errors, different parameter names from
within or outside functions, and the fact that it is still very young and things
are likely to change in the short/mid term.

Finished giving my first impressions on Swift, let’s now move on to the so called
ios awesome app and particularly to the most technically interesting part of
it that is how we managed to be able to record accelerometer data for hours
without running out of memory, because, as you have guessed, our first approach
was to store all the accelerometer observations in an array structure and then,
when finished recording, dump it for gzipping and submitting to the cloud storage.

That ‘all in memory/brute force’ approach worked not bad for a while, given that
we were collecting data at 1Hz frequency, but when we required to start recording
data at higher frequencies (30 and 60 Hz), problems soon appeared.

After spotting the cause of the issue, we decided to create a custom NSOperation
meant to run outside the main NSOperationQueue that, from time to time simply
dumps the contents of the array holding the accelerometer data to a file in disk
through a NSOutputStream and that worked fine except for the fact that after some
time using the app we realised that the last batch wasn’t fully dumped
(failed to wait for the dumping queue to finish before reading for gzipping).

Once solved, the code looks more or less like this:

func addDataRow(...) {
  data.append(...)

  if data.count >= 300 {
    let toBeDump = data
    data = [Row]()
    dumpArray(toBeDump)
  }
}

This piece of code simply appends data to an array and when it’s reaches a
size, copies it to another variable and clears it to avoid it becoming too large.

func dumpArray(data: [Row]!) {
  let op = CSVDumpOperation(file: filePath, data: data)
  if lastOp != nil && !lastOp!.finished {
    op.addDependency(lastOp)
  }
  lastOp = op
  dumpQueue.addOperation(op)
}

The copy is then given to a custom NSOperation to be dump to disk outside the
main operation queue. Those operations are executed sequentially to avoid data
being disordered.

The dump operation looks like this:

class CSVDumpOperation: NSOperation {

  let data = [Row]()
  let os : NSOutputStream

  init(file: String, data: [Row]) {
    os = NSOutputStream(toFileAtPath: file, append: true)
    os.open()

    super.init()

    self.data = data
  }

  override func main() {
    for row in data {
      let rowStr = "(row.x),(row.y),(row.z)...n"
      if let rowData = rowStr.dataUsingEncoding(NSUTF8StringEncoding, allowLossyConversion: false) {
        let bytes = UnsafePointer<UInt8>(rowData.bytes)
        os.write(bytes, maxLength: rowData.length)
      }
    }

    os.close()
  }
}

This CSVDumpOperation simply opens a NSOutputStream to the file and writes there
the csv formatted contents of the given array.

And that’s it!, with this simple approach for this simple application we intend
to collect hundreds of hours of different activities for further analysis.