Cassandra Day London 2015

DataStax has made it again!!

So far I’ve attended to Cassandra Day London 2014, Cassandra Summit 2014 and today’s Cassandra Day London 2015 and several Cassandra Meetups, all organised by DataStax and I can only admire them, both for the organisation itself (food, merchandising, sponsors, etc…) but most important for the quality of the contents they deliver. I can arguably say that Cassandra would not be the same without DataStax.

But now let’s focus in what’s really important to us, the contents. I usually make notes on important things I listen to in conferences and then just transcribe them here for further reading and sharing.

Cassandra resilience through a catastrophe’s post-mortem.

by @cjrolo

They lost slightly more than 50% of their data center and their experience was that, after some tweaks and nights awake, Cassandra could still ingest all the data.

Their setup:

  • 1Tb writes per day
  • Three nodes cluster
  • Write consistency: ONE
  • Read consistency: QUORUM

Their recommendations:

  • Five nodes cluster (RF=3)
  • >1Gb links
  • SSDs
  • Avoid Batches and Counters.

They claim to have been in Cassandra since pre releases and that particular catastrophe happened to them before DataStax had released any OpsCenter whatsoever so I was curious to know how they were monitoring their cluster. They were using the bundled Graphite Reporter along with Statsd.

Using Cassandra in a micro services environment.

by @mck_sw

Mick’s talk was mostly about tools, particularly highlighting two:

  • Zipkin: A distributed tracing system developed by Twitter.
    • Useful for debugging, tracing and profiling on distributed services.
  • Grafana: OpenSource Graphs dashboard.
    • Very useful because can easily integrate with tools like Graphite or Cyanite.

One of the most interesting parts was, once again, the emphasis on monitoring.

Lessons learnt building a data platform.

by @jcasals & @jimanning, from british gas connected homes

They are building the Connected Homes product at British Gas, that is basically an IOT system that monitors temperature and boilers with several benefits for the customers.

They receive data back from users every two minutes.

And the lessons are:

  • Spark has overhead, so make sure it worths using it.
    • Basically, Spark takes advantage of parallelism and distribution across nodes, so if all computations are to be done in a single node then maybe you don’t need Spark.
  • Upsert data from different sources

Given this structure:

CREATE TABLE users(
id integer,
name text,
surname text,
birthdate timestamp,
PRIMARY KEY (id));

We can UPSERT like this

INSERT INTO users (id, name, surname) VALUES (1, 'Carlos', 'Alonso');
INSERT INTO users (id, birthdate) VALUES (1, 1368438171000);

Resulting in a completed record.

  • Tweak Spark jobs to avoid it killing Cassandra. Bear in mind that Spark is much more powerful than Cassandra and can, kill its memory. Check this little comic below for more info ūüėČ

Spark vs Cassandra Comic

  • Gain velocity by breaking the barrier between Data Scientists and Developers in your team.

Amaze yourself with this visualisation of London’s energy consumption they showed!

London's Energy consumption

 

Cassandra at Hailo Cabs

by chris hoolihan, infrastructure engineer at hailo

At Hailo Cabs, they use Amazon AWS as their infrastructure to support Cassandra, particularly they use:

  • m1.xlarge instances in development systems
  • c3.2xlarge instances in production systems.
  • striped-ephemeral disks
  • 3 availability zones per DC

Again, one of the most interesting parts were the monitoring. They showed several really interesting tools, some of them developed by themselves!

  • Grafana
  • CTOP (Top for Casandra).
  • The Cassandra metrics graphite plugin.

And GoCassa, a Go Language wrapper for the Go Cassandra driver they developed themselves, to basically encourage best practices.

Finally he gave one final advice:¬†Don’t put too much data!!

Antipatterns

By @CHBATEY, Apache Cassandra evangelist at Datastax

This talk was simply awesome, it really was a long time ago since I last had to make notes so fast and be so concentrated to try not to miss a word, and here are them!

Make sure every operation hits ONLY ONE SINGLE NODE.

Easy to explain, right? The more nodes, the more connections and therefore more time in resolving your query.

Use Cassandra Cluster Manager.

This is a development tool for creating local Cassandra clusters. Can be found here.

Use query TRACING.

Is the best way to profile how your queries perform.

  • Good queries trace small.
  • Bad queries trace long.

Cassandra cannot join or aggregate, so denormalise.

You have to find the balance between denormalisation and too much duplication. Also bear in mind that User Defined Types are very useful when denormalising.

‘Bucketing’ is good for time series.

It can help you distributing load among the different nodes and also achieving the first principle here: “Make sure every operations hits only one single node”.

It is better to have several asynchronous ‘gets’ hitting only one node each than a single ‘get’ query that hits several nodes.

Unlogged batches

Beware that this batches do not guarantee completion.

Unlogged batches can save on network hops but while the coordinator is going to be very busy while processing the batch, the other nodes will be mostly idle. It’s better to run individual queries and let the driver load the balance and manage the responses. Only if all parts of the batch are to be executed on the same partition, then, the batch is a good choice.

Logged batches

This ones guarantee completion by saving the batch to a particular commit log.

Logged batches are much slower than their unlogged counterpart (~30%) so only use them if consistency is ABSOLUTELY mandatory.

Shared mutable data is dangerous also in Cassandra.

This always reminds me of¬†this tweet with a very descriptive explanation of how dangerous it is ūüėČ

There are two main ways to avoid it:

  • Upserting (explained above)
  • Event sourcing: Basically just appending new data as it comes.
    • As this doesn’t scale, it’s good to combine it with some snapshot technique (taking a snapshot every night in batch job).

Cassandra does not rollback

So it’s pointless retrying failed inserts unless failed in the coordinator, because if it reached the coordinator, the it’ll have a hint to retry it later.

Don’t use Cassandra as a queue!!

Cassandra doesn’t delete, instead marks as deleted and those registers are around for a while and that will affect reads.

TTL’s also generate tombstones so beware!! (unless DateTieredCompaction)

Secondary Indexes

As Cassandra doesn’t know the cardinality it saves the index in local tables.

Local tables are on every node and only contains references to data that could be found on the corresponding node.

Therefore, to use them, a query will run on all the nodes.

You can see slides of this last talk here: http://www.slideshare.net/chbatey/webinar-cassandra-antipatterns-45996021

And that was it!! Amazing, huh?

16 Notes on Cassandra Summit Europe 2014 (Part II)

Cassandra Summit 2014 Header

And here we are again! Transcoding my handwritten notes before I forget the interesting details from the Summit. If you’re new here I’d advice you to read the first part of this serie from the first day at the Summit.

But before beginning I want to make sure I congratulate all the organisers of this event enough for making such a big and high quality one. DataStax, thank you very much, you are simply GREAT!

So this second day was a talks day, lots of talks organised in groups of 4 that happened simultaneously, so hard decisions to be made every 45 minutes or so.

The conferences day began with a Keynote by Billy Bosworth and Jonathan Ellis speaking about Cassandra and the upcoming 3.0 and I took a few of interesting notes:

Use DTCS for time series compaction.

It optimises time range queries but not time sorting queries.

Lightweight transactions by using conditional statements

Typical example of balance transfers. We update after reading. Use the read value to be sure it has not been updated

BEGIN BATCH
  UPDATE accounts SET balance = 180 WHERE ... IF balance = 100
  UPDATE accounts SET balance = 100 WHERE ... IF balance = 180
APPLY BATCH

User Defined Types

CREATE TABLE address (
   ...
)

CREATE TABLE users (
  id uuid PRIMARY KEY,
  name text, 
  addresses map<text, frozen<address>>
)

Beware that every addresses map value will be blob serialised so even an update on a single field of the address implies a full entity update.

Collections indexing

C* 3.0 will allow us to query on collections, i.e:

 SELECT ... WHERE field CONTAINS X

Counters…

Prior to v2.1 only the increment was written to the commit log and therefore the main problem was to ensure commit log was never replayed. In 2.1 and afterwards the full counter value is also stored in the commit log making a replay idempotent, just as any other operation.

JSON integration

JSON arrives to Cassandra! C* will map a json formatted string to a matching tables structure and will allow something like:

INSERT INTO user VALUES '{ "name" = "Carlos", "surname" = "Alonso", ...}'

User defined functions

Like any other stored procedure language.

Local indexes -> Global indexes

Each node indexes it’s owned data, denormalised as another table. With global indexes the index table is in it’s own partition making it faster and easier to query.

After a few good talks then the awesome Patrick McFadin came with a few of good advices on performance.

Use preprocessed statements: They were conceived for performance.

Once the statement is preprocessed then using it is a shortcut, saving all the parsing code.

Use execute async whenever possible

It benefits from parallel processing. This is speed for free!!

Batches are for atomicity

And should be small!

Row cache is better than partition

Row cache just caches a fixed number (defined on table creation) of rows (hot rows), rather than the whole partition.

And again, after a few more good talks and very close to the end Theo Hultberg filled the last lines of my notebook.

If you need to read a very long row, split it and read parts concurrently

Sounds clever and simple, huh?

Beware tombstones!

Don’t just assume Cassandra is fast compacting them. Again, test!

Cassandra is written in Java

So make sure you know the differences between data types in your language and Java.

Atomic operations?

Make sure you know what is atomic and what isn’t from the docs, otherwise you’ll soon be in trouble.

And that’s all folks!! A really nice compilation of notes, ideas and concepts from a really awesome event, have I mentioned that before?

But I don’t want to say goodbye without mentioning a very good product presented there. I didn’t made any note on it because I’m sure I won’t forget. It is Stratio Cassandra and it seems to fix one of the biggest problems C* has: lack of flexibility. If you happen to require a new data access pattern it’s very likely that your model simply doesn’t support it.

Andr√©s de la Pe√Īa and Daniel Higuero, from Stratio introduced the product that consists on indexing the data on each node using Lucene on top of Cassandra allowing you to access the data with the flexibility Lucene provides. Simply brilliant.

And that’s all! Hope you find those useful.

16 Notes on Cassandra Summit Europe 2014

Cassandra Summit 2014 Header Today I attended to Cassandra Summit Europe 2014 in London and it has been simply awesome, both the technical contents itself and the overall event organisation, as usual when DataStax guys are on it. The summit consists of two days, the first was a training day and the second a conferences day. On the training day I have attended to the Data Modelling Advanced Track, driven by Patrick McFadin and here is the transcription of the notes I made.

1. Before you begin: Understand your data.

Yes, as simple as that. Fully and deeply understand the data model you are hands-on before even thinking of switching on your editor or even your laptop!

2. Query-Driven data modelling.

Cassandra data modelling¬†ABSOLUTELY REQUIRES¬†you to know how you are going to access the data upfront. Otherwise you’ll be in trouble soon.

3. Clustering order can be specified when creating the table definition.

Yes, I didn’t know it and it’s totally necessary at use cases like: “Give me the last 5 events…”

4. You can define static fields

Again, I didn’t know this feature and it is a field that is only stored once per partition, therefore, has only one value in the whole partition.

5. Map data types are fully serialised/deserialised when writing/reading.

So make sure that you are ok with what this implies:

  • Impossible to filter results based on values contained in it.
  • Whole field read and deserialised even if only reading one of the key-value pairs Although theoretical limit is much higher, recommended size is up to several hundreds of pairs.

6. TimeUUID = Timestamp + UUID

Sounds natural, doesn’t it? But, what’s the point? Well, if you want to use a timestamp as part of the primary key that, for example, stores the timestamp in which a particular record was created and entities of this type are created really fast chances are that two different entities share the same UNIX timestamp and, therefore, the latter will override the former. Use a timeuuid to be on the safe side.

7. Don’t use ALLOW FILTERING in your queries.

Although it is permitted, it is a good indicator that your model is poor.

8. Batches have a really big performance impact.

That’s because the coordinator node has to make sure that all the sentences are sequentially executed and distributed among the required replication nodes before returning. > 9. Pre-calculate and estimate sizes of your partitions in production.

Once you have them in adequate numbers, TEST WITH PRODUCTION DATA. Using this formulae: Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic and try to keep below 1M by:

  • Splitting partitions if too big.
    • By adding another existing field to the partition key.
    • By creating a new convenience column and then add it to the partition key.
    • By defining buckets.
  • Grouping partitions using a bucket PK if too small.

10. If your time series give you lots of data, it is probably a good idea to store different grain size levels and let smaller expire using TTL.

This is a widely used technique. Have very fine grain data for the most recent but then, if you query older data, probably, at some point, you won’t get the finest grain but rather a bigger one.

11. Use maps to reduce number of columns But only for columns that:

  • You’re not gonna query on.
  • You don’t mind to get them all together.

12. Batches are good for managing duplicated fields.

Whenever you have duplicated fields, i.e. in lookup tables, batches are great for inserting/updating all of them or none.

13. Cassandra is not a data warehouse for analytics.

Because sooner or later you’ll need to query your data in a way that your model doesn’t support.

14. Cassandra is not a transactional database.

As we have said transactions are something quite expensive in Cassandra, so a heavy transaction use is going to put us in trouble.

15. Don’t use supercolumns.

They were a bad decision from the early Thrift days and are nowadays deprecated, therefore should not be used.

16. Use Solr on top of Cassandra for full text searches

Apache Solr nicely integrates with Cassandra and is happy to index text columns on Cassandra.

And these are the 16 notes I hand wrote during the whole day training. Let’s see what comes from the conference day!
UPDATE: Second part of this serie has been published!