Scalable Data Modelling by example – Cassandra Summit ’16

Two facts are the motivations for this talk:

  • First one is that the model cannot be changed once it is in production, well, you can, but by migrating data away to the new model using external tools such as Spark.
  • The second one is that one of the most common pain points in Cassandra deployments out there is actually performance issues caused by bad data models.

So in order to provide a Cassandra overview and a ‘checklist’ to follow when data modelling here you have my talk.

And the slides are here

Just as a final comment I’d like to remark that it was my bad that, during the talk, I forgot to acknowledge DataStax for all the amazing work they’re doing and because most of the content of this talk is taken from their Academy website. So once again, thanks DataStax for all your efforts, for such an amazing Summit and for all your contributions into the Cassandra Open Source project.

Scalable Data Modelling By Example – Cassandra London Meetup

This time I spoke at the Cassandra London Meetup and had the chance to share the stage with the amazing Patrick McFadin!!

My talk is about those concepts on top of which Cassandra lies that definitely make a difference in how we have to model our data. Theoretically reviewing those concepts and do some data modelling by example.

Here you have the slides

You can see the video here: https://skillsmatter.com/skillscasts/7762-scalable-data-modelling-by-example#video

Codemotion 2015 – Cassandra for impatients

This talk tries to provide with the basic but fundamental concepts required to build scalable Cassandra data models. I think that we, technical people, are impatient, and that sometimes may lead to errors, usually acceptable on the other hand, but in this case, when we’re dealing with Cassandra models, the cost in detecting why models are not scaling and having to modify them live will always be very expensive and painful. Basically, you don’t want to build and deploy models that don’t scale.

In order to get those concepts, here you have the slides and the video.

Cassandra Day London 2015

DataStax has made it again!!

So far I’ve attended to Cassandra Day London 2014, Cassandra Summit 2014 and today’s Cassandra Day London 2015 and several Cassandra Meetups, all organised by DataStax and I can only admire them, both for the organisation itself (food, merchandising, sponsors, etc…) but most important for the quality of the contents they deliver. I can arguably say that Cassandra would not be the same without DataStax.

But now let’s focus in what’s really important to us, the contents. I usually make notes on important things I listen to in conferences and then just transcribe them here for further reading and sharing.

Cassandra resilience through a catastrophe’s post-mortem.

by @cjrolo

They lost slightly more than 50% of their data center and their experience was that, after some tweaks and nights awake, Cassandra could still ingest all the data.

Their setup:

  • 1Tb writes per day
  • Three nodes cluster
  • Write consistency: ONE
  • Read consistency: QUORUM

Their recommendations:

  • Five nodes cluster (RF=3)
  • >1Gb links
  • SSDs
  • Avoid Batches and Counters.

They claim to have been in Cassandra since pre releases and that particular catastrophe happened to them before DataStax had released any OpsCenter whatsoever so I was curious to know how they were monitoring their cluster. They were using the bundled Graphite Reporter along with Statsd.

Using Cassandra in a micro services environment.

by @mck_sw

Mick’s talk was mostly about tools, particularly highlighting two:

  • Zipkin: A distributed tracing system developed by Twitter.
    • Useful for debugging, tracing and profiling on distributed services.
  • Grafana: OpenSource Graphs dashboard.
    • Very useful because can easily integrate with tools like Graphite or Cyanite.

One of the most interesting parts was, once again, the emphasis on monitoring.

Lessons learnt building a data platform.

by @jcasals & @jimanning, from british gas connected homes

They are building the Connected Homes product at British Gas, that is basically an IOT system that monitors temperature and boilers with several benefits for the customers.

They receive data back from users every two minutes.

And the lessons are:

  • Spark has overhead, so make sure it worths using it.
    • Basically, Spark takes advantage of parallelism and distribution across nodes, so if all computations are to be done in a single node then maybe you don’t need Spark.
  • Upsert data from different sources

Given this structure:

CREATE TABLE users(
id integer,
name text,
surname text,
birthdate timestamp,
PRIMARY KEY (id));

We can UPSERT like this

INSERT INTO users (id, name, surname) VALUES (1, 'Carlos', 'Alonso');
INSERT INTO users (id, birthdate) VALUES (1, 1368438171000);

Resulting in a completed record.

  • Tweak Spark jobs to avoid it killing Cassandra. Bear in mind that Spark is much more powerful than Cassandra and can, kill its memory. Check this little comic below for more info 😉

Spark vs Cassandra Comic

  • Gain velocity by breaking the barrier between Data Scientists and Developers in your team.

Amaze yourself with this visualisation of London’s energy consumption they showed!

London's Energy consumption

 

Cassandra at Hailo Cabs

by chris hoolihan, infrastructure engineer at hailo

At Hailo Cabs, they use Amazon AWS as their infrastructure to support Cassandra, particularly they use:

  • m1.xlarge instances in development systems
  • c3.2xlarge instances in production systems.
  • striped-ephemeral disks
  • 3 availability zones per DC

Again, one of the most interesting parts were the monitoring. They showed several really interesting tools, some of them developed by themselves!

  • Grafana
  • CTOP (Top for Casandra).
  • The Cassandra metrics graphite plugin.

And GoCassa, a Go Language wrapper for the Go Cassandra driver they developed themselves, to basically encourage best practices.

Finally he gave one final advice: Don’t put too much data!!

Antipatterns

By @CHBATEY, Apache Cassandra evangelist at Datastax

This talk was simply awesome, it really was a long time ago since I last had to make notes so fast and be so concentrated to try not to miss a word, and here are them!

Make sure every operation hits ONLY ONE SINGLE NODE.

Easy to explain, right? The more nodes, the more connections and therefore more time in resolving your query.

Use Cassandra Cluster Manager.

This is a development tool for creating local Cassandra clusters. Can be found here.

Use query TRACING.

Is the best way to profile how your queries perform.

  • Good queries trace small.
  • Bad queries trace long.

Cassandra cannot join or aggregate, so denormalise.

You have to find the balance between denormalisation and too much duplication. Also bear in mind that User Defined Types are very useful when denormalising.

‘Bucketing’ is good for time series.

It can help you distributing load among the different nodes and also achieving the first principle here: “Make sure every operations hits only one single node”.

It is better to have several asynchronous ‘gets’ hitting only one node each than a single ‘get’ query that hits several nodes.

Unlogged batches

Beware that this batches do not guarantee completion.

Unlogged batches can save on network hops but while the coordinator is going to be very busy while processing the batch, the other nodes will be mostly idle. It’s better to run individual queries and let the driver load the balance and manage the responses. Only if all parts of the batch are to be executed on the same partition, then, the batch is a good choice.

Logged batches

This ones guarantee completion by saving the batch to a particular commit log.

Logged batches are much slower than their unlogged counterpart (~30%) so only use them if consistency is ABSOLUTELY mandatory.

Shared mutable data is dangerous also in Cassandra.

This always reminds me of this tweet with a very descriptive explanation of how dangerous it is 😉

There are two main ways to avoid it:

  • Upserting (explained above)
  • Event sourcing: Basically just appending new data as it comes.
    • As this doesn’t scale, it’s good to combine it with some snapshot technique (taking a snapshot every night in batch job).

Cassandra does not rollback

So it’s pointless retrying failed inserts unless failed in the coordinator, because if it reached the coordinator, the it’ll have a hint to retry it later.

Don’t use Cassandra as a queue!!

Cassandra doesn’t delete, instead marks as deleted and those registers are around for a while and that will affect reads.

TTL’s also generate tombstones so beware!! (unless DateTieredCompaction)

Secondary Indexes

As Cassandra doesn’t know the cardinality it saves the index in local tables.

Local tables are on every node and only contains references to data that could be found on the corresponding node.

Therefore, to use them, a query will run on all the nodes.

You can see slides of this last talk here: http://www.slideshare.net/chbatey/webinar-cassandra-antipatterns-45996021

And that was it!! Amazing, huh?

16 Notes on Cassandra Summit Europe 2014

Cassandra Summit 2014 Header Today I attended to Cassandra Summit Europe 2014 in London and it has been simply awesome, both the technical contents itself and the overall event organisation, as usual when DataStax guys are on it. The summit consists of two days, the first was a training day and the second a conferences day. On the training day I have attended to the Data Modelling Advanced Track, driven by Patrick McFadin and here is the transcription of the notes I made.

1. Before you begin: Understand your data.

Yes, as simple as that. Fully and deeply understand the data model you are hands-on before even thinking of switching on your editor or even your laptop!

2. Query-Driven data modelling.

Cassandra data modelling ABSOLUTELY REQUIRES you to know how you are going to access the data upfront. Otherwise you’ll be in trouble soon.

3. Clustering order can be specified when creating the table definition.

Yes, I didn’t know it and it’s totally necessary at use cases like: “Give me the last 5 events…”

4. You can define static fields

Again, I didn’t know this feature and it is a field that is only stored once per partition, therefore, has only one value in the whole partition.

5. Map data types are fully serialised/deserialised when writing/reading.

So make sure that you are ok with what this implies:

  • Impossible to filter results based on values contained in it.
  • Whole field read and deserialised even if only reading one of the key-value pairs Although theoretical limit is much higher, recommended size is up to several hundreds of pairs.

6. TimeUUID = Timestamp + UUID

Sounds natural, doesn’t it? But, what’s the point? Well, if you want to use a timestamp as part of the primary key that, for example, stores the timestamp in which a particular record was created and entities of this type are created really fast chances are that two different entities share the same UNIX timestamp and, therefore, the latter will override the former. Use a timeuuid to be on the safe side.

7. Don’t use ALLOW FILTERING in your queries.

Although it is permitted, it is a good indicator that your model is poor.

8. Batches have a really big performance impact.

That’s because the coordinator node has to make sure that all the sentences are sequentially executed and distributed among the required replication nodes before returning. > 9. Pre-calculate and estimate sizes of your partitions in production.

Once you have them in adequate numbers, TEST WITH PRODUCTION DATA. Using this formulae: Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic and try to keep below 1M by:

  • Splitting partitions if too big.
    • By adding another existing field to the partition key.
    • By creating a new convenience column and then add it to the partition key.
    • By defining buckets.
  • Grouping partitions using a bucket PK if too small.

10. If your time series give you lots of data, it is probably a good idea to store different grain size levels and let smaller expire using TTL.

This is a widely used technique. Have very fine grain data for the most recent but then, if you query older data, probably, at some point, you won’t get the finest grain but rather a bigger one.

11. Use maps to reduce number of columns But only for columns that:

  • You’re not gonna query on.
  • You don’t mind to get them all together.

12. Batches are good for managing duplicated fields.

Whenever you have duplicated fields, i.e. in lookup tables, batches are great for inserting/updating all of them or none.

13. Cassandra is not a data warehouse for analytics.

Because sooner or later you’ll need to query your data in a way that your model doesn’t support.

14. Cassandra is not a transactional database.

As we have said transactions are something quite expensive in Cassandra, so a heavy transaction use is going to put us in trouble.

15. Don’t use supercolumns.

They were a bad decision from the early Thrift days and are nowadays deprecated, therefore should not be used.

16. Use Solr on top of Cassandra for full text searches

Apache Solr nicely integrates with Cassandra and is happy to index text columns on Cassandra.

And these are the 16 notes I hand wrote during the whole day training. Let’s see what comes from the conference day!
UPDATE: Second part of this serie has been published!