16 Notes on Cassandra Summit Europe 2014

Cassandra Summit 2014 Header Today I attended to Cassandra Summit Europe 2014 in London and it has been simply awesome, both the technical contents itself and the overall event organisation, as usual when DataStax guys are on it. The summit consists of two days, the first was a training day and the second a conferences day. On the training day I have attended to the Data Modelling Advanced Track, driven by Patrick McFadin and here is the transcription of the notes I made.

1. Before you begin: Understand your data.

Yes, as simple as that. Fully and deeply understand the data model you are hands-on before even thinking of switching on your editor or even your laptop!

2. Query-Driven data modelling.

Cassandra data modelling ABSOLUTELY REQUIRES you to know how you are going to access the data upfront. Otherwise you’ll be in trouble soon.

3. Clustering order can be specified when creating the table definition.

Yes, I didn’t know it and it’s totally necessary at use cases like: “Give me the last 5 events…”

4. You can define static fields

Again, I didn’t know this feature and it is a field that is only stored once per partition, therefore, has only one value in the whole partition.

5. Map data types are fully serialised/deserialised when writing/reading.

So make sure that you are ok with what this implies:

  • Impossible to filter results based on values contained in it.
  • Whole field read and deserialised even if only reading one of the key-value pairs Although theoretical limit is much higher, recommended size is up to several hundreds of pairs.

6. TimeUUID = Timestamp + UUID

Sounds natural, doesn’t it? But, what’s the point? Well, if you want to use a timestamp as part of the primary key that, for example, stores the timestamp in which a particular record was created and entities of this type are created really fast chances are that two different entities share the same UNIX timestamp and, therefore, the latter will override the former. Use a timeuuid to be on the safe side.

7. Don’t use ALLOW FILTERING in your queries.

Although it is permitted, it is a good indicator that your model is poor.

8. Batches have a really big performance impact.

That’s because the coordinator node has to make sure that all the sentences are sequentially executed and distributed among the required replication nodes before returning. > 9. Pre-calculate and estimate sizes of your partitions in production.

Once you have them in adequate numbers, TEST WITH PRODUCTION DATA. Using this formulae: Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic and try to keep below 1M by:

  • Splitting partitions if too big.
    • By adding another existing field to the partition key.
    • By creating a new convenience column and then add it to the partition key.
    • By defining buckets.
  • Grouping partitions using a bucket PK if too small.

10. If your time series give you lots of data, it is probably a good idea to store different grain size levels and let smaller expire using TTL.

This is a widely used technique. Have very fine grain data for the most recent but then, if you query older data, probably, at some point, you won’t get the finest grain but rather a bigger one.

11. Use maps to reduce number of columns But only for columns that:

  • You’re not gonna query on.
  • You don’t mind to get them all together.

12. Batches are good for managing duplicated fields.

Whenever you have duplicated fields, i.e. in lookup tables, batches are great for inserting/updating all of them or none.

13. Cassandra is not a data warehouse for analytics.

Because sooner or later you’ll need to query your data in a way that your model doesn’t support.

14. Cassandra is not a transactional database.

As we have said transactions are something quite expensive in Cassandra, so a heavy transaction use is going to put us in trouble.

15. Don’t use supercolumns.

They were a bad decision from the early Thrift days and are nowadays deprecated, therefore should not be used.

16. Use Solr on top of Cassandra for full text searches

Apache Solr nicely integrates with Cassandra and is happy to index text columns on Cassandra.

And these are the 16 notes I hand wrote during the whole day training. Let’s see what comes from the conference day!
UPDATE: Second part of this serie has been published!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s