Benchmarking Cassandra Models

During the last couple of weeks we’ve been very focused on deeply understanding Cassandra’s write and read paths to build our models the best possible way.

After so much investigation I can summarise which are the steps we follow to assess our model’s performance.

1. Have good practices and anti-patterns always in mind.

Cassandra is known to be a very resilient and highly performant platform, but only so long as you follow the rules and work with the underlying data model.

There are quite a few of these rules so my recommendation is to read through them quickly before thinking of your model.

There are lots of places online where you can read about the good practices to follow and anti-patterns to avoid when data modelling for Cassandra but I have my own repository, here are the links:

Once you’ve all the ideas fresh in your mind is time to start thinking about your data model.

2. Design your data model and alternatives

Following all the ideas and principles learnt in the previous step, design your data model and try to think of alternatives. These alternatives usually come as minor adjustments that can be applied to the model but just by following the best practices you can’t decide whether one or the other is a better choice.

Here I’ll provide two examples we’ve recently had:

  1. Having a bucketed time series with a maximum of 86,400 rows per partition, how is it better to read an entire partition?
    a) By reading the whole partition at once
    b) By reading the whole partition in chunks (2 halves, 4 quarters, …)
  2. Having a model that contains the information of a discretised distribution on each record, how is it better to save the bins?
    a) By using a List element that will contain the 100 bins
    b) By having 100 columns, one for each bin

The resulting models will meet all the good practices and avoid all the anti-patterns regardless of the final decision so, how do you decide which way to go?

3. Benchmark your alternatives

For this purpose I’ve created a script (Ruby in this case) that:

  1. Creates the table purposed by the model under test
  2. Generates PRODUCTION DATA and saves it (memory or disk, depending on the size)
  3. Benchmarks the applicable access patterns, in our case:
    3.1. Insertion of all the data generated in step 2
    3.2. Reading of all the data
    3.3. Updating the data

It is important that the access patterns are exactly the same way they’ll be in production, otherwise the result of the benchmark is completely useless.

This script should be adapted and ran for every single alternative.

Here you can see the scripts used for both alternatives proposed for the example No. 2 described above:

4. Let data decide

We’re almost there!!

Now we have to collect the data for each execution and compare them to work out which of the candidates is our final model.

There are two sources of data you should look at:

  1. The output of the script’s execution.
    1.1 The script will print an output for every workload benchmarked, (Insert, Read and Update in our example)
  2. The DataStax OpsCenter’s Graphs.
    2.1 DataStax OpsCenter is probably the most advanced and easy to use Cassandra monitoring tool.

In our previous example we could see this output from the scripts:

calonso@XXX-XXX-XXX-XXX: bundle exec ruby lists_benchmark.rb
                user     system        total          real
Writing:  133.840000   5.180000   139.020000   (171.499862)
Reading:   24.000000   0.350000    24.350000   ( 47.897192)
Updating:   2.560000   0.210000     2.770000   (  4.135555)

calonso@XXX-XXX-XXX-XXX: bundle exec ruby cols_benchmark.rb
                user     system        total          real
Writing:  133.730000   2.710000   136.440000   (144.749167)
Reading:   30.340000   0.410000    30.750000   ( 41.759687)
Updating:   1.950000   0.090000     2.040000   (  3.020997)

So, we could say that the columns model performs better than the lists based one, but let’s confirm our initial assessment by looking at OpsCenter’s graphs:

In all the graphs we can see two peaks, the first one was generated during the execution of the lists based model benchmarking and the second one during the columns based one.

Absolutely every graph comparison points towards the columns based model as the one with better performance:

reads

  • This graphs show the total reads per second received in the whole cluster on the and coordinator nodes and the average time taken in responding them.

writes

  • This graphs show the total writes per second received in the whole cluster on the and coordinator nodes and the average time taken in responding them.

os_load

  • Average measure of the amount of work a computer performs. A Load of 0 means no work at all, and a load of 1 means 100% of work for a single core, therefore, this value depends on how many cores available. In our deployment = 2.

gcs

  • Number of times each of the JVM GCs run per second and the time elapsed in each run.

local_reads

  • Total reads per second received on the specific column families being used and the average time taken to respond them.

local_writes

  • Total writes per second received on the specific column families being used and the average time taken to respond them.

And that’s all for us for now!

Troubleshooting Cassandra performance issues

A couple of weeks ago we released a feature and it’s performance was unexpectedly poor and here I want to share the steps and tools used to get to the root cause of the problem.

To give a little bit of background I’ll tell you that the feature was something really common nowadays: Saving a bunch of time series in Cassandra

Step 1: Look for evidences in metrics

The natural first step I think everyone does is to look at the graphs (here we use graphite) and try to find evidences of what’s going on.

In our case we had a very clear evidence that something was wrong as the time consumed in the process had increased by around 600% after the deploy!!

events_latency_increase

But that means that not only something is wrong in the feature but also, and even more scary, in our whole process!! How can such a poorly performing piece of code have reached production without anyone noticing it before? Ok, we don’t have performance tests as part of our CI process, but we test every feature in our pre-production environments before going to production and that should have appeared there!! That would have simply been unacceptable and processes are easy and strong here @_MyDrive, so, after scrolling a little bit along the graphs we found an explanation. The tests ran in the pre-production environment were ok!

events_latency_increase_with_labs

Ok, so we have somewhere to look at: something on our production environment is performing poorly and, at first glance, our stats are not showing it at all!.

Step 2: Profile suspicious code

At this step we use the fantastic RubyProf to wrap the suspicious code in a RubyProf.profile block and save the results to analyse later.

require 'rubyprof'

results = RubyProf.profile do [Code to profile] end

printer = RubyProf::GraphPrinter.new results 
printer.print(File.new('/tmp/profiled-events-insert.txt', 'w'), min_percent: 2) 

Reading the saved files I could clearly see that the time was going into the Cassandra related stuff and made the assumption that the problem would be somewhere in the model/queries stuff.

I could have read a little more of the profiling result and will probably have saved some steps here, but as we were issuing several thousands of inserts asynchronously and the first lines of the profiling report were pointing to Cassandra everything looked crystal clear.

Step 3: Trace queries

There’s only one thing we’re doing here: INSERT so…

cqlsh:carlos_test> TRACING ON cqlsh:carlos_test> INSERT INTO ...
activity                          | timestamp    | source       | source_elapsed 
----------------------------------+--------------+--------------+----------------
execute_cql3_query                | 10:26:35,809 | 10.36.136.42 | 0 
Parsing INSERT INTO ...           | 10:26:35,836 | 10.36.136.42 | 26221 
Preparing statement               | 10:26:35,847 | 10.36.136.42 | 37556 
Determining replicas for mutation | 10:26:35,847 | 10.36.136.42 | 37867 
Acquiring switchLock read lock    | 10:26:35,848 | 10.36.136.42 | 38492 
Appending to commitlog            | 10:26:35,848 | 10.36.136.42 | 38558 
Adding to events memtable         | 10:26:35,848 | 10.36.136.42 | 38600 
Request complete                  | 10:26:35,847 | 10.36.136.42 | 38926 

Looking at this results something looks broken on parsing the query, and running the same thing on our pre-production environment it clearly shows as something broken!

cqlsh:benchmark> TRACING ON cqlsh:benchmark> INSERT INTO ...
activity                          |  timestamp   |    source     | source_elapsed 
----------------------------------+--------------+---------------+---------------- 
execute_cql3_query                | 10:27:40,390 | 10.36.168.248 | 0 
Parsing INSERT INTO ...           | 10:27:40,390 | 10.36.168.248 | 75 
Preparing statement               | 10:27:40,390 | 10.36.168.248 | 233 
Determining replicas for mutation | 10:27:40,390 | 10.36.168.248 | 615 
Acquiring switchLock read lock    | 10:27:40,390 | 10.36.168.248 | 793 
Appending to commitlog            | 10:27:40,390 | 10.36.168.248 | 827 
Adding to events memtable         | 10:27:40,390 | 10.36.168.248 | 879 
Request complete                  | 10:27:40,391 | 10.36.168.248 | 1099

But, what could be so wrong with parsing a query?

Step 4: Simplify the problem

At this point I decided to write a small Ruby program that:

  1. Connects to the Cassandra cluster
  2. Creates a test keyspace
  3. Creates a column family within the test keyspace
  4. Runs an insert like the one profiled above
  5. Drop the test keyspace

and profile it all using RubyProf to try to spot something obvious.

Running this script in production showed something useful, more than 98% of the time was spent in Cluster#connect method! Furthermore, almost 100% of the time inside that method was going to IO.select! Which means that the time is being wasted outside Ruby itself.

Actually I could have saved some time, because this exact same thing was also clear in the first profiling I did in step 1, but my premature assumption made me walk some extra unnecessary steps.

Ok, we have a new symptom, but no idea where to look at, so after some desperate and useless attempts like tcpdumping the communications between the client and the cluster I decided to go back to the script I wrote and…

Step 5: Enable Logging

Maybe this should have been the first, right?

But this was really revealing!, for some reason, the driver was trying to connect to four nodes when in our ring we only have three!! Of course, the connection to this extra node was failing on timeout (10 seconds) and that was the source of our poor performance!

A quick google with the symptoms as the query and … ta-dah!! We’re facing a Cassandra bug!!

Quickly applied the fix and we were saving our 10 seconds per execution again.

Conclusions

  1. Measure everything
    • Metrics and monitoring let us know we were facing an unexpected performance issue
  2. Keep calm, read to the end
    • On first profiling report I could have spotted that the issue was on Cluster#connect but due to my eagerness to find a fix I made a wrong assumption that meant more time.
  3. Thank the community
  4. Take advantage of learning opportunities!
    • These kind of unexpected situations normally push you out of your comfort zone and really test your limits which is really good for your personal development.

Notes on Bath Ruby 2015

UPDATE: I’ve just received this Amazing Timelapse of the event! (Don’t miss the Happy Friday Hug moment and the massive movement on breaks xD)

Friday 13, March 2015, 5.30am. My alarmclock rings. I don’t feel tired but excited instead!.
Why? Lots of hours to enjoy learning Ruby from some of the top minds!!

So after several hours by train and car here we are!!,
we made it just in time to grab a coffee and then the talks started!

Linda Liukas

Linda showed us her successful KickStarter project: HelloRuby, definitely a really nice idea
to introduce kids the passion for computing and programming.

Her talk gave us the idea of trying following her steps and try to get out a driving related project
using KickStarter funds. Would you like to see that happening? Let us know!

Ben Orenstein

Listening to Ben is definitely something very familiar to me as I’ve been following his Robots Podcast
for several years, but meeting him in person was awesome!! Specially because of the great talk he did.

I love live coding during conferences and his’ was really a good one, beginning with a working code and refactoring it
to improve it’s quality.

Running tests should be fast, and it can be fast, so if it doesn’t feel like that, then you’re doing something wrong.

equation
Being n the numer of keystrokes to run the required test(s)

Also he made a note on The SOLID Principles.

Saron Yitbarek

Saron’s pitch was a typical story of someone who moves her career from marketing to coding and during it
dropped some interesting ideas:

Everything changes all the time. Welcome to programming!!

Well, that’s arguably TRUE, isn’t it? And in my experience this happens at all levels. Business requirements and specifications,
best frameworks and technologies for a particular task, even your own willingness to certain particular technology, … Uncountable, you name it.

Reading code is one of the best things to gain experience

I think this is arguably true again, and I’d like to highlight a good code hunting process she described
she had in order to try to find the best pieces of code to go through…Failing. It is not easy to know
which pieces of code are good or bad before fully understanding them, and once understood, the learning process
had already taken place because of all the questions that she did about that code!! So, in order to pick
good pieces of code to read to learn, just make sure that the size is something you can easily manage.

During all her learning process she had spent $11k and several full time months. One of the most clear things to her
was the strength of the community and the idea of allowing people with less resources learn Ruby as well lead her
to organise CodeClub, TwitterChats and The CodeNewbie Podcast.

Apart from her talk, we loved the OpenSource Board,
which is one more evidence that the Ruby Community is something absolutely strong.

Katrina Owen

Katrina spoke about a very bad, and unfortunately common situation in the development world,
that we should definitely erradicate, and is
The Fundamental Attribution Error

It has two sides:

  • Companies and Organisations should strive to avoid it by using Code of Conducts or Contract Clauses.
  • Each developer should decide to COOPERATE
    • When using someone else’s code by just trying to understand the code itself rather than complaining that the guy was an idiot or bad developer. Because you don’t know which was the other guy’s situation when he wrote it, and because, ultimately, everyone can have a bad day.
    • When commiting: To make sure the upcoming processes on this code (code reviews or next developers) are the most smooth and pleasant coding experience ever.

This talk was very inspirational, even heard about someone who wanted to become a better
person after it 😉

Tom Stuart

Tom’s talk was, to me, a myth killer. He basically explanied the abstraction of numbers
from scratch to end up with:

Mathematics = Spotting patterns and bulid reusable abstractions.

And then wondering… Isn’t this EXACTLY the same as programming?

The explanation is that maths are not only the theory that we studied and that we have
associated with the Maths term. So then typical quotes as…

You don’t need maths to program

Or

You need to be VERY good at maths to program

Become somehow FALSE, as it all depends on which kind of programming you’re up to and
which maths part does that require.

Sandi Metz

Sandi’s talk showed two interesting programming patterns starting from the idea that she,
as a former purely functional programmer, hates conditionals.

  • First was an example of the so called Null Object Pattern, and how it removes the conditionals. You can see the code of this example along with her explanations on her blog
  • Second was an example of how Inheritance differs from Composition and how the latter can be used to remove conditionals again.

To wrap up, and as it was Friday, we sent a massive Happy Friday Hug and then went for some
free beers and snacks that I’m sure everyone enjoyed.

And that was it, but before closing this post, We’d like to remark how well the
conference was organised, We really enjoyed it and We’re really looking forward to 2016’s edition!!

Ruby Thread Synchronisation

Multi-Threading programming is one of those things that I like doing from time

to time. I find it particularly useful when automating any task that involves
network downloading.

My approach is usually a Publish-subscribe using the very handy Ruby’s
Queue class.

Sometimes I build it using several stages, i.e. last time I used it I was building
a tool for downloading and processing a whole S3 bucket and I designed it with
two messaging layers. First one publishes object names within the bucket
for the second one to pick them and actually download and process them and finally
publish the results for an aggregator thread (the main program’s thread).

2_stages_pub_sub_diagram

My programming approach to this is leveraging all synchronisation in queues rather
than waiting for threads or passing messages between them.
But don’t worry, that’s not a crazy or hacky approach at all, is just the Ruby’s
recommended way.

So, what I’m going to do here is just explain by an example how to properly do
it avoiding errors and obtaining a clean end. Let’s start off by showing a
failing example.

require 'thread'

queue = Queue.new

producer = Thread.new {
  5.times do |i|
    queue << i
    sleep 1
  end
  p 'Producer exiting'
}

consumer = Thread.new {
  while producer.alive?
    p "Popped #{queue.pop}"
  end
  p 'Consumer exiting'
}

producer.join
consumer.join

This code sets up two threads, a publisher(producer) and a subscriber(consumer).
The producer publishes a value to the queue and sleeps a second for 5 times.
The consumer simply pulls messages from the queue as soon as they’re available.

The producer exit condition is very straightforward. As soon as it finishes it’s
job, it simply finishes. The consumer, on his end, monitors the producer status
and will exit as soon as it detects the producer is not alive anymore.

Finally, the main thread waits for both to finish Thread.join.

All looks good, doesn’t it? But when we run it… Crash!!

"Popped 0"
"Popped 1"
"Popped 2"
"Popped 3"
"Popped 4"
"Producer exiting"
`join': No live threads left. Deadlock? (fatal)

Investigating a bit you’ll find that this error is raised at the queue.pop consumer’s
invocation. That’s because when it checked the status of the producer, it was still alive.
Now we could try several approaches but the best and most robust one I think it is to use
what I call end of operation objects.

Those objects are simply instances of a dummy class which purpose is to signal the end
of the operations queue. Using end of opertion objects we could rewrite our piece
of code as follows:

require 'thread'

class EndOfOp ; end

queue = Queue.new

producer = Thread.new {
  5.times do |i|
    queue << i
    sleep 1
  end
  p 'Producer exiting'
  queue << EndOfOp.new
}

consumer = Thread.new {
  while obj = queue.pop
    break if obj.instance_of? EndOfOp
    p "Popped #{obj}"
  end
  p 'Consumer exiting'
}

producer.join
consumer.join

Now the producer pushes an instance of the `EndOfOp` dummy class just before exiting
to signal the consumer that it has finished its job. The consumer, on his end just
tests every pulled object that it is not an `EndOfOp` in order to continue.

Executing this code we would see:

"Popped 0"
"Popped 1"
"Popped 2"
"Popped 3"
"Popped 4"
"Producer exiting"
"Consumer exiting"
[Finished in 5.1s]

And that’s all. Happy pubsubbing!!

Construyendo y publicando nuestra primera app multiplataforma (II)

In this second session of the Interlat Webinars lecturing I review how the variety of devices on the market, related to brands (and underlying operating system), screen sizes and resolutions, input methods, available sensors, etc…, which is something really good for consumers, constitutes a big problem for those planning to develop an application or game.

The problem is technically called ‘fragmentation’ and appears when we realise all the effort that will be required for our product to run on as many devices as possible.

A possible solution is proposed and it is the use of technologies such as HTML5, CSS, JS and optionally a framework like jQuery Mobile to develop a mobile web application.

Also during the presentation I show a little example of a well-known videogame with a world ranking to show some of the possibilities that this set of technologies gives. Main features are:

  • HTML5 Canvas based 2D game
  • Usage of websockets to obtain updates to the ranking in real time.
  • Usage of Google Maps and a table to display the ranking.
  • Usage of jQuery Mobile to quickly build all pages and UI.
  • Backend built as an API using Grape, PostgreSQL and ActiveRecord and deployed to Heroku.

You can see the source of the project in my js interlat demo app github repo.

Rails Friendly URLs Gem

Great! After several months working on this, I have finally released the first stable version of my Rails Friendly URLs gem. This is a very interesting feature that I started developing at OffsideGaming as a feature request for a particular partner and I was surprised to realise that no gems already existed for this! Basically it consists of a gem that allows administrators of any web service to dynamically configure any url they want into a SEO Friendly one and instantly see it applied in its website. Apart from how cool it is to see a url as friendly rather than a machine generated one, it definitely has it’s SEO advantages and therefore I thought It would be a nice contribution to the open source community. It performs three changes for every configured url:

  • Makes the application recognise the new route and act exactly as the originally configured one.
  • Makes the original one to be redirected to the new one, to avoid search engines penalise your site.
  • Makes Rails url and path helpers to use the friendly url, so you don’t need to change any single line of code in order to use them. And all these is done dynamically, without even need to restart your application server!

If you want to know more just visit the Rails Friendly URLs Gem Github Repo where you will find all the source code and instructions to set up and running easily. Also, if you want to see it in action and even play a little bit with it I’ve uploaded an example application in Heroku. The url is this: https://rails-friendly-urls-test.herokuapp.com/

I’ll be more than happy of hearing from you if you ever use it, or even if you find any issue using it. Leave your comments below!

Cheers!