Scalable Data Modelling by example – Cassandra Summit ’16

Two facts are the motivations for this talk:

  • First one is that the model cannot be changed once it is in production, well, you can, but by migrating data away to the new model using external tools such as Spark.
  • The second one is that one of the most common pain points in Cassandra deployments out there is actually performance issues caused by bad data models.

So in order to provide a Cassandra overview and a ‘checklist’ to follow when data modelling here you have my talk.

And the slides are here

Just as a final comment I’d like to remark that it was my bad that, during the talk, I forgot to acknowledge DataStax for all the amazing work they’re doing and because most of the content of this talk is taken from their Academy website. So once again, thanks DataStax for all your efforts, for such an amazing Summit and for all your contributions into the Cassandra Open Source project.

Scalable Data Modelling By Example – Cassandra London Meetup

This time I spoke at the Cassandra London Meetup and had the chance to share the stage with the amazing Patrick McFadin!!

My talk is about those concepts on top of which Cassandra lies that definitely make a difference in how we have to model our data. Theoretically reviewing those concepts and do some data modelling by example.

Here you have the slides

You can see the video here:

Cassandra instantaneous in place node replacement

At some point everyone using Cassandra faces the situation of having to replace nodes. Either because the cluster needs to scale and some nodes are too small or because a node has failed or because your virtualisation provider is going to remove it.

Whichever the reason the situation we face is that a new node has to be streamed in, to hold the exact same data the old one had. Why do we need to wait for a whole streaming process, with the network and CPU overhead this requires when we could just copy the data into the new node and have it join the ring replacing the old one?

That’s what we, at MyDrive have been doing for a while and we want to share the exact process we follow with the community shall it help someone.


They main idea behind this process is to have the replacement node up and running as quick as possible by cutting down the process where it takes longer, streaming data.

The key points of the process are:

  1. Data will be copied from the old node to the new one using an external volume instead of transmitting it through the network.
  2. The new node will be given the exact same configuration as the replaced one. Therefore, the replacement node will be responsible for the same tokens as the replaced one, and will also have the same Host-ID, so, when it joins the ring, the other nodes won’t even notice the difference!

All our infrastructure is in AWS, therefore, we used EBS volumes to backup and restore cassandra data. You may use a different data transfer method which suits you better in your infrastructure.


  1. Setup the new node, paying special attention to the following configuration parameters:
    1. listen_address
    2. rpc_address
    3. seeds
  2. Create the external volume you’re going to use to transfer the data from the old node to the new one.
  3. Rsync data and commitlog directories to the external volume
    1. Mount the external volume into the old node in /mnt/backup
    2. Copy the Cassandra data directory into the volume: rsync -av --progress --delete /var/lib/cassandra/data /mnt/backup/data
    3. Copy the Cassandra commitlog directory into the volume: rsync -av --progress --delete /var/lib/cassandra/commitlog /mnt/backup/commitlog
    4. Unmount and disconnect the volume. Connect and mount it into the replacement node.
    5. Copy the Cassandra data directory: rsync -av --progress --delete /mnt/backup/data /var/lib/cassandra/data
    6. Copy the Cassandra commitlog: rsync -av --progress --delete /mnt/backup/commitlog /var/lib/cassandra/commitlog
  4. Drain the old node: nodetool drain
  5. Stop Cassandra in the old node: sudo service cassandra stop
    1. And make sure it doesn’t accidentally come back (i.e. if you’re running chef, supervisor or any other tool that may restart it automatically). This is EXTREMELY important, as if the replacement node tries to join the ring when the old one is alive, the new host will be assigned a new Host ID and the ring will be rebalanced as if we were adding a new node instead of replacing one.
  6. Do a final rsync. This one is to catch any last changes. (Repeat all steps from step 3)
  7. Ensure Cassandra data and commitlog folders are owned by the cassandra user (rsync copies the owner’s UID along with the data and that UID may not be the appropriate in the new machine).
  8. Start the new node. sudo service cassandra start
  9. Check that everything is working properly:
    1. In the replacement’s logs you should see a message like this: WARN <time> Not updating host ID <host ID> for /<replaced node IP address> because it's mine Indicating that the new node is replacing the old one.
    2. In the replacement’s logs you should also see one message like the following per token: INFO <time> Nodes /<old IP address> and /<new IP address> have the same token <a token>. Ignoring /<old IP address> Indicating that the new node is becoming primary owner of the replaced’s tokens.
    3. In the other nodes’ logs you should see a message like: Host ID collision for <Host ID> between /<replaced IP address> and /<replacement IP address>; /<replacement IP address> is the new owner Indicating that the other nodes acknowledge the change
    4. nodetool [status] should show the new node’s IP owning the replaced Host ID and the old one shouldn’t appear anymore.
    5. Everything should look normal.
  10. Update other nodes’ seeds list if the replaced node was a seed one.
  11. You can now safely destroy you old machine.

And voilà! By following these steps carefully you will be able to replace nodes and have them running quickly, avoiding any tokens movement or streaming.

Codemotion 2015 – Cassandra for impatients

This talk tries to provide with the basic but fundamental concepts required to build scalable Cassandra data models. I think that we, technical people, are impatient, and that sometimes may lead to errors, usually acceptable on the other hand, but in this case, when we’re dealing with Cassandra models, the cost in detecting why models are not scaling and having to modify them live will always be very expensive and painful. Basically, you don’t want to build and deploy models that don’t scale.

In order to get those concepts, here you have the slides and the video.

Fitting IPython Notebooks, Spark and Cassandra all together

Fitting IPython Notebooks, Spark and Cassandra all together

Yesterday after more than a whole week working on this I’ve finally managed to set all this stack up.

This stack is one of the most hot and trending topics nowadays as it is very useful for BigData, specifically for data exploration purposes.

My starting point was a 3 nodes Cassandra cluster, intended for analytics, data exploration and adhoc reporting. This cluster was running
DSE (DataStax Enterprise) 4.6, which ships with Cassandra 2.0. Worth mentioning that the cluster was running in Cassandra only mode.

After having attended to the latest Cassandra Summit ’15, it was made clear not to use
Spark with any Cassandra earlier than 2.1.8, because the integration was buggy. Therefore:

  1. Upgrade DSE to latest (4.8) version, that includes Cassandra 2.1.10.
  2. Next step is to enable Spark on the cluster. This is one of the points where relying on something like DSE comes in handy, as the DSE distribution comes
    with a Spark installation and the Cassandra-Spark connector already configured and optimised for maximum compatbility and throughput. It is also really easy to
    enable Spark on the nodes.

  3. Once I had Cassandra and Spark running on all nodes and one of them was designated as the Spark Master (first of the Cassandra seeds by default, but you can check it with dse client-tool spark master-address), it’s time to install IPython and all its dependencies:
    1. First step is to install all system required packages (use apt-get, yum, etc depending on your OS): sudo apt-get install build-essential libcurl4-openssl-dev libssl-dev zlib1g-dev libpcre3-dev gfortran libblas-dev libblas3gf liblapack3gf liblapack-dev libncurses5-dev libatlas-dev libatlas-base-dev libscalapack-mpi1 libscalapack-pvm1 liblcms-utils python-imaging-doc python-imaging-dbg libamd2.3.1 libjpeg-turbo8 libjpeg8 liblcms1 libumfpack5.6.2 python-imaging libpng12-0 libpng12-dev libfreetype6 libfreetype6-dev libcurl4-gnutls-dev python-pycurl-dbg git-core cython libhdf5-7 libhdf5-serial-dev python-egenix-mxdatetime vim python-numpy python-scipy pandoc openjdk-7-jdk
    2. Next step is to install Python Virtualenv, as IPython depends on Python and changing the system’s Python installation can be dangerous: sudo pip install virtualenv
    3. Then, create a folder for your installation. This folder will contain the virtual environment for the notebooks installation (ipynb in this example). mkdir ipynb and cd ipynb
    4. Create a virtual environment: virtualenv ipython. Where ipython is the name of the virtual environment we’re creating.
    5. To begin using the virtual environment we need to activate it: source ipython/bin/activate. At this point our prompt will indicate that we’re inside the ipython virtual env.
    6. Install all IPython’s dependencies (using pip): pip install uwsgi numpy freetype-py pillow scipy python-dateutil pytz six scikit-learn pandas matplotlib pygments readline nose pexpect cython networkx numexpr tables patsy statsmodels sympy scikit-image theano xlrd xlwt ipython[notebook]
  4. Let’s create a IPython default profile (we’ll not use it, but it’s safe to create it to avoid bugs and strange issues)
    • ./ipython/bin/ipython profile create --profile=default --ipython-dir .ipython
  5. Then we create the pyspark ipython profile we’ll use.
    • ./ipython/bin/ipython profile create --profile=pyspark --ipython-dir .ipython
  6. Now install MathJax extension:
    • python -c "from IPython.external.mathjax import install_mathjax; install_mathjax(replace=True, dest='~/ipynb/ipython/lib/python2.7/site-packages/IPython/html/static/mathjax')"
  7. Now paste the following contents into ~/ipynb/.ipython/profile_pyspark/

      c = get_config()
      # The IP address the notebook server will listen on.
      # If set to '*', will listen on all interfaces.
      c.NotebookApp.ip= '*'
      # Port to host on (e.g. 8888, the default)
      c.NotebookApp.port = 8888
      # Open browser (probably want False)
      c.NotebookApp.open_browser = False
  8. Create now the following file under ~/ipynb/.ipython/profile_pyspark/startup/

      import os
      import sys
      spark_home = os.environ.get('SPARK_HOME', None)
      if not spark_home:
        raise ValueError('SPARK_HOME environment variable is not set')
      sys.path.insert(0, os.path.join(spark_home, 'python'))
      sys.path.insert(0, os.path.join(spark_home, 'python/lib/'))
      execfile(os.path.join(spark_home, 'python/pyspark/'))
  9. Prepare the environment:

    1. export SPARK_HOME=/usr/share/dse/spark
    2. export PYSPARK_SUBMIT_ARGS='--master spark://<spark_master_host>:<spark_master_port> pyspark-shell'
  10. Start your Notebooks server!!:

    • ipython/bin/ipython notebook --profile=pyspark

Now you should be able to navigate to <host_running_notebooks_server>:8888 and and see the WebUI!

Finally, check that everything is working by creating a new notebook and typing and running sc into it. You should see <pyspark.context.SparkContext at 0x7fc70ac8af10>


  1. I’ve followed the steps but when I type and run sc, an empty string is returned.

    • Make sure environment variables defined at step 9 are properly set.
    • Make sure the actually exists under SPARK_HOME/python/lib (Update your version of it on the startup script accordingly).
    • That’s because the startup script (saved at step 8 under …startup/ hasn’t run properly. Try to run it’s contents as a notebook to debug what’s happening.
  2. When running the startup script as a notebook I get: Java gateway process exited before sending the driver its port number

    • You are not running using Java JDK but JRE instead.

Cassandra Workshop: Cassandra from scratch in one day

This is an opportunity ShuttleCloud people gave me after knowing I had the Cassandra Developer official certificate.

We organised it along with some DataStax guys and published it as a Madrid Cassandra Users Meetup.

I prepared and gave the workshop to about highly skilled and motivated people in an intensive day.

The experience was absolutely awesome and, following comments and ratings, people also enjoyed it.

Notes on Cassandra Summit 2015

Screen Shot 2015-09-27 at 18.59.35

DataStax Cassandra Summit is becoming a GREAT conference!! With each new edition more and more people, talks and activities are adding themselves to the event to make it really awesome.

This year has been very special and intense for me, for several reasons.

  • First because of the location. This year, the Summit was held in Santa Clara, CA. Which meant an opportunity for me to know the area, including San Francisco and The Bay. I’ve really enjoyed walking and visiting the city’s touristic attractions. San Francisco is a lovely city and the weather has been just amazing, but I wouldn’t like to close this note without mentioning the reality of the streets. It is simply incredible the amount of homeless and people suffering mental diseases you can see just by walking the city. And it seems that the situation is getting worse as the housing prices increase. This article on the news particularly caught my attention: Tech bus drivers forced to live in cars to make ends meet.

  • Second reason is because I have spoken there!! This has been, by far, the biggest conference I’ve had the pleasure to speak at. And I really enjoyed it!! The talk was actually a live coding demo (best kind of talks for sure, aren’t they?) showing how I, being a Cassandra developer, not admin, tackled a production issue a couple of months ago. You can see the exact steps in this previous blog post and also in this video of this same talk in the London Cassandra Meetup.

  • Third reason is because I’ve been officially certified as an Apache Cassandra Developer by DataStax and O’Reilly Media:

  • And finally, because I’ve been awarded with one of the DataStax MVPs of the year 2015!! And, to be honest, I couldn’t be more excited. I love Cassandra and I love contributing to OS projects and getting some recognition on it is always welcome. I’d like to congratulate every other MVP of this year (full roster to be announced soon) and the whole DataStax for such a great and well organised event.

Tech Notes. Day 1

And after all this personal notes let’s go into the technical notes I’ve made from the talks I’ve attended. Hope you find them interesting!! project and it’s GIS tools for Hadoop subproject

This first one is a project to check out and bear in mind always when working with geospatial data and corresponding visualisations.

Time series writing performance can be improved by buffering

This is pretty simple. If you do some buffering in your application and write bigger chunks of data, rather than doing one write per observation. It’s likely that your overall performance goes up.

PagerDuty’s ‘One year of Cassandra failures’

PagerDuty’s talk about their infrastructure was really interesting to me, as it somehow resembles ours.

Watch to dropped messages counts. They anticipate bigger issues

Upgrade Cassandra. At least to the corresponding latest DSE’s version.

The Cassandra Danger Metrics page.

Having a Dashboard of danger metrics to look at can be very useful. This should include pending tasks and latencies at least.

The Weather Company

Robbie Strickland’s was one of my favourite ones. I found it really dense and intense. Lots of notes here.

The Weather Company Lambda Architecture

First of all is to check Robbie’s book Cassandra High Availability

Be careful when using DTCS. It is dangerous.

Compactions in Cassandra have to be deeply understood, otherwise they will simply bite you sooner or later.

Process and filter your events at ingestion time.

That makes sense. Instead of having unstructured data, parse and process your data before ingesting it and that way you’ll be able to filter invalid data if necessary. If you don’t do it, you’ll have to parse and process every time you read, which is definitely worse.

And speaking particularly on his lambda architecture (in the picture), a couple of ideas:

  • Daily backup data from C* to S3 + Parquet. C* writes quickly but is more expensive than S3 for long term storage. Compute analytics at C* and dump when data is not likely to be required anymore.
  • Beware versions: Spark + Cassandra = 2.1.8
  • The usage of secondary indexes to help Spark reading data is a good practice, but keeping index cardinality low (<= 50k)
  • Beware wide rows, it only takes one to get you in trouble. Use nodetool toppartitions or nodetool cfstats Max row bytes
  • Make your data visualisable ASAP: Zeppelin project

Tech Notes. Day 2


Christos Kalantzis and his colleagues’ talk about how they survived AWS re:boot gave as well lots of notes:

  • Run periodic checks on each nodes’ health, using, for example, Jenkins.
  • They use internal products Atlas and Priam for managing and monitoring the cluster.
  • Have idempotent processes.
  • Retry with exponential back-off.
  • Collect data/stats on your clusters to predict failures.
  • No ops team. Everyone acts as devops and gets on-call.

Troubleshooting with Cassandra

This talk from a DataStax’s support engineer left several notes as well:

  • If cache is full but the hit rate is low, increase cache capacity.
  • If memtableflushwriter:All time blocked jobs is significant means disk pressure on writes.
  • proxyhistograms command shows the full time to process the request, including the network round-trip from the coordinator to the required replicas. By contrast, the cfhistograms command shows the local latency for a single replica.
  • Use describecluster for schema disagreements.
  • Log warns with ‘Compacting large partition’ message when if encounters a big partition that you should take care of.

Lessons learnt with Cassandra and Spark

This talk by Christopher Bradford from OpenSourceConnections gave me some architectural notes.

  • Their architecture is composed by Cassandra to store data, Spark to ETL and Solr to search. An example of this setup is in their github account:
  • Build balanced models: spread data/load evenly across all nodes.
  • Use vnodes in small clusters. Single token nodes if you’re big (Apple, Netflix, …)
  • Have a look at Metrics: Dropwizard‘s Java profiler project.

Extreme Cassandra Optimization.

I must admit this talk by Al Tobey was a bit out of reach for me. There were so many advanced optimisation tips that I can only link to his guide to come back to it when I get the appropriate level: Al’s Cassandra 2.1 tunning guide

Repeatable, Scalable, Reliable, Observable: Cassandra

This talk by Aaron Morton had sooooo many notes on how to monitor Cassandra that at a given point I decided to give up and simply grab the slides later to review them. REALLY GOOD!!!

Case Study: Troubleshooting production issues as a developer.

And that was my talk!! A live coding demo!! I have to massively thank to everyone who attended to it as that was the last slot and I acknowledge everyone was already kind of burnout of Cassandra and I also hope they enjoyed it and found it valuable.

Here you can see a video of the same talk during the Cassandra London Meetup and the slides as well:

And that’s about it!! Hope this notes are as useful for you guys as they’re for me and a HUGE THANKS to everyone who spoke at the Summit for sharing such a valuable knowledge.