Ruby closures, how are they possible?

This talk is the result of my curiosity. I like understanding things in detail and programming closures had me scratching my head for a while until I decided to investigate it, came across the great Ruby under a microscope book and voilà, everything makes sense now!

Here you can see the slides and the video from my closures talk at madrid.rb


Scalable Data Modelling by example – Cassandra Summit ’16

Two facts are the motivations for this talk:

  • First one is that the model cannot be changed once it is in production, well, you can, but by migrating data away to the new model using external tools such as Spark.
  • The second one is that one of the most common pain points in Cassandra deployments out there is actually performance issues caused by bad data models.

So in order to provide a Cassandra overview and a ‘checklist’ to follow when data modelling here you have my talk.

And the slides are here

Just as a final comment I’d like to remark that it was my bad that, during the talk, I forgot to acknowledge DataStax for all the amazing work they’re doing and because most of the content of this talk is taken from their Academy website. So once again, thanks DataStax for all your efforts, for such an amazing Summit and for all your contributions into the Cassandra Open Source project.

Scalable Data Modelling By Example – Cassandra London Meetup

This time I spoke at the Cassandra London Meetup and had the chance to share the stage with the amazing Patrick McFadin!!

My talk is about those concepts on top of which Cassandra lies that definitely make a difference in how we have to model our data. Theoretically reviewing those concepts and do some data modelling by example.

Here you have the slides

You can see the video here:

Cassandra instantaneous in place node replacement

At some point everyone using Cassandra faces the situation of having to replace nodes. Either because the cluster needs to scale and some nodes are too small or because a node has failed or because your virtualisation provider is going to remove it.

Whichever the reason the situation we face is that a new node has to be streamed in, to hold the exact same data the old one had. Why do we need to wait for a whole streaming process, with the network and CPU overhead this requires when we could just copy the data into the new node and have it join the ring replacing the old one?

That’s what we, at MyDrive have been doing for a while and we want to share the exact process we follow with the community shall it help someone.


They main idea behind this process is to have the replacement node up and running as quick as possible by cutting down the process where it takes longer, streaming data.

The key points of the process are:

  1. Data will be copied from the old node to the new one using an external volume instead of transmitting it through the network.
  2. The new node will be given the exact same configuration as the replaced one. Therefore, the replacement node will be responsible for the same tokens as the replaced one, and will also have the same Host-ID, so, when it joins the ring, the other nodes won’t even notice the difference!

All our infrastructure is in AWS, therefore, we used EBS volumes to backup and restore cassandra data. You may use a different data transfer method which suits you better in your infrastructure.


  1. Setup the new node, paying special attention to the following configuration parameters:
    1. listen_address
    2. rpc_address
    3. seeds
  2. Create the external volume you’re going to use to transfer the data from the old node to the new one.
  3. Rsync data and commitlog directories to the external volume
    1. Mount the external volume into the old node in /mnt/backup
    2. Copy the Cassandra data directory into the volume: rsync -av --progress --delete /var/lib/cassandra/data /mnt/backup/data
    3. Copy the Cassandra commitlog directory into the volume: rsync -av --progress --delete /var/lib/cassandra/commitlog /mnt/backup/commitlog
    4. Unmount and disconnect the volume. Connect and mount it into the replacement node.
    5. Copy the Cassandra data directory: rsync -av --progress --delete /mnt/backup/data /var/lib/cassandra/data
    6. Copy the Cassandra commitlog: rsync -av --progress --delete /mnt/backup/commitlog /var/lib/cassandra/commitlog
  4. Drain the old node: nodetool drain
  5. Stop Cassandra in the old node: sudo service cassandra stop
    1. And make sure it doesn’t accidentally come back (i.e. if you’re running chef, supervisor or any other tool that may restart it automatically). This is EXTREMELY important, as if the replacement node tries to join the ring when the old one is alive, the new host will be assigned a new Host ID and the ring will be rebalanced as if we were adding a new node instead of replacing one.
  6. Do a final rsync. This one is to catch any last changes. (Repeat all steps from step 3)
  7. Ensure Cassandra data and commitlog folders are owned by the cassandra user (rsync copies the owner’s UID along with the data and that UID may not be the appropriate in the new machine).
  8. Start the new node. sudo service cassandra start
  9. Check that everything is working properly:
    1. In the replacement’s logs you should see a message like this: WARN <time> Not updating host ID <host ID> for /<replaced node IP address> because it's mine Indicating that the new node is replacing the old one.
    2. In the replacement’s logs you should also see one message like the following per token: INFO <time> Nodes /<old IP address> and /<new IP address> have the same token <a token>. Ignoring /<old IP address> Indicating that the new node is becoming primary owner of the replaced’s tokens.
    3. In the other nodes’ logs you should see a message like: Host ID collision for <Host ID> between /<replaced IP address> and /<replacement IP address>; /<replacement IP address> is the new owner Indicating that the other nodes acknowledge the change
    4. nodetool [status] should show the new node’s IP owning the replaced Host ID and the old one shouldn’t appear anymore.
    5. Everything should look normal.
  10. Update other nodes’ seeds list if the replaced node was a seed one.
  11. You can now safely destroy you old machine.

And voilà! By following these steps carefully you will be able to replace nodes and have them running quickly, avoiding any tokens movement or streaming.

Codemotion 2015 – Cassandra for impatients

This talk tries to provide with the basic but fundamental concepts required to build scalable Cassandra data models. I think that we, technical people, are impatient, and that sometimes may lead to errors, usually acceptable on the other hand, but in this case, when we’re dealing with Cassandra models, the cost in detecting why models are not scaling and having to modify them live will always be very expensive and painful. Basically, you don’t want to build and deploy models that don’t scale.

In order to get those concepts, here you have the slides and the video.

Notes on 2015

14264985868265codemesh215logotransparentThis was my first CodeMesh edition. I had been told to attend several times by my former colleague Félix López and it was really good. Probably the conference with highest technical level I’ve ever attended so far and as such I made lots of notes that you can see right now.

Concurrency + Distribution = Availability + Scalability

by Francesco Cesarini

This talk was a really useful step by step recipe to distributed systems with a final summary in the form of a 10 bullet points list:

  1. Split up your system’s functionality into manageable, stand-alone nodes.
  2. Decide what distributed architectural pattern you are going to use.
  3. Decide what network protocols your nodes, node families and clusters will use when communicating with each other.
  4. Define your node interfaces, state and data model.
  5. For every interface function in your nodes, you need to pick a retry strategy.
  6. For all your data and state, pick your sharing strategy across node families, clusters and types, taking into consideration the needs of your retry strategy.
  7. Reiterate through steps 1 to 6 until you have the trade-offs which suit your specification.
  8. Design you cluster blueprint, looking at node ratios for scaling up and down.
  9. Identify where to apply back-pressure and load regulation.
  10. Define your Ops & Management approach, defining system and business alarms, logs and metrics.

You definitely want to have this list close when designing distributed systems, don’t you?

From irrational configuration system to functional data store

by Rob Martin

This talk was an interesting story about how they migrated a highly complicated and inflexible Oracle data store into a functional one based on Tries. With the migration they achieved, apart from a much more simple and manageable system, an impressive read latency speedup (from more than 15 seconds to less than 10 ms). Two take aways from this talk:

Target your response times below 10 ms.

It is always good to have reference values, this is a good one.

Build tools so that your juniors can work on it.

That means:

  • Simplicity
  • You can hire juniors
  • You can develop seniors (not only hire them).

Contemporary approach to data at scale

by Ben Stopford

This talk was a really nice journey through all concepts, problems and available solutions for having data at

Storage engine:

  • Have your indexes in RAM
  • Use indexed batches (Cassandra SSTables)
  • Brute force approach: Columnar. Save one file per row and compress.

Data access:

  • Consistent hashing: Agents know where the data is stored. Very efficient and scalable.
  • Broadcast: Simpler but may cause network overhead when getting bigger.

Architectures (click to enlarge):


Find the slides at

Exercise analysis

by Jan Machacek

This talk was specially interesting as the usage of mobile sensors to classify user’s activity is something we’re working on at MyDrive as well therefore very interesting takeaways for us here:

  • Use OpenCV library for classification.
  • Classify data on the phone.

And two ideas to improve the product that are potentially applicable for us at MyDrive as well:

  • Have a ML model per user.
  • Implement live sessions instead of submitting all the data to the backend at the end.

Beyond Lists: High Performance Data Structures

by Phil Trelford

This talk was a very good and dense one, therefore I’m really looking forward the video to be released, but until then…

Immutability + memory affinity for high speed.

And three new data structures to look at:

Boot my (secure)->(portable) clouds

by Nicki Watt

This talk was really interesting as well because we at MyDrive, as pretty much everyone out there I think, we rely on cloud system providers as Amazon AWS, for our deployments.

The takeaways from this talk are the Cloud computing ASAP principles and three lessons:

  • Automate everything
  • Separate config from code
  • API driven clouds & tools
  • Prefer modular, open source tools

Lesson 1: There’s NO single common cloud API, therefore use Terraform

This is something we are already doing, but I also made a note on checking Packer out. A tool to manage base OS images (we currently rely on Ubuntu’s AMIs)

Lesson 2: Decouple infrastructure and configuration management tools using automated configuration management tools, such as Chef, Ansible, …

Again, here we’re doing good at MyDrive using Chef, but also an interesting note on checking FreeIPA out to manage user accounts was made.

Lesson 3: Don’t roll your own security tools, use Vault to manage secrets and sensitive information.

This is again a tool to be checked out, but we, at MyDrive, are already following the rule using Citadel, a tool to store secrets in Amazon S3.

OpenCredo‘s guys have already compiled this whole talk in their own Boot my secure government cloud blog post. Read recommended!!

Reactive Design Patterns

by Roland Kuhn

This talk was a good review of concepts and patterns from The Reactive Manifesto

  • Responsiveness: The system responds in a timely manner.
  • Elasticity: Capability of scaling both up and down, both to meet requirements and keep costs controlled.
  • Resilience: The system responds in the presence of failures.
  • Message driven: This means that the system is decoupled with a share nothing architecture.

And also some architecture patterns for it:

  • Single Responsibility Principle:
    • A component must do only one thing and do it completely.
    • In order to achieve it, keep splitting responsibilities into nodes until you have single components, but not any further. Avoid trivial components.
  • Circuit breaker Pattern:
    • Protect services by breaking connection to failing components
  • Request – Response Pattern:
    • Include a return address in the message in order to receive the response.
  • Saga Pattern:
    • Divide long-lived and distributed transactions into quick local ones with compensating transactions for failure cases.
    • Partition your data so that every operation remains local.
    • Example: Bank transfer avoiding lock of both accounts:
    • T1: Transfer money from X to a local working account.
    • C1: Compensate failure by transferring money back to X.
    • T2: Transfer money from local working account to Y.

Panel Discussion

Well, this was a very interesting closing event, having the chance to hear people with so many years of experience and knowledge as the panelists was a pleasure. I was specially excited to listen to Tony Hoare talking about his story and a little bit of the future.

The panelists were:

As you can imagine I can only advice you to watch the video, but there were two specially interesting ideas:

  • A very big improvement for computer science would be to fix the so problematic mutable shared state
  • The tools is where the action is:


Fitting IPython Notebooks, Spark and Cassandra all together

Fitting IPython Notebooks, Spark and Cassandra all together

Yesterday after more than a whole week working on this I’ve finally managed to set all this stack up.

This stack is one of the most hot and trending topics nowadays as it is very useful for BigData, specifically for data exploration purposes.

My starting point was a 3 nodes Cassandra cluster, intended for analytics, data exploration and adhoc reporting. This cluster was running
DSE (DataStax Enterprise) 4.6, which ships with Cassandra 2.0. Worth mentioning that the cluster was running in Cassandra only mode.

After having attended to the latest Cassandra Summit ’15, it was made clear not to use
Spark with any Cassandra earlier than 2.1.8, because the integration was buggy. Therefore:

  1. Upgrade DSE to latest (4.8) version, that includes Cassandra 2.1.10.
  2. Next step is to enable Spark on the cluster. This is one of the points where relying on something like DSE comes in handy, as the DSE distribution comes
    with a Spark installation and the Cassandra-Spark connector already configured and optimised for maximum compatbility and throughput. It is also really easy to
    enable Spark on the nodes.

  3. Once I had Cassandra and Spark running on all nodes and one of them was designated as the Spark Master (first of the Cassandra seeds by default, but you can check it with dse client-tool spark master-address), it’s time to install IPython and all its dependencies:
    1. First step is to install all system required packages (use apt-get, yum, etc depending on your OS): sudo apt-get install build-essential libcurl4-openssl-dev libssl-dev zlib1g-dev libpcre3-dev gfortran libblas-dev libblas3gf liblapack3gf liblapack-dev libncurses5-dev libatlas-dev libatlas-base-dev libscalapack-mpi1 libscalapack-pvm1 liblcms-utils python-imaging-doc python-imaging-dbg libamd2.3.1 libjpeg-turbo8 libjpeg8 liblcms1 libumfpack5.6.2 python-imaging libpng12-0 libpng12-dev libfreetype6 libfreetype6-dev libcurl4-gnutls-dev python-pycurl-dbg git-core cython libhdf5-7 libhdf5-serial-dev python-egenix-mxdatetime vim python-numpy python-scipy pandoc openjdk-7-jdk
    2. Next step is to install Python Virtualenv, as IPython depends on Python and changing the system’s Python installation can be dangerous: sudo pip install virtualenv
    3. Then, create a folder for your installation. This folder will contain the virtual environment for the notebooks installation (ipynb in this example). mkdir ipynb and cd ipynb
    4. Create a virtual environment: virtualenv ipython. Where ipython is the name of the virtual environment we’re creating.
    5. To begin using the virtual environment we need to activate it: source ipython/bin/activate. At this point our prompt will indicate that we’re inside the ipython virtual env.
    6. Install all IPython’s dependencies (using pip): pip install uwsgi numpy freetype-py pillow scipy python-dateutil pytz six scikit-learn pandas matplotlib pygments readline nose pexpect cython networkx numexpr tables patsy statsmodels sympy scikit-image theano xlrd xlwt ipython[notebook]
  4. Let’s create a IPython default profile (we’ll not use it, but it’s safe to create it to avoid bugs and strange issues)
    • ./ipython/bin/ipython profile create --profile=default --ipython-dir .ipython
  5. Then we create the pyspark ipython profile we’ll use.
    • ./ipython/bin/ipython profile create --profile=pyspark --ipython-dir .ipython
  6. Now install MathJax extension:
    • python -c "from IPython.external.mathjax import install_mathjax; install_mathjax(replace=True, dest='~/ipynb/ipython/lib/python2.7/site-packages/IPython/html/static/mathjax')"
  7. Now paste the following contents into ~/ipynb/.ipython/profile_pyspark/

      c = get_config()
      # The IP address the notebook server will listen on.
      # If set to '*', will listen on all interfaces.
      c.NotebookApp.ip= '*'
      # Port to host on (e.g. 8888, the default)
      c.NotebookApp.port = 8888
      # Open browser (probably want False)
      c.NotebookApp.open_browser = False
  8. Create now the following file under ~/ipynb/.ipython/profile_pyspark/startup/

      import os
      import sys
      spark_home = os.environ.get('SPARK_HOME', None)
      if not spark_home:
        raise ValueError('SPARK_HOME environment variable is not set')
      sys.path.insert(0, os.path.join(spark_home, 'python'))
      sys.path.insert(0, os.path.join(spark_home, 'python/lib/'))
      execfile(os.path.join(spark_home, 'python/pyspark/'))
  9. Prepare the environment:

    1. export SPARK_HOME=/usr/share/dse/spark
    2. export PYSPARK_SUBMIT_ARGS='--master spark://<spark_master_host>:<spark_master_port> pyspark-shell'
  10. Start your Notebooks server!!:

    • ipython/bin/ipython notebook --profile=pyspark

Now you should be able to navigate to <host_running_notebooks_server>:8888 and and see the WebUI!

Finally, check that everything is working by creating a new notebook and typing and running sc into it. You should see <pyspark.context.SparkContext at 0x7fc70ac8af10>


  1. I’ve followed the steps but when I type and run sc, an empty string is returned.

    • Make sure environment variables defined at step 9 are properly set.
    • Make sure the actually exists under SPARK_HOME/python/lib (Update your version of it on the startup script accordingly).
    • That’s because the startup script (saved at step 8 under …startup/ hasn’t run properly. Try to run it’s contents as a notebook to debug what’s happening.
  2. When running the startup script as a notebook I get: Java gateway process exited before sending the driver its port number

    • You are not running using Java JDK but JRE instead.