Why the site went down (or how to not send messages using camel)

Recently, we went down because of an issue that was hard to find.

The last release was the day before, so we didn’t directly suspect it to be release-related.

However, it was.
This is some new code that was in the release:

class UserActionNotifier(producerTemplate: org.apache.camel.ProducerTemplate, serialization: Serialization) {
 private[this] val userActionsEndpointUri: String = "activemq:topic:VirtualTopic.UserActions"

 def notifyOfUserAction(userId: Int, userActionType: String): Unit = {
 val ua = new UserAction(s"u$userId", 1, userActionType, new DateTime().getMillis)
 producerTemplate.sendBody(userActionsQueueEndpointUri, serialization.serialize(ua))
 }
}

Looks pretty innocuous, right?

Wrong.

What happens here is that we’re sending a message to a Virtual Topic in ActiveMQ.
When sending a message via a camel producerTemplate and passing in a String for the Endpoint, camel will do a lookup to get the actual endpoint to send to:

    protected Endpoint addEndpointToRegistry(String uri, Endpoint endpoint) {
        ObjectHelper.notEmpty(uri, "uri");
        ObjectHelper.notNull(endpoint, "endpoint");

        // if there is endpoint strategies, then use the endpoints they return
        // as this allows to intercept endpoints etc.
        for (EndpointStrategy strategy : endpointStrategies) {
            endpoint = strategy.registerEndpoint(uri, endpoint);
        }
        endpoints.put(getEndpointKey(uri, endpoint), endpoint);
        return endpoint;
    }

    protected EndpointKey getEndpointKey(String uri, Endpoint endpoint) {
        if (endpoint != null && !endpoint.isSingleton()) {
            int counter = endpointKeyCounter.incrementAndGet();
            return new EndpointKey(uri + ":" + counter);
        } else {
            return new EndpointKey(uri);
        }
    }

Source here and here (we are using Apache Camel 2.9.2).
As you can see, on line 15, it checks whether the Endpoint is a singleton. A Virtual Topic in ActiveMQ is not configured as a singleton, since its intent is to have multiple consumers.
However, when you’re sending messages, this goes wrong.

For each UserAction message that the service sends, a lookup is done to get the Endpoint and then a new EndpointKey is returned (with an incremented counter), which is added to the endpoints map (line 10).This leads to an ever-increasing map, which leads to the service blowing up and the site going down.

The following screenshots from Eclipse Memory Analyzer show the problem clearly:

Geert1

Geert2

In order to prevent this in the future, send messages using the actual Endpoint (instead of the endpoint name), like so:

class UserActionNotifier(producerTemplate: org.apache.camel.ProducerTemplate, context: org.apache.camel.CamelContext, serialization: Serialization) {

  private[this] val userActionsEndpoint: Endpoint = context.getEndpoint("activemq:topic:VirtualTopic.UserActions")

  def notifyOfUserAction(userId: Int, userActionType: String): Unit = {
    val ua = new UserAction(s"u$userId", 1, userActionType, new DateTime().getMillis)
    producerTemplate.sendBody(userActionsEndpoint, serialization.serialize(ua))
  }

}

DevOps at Marktplaats

The journey

How did we get from this:

old-marktplaats

 

….to this?

new-marktplaats

 

TL;DR

We’re deploying to production multiple times per week, running tests after every commit, pushing only to master, not working on branches, you should do it too, it’s awesome.

pipeline-dashboard-edited

 


 

The DevOps story

When we started on the big Migration project, there was not a lot of communication between developers and system administrators (SiteOps). It kind of looked like this:

wall-of-confusion

 

It didn’t work, lots of confusion and miscommunication.

 

The next step was to put system administrators in the development teams. The developers really liked it, they could ask (stupid) questions whenever they wanted.

How did the SiteOps guys like it, you ask?

Not so much.

 

Where we are now:

We have a Sysadmin Support team which does releases and picks up the smaller tasks like environment configuration changes.

Next to this, there is a Sysadmin Solutions team for longer running, more architectural issues (e.g. configuring a Hadoop cluster).

 

As developers, we are pretty happy with the current situation, since the tickets we create are being picked up quickly.

 


How to release?

A big question in the DevOps movement is: how can we make it easier to do a release?

 

Here’s how we solved it:

  1. We have a service-oriented architecture.
  2. Developers push code to different services, to the master branch.
  3. The different services each have their own git repo, which is polled by Jenkins.
  4. Jenkins compiles, runs tests and deploys to the test environment (called ‘integration).
  5. Acceptance tests (e.g. Selenium tests) run against the full test environment.
  6. When these tests are green, a deployable tarball is created (with a specific version)*.
  7. The tarball is being deployed to the next environment (called ‘demo’).
  8. On demo, the integration with other platforms is checked. This is also where QA manually checks the stories that have just been implemented.
  9. The (versioned) tarball is deployed to the Load & Performance test environment.
  10. Load & Performance tests are run against the L&P environment.
  11. If all is green, it can be deployed to the production environment (this is a manual step).

 

* This is how the tarball is created:

  1. All services deployed to the test environment are checked for their version.
  2. Such a version consists of a git commit hash and a timestamp.
  3. The services are then copied from the test environment and packaged into a tarball.
  4. The tarball is versioned using a timestamp.
  5. The tarball is archived for later use.

 

This is our overview screen, to easily see what version of the application is deployed on which environment:

pipeline

 


Doing DevOps as a developer

Doing DevOps will not work if developers don’t change. Luckily, we did change.

These are some things that we’re doing differently than before:

Devs are making puppet changes (reviewed through gerrit)

Our local development environment is managed by puppet and configured to be as much as Production as possible (using boxenvagrant and homebrew (to name a few))

 

 

#1 Tip

Tip #1 for a developer to become more DevOps-savvy: setup a new environment. Some benefits include, but are not limited too:

  • direct communication and tight integration with sysadmins
  • learning the conventions and consistencies in naming
  • improve your knowledge of the communication flows within the application (hosts, ports, using proxies, db access, access to messaging system, firewall, etc.)
  •  beer more recognition from your SiteOps colleagues

 

 

Sources

devopsweekly.com

“Release It!” by Michael Nygard

news.ycombinator.com

Awesome colleagues 😉