Announcement Announcement Module
No announcement yet.
Failure recovery, total loss of a broker Page Title Module
Move Remove Collapse
Conversation Detail Module
  • Filter
  • Time
  • Show
Clear All
new posts

  • Failure recovery, total loss of a broker


    I'm testing the AMQP/Rabbit code, specifically in failover and recovery situations. My test client uses a MessageListenerContainer and a POJO to receive messages which I send to the broker cluster from a different client or host. My rabbit cluster (3 machines) is behind a load balancer that detects if a broker is up or down using a port check every 10 seconds. The cluster and LB work fine in normal situations.

    If my Spring client detects a shutdown of the broker it is connected to, then it logs the shutdown exception, reconnects (back through the same VIP on the LB) to a different cluster member and continues to receive messages - this happens very quickly.

    However, if the connection between client and cluster is simply severed (remove a network cable or just power off the broker machine without any server or OS shutdown) then the client doesn't reconnect at all and simply stops receiving messages. It doesn't seem to detect the loss of connectivity ever, even with a restart of the broker or restore the network link. Only if I kill the client app and restart it will it obtain a new connection to the cluster and receive all of the messages that had built up in the cluster.

    Is this a limitation of the AMQP or Rabbit Spring code, or the broker, or something that can be configured in the client side?

  • #2
    This is a classic problem with TCP connections.

    You can enable heartbeats on the underlying Rabbit ConnectionFactory...


    • #3
      Thanks Gary. I tried that and I get very erratic results with it; certainly nothing that would give me any confidence about using it in a production system. The heartbeat errors show up after the configured heartbeat interval that was set and the consumer then reconnects to a different server via the LB. But then the consumer begins to receive messages at a much slower rate than they are being produced, and only receives every second message. This is with a producer that has a stable, unbroken connection to the cluster and is sending messages at a rate of 1 every 250ms in my test.

      Only after the failed server machine is restored to the cluster does the consumer catch up with both the messages that appeared to go missing, and the production rate. Very odd, but repeatable every time I run the tests and even with a completely rebuilt cluster.

      Last edited by davison; Oct 27th, 2012, 02:54 PM.


      • #4
        Hmmm... can you share a debug (or preferably TRACE) level log for the consumer showing good->erratic->good ??

        Also, given that spring-amqp is a thin layer on top of the rabbit client, this is something you might want to bring up on the rabbit list (, but I'd be happy to take a quick look at a spring-amqp log if you like.