Click to copy

Deep dive

How to use Apache Kafka in AWS - A quick guide to Amazon MSK

Introduction

The setup to create and scale an Apache Kafka cluster can be a pain to deal with, that's why AWS came up with the MSK service back in 2019. AWS MSK is a fully managed cloud service for Apache Kafka so that developers don't have to worry about the underlying infrastructure. It's a good solution if you want to use Apache Kafka without the overhead.

In this post, I want to give you a quick overview of what this service is about, how to start using it, and compare it with some of the alternatives so you can make the right decision about which one to use.

What you will learn:

  • What is AWS MSK and why would you want to use it
  • How to create an Apache Kafka cluster with it
  • What MSK Serverless is about
  • Some hints on how to migrate your current infrastructure to AWS MSK
  • How does it compare with AWS Kinesis or Confluent
  • This dog just learned how to use Amazon MSK, look how happy he is!

    What is MSK in AWS?

    The MSK service stands for Managed Streaming for Apache Kafka. It's the service that Amazon offers so that you don't have to self-manage your Apache Kafka clusters, and spend a lot of time securing, scaling, patching, and ensuring that the Apache Kafka clusters are available, the same for Apache ZooKeeper, which Apache Kafka depends on for resource management.

    As Apache Kafka is getting more popular, this service is also getting more attention, and now it offers us 3 main tools.

      Basic MSK
      MSK Serverless
      MSK Connect

    I'll cover each of these next.

    Why would you want to use Amazon MSK

    Some of the key features of Amazon MSK:

  • You won't have to manage the servers: It's a managed service that helps you manage your Apache Kafka cluster and Apache ZooKeeper nodes without the need to provision ec2 instances.
  • High availability: If a node fails, Amazon MSK will automatically replace it without downtime for your apps. This means you don't have to worry about starting, stopping, or directly accessing the nodes yourself.
  • Really scalable. You can add a broker, change broker sizes and add more storage on the fly. It can also auto-scale with the serverless option.
  • Observability. You can monitor the logs and metrics via Amazon CloudWatch or extract JMX metrics with Open Monitoring or Prometheus.
  •  Automatic updates: It also automatically deploys software patches as needed. There is one simple API to upgrade your Apache Kafka cluster with no downtime.
  • Apache Kafka security: MSK supports SSL-based security and SASL/SCRAM, and makes it easy to configure.
  • Deeply integrated: One of the great advantages of using AWS is the number of integrations we have available out of the box (including AWS IAM).
  • How does Amazon MSK work?

    When you first enter the Amazon MSK console, you will find 2 main menu sections, MSK Clusters, and MSK Connect.

    Overview of the Amazon MSK service

    MSK CLuster

    Clusters. You can get started here. When creating a new cluster we have 2 main options, "Quick create" and "Custom create". The first option comes with lots of defaults following best practices, the second enables you more customization.

    After that, the second decision we need to make is whether we want to provision some servers for the brokers or if we want to go serverless and let amazon handle this for us.

    When you create a new cluster, it will exist within a special Amazon MSK VPC, but you can select the AZs it will be deployed in, so it's located near your servers (ensuring high connection speeds).

    Cluster configuration. When you create a new cluster it comes with a default configuration following best practices, but if you want to change that you can do it in this section.

    Some of these properties affect a broker and others relate to the ZooKeeper cluster. You can find the complete list of properties here.

    Does AWS MSK support Kafka connect? - MSK Connect

    This feature makes it easy for you to deploy connectors that move data between Apache Kafka clusters and external systems such as databases, file systems, and search indices.

    MSK Connect is fully compatible with Kafka Connect, so you can run your applications with no changes.

    How to connect to Amazon MSK from your local machine?

    Once you have your cluster set up you may wonder, how can you connect to it from your local machine. You have a couple of options here:

  • The easiest and the one you should definitely not use in production is to create a custom cluster open to the internet.
  • Another option is to use bastion host to proxy traffic from your localhost to your MSK cluster.
  • You can also set up an Apache Kafka Rest Proxy framework open-sourced by Confluent to access the MSK cluster from the outside world via rest API. This framework is not a full-fledged Apache Kafka client and doesn't allow all operations, but it's good enough
  • NOTE: You can easily test the connection using a free program like KafkaIDE.

    An example of how easy it is to use KafkaIDE

    Amazon MSK networking

    When you create an Apache Kafka cluster with MSK, it’s deployed into a managed VPC with brokers in private subnets (one per Availability Zone as you specify when creating a new cluster). Amazon MSK also creates the Apache ZooKeeper nodes in the same private subnets.

    The brokers in the cluster are made accessible to your VPC through elastic network interfaces (ENIs) that will appear on your account.

    The magic of AWS MSK Serverless

    When using AWS MSK you can run into some problems:

  • Your demand can vary a lot, which means that you will either have unused capacity when demand is low or you will reach the capacity limit when it's high.
  • You will need to rebalance partitions when scaling your cluster up or down.
  • MSK Serverless solves these problems. As I mentioned earlier you can select the serverless option when creating a new cluster.

    Amazon MSK alternatives within AWS: Amazon Kinesis

    We have been talking about MSK, another popular amazon managed streaming platform is Amazon Kinesis. These two options are very similar, if you want to dig deeper into their differences you can find a fantastic comparison here by Noel Anson.

    Amazon MSK alternatives outside AWS: Confluent Cloud

    Of course, there are some fully managed services outside AWS, one of them is Confluent Cloud. Some considerations when considering if Confluent Cloud may work better for you:

  • Confluent is a company pretty much exclusively dedicated to Apache Kafka, so they are the go-to solution.
  • They take care of all the details.
  • They have a dedicated CLI to communicate with your clusters.
  • You can find a good overview of the platform here.

    Is it easy to migrate to Amazon MSK?

    Now we are near the end of the post, and if you have decided to give this service a try you may wonder how can you migrate your existing cluster over to AWS.

    The process can be a bit complicated, lucky for you there is this awesome post by Sandeep Mehta that guides you step by step.

    Conclusion

    Amazon MSK is a very good solution to implement Apache Kafka on AWS. As we talked about it's fully managed, so it can save us a lot of time that would otherwise go into managing your Apache Kafka cluster.