Splitting a topic into smaller topics can be a good way to improve performance and make management easier. This is especially true when there are multiple consumers reading from the same topic. By splitting the topic, you can ensure that each consumer is only reading data that is relevant to them, which can improve performance and reduce load.
Do we force consumers to filter out 80% of the data on a given topic, or force them to subscribe to 7+ topics for a comprehensive dataset?
It's not always easy to determine when a topic has become too large. There are a number of factors to consider, such as the number of messages per second, the size of the messages, and the number of consumers. However, if you're seeing latency or performance problems, it's likely that your topic has become too large.
It is more difficult to maintain more topics. Users need guidance on which topics to read from. If users need extra fields in just one of those smaller topics, do we also apply the change to the other? Make sure you check out kafkaide.
We have more duplicate information across topics. If we rewrite just one message in a topic that is duplicated to other topics, all those messages need to be rewritten as well.
This statement is false. The number of partitions matters, not the size of the topic. A single large topic with multiple partitions can be more scalable than multiple smaller topics.
Whether you have one or multiple topics, have at least as many partitions as brokers in the cluster for each topic. That way partition leaders can be evenly spread among the brokers, avoiding network peaks in certain brokers.
Being the leader of a single-partition topic with 10MB/s incoming is different from being one of the leaders of a partition with 1MB/s incoming.
When consuming data from smaller topics, network traffic is reduced since each consumer is only reading data that is relevant to them. This can improve performance and reduce the load on the network.
If you want to keep the master topic, make sure you have enough storage space in your brokers since all the data will be duplicated. Remember that you can use retention policies to control how long data is kept in each topic.
Another thing to think about is if all the events in your topic have a relative order that needs to be followed. If so, they should all be on one topic. You might also want to create some convenient topics that will help people find the information they need more easily.
For example, if you want to study user flows, it would not be a good idea to have two different topics for different user actions. Having a topic page_visited for when somebody visits a page and another topic button_clicked, when they click on a button, would not make sense because then it would be hard to read it in a consistent manner.
If you subscribe to both topics, you may get that a button was clicked before the page the button is on was visited. Eventual consistency can be your worst enemy here.
When it comes to Apache Kafka, splitting a topic into several smaller topics can be a good way to ensure that each consumer is only reading data that is relevant to them, which can improve performance and reduce network load.
If you're seeing latency or performance problems, it's likely that your topic has become too large. The number of partitions is crucial, whether you have one or multiple topics, so make sure you check that first.
However, it's important to consider the trade-offs before making this decision. Smaller topics can be more difficult to maintain and may result in duplicated data.