In my limited Pub/Sub experience, this seems to be how it works. You publish to ...

alexdean · on Nov 20, 2015

I think this sentence [1] helps to explain the difference:

> When you create a subscription, the system establishes a sync point. That is, your subscriber is guaranteed to receive any message published after this point.

[1] https://cloud.google.com/pubsub/subscriber

With Kafka or Kinesis, I can write events to a stream/topic completely independently of any consumer. I can then bring as many consumers online as I want, and they can start processing from the beginning of my stream if they want. If one of my consumers has a bug in it, I can ask it to go back and start again. That's what I mean by an immutable stream in Kafka or Kinesis.

jganetsk · on Nov 20, 2015

Cloud Pub/Sub engineer here. You can create as many consumers as you want. You can create them offline and bring them up and down whenever you want. Each consumer will receive a full copy of the stream, starting with its sync point (subscriber creation). Each message is delivered, and redelivered, to each consumer until that consumer acks that message.

If I understand your point correctly, the only expectation we haven't matched is the ability to "go back and start again". We hear you.

alexdean · on Nov 20, 2015

From your comment it sounds like you haven't used Kinesis or Kafka yourself - rather than take my word for it, I'd suggest your team give both of those platforms a serious try-out to really understand the capability gaps. I'd be surprised if a lot of your [prospective] customers weren't asking for these kinds of unified log capabilities in Cloud Pub/Sub.

jganetsk · on Nov 20, 2015

We hear you.

Let me see if I'm understanding the criticism: when creating a consumer, the sync point of a new consumer really should start from the very beginning of the topic, at a predictable explicit start point, rather than at the current end of the topic. This makes a lot of sense, and yes, there is a disconnect between the models. We think the capabilities you are talking about are great and those use cases are important. All I can say is keep your eyes open.

We went with defaults from Google's internal use of Pub/Sub, which is older than the public release of Kinesis and Kafka. Internal use involves an approach where topics and consumers are very long-lived. Topics are high throughput, in terms of bytes published per unit time. Retaining all messages and starting consumers from the very beginning wasn't a sensible default; our focus was more centered on making sure that, once topics and consumers were set up, consumers could keep up over time.

One example use case to help illustrate this thinking is doing real-time sentiment analysis on tweets: https://www.youtube.com/watch?v=O3mfuc-syTI

In the work described by that video, they were essentially publishing tweets in real time into a Cloud Pub/Sub topic, thus making an "all tweets on Twitter in realtime" topic. This is a great example of a topic where producers and consumers are completely decoupled from each other. It doesn't necessarily make sense to retain all tweets forever by default (although there certainly are use cases for that). There are plenty of use cases where a consumer might want to say "ok, please start retaining all tweets made from here on out" rather than starting from a specific tweet.

alexdean · on Nov 20, 2015

Thanks for the detailed explanation jganetsk.

> when creating a consumer, the sync point of a new consumer really should start from the very beginning of the topic, at a predictable explicit start point, rather than at the current end of the topic

I'll talk about Kinesis because that's the technology we use more at Snowplow. When creating a Kinesis consumer, I can specify whether I want to start reading from a) TRIM_HORIZON (which is the earliest events in the stream which haven't yet been expired aka "trimmed"), b) LATEST which is the Cloud Pub/Sub capability, c) AT_SEQUENCE_NUMBER {x} which means from the event in the stream with the given offset ID or d) AFTER_SEQUENCE_NUMBER {x} which is the event immediately after c).

Kinesis streams or Kafka topics don't themselves care about the progress of any individual consumer - consumers are responsible for tracking their own position in the stream via sequence numbers / offset IDs.

> It doesn't necessarily make sense to retain all tweets forever by default (although there certainly are use cases for that)

Completely agree. I think a good point of distinction between pub/sub systems and unified log is: use pub/sub when the messages are a means-to-an-end (which is feeding one or more downstream apps); use unified log when the events are an end-in-themselves (i.e. you would still want to preserve the events even if there were no consumers live).

Anyway, I could talk about this stuff all day :-) - if you'd like to chat further, my details are in my profile!

boundlessdreamz · on Nov 20, 2015

I'm not familiar with Kafka

1. Can you direct the consumer to a point in stream? (ideally time based i.e messages from 16 Nov UTC)

2. Can old events be auto removed defined by rules?

herriojr · on Nov 20, 2015

I haven't played with kafka in a while, but basically,

1. each group id represents a point in the stream that a consumer is processing off of. You could technically have multiple processes consuming off of a single group id.

2. there was a configuration on time to keep things there as well as space if I remember correctly, but basically, there has to be. There's a pretty hard limit on what all you can store on disk.

edit: changed consumer id to group id. If you want more info, feel free to ping me about the ecosystem