Support plain HTTP as an ingest source.

Question

Support plain HTTP as an ingest source.

Opened this issue 2 years ago · 6 comments

We currently support ingest sources like files, server-sent events, websockets, kafka, kinesis, stdin, and named pipes as ingest sources, but having a plain HTTP source sounds desirable (e.g. to consume Twitter's streaming API). It should also have the option to pass an authorization header (for e.g. bearer token for Twitter's API).
I imagine this would be analogous to our file ingest source, just reading from HTTP instead of a file.

Answer 1 · 2022-05-17T04:13:52.000Z

hi @LeifW am interested to try this out, my guess is:

define a new IngestStreamConfiguration, lets call it HttpSourceIngest for now
handle the new HttpSourceIngest in ingest/package.scala
implement ingest.HttpSource

Theres also the question of request format, one immediate possibility where body is hardcoded to json could be

{
  "url" : "https://api.twitter.com/2/tweets/search/stream/rules",
  "type"  : "HttpSourceIngest",
  "headers" : [ "Authorization":"Bearer " + SECRET_KEY, "X-Header2": "Val2" ],
  "body" : {
		"arr1": [ "k1":"v1", "k2":"v2" ],
                "obj1" : { "k1":"v1", "k2":"v2" }
	} 
}

what do you think?

Answer 2 · 2022-05-17T18:13:17.000Z

That looks good to me!
Our FileIngestFormat seems like it works in a similiar way - an Akka Stream that can handling a series of JSON objects being passed in - perhaps some code there could be shared?

I was wondering about the format that JSON objects come back from the Twitter stream API - if there's one JSON object per line, the stream can be split on newlines, and one JSON object parsed per line. If not, Akka Streams also supporting chunking up the stream based on matching opening and closing {}'s - https://doc.akka.io/docs/alpakka/current/data-transformations/json.html. I believe our existing FileIngestFormat supports both of these options.

I assume by the body part that this is meant to be a POST request? On reading the Twitter API docs, to me it sounds like you do a POST to the that stream/rules endpoint to add rules, but then do a subsequent GET to the stream endpoint to get the filtered stream you just configured? In that case, it sounds like maybe you'd do the POST on your own (e.g. w/ curl), and could then create the ingest stream w/ a GET request? Basically I'm wondering if this would really need to support POSTing thing, or would just GET be sufficient?

... POST /tweets/search/stream/rules endpoint. Once you’ve added rules and connect to your stream using the GET /tweets/search/stream endpoint, only those Tweets that match your rules will be delivered in real-time through a persistent streaming connection.

https://developer.twitter.com/en/docs/twitter-api/tweets/filtered-stream/integrate/build-a-rule

Answer 3 · 2022-05-17T23:41:23.000Z

That looks good to me! Our FileIngestFormat seems like it works in a similiar way - an Akka Stream that can handling a series > of JSON objects being passed in - perhaps some code there could be shared?
I was wondering about the format that JSON objects come back from the Twitter stream API - if there's one JSON object per line, the stream can be split on newlines, and one JSON object parsed per line. If not, Akka Streams also supporting chunking up the stream based on matching opening and closing {}'s - https://doc.akka.io/docs/alpakka/current/data-transformations/json.html. I believe our existing FileIngestFormat supports both of these options.

Thanks for the tip! I will figure out how to refactor what is available

I assume by the body part that this is meant to be a POST request? On reading the Twitter API docs, to me it sounds like you > do a POST to the that stream/rules endpoint to add rules, but then do a subsequent GET to the stream endpoint to get the > filtered stream you just configured? In that case, it sounds like maybe you'd do the POST on your own (e.g. w/ curl), and could > then create the ingest stream w/ a GET request? Basically I'm wondering if this would really need to support POSTing thing, or > would just GET be sufficient?

definitely unsure at this moment, but im definitely not targeting twitter only, it should be general enough to support either GET/POST/PUT/PATCH/OPTIONS/TRACE/CONNECT ? or should we stick to GET first? i reckon having an optional method option could at least allow for extensibility. as a first use case, yes i will try it with the twitter API first

Answer 4 · 2022-05-19T01:56:35.000Z

Awesome, thanks!

As far as different methods vs only GET - whatever you think is best for now. Seems it should be easy enough to remove / add later if they're not used / if there's a demand.
I'm not typically used to those other methods returning a substantial body, but it is certainly possible. Was just thinking it'd be a simple MVP to only support GET, and from there other features could be added as-needed?

Answer 5 · 2022-05-19T02:35:15.000Z

Youre right, ill try to put up the GET MVP PR soonish!

Answer 6 · 2022-06-06T11:26:53.000Z

I tried this same thing from another angle: create a separate executable that reads the Twitter stream and writes it to stdout, and just have Quine read that from stdin. In my case, I just repurposed some existing sample code that reads the Twitter sample stream (v1.1 - this is old sample code). There is a simple shell script to call it:
https://github.com/Andy-42/twitter-stream-analysis/blob/master/quine-twitter.sh

I agree that it would be nice to have the ability to read a JSON lines stream from HTTP built into Quine, it isn't necessary.