The Twitter River plugin allows index twitter stream using Elasticsearch rivers feature.
In order to install the plugin, simply run: bin/plugin -install elasticsearch/elasticsearch-river-twitter/2.0.0
.
- For 1.0.x elasticsearch versions, look at master branch.
- For 0.90.x elasticsearch versions, look at 1.x branch.
Twitter River Plugin | elasticsearch | Release date |
---|---|---|
2.1.0-SNAPSHOT | 1.0.0.RC1 -> master | XXXX-XX-XX |
2.0.0 | 1.0.0.RC1 -> master | 2014-03-08 |
2.0.0.RC1 | 1.0.0.RC1 -> master | 2014-01-15 |
The twitter river indexes the public twitter stream, aka the hose, and makes it searchable.
You need to get an OAuth token in order to use Twitter river. Please follow Twitter documentation, basically:
- Login to: https://dev.twitter.com/apps/
- Create a new Twitter application (let's say elasticsearch): https://dev.twitter.com/apps/new You don't need a callback URL.
- When done, click on
Create my access token
. - Open
OAuth tool
tab and noteConsumer key
,Consumer secret
,Access token
andAccess token secret
.
Creating the twitter river can be done using:
curl -XPUT localhost:9200/_river/my_twitter_river/_meta -d '
{
"type" : "twitter",
"twitter" : {
"oauth" : {
"consumer_key" : "*** YOUR Consumer key HERE ***",
"consumer_secret" : "*** YOUR Consumer secret HERE ***",
"access_token" : "*** YOUR Access token HERE ***",
"access_token_secret" : "*** YOUR Access token secret HERE ***"
}
},
"index" : {
"index" : "my_twitter_river",
"type" : "status",
"bulk_size" : 100,
"flush_interval" : "5s"
}
}
'
The above lists all the options controlling the creation of a twitter river.
By default, the twitter river will read a small random of all public statuses using sample API.
But, you can define statuses type you want to read:
- sample: the default one
- filter: track for text, users and locations. See Filtered Stream
- firehose: all public statuses (restricted access)
For example:
curl -XPUT localhost:9200/_river/my_twitter_river/_meta -d '
{
"type" : "twitter",
"twitter" : {
"oauth" : {
"consumer_key" : "*** YOUR Consumer key HERE ***",
"consumer_secret" : "*** YOUR Consumer secret HERE ***",
"access_token" : "*** YOUR Access token HERE ***",
"access_token_secret" : "*** YOUR Access token secret HERE ***"
},
"type" : "firehose"
}
}
'
Note that if you define a filter (see next section), type will be automatically set to filter
.
Tweets will be indexed once a bulk_size
of them have been accumulated (default to 100
)
or every flush_interval
period (default to 5s
).
Filtered stream can also be supported (as per the twitter stream API). Filter stream can be configured to
support tracks
, follow
, locations
and language
. The configuration is the same as the twitter API (a single comma
separated string value, or using json arrays). Here is an example:
{
"type" : "twitter",
"twitter" : {
"oauth" : {
"consumer_key" : "*** YOUR Consumer key HERE ***",
"consumer_secret" : "*** YOUR Consumer secret HERE ***",
"access_token" : "*** YOUR Access token HERE ***",
"access_token_secret" : "*** YOUR Access token secret HERE ***"
},
"filter" : {
"tracks" : "test,something,please",
"follow" : "111,222,333",
"locations" : "-122.75,36.8,-121.75,37.8,-74,40,-73,41",
"language" : "fr,en"
}
}
}
Note that locations use geoJSON order (longitude, latitude).
Note that if you want to use language filtering you need also to define at least one of tracks
,
follow
or locations
filter.
Supported languages identifiers are BCP 47. You can filter
whatever language defined in Twitter Advanced Search.
Here is an array based configuration example:
{
"type" : "twitter",
"twitter" : {
"oauth" : {
"consumer_key" : "*** YOUR Consumer key HERE ***",
"consumer_secret" : "*** YOUR Consumer secret HERE ***",
"access_token" : "*** YOUR Access token HERE ***",
"access_token_secret" : "*** YOUR Access token secret HERE ***"
},
"filter" : {
"tracks" : ["test", "something"],
"follow" : [111, 222, 333],
"locations" : [ [-122.75,36.8], [-121.75,37.8], [-74,40], [-73,41]],
"language" : [ "fr", "en" ]
}
}
}
By default, elasticsearch twitter river will convert tweets to an equivalent representation
in elasticsearch. If you want to index RAW twitter JSON content without any transformation,
you can set raw
to true
:
curl -XPUT localhost:9200/_river/my_twitter_river/_meta -d '
{
"type" : "twitter",
"twitter" : {
"oauth" : {
"consumer_key" : "*** YOUR Consumer key HERE ***",
"consumer_secret" : "*** YOUR Consumer secret HERE ***",
"access_token" : "*** YOUR Access token HERE ***",
"access_token_secret" : "*** YOUR Access token secret HERE ***"
},
"raw" : true
}
}
'
Note that you should think of creating a mapping first for your tweets. See Twitter documentation on raw Tweet format:
curl -XPUT localhost:9200/my_twitter_river/status/_mapping -d '
{
"status" : {
"properties" : {
"text" : {"type" : "string", "analyzer" : "standard"}
}
}
}
'
If you don't want to index retweets (aka RT), just set ignore_retweet
to true
(default to false
):
curl -XPUT localhost:9200/_river/my_twitter_river/_meta -d '
{
"type" : "twitter",
"twitter" : {
"oauth" : {
"consumer_key" : "*** YOUR Consumer key HERE ***",
"consumer_secret" : "*** YOUR Consumer secret HERE ***",
"access_token" : "*** YOUR Access token HERE ***",
"access_token_secret" : "*** YOUR Access token secret HERE ***"
},
"ignore_retweet" : true
}
}
'
If you need to stop the Twitter river, you have to remove it:
curl -XDELETE http://localhost:9200/_river/my_twitter_river/
This software is licensed under the Apache 2 license, quoted below.
Copyright 2009-2014 Elasticsearch <http://www.elasticsearch.org>
Licensed under the Apache License, Version 2.0 (the "License"); you may not
use this file except in compliance with the License. You may obtain a copy of
the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations under
the License.