tuplejump/calliope

Issue connecting to remote Cassandra instance

jmahonin opened this issue · 2 comments

I'm testing out the thrift and cql3 integration with Cassandra, but I find that when I'm creating an RDD via the following, it's only connecting to my local Cassandra instance, not a remote one. This is pretty much right out of the example (http://tuplejump.github.io/calliope/show-me-the-code.html)

val cassandra_server = "some-example-host"
val cassandra_keyspace = "some-keyspace"
val cassandra_thrift_cf = "some-thrift-column-family"
val cassandra_cql_cf = "some-cql-column-family"

val thrift_rdd = sc.thriftCassandra[String, Map[String, String]](cassandra_server, "9160", cassandra_keyspace, cassandra_thrift_cf)

val cql_rdd = sc.cql3Cassandra[Map[String, String], Map[String, String]](cassandra_server, "9160", cassandra_keyspace, cassandra_cql_cf)

println(thrift_rdd.count(), cql_rdd.count())

Changing the value of cassandra_server seems to have no effect at all. I've also tried creating a CasBuilder as follows, with no effect:

val cas = new Cql3CasBuilder(cassandra_keyspace, cassandra_cql_cf, cassandra_server, cassandra_port)
val cql_rdd = sc.cql3Cassandra[Map[String, String], Map[String, String]](cas)

Forgot to mention, I'm using the following dependencies:

libraryDependencies ++= Seq(
    "org.apache.spark" %% "spark-core" % "0.9.0-incubating",
    "com.tuplejump" %% "calliope" % "0.9.0-EA"
)

@jmahonin The Cassandra server passed to the CasBuilder is only the initial server in the cluster that Calliope will connect too. From which it plans out the spark nodes to run the task on trying it's best to ensure data locality.

So if you give an initial node which is remote, but the node running the standalone-shell/spark worker is also a Cassandra node in the same cluster and has the data to be processed available locally, then the job will run on the local node and connect to the local Cassandra.

If the remote node has a spark worker running on it, or the local c* node doesn't have the data or the server is not running Cassandra and spark on the same system, then the spark worker will connect to the remote node.

I hope that answers your query... If not please feel free to comment and reopen the issue.