Netflix/aegisthus

Functional example?

matschaffer opened this issue · 9 comments

Is there any doc on a functional example? I'm trying to just process a test data set:

cqlsh> SELECT * FROM mattest.users;

 userid      | emails              | first_name | last_name | todo | top_scores
-------------+---------------------+------------+-----------+------+------------
 matschaffer | {'mat@schaffer.me'} |        mat |  schaffer | null |       null

(1 rows)

And running this on EMR:

hadoop jar aegisthus-hadoop-0.3.0-SNAPSHOT.jar com.netflix.Aegisthus -input s3://bucket/mattest2.0/mattest/users/snapshots/1460510053352/mattest-users-jb-1-Data.db -output s3://bucket/outputs2.0/1

The job completes, but all I get back is one JSON file that doesn't look like I can do anything with it:

6d61747363686166666572  {"6d61747363686166666572":{"deletedAt":-9223372036854775808,"columns":[["000000","",1460510043566000],["0006656d61696c7300000f6d61744073636861666665722e6d6500","",1460510043566000],["00096c6173745f6e616d6500","7363686166666572",1460510043566000],["000a66697273745f6e616d6500","6d6174",1460510043566000]]}}

I tested this on snapshots from both 2.0 and 2.1 (via the https://hub.docker.com/_/cassandra/ containers) with the same result. Here's the Aegisthus build line:

Running Aegisthus version 0.3.0-SNAPSHOT built from change 3305276 on host ip-10-0-2-78 on 2016-04-12_20:34:02 with Java 1.8.0_77

Thanks in advance!

Ah ha!

ByteBuffer buffer = ByteBufferUtil.hexToBytes("6d6174");
String string = new String(buffer.array(), Charset.forName("UTF-8"));
System.out.println(string);  // => "mat"

So this is getting me closer:

hadoop jar \
  aegisthus-hadoop-0.3.0-SNAPSHOT.jar \
  com.netflix.Aegisthus \
  -Daegisthus.columntype=UTF8Type \
  -Daegisthus.keytype=UTF8Type \
  -Daegisthus.column_value_type=UTF8Type \
  -input s3://bucket/mattest2.0/mattest/users/snapshots/1460510053352/mattest-users-jb-1-Data.db \
  -output s3://bucket/outputs2.0/1

But I'm getting non-printables in the column names:

matschaffer {"matschaffer":{"deletedAt":-9223372036854775808,"columns":[["\u0000\u0000\u0000","",1460510043566000],["\u0000\u0006emails\u0000\u0000\u000Fmat@schaffer.me\u0000","",1460510043566000],["\u0000\tlast_name\u0000","schaffer",1460510043566000],["\u0000\nfirst_name\u0000","mat",1460510043566000]]}}

This could have something to do with how I created the table in the first place, so I'll try recreating the data but thought I'd mention it here incase this was an issue people had encountered before.

Hi Mat,

Aegisthus's support for CQL is not great at the moment. @danchia added some support for reading from CQL defined tables here: https://github.com/Netflix/aegisthus/blob/3305276264abfe5ecddb4f46c7a7a6940dc41093/aegisthus-hadoop/src/main/java/org/coursera/SSTableExport.java

We still use Apache Pig to read Aegisthus output and for the minority of tables that are in CQL the casting/parsing is handled by Pig scripts.

I would like to have proper CQL support, it would make Aegisthus a lot nicer, but I don't have any definite plans to work on it.

Sorry I don't have a better answer for now.

Hi Mat,

At Coursera we use the follow additional job I wrote a while back to extract logical CQL rows:
https://github.com/coursera/aegisthus/blob/coursera-columnar-20/aegisthus-hadoop/src/main/java/org/coursera/SSTableExport.java

In our case we choose to map the columns to an avro file and save it, but you could conceivable write to json or the like.

thanks @danchia and @danielbwatson! Cassandra and EMR is still a new space for me.

What's the advantage of writing out as Avro?

And @danielbwatson are there any pig scripts you could share that handle the Aegisthus output? (Presumably you mean the JSON output?)

Maybe https://github.com/Netflix/aegisthus/wiki/Pig-Loader is still accurate?

Thanks!

Avro is one of the commonly used Hadoop storage formats, so it should interop pretty nicely with the rest of the Hadoop ecosystem (Hive, etc).

There are plenty of other formats that would work equally well (Parquet, ORC, etc), but that should have minimal effect – the core CQL processing logic of the job I shared is pretty similar.

@matschaffer I believe that the PigLoader should still work. The pig scripts in the wiki looked accurate to me. I think if you use it as is, you are going to have to parse the map keys because it looks like the emails set value is stored in the column name.

Some other hints:
\u0000\nfirst_name\u0000 is actually \u0000\u0010first_name (\n is 10 is the length of first_name)
and with
\u0000\tlast_name\u0000 the \t is actually decimal 9, the length of last_name.

Sorry this is so ugly right now, hopefully we can get better CQL support in Aegisthus in the future.

Closing all issues since the project is archived.