brmson/yodaqa

Scale-Out Architecture?

Opened this issue · 15 comments

k0105 commented

Hi,

I just finished building another physical system, so I could run YodaQA on 24 cores and 60GB of RAM in parallel (even though 16/48 is more realistic - I have some other stuff to do). Executing Yoda with all its backends on an octocore with 24GB RAM currently takes 7 seconds, so with a cluster of 24 cores real-time might become feasible. [OK, it doesn't, but don't burst my bubble.]

I know that performance is no core goal right now, but I'm still interested: Is there anything in place or planned to do this? UIMA-AS, OpenStack, MPI / OpenMP, Docker + CoreOS? Which direction do you want to take and how far away is it?

Best,
Joe

pasky commented

Hi! Scale out is certainly something we have in mind and was in fact one of my primary motivations for using UIMA at all (DKpro the other half of that motivation).

However, when I first seriously looked into using UIMA-AS and DUCC, it turned out that I would have to rewrite the whole pipeline code of YodaQA (that uses UIMAfit heavily right now), and that non-scaled-out setup of YodaQA would probably get significantly more complicated.

I don't know if I didn't misjudge anything here, though, maybe it's not as bad as it seems. Anyhow, I don't plan to work on scaleout again in the near future, personally. Patches are welcome, but writing them will probably involve some learning curve within the UIMA ecosystem.

A technology that might be most suitable for all this is Leo:

https://github.com/department-of-veterans-affairs/Leo

It seems like something like UIMAfit for UIMA-AS (i.e. simple Java code instead of XML madness). But we use UIMAfit heavily now and I think there are little to none existing examples that use UIMAfit + Leo, though its author mentioned on uima-users a few months ago that it should in principle be possible.

(P.S.: For the web interface, an extra-UIMA "question dashboard" is instituted that various parts of the pipeline access. It's mainly for reporting extra metadata to the web interface, while the pipeline is still running; this code would have to be modified to be scaleout/network aware, or just temporarily disabled for starters. Dealing with this is not a major effort, just a heads-up that it's there.)

k0105 commented

Just for future reference - the post Petr referred to should be: http://permalink.gmane.org/gmane.comp.apache.uima.general/6270

Thanks for the answer, interesting problem. As you know I also have my hands full with an important integration, but this is yet another item on my future work list. I'll revisit this once my knowledge of UIMA has advanced.

pasky commented

I'll keep this open as it's a general YodaQA issue that will need to be solved sooner or later.

Regarding DUCC vs. Mesos, here are my 2 cents: DUCC should be easier to integrate because it is developed for UIMA. Mesos is widely deployed, proven to be highly scalable, and also supports Containers/Docker.
If you have a lot of available time, then Mesos is the way to go. You don't have to integrate the Zookeeper in the beginning.

pasky commented

Hi! It seems jbauer's commentss did not appear in the github issue, just
in email, is that possible?

OK, after an upgrade I now have 20 cores and 80GB of RAM in 3 machines (connected via ethernet and KVM switch) for future Yoda development and over the next months [end of March, early April] I'll set up a test cluster. I'll assume for scaling itself you target DUCC instead of Hadoop / Behemoth? What about Beowolf, MPICH2 or even OpenStack? Do you have any preliminary ideas / preferences about what the infrastructure for this should look like?

OK, so when I want to start scaling Yoda (I strongly assume via DUCC an early April) does anyone have any hints about how to connect multiple computers to one cluster? Should I go DUCC only or does it make sense / is it possible to use any framework below like MPICH2, Mesos (Hydra), Beowolf or even OpenStack?

Overally, cluster processing is one area where my personal experience
with frameworks etc. is almost completely blank. So, I'll have to defer
this to the choices of others... DUCC would be my first instict though
as that's what UIMA itself uses.

However, in general, I think the right way to do scaleout is to look at
this at a bit more abstract level:

(i) use the computing power to do preprocessing, e.g. preparsing
some of the corpus [it would be pretty important for YodaQA to also gain
the capability to persistently cache intermediate results];

(ii) decouple a lot of the computationally intensive tasks like parsing
behind REST APIs that we can call - again, the obvious candidate here is
parsing, but a lot of other stuff that takes long to initialize would be
nice as well. We already do this a bit with entity linking. Maybe this
is what some of the architectures you mention are about, I have no idea.
We can then stuff this behind a loadbalancer and deploy many times.

(iii) a lot of our bottleneck is currently in database lookup, so
intuitively scaling out just the database engine (and documenting that)
would be perfect...

k0105 commented

Very interesting.

a) Maybe we can already start with the JoBim Text (JBT) integration? What if I write a little REST interface for it and just throw this into YodaQA so it's generally available together with a tutorial on how to set it up (create models and DBs)? Its first processing step is based on Hadoop and hence should already scale rather well with a Mesosphere.
[I can also provide an example how to call it. Can you estimate when you will find time to answer me? Is cz.brmlab.yodaqa.analysis.tycor.LATMatchTyCor.java a good place to include an example call?]

b) I could easily package the data backends including JBT into Docker containers just like I did with YodaQA itself. Again, this should help if we follow a Mesos route (in addition to DUCC)... [For the DUCC part I still need some time to really know what I'm doing.]

pasky commented

I hope to answer re JBT tonight. :) Sorry, I was doing some focused work over the weekend. But it'd be best to have that tracked in a different github issue, I guess...

jbauer's comments did not appear in Github issue. Strange!

I don't have experience working with the technologies (DUCC, Mesos, etc) mentioned previously. So, my comments are based on whatever I have read so far about these technologies.

I see the problem of scaling as consisting of two unrelated problems as follows:

(1) How to scale within an instance of YodaQA? This problem is solved by UIMA-AS. Seems that we don't need UIMA-AS if YodaQA code is reentrant so that multiple user-inputs (i.e. questions) are processed simultaneously.

(2) How to scale to multiple instances of YodaQA? As an example of a requirement: YodaQA must scale to thousands of instances of YodaQA. This problem is solved by DUCC, Mesos, etc. This also includes connecting thousands of nodes/servers in a cluster.

To me, Preprocessing is separate from Normal-operation. Normal operation involves executing user input and producing a result to the user. Preprocessing can always be done on nodes where the data (corpus, etc) resides. During normal operation, the same preprocessed data should be available to all yodaqa instances, and some of these instances may reside on the same nodes that have the preprocessed data.

Regarding (1) above: If preprocessing is NOT done, and during normal operation if YodaQA is still heavily I/O-bound (involving database lookups) then I/O is a major problem. Running multiple questions simultaneously would somewhat alleviate this problem. Running multiple questions implies reentrant code.
Scaling the databases may require making multiple copies, placing a copy in each rack, caching on the same nodes that have yodaqa instances, and if database size permits then placing copies on the same nodes' disks that have the yodaqa instances.

pasky commented
k0105 commented

In case I haven't mentioned this: I did some measurements and also found IO to be a bottleneck, so I added an SSD and additional ethernet adapters to my systems. During my evaluation (scheduled in 5 weeks) I will run YodaQA and its backends in several configurations (SSD vs HDD, backends on one system vs backends on different systems, with octocore and 32GB RAM vs Quadcore and 8GB RAM) so I get a feeling for what gets results and how close to real-time I can get with simple tricks. I'll keep you posted.

Yes, the focus should be on minimizing the end-to-end delay.
Making the code reentrant helps in reducing the idle time of the processor. But this should be low on the priority list.

pasky commented

When a thread blocks, the processor runs another thread/process that is runnable. If nothing is runnable, then the processor sits idle.

I will run YodaQA and its backends in several configurations (SSD vs HDD, backends on one
system vs backends on different systems, with octocore and 32GB RAM vs Quadcore and
8GB RAM)

Look forward to your results.
There is, obviously, a trade-off between price and speed. Inexpensive off-the-shelf severs have HDD. SSD comes with a much higher price that adds up as the number of servers increase.

k0105 commented

The containerization part is now discussed in a dedicated thread: #41

Update: Done. I have build Dockerfiles for all data backends as well as Yoda itself and have orchestrated them via Docker Compose such that all of Yoda can be launched fully automatically.