XIVStats/XIVStats-Gatherer-Java

Performance optimizations, speed and MORE speed

Closed this issue ยท 61 comments

Wanted to set up a new site with the support of new jobs (recently introduced). All went well, but finding the highest ID was a bit tedious. I had to hand edit the number and play around for an hour. And still, I am not sure if I have the highest valid ID at all.

Would it be possible to add a way to the Java collector to find the highest ID?
Thank you.

Example, why it would be useful:

My script is at: 1,803,264
The fresh stat page (mine) says: 242,560 players.
According to my hand guess, the highest ID is: 19,096,969
Meanwhile, the "official" site lists 10 million players, So at this rate, my script is really off.

No,

Its a good idea, but I'm not really up for adding a feature that's being requested by someone competing with us unless you're going to be feeding back into the project yourself

In my opinion it's a good idea, I currently shift forwards the number each time I do a run, and it's not the greatest as occasionally I miss 100k IDs or so.

I'm guessing a good way would be to have it quit if it finds 5,000 missing pages in a row or something. If you have any solid suggestions let me know.

Thank you @Pricetx !

I love your guys work, really don't know why he was so offended. I mean if I intended to "steal your work", I most certainly would not have come here. Anyone could just copy the code, run it, then make any design they wanted.

Anyhow, about the ID finding... I had the same thought.
Try a high number, then gradually go higher.
Once it hits a missing person, it should back off bit by bit. Well, basically just what I did then.
Wondering if there is a reliable way to find a "fairly recent id" somewhere to get a starting marker.

Will check around and comment if I find anything. Thank you again.

A couple of bits of info that may be useful to know:

  • The early IDs (first million or so) tend to have a lot of gaps. I'm not sure if this is some quirk of characters coming across from 1.0, whether players had to log in at least once or something
  • You will find the odd gap here and there in normal data, this is just players who have deleted their character (and maybe permanently banned characters? can't remember what happens with those)
  • I'm currently seeing IDs up to around 19,800,000.

Yeah, see my opening post with the numbers. (Included at the bottom).

The weird bit - the reason I started the whole finding ID idea - is because it looks like my "try and guess" method was not good enough. But turns out, you also rely on a similar one. Because as you can see, it's almost at 2mil (ID wise), and only found 250k players.

Meanwhile, the live XIVCensus has 10.8 mil players, with a pre-Stormblood data.
I reckon many people made new characters after Stormblood came out.
So if I do a 10x250k, that's only 2.5mil. Way below the live pre-Stormblood data.

That worries me. The site can be found here btw: http://ffstat.agx.co

(I am slapping together some ugly Java code to test this number finding game. Added the link and warning I talked about. Didn't bother until now until I even get to create a working site with proper data, of course. So at the moment it's 1,849,291 ID and 249,303 players. Projected end player number is ~2,667,542.)

If it's any consolation as surprising as those numbers may seem, they match what I used to get with the Ruby gatherer and what I get with the Java gatherer that runs the live site. I'll do a quick bit of data analysis for you. Give me a few minutes.

Of course, I am by no means in a hurry, sorry if it seemed so. And thank you for the very quick reply. I get to the number game.

I certainly agree that this is a good feature to add, I'm not in a position to add it myself at the moment.

In reference to creating your own instance of the stats site, I can see perfectly well why you want to do this - we've not updated the stats on the site for quite some time due to a lack of updates for heavensward, however we now have those updates in place and will be executing our own run shortly.

Yes, I saw the dev site on some other ticket, but see the comments, the IDs and player numbers seem to be off quite a bit. At the moment I am not even trying to get the full site up, but help you guys track this down. And of course, maybe implement the number guess part.

Would be a fairly complex algorithm to do this for how simple a task it is at least based on my estimation.

It'd need to build in 'stepping' up in increments, then scanning along them if it hits a 404 for that increment, then stepping up further, repeat. The problem being how do you properly evaluate the 'terminal' state.

How does your test run match up @ReidWeb ? The dev. site. How many IDs did you scan, and how many players did it find? Would it get to a "reasonable number" if you allowed it to do a full run?

I mean, as you can see, my on-going run getting very low results. (Pricetx is looking into this right now.)

@Balls0fSteel didn't get to fully run it myself, @Pricetx is handling that. Results should be available soon for our small test run, hope to kick off complete run on September 1st.

Here's the results of analysing the June dataset:

0 to 999,999 = 114,939
1,000,000 to 1,999,999 = 145,234
2,000,000 to 2,999,999 = 617,615
3,000,000 to 3,999,999 = 706,858
4,000,000 to 4,999,999 = 677,867
5,000,000 to 5,999,999 = 611,694
6,000,000 to 6,999,999 = 623,244
7,000,000 to 7,999,999 = 657,453
8,000,000 to 8,999,999 = 635,399
9,000,000 to 9,999,999 = 692,838
10,000,000 to 10,999,999 = 657,786
11,000,000 to 11,999,999 = 675,689
12,000,000 to 12,999,999 = 629,330
13,000,000 to 13,999,999 = 620,318
14,000,000 to 14,999,999 = 594,508
15,000,000 to 15,999,999 = 585,848
16,000,000 to 16,999,999 = 714,455
17,000,000 to 17,999,999 = 771,092
18,000,000 to 18,999,999 = 86,319

I have to say i'm a little surprised how only 70% or so of all IDs actually exist anymore. I guess a lot of characters get deleted.

Anyway, this agrees with my previous statement about the low IDs having more missing than higher IDs.

@Pricetx Wow, that is really detailed. Thank you very much.
Since I already set up the machine, I will let it run for a bit more. Since I set up the machine around a day ago, and it's at 2mil, I reckon it will take roughly 10 days at this rate to get a full stat result.

How many threads do you guys use, by the way?
I realize no one should really hammer SE's servers, but how much is too much? (I mean there I have absolute power over the config, but what is ethical to use?)

Now I had a thought...
What if...

  • I made a VM image that contains a fresh system + SQL + java gatherer.
  • I deployed 20 (example) VMs using this technique and divide the IDs (slaves).
  • This way the gathering could finish fast. I mean it would not put less load on SE of course, but it would finish reasonably fast. And I don't think the servers pushing back 429 replies result in 0% CPU load at SE servers either. Hell, I think if the script would not produce so many 429s, it would simply finish faster and thus result in less load on the other end.

I hard limited it to 64 for both resources saving and as you put it 'ethical' reasons, there's also a diminishing return with the number of threads used, due to the way in which threads are being handled by the CPU it is faster to run with more threads, but a thread count beyond the no. cores (or number of virtual cores i.e. in hyperthreading or virtualisation) it does begin to diminish in return significantly.

Main reasons to run with fewer threads would be to save on memory and if you're hitting a lot of 429 error codes on the lodestone.

With our census hopefully kicking off on the 1st, I'd be concerned that we might see increased rates of 429s between us.

Out of interest what version of the gatherer are you running with? We may be able to collaborate in that we split the run up for this month if you're already running the latest version, then merge databases at end.

Note that the non-latest versions of the gatherer are no longer functional due to changes in the lodestone.

It's actually an interesting topic, and caused us to find a bug in the application a while ago.

It appears that Square Enix are running the website behind an nginx instance. This nginx instance has some sort of rate limiting set up. I've noticed that I hit the rate limit more often on weekends and evenings, so my best guess is that the rate limit is a global number of active connections or something.

I currently have the threads to 64, which I find to be a sweet spot between running well on quiet periods, and not triggering to many 429s (rate limit failures) during peak hours.

EDIT: I apparently forgot about the 64 hardcode limit. My points remain though. During absolute peak traffic I've found somewhere from 30-40 threads will just about max out the throughput.

Out of interest what version of the gatherer are you running with?

The XIVStats-Gatherer-Java-v1.4.0.jar version is running at the moment. So far so good. If the 1st schedule proceeds, I will stop my generation, of course, I don't want to cause any trouble for the site as I have used it numerous times in the past (and of course, I intend to use it from now on too, it's superb.)

Should I try this split-generation tomorrow? I could set up the image and deployment in a matter of hours. Based on my experiment the 429 errors gather up at ~500+ threads.

Quick (horrible) math:
2000 threads = ~2mil / day. (with 429s that is.)
So 500 threads (no or rare 429) = 500k.

Let's aim for a day generation. That's 40 virtual machines. And 24 hours.
The cheapest Vultr VPS is 25GB SSD + 1GB ram + 1 core, should be enough.

0,168$ / 24 hours. So we need 40 machines, 40 * 0,168 = ~7$. Not a huge amount, if I aim for two days (less stress on SE and less chance the Japanese Special Forces will knock on my door), then it's only 3.5$.

And yeah, if we go back to the ethical part, even two days with a 20 cluster is 10k constant requests. Real hard to even guess how much stress X thread puts on their system. I mean maybe they have a whole cluster and it's literally no load. Or, it's just a small cluster for FFXIV and many threads might hammer it. Oh boy. It's really hard to "feel it" too. Because from this point, it always feels slow (Japanese servers).

@Pricetx - I saw the same limit thing at Apache. But yeah, achieves the same thing really, sends clients away until they slow down. I used Japanese VPS machine(s) to do the work (less roundtrip, less load, faster response. Might be just placebo though).

Sorry for the lengthy post!

With that said I ran a website with way higher traffic (millions and millions of requests), with only a single PC behind (a plebeian simple i7) and we had a negligible load. Well, a lot matters on the code and content too, of course. Images and such were off-loaded using haproxy to a CDN machine. Again, just another poor PC that served the images.

Those results are interesting. It sounds like SE might have finally invested in better infrastructure for the site around the same time they updated the layout for Stormblood.

I had also considered latency assisting somewhat, but my bottleneck previously has been the request rate at the lodestone end so I hadn't really done any proper testing (although thinking back, if the limitation was a global connection one, having my connections freed up more rapidly from lower latency would help I guess).

By all means go ahead with your testing, it would be interesting to see. If you really can push those numbers then I wouldn't be too concerned about interfering with my comparatively trivial 64 threads. Just don't get into trouble with anyone! ;)

You mentioned using Japanese VPS', I'd considered writing a new version for AWS Lambda that could run in the tokyo region, would be a cost of around $2.80 for a month (15 million players), @Pricetx and I have talked at length about getting him those resources back. Serverless is the best way to go IMO as you don't pay for any idle time.

I've also confirmed that they don't do any CDN/regional distribution of the HTML (I think they do for other assets, but not the HTML which is all we ened), all requests are served from Japan AFAIK.

@ReidWeb I always steered away from AWS, it's just way too costly for me. Vultr, Linode, Ramnode, OVH, Hetzner - they all provide more and they are way cheaper. AWS cannot be beaten if you are an enterprise and need to deploy a 1000 machine cluster in matter of minutes. But as a regular folk with small needs, I always found other providers to be way better off. I used Vultr for the current node of mine since they have a Japanese location too.

The VPS providers based on my experience:

Best price / value: Hetzner. (+snapshot, +large bandwidth, +vCore, +fast connection [Germany location only])
Best price / value if you need 10TB of bandwidth: OVH. (+very large bandwidth any package; -100mbps only [several locations for VPS])
Best price / value if you need speed, fast deployment, global presence: Vultr. (+global, +snapshot, +fast IO+CPU, +fast connection, +custom iso, -Hetzner/OVH both offers +1 core for same price.)
Still very good: Ramnode, Linode. (+global, -no snapshot, -worse than the others)

I've also confirmed that they don't do any CDN/regional distribution of the HTML (I think they do for other assets, but not the HTML which is all we need), all requests are served from Japan AFAIK.

Yes indeed, they only have Japanese IP. But it might be Square Enix's cluster and a front-end proxy. Good question why SE does not pay Cloudflare though for example, it would greatly benefit the Lodestone I think. Oh well, you cannot argue with suits (leader board of SE).

So, I am off to baking this VM image, @Pricetx. We shall see what happens. If the Special Forces get me, please delete my browsing history!

One last data bit: 2000 threads on very fast connection and high activity results in only ~30% 1 core usage on a VPS. Memory wise it also eats ~700MB only. I saw the ticket about rewriting the code base for the Java program, but I must say, it works very well in its current form.

Well, I have did what I planned.
Started 20 cute slaves and started gathering data.

  • The gather seemed to stop from time to time, even though it's different IP / machines. Like the script just sat there idle doing nothing for seconds. This happened on some machines, not on others. They did not receive 429, nothing.
  • The main website was responsive, and some nodes were gathering super fast, so no idea why some IPs worked less than others.
  • There was this error, even though I used a private network + MySQL to gather, and the MySQL node had like 5% cpu usage max. So the server is full ready to accept new connections, but there is some other issue preventing connection. :
Connection failed! Please see output console
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure

The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
        at sun.reflect.GeneratedConstructorAccessor38.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at com.mysql.jdbc.Util.handleNewInstance(Util.java:404)
        at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:981)
        at com.mysql.jdbc.MysqlIO.<init>(MysqlIO.java:339)
        at com.mysql.jdbc.ConnectionImpl.coreConnect(ConnectionImpl.java:2253)
        at com.mysql.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:2286)
        at com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2085)
        at com.mysql.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:795)
        at com.mysql.jdbc.JDBC4Connection.<init>(JDBC4Connection.java:44)
        at sun.reflect.GeneratedConstructorAccessor34.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at com.mysql.jdbc.Util.handleNewInstance(Util.java:404)
        at com.mysql.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:400)
        at com.mysql.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:327)
        at java.sql.DriverManager.getConnection(DriverManager.java:664)
        at java.sql.DriverManager.getConnection(DriverManager.java:247)
        at com.ffxivcensus.gatherer.GathererController.openConnection(GathererController.java:604)
        at com.ffxivcensus.gatherer.GathererController.writeToDB(GathererController.java:467)
        at com.ffxivcensus.gatherer.Gatherer.run(Gatherer.java:38)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.ConnectException: Connection timed out (Connection timed out)
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at com.mysql.jdbc.StandardSocketFactory.connect(StandardSocketFactory.java:211)
        at com.mysql.jdbc.MysqlIO.<init>(MysqlIO.java:298)
        ... 17 more

tl;dr: I don't think there is much we can do about the gather speed. Sadly. ๐Ÿ‘Ž

Just to confirm, were all of the slaves pausing at the same time, or were the times unique?

As for the MySQL error, that's interesting. I'm not sure what the cause would be there.

  • The pause was pretty much random.
  • The SQL too.

It is 100% possible that Vultr's private network implementation sucks and that causes the hiccups. Though I doubt it. I mean it's only what, a few bytes per each transaction? A remote internet connection could do this at home seriously... So yeah, I would vote for the "alien powers at work" for this one.

Here is a cool matrix like video: https://youtu.be/jQNOLZ9ZjHg
On the vid I just have 4 nodes opened but they all produce the same thing totally randomly all over the place.

I will try another thing. One big fat node with muscle and with threads. Threads and more threads.

Nevermind, the internal IPs all match, so that may be causing some issues. (I did a snapshot then deployed it over and over again.) I will correct the IPs and test it again. (Putty/Kitty could really use tabs, it's a pain to manage this many machines lol.)

Report: Fixed up the IP configs, not much changed. I also gave a lot more resources to MySQL in the config, but nope, same issue. For some reason if there is too many threads hammering with mysql, it makes the gatherer fart itself (see the copied exception above.)

I am trying the big method now.

Seems like the big method is working. Will report back with numbers in a few hours. If you guys want, I can let this finish (1.5days), and then simply upload the db compressed. There should be no issues on the way, it seems stupid fast and stable.

Well, some more number pr0n:

  • Started generation about ~13 hours ago.
  • The generation is at 1,227,488 ID. So the maximum rate for one client to collect is ~1mil / 12 hours.
  • Going above too many threads seem to make the Apache/Nginx addon to stabilize the rate limit.

And I am sorry about spamming this ticket, I will make a new one for the original idea, but it was an interesting experiment.

By any chance are you still running yours? I'm seeing this: https://i.imgur.com/zAs5Lx9.png

If nothing else, it proves the global limit imposed at their end.

No @Pricetx , I have stopped my run long hours ago. It seem to worry from machine / IP / region / time of day and all the factors. Some machines were capable of maintaining high amount of threads, others got stuck by limit asap.

So it's not me.

@Balls0fSteel ah, no worries. Being the beginning of the month there might be other people doing similar things for other projects. It should calm down later.

Thanks for your testing though, you've certainly pushed it harder than either of us have before. It's just a shame it mostly revealed that we're at the bottleneck already.

Well, my best idea so far is to:

  • Push the threads high, even if they get delayed, they return sooner or later. So even like 5000 can work and is certainly not eating too many resources. I used 30,000 though which resulted in the last amount. (12 hours =~ 1.3mil) Website wise, yeah, if someone could cry about "muh resources" and "muh costs", but as I said I did run huge sites on cheap machines, and these many requests require not that many resources. (Even if the site is complex or badly written.)
  • Deploy machines. I used Vultr. Simply made a machine, added Java, Mysql, phpmyadmin (im lazy, sue me), set up the database config in the Java gatherer. Made a script that simply called the -jar command with the -s -f, so like: java -jar ~/blah.jar -s arg1 -f arg2
    And made a text file where I split up them by a million. (So I know which node gets which portion.) This could be further automatized. I thought about having the hostname set to 1,2,3,4,whatever, and then the script would simply assign region by that.
    The cheapest 5$ machine can do 1mil in 24 hours. 20 machines will cost you only a few bucks for 24-48 hours. Once they are done, you only have to export the SQL and upload it to a final machine where you generate the stats. As you have seen the SQL runs into some exception if I tried to use it non-locally. Maybe if there was a "while" added, like to the 429, it could finish connecting and that way a central place could gather all data already, without the need of this sewing process at the end.
    This way, whole database = 2 days.

Well, once the generation is done, I will try a distributed run, I mean it's just too much fun to pass out on. Interesting how Java, SQL and the site behaves.

Code-wise, there's a couple of things that stand-out to me that could be optimised quite nicely to get us some raw improvements in speed:

1. Connection Pooling - In-Progress

It looks like we're currently opening and closing raw connections, which is generally considered expensive. I've started looking at adding in a very basic connection pool implementation, and initial tests locally show a 1000-character run going from 59s to 49s.

2. Thread Pooling - In-Progress

Java implemented a ThreadPool mechanism which should allow us to load-up as many jobs as we want and then it'll internally re-use the same objects to run the actual gatherer workers. The differences are marginal over a small run, but over time that builds up.

I'm working on a code refactoring branch at the moment where I've started to implement some of these already, so I'll push that into the main repo as I get it tidied up.

Beyond that, however, I think there's some changes to the general approach that could do with investigating:

  1. Don't re-parse dead characters - I think, based on SE's notices, that once a character is deleted, they're "gone-gone". Which means that we can probably skip any characters that we've previously had the 404 page for and save ourselves a bucket of time. That does somewhat imply we need a consistent DB that we keep over time ๐Ÿ˜„, and also need to figure out how to separate dead characters from the end-of-the-stack (probably though some trim operation at the end of the batch).
  2. Message queuing - One way to aide scalability would be to use a message queue (see: AWS Simple Queue), so that we could split up the Gatherer into one part that creates jobs, and another that does the actual work per-character. An MQ would then give us the glue between creating parse jobs and seeing them done. AWS Lambda looks like it may give us an alternative to this as well, but need to read more and understand the mechanics there.

Things to explore from an architecture perspective. My main concern at the moment would be not overwhelming the Lodestone servers with too many simultaneous connections. We don't want the census to appear to their servers as being a DDOS attack, as that's just going us blocked. I need to read-up more on how Ngnix identifies and tries to mitigate for attacks, but it's also safe to assume that there will also be firewalls infront of Ngnix which are likely going to provide their own, and probably more aggressive, detection and prevention. I think we need to have a good balance between volume and efficiency here, so that we're only doing requests that matter and we can appropriately throttle ourselves as to not take unfair advantage of the fact that SE allows us to "freely" access their DB through the Lodestone. Just my 2-cents ๐Ÿ˜„

Yeah, the original topic went really stray. I started a snippet when I talked about it, but then it all turned into performance and whatnot instead.

  1. Great to hear, sounds like a lot if scaled up!
  2. The biggest issue is that the clients cannot use a shared DB. I mean if you start like 4 in network and try one database server, it will just throw errors at you blindly (see my posts above). Don't know what causes that. Maybe a while() with random wait statement would work like what the code already does when the site slaps a 429 in your face.

Tbh there are three possibilities.

  1. They don't care, they just allow us to do this. Because there is virtually no cost of us doing this. Maybe 0.00001 cycles of the CPU? WOW! I mean yeah, it's negligible.
  2. They care about that someone making the entire DB available. But then it's not about the load, and the Census has been around for some time. So I doubt they would do this.
  3. They could simply make a db dump available for you guys, I mean really, it would take a db guy a few minutes to cook up a query.

But hey, it's SE. They don't care about RMT, about bots, about selling, about anything. So I would bet 1st.
(I know this sounds harsh, but it's true. Sadly. I would be the happiest person if they cared one bit about players, bots, sellers, etc. But yeah, taking action costs $, and losing $ is awful!)

While this thread is currently being used for miscellaneous performance discussion I thought I would update you on the current live run:

We're 6 days and 18 hours into this months run currently. It is running as a single 64-thread gatherer. It is just about to approach ID 8,800,000. Based on this I would estimate that it will finish scanning IDs around the 15th. From there I imagine it will take roughly half a day to generate the page from the data.

I will try to record when those pieces of the process happens. As it stands, any improvement on a little over 2 weeks is more than welcome.

Welcome to the official performance thread ;)

  1. I think the DB issues are due to the resources not being shut-down properly. When I first added the connection pool implementation, I locked up a single JVM simple running more than 64 characters, and it turned out things just weren't being shut down properly. I'm going to run some tests when I get home, but I suspect with everything closing down properly, that should sort out the multiple-instances issue (although, that somewhat depends on the MySQL configuration as well - max connections et al).

As another quick thought, I'm currently sat in Japan this week (hurrah!) and thought I'd try out a loading test in Chrome. Turns out, loading a character takes around 1s longer on the eu.finalfantasyxiv.com domain, versus the jp.finalfantasyfix.com. Same goes for the character search, which was almost 2x as fast. That makes me wonder whether they've got actual replica sites running, versus the same site being available over CDN. Might be worth running the same test elsewhere to better understand how much location has an impact on performance - there would be little gain hosting the gatherer in JP if we're hitting the EU URL ๐Ÿ˜„.

Also, did you say half a day to build the page?? O_o

Also, did you say half a day to build the page?? O_o

Yeah, the SQL queries take quite some time to run. PHP may look like the problem to anyone initially viewing the code, but in fact any language would experience the same bottleneck with queries.

The queries could likely be optimised, our DB choice could be at fault (MariaDB), our database engine may need more tuning, but the issue well still occur

@Pricetx Well, my idea still stands. Albeit it may be a bit late. Use machines from Vultr, or any other cloud provider. Install SQL, add a copy of the Java gatherer, and fire it up. Just chop it up by hand, like two machines do 1-10 mil, and 10-20 mil IDs, for example.

I mean... grabbing all the deps takes around 1 minute. I just copy-pasted my sql password over several SSH windows open back then, took me 1 min really to set things up.


sudo apt -y install apache2 mysql-server phpmyadmin php libapache2-mod-php php-mcrypt php-mysql openjdk-8-jdk-headless

Making a "one-in-all" tarball that contains a /var/lib/mysql, and a /root/ with the Gatherer + SQL set up is also helpful. I would add a script to /root/run.sh. It would invoke the .jar with 2 arguments for -f and -s. The SQL config could be already set in the gatherer files, as all nodes would use the same SQL pass/user anyway.

  1. You deploy X machines.
  2. You SSH into the machine with the key/pw, and execute the following command:

sudo apt update && sudo apt -y dist-upgrade && sudo apt -y install apache2 mysql-server phpmyadmin php libapache2-mod-php php-mcrypt php-mysql openjdk-8-jdk-headless && sudo service mysql stop && wget http://amazonawslink/pack.zip && tar xf pack.zip -C / && sudo sync && sudo reboot

A new deployment with this method would take maybe 10 minutes (regardless of node amount) anytime.

FAQ (not you personally but people who may stumble upon this post):
Why separate nodes? Because NGINX hits them by IP. So if you have more IP, you get faster speed.

Why use SQL on all servers? If you check the previous comments, the gatherer can fail to insert data into a remote SQL. So this way you gather smaller packs and just stitch them together at the end.

Isn't this DDOS? Not really. No one said you should start 10000 machines on AWS and start this on them. Worst case your provider gets a Cease and Desist email and your account is banned forever. Just use common sense.

Why didn't I link the tarball?! Well I could, but I don't know what user/pass you want for the SQL user+db. I would deploy a new node, add phpmyadmin, set things up, shut down mysql with service mysql stop then do the tar.

We should probably containerize the gatherer in docker, then we can deploy it in whatever configuration makes the most sense.

Then we could deploy 15/16 containers to AWS micros with a 1M range each, use RDS for the DB to cut out the extra VM for the database (RDS is priced based on data size, not uptime).

Thinking through the original problem (what's the "latest ID?", I'm leaning towards either:

  1. We need a pre-process that searches for the highest ID, probably something fuzzy down to the nearest 1000 block
  2. Give the gatherer(s) a way to automatically stop themselves when they run out of verifiable characters

Either way, deploying multiple workers requires them to be able to coordinate. That can be done via the DB, but from experience it's a PITA to get working reliably.

For the PHP timings, are you using the MyISAM or InnoDB engine? We should get a pretty significant boost from adding an index to the server name column - but you'll need InnoDB.

The rest is then query rewrites, but I'll raise a separate ticket for that. The technology (MySQL / MariaDB) isn't the problem here.

We should probably containerize the gatherer in docker, then we can deploy it in whatever configuration makes the most sense.

This is what I've been moving towards the notion of, I have the orchestration tools in place already.

@ReidWeb I looked at the PHP file, and what if it used Threads? No need to over-complicate, just threads for each query. So if a host has more cores, it would scale up. I am not a PHP guru though. Theoretically, this should speed up the PHP generation though.

@ReidWeb I looked at the PHP file, and what if it used Threads? No need to over-complicate, just threads for each query. So if a host has more cores, it would scale up. I am not a PHP guru though. Theoretically, this should speed up the PHP generation though.

@Balls0fSteel my knowledge of and experience with PHP is extremely limited so can't really comment to this myself, @Pricetx will be able to comment further.

Reflecting upon @Pricetx's decision it was a poor technology choice in terms of appropriateness to solving the problem when scaling up is considered, the choice was made based on experience on his part - which is a fair decision. Architectural decisions are always a balance of knowledge and experience vs use case. A language such as Python, F#, Golang or Node would likely be the appropriate technology choice as the problem stands today.

Can any discussions pertaining to the web page generation be logged and discussed against XIVStats/XIVStats, this repo pertains purely to the Gatherer Java Implementation

For the PHP - go here.

For historical context, the PHP file you see today is just a modified version of the very first version of the site. It was actually written to be run once ever as I was curious to see what visualisations of the stats looked like. Quite literally no time was given to the design or optimisation of it, hence why it's just a single block of instructions that call sequentially. It could be written in pretty much any language of your choosing, but the limitation is always going to be the database rather than the language doing the data manipulation anyway.

Multi-threading may provide some performance benefit, but I wouldn't expect to see a huge benefit unless the database is sharded, as I imagine the natural bottleneck would be a disk one.

The real performance gains would be from reducing the number of queries to generate all of the data. The cause of the slow-downs are the large number of count() queries which are used to generate numbers on the total number of player characters under a range of conditions. These count queries obviously have to read every single row and won't really benefit from indexing. However, if you can reduce the number of total queries run, then you will greatly reduce the total execution time.

In terms of MySQL itself, it is running the InnoDB engine. Performance of the live process should be a bit quicker than the last successful run due to being on a newer server with faster I/O. There's not a lot of tuning that can happen however that will improve count() queries. The underlying filesystem is currently ZFS and I have tuned it for MySQL, so there shouldn't be any major performance losses there.

Yeah, I reckon some black-belt DB guru could make a query and DB structure that would result in a 10s php page generation (exaggerated, but yeah, never underestimate DB ninjas).

The threading is just the lazy man's method. (me!)
Btw if you guys use my method next time:

  • Generation of 5 mil takes 4 days.
  • 2GB of ram and 1 core is enough, literally the cheapest machine on Vultr.
  • Just use the tarball method I wrote, the container thing might not worth the hassle.
  • When you are done, you have to export your DBs with (so you don't have CREATE/DROP):
    mysqldump --no-create-info -u root -p ffdb | gzip > ffdb1.sql.gz
  • Then you just copy all databases you saved to the processor machine (I used an Amazon AWS C2, also piss cheap), and do:
    zcat /path/to/file.sql.gz | mysql -u 'root' -p ffdb

(In my case I had "ffdb" database + user. I created the DB+user with this command:
drop database ffdb;DROP USER 'ffdb'@'localhost';CREATE USER 'ffdb'@'localhost' IDENTIFIED WITH mysql_native_password AS 'changeme';GRANT USAGE ON *.* TO 'ffdb'@'localhost' REQUIRE NONE WITH MAX_QUERIES_PER_HOUR 0 MAX_CONNECTIONS_PER_HOUR 0 MAX_UPDATES_PER_HOUR 0 MAX_USER_CONNECTIONS 0;CREATE DATABASE IF NOT EXISTS ffdb;grant all privileges on ffdb.* to ffdb@localhost identified by 'changeme'; exit;)

Hope my bits helped.
So 4 cheap Vultr machines for 4 days, its' like 6$ overall. I would go with 6, but even 4 is OK, I mean 4 days for the entire generation is quite comfy. The stitching together also goes fast, only the PHP generation takes time.

Generation of 5 million characters in 4 days sounds pretty similar to what we're currently seeing, but good to see that the distributed method works.

I am still noticing a pronounced increase in rate limit around the beginning of the month and on weekends. This seems to suggest the limit is a global one rather than a per IP one (there could of course be both).

I suspect some performance gain may occur from being closer to the server however, as although retries occur in usually 1-10ms, it takes a RTT of at least 180ms or so for the application to even know that there was a 429 in the first place, so by the time it tries again and receives a proper result, almost half a second could have passed. If you are able to perform comparison testing in Europe and Japan I would be very interested to see how that turns out.

It must be both. Like at night, I had fast rates on each node while I was testing. But also, if I increased thread count way too much, I got hit. So it may be just both as well. :/

Heads up! I used my horrible math skills and it looks like the final PHP generation would take almost 6-7 days. As opposed to only 30 minutes with the new pull request. So IMO fixing the pull request at the PHP side is utmost priority, as the gathering will be soon done I reckon.

I tried to help over there, but I could only gather out all reported errors, as I am no SQL/PHP wizard. (Oracle classes during uni were more than enough for my taste, lol.)

The request in question: XIVStats/XIVStats#21

Wow, that's quite the miscalculation!

Don't worry, the live run for this month is still set up to use the old PHP. I was going to test the improvements before swapping it in, but I think i'll leave it for this run.

Alright, just wanted to give you a heads up. Maybe it is as Matthew says and they are just harmless errors. My HTML missed a few servers here and there and whatnot, so I think it's something deeper than that. But yeah, the improvement is crucial, reduces generation by huge.

Oh, sorry, I think I misinterpreted what you were saying. I thought you were saying your changes would actually take 6-7 days to execute, rather than 30 minutes. I'll still wait until we've definitely ironed out any data errors, but that's a huge improvement (although I don't think i've ever seen the existing PHP take more than about half a day).

Seriously, once you have the result with the new improved script, try it with the old one. Just start it in tmux, in the background. I think it will work for 6-7 days 24-7. At least that's what the current data suggests using high-frequency Xeon with Turbo.

From experience, the time it takes MySQL to work is more dependent on memory allocation than processor speed.

This is because InnoDB is able to load data into memory, and so a typical production system would have at least 1.5x data size allocated as memory for MySQL so that it can cache everything off disk and supercharge read times!

#FunFact

The test machine had 16GB of RAM and the script ate ~700mb / big query, so I don't know. Maybe a bigger machine would run the script faster?

Did you change the configuration of MySQL itself? There's a few config options you need to set to allow the service to use that 16GB in the first place - table_cache being the usual culprit, iirc. (Obligatory discussion thread reference).

Nope.

45 ffst localhost ffst Query 252608 Sending data SELECT grand_company,count(*) FROM tblplayers GROUP BY grand_company

That's the original script running on test data for days.
What should I modify?
In comparison, your script finishes in 30 minutes tops.

I'll try running the new one on my test data tonight if I get time. If it produces the same results as my existing test page, i'll put it in live. Worst case scenario i've wasted 30 minutes before I get it to execute the old one instead.

September run has completed: https://ffxivcensus.com/

Thank you. Looks pretty bad all around for Europe. :/
There is only Ragnarok left as a viable choice (and a France, and a German server.)
Sucks SE moved the NA servers. :( RIP.

Welp, ticket can be closed I guess, the new joined SQL queries make the generation super fast. The collection, well, the 4 server solution is fast enough IMO, only takes 60 minutes to set up, then 4 days to run. Then you just need a local machine or just one of them to import all SQL (make sure you export each DB without "create", it's an argument I posted above), and that's all. 5 days for an entire Lodestone is pretty darn quick.

Summary of updates from this thread:

  • Connection pooling has now been implemented
  • Thread pooling has now been implemented
  • The PHP has been re-designed so that it runs a minimal number of queries, and builds all of the information it needs from the result set, it now takes a matter of minutes rather than hours
  • We never did do anything fancier with the underlying infrastructure. If anything, it probably got a smaller resource pool when moving from a physical host to a VM. Due to cost reasons this is unlikely to change, but not closed to the idea

Thanks for your time and input everyone!