Note to self: Execution time is unnecessarily high
jacobwb opened this issue · 11 comments
Consider the following:
- XML parse time for 10 files: 0.0003 seconds.
- HashOver execution time with 10 XML files: 0.0025 seconds.
Now let's say there's 1000 files:
HashOver's execution time would be 2.5 seconds, whereas the XML's parse time is only 0.3 seconds. That means 88% of the execution time is HashOver itself, considering HashOver is doing very little work to display the comments, and most of it is done in JavaScript on the client's web browser, this is unacceptable. It needs to be fixed before HashOver becomes like Disqus :-)
Memory is also high as well, with script memory peak of 0.51Mb and system memory peak of 0.75Mb for only 10 XML files. These numbers should be around 0.22Mb for 1000 files not 10.
Does it have to do with the way PHP parses XML files? I'm just trying to educate myself a bit more with PHP, found for example this interesting tip
http://stackoverflow.com/a/1835324
If that even covers a part of this problem
I've looked into XMLReader, and various other methods of reading XML files. What I've found is that XMLReader is indeed faster, but only on large files. Since HashOver uses many small files, and uses everything in each file, XMLReader isn't really practical, and isn't faster either.
However, currently what's happening is each file is read and parsed separately, meaning simplexml_load_file()
is called multiple times. What I've done so far is changed it to load each file into a single string and then just call simplexml_load_string()
once on that large string. This has improved the performance a little, it looks like by 17.64%. So I may do the same thing with JSON.
That being said, 88% of the execution time is spent on HashOver's other functions, and that is the problem that really has to be addressed. XML is slightly slower to parse than JSON, and some configurations of SQL, but the difference between them is quite small. To demonstrate this I recently wrote this code: https://github.com/jacobwb/lorem-ipsum-article-generator
It turns out I did my math wrong. The execution time above is for 10,000 files not 1,000. For 1,000 files the execution time for the XML is 0.03 seconds, and 0.25 seconds total. The percentages are the same, and the memory usage is still high for only 10 files, so I won't close this just yet.
It seems to me that massive disk Inputs/outputs are often an explanation for slowness. Have think about store all comments in only one file ? You may concatenate xml strings, one per comment in a single file for a page, so you could easily add one.
It has been considered, but ultimately rejected. If all comments are in a single file, a single XML error or wrong file permission would bring down every comment, and it would make finding and fixing the XML error very difficult. This separated, individual file nature of XML is one of its benefits.
I did not suggest to use only one XML file but to concatenate XML files (standing in memory) in one RAW file to reduce disk I/O. Have you considered using phardata http://php.net/manual/en/class.phardata.php ?
Very interesting idea. While this may technically introduce the same issues I mentioned before (corrupt the archive file and you lose all the comments, or fail to update the archive file and new or edited comments won't appear), I would not be opposed to this feature for all of the flat-file storage formats (XML and JSON). Provided it actually improves performance.
You may have noticed the odd repository on my account named "lorem-ipsum-article-generator". That repository was created during a discussion I was having with some people on Google+ about flat-file storage format performance. The discussion involved running those tests in RAM to remove any possible performance bottlenecks from filesystem read/write, and if I remember correctly, the read/write from the hard drive wasn't very detrimental to performance.
Stupid question perhaps, but: why not JSON?
What is the need for storing all the data as XML files?
JSON encoding/decoding is built into PHP core, is incredibly reliable and fast, and the files themselves will be considerably smaller.
Am I missing something?
is server side parsing neccessary at all (be it json or xml IMO doesn't matter too much). For a different endeavor I do render xml to html in the browser. Both, for tiny xml as well as consolidated, larger ones. See (oh, the server is down right now for another 2h) http://rec.mro.name/stations/b2/2016/08/10/
JSON is being considered, actually.
I would recommend you read the "JSON" section of the "Data storage format" issue.
While it's true that JSON encoding/decoding is built into PHP core, so is XML. Also, the performance difference isn't very significant, especially compared to a database, in fact JSON performs worse than XML during writing the file. Here is some code I wrote that tests the read/write performance of a few different data formats: lorem-ipsum-article-generator
The main fault with JSON is that it isn't easily readable for a human, whereas XML is, which is why it is the current default format. When I say "easily readable for a human", I mean if you need to edit a comment file manually, it would be quite difficult since the entire -- perhaps multi-paragraph -- comment would be on a single line (with or without pretty print), and this is very inconvenient.
Yes and no.
Technically speaking the XML files could indeed be used as-is.
However, where enabled, the comment files will store sensitive user information, such as e-mail addresses and passwords. For this reason the files cannot simply be displayed as-is on a web page, or transmitted asynchronously via XMLHttpRequest
in JavaScript, some parsing is required to remove this information before sending the comments to visitors.
Specifically, HashOver uses the files to generate HTML necessary to display the comments in a structured and style-able way, as well as load and display avatars and remotely hosted images, etc. The parsing that is done is to display the comments in a user-friendly way, I doubt this could be done by simply displaying the XML files themselves alone.