Incredibly abysmal performance.
amcgregor opened this issue · 10 comments
import textile
src = u'This @example@ is "neither":/defn/neither complete *nor* trite, *though _simple_*.'
bigsrc = src * 100 # Amplify any issues.
%timeit textile.textile(src)
#1000 loops, best of 3: 1.4 ms per loop
This doesn't seem bad. Until you compare it to my experimental parser, which is almost 3x faster.
from marrow.texting.engine import Parser
%timeit Parser(src).render()
#1000 loops, best of 3: 510 µs per loop
The output of the engines on this sample is nearly identical: (only difference is an inexplicable leading tab in the python-textile
version)
textile.textile(src).strip() == Parser(src).render() # True
It gets worse when using the 100x larger bigsrc
value, but it gets worse in a very unusual way:
%timeit Parser(bigsrc).render()
#10 loops, best of 3: 39.3 ms per loop
%timeit textile.textile(bigsrc)
#1 loop, best of 3: 6.18 s per loop
That's six seconds, or 4285x slower to process 100x as much text. (Vs. 28x slower to process 100x as much text, which is actually an increase in efficiency and performance vs. the smaller source.) This is baffling.
Important Note: I do not mean this ticket to be a spammy advertisement for my own package. My package was an experiment in parser design five years ago and not suitable for any real use.
I'm in the process of re-organizing the entire in the hope of making the code easier to follow in the short run and hopefully easier to optimize in the long run. It was always structured almost identically to the original php version, and I'm not a fan of the "god object" style employed there. I have little doubt that this project suffers from some needless cruft due to the fact that it was built like a line-by-line translation from php. (It's like reading a book originally written in another language that's been poorly translated.)
I appreciate your explanation into the reason for the current state of affairs, which is enlightening.
While marrow.texting it is not what I might consider production-ready yet (which is subjective; it's simply missing a full test suite) it is an example processing architecture (avoiding regular expression use) that has clear performance benefits over the current one and generates "identical" output. Ignoring that weird leading tab, again. (I recall that back then I wanted to get it as close to real textile output as possible; I don't fully recall how far I got, but do recall difficulty with footnotes.)
I hope there are bits in my own experiments that can help. It's MIT licensed, which is liberal to the extreme; do not hesitate to use anything you see, either code-wise or architecture-wise, and I'm in ##python-friendly
on Freenode if you ever have any questions.
This whole thing cropped up today as an extension of BBCode rendering for forums on IRC, and when I ran the timeit tests against various renderers, I checked for existing tickets, found none, so opened one.
I don't think the bug's been fixed yet, though.
Edited to add: here's the Gist I dug up that was the problem solving scratchboard for the parser itself, separate from the rendering bits. It's a bit more digestible than the more complete version which integrates rendering from another package. This Gist version will be less complete, but also potentially (substantially) faster still.
I'll just add a minor note that projects at work account for just over 1% of this package's weekly downloads. (Yay, regular deploy cycle.) The abysmal performance isn't just theoretical, it also exhibits itself when rendering job descriptions in our job tracking / distribution system, for example; this is a real problem, affecting real projects. (Try taking a two page Word document, rendered in terrible HTML, and pass it through the Textile renderer.)
Sorry, I have to weigh in here.
- "Incredibly abysmal performance" is not a constructive criticism, it's nonspecific trash talk.
- "I made a better version of this" is also not constructive, and is extremely obnoxious.
- Accounting for 1% of a project's downloads does not make you a big cheese.
- Claiming you have "real" projects impacted by this is insulting to the other people relying on this software (often for equally "real" projects).
- You are not the maintainer's manager.
- Even if you were, it would be tremendously unprofessional to bring this level of rudeness and entitlement to a project review.
I first started using this project before ikirudennis took over, and it was very primitive and years behind the PHP spec. Since then it's improved dramatically, and we rely on it heavily at work.
I've seen a number of projects wither and die when the maintainers decided it wasn't worth the trouble to keep working on them. If you are haranguing and insulting the maintainer, possibly making them question whether it's worth continuing to work on this project, that's a pretty serious issue for me.
So yes, to rephrase your (apparently now retracted) earlier comment: please bring constructive feedback and a helpful attitude or GTFO.
Well, there goes the pleasant discussion Dennis and I had on IRC. Great job @aclark73.
"it's nonspecific trash talk."
Actually, the timeit tests pretty clearly confirm the initial statement. Performance is abysmal. And I had proof. A simple troll wouldn't have gone to the effort of actually running the bloody tests, and prettily Markdown formatting the result for a ticket.
"is also not constructive"
So, providing a complete alternate approach that is provably better is not constructive? "You're doing it wrong, here's a better way to do it." That's an interesting planet you live on. Regular expressions as used by the current engine do not work efficiently, and performance is geometrically worse the larger the input data, as again demonstrated in the timeit tests of the initial ticket.
"does not make you a big cheese"
From the discussion on IRC, the actual maintainer apparently had no idea that this package was used for anything serious. That the problem is "real" was a surprise, and partially explained his initial negative reaction. From IRC:
dburke: I had no idea anyone used this thing for processing documents.
That… did go a fair distance in explaining the initial "this ticket must be a troll" response.
"You are not the maintainer's manager."
Literally: what? If I were the maintainer I would have swapped out the underlying engine five years ago, since writing Python code to pretend to be PHP code is… an initial porting approach, but not an efficient one long-term. You aren't the maintainer's manager, either, not sure how this simple pair of facts has any bearing on the problem at hand. In this situation the library is the vendor, and I'm the customer. I could only imagine how well this type of conversation would go over:
"Hi, I just bought XYZ but either it's crashing on startup or it's just taking way too long."
"Pull request or GTFO."
"Uh… what? It's, like, taking minutes to start."
"We don't believe you, and unless you can fix the problem for us, we won't believe you."
A customer service failure is rarely the fault of the customer, and presentation does not invalidate the issue being raised.
"tremendously unprofessional to bring this level of rudeness"
Welcome to the internet. One does not have the right to not be offended. On the other hand, assuming any negative post is immediately a troll and should be casually dismissed is similarly rather rude. Not every user has the time, inclination, or ability to submit a pull request or properly diagnose the source of a problem for you, and pointing out that a performance problem exists when there was no ticket with any indication that anyone was aware of a problem is better than remaining silent and continuing with the status-quo, no? I mean, this performance problem has been a problem for at least five years and has only gotten worse (in our IRC discussions we identified that develop branch is ~1.57x slower than master on the same test).
We were both in the wrong with our tones, which is why we both mutually agreed to edit our comments.
"I first started using this project before ikirudennis took over"
As did I. Had the amusing note that my experiment five years ago was also before the current maintainership. I wish I had filed the ticket then instead of forgetting about the problem for so long, despite the fact that this problem alone is the only reason we have a data-layer caching system in our applications at work.
@aclark73 You, on the other hand aren't helping at all. Imagine the profiling test you could have run in the time it took you to write that comment. The hypocrisy is too stronk.
@ikirudennis Mind if I link the IRC logs with my two at-work links redacted? (Might be useful anyway as a record of ticket follow-up.)
(Heh, love Github's new Facebook-style responses, too.)
Firstly, I have some things to say on the conversation here, aside from the issue itself:
@amcgregor, you can post a link to whatever you like. Just know you're running the risk of me locking the conversation on this issue. Yes, this is the internet, and while I don't have a right to not be offended, I do have some means of limiting your ability to purposefully offend me. I had hoped that github was better than the internet at large. Let's all do our part to keep things civil.
For what it's worth, I was feeling rather disheartened by this issue and your subsequent follow-ups, and the IRC conversation didn't do much to change that. I would like to thank @aclark73 for his input as it bolsters my point: the way you reported this issue comes across as a seagull manager. What sort of response were you expecting on this issue?
I didn't mean to offend by closing the issue, but I thought I addressed the fact that I'm aware that this project is not particularly efficient and that it's something I'd like to improve. It seemed to me, every project on github would have a "make it faster" open ticket which never closes. Why keep an open ticket for a situation which is always the goal? @amcgregor, I was considering asking you to open a new ticket which sticks to just the facts: "python-textile is not speedy nor is it efficient. What can be done to improve it?" But would that change anything?
Since this is an open source free project, which I use only minimally at work, I don't have a ton of time to devote to it. I apologize for that. Perhaps this project deserves better. And perhaps I'd even accept some seagull management if I were getting paid for this work. But there's no one else working on this, and no one's paying me.
Finally, about the actual issue:
@amcgregor, as I mentioned on IRC, I do apologize that this project has forced you to increase the complexity of your system. You mention that HTML generated from a Word document is particularly troublesome. Firstly, an example document would be IMMENSELY helpful for a project maintainer to test against. I don't have Word, and can't remember the last time I used it. (I am aware of how ugly Word's HTML is, I just never imagined someone might use textile to clean it up. Perhaps there's even a better way to clean up the HTML first before processing it with textile.) Out of curiosity, how does marrow.texting compare to python-textile in a race with such a document?
Also, I tried pointing out on IRC that your example is flawed. It generates a long string, with punctuation which does not resemble what would be considered typical input. My initial thought was to just add a space to the end of src
, but that didn't affect much. I tested again with two newlines instead, and that brought the time down from about 20s to 487ms (my computer wasn't getting anywhere near your speeds :| ). Yes, there are issues with python-textile and its processing speed. Textile is intended for processing plaintext input. If you're feeding it ugly HTML and complaining about the time it takes to process it, it kind of boils does to "Doc, it hurts when I do this." Don't be surprised when the response is, "Don't do that."
Let's all do our part to keep things civil.
That using a word for its literal meaning (abysmal: extremely bad, appalling) and providing evidence to confirm the fact is offensive is incomprehensible to me excepting misinterpretation as troll, which it is not. That the method of the issue being raised is discussed more fervently than the actual issue is unfortunate.
The way you reported this issue comes across as a seagull manager.
You're not my employee, I'm your customer. A customer with a problem with your product that has demonstrated harm when used. That's a far better way of visualizing this situation. Customer service situations are somewhat different, no?
What sort of response were you expecting on this issue?
Your suggestion on IRC to try testing against the master
branch was an excellent point and added an additional data point to the problem. It also offers a simpler method to diff
the versions to see what change would result in that substantial of a further negative impact. The idea that the test case is artificial and unrealistic is also a very valid point of discussion, but ancillary to the underlying mechanisms scaling badly.
I didn't mean to offend by closing the issue… every project on github would have a "make it faster" open ticket which never closes. Why keep an open ticket for a situation which is always the goal?
Offended? It's your project. Surprised, is all.
Because "go faster" isn't the goal. Not scaling geometrically is. 100x slower for 100x more text is not too unreasonable. 1000x slower for 100x more text would indicate a ten-to-one complexity amplification problem. Being more than 4000x slower on the larger test indicates there's something seriously wrong.
If going faster were the goal, sure, it'd be a forever-ticket. But when you have a test case so clearly illustrating a problem the ticket has an explicit "victory condition" for closure. (That is, achieving something approaching 1:1 scaling.) You mentioned an in-progress cleanup. Potentially the completion of that large update would warrant closing of the ticket… if it actually solves the problem.
I was considering asking you to open a new ticket which sticks to just the facts: "python-textile is not speedy nor is it efficient. What can be done to improve it?" But would that change anything?
Am I unclear on the facts in my first two comments? "Here is a test case illustrating the problem, here is an alternate text processing algorithmic structure which doesn't suffer this problem as an example of a better way to do this." I agree, everything after that point can be ignored or deleted, but there's no need to create a new ticket. If there's anything offensive remaining in my first two comments I won't hesitate to correct any problem that can actually be pointed at.
Since this is an open source free project, which I use only minimally at work, I don't have a ton of time to devote to it. I apologize for that.
No worries; I understand. If I didn't have my own obligations I would have done more than just submit timeit tests. (Wiring up a function-annotated line profiler is somewhat tedious.) Interestingly, from the original discussion that lead to running the test in the first place, here's the bbcode
result, providing another alternate approach which is a hybrid of the two. (It uses a few regular expressions, but also stream processes the text as text. Don't casually dismiss this example because the syntaxes or capabilities are different; I'm mentioning it, like my experiment, only for the text processing approach.)
import bbcode
bbsrc = 'This [code]example[/code] is [url=/defn/neither]neither[/url] complete [b]nor[/b] trite, [b]though [i]simple[/i][/b]. '
bbbig = bbsrc * 100
%timeit bbcode.render_html(bbbig)
# 10 loops, best of 3: 62.7 ms per loop
Perhaps this project deserves better. And perhaps I'd even accept some seagull management if I were getting paid for this work. But there's no one else working on this, and no one's paying me.
I maintain around 40 open source projects, without being paid for it, and this excuse for issue dismissal has never occurred to me.
I do apologize that this project has forced you to increase the complexity of your system. You mention that HTML generated from a Word document is particularly troublesome. Firstly, an example document would be IMMENSELY helpful for a project maintainer to test against.
Unfortunately all of the examples of truly troublesome documents (60 second+ rendering) I could provide are confidential client job postings which I can't really share. (Not sure my own clients would want to be associated with content that funky, either. ;) Due to the content, it would be extremely difficult to anonymize, but I'll see what I can do. I have the main application from work currently searching for the true worst case and turning my laptop into a space heater at the moment.
That's why I provided a simplified (medium terribleness) test case, though. To expand it, take bigsrc = (src * 6 + "\n\n") * 12
to simulate a 12-paragraph heavily formatted document.
Out of curiosity, how does marrow.texting compare to python-textile in a race with such a document?
Once the worst-case search is complete I'll run the offender through texting to see what happens. I doubt it'll be pretty. ;) Importantly, though, I'm running this content through textile not to clean up the HTML, but to allow our clients to post simplified job postings (like the second link I gave over IRC) easily or use full HTML, or use any combination of the two. It's the people copying and pasting from Word that throw the wrench into the works.
Also, I tried pointing out on IRC that your example is flawed. It generates a long string, with punctuation which does not resemble what would be considered typical input.
Unfortunately that basically doesn't matter. The performance issue exists in both the simplified test case and when using "real data".
Textile is intended for processing plaintext input.
Uhm… then why does the format actually explicitly mention free mixing of textile and HTML? That is a key design feature that encouraged me to begin using this package so many years ago.
If you're feeding it ugly HTML and complaining about the time it takes to process it, it kind of boils does to "Doc, it hurts when I do this." Don't be surprised when the response is, "Don't do that."
Okay then. Just wow. Another excuse I've never used in response to a user's issue when that user has proof in hand. (And I've successfully used BOFH excuse #144: solar flares!)
If this ticket remains unlocked I'll add another comment with the "realistic worst case" result, and I'll try hard to clean up the source for sharing.
So, having regenerated 1/4 of the job data (roughly 111,000 records) so far just under 1% have multi-second generation times. One in particular has a 31.37 second generation time, and as you noted, my laptop is unusually speedy–this is one that would exhibit as greater than 60 seconds in production, if we didn't have a cache which reduces it to around 4ms. (At the expense of needing to store all recently accessed rendered versions.)
What's really interesting is that note about newlines you mentioned. This particular copy/paste from Word has no newlines and totals around 32KiB of HTML. That's a highly unusual case! User data, am I right? ;)
Replacing </p>
with </p>\n
in the source and re-running results in a much, much more acceptable time: 86ms! Huzzah, the real bug has been tracked down. Whatever processing is being done within the context of a single line is the specific code that needs to be rewritten. Conveniently, that's precisely what that gist experiment is for, since the gist version doesn't handle block-level concerns like texting does.
So… the original "unrealistic test case" is, in fact, the test case for the actual bug.
Sample code: (warning: Word will make you gouge out your eyes)
<font size="3" face="Times New Roman"><br><br></font><p style="margin: 0in 0in 0pt; mso-pagination: none; tab-stops: -1.0in -.5in 0in 1.25in 256.5pt 310.5pt 6.5in 7.0in 7.5in 8.0in 8.5in 9.0in 9.5in 10.0in 10.5in 11.0in 11.5in 12.0in 12.5in;" class="MsoNormal"><span style='font-family: "CG Times","serif"; font-size: 11pt; layout-grid-mode: line; mso-bidi-font-size: 10.0pt; mso-bidi-font-style: italic;' lang="EN-CA">The primary duty of the </span><span style='font-family: "CG Times","serif"; font-size: 11pt; layout-grid-mode: line; mso-bidi-font-size: 10.0pt;' lang="EN-CA">JOB TITLE HERE <span style="mso-bidi-font-style: italic;"> is to words and more words and more words and more words<br>even more words more words more words more words more words more words<br>some words more words even more words<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /><o:p></o:p></span></span></p><font size="3" face="Times New Roman"><br><br></font><p style="margin: 0in 0in 0pt;" class="MsoNormal"><span style='font-family: "Arial","sans-serif"; font-size: 4pt; mso-bidi-font-size: 12.0pt;' lang="EN-CA"><o:p> </o:p></span></p><font size="3" face="Times New Roman"><br><br></font><p style="background: rgb(224, 224, 224); margin: 0in 0in 0pt; text-align: center;" class="MsoNormal" align="center"><strong><span style='font-family: "Arial","sans-serif"; font-size: 11pt; mso-bidi-font-size: 12.0pt;' lang="EN-CA">A HEADING OR SOMETHING</span></strong><span style='color: rgb(0, 102, 153); font-family: "Arial","sans-serif"; font-size: 10pt;' lang="EN-CA"><br></span><span style='font-family: "Arial","sans-serif"; font-size: 10pt; mso-bidi-font-size: 12.0pt;' lang="EN-CA"><span style="mso-tab-count: 1;"> </span></span><strong><span style='font-family: "Arial","sans-serif"; font-size: 11pt; mso-bidi-font-size: 12.0pt;' lang="EN-CA"><o:p></o:p></span></strong></p><font size="3" face="Times New Roman"><br><br></font><p style="margin: 0in 0in 0pt 0.25in; text-indent: -0.25in; mso-pagination: none; tab-stops: list .25in; mso-list: l0 level1 lfo1;" class="MsoNormal"><span style="font-family: Symbol; font-size: 8pt; mso-bidi-font-size: 12.0pt; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;" lang="EN-CA"><span style="mso-list: Ignore;">·<span style='font: 7pt/normal "Times New Roman"; font-size-adjust: none; font-stretch: normal;'> <br></span></span></span><span style="font-size: 11pt; mso-bidi-font-size: 12.0pt;" lang="EN-CA"><font face="Times New Roman">A bullet point because people typing a dot<br>and a space apparently makes a list<o:p></o:p></font></span></p><font size="3" face="Times New Roman"><br><br></font><p style="margin: 0in 0in 0pt; mso-pagination: none;" class="MsoNormal"><span style="font-size: 11pt; mso-bidi-font-size: 12.0pt;" lang="EN-CA"><o:p><span style="font-family: Times New Roman;"> </span></o:p></span></p><font size="3" face="Times New Roman"><br><br></font><p style="margin: 0in 0in 0pt 0.5in; text-indent: -0.25in; mso-pagination: none; tab-stops: list .5in; mso-list: l0 level2 lfo1;" class="MsoNormal"><span style='font-family: "CG Times","serif"; font-size: 11pt; mso-bidi-font-size: 12.0pt; mso-fareast-font-family: "CG Times"; mso-bidi-font-family: "CG Times";' lang="EN-CA"><span style="mso-list: Ignore;">-<span style='font: 7pt/normal "Times New Roman"; font-size-adjust: none; font-stretch: normal;'> <br></span></span></span><font face="Times New Roman"><span style="font-size: 11pt; mso-bidi-font-size: 12.0pt; mso-bidi-font-style: italic;" lang="EN-CA">Whoever originally created<br>this document will be the first against the wall when the revolution comes<br></span><span style="font-size: 11pt; mso-bidi-font-size: 12.0pt;" lang="EN-CA"><o:p></o:p></span></font></p><font size="3" face="Times New Roman"><br><br></font><p style="margin: 0in 0in 0pt 0.5in; text-indent: -0.25in; mso-pagination: none; tab-stops: list .5in; mso-list: l0 level2 lfo1;" class="MsoNormal"><span style='font-family: "CG Times","serif"; font-size: 11pt; mso-bidi-font-size: 12.0pt; mso-fareast-font-family: "CG Times"; mso-bidi-font-family: "CG Times";' lang="EN-CA"><span style="mso-list: Ignore;">-<span style='font: 7pt/normal "Times New Roman"; font-size-adjust: none; font-stretch: normal;'> <br></span></span></span><span style="font-size: 11pt; mso-bidi-font-size: 12.0pt;" lang="EN-CA"><font face="Times New Roman">Even more "bullet points"<br> and text;<o:p></o:p></font></span></p><font size="3" face="Times New Roman"><br><br></font><p style="margin: 0in 0in 0pt 0.5in; text-indent: -0.25in; mso-pagination: none; tab-stops: list .5in; mso-list: l0 level2 lfo1;" class="MsoNormal"><span style='font-family: "CG Times","serif"; font-size: 11pt; mso-bidi-font-size: 12.0pt; mso-fareast-font-family: "CG Times"; mso-bidi-font-family: "CG Times";' lang="EN-CA"><span style="mso-list: Ignore;">-<span style='font: 7pt/normal "Times New Roman"; font-size-adjust: none; font-stretch: normal;'> <br></span></span></span><span style="font-size: 11pt; mso-bidi-font-size: 12.0pt;" lang="EN-CA"><font face="Times New Roman">Microsoft, you make<br>my eyes bleed;<o:p></o:p></font></span></p><font size="3" face="Times New Roman"><br><br></font><p style="margin: 0in 0in 0pt 0.5in; text-indent: -0.25in; mso-pagination: none; tab-stops: list .5in; mso-list: l0 level2 lfo1;" class="MsoNormal"><span style='font-family: "CG Times","serif"; font-size: 11pt; mso-bidi-font-size: 12.0pt; mso-fareast-font-family: "CG Times"; mso-bidi-font-family: "CG Times";' lang="EN-CA"><span style="mso-list: Ignore;">-<span style='font: 7pt/normal "Times New Roman"; font-size-adjust: none; font-stretch: normal;'> <br></span></span></span><span style="font-size: 11pt; mso-bidi-font-size: 12.0pt;" lang="EN-CA"><font face="Times New Roman">Job responsibilities would be going here<br>instead of this replacement <br>text;<o:p></o:p></font></span></p><font size="3" face="Times New Roman"><br><br></font><p style="margin: 0in 0in 0pt 0.5in; text-indent: -0.25in; mso-pagination: none; tab-stops: list .5in; mso-list: l0 level2 lfo1;" class="MsoNormal"><span style='font-family: "CG Times","serif"; font-size: 11pt; mso-bidi-font-size: 12.0pt; mso-fareast-font-family: "CG Times"; mso-bidi-font-family: "CG Times";' lang="EN-CA"><span style="mso-list: Ignore;">-<span style='font: 7pt/normal "Times New Roman"; font-size-adjust: none; font-stretch: normal;'> <br></span></span></span><span style="font-size: 11pt; mso-bidi-font-size: 12.0pt;" lang="EN-CA"><span style="font-family: Times New Roman;">So much code, so little content;<o:p></o:p></span></span></p>…
Phew, that actually came across as a single line. I redacted the actual content with ostensibly pithy quips instead. Kinda stupidly hard to tell if it's properly redacted, though. On the original content, as a single line, texting takes 33.7ms.
Now we've got an actual issue with something to sink my teeth into.
As I mentioned earlier, perhaps cleaning up the html first will help speed things up. I found https://github.com/mozilla/bleach which seems to be a good fit. With it, you can easily define acceptable and unacceptable markup in the HTML. It might be an extra step in the process, but I'd be surprised if this didn't provide a marked improvement. I don't have the time to run tests right now, sorry. (I've got my priorities: Game of Thrones comes first, dammit.)
Yes, textile can process embedded html, but that's not its primary purpose. Feeding the system incredibly abysmal data resulted in incredibly abysmal performance. Garbage in, garbage out. (That was my point earlier with the doctor and patient.) We've established that it does not scale well, but that is largely due to having to process very long lines of data. It splits the input text by lines and processes each line. If a line contains html tags, it runs the contents of each tag through the system. And that's just how textile is meant to run. If it didn't do that, it wouldn't match the spec. ¯_(ツ)_/¯
Now... I was tempted to lock the conversation, but I gave you one last chance and it paid off. You dialed down the snark, provided some actual data, and we're finally having a productive conversation. That's where this issue report actually starts being an issue report. Please be aware of your tone when reporting issues. If you have specific data relevant to the discussion, don't wait until your third reply to even mention it. I'm happy to help if I can, but wading through the snark to get there isn't fun. And apparently, I wasn't the only one who found the the initial report and its replies unhelpful and combative.
The real hero here is @aclark73. Because he chimed in, you now have a means to test improvements to your system. I've thanked him before, but I'll thank him again. Thanks, man. I owe you one. 👏 👍
Up front: feel free to skip this if you don't want to "wade through the snark", as I'm continuing to reply to various bits of minutiae which have nothing to do with the problem with the code.
Except for the fact that the original test case was more than sufficient to help identify the problem and included literally no snark, sure. You almost made the newline-as-the-issue identification during our IRC conversation, so the extra effort to convince you there was actually a problem is effectively wasted time, but it's cold out and I needed the space heater.
I'm aware of solutions like Bleach, and yes, we do run some sanitization on user input. We must balance our sanitization routines against the fact that we accept data from a number of unusual sources including automated systems (mmm, IBM XML is atrocious) and can't simply throw away "bad user input" while expecting to keep the client. Law 36: Practicality beats purity.
Absolutely none of this "it hurts when I do this, so don't do it" nonsense is helpful in the face of a clear fault in your codebase. But, the attitudes and philosophies demonstrated are excellent encouragement to look for other solutions.
Please be aware of your tone when reporting issues.
Re-read the first post. Please be aware of your tone when responding to issues; you wouldn't have edited your reply after, as I edited my reply, if we didn't agree on this point, no? If I had that data at the time, it would have been provided. Amongst my circle of peers, I'm not the only one who found the initial response baffling, either.
Because he chimed in, you now have a means to test improvements to your system.
I don't even know what you are referring to. ¯_(ツ)_/¯
I'm done with this ticket. I do sincerely hope the codebase improves for those who continue using it.