sotorrent/db-scripts

Improve escaping/unescaping newline characters

sbaltes opened this issue · 2 comments

We currently use the following regular expression to replace the escaped newline characters present in file PostHistory.xml, which is part of the official Stack Overflow data dump:

((?:
|
)?
)

The head of PostHistory.xml looks like this:

2020-06-11 12_47_07-Window

In some cases, this may break posts containing the character sequence 
.
One example is this post, others can be found using Stack Overflow's search feature.

The 
 sequences themselves are escaped within the posts:
image
We have to make sure that those sequences are preserved while the newlines are replaced.

We use the same character sequence when exporting the SOTorrent dataset versions, thus our export and import scripts are also affected.

Thanks @laitingsheng for pointing this out to me.

Actually, I am wondering if you can replace all & in the raw text by &, which is something like an escape character in HTML but only for &. You can guarantee that 
 or 
 will only exist if you append them to the output. All original 
 and 
 in raw text will be escaped to 
 and 
, respectively.