Improve escaping/unescaping newline characters
sbaltes opened this issue · 2 comments
We currently use the following regular expression to replace the escaped newline characters present in file PostHistory.xml
, which is part of the official Stack Overflow data dump:
((?:
|
)?
)
The head of PostHistory.xml
looks like this:
In some cases, this may break posts containing the character sequence 

.
One example is this post, others can be found using Stack Overflow's search feature.
The 

sequences themselves are escaped within the posts:
We have to make sure that those sequences are preserved while the newlines are replaced.
We use the same character sequence when exporting the SOTorrent dataset versions, thus our export and import scripts are also affected.
Thanks @laitingsheng for pointing this out to me.
Actually, I am wondering if you can replace all &
in the raw text by &
, which is something like an escape character in HTML but only for &
. You can guarantee that 
or 

will only exist if you append them to the output. All original 
and 

in raw text will be escaped to 
and 

, respectively.
Should be fixed in the most recent database versions (2020-08-31 and 2020-11-16).
I'm now keeping the newlines, hence I had to switch to SQL dumps instead of CSV files.
MySQL's CSV export is broken, see:
- https://issuetracker.google.com/issues/35906027
- https://stackoverflow.com/questions/12418381/data-between-quotes-and-field-separator
- https://stackoverflow.com/questions/24610691/valid-csv-filed-import-fails-with-data-between-close-double-quote-and-field
- https://stackoverflow.com/questions/41774233/best-practice-to-migrate-data-from-mysql-to-bigquery