Wrong detection of code snippets
martincabello opened this issue · 5 comments
Hello,
I am part of a group of University College London MSc. students working on the MSR Mining Challenge. We have been running certain queries on BigQuery to see if we can get access to the data we need in order to answer our proposed question. We are currently stuck with the following problem.
We have written the following query.
SELECT
posts.Id,
posts.Title,
versions.CreationDate,
versions.SuccPostHistoryId,
blocks.Length,
blocks.LineCount,
blocks.PostHistoryId,
blocks.PostBlockTypeId,
blocks.LocalId,
blocks.Content
FROM
`sotorrent-org.2018_09_23.Posts` AS posts
INNER JOIN
`sotorrent-org.2018_09_23.PostBlockVersion` AS blocks
ON
posts.Id = blocks.PostId
INNER JOIN
`sotorrent-org.2018_09_23.PostVersion` AS versions
ON
posts.Id = versions.PostId AND blocks.PostHistoryId = versions.PostHistoryId
WHERE
(posts.Id = 49311839 OR posts.Id = 46695379) AND
posts.Tags LIKE "%python%"
--AND blocks.PostBlockTypeId = 2
AND versions.SuccPostHistoryId IS NULL
ORDER BY
posts.CreationDate DESC, blocks.LocalId ASC
LIMIT
1000;
Which gives us the following results:
Our impression is that the extraction of post versions is not working with this particular post. We see that the PostBlockVersions are not being properly identified. We have seen this problem with some other posts as well, but we provide this one as an example.
We provide one more additional example, where we get 71 PostBlockVersions with PostBlockType = CodeBlock if we use the same query.
SO Post II
We hope this is not an issue with our query. We will be very grateful for any feedback on this issue.
Hi, I am part of the same team. We invested the issue further and found that there are 4 spaces in the HTML, even though they are not rendered if you simple open the SO page.
Traceback (most recent call last):
File "C:\Users\elmou\AppData\Local\Programs\Python\Python36-32\lib\socketserver.py", line 639, in process_request_thread
self.finish_request(request, client_address)
File "C:\Users\elmou\AppData\Local\Programs\Python\Python36-32\lib\socketserver.py", line 361, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "C:\Users\elmou\AppData\Local\Programs\Python\Python36-32\lib\socketserver.py", line 696, in
Hence, it is probably parsed as SO Markdown, producing the inaccurate detection of CodeBlockVersions.
Thanks for reporting this. I can confirm that for the mentioned posts, the post block extraction does not work correctly. I'm currently working on a fix.
This should be fixed now in the posthistory-extractor. We are still running some tests to check whether the changes didn't break anything else. If I'll find the time, there will be a new release based on data dump 2018-09-05 next week, but I can't promise that. What I can promise is a new release based on the December data dump, which I except to be released end of next week. I'll keep this issue open until you can confirm the fix using a new release of the dataset.
Since Stack Exchange released their new data dump a bit earlier (2018-12-02), there won't be a new release based on dump 2018-09-05. I'm currently downloading the new data dump and will run the extraction afterwards. Expect a new SOTorrent release beginning of next week.
The extraction seems to be correct in the new SOTorrent release 2018-12-09:
MySQL:
SELECT *
FROM PostBlockVersion
WHERE PostId=49311839
ORDER BY PostHistoryId ASC, LocalId ASC;
BigQuery:
SELECT *
FROM `sotorrent-org.2018_12_09.PostBlockVersion`
WHERE PostId=49311839
ORDER BY PostHistoryId ASC, LocalId ASC;