sotorrent/db-scripts

Wrong detection of code snippets

martincabello opened this issue · 5 comments

Hello,
I am part of a group of University College London MSc. students working on the MSR Mining Challenge. We have been running certain queries on BigQuery to see if we can get access to the data we need in order to answer our proposed question. We are currently stuck with the following problem.

We have written the following query.

SELECT
  posts.Id,
  posts.Title,
  versions.CreationDate,
  versions.SuccPostHistoryId,
  blocks.Length,
  blocks.LineCount,
  blocks.PostHistoryId,
  blocks.PostBlockTypeId,
  blocks.LocalId,
  blocks.Content
FROM
  `sotorrent-org.2018_09_23.Posts` AS posts
INNER JOIN
  `sotorrent-org.2018_09_23.PostBlockVersion` AS blocks
ON
  posts.Id = blocks.PostId
INNER JOIN
  `sotorrent-org.2018_09_23.PostVersion` AS versions
ON
  posts.Id = versions.PostId AND blocks.PostHistoryId = versions.PostHistoryId
WHERE
  (posts.Id = 49311839 OR posts.Id = 46695379) AND
  posts.Tags LIKE "%python%"
  --AND blocks.PostBlockTypeId = 2
  AND versions.SuccPostHistoryId IS NULL
ORDER BY
  posts.CreationDate DESC, blocks.LocalId ASC
LIMIT
  1000;

Which gives us the following results:

Results Google Sheet

Our impression is that the extraction of post versions is not working with this particular post. We see that the PostBlockVersions are not being properly identified. We have seen this problem with some other posts as well, but we provide this one as an example.

SO Post

We provide one more additional example, where we get 71 PostBlockVersions with PostBlockType = CodeBlock if we use the same query.
SO Post II

We hope this is not an issue with our query. We will be very grateful for any feedback on this issue.

Hi, I am part of the same team. We invested the issue further and found that there are 4 spaces in the HTML, even though they are not rendered if you simple open the SO page.

Traceback (most recent call last):
  File "C:\Users\elmou\AppData\Local\Programs\Python\Python36-32\lib\socketserver.py", line 639, in process_request_thread
    self.finish_request(request, client_address)
  File "C:\Users\elmou\AppData\Local\Programs\Python\Python36-32\lib\socketserver.py", line 361, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "C:\Users\elmou\AppData\Local\Programs\Python\Python36-32\lib\socketserver.py", line 696, in 

Hence, it is probably parsed as SO Markdown, producing the inaccurate detection of CodeBlockVersions.

Thanks for reporting this. I can confirm that for the mentioned posts, the post block extraction does not work correctly. I'm currently working on a fix.

This should be fixed now in the posthistory-extractor. We are still running some tests to check whether the changes didn't break anything else. If I'll find the time, there will be a new release based on data dump 2018-09-05 next week, but I can't promise that. What I can promise is a new release based on the December data dump, which I except to be released end of next week. I'll keep this issue open until you can confirm the fix using a new release of the dataset.

Since Stack Exchange released their new data dump a bit earlier (2018-12-02), there won't be a new release based on dump 2018-09-05. I'm currently downloading the new data dump and will run the extraction afterwards. Expect a new SOTorrent release beginning of next week.

The extraction seems to be correct in the new SOTorrent release 2018-12-09:

MySQL:

SELECT *
FROM PostBlockVersion
WHERE PostId=49311839 
ORDER BY PostHistoryId ASC, LocalId ASC;

2018-12-12 12_32_01-mysql workbench

BigQuery:

SELECT *
FROM `sotorrent-org.2018_12_09.PostBlockVersion`
WHERE PostId=49311839 
ORDER BY PostHistoryId ASC, LocalId ASC;

capturfiles-20181215_101857