Data4Democracy/assemble

4chan link extraction & cleanup

Closed this issue · 6 comments

Problem:

  • We have collected text data from the 4chan api. Unfortunately the text returned via the API In the (com) field is pretty rough. It includes html markup and other random garbage.

Ex:

"com": "<a href=\"#p116190305\" class=\"quotelink\">&gt;&gt;116190305</a>
<br>redacted are simply jealous of redacted.<br>https://youtu.be/k4yXQkG2s1E"

Additional info:

  • Clean all html tags leaving only the plain text. Extract links to external sites & quoted threads.
  • Expect lots of malformed html and weird junk mixed in.
  • Err on the side of capturing too much vs too little (especially when identifying links).
  • I've setup a small public test dataset posted to s3 here you can load directly to pandas via df = pd.read_csv('https://s3.amazonaws.com/far-right/fourchan/chan_example.csv', parse_dates=['created_at'])

Post cleaning the above should generate something along the lines of the below (use your own judgement after playing with the data):

{
    "text": "redacted are simply jealous of redacted",
    "external_links": ["https://youtu.be/k4yXQkG2s1E"]
}

warning: this work requires you deal with highly explicit and offensive content from the pol 4chan board.

Starting this

Issue is still open for anyone looking to get started.

If nobody is assigned this task, I would love to try my hand at it. This will be my first attempt at contributing to D4D.

Sounds good @subbuvenk94. @carolph3232 if you find time during the weekend hackathon feel free to drop into chat and tag team.

@subbuvenk94 I've got a pretty good start on this, but it's not perfect. I'll submit a PR so you can see what I've done and we can collaborate

update: here's the PR #55

@carolph3232 Nice work there! I think you have it covered all by yourself. I didn't see this earlier, my bad. Thanks for the offer to collaborate 👍