dail_debates: A repository from richiemorrisroe

The goal is to examine dail debates, and map them to facts
Facts will be defined as things from the CSO
This is (loosely) insprired by FullFact, a UK initiative
But this will be the lazy, automated version

Dail Data

The transcript of all the debates is available online
It appears to go back to 1919, which is pretty impressive
I started with the code below

from lxml import html
import requests as r
page = r.get("http://oireachtasdebates.oireachtas.ie/debates%20authoring/debateswebpack.nsf/takes/dail2017011700003?opendocument#A00100")

But then I noticed that there’s a common string, that webpack.nsf stuff
So I went to this url, and found gloriousness
At least for my purposes
It’s a simple table filled with links
There’s some buttons to paginate at the top
I should try faking those in case there’s a pattern

from bs4 import BeautifulSoup
dail = "http://oireachtasdebates.oireachtas.ie/debates%20authoring/debateswebpack.nsf/takes?OpenView&Start=1"
dail2 = "http://oireachtasdebates.oireachtas.ie/debates%20authoring/debateswebpack.nsf/takes/dail2017011800001?opendocument"
base = r.get(dail)

soup = BeautifulSoup(base.content, 'lxml')
tables = soup.find_all('table')
#dirtyhack
links = tables[1].find_all('a')

So I probably need some regular expressions.

href = soup.find_all(["a", 'href'])

The code above gets all of the links
I can just convert to a list
This will then allow me to iterate more easily, and use re

#i feel weird importing regular expressions
import re
href_str = list(href)
debate_urls = []
debate_names = []
for each in href_str:
    res = re.search(r'(debates.*OpenDocument).*(dail[0-9]+)', str(each))
    if res:
        debate_urls.append(res.group(1))
        debate_names.append(res.group(2))
    else:
        pass

Hmmm, interesting style discovery
Listcomps are awesome, see below

relurls = [re.search(r'(".*").*dail[0-9]+', str(x)) for x in href_str]
relnames = [re.search(r'dail[0-9]+', str(x)) for x in href_str]
urls_names = zip(relurl, relnames)

But it’s hard to put them back together
The first code above (the for loop) was actually much easier to write and get working
Avoiding Nones in one’s results appears to be good practice
So now, following all of that, I can take the urls I’ve extracted and see if I can collect all of the results in a list.
The difficulty is what to do when I get to text.
I need to decide how to organise and store it
I can probably just split by year at first, and keep all the information in the names so I can re-arrange

import os
results =[]
for url in debate_urls:
    path = os.path.join(dail, str.lower(url))
    res = r.get(path)
    results.append(res)

Bugger, that easier link doesn’t work properly.
However, if I use dail2 above, I can get the actual response.

testlink = r.get( "http://oireachtasdebates.oireachtas.ie/debates%20authoring/debateswebpack.nsf/takes/dail2017011800001?opendocument")
base_debates = "http://oireachtasdebates.oireachtas.ie/debates%20authoring/debateswebpack.nsf/takes/dail2017011800001?opendocument"
testsoup = BeautifulSoup(testlink.content)
para = testsoup.find_all('p', {'class' : 'tocsubitem'})

So this works
If I grab all of the paragraphs (note the syntax for classes, presumably consistent with other attributes)
I then have links to each individual question
I’m a little unsure if the entire text is already there

base_debates = "http://oireachtasdebates.oireachtas.ie/debates%20authoring/debateswebpack.nsf/takes/dail1919012200012?opendocument"

That’s not the base string
But I just grabbed one of the debate names from earlier
And was able to get to the correct page
Which is pretty nice
Man, this is so much easier in Python than it would be in R
This is presumably why I was never able to do this well (scraping)
It’s odd, because Hadley built loads of stuff for this
I suppose the lack of dictionaries and the much, much, broader selection of programmers probably causes most of this

richiemorrisroe/dail_debates

Dail Data