/TellMeWho

Search engines exploit knowledge graphs to provide meaningful and often "structured" information in response to user queries. We can exploit knowledge graphs in the context of web search in different ways, including through infobox creation and question answering.

Primary LanguagePython

COMS E6111 Project2
============================================================================
*Group members

Di Li - dl2943
Bingjie Sun - bs2888
============================================================================
*List of files

main.py - main code for 3 modes of input 
matching.py - mapping configuration from querying Freebase API properties to 
			  our program
printable.py - print customized table 
question.py -  MQLquery for part 2
README - readme file
transcript.txt - Sample run results (please view this full screen for proper table views)

============================================================================
*How to run?

- Dependency libraries (Python):
urllib
json
OrderedDict

- To run the code:

Basic mode : If you want to input a single query in string format "<q>"
	-k <account_key> -q <query> -t [infobox|question]

File input mode : If you have an input file with queries on each line
	-k <account_key> -f <file_with_queries> -t [infobox|question]

Interactive mode : If you want to keep inputing in terminal until you are done
	-k <account_key>

- GoogleAPI Search Account Key

AIzaSyB-LJI6QQ9P_D1WyKuCLT6yABME20lrYwM

- requests per second per user

Same as default, 10
 
============================================================================
*Internal Project Design

The main.py takes in five type of input (k,q,t,f,h), we used check_args() function 
to extract the input mode and fields, also we provided instructions on how to 
use the system through usage() function.

For mode 1, we choose to run either infobox.py or question.py depending on the
input parameter -t. If the user want to query infobox, the system pass api key 
and query to run(api_k, query) function inside infobox.py.

For mode 2, similar to mode 1, we run either infobox.py or question.py depending 
on the input parameter -t, for each line of the input file.

For mode 3, we wrote an infinite loop which will only break on KeyboardInterrupt 
(Control+D on mac). Inside each loop, we take in a raw input and check if its a 
"Who created"/"who created" question or others. If former, we call questions.py
and if latter we call infobox.py.

	
How infobox.py runs - 

First the search(query) function was called, quering 
Freebase API and return a list of mids, which may contain more types than
required. So we have two helper functions valid_topic() and cleanup_type(), 
the former for filtering out the 6 categories amongst all mids, and the 
latter for resolving conflicted catergories (eg. Author and League).
Using the accepted mids, topic() function query Freebase API and get the data
in Dict format.
Then, we call assemble_infobox() to format the Dict raw data into our 
own design - a Dict of lists, where lists can consist of text values or 
a nested layer of dicts. Inside these nested dicts, each key and value 
are a row of values of all sub-field under the outer layer. We used OrderedDict
for this design, so that each output could have a certain output format. Also,
for some entities that does not contain all the properties, we just put an 
empty list in the corresponding position.
At last, printable() function is called to display the data. 

How question.py runs - 

First the extractX() function was called to get the query word, we only 
consider "Who created"/"who created" as valid input, and only query the API
with the given two types ('Author', 'BusinessPerson').
Then, we call MQLquery() function to query the API. Noticed that same as the 
reference program, we provide table-like output for interactive mode, and text
output for Basic and Fileinput mode. This distinction was implemented simply 
by calling printable() function for mode 3, whereas we just output lines of 
concatenated strings for mode 1 and 2.


How printable.py works - 

This is a purly coded from scrach function for displaying beautiful tables and
allowing nested columns. To make the table wider just change the whole parameter.
The function first checks if the header data was passed in as parameter, so that 
we can print Name+Categories for infobox, and no such header for questions.
Then, for each key and value(list) in the given Dict, we check the type of (list[0]).
If it is a Dict, then we have to divide the columns in terms of the nested fields we 
have. If not, we can simply print out the list values. All of these functionalities
have taken automatically breakline into account to prevent table dsplay overflow. 

============================================================================
*Additional Points

mapping list: Convert Freebase Properties to our data structure, from matching.py

- The mapping we used to filter out only six categories of entities:

accepted_type_list = OrderedDict([
    ('/people/person', 'Person'),
    ('/people/deceased_person', 'Person'),
	('/book/author', 'Author'),
	('/film/actor', 'Actor'),
	('/tv/tv_actor', 'Actor'),
	('/organization/organization_founder', 'BusinessPerson'),
	('/business/board_member', 'BusinessPerson'),
	('/sports/sports_league', 'League'),
	('/sports/sports_team', 'SportsTeam'),
	('/sports/professional-_sports_team', 'SportsTeam'),
])

- The mapping we used to match properties with our data structure ( OrderedDictionary of lists )

information_map = OrderedDict([
	('/people/person', OrderedDict([
		('/type/object/name', 'Name'),
		('/people/person/date_of_birth', 'Birthday'),
		('/people/person/place_of_birth', 'Place of Birth'),
		('/people/person/sibling_s', {
			'name': 'Siblings',
			'children': {
				'/people/sibling_relationship/sibling': 'Sibling',
			},
		}),
		('/people/person/spouse_s', {
			'name': 'Spouse(s)',
			'children': {
				'/people/marriage/spouse': 'Spouse Name',
				'/people/marriage/from': 'Marriage From',
				'/people/marriage/to': 'Marriage To',
				'/people/marriage/location_of_ceremony': 'Ceremony Location',
			},
		}),
		('/common/topic/description', 'Description'),
	])),
	('/people/deceased_person', OrderedDict([
		('/people/deceased_person/date_of_death', 'Death Date'),
		('/people/deceased_person/place_of_death', 'Death Place'),
		('/people/deceased_person/cause_of_death', 'Death Cause'),
	])),
	('/book/author', OrderedDict([
		('/book/author/works_written', 'Books'),
		('/book/book_subject/works', 'Books About The Author'),
		('/influence/influence_node/influenced', 'Influenced'),
		('/influence/influence_node/influenced_by', 'Influenced By'),
	])),
	('/film/actor', OrderedDict([
		('/film/actor/film', {
			'name': 'Films',
			'children': {
				'/film/performance/character': 'Character',
				'/film/performance/film': 'Film',
			},
		}),
	])),
	('/tv/tv_actor', OrderedDict([
		('/tv/tv_actor/guest_roles', {
			'name': 'TV Series',
			'children': {
				'/tv/tv_guest_role/character': 'Character',
				'/tv/tv_guest_role/episodes_appeared_in': 'TV Series',
			}
		}),
		('/tv/tv_actor/starring_roles', {
			'name': 'TV Series',
			'children': {
				'/tv/regular_tv_appearance/character': 'Character',
				'/tv/regular_tv_appearance/series': 'TV Series',
			},
		}),
	])),
	('/organization/organization_founder', OrderedDict([
		('/organization/organization_founder/organizations_founded', 'Founded'),
	])),
	('/business/board_member', OrderedDict([
		('/business/board_member/leader_of', {
			'name': 'Leadership',
			'children': {
				'/organization/leadership/from': 'From',
				'/organization/leadership/to': 'To',
				'/organization/leadership/organization': 'Organization',
				'/organization/leadership/role': 'Role',
				'/organization/leadership/title': 'Title',
			},
		}),
		('/business/board_member/organization_board_memberships', {
			'name': 'Board Membership',
			'children': {
				'/organization/organization_board_membership/from': 'From',
				'/organization/organization_board_membership/to': 'To',
				'/organization/organization_board_membership/organization': 'Organization',
				'/organization/organization_board_membership/role': 'Role',
				'/organization/organization_board_membership/title': 'Title',
			},
		}),
	])),
	('/sports/sports_league', OrderedDict([
		('/type/object/name', 'Name'),
		('/sports/sports_league/championship', 'Championship'),
		('/sports/sports_league/sport', 'Sport'),
		('/organization/organization/slogan', 'Slogan'),
		('/common/topic/official_website', 'Website'),
		('/common/topic/description', 'Description'),
		('/sports/sports_league/teams', {
			'name': 'Teams',
			'children': {
				'/sports/sports_league_participation/team': 'Team',
			},
		}),
	])),
	('/sports/sports_team', OrderedDict([
		('/type/object/name', 'Name'),
		('/common/topic/description', 'Description'),
		('/sports/sports_team/sport', 'Sport'),
		('/sports/sports_team/arena_stadium', 'Arena'),
		('/sports/sports_team/championships', 'Championships'),
		('/sports/sports_team/coaches', {
			'name': 'Coaches',
			'children': {
				'/sports/sports_team_coach_tenure/coach': 'Name',
				'/sports/sports_team_coach_tenure/position': 'Position',
				'/sports/sports_team_coach_tenure/from': 'From',
				'/sports/sports_team_coach_tenure/to': 'To',
			},
		}),
		('/sports/sports_team/founded', 'Founded'),
		('/sports/sports_team/league', {
			'name': 'League(s)',
			'children': {
				'/sports/sports_league_participation/league': 'League',
			},
		}),
		('/sports/sports_team/location', 'Location'),
		('/sports/sports_team/roster', {
			'name': 'PlayerRoster',
			'children': {
				'/sports/sports_team_roster/player': 'Name',
				'/sports/sports_team_roster/position': 'Position',
				'/sports/sports_team_roster/number': 'Number',
				'/sports/sports_team_roster/from': 'From',
				'/sports/sports_team_roster/to': 'To',
			},
		}),
	])),
	('/sports/professional_sports_team', OrderedDict([
		### empty
	])),
])
 
============================================================================