This repository serves as initial research for a web-based game called Polo.
- Node v7.0 with harmony flag for async/await use
- Typescript as programming language
- Firebase as schemaless database
Through 3rd party libraries ( BIG shoot out to himalaya) and require I am crawling websites and storing some informations about their HTML content. The data is stored in a real-time Google powered database, Firebase.
The data has the following structure:
type siteAnalytics = {
website: string,
maxDepth: number,
elements: { [tag: string]: number },
hrefObjs: string[],
childrenCount: { [count: number]: number },
isDeadEnd: boolean
};
maxDepth
represents the deepest hierarchical level in the DOM for a website.
Elements
is a dictionary with key
an HTML tag (e.g. a
, div
, ...) and value the occorrunces of such tag in the page website
. Only W3C valid tags are included.
childrenCount
is another dictionary. Its key represents the number of children a node has. The value is instead how many nodes have x children (with x being the key
).
hrefObjs
is an array of hrefs found in the page. Those are handy in the game that will be developed.
isDeadEnd
is true if no new hrefs are found.
New hrefs are in fact stored in an array (in the application scope). When we crawl a website we look at all the hrefs found and if subDomain + domain has not been found yet, we add it to an array of hrefs yet to crawl. Otherwise, they get dismissed.
Until there are new hrefs we keep parsing.