Banko is my contribution to the Bankin's challenge. I'm very proud to submit my very first NodeJS project/script :)
npm install
node index.js
For getting the result into a file
node index.js > output.json
For benching the script
time node index.js
The fake webpage contains a list of transactions. The script have to return a JSON list of these transactions with the following properties:
- account
- transaction
- amount
- currency
The transaction list is rendered into a HTML table:
<tr> account | transaction | amountAndCurency </tr>
.
The data can be cought by the following regexp:
/<tr>\s*
<td>(?<account>.*)<\/td>\s*
<td>(?<transaction>.*)<\/td>\s*
<td>\s*(?<leftCurrency>\D*)\s*(?<amount>\d(?:.*\d)?)\s*(?<rightCurrency>\D*)\s*<\/td>\s*
<\/tr>/gU
If this pattern changes, you have to rewrite the getTransactionList
function.
I notice the response could be displayed into the main page or into an iframe. The transaction list can be immediately displayed, after a while or after clicking (several times?) on a button. The browser can trigger a dialog message (like an alert box) because it detects an error.
Transforming a string to a float is costly.
The response as application/json
will not be lighter.
Moreover the amount
is cast as given by the website.
The script doesn't send something to localize the response.
The mean to unlocalize the data should be store/configure where the URL to scrap is.
As the currency alignement is depending to the page's localization. So, it can be on the left side or the right side. On the fake page, the currency is one character on the right side.
I notice there is more pages than I can browse. The links and the buttons into the page are not exhaustive enough. There is more transaction pages if I type the full URL. The scrapper will not navigate, it scraps directly an URL pattern.
The single scrapping time is the sum of every page. The parallel scrapping time is the time of the slowest page.
Returning an ordered transaction list is not mandatory. Let the parallel scrapping shuffle the final result. Why ordering by transaction label if you prefer sort/filter the list on your favorite attribute?
The JSON is sent to the console. I didn't configure any CLI attributes. It's not the exercise ;)
The output could be a stream. In this case, write into the stream as soon as an URL is parsed. But there is no evalution on a transformation of the results.
A complete benchmark on jsPerf tells Array.concat
is faster than Array.push.apply
.
IS_RUNNING
is turned off if:
- the start parameter is over its max
- if a transaction list is less than the usual length (defined as
step
inURL_PARAMETER_START
)
-
URL = 'https://web.bankin.com/challenge/index.html?start={start}'
is the pattern of URLs to scrap. -
URL_PARAMETER_START = {}
defines how thestart
URL parameter is evolvingmin : 0
is the starting valuemax : 12345
is the maximum valuestep : 50
is the increment
-
PAGE_CONFIG = {}
is the custom parametersnbPageReload : 3
is the number of allowed page reload if there is a network issue like a break wireloadTimeout : 10000
is the maximum milliseconds allowed for loading the DOMscriptTimeout : 30000
is the maximum milliseconds allowed for onload execution scriptmutableSelector : 'tr, iframe'
is the elements that the scrapper observesreloadButtonSelector : '#btnGenerate'
is the selector of the reload button
-
async function getTransactionList(frame) { ... }
is the function which extracts the transaction list
Simple algo:
parse(page, url)
----------------
goto url (reload x%nbPageReload% if there is a network error)
click on %reloadButtonSelector% if exists
wait until %mutableSelector% if exists
foreach frame of page do
lastTransactionList = getTransactionList(frame)
if lastTransactionList is OK then break
done
append the lastTransactionList to the globalTransactionList
if no transactionList
then stop
end
Parallelism algo:
run(browser)
------------
crawlers = []
repeat NB_PARALLEL_PROCESS
do
crawlers.push(crawl())
done
// shutdown
wait all running crawlers
Array<Object> getTransactionList(Frame frame)
void appendTransactionList(Frame frame)
void parseAnyFrame(Page page)
void crawl(browser)(Browser page) throws Error("network")
string getUrl()
Page newPage(Browser browser)
void run(browserPromise) throws Error("network")
1e1 - 2.71828183E0+bankin-challenge at gmail.com