dwyl/learn-postgresql

Codeface: Get Org, Repo & Profile Data for Example

nelsonic opened this issue · 5 comments

As part of the example app we are assembling #51 we need to source fresh data.
I propose writing a "crawler" to index all of @dwyl's repositories and people.
The crawler should:

  • start by crawling the org home page https://github.com/dwyl
  • add any URLs for repos it finds to the log table as next_page
  • org
    • members
    • repo
      • "starer"
  • person (profile)
    • repo
      • "starer"
    • followers
    • following
  • once the crawler has finished indexing a page (saving it's data to the DB),
    it should select the next_page to be crawled from the DB and keep going ...

progress: (all the next pages to crawl are being inserted for an Org)
image

SELECT
 next_page, 
 COUNT (next_page) AS c
FROM
 logs
WHERE next_page IS NOT null
GROUP BY
 next_page
ORDER BY 
 c asc
limit 1;

image

Context: I'm trying to get the next_page we need to view that has not been viewed before. 🔍

SELECT
 next_page, 
 COUNT (next_page) AS c
FROM
 logs
WHERE next_page IS NOT null
AND next_page NOT IN (
    SELECT path
    FROM logs
    WHERE path IS NOT NULL
)
GROUP BY
 next_page
ORDER BY 
 c ASC
LIMIT 1;

image

Stars are being saved! ⭐️
image