How to use goclone for site-search ?

Question

How to use goclone for site-search ?

anborg opened this issue 4 years ago · 1 comments

How can I use goclone to extract content of internal-websites? I can put it in a struct like below, and then inject the json to elasic to for building site search functionality.

P.S: I'm new to golang. :)

/**
Plan : Import this struct in a crawling program, extract just text content no images/js, index for sitesearch.
**/
type WebPage struct {
	Id      string    `json:"id"`      // some id string or number
	Url     string    `json:"url"`     // URL of the page to index
	Title   string    `json:"title"`   // Title of the page
	Content string    `json:"content"` // Page content //TODO remove javascript, try to extract only core content
	Time    time.Time `json:"time"`    // TODO timestamp of page creation time
}

func (document *WebPage) Print() {
	// enc := json.NewEncoder(os.Stdout)
	// enc.SetIndent("", "  ")
	// document.Content = ""
	// enc.Encode(document)
	println(fmt.Sprintf("page:  {\n  title: %s, \n  url : %s, \n  content:%s, \n  time:%s \n}", document.Title, document.Url, "-redacted-", document.Time))
}

Answer 1 · 2020-07-13T21:27:24.000Z

Hey there!

So to begin Goclone is a CLI tool for copying/mirroring sites via a given URL or Domain. If you would like to extract specific data from a web page and utilize that data for some task, I would recommend looking into writing a webscraper. A great webscraping tool that I used in this project is Colly. Quick Colly example:

func main() {
	c := colly.NewCollector()

	// Find and visit all links
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		e.Request.Visit(e.Attr("href"))
	})

	c.OnRequest(func(r *colly.Request) {
		fmt.Println("Visiting", r.URL)
	})

	c.Visit("http://go-colly.org/")
}

Some other great examples of building a webscraper can be found here! I hope this answers your question :)