How to use goclone for site-search ?
anborg opened this issue · 1 comments
anborg commented
How can I use goclone to extract content of internal-websites? I can put it in a struct like below, and then inject the json to elasic to for building site search functionality.
P.S: I'm new to golang. :)
/**
Plan : Import this struct in a crawling program, extract just text content no images/js, index for sitesearch.
**/
type WebPage struct {
Id string `json:"id"` // some id string or number
Url string `json:"url"` // URL of the page to index
Title string `json:"title"` // Title of the page
Content string `json:"content"` // Page content //TODO remove javascript, try to extract only core content
Time time.Time `json:"time"` // TODO timestamp of page creation time
}
func (document *WebPage) Print() {
// enc := json.NewEncoder(os.Stdout)
// enc.SetIndent("", " ")
// document.Content = ""
// enc.Encode(document)
println(fmt.Sprintf("page: {\n title: %s, \n url : %s, \n content:%s, \n time:%s \n}", document.Title, document.Url, "-redacted-", document.Time))
}
imthaghost commented
Hey there!
So to begin Goclone is a CLI tool for copying/mirroring sites via a given URL or Domain. If you would like to extract specific data from a web page and utilize that data for some task, I would recommend looking into writing a webscraper. A great webscraping tool that I used in this project is Colly. Quick Colly example:
func main() {
c := colly.NewCollector()
// Find and visit all links
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
e.Request.Visit(e.Attr("href"))
})
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c.Visit("http://go-colly.org/")
}
Some other great examples of building a webscraper can be found here! I hope this answers your question :)