/godensity

This repository is implematation of ๐Ÿ“„ DOM based content extraction via text density. Tested for Korean web pages.

Primary LanguageGoMIT LicenseMIT

godensity

This repository implements DOM-based Content Extraction via Text Density in Go. The project is particularly useful for extracting main content from web pages by analyzing text density, and it has been thoroughly tested on Korean web pages.

๐Ÿ“„ DOM-based Content Extraction via Text Density ๋…ผ๋ฌธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ Go ์–ธ์–ด๋กœ ๊ตฌํ˜„ํ•œ ํ”„๋กœ์ ํŠธ์ž…๋‹ˆ๋‹ค. ์ฃผ๋กœ ํ•œ๊ตญ์–ด ์›น ํŽ˜์ด์ง€๋“ค์„ ๋Œ€์ƒ์œผ๋กœ ํ…Œ์ŠคํŠธํ•˜์˜€์œผ๋ฉฐ, ํ…์ŠคํŠธ ๋ฐ€๋„๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์ฃผ์š” ์ฝ˜ํ…์ธ ๋ฅผ ์ถ”์ถœํ•˜๋Š” ๋ฐ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.

image

Features

Extracts main content from web pages by removing unnecessary elements (e.g., ads, navigation bars, sidebars). Analyzes text density to determine relevant content blocks. Efficient DOM traversal using goquery for HTML parsing and manipulation.

์ฃผ์š” ๊ธฐ๋Šฅ:

๊ด‘๊ณ , ๋„ค๋น„๊ฒŒ์ด์…˜ ๋ฐ”, ์‚ฌ์ด๋“œ๋ฐ” ๋“ฑ ๋ถˆํ•„์š”ํ•œ ์š”์†Œ๋ฅผ ์ œ๊ฑฐํ•˜์—ฌ ์›น ํŽ˜์ด์ง€์˜ ์ฃผ์š” ์ฝ˜ํ…์ธ ๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค. ํ…์ŠคํŠธ ๋ฐ€๋„๋ฅผ ๋ถ„์„ํ•˜์—ฌ ๊ด€๋ จ๋œ ์ฝ˜ํ…์ธ  ๋ธ”๋ก์„ ์‹๋ณ„ํ•ฉ๋‹ˆ๋‹ค. goquery๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํšจ์œจ์ ์ธ DOM ์ˆœํšŒ ๋ฐ HTML ํŒŒ์‹ฑ์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

How to run?

# Clone the repository
gh repo clone minarc/godensity

# Change to the project directory
cd godensity

# Run tests
go test -v .

image

Contribution

Feel free to open issues or submit pull requests if you find any bugs or have suggestions for improvement. Contributions are always welcome!

๊ธฐ์—ฌ๋Š” ์–ธ์ œ๋‚˜ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค! ๋ฒ„๊ทธ๋ฅผ ๋ฐœ๊ฒฌํ–ˆ๊ฑฐ๋‚˜ ๊ฐœ์„  ์‚ฌํ•ญ์ด ์žˆ๋‹ค๋ฉด ์ž์œ ๋กญ๊ฒŒ ์ด์Šˆ๋ฅผ ๋“ฑ๋กํ•˜๊ฑฐ๋‚˜ PR์„ ์ œ์ถœํ•ด์ฃผ์„ธ์š”.