com-puter-tips/Links-Extractor

BUG found: unicode issue.

GhbSmwc opened this issue · 3 comments

Can you add a feature to output to a file instead of directly in the console window? I do not want to end up as it creates new lines, the oldest lines get erased due to the buffer limits. I did try using >>Outputfile.txt, but that resulted the file containing this:



https://www.uchinokomato.me/chara/show/154058
---------------------


18 internal links found:


'charmap' codec can't encode characters in position 14-23: character maps to <undefined>

Compare that to this:

https://www.uchinokomato.me/chara/show/154058
---------------------


18 internal links found:


/users/sign_in
https://www.uchinokomato.me/legal
https://www.uchinokomato.me/recruit
/tags/ファンタジー
/search3
https://www.uchinokomato.me/about
/users/sign_up
/chara/birthday/11/7
/lobby
/?f=ex
/tags/ポケ擬
/users/sign_up?from=medal-modal
https://www.uchinokomato.me/termsofuse
/tags/主人公
/search?world=ポケタリアクロニクル-英雄の軌跡-
https://www.uchinokomato.me/?f=ex
https://www.uchinokomato.me/privacy
/user/show/26774


13 external links found:


https://s3-ap-northeast-1.amazonaws.com/uchinoko/chara_images/pictures/000/326/330/original/9751eb6a38a5d056fa242da79f6521a7.jpg?1529667087
http://twitter.com/uchinoko_beta
https://twitter.com/share
https://s3-ap-northeast-1.amazonaws.com/uchinoko/chara_images/pictures/000/373/764/original/57d027867d8e17d46c3e365275effc34.png?1545748783
http://uchinoko.dou-jin.com/
https://s3-ap-northeast-1.amazonaws.com/uchinoko/chara_images/pictures/000/327/127/original/57d027867d8e17d46c3e365275effc34.png?1529853167
https://s3-ap-northeast-1.amazonaws.com/uchinoko/chara_images/pictures/000/373/765/original/24a7613a0a284ac9f891fba8da287019.jpg?1545748799
https://s3-ap-northeast-1.amazonaws.com/uchinoko/chara_images/pictures/000/381/594/original/04d2c534525a7a1e11404f5c44d822a7.jpg?1548338586
https://s3-ap-northeast-1.amazonaws.com/uchinoko/chara_images/pictures/000/326/328/original/57d027867d8e17d46c3e365275effc34.png?1529667058
http://www.5thfloor.co.jp
https://doc.uchinokomato.me/
https://s3-ap-northeast-1.amazonaws.com/uchinoko/chara_images/pictures/000/362/565/original/57d027867d8e17d46c3e365275effc34.png?1541767222
https://s3-ap-northeast-1.amazonaws.com/uchinoko/chara_images/pictures/000/326/329/original/aa11c29fa81359b4ed70e56be18e5fe3.jpg?1529667083

Ack, I found a “solution”, but it adds the “b'<URL>'” (lowercase b, and 2 apostrophies, ) when using encode("utf-8") only on lines 41 and 46. Is there a way to fix that?

To save the output to file you can do:

% python3 extractor.py https://com.puter.tips > out.txt