Remove the Hyperlink text from lyrics scrapper
bdubs1991 opened this issue · 4 comments
When you use your package to scrape lyrics it includes text for the hyperlinks at the end of the lyrics, see attached screenshot. For a reproductible example, I have attached this in a jupyter notebook.
. This can be removed with some regex code I have created below. I am agnostic if this should be done to all lyrics or only when remove_section_headers=True is selected.
Potential Solution:
hyperlinks_removed = re.sub(r"[0-9]+EmbedShare URLCopyEmbedCopy",'',lyrics)
Example for reproduction
import lyricsgenius as lg
import genius_token as gt
genius = lg.Genius(gt.token, # Client access token from Genius Client API page
skip_non_songs=True, excluded_terms=["(Remix)", "(Live)"],
remove_section_headers=True)
songs = (genius.search_artist('Kanye-west', max_songs=1, sort='popularity')).songs
s = [song.lyrics for song in songs]
print(s[0][-30:])
Thanks for the regex, I had exactly the same problem.
I slightly modified the regex to
hyperlinks_removed = re.sub(r"[0-9]*URLCopyEmbedCopy",'',lyrics)
because the other one failed for songs that had zero shares.
Yeah. I'm hitting this as well. I'll add the regex to me code for a short term fix
I had to slightly modify Vuizur's solution because it was only getting the URLCopy part
(Javascript):
let re = /[0-9].*URLCopyEmbedCopy/
lyrics.match(re)
Thanks! I was wondering why the output on my songs lyrics were outputting this!