johnwmillr/LyricsGenius

Remove the Hyperlink text from lyrics scrapper

bdubs1991 opened this issue · 4 comments

When you use your package to scrape lyrics it includes text for the hyperlinks at the end of the lyrics, see attached screenshot. For a reproductible example, I have attached this in a jupyter notebook.
image. This can be removed with some regex code I have created below. I am agnostic if this should be done to all lyrics or only when remove_section_headers=True is selected.

Potential Solution:
hyperlinks_removed = re.sub(r"[0-9]+EmbedShare URLCopyEmbedCopy",'',lyrics)

Example for reproduction

import lyricsgenius as lg
import genius_token as gt
genius = lg.Genius(gt.token, # Client access token from Genius Client API page
skip_non_songs=True, excluded_terms=["(Remix)", "(Live)"],
remove_section_headers=True)

songs = (genius.search_artist('Kanye-west', max_songs=1, sort='popularity')).songs
s = [song.lyrics for song in songs]

print(s[0][-30:])

Thanks for the regex, I had exactly the same problem.

I slightly modified the regex to
hyperlinks_removed = re.sub(r"[0-9]*URLCopyEmbedCopy",'',lyrics)
because the other one failed for songs that had zero shares.

Yeah. I'm hitting this as well. I'll add the regex to me code for a short term fix

I had to slightly modify Vuizur's solution because it was only getting the URLCopy part

(Javascript):
let re = /[0-9].*URLCopyEmbedCopy/
lyrics.match(re)

Thanks! I was wondering why the output on my songs lyrics were outputting this!