|
Post by NJtoTX on Sept 12, 2017 22:42:52 GMT
|
|
|
Post by Admin on Sept 13, 2017 0:13:35 GMT
It's evolution, baby!
|
|
|
Post by cupcakes on Sept 13, 2017 18:24:03 GMT
|
|
|
Post by Terrapin Station on Sept 13, 2017 18:32:53 GMT
Yeah, I was wondering that, too. They have terms like "paperback," "cuerpo," and "like diamond" on there. There's no way those were among the most common terms in lyrics.
|
|
Deleted
Deleted Member
@Deleted
Posts: 0
Likes:
|
Post by Deleted on Sept 13, 2017 23:33:10 GMT
Because once everything's been done enough times, it's deemed necessary to push the envelope to stay relevant.
|
|
|
Post by NJtoTX on Sept 13, 2017 23:39:07 GMT
I used the xml and RCurl packages to scrape song and artist names from each Wikipedia entry. I then used that list to scrape lyrics from sites that had predictable URL strings (for example, metrolyrics.com uses metrolyrics.com/SONG-NAME-lyrics-ARTIST-NAME.html). If the first site scrape failed, I moved onto the second, and so on. About 78.9% of the lyrics were scraped from metrolyics.com, 15.7% from songlyrics.com, 1.8% from lyricsmode.com. About 3.6% (187/5100) were unavailable. The dataset features 5100 observations with the features rank (1-100), song, artist, year, lyrics, and source. The artist feature is fairly standardized thanks to Wikipedia, but there is still quite a bit of noise when it comes to artist collaborations (Justin Timberlake featuring Timbaland, for example). If there were any errors in the lyrics that were scraped, such as spelling errors or derivatives like "nite" instead of "night," they haven't been corrected. Method seems to be: Attribute the lyrics to a certain decade, break them into ngrams, calculate the log likelihood for each ngram/decade pair, and rank them to create a list of most characteristic words/phrases for each decade. kaylinwalker.com/50-years-of-pop-music/
|
|