At Next Big Sound, we have always been fascinated by the power of data to predict tomorrow’s music stars. Recently we developed an algorithm that creates a list of the emerging artists who are most likely to break out this year. Over time we tweaked this formula enhancing its forecasting ability, to the point that we’ve been able to patent its powers of prediction. This article describes how to pinpoint breakout artists 500 times better than random chance, up to a year in advance.

For instance, in June 2012, both Kendrick Lamar and A$AP Rocky had released acclaimed mixtapes but no studio albums or hit Billboard songs. We predicted both to explode based on their social media data in June 2012. Their debut studio albums opened at #1 and #2 months later on the Billboard 200, with Lamar now counting a total of nine singles that have hit the Hot 100 and A$AP Rocky a total of three. 



Each week Billboard releases the Hot 100, charting the most popular songs in the U.S. based on sales, radio, and online streaming. We chose this as the indicator of success, because the chart represents a long-running, consistent, and clearly-defined measure of the often debated concept of success in the music industry. From July 2012 to July 2013, a total of 250 different artists appeared on the Billboard Hot 100 chart. Excluding the majority who have already charted before, only 44 of these were true breakouts, in the sense that they achieved a Hot 100 hit song for the very first time.

These are the artists:

Sammy Adams, Little Mix, Greg Bates, Chief Keef, Lil Reese, Juicy J, Foxes, Tyler Farr, TJR, Emeli Sandé, Kacey Musgraves, Ariana Grande, Slaughterhouse, Cedric Gervais, Swedish House Mafia, Labrinth, Cher Lloyd, Detail, Olly Murs, Kendrick Lamar, The Weeknd, Icona Pop, Kid Ink, Zedd, Krewella, Ed Sheeran, ASAP Rocky, Casey James, Becky G, Alabama Shakes, Thomas Rhett, The Neighbourhood, Capital Cities, Driicky Graham, Florida Georgia Line, Brett Eldredge, Guy Sebastian, Passion Pit, Britt Nicole, Rizzle Kicks, Hadouken!, Friends, Amanda Brown, Nero
How rare are these elusive breakout artists? We found 130,000 artists with up-to-date social media data in July 2012. Randomly choosing the hit artists from that group would yield a success rate of only .03% (44/130,000).

In an attempt to pinpoint these artists more accurately, we used the data available in July 2012 to generate a predictive model. Inputs would be each artist’s total, daily change, and rate of growth for fans, plays, and views on eight different networks including Facebook, Twitter, SoundCloud, and YouTube. We then identified the artists who charted on the Hot 100 before July 2012 to use as training cases for what a breakout pattern looks like. The resulting supervised learning model, outlined in more detail below, is applied to artists’ data on July 1, 2012 to generate their success likelihoods in the 12 months from July 2012 to July 2013.

The algorithm’s top 44 predicted artists contained six who went on to the Billboard Hot 100, which is a success rate of 14%. The top 100 predicted potential breakouts did even better with 16 hits, a success rate of 16%, over 500 times better than random chance, and 16 times better than using most YouTube views to select artists.

image

Imagine you are a record label investing in 100 potential up-and-coming artists. We are correctly identifying 16 out of the 44 breakout artists, or 36%, in this batch of suggested targets. In other words, we can give you a list of 100 artists today and expect over a third of the breakthrough artists next year to come from that list.

Here is that list of top 100 artists from July 2012, ranked in order of predicted success chance via text size, along with an indication in red of whether each artist did in fact reach the Hot 100:

image


Predicting future breakouts with data is a well-defined supervised machine learning problem, using an algorithm to generalize from past instances to predict unseen future cases. Many different techniques for supervised learning exist; here we employ stochastic gradient boosting, optimized for accuracy and performance by creating an ensemble of simpler classification tree models with subsampling. 

The inputs for the model comprise transformations for each of the network metrics: Facebook page likes, YouTube video views, Twitter mentions, etc. The transformations – calculated and stored in our Find database designed specially for A&R purposes – include lifetime totals, daily and % change over the last 7, 30 and 90 days, and a virality measure of exponential growth, calculated by fitting a second-order polynomial and combining the magnitude of the coefficient with the goodness of fit.

image

With the model procedure, training data, and inputs in place, validation testing was the final step. The testing process unearthed three interesting challenges, which are likely to be common issues for anyone performing social media analytics:
  • named entity matching
  • adjusting for social media creep over time
  • handling missing data across networks

Incorporating artist data across multiple sources leads to a name matching issue. For example, JAY Z is often erroneously spelled as Jay-Z, Jay Z, and others. We addressed the name matching problem with a list of alternate spellings for each artist, but perfect matching continues to be an ongoing challenge.

A Facebook page like today is not the same as a page like a year ago. To counter the natural growth of social media over time, we transformed each metric on the inverse hyperbolic sine scale, then standardized each to have mean 0 and variance 1. The inverse hyperbolic sine behaves like a log transformation for large values, but is also defined for negative values and zeros, a key advantage when dealing with social media data.

Missing values is another common issue when dealing with social media data, especially when incorporating multiple networks for an entire set of artists. One key finding we had from testing was that the missing at random (MAR) assumption does not hold true for artists across networks. The absence of an artist on a particular network actually has predictive ability for likelihood of future success and is not random. As a result we account for missing values with surrogate variables.

Finally, we leave you with the list of the 100 artists we deem most likely to reach the Billboard Hot 100 in the next year starting today. For the top 25, we include a short description of why they are hot right now. Which ones do you pick to be the breakouts? Leave your comments below!

1. Schoolboy Qperformed at BET Hip Hop Awards, has track on NBA Live 2014
2. MalumaPerformed at 2013 Latin Grammys, nominated for Best New Artist
3. HedleyReleased new album in November, performed at 2013 Grey Cup halftime
4. Noel Schajris - New music video out for new single, album coming soon
5. Reik - Released new live album, En Vivo desde el Auditorio Nacional and a new single
6. Camila - Performed at 2013 Latin Grammys
7. Ben Howard - "Oats in the Water" featured in a recent episode of The Walking Dead
8. Axwell - Released two singles in 2013, both reaching number one on Beatport
9. Kalimba - coach of The Voice Peru, performed new single on show
10. Band Of Skulls - Announced 2014 tour and new album, "Himalayan," and single
11. Calle 13 - Released new single off upcoming album (March 2014)
12. Marco Antonio Solis - Recently released a duet with Enrique Iglesias
13. Alejandro Fernandez - Just started a highly anticipated US tour
14. Gerardo Ortiz - Musical guest on Mira Quien Baila, recent single release
15. Carlos Vives - Just won Latin Grammy for song of the year
16. Voz De Mando - Regional Mexican group who performed at the Latin Grammys
17. BANKS - Her latest album named "Best of 2013" by iTunes, supporting the Weeknd
18. Cut Copy - Aussie electronic pop crew hit up Late Night with Jimmy Fallon Nov. 19
19. R5 - released a new single Aug. 30 and just announced a new world tour last week.
20. Franco de Vita - won a Latin Grammy in 2012 and is on tour in the US right now
21. David Bisbal - new album, new hairstyle, recent breakup
22. Curren$y - new mixtape out, new track with Wiz Khalifa, signed Mary Gold to roster
23. Fitz & The Tantrums - currently on tour with Capital Cities, current popular single
24. J Balvin - popular new album released late October, featured in many press articles
25. Childish Gambino - promoting new album coming Dec 9, new song releases out

The rest: Naughty Boy, Los Tigres del Norte, Gloria Trevi, Arcangel, Bonnie McKee, Abraham Mateo, Iggy Azalea, James Arthur, Local Natives, Colton Dixon, Jessica Sanchez, Banda El Recodo, la fouine, Metronomy, Deitrick Haddon, Emmanuel, Brandon Flowers, Alesso, Santigold, Bombay Bicycle Club, Chris Tomlin, Intocable, Newsboys, Armin van Buuren, Mikky Ekko, Death Grips, Shane Harper, OV7, Victor Manuelle, Julian Casablancas, American Authors, tobyMac, Jeremy Camp, Dean Brody, Nervo, Hillsong United, Priyanka Chopra, Black Veil Brides, The 1975, Third Day, foxes, Cash Cash, Jake Bugg, Hollywood Undead, Axel, Matthew West, Lamb of God, Manchester Orchestra, Eros Ramazzotti, Calibre 50, Ricardo Arjona, Rich Kidz, Stereophonics, Linda Teodosiu, Mandisa, Casting Crowns, Natalia Kills, Amos Lee, Yuna, Mat Zo, Pablo Alboran, Rico Love, Katy B, Conor Maynard, Kany Garcia, Oak Ridge Boys, B. Smyth, Tenth Avenue North, Alejandra Guzman, Jarabe de Palo, Steel Panther, CA$H OUT, La Factoria, Blood Orange, Red Café