Making Next Big Sound

Posts from the team

Iterating on Iterations - The Year-Long Evolution of the Way We Work at Next Big Sound

dzwieback dzwieback

Jun 13, 2014

[Originally posted in two parts on popforms.com.]

With epidemically low employee engagement, being highly effective and happy at work is an exception, not the norm. Why are we failing to engage at work? Why are healthy, high performance teams so rare?

At least part of the answer to these troubling questions lies in the fact that most companies are organized in inflexible, hierarchical, command-and-control silos. These organizational structures are arguably ill-suited even to the assembly lines where they originated over a century ago, let alone today’s knowledge workforce. Even more surprisingly, of the many companies that have adopted a modern, iterative approach to product development (known as “Lean” or “Agile”), only a few take the same iterative approach to their organizations.

At Next Big Sound, we are committed to iterating not only on our products but also on the way that we work.

Our fundamental approach is rooted in openness. We want everyone to be directly involved in deciding to work in a particular way, and able to easily learn the history and the rationale behind past decisions. We continuously ask “Why?” and never settle for “I don’t know” or, worse, “Because we’ve always done it this way.”

Organizations are complex systems that exhibit surprising, emergent behaviors. We can’t predict the future–no one really knows if changing the organization in a specific way will have the intended results, in the same way that no one knows if adding a new feature will help make software successful. (Though we do know a thing or two about who might enter the Billboard 200 next year). However, we are an organization that’s willing to quickly experiment with various ways of working and adjust based on what we’ve learned.

This is a story of the evolution of the way that we work at Next Big Sound, a record of the things we’ve tried and tweaked over the last year.

Since July 2013, we’ve been iteratively building a healthier, more flexible, high-performance organization, a place where highly engaged and happier folks could do some of the best work of their lives.

We’re far from done, and this account is also meant to provide the necessary context and encouragement for everyone to continue asking “Is there a better way to do it?”

July 2013: Creating product-focused, self-organized teams

In the spring of 2013, we were working in several product teams and a “core” team which was responsible for infrastructure, storage, and our API. The idea was that each of these teams would have all (human) resources necessary to do the work planned for each product. In reality, though, it wasn’t always clear what teams were (or should be) working on, and there was sometimes a lack of focus with multiple projects going on at the same time within each team.

To address some of these issues, the entire company gathered to discuss a proposal of working in a different way.

First, we agreed to do away with strictly product-focused teams, and instead introduced project-focused teams. We defined a “project” as 2-4 weeks of focused work, and agreed that there would only be one project at a time per team. We also encouraged everyone to keep the teams small, in order to minimize communication overhead and maximize speed, and independent, in order to minimize external dependencies.

Before the start of the project, each team would scope the work and define a clear and measurable outcome. At the project’s completion, we would show everyone the progress during a “demo day”. Teams would also conduct retrospectives to learn what we did well or could do better.

We also outlined the role of management as simply “to provide clear business goals, and to help teams maximize productivity, minimize distraction, and to remove roadblocks”. You might notice that, by omission, the role of management was (and still is) not to tell people what to do, or how to do it.

Instead of top-down management, teams would self-organize and self-manage, with everyone was encouraged to take on the team lead role. (In fact, as of today, everyone at the company has served as a team lead on at least one project). At the time, the role of team leads was loosely defined, with the main focus on ensuring communication within a team and the rest of the company. We offered some loose guidelines, but each team had the choice to follow, not follow, or amend them.

Over the previous four years, and several versions of our flagship product, we accumulated significant amounts of technical debt, with an aging storage system nearing capacity, and two similar but not quite the same versions of our analytics dashboard in production. With that in mind, we agreed to have at least one “non-project” team to pay down technical debt and fix bugs at all times.

At the time, we thought that the most significant difference from prior iterations is that teams will now self-organize to complete specific projects. That is, people can join or ask others to join a team at any time, not just at the beginning of a project.

In retrospect, the more important change that we agreed to try was a new method of working that we later started calling “self-selection”. A year later, it is still a cornerstone of the way that we work at Next Big Sound: you get to pick what you work on, whom you work with, and where you work.

This is not a startup “perk”, or a recruiting tactic; it is rooted in a deeply held belief that everyone should have the autonomy to work in the most engaged way that makes them happiest and thus most productive.

After a brief discussion, we dove right in, self-organizing project teams and selecting team leads. Watching folks self-select into projects was both nerve-racking (Will it actually work? Will people select the difficult, unglamorous, but critical projects aimed at paying down our technical debt? Will it all devolve into chaos?), and yet proceeded in a remarkably matter-of-fact fashion.

August 2013: Arrival of the BugBusters

With a month of working in this new way under our belt, everyone in the company met to discuss what we’ve learned and to see how we might improve. While initial results were very positive – the sharpened focus and the self-selecting teams were working well – we decided we needed to do three things: shorten the length of projects (now called “iterations”); clarify the mechanics of an iteration; and reduce interruptions.

Interestingly, most of the teams had chosen one month as the length of the first iteration. A month is practically an eternity in startup time; accurately planning such a significant amount of work is difficult in any organization. Most teams had experienced significant changes in scope of iterations, which either unexpectedly ballooned or had to be reduced before completion. As a result, we decided to limit the length of iterations to two weeks, a practice that we stuck with until April of 2014.

There was also some confusion about the mechanics of iterations, which we clarified by specifying things like when iteration scopes should be defined, and where they should be documented; and when retrospectives should be conducted. Given the fact that iterations were now fixed length, it became possible to start and end all iterations on the same day (typically every other Wednesday), which became company-wide demo day.

We also noticed that there was a non-trivial amount of work required to fix bugs or data import issues, and address systems/ops-related alerts or outages. We wanted to keep these interruptions to a minimum, so we introduced an evolution of the “non-project” team idea: a 1-week BugBusters engineer rotation, which started each Wednesday at noon.

The BugBuster was tasked with incident management, i.e., triage of any issue or bug that might arise during the rotation. The BugBuster was not expected to be able to fix every issue, and would ask for help as necessary. Things that couldn’t be fixed quickly (e.g., within about a day) would become projects and go through the normal prioritization and self-selection processes.

BugBusters was perhaps the most impactful change that we introduced at this time, especially because engineers could now choose to do an entire hack week before or after their rotation in recognition of the community service aspect of BugBusters, and work on whatever they wanted at Next Big Sound.

This currently adds up to about four weeks of individual hack time per year for each engineer (with an additional two weeks of company-wide hack days). It’s also worth noting that although we recognized the critical nature of BugBusters, we did not mandate that there must have always been an engineer on rotation or that every engineer must have participated in BugBusters.

Instead, we chose to treat BugBusters as one of the projects up for self-selection, something that we had to clarify and “tune up” a few months later. Still, the BugBusters rotation (also fondly called “HackBusters”) is alive and well today, and is responsible for some of the continued innovation at Next Big Sound. Some of the projects that came out of hack weeks included Tunebot, explorations of Zipf Law for Facebook Page Likes, an Next Big Sound Charts iPhone App, and countless product experiments and improvements.

November 2013: When projects go un-selected

When people hear about self-selection, the first question is usually whether there are projects that don’t get selected. What about that “shit project” that no one wants to work on? What about projects with external customer deadlines?

Yes, it’s true, it happens: sometimes seemingly important projects don’t get selected. When this happens, no one is ever coerced or forced to work on a project that they did not select. Instead, we ask a lot of questions, starting with, not surprisingly, “Why?”.

If the project’s importance is obvious, why did it still not get selected? Was its importance clearly communicated? Does the team have the necessary skills to complete the work? Are we working on other projects that are higher priority? If we think the project is, in fact, important, we have at most two weeks to advocate for its selection for the following iteration. Otherwise, we could be thankful to the “wisdom of the crowd” for showing that the project is not as critical or time-sensitive as initially thought.

With that in mind, we introduced a concept of project advocate, a person who could provide the necessary context to the team during self-selection. Ideally, the advocate would cover things like why this project is important to do during the coming iteration, and how it ties to company themes and goals. In addition, we decided that each project idea should explicitly list the skills required to complete it (e.g., front-end developer, Java, design).

We also noticed that the timing of communication of iteration scopes, updates, and retrospective results was somewhat haphazard. (All these are communicated via e-mail to the entire company).

Because accountability–which, in this case, is literally the responsibility to provide an account of what’s happening to the entire team–is such a critical part of self-selection, we agreed to adhere to a clear communication schedule, which specified the exact timing of initial iteration scopes, mid-iteration progress updates, scope changes, and retrospectives.

In the two months since the last tune up to the way that we were working, it also became increasingly clear that BugBusters was a project that we had to have someone select at all times (not letting it go during a week when no one selected it, as we had done before).

For at least one two-week iteration, no one selected to be on BugBusters, which actually resulted in higher interrupt levels for most engineers. In addition, certain tasks (like managing over 100 data sources) were falling disproportionately to several engineers and client services folks, who were unable to fully dedicate their time to other projects.

Most importantly, we realized that having the rotation was required for self-selection to apply fully to all engineers: we did not want to have a single engineer dedicated to BugBusters/incident management at all time, because she would not be able to fully participate in self-selection. More fundamentally, having someone on ops at all times (not just when someone wants to) is required for us to operate the Next Big Sound service.

After a lengthy and heated all-hands discussion, we agreed to have someone on BugBusters at all times and created a place for engineers to track (and trade) their upcoming rotations. To be clear, this still does not mean that every engineer must do BugBusters (although by this point, every engineer has completed at least one rotation). Simply put, as part of self-selection, we trust everyone to do what’s right for them, their team, and NBS.

March 2014: Doing demo day

With the mechanics of self-selection mostly worked out, we continued what has been (according to the founders) the longest sustained period of high productivity in the company’s 5-year history. We next turned our attention to demo day – a high point of the iteration when everyone gets to show their work and celebrate our progress as a team.

Perhaps the best way to illustrate the issues with demo day was to compare its intention with how it was actually practiced:

In theory

  • demos are short (under 7 minutes)
  • demo day should last about 1 hour
  • demos highlight “the difference between what was initially planned and what was accomplished, including identifying any loose ends”
  • demos are explicit opportunities to learn from others, and the most salient parts of retrospectives are emphasized
  • demo day artifacts (e.g., presentations) should be easily found
  • the entire history of any iteration should be easily accessible at any time to anyone in the company
  • we intend the software that we write to be tested, documented, and shipped to production during the iteration

In practice

  • demos are sometimes short, and frequently go over 7 minutes
  • demo days have lasted as long as 2 hours
  • demos sometimes highlight the difference between the original scope and what was completed; remaining work is sometimes documented as Trello cards
  • demos sometimes refer to lessons learned and stop/start/continue items from retrospectives
  • there is no central place for demo day artifacts
  • only the original and updated scopes and results of retrospectives can be found in e-mail (if you are at the company when it it sent); it’s not always clear what was actually shipped and how that might differ from what was planned
  • the status of testing, documentation, and shipping is sometimes mentioned during demos, and not consistently documented
  • many demos use PowerPoint presentations, with “the rate of information transfer … asymptotically approaching zero

The first thing that we nipped in the bud was the (over)use of PowerPoint. Instead, we opted to write out iteration summaries, and store them in a central place (Google Drive). We created an (optional) summary template, and agreed to keep summaries under six pages in length. We then decided to try reading the summaries simultaneously as a group during demo day, in the same way that we’ve done during several all-hands meetings (inspired by the same practice during meetings at Amazon).

With the introduction of the iteration summaries, we now had a rich, historic, narrative record of each iteration. Due to the high-bandwidth communication of the written word (and an occasional animated gif), the team now had a higher degree of awareness about the many projects going on than ever before.

However, because we also limited the actual demos to two minutes, the social aspect of presenting your project in front of the entire team was greatly diminished. The high-energy demo days became subdued. Before we could address this issue, though, we identified another gap between how we thought about the way we worked vs how we actually worked that we had to dive into first.

April 2014: Getting things done

Starting in July of 2013, we settled into the comfortable rhythm of two-week iterations. However, in reality, few big projects really divide neatly into two-week chunks. As a result, either “work [expanded] so as to fill the time available for its completion” (according to Parkinson’s law); or people took on additional work to fill available time within the iteration. Both of these situations were common, but not easily visible (which is why it took us this long to address the situation).

In addition, because we were emphasizing iterations that are exactly two weeks in length, shorter project and leftover items from previous iterations that took less than two weeks to complete were effectively “second-class citizens.”

That is, two-week iterations subtly encouraged the selection of projects that took at least two weeks to complete. As a result, some projects took a long time to finish, staying at the “almost complete” mark for extended periods of time, and incurring a significant context-switching cost to get to 100% complete.

We also had a whole class of projects (namely data science reports and data journalism articles) that were much more dynamic in nature, with constantly changing priorities (driven by customers or partners), and highly variable effort required to complete them. In recognition of this, some team members were already working outside the structure of normal two-week iterations. A recent experience of cramming several smaller research projects into a two-week iteration has further highlighted the awkward fit of rigid iteration lengths to this type of work.

To address the above shortcomings of fixed length iterations, in early April, we agreed to try working in iterations that could vary in length, not to exceed two weeks. To minimize the cost of context-switching, we also encouraged folks to stay on projects for their duration (not just for an iteration). As one engineer put it, “One fully complete project is better than five halfway complete ones.”

We also agreed to make context-switching explicit: recognizing the fact that we can only work on one thing at a time, we can only mark one task as “in progress” at any given time, marking all others as “blocked”. (We use Trello for tracking our work).

We also changed demo day to accommodate the new, variable-length iterations. During the bi-weekly demo day, folks present the results of any iteration that has been completed before that Wednesday. That might be one or 10 iterations. If an iteration was not fully completed before a particular demo day, its results would be presented during the next one:

We’ve also returned to more interactive demo days. While we still ban PowerPoint and generate iteration summaries (which everyone is encouraged to read offline), we’ve shifted the focus of demo days back on actual demos.

Today

What has worked in the past may not work in the future. As we grow, we remain committed to collaboratively iterating not only on our products, but also on the way that we work.

Unwilling to settle for cookie cutter approaches, we’ll continue to experiment until we find methods that best fit our culture and the challenges ahead.

This evolution is not driven solely by management; in fact, many of the changes described above were championed by designers, client services folks, data journalists, data scientists, and engineers. We strongly believe that this is one of our competitive advantages, and one of the conditions that makes people working at Next Big Sound to stay highly engaged.

Acknowledgements

I would like to thank my colleagues Liv Buli and Karl Sluis for their excellent and tireless feedback and advice on writing this story. I would also like to thank the entire Next Big Sound team for so fearlessly embracing the iterative approach to the way we work.

Zipf’s Law for Facebook Fans: Building Intuition for Big Numbers

adamhajari adamhajari

May 30, 2014

Have you heard of Lydia Loveless? She is an alt-country singer songwriter from Ohio with three full length albums under her belt, and she has played more than 150 shows across the United States.

Loveless is more popular than 90% of all musicians on Facebook. With more than 7,000 page likes, she is 16 times more popular than the typical* artist. When viewed from this perspective, 7,000 seems like a large number. But when you consider that the average number of page likes across all artists on Facebook is 22,000 – over 3 times greater than the value for the Loveless page, things are less clear.

This seeming incongruity is due to the highly skewed distribution of Facebook page likes amongst musicians. On Facebook, as with most things on the internet, a very large percentage of engagement belongs to a very small percentage of artists. One of the original observers of this “Law of Unfairness” was Vilfredo Pareto, an early 20th century Italian economist who, when looking at the distribution of land ownership in Italy, noticed that 80% of land was owned by 20% of the population. The Pareto Distribution (also known as the Power Law Distribution) still applies to the distribution of wealth today, as well as to many other naturally occurring and man-made phenomena.

A more extreme version of this skewed distribution does a decent job of characterizing the distribution of fan engagement for musicians on various social media platforms (including Facebook, Twitter, and Youtube). On Facebook, for instance, 95% of all page likes are associated with only 5% of artists’ pages. So while a successful, full time musician like Lydia Loveless gets significantly less page likes than the average across all artists, Eminem’s page has 3,500 times more likes than average, 10,000 times more than the Loveless page, and 160,000 times more than the typical artist on Facebook.

Our linear world of normal distributions has done little to prepare our numerical intuition for scales of that magnitude. Fortunately, there is another phenomenon that follows a similar distribution to fan engagement with which most people have a much greater level of familiarity: word frequency in the English language.

The most popular word in the English language, “the”, accounts for 6% of all written text, and the second most popular word “be” accounts for 4% (this power law distribution for word usage was noticed by the linguist George Zipf). While the distribution of page likes on Facebook isn’t quite so top heavy, the fact that 94% of all written English text is made up of 6% of the words in the English language suggests that the two distributions are remarkably similar**. And while even music savvy folks may only be familiar with a few hundred bands, the average person knows on the order of tens of thousands of words.

The problem with a statement like “Lydia Loveless has 7,000 Facebook page likes” is that it lacks context. And even when given some context (“Eminem has 88 million Facebook page likes”), the non-linear nature of the distribution across all artists makes the interpretation based on only a few data points very difficult. But a statement like “Lydia Loveless is about as popular as the word sabbatical” provides a analogical bridge from the unfamiliar language of page likes to the been-using-it-my-whole-life English language.

The Eminems and Rihannas of Facebook (i.e. those with over 70 million page likes) are about as popular as words like “it”, “with”, “for”, and “and”. The Black Keys, with a tenth the number of Eminem’s Facebook fans, is about as popular as words like “big”, “every”, and “turn”. Bluegrass prodigy and recent grammy nominee, Sarah Jarosz, is about as popular as the words “plethora” and “consonant”. And the local New York City rock-band-from-the-future The Sky Captains of Industry, with 810 page likes (100% more than the median value), is about as popular as the words “quare” and “pleuritic”.

If you’re interested in putting a little context around the “number of page likes” for a given artist try this: Imagine you’re reading a novel that is approximately the size of To Kill a Mockingbird but where all the words have been replaced by their artist name popularity equivalent (“it” replaced by “Eminem”, “turn” replaced by “The Black Keys” and so on). Divide the number of Facebook page likes for the artist of interest by 80,000 and that’s the number of times you could expect that artist to appear in the words-to-artist-names novel.

So Lady Gaga, with 62.5 million Facebook page likes, would appear in our metaphorical novel 775 times— 2 to 3 times per page. The Sky Captains of Industry, on the other hand, appear in a single novel 0.008 times, which means you’d probably have to read around 125 of these books before you’d come across a mention of “The Sky Captains of Industry”.

We’ve built a web app to help you build some Zipfian intuition for artist popularity on Facebook. Given an artist’s name, the app will show you a list of 10 words that are as popular as the given artist is on Facebook. Check it out on the Next Big Sound’s labs page: http://labs.nextbigsound.com/zipf.

image


*the median value of Facebook page likes is around 440. The “typical” artist is bigger than 50% of artists on facebook, and smaller than 50% of artists on facebook. The average value is the total page likes across all artists divided by the total number of artists on Facebook. For a normal distribution, the average value and the median value are almost the same, but for a highly skewed distribution, they can be very different.

**Statistics on word usage was calculated using data from the American National Corpus (http://www.anc.org/data/anc-second-release/frequency-data/), which aggregated unique word counts for 18.5 billion words in nearly 11,000 written documents. I use the word “word” loosely. The corpus provided by the ANC includes many proper nouns, foreign words, and misspellings.

Data Architecture @ NBS

ericczechnbs ericczechnbs

May 13, 2014

Tracking online activity is hardly a new idea, but doing it for the entire music industry isn’t easy. Half a billion music video streams, track downloads, and artist page likes occur each day and measuring all of this activity across platforms such as Spotify, iTunes, YouTube, Facebook, and more, poses some interesting scalability challenges.

Our data growth rate has been close to exponential and early adoption of distributed systems has been crucial in keeping up. With over 100 sources tracked coming from both public and proprietary providers, dealing with the heterogenous nature of this data has required some novel solutions that go beyond the features that come for free with modern distributed databases.

We’ve also transitioned between full cloud providers (Slicehost), hybrid providers (Rackspace), and colocation (Zcolo) all while running with a small engineering staff using nothing but open source systems. There has been no shortage of lessons learned in building Next Big Sound, and following are some highlights on what we did and how we did it.

Stats

Platform

  • Hosting: Colocation via ZColo

  • Operating System: Ubuntu 12.04 LTS for VMs and physical servers

  • Virtualization: OpenStack (2x Dell R720 compute nodes, 96GB RAM, 2x Intel 8-core CPU, 15K SAS drives)

  • Servers: mainly Dell R420, 32GB RAM, 4x 1TB 7.2K SATA data drives, 2x Intel 4-core CPU

  • Deployment: Jenkins

  • Hadoop: Cloudera (CDH 4.3.0)

  • Configuration: Chef

  • Monitoring: Nagios, Ganglia, Statsd + Graphite, Zenoss, Cube, Lipstick

  • Databases: HBase, MySQL, MongoDB, Cassandra (dropped recently in favor of HBase)

  • Languages: PigLatin + Java for data collection/integration, Python + R + SQL for data analysis, PHP (Codeigniter + Slim), JavaScript (AngularJS + Backbone.js + D3)

  • Processing: Impala, Pig, Hive, Oozie, RStudio

  • Networking: Juniper (10Gig, redundant core layer w/ auto failover, 1 Gig access switches on racks)

Storage Architecture

Storing timeseries data is relatively simple with distributed systems like Cassandra and HBase, but managing how that data evolves over time is much less so. At Next Big Sound, aggregating data from 100+ sources involves a somewhat traditional Hadoop ETL pipeline where raw data is processed via MapReduce applications, Pig, or Hive and results are written to HBase for later retrieval via Finagle/Thrift services; but with a twist. All data stored within Hadoop/HBase is maintained by a special version control system that supports changes in ETL results over time, allowing for changes in the code that defines the processing pipeline to align with the data itself.

Our “versioning” approach for managing data within Hadoop is an extension to techniques like that used in the Linked.in data cycle, where results from Hadoop are recomputed, in full, on a recurring basis and atomically swapped out with old result sets in Voldemort in a revertible, versioned way. The difference with our system is that versioning doesn’t just occur at the global level, it occurs on a configurable number of deeper levels. This means, for example, that if we’re recording the number of retweets by country of an artist on Twitter and we find that our logic for geocoding tweet locations was wrong for a few days, we can simply create new versions of data for just those days rather than rebuilding the entire dataset. Different values will now be associated with each of these new versions and access to each version can be restricted to certain users; developers might see only the newest versions while normal users will see the old version until the new data is confirmed as accurate. “Branching” data like this has been critical for keeping up with changes in data sources and customer requests as well as supporting efficient, incremental data pipelines.

For some extra detail on this system, this diagram portrays some of the key differences described above.

For even more details, check out our white paper for HBlocks, the system we use to make this happen.

The Hadoop infrastructure aside, there are plenty of other challenges we face as well. Mapping the relationships of entities within the music industry across social networks and content distribution sites, building web applications for sorting/searching through millions of datasets, and managing the collection of information over millions of API calls and web crawls all require specialized solutions. We do all of this using only open source tools and a coarse overview of how these systems relate to one another is shown below.

Products and Services

  • Data Presentation: The construction of our metric dashboard has always been an ongoing project guided in large part by our customers. Striking the right balance between flexibility and learning curve is a moving target with so many different datasets, and maintaining a coherent JavaScript/PHP codebase to manage it all only gets harder with each new customer and feature. Here are some highlights on how we’ve dealt with this so far:

    1. Started as simple Codeigniter app, tried to incorporate Backbone as much as possible, now shifting towards Angular (aggressively)
    2. Memcache for large static objects (e.g. country to state mappings)
    3. Local storage for metric data caching and history (i.e. when you reload a page, this is how we know what you were looking at before)
    4. Graphing all done with D3, previously used Rickshaw

    Also, we don’t do anything fancy for feature flags, but we use our basic implementation of them incessantly. They’ve been one crucial (though sometimes messy) constant in a codebase that’s consistently being rewritten and there are many things we would have been unable to do without them.

  • FIND: We’ve invested heavily in building products that give our users the ability to search through our data for interesting artists or songs based on a number of criteria (we call our premier version of this the “FIND” product). As something akin to a stock screener for music, this product lets users sort results after filtering by criteria like “Rap artists within the 30th - 40th percentile of YouTube video views that have never previously appeared on a popularity chart of some kind”. The bulk of the infrastructure for this resides in MongoDB where heavily indexed collections are fed by MapReduce jobs and provide nearly instantaneous search capabilities over millions of entities.

    Building this type of product on MongoDB has worked well but indexing limits have been an issue. We’re currently exploring systems better suited to this kind of use case, specifically ElasticSearch.

  • Internal Services: All metric data used by our products and APIs is served from an internal Finagle service that reads from HBase and MySQL. This service is separated into tiers (all running the same code) where a more critical, low-latency tier is used directly by our products and a second tier capable of much greater throughput, but with a much higher 90th percentile latency, is used by programmatic clients. The latter of the two tends to be much more bursty and unpredictable so using separate tiers like this helps to keep response times as low as possible for customers. This is also a convenient split because it means we can build smaller, virtual machines for the critical tier and then just colocate the other array of Finagle servers on our Hadoop/HBase machines.

  • Next Big Sound API: We’ve gone through a lot of iterations on the primary API we expose externally as well as use internally to power our products. Here are some of the best practices we’ve found to be the most influential:

    1. Don’t build an API that just exposes methods, build an API that models the entities in your system and let HTTP verbs (eg GET, PUT, POST, HEAD, PATCH, DELETE) imply the behaviors of those entities. This makes the structure of the API much easier to infer and experiment with.
    2. For methods requiring entity relationships, use something like a “fields” parameter for the main entity and let the existence of fields in that parameter imply what relationships actually need to be looked up. For us, this means that our API exposes an “artist” method with a “fields” parameter that would only return return the artist’s name if the fields are set as “id,name”, but could also return the data about the artists YouTube channel and any videos on it if the fields are set as “id,name,profiles,videos”. Fetching the relationships between entities can be expensive so this is a good way to save database queries without having to write a bunch of ugly, combination methods like “getArtistProfiles” or “getArtistVideos”.
    3. Using an externally exposed API to build your own products is always a good idea, but one more subtle advantage of this we’ve seen is with the on-boarding of new web developers. We used to put a good bit of PHP code between our JS code and the API calls but are now trying to limit interactions to be strictly between JavaScript and the API. This means web devs can focus on the browser code they know so well and it plays much more nicely with their favorite frameworks like Backbone and Angular.
  • Alerts and Benchmarks: There are always a lot of things going on in the world of music and one way we try to dig up significant events in all the noise is by benchmarking data across whole platforms (e.g. establishing overall trends in the number of Facebook likes happening every day) and by alerting our customers when the artists they care about see significant spikes in activity. We had some early scalability issues with this, but we’ve solved most of those by committing to using only Pig/Hadoop for it with results stored in MongoDB or MySQL. The remaining issues center around how to set thresholds for what is “significant,” and our biggest take away so far has been that online activity tends to trend and spike globally, so baselines have to take into account as much data as possible without focusing solely on single entities (or artists in our case). Deviations from these more wholistic baselines are a good indicator of real changes in behavior.

  • Billboard Charts: We license two charts to Billboard magazine, one for overall popularity of artists online (the Social 50 Chart) and one basically attempting to predict which artists are most likely to make that list in the future (the Next Big Sound Chart). Calculating these charts doesn’t introduce any dramatic scaling challenges since it’s just a reverse sort by some computed score across all artists, but producing a polished, de-duplicated, production-worthy list takes some consideration. We have a lot of problems with duplicate or near-duplicate artists within our system as well as the associations of those artists to their online profiles (e.g. Is Justin Bieber’s twitter username “justinbieber”, “bieber”, or “bieberofficial”?). Solving problems like this usually takes some combination of automation and human interaction, but when it’s very important not to have false-positives in filtering routines (i.e. removing even a single legitimate artist is a big problem), manual curation is necessary. We’ve found though that augmenting this curation with systems that remember actions taken and then have the ability to replay those actions is pretty effective and easy to implement.

  • Predictive Billboard Score: One of the more interesting analytical results we’ve ever produced is a patented algorithm for calculating the likelihood with which an artist will “breakout” in the next year. This process involves the application of a stochastic gradient boosting technique to predict this likelihood based on the “virality” of different social media numbers. The math aside, this is difficult to do because many of the tools we use for it don’t have Hadoop-friendly implementations and we’ve found that Mahout just doesn’t work beyond basic applications. Our architecture for a process like this then involves input data sets built and written to MongoDB or Impala by MapReduce jobs, pulled into R via R-Mongo and R-Impala, and then processed on a single giant server using some of R’s parallel processing libraries like multicore. Doing most of the heavy lifting with Hadoop and leaving the rest to a single server has some obvious limitations and it’s unclear exactly how we’ll eventually address them. RHadoop might be our best hope.

Hosting

  • Rolling your own networking solutions sucks. If you’re going to do it as a small team, make sure you’ve got someone dedicated to the task that has done it before and if you don’t, find someone. This has pretty easily been our biggest pain point with colocation (and the cause of some pretty scary outages).

  • Moving between hosting providers is always tricky but doesn’t have to be risky if you budget for the extra money you’ll inevitably spend with machines running in both environments, doing more or less the same thing. Aside from a few unavoidable exceptions, we always ended up duplicating our architecture, in full and usually with some enhancements, within our new provider before shutting any old resources down. Sharing systems between the providers never seems to go well and usually the money saved isn’t worth the lack of sleep and uptime.

  • With ~90% of our capacity dedicated to Hadoop/HBase and a really consistent workload, it’s hard to beat the price point that came with owning your own servers. Our workloads aren’t bursty on a daily basis due to user traffic since that traffic is small compared to all the internal number crunching going on. We do have to increase capacity regularly but doing it as a step function is fine since those increases usually coincide with the acquisition of large customers or data partners. This is why we saved ~$20k/month by moving on to our own hardware.

Lessons Learned

  • If you’re aggregating data from a lot of sources and running even modest transformations on it, you’re going to make mistakes. Most of the time, these mistakes will probably be obvious and you can fix them before they make it to production, but the rest of the time, you’ll need something in place to handle them once they’re there. Here’s the sort of scenario we went through way too many times before realizing this:

    1. Collect terabyte-sized dataset for followers of Twitter artists, load it into the database in a day or two.
    2. Let customers know the data is ready, high-five ourselves for being awesome.
    3. (a month later) Wait, why do 20% of all followers live in bumblefuck, Kansas?
    4. Oh, the code that converts location names to coordinates translates “US” to the coordinates for the middle of the country.
    5. Ok, well since customers are still using the correct part of the dataset and we can’t delete the whole thing, lets just reprocess it, write it to a new table, updated all code everywhere to read from both tables, only take records from the old table if none exist in the new one, and delete the old table after the reprocessing is done (easy right?).
    6. A hundred lines of hacky spaghetti code (that never go away) and a few days later, job complete.

    There might be a smarter way to do things in some cases like this, but when you run into enough of them it becomes pretty clear you need a good way to update production data that can’t just be completely removed and rebuilt. This is why we went through the trouble of building a system for it.

  • Most of our data is built and analyzed using Pig. It is incredibly powerful, virtually all of our engineers know how to use it, and it has functioned very well as the backbone of our storage system. Figuring out what the hell it’s doing half the time though is still a work in progress and we’ve found Lipstick, from Netflix, very helpful for that. We’ve also found that, in lieu of great visibility, keeping the length of development iterations down with Pig is crucial. Putting time in to intelligently build sample input datasets to longer-running scripts that spawn 20+ Hadoop jobs is a must before trying to test them.

  • We used Cassandra for many years beginning with version 0.4 and despite a terrifying experience early on, it was really awesome by the time we moved away from it. That move really didn’t have much to do with Cassandra but was really just a consequence of wanting to use Cloudera’s platform as we rebuilt our storage architecture. The lesson we learned though after using it and HBase extensively is that arguing about which to use is probably just a waste of time for most people. They both worked reliably and performed well once we understood how to tune them, and focusing on our data model (eg key compression schemes, capping row sizes, using variable length integers, query access patterns) made a much bigger difference than anything.

Predicting Next Year’s Breakout Artists

victorlovesdata victorlovesdata

Nov 27, 2013

At Next Big Sound, we have always been fascinated by the power of data to predict tomorrow’s music stars. Recently we developed an algorithm that creates a list of the emerging artists who are most likely to break out this year. Over time we tweaked this formula enhancing its forecasting ability, to the point that we’ve been able to patent its powers of prediction. This article describes how to pinpoint breakout artists 500 times better than random chance, up to a year in advance.

For instance, in June 2012, both Kendrick Lamar and A$AP Rocky had released acclaimed mixtapes but no studio albums or hit Billboard songs. We predicted both to explode based on their social media data in June 2012. Their debut studio albums opened at #1 and #2 months later on the Billboard 200, with Lamar now counting a total of nine singles that have hit the Hot 100 and A$AP Rocky a total of three. 

Read More

Introducing the Tech Blog, Making Next Big Sound

victorlovesdata victorlovesdata

Nov 26, 2013

We are starting this blog to share some of these techniques that we use to understand how people discover and engage with music. Over the last four years, we have collected data on hundreds of thousands of artists worldwide, drawing from fan interactions on the radio, YouTube, Twitter, iTunes, concerts and more. In the tech blog we will delve more deeply into topics such as:

  • how we accurately predict breakout artists a year in advance
  • defining and detailing the spread of DevOps culture
  • how to use a music robot to DJ for your office
  • technical challenges we’ve overcome as a Big Data company
  • what exactly Granger causality is and why it matters
  • and more!

Enjoy a behind-the-scenes look at Making Next Big Sound!  

davidhoffman davidhoffman

Nov 17, 2013

Reppin’ the NBS Mousepad. 

thecool thecool

Nov 04, 2013

We all want to change the world