Golden Nuggets at Berlin Buzzwords 2014

If you haven’t heard of Berlin Buzzwords, it’s the conference covering all the latest buzzwords surrounding the biggest buzzword of them all: Big Data. And boy is it buzzwordy! I’m doing my research in a field that is somewhat overlapping with big data – but as a first-time attendee and I was blown away by the sheer number of words buzzing around: ElasticSearch, Cassandra, YARN, Riak, Kibana, Storm, Mesos, ansible, docker, puppet, hadoop, cascading … I probably could go on and fill a couple of more paragraphs, but let’s not waste our precious screen real estate!

Just to make it clear, before it sounds like it’s all just buzz, the conference was actually a very good experience, and the location (KulturBrauerei) was indeed awesome! My personal highlight was meeting all so many people with very diverse backgrounds, working on all angles of big data. For me personally, as I’m not so familiar with how big data is really seen outside of the academic environment, this was eye-opening! All the abstract knowledge of it being useful in the ad industry, banking, retail, etc. was finally filled with some meaningful examples. For instance, I didn’t know that there is a real-time bidding for ads on a huge ad exchange every time when a page is accessed before it actually loads, flooding ad services with thousands of requests per seconds. My knowledge here was quite limited to Google AdSense and nothing else.

The second highlight was the keynote, where Ralf Herbrich from Amazon Berlin gave a look back on how he made research ideas into real products. The special treat here was the story about Microsoft’s TrueSkill online matchmaking system for Xbox Live. Herbrich and his colleagues used machine learning methods to pair up gamers playing at a similar skill level so that matches are actually interesting and don’t end up in one team being dominate. Combining the passion for his personal hobby of gaming and his passion for research sure sounded like the thing for me!

There were also some minor gripes I had with BBuzz. I know that presentations re-inventing the wheel will for sure never go away. At BBuzz, though, there were some speakers that really stretched my patience with talks that had glamorous titles but nothing to show for. You get this at scientific conferences as well, but I’ve never experienced it at this kind of mind-boggling level. Luckily it was just one or two of them.

Another part of the conference I expected more from was the barcamp. It was my first one ever, so I don’t have a point of reference. For me, it felt like regular conference talks, just with less preparation on the speaker side. As a contrast, five years ago I was at WikiSym. There, we had another style of unconference, which I liked way better. Just to briefly elaborate: it was held in a big hall, where everyone would just go to any free spot on the wall and put up a post-it with a topic dear to her or his heart. Afterwards, people would roam around, and spontaneous discussion clusters would form. As confusing as it sounds, I had the feeling that many more interesting things emerged – maybe voting with your feet during the discussions helps maintain focus more than a barcamp-style initially decided schedule.

I guess I rambled on for long enough already, but I still want to share some real gold nuggets I found at Buzzwords:

  • T-Digest: if you need to get an estimate of the median (or any other quantile) in your data (stream), look no further. It’s fast, memory-efficient, and highly accurate even for skewed distributions. Very helpful if you need that kind of data for downstream tasks like anomaly detection in timelines, which brings me to the second point.
  • Deep Learning for Anomaly Detection: Anomaly is defined in the sense of having a deviation from the “normal” data distribution, e.g. exceeding the 99.99 percentile (which can be neatly measured using T-Digest). However, defining “normal” is not so straight-forward if you input signal or distribution are complex. One way to cope with this problem is deep learning, where the actual underlying (latent) structures can be learned. An anomaly is then a deviation from the learned model. If this was too abstract to understand, I recommend watching Ted Dunning’s talk on youtube.
  • Hadoop break even point: A very interesting tidbit that is pretty obvious in hindsight, but easily overlooked. Hadoop was designed with Google scale in mind, however most users do not have clusters or even problems of that scale. Jobs that run for less than 50,000 cpu hours are actually hurt by fault-tolerance mechanism of checkpointing – hopefully I’m quoting the number correctly here. The reason is that hardware might be unreliable, but not THAT unreliable, and provisioning for failures in short jobs just kills the performance without a real benefit. In case something really goes wrong, the job can still be restarted from scratch, not much harm done.

By the way, all three of these are thanks to Ted Dunning. Besides giving a great presentation and having a deep understanding of the field, explaining things at a level of details that are truly insightful, he is also a very nice guy. Just for the sake of seeing him it was worth going to Berlin Buzzwords 2014!

Leave a Reply