Posted in Community, Presentations, Troubleshooting | Posted on 27-02-2014|
I am extremely happy to say that my proposal was accepted for Hadoop Summit 2014 in Amsterdam ;) The title of my presentation is Hadoop operations powered by … Hadoop and I will talk about various metrics, logs and files that Hadoop generates and how to analyze them … using Hadoop (and open-source tools and simple scripts) to learn more about Hadoop and avoid guesstimates!
The “official” abstract of the presentation is bellow:
At Spotify we collect huge volumes of data for many purposes. Reporting to labels, powering our product features, and analyzing user growth are some of our most common ones. Additionally, we collect many operational metrics related to the responsiveness, utilization and capacity of our servers. To store and process this data, we use scalable and fault-tolerant multi-system infrastructure, and Apache Hadoop is a key part of it.
Surprisingly or not, Apache Hadoop generates large amounts of data in the form of logs and metrics that describe its behaviour and performance. To process this data in a scalable and performant manner we use … also Hadoop! During this presentation, I will talk about how we analyze various logs generated by Apache Hadoop using custom scripts (written in Pig or Java/Python MapReduce) and available open-source tools to get data-driven answers to many questions related to the behaviour of our 690-node Hadoop cluster. At Spotify we frequently leverage these tools to learn how fast we are growing, when to buy new nodes, how to calculate the empirical retention policy for each dataset, optimize the scheduler, benchmark the cluster, find its biggest offenders (both people and datasets) and more.
Hopefully, see you there!