Getting started with EMR + Apache Spark

It's been a while since my last post. Given that I've been playing around with Apache Spark on AWS EMR lately, I figured I'd write about it.

At work, we're trying to solve a specific problem: we want to know how many times has a particular user has taken a particular action over a rolling window of time (ranging from seconds to hours). Further, we want this data to be calculated in near real time so that actions can be taken based on this "aggregation" by another process while the user is navigating our site, as opposed to a few hours later (which is what we have done so far, since we rely on our data warehouse to be the source of information for the calculations). For example, at LinkedIn, we'd want to know how many times has a given user viewed another user's profile over the last 10 minutes. A simple use case for this aggregation might be to calculate the 99'th percentile of profile views over the 10 minute rolling window.

I'll write a separate, and more detailed blog post about the architecural choices we made and what the tradeoffs were, but for now, I just wanted to list some things that I found interesting while working with EMR and Spark 2.2.0 (again, more details in separate (several ?) blog posts):

  • There's no easy way to load the Spark UI when running an EMR cluster. You have to setup an SSH tunnel to get in, and in our case, this caused issues since our VPC for the project (we have isolated VPCs so that we do not adversely impact our production workloads with our 'analytics' workloads, at-least not yet) won't allow us to ssh into machines in it from inside our corporate VPN ( I don't understand why, and I don't understand VPCs in general, really).
  • The best way to actually kick off a Spark Streaming job is to use an EMR Spark Step. These allow you to define specific actions you want EMR to take once the cluster has been bootstrapped. You can also tell EMR to shutdown the cluster once all steps assigned to it have been executed.
  • A good way to install packages your streaming job needs is to use Bootstrap actions. For me, I needed boto3 to be installed on the workers so that once my python each streaming batch was done, I could send the results into a kinesis stream.

I'll talk in more details about each one of these (and some other things) in follow up blog posts.

social