• Seminar Series 2013-2014: Ninth Seminar
    Matei ZahariaMatei Zaharia
    Databricks and MIT
    Date: Friday, July 4, 2014 at 2 pm.
    Location: AL 116
    Title: Making Big Data Interactive with Spark

    Abstract: The rapid growth in data volumes requires new computer systems that scale out across hundreds of machines. While early programming models, such as MapReduce, handled large-scale batch processing, the demands on these systems have also grown: in particular, users quickly needed to run (1) more interactive ad-hoc queries, (2) more complex multi-pass algorithms (e.g. machine learning), and (3) real-time processing on large data streams. In this talk, we present a single programming model, resilient distributed datasets (RDDs), that supports all of these emerging workloads. RDDs form the basis of Apache Spark, an open source cluster computing system that supports real-time and sophisticated analytics on big data. Spark runs up to 100x faster than previous systems like Hadoop MapReduce, while offering clean, easy-to-use interfaces in Java, Scala and Python. Spark has quickly become the most active project in the Apache big data ecosystem, with over 100 developers contributing in the past year, and we will cover industry applications as well as the ideas behind the project.

    Biography: Matei Zaharia is an assistant professor at MIT and CTO at Databricks, the startup company commercializing Spark. He got his undergraduate degree at the University of Waterloo and then his PhD at UC Berkeley. While at Berkeley, Matei started Spark as a research project.