Skip to main content

Data Engineering

Spark

Apache Spark is an open-source, distributed computing system that provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Spark extends the MapReduce model to efficiently cover more types of computations, which include interactive queries and stream processing. One of Spark's key features is its in-memory cluster computing which increases the processing speed of an application. The MapReduce model is from Google; it's one of their super duper innovations for working with data.

Links

Thoughts

  • Data engineering is more software than analytics.
  • Data engineering is a lot more structured and require intelligent design than data science.