When comparing Apache Spark vs Pachyderm, the Slant community recommends Apache Spark for most people. In the question“What are the best processing tools for Big Data?” Apache Spark is ranked 1st while Pachyderm is ranked 3rd. The most important reason people chose Apache Spark is:
Pandas-like data frame syntax for succinctly doing powerful aggregations, with no need to pollute your code with batching or scaling logic.
Ranked in these QuestionsQuestion Ranking
Pros
Pro High level API that scales
Pandas-like data frame syntax for succinctly doing powerful aggregations, with no need to pollute your code with batching or scaling logic.
Pro Cheap clusters with spot instances
Using AWS spot instances or GCE preemptible VMs allows for starting cheap clusters, averaging ~10% the price of a on demand instance.
Cons
Con High learning curve
Takes time to understand memory & time complexity of transforms, which leads to a rocky start with Spark while you keep spilling memory.
Con Reinventing the wheel
Recreated a "modern" hadoop mapreduce & HDFS, except with git-like data. This ties you to their system, and cannot not run any tools such as Spark/Hive on top.
Con Low level MapReduce API
Only ways to express transforms are: map & reduce. Developer time will be spent expressing high level concepts such as count, windowing, graph traversal in this framework, vs having Spark/Hive do it.