A friend asked me this week what the difference is between using Hadoop and its related ecosystem for data storage and analysis, and using a traditional Data Warehouse.
You might want to skip this post if you’re already way ahead on this topic, but for everyone else, I thought I’d try and clarify…
Hive is a SQL-like interface onto Map Reduce. It feels nice and familiar to analysts who are used to thinking in a SQL paradigm, but it has some nasty gotchas that can make jobs verrrrrry slow or make them fail altogether. Either way, you waste a lot of time, blood pressure, and machine hours.
I went to a great talk recently by Philip Tromans at the London Hive meetup which covered some very useful Hive Optimisation tips. His full deck is here, but I’ve shamelessly recopied a couple of the most useful points here: