Using DBOutputFormat in Hadoop


February 2019


1.1k time


When using DBOutputFormat with hadoop, say the final result is to go to MySql database. Will Hadoop create separate connection each time a result has to be written? (Would DB be burdened with too many open connections). I have not used the format, so any suggestion on the same is acceptable. Would it have a performance upperhand over Sqoop? Sqoop can also be used to export output file to DB. Please share your views.

1 answers


Here's an explanation I found this blog post from Cloudera:

The DBOutputFormat writes to the database by generating a set of INSERT statements in each reducer. The reducer’s close() method then executes them in a bulk transaction. Performing a large number of these from several reduce tasks concurrently can swamp a database. If you want to export a very large volume of data, you may be better off generating the INSERT statements into a text file, and then using a bulk data import tool provided by your database to do the database import.

So it appears that each individual reducer will only open one connection, so the database probably won't have too many open connections, but it still could cause performance issues. I don't know for sure, but Sqoop is probably slightly more efficient and robust.