etl - Read, transform and stream to Hadoop -


i need build server reads large csv data files (100gbs) in directory, transforms fields , streams them hadoop cluster.

these files copied on other servers @ random time (100s times/day). takes long time finish copying file.

i need to:

  1. regularly check new files process (i.e., encrypt , stream)
  2. check if csv copied on kick off encryption
  3. process stream multiple files in parallel, prevent 2 processes stream same file
  4. mark files being streamed
  5. mark files being streamed unsuccessfully , restart streaming process.

my question is: there open source etl tool provide of 5, , works hadoop/spark stream? assume process standard, couldn't find yet.

thank you.

flume or kafka serve purpose. both integrated spark , hadoop.


Comments