i need build server reads large csv data files (100gbs) in directory, transforms fields , streams them hadoop cluster.
these files copied on other servers @ random time (100s times/day). takes long time finish copying file.
i need to:
- regularly check new files process (i.e., encrypt , stream)
- check if csv copied on kick off encryption
- process stream multiple files in parallel, prevent 2 processes stream same file
- mark files being streamed
- mark files being streamed unsuccessfully , restart streaming process.
my question is: there open source etl tool provide of 5, , works hadoop/spark stream? assume process standard, couldn't find yet.
thank you.
Comments
Post a Comment