Abstract:
According to the development of technology, enormous amount of data are being generated as a continuous basis from Social media, IOT devices, and web etc. This lead to big data era. Many researchers are paying attention on massive amount of data stream processing coming with a rapid rate to gain valuable information in real-time or to make immediate decision .Data Ingestion Stage is an important part in data stream processing system .It is responsible for the data collection from different locations and then deliver this data for processing. The most important requirement of data ingestion is to provide low latency, high throughput, and scalability with many data producers and consumers. It can influence on entire stream processing performance. In big data stream computing, speed at which data being created and explosive growth of data lead to new challenges. One challenge is to accurately ingest different stream data into a processing platform or data storage platform.Current existing data stream ingestion systems use a combination of Apache NiFi and Kafka. Apache Nifi is used for collection and preprocessing of structured and unstructured data feeds. Kafka is used for message distribution. However, processor such as MergeRecord in Nifi can be memory , I/O CPU intensive.As a result,when processing massive data streams creation with high speed can lead to a lot of memory effort , input/output bottleneck or central processing unit (CPU) bottleneck.It leads to impact on the performance of stream processing layer and it is not appropriate for time sensitive applications. In this paper, we propose to use a combination of StreamSets Data Collector and Kafka to collect and transform from various sources of structured and unstructured feeds.