Problem Statement:
Design scalable pipeline using spark to read customer review from s3 bucket and store it into HDFS. Schedule your pipeline to run iteratively after each hour.
Create a folder in the s3 bucket where customer reviews in json format can be uploaded. The Scheduled big data pipeline will be triggered manually or automatically to read data from The S3 bucket and dump it into HDFS.
Use Spark Machine learning to perform sentiment analysis using customer review stores in HDFS.
Data: You can use any customer review data from online sources such as UCI