Looking for an Expert in:
· Build Custom Data Pipelines in Python that Clean, Transform and Aggregate data from many different sources
· Big Data Technologies such as: MapReduce, Hadoop, Spark, HBase, Hive, Elastic Search
· Data Structures and Data Processing Algorithms and Frameworks
· Data Migration, high throughput Data Pipelines
· Massively Parallel Processing using Python Tools
· Ability to analyze Performance Issues in Big Data Environment
· Data Modelling, Data Transfer and Storage, Partitioning, Indexing and caching Techniques
Well experienced with:
· Large Scale Data Modelling from Big Data perspective
· Big Data Structures in Python
· PyData, Anaconda, numPy, PyTables, DataFrames, Jupyter Notebook
· PyHive, PySpark
· JSON/Parquet Data formats
· Real time streaming with either Spark Streaming of Kafka
Good to have:
· Familiarity with PyPi
· Workflow Management Tools such as Luigi, Apache Airflow, Snowflow or similar