Experienced Sr. Big data engineer with a demonstrated history of working in the field of the Information technology industry, having extensive experience with multiple big data technologies and ecosystems, Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification, and Testing as per Cycle in both Waterfall and Agile methodologies. Also experienced with ingesting and processing unstructured petabyte-scale datasets into complicated data pipelines that enable faster and better data analytics for business users. Big data technologies like Spark and Hive, AWS services like RedShift, s3, EMR, and others, and Azure services like Azure Data Lake and Stream Analytics. Extensive knowledge of Flume, Hive, Hadoop, Spark, Machine Learning, Statistical Methods, Databases, and Python, SQL, Java, Pandas, and NumPy.
Developed data pipeline using Flume, Sqoop, Pig and Java map reduce and Spark to ingest customer behavioral data and purchase histories into HDFS for analysis, connected to AWS Redshift through Tableau to extract live data for real time analysis, responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark, and developed ETL framework using Spark and Hive, including daily runs, error handling, and logging.
Extracted transform and load data from sources systems to Azure Data Storage services using a combination of Azure Data Factory, Informatica BDM, T-SQL, Spark SQL, and Azure Data Lake Analytics. Data Ingestion to one or more Azure services and processing data in Azure Databricks, implemented ETL process through Informatica BDM and Python scripting to load data from Denodo to ThoughtSpot, built CI/CD pipeline, and developed features, scenarios, step definitions for BDD and TDD using Cucumber, Gherkin, and Ruby.
Responsible for architecting Hadoop clusters and analysis, migrated Hive UDF’s and queries into Spark SQL for faster requests, configured Spark Streaming to receive real time data from Apache Kafka, stored stream data to HDFS using Scala, hands-on experience in Spark and Spark Streaming creating RDD’s, applying operations, and developed multiple POCs using Scala.