Description
Summary: * An accomplished Hadoop/Spark developer experienced in ingestion, storage, querying, processing and analysis of big data. * Extensive knowledge of Hadoop architecture and its components. * Extensive experience in working with various distributions of Hadoop like enterprise versions of Cloudera, Hortonworks and good knowledge on MAPR distribution and Amazon's EMR. * In depth experience in using various Hadoop Ecosystem tools like HDFS, MapReduce, Yarn, Pig, Hive, Sqoop, Spark, Storm, Kafka, Oozie, Elastic search, HBase, and Zookeeper. * Experienced in implementing scheduler using Oozie, Crontab and Shell scripts. * Good working experience using Sqoop to import data into HDFS from RDBMS and vice versa. * Extensive experience in importing and exporting streaming data into HDFS using stream processing platforms like Flume and Kafka messaging system. * Worked on ELK stack like Elastic search, Logstash, Kibanafor log management. * Exposure to Data Lake Implementation using Apache Sparkand developed Data pipe lines and applied business logics using Spark. * Well-versed in spark components like Spark SQL, MLib, Spark streaming and GraphX. * Extensively worked on Spark streaming and Apache Kafka to fetch live stream data. * Used Scala and Python to convert Hive/SQL queries into RDD transformationsin Apache Spark. * Expertise in writing Spark RDD transformations, Actions, Data Frames, Case classes for the required input data and performed the data transformations using Spark-Core. * Experience in integrating Hive queries into Spark environment using Spark SQL. * Expertise in performing real time analytics on big data using HBase and Cassandra. * Experience in developing data pipeline usingPig, Sqoop, and Flume to extract the data from weblogs and store in HDFS. * Created User Defined Functions (UDFs), User Defined Aggregated Functions (UDAFs) in PIG and Hive. * Experienced in Data cleansing process using Pig Latin operations and UDF's. * Hands-on experience in tools like Oozie and Airflowto orchestrate jobs. * Proficient in NoSQL databases including HBase, Cassandra, MongoDB and its integration with Hadoop cluster. * Expertise in Cluster management and configuring Cassandra Database. * Involved in maintaining the Big Data servers using Ganglia and Nagios. * Great familiarity with creating Hive tables, Hive joins & HQL for querying the databases eventually leading to complex Hive UDFs. * Accomplished developing Pig Latin Scripts and using Hive Query Language for data analytics. * Worked on differentcompression codecs (ZIO, SNAPPY, GZIP) * Developing various cross platform products while working with different Hadoop file formats like Sequence File, RC File, ORC, AVRO & Parquet. * Extracted data from various data source including OLE DB, Green plum, Excel, Flat files and XML. * Good understanding of MPP databases such as HP Vertica and Impala. * Experience in practical implementation of cloud-specific AWS technologies including IAM, Amazon Cloud Services like Elastic Compute Cloud (EC2), ElastiCache, Simple Storage Services (S3), Cloud Formation, Virtual Private Cloud (VPC), Route 53, Lambda, EBS. * Built AWS secured solutions by creating VPC with public and private subnets. * Migrated an existing on-premises application to AWS. Used AWS services EC2 and S3 for small data sets processing and storage. * Worked on data warehousing and ETL tools like Informatica, Talend, and Pentaho. * Expertise working in JAVA J2EE, JDBC, ODBC, JSP, Java Eclipse, Java Beans, EJB, Servlets. * Worked on various programming languages using IDEs like Eclipse, NetBeans, and Intellij. * Excelled in using version control tools like PVCS, SVN, VSS and GIT. * Used web-based UI development usingJavaScript, JSP, Java Swings, CSS, jquery, HTML, HTML5, XHTML and JavaScript. * Proficiency with the application servers like WebSphere, WebLogic, JBOSS and Tomcat * Development experience in DBMS like Oracle, MS SQL Server, Teradata, and MYSQL. * Developed stored procedures and queries using PL/SQL. * Experience with best practices of Web services development and Integration (bothREST and SOAP). * Experienced in using build tools like Ant, Gradle, SBT, Maven to build and deploy applications into the server. * Experience in automated scripts using Unixshell scripting to perform database activities. * Knowledge inUnified Modeling Language (UML) and expertise in Object Oriented Analysis and Design (OOAD) and knowledge. * Experience in complete Software Development Life Cycle (SDLC) in both Waterfall and Agile methodologies. * Knowledge in Creating dashboards and data visualizations using Tableau to provide business insights. * Experienced in ticketing tools like Jira, Service now. * Working experience with Linux lineup like Redhat and CentOS. * Good understanding of all aspects of Testing such as Unit, Regression, Agile, White-box, Black-box. * Good analytical, communication, problem solving skills and adore learning new technical, functional skills. Experienced with Spark Context, Spark-SQL, Spark YARN. * Implemented Spark-SQL with various data sources like JSON, Parquet, ORC and Hive. * Implemented Spark Scripts using Scala, Spark SQL to access hive tables into spark for faster processing. * Loaded the data into Spark RDD and do in memory data Computation to generate the Output response. * Worked on loading AVRO/PARQUET/TXT files in Spark Framework using Java/Scala language and created Spark Data frame and RDD to process the data and save the file in parquet format in HDFS to load into fact table using ORC Reader. * Good knowledge in setting up batch intervals, split intervals and window intervals in Spark Streaming. * Implemented data quality checks using Spark Streaming and arranged passable and bad flags on the data. * Implemented Hive Partitioning and Bucketing on the collected data in HDFS. * Involved in Data Querying and Summarization using Hive and Pig and created UDF's, UDAF's and UDTF's. * Implemented Sqoop jobs for large data exchanges between RDBMS and /Hive clusters. * Extensively used Zookeeper as a backup server and job scheduled for Spark Jobs. * Knowledge on MLLib (Machine Learning Library) framework for auto suggestions. * Developed traits and case classes etc in Scala. * Developed Spark scripts using Scala shell commands as per the business requirement. * Worked on Cloudera distribution and deployed on AWS EC2 Instances and experienced in Maintaining the Hadoop cluster on AWS EMR. * Extracted data from various data source including OLE DB, Green plum, Excel, Flat files and XML. * Involved in running Hadoop streaming jobs to process terabytes of text data. * Experienced in loading the real-time data to NoSQL database like Cassandra. * Experienced in using Data StaxSpark Connector which is used to store the data in Cassandra database from Spark. * Involved in NoSQL (Datastax Cassandra) database design, integration, implementation, written scripts and invoked them using CQLSH. * Well versed in using Data Manipulations, Compactions, tombstones in Cassandra. * Experience in retrieving the data present in Cassandra cluster by running queries in CQL (Cassandra Query Language). * Worked on connecting Cassandra database to the Amazon EMR File System for storing the database in S3. * Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3). * Deployed the project on Amazon EMR with S3 connectivity for setting a backup storage. * Developed Schedulers that communicated with Cloud based services (AWS) to retrieve data. * Well versed in using of Elastic Load Balancer for Autoscaling in EC2 servers. * Implemented ETL standards utilizing proven data processing patterns with open source standard tools like Talend and Pentaho for more efficient processing. * Well versed on Data Warehousing ETL concepts using Informatica Power Center, OLAP, OLTP and AutoSys. * Configured work flows that involves Hadoop actions using Oozie. Experienced with Faceted Reader search, Full Text Search Data querying using Solr. * Used Python for pattern matching in build logs to format warnings and errors. * Coordinated with SCRUM team in delivering agreed user stories on time for every sprint. Environment: Hadoop YARN, Spark SQL, Spark-Streaming, AWS S3, AWS EMR, Spark-SQL, GraphX, Scala, Python, Kafka, Hive, Pig, Sqoop, Solr, Cassandra, Cloudera, Oracle 10g, Linux. Company: Sherwin Williams, Cleveland, OH Jan 2015 - Mar 2016 Hadoop Developer Responsibilities: * Involved in review of functional and non-functional requirements. * Responsible for Collection and aggregation of large amounts ofdata from various sources and ingested into Hadoop file system (HDFS) using Sqoop and Flume, the data was transformed to business use cases using Pig and Hive. * Collected and aggregated large amount of weblogs and unstructured data from different sources such as web servers, network devices using Apache Flume and stored the data into HDFS for analysis. * Developed and maintained data integration programs in RDBMS and Hadoop environment with both RDBMS and NoSQL data stores for data access and analysis * Responsible for coding MapReduce program to develop multiple Map Reduce jobs in Java for data cleaning and processing, also done testing and debugging the Map Reduce programs. * Experienced in implementing Map Reduce programs to handle semi/unstructured data like json, XML, Avro data files and sequence files for log files. * Hands on experience in Hortonworks Hadoop Distributions. * Developed various Python scripts to find vulnerabilities with SQL Queries by doing SQLinjection, permission checks and analysis. * Experienced in writing SparkApplications in Scala and Python. * Designed and implemented Sparkjobs to support distributed data processing. * Worked on importing of data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS. * Developed and implemented map reduce jobs to support distributed processing using Java, Hive and Apache Pig. * Executed Hive queries on Parquet tables stored in Hive to perform data analysisto meet the business requirements.Implemented Partitioning, Dynamic Partitions, Buckets in HIVE. * Tested Apache TEZ, an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs. * Configured various views in Ambari such as Hive view, Tez view, and Yarn Queue manager. * Used Spark-SQL to Load data into Hive tables and Written queries to fetch data from these tables. * Imported data from AWSS3 and into SparkRDD and performed transformations and actions on RDD's. * Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in Amazon EMR. * Developed Pig scripts and UDF's as per the Business logic. * Used Pig to import semi-structured data from Avro files to make serialization faster. * Used Oozie work flows and Java schedulers to manage and schedule jobs on a Hadoop cluster. * Created the Hive external tables using Accumulo connector and Indexed documents using Elastic Search. * Developed Multi-hop flume agents by using Avro Sink toprocess web server logs and loaded them into MongoDB for further analysis. * Implemented business logics, transformations and data quality checks using Flume Interceptorin Java. * Experience in Working with MongoDB for distributed storage and processing. * Responsible for using Flume sink to remove the data from Flume channel and to deposit in MongoDB. * Implemented collections & Aggregation Frameworks in MongoDB. * Involved in maintaining Hadoop clusters using the Nagios server. * Configured Oozie workflow engine to automate Map/Reduce jobs. * Collaborated with Database, Network, application and BI teams to ensure data quality and availability. * Good knowledge in using python Scripts to handle data manipulation. * Created Pig Latin scripts to sort, group, join and filter the enterprise wise data. * Experience in processing large volume of data and skills in parallel execution of process using Talend functionality. * Configured Hadoop clusters and coordinatedwith Bigdata Administrators for cluster maintenance. * Experienced in using agile approaches including Test-Driven Development, Extreme Programming, and Agile Scrum. Environment: Hortonworks HDP, Hadoop, Spark, Flume, Elastic Search, AWS, EC2, S3, Pig, Hive, Python, MapReduce, HDFS. Company: Chameleon Integrated Services, Mattoon, IL Oct 2013 - Dec 2014 Hadoop Developer Responsibilities: * Worked on analyzing Hadoop cluster using different big data analytic tools including Pig, Hive. HBase and MapReduce. * Extracted data of everyday transaction of customers from DB2 and export to Hive and setup Online analytical processing. * Installed and configured Hadoop, MapReduce, and HDFS clusters. * Created Hive tables, loaded the data and Performed data manipulations using Hive queries in MapReduce Execution Mode. * Developed MapReduce programs to cleanse the data in HDFS obtained from heterogeneous data sources to make it suitable for ingestion into Hive schema for analysis. * Loaded the structured data which was resulted from MapReduce jobs into Hive tables. * Analyzed user request patterns and implemented various performance optimization measures including but not limited to implementing partitions and buckets in HiveQL. * Identified issues on behavioral patterns and analyzed the logs using Hive queries. * Analyze and transform stored data by writing MapReduce or Pig jobs based on business requirements * Used Flume to collect, aggregate, and store the web log data from different sources like web servers, mobile and network devices and import to HDFS. * Using Oozie, developed workflow to automate the tasks of loading the data into HDFS and pre-process with Pig scripts. * Worked on various compression techniques like GZIP and LZO. * Integrated Map Reduce with HBase to import bulk data using MR programs * Used Maven extensively for building jar files of MapReduce programs and deployed to Cluster. * Worked in developing Pig Scripts for data capture change and delta record processing between newly arrived data and already existing data in HDFS. * Developed data pipeline using Sqoop, Pig and Java MapReduce to ingest behavioral data into HDFS for analysis. * Installed Oozie workflow engine to run multiple Hive and Pig jobs which run independently with time and data availability. * Used Pig as ETL tool to do Transformations, even joins and some pre-aggregations before storing the data on to HDFS. * Used SQL queries, Stored Procedures, User Defined Functions (UDF), Database Triggers, using tools like SQL Profiler and Database Tuning Advisor (DTA) * Installed a cluster, commissioned & decommissioned data node, performed name node recovery, capacity planning, and slots configuration adhering to business requirements Environment: HDFS, Map Reduce, Pig, Hive, Oozie, Sqoop, Flume, HBase, HiveQL, Java, Maven, Cloudera, AWS EC2, Avro, Eclipse and Shell Scripting.