A Data Engineer conceives, builds and maintains the data infrastructure that holds your enterprise's BI and Advanced Analytics capacities together. These are the capacities that allow your enterprise to leverage the multiple, disconnected streams of data into rational, data-driven decisions and customer engagement.
This looks very similar (but not the same) to what it is or used to be an ETL developer (Extract/Transform/Load). The main difference is in the vast amount of data that companies want to process and the number of additional technologies and skills that the Data Engineer needs to master like e.g. different file formats, ingestion engines, stream processing, batch processing, batch SQL, Cloud, data storage, cluster management, transaction databases, web frameworks, data visualizations and machine learning.
What exactly is a Data Engineer doing?
Data engineers primarily focus on the following areas:
Build and maintain the organization's data pipeline systems
Data pipelines encompass the journey and processes that data undergoes within a company. Data Engineers are responsible for creating those pipelines. Creating a data pipeline may sound easy or trivial, but as big data scale, this means bringing together multiple different big data technologies. More importantly, a Data Engineer is the one who understands the various technologies and frameworks in-depth, and how to combine them to create solutions to enable a company's business processes with data pipelines.
Clean and wrangle data into a usable state
Data Engineers make sure the data the organization is using is clean, reliable, and prepped for whatever use cases may present themselves. Data Engineers wrangle data into a state that can then have queries run against it by BI Developers and Data Scientists. Data wrangling is a significant problem when working with Big Data, especially if you haven't been trained to do it, or you don't have the right tools to clean and validate data in an effective and efficient way.
What makes a good Data Engineer?
While most successful Data Engineers will have computer science or IT backgrounds, many great Data Engineers come from a range of engineering backgrounds - frequently, but not limited to, computer engineering.
Data Engineers will need to recommend and sometimes implement ways to improve data reliability, efficiency, and quality. To do so, they will need to blend the practical, creative problem solving of an engineer together with "a variety of languages and tools to marry systems together or try to hunt down opportunities to acquire new data from other system-specific codes...can become information in further processing by BI Developers and Data Scientists.
Team oriented and collaborative
Given the shifted understanding of the need to balance analysis with data management, companies are increasingly looking to weave together data science teams instead of hiring unicorn data scientists. For the Data Engineer, that means that in order to excel, they need to be able to collaborate effectively within IT and cross-enterprise teams. This requires not only the ability to bring up-to-date Data Engineering expertise to the table, but also to be able to achieve alignment with broader enterprise needs - ensuring that all are able to advance enterprise objectives.
Curious, and never stop learning
Given the endless problem-solving that Data Engineers face on a daily basis, a curiosity to know how things work and how to make them better is essential. Given the fast pace of change in our world, an outstanding Data Engineer has embraced a desire - a passion even - for continuous learning. Lifelong learning enables him or her to remain current regarding cutting edge technologies relevant for Data Engineering. Attending industry events and staying in the know is crucial.
10 Proven Steps to become a Data Engineer
1. Get comfortable with Linux/Unix commands
In your day-to-day activities, you will be using these commands constantly. For example, you will be moving files to different locations on your server or in the data lake, you will be creating Bash scripts to run Hadoop jobs on the cluster, creating directories, changing file permissions, reading the first 10 lines of a file, etc...
Some of these commands are very basic, but you need to become a maverick in Linux so you can be more productive in your job.
2. Refresh your RDBMS and NoSQL Knowledge
No matter where you work, you will have traditional database systems and/or newer NoSQL document based storage systems. You should already know these technologies if you are a software developer, but it is important to emphasize that you still need to understand how to access/read/write/update these data storage systems. So practice how to interact with databases such as MySQL or DynamoDB, Cassandra, etc...
3. Get familiar with Apache Hive
After working hard to clean, transform, and give schemas to your datasets, you will need a presentation layer for the data analysts or to people that might not know how to access your data through code. This is what the Apache Foundation says about Apache Hive: "The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. In other words, it is a tool that reads large datasets and displays the results based on SQL statements results.
4. Understand where your data is stored and how it can be useful
As a future Data Engineer, you need to conceptually know where your data is located. Data can be in different formats, databases, different servers, etc.. Once you pinpoint the location of the data that you want/need to working with, you can then transfer them to the data lake so you can begin working on your data projects.
5. Familiarize yourself with different File Formats
Data will come to you in many different formats. You name it: CSV or XML, ORC, Parquet, etc... The Data Engineers' job is to identify quickly which format they have to deal with as well as the schema (column names, data types, etc). Data technologies such as Apache Spark have built-in reads in text, csv, or Parquet; so you don't have to do anything special to read these formats.
6. Learn Apache Spark
There are several technologies available to work with data. Apache Spark is widely used in the industry to run Batch jobs; Streaming Real-Time data and even write/run machine learning algorithms. You have the option to write Spark applications in several languages including Java, Python and Scala.
7. Learn about Real-Time Data and Streaming technologies
Today's data IO goes way faster than in the past. For some companies data ingestion needs to happen as events occur and make decisions based on that data very fast. There are several technologies that allow you to ingest and process data "real-time".
Tools include: Spark Streaming, Apache Nifi, Apache Kafka, StreamSets, Amazon Kinesis, etc. Your job will be to understand which tool is best for your use case.
8. Learn and use Cloud Technologies
According to the Amazon Web Services (AWS) website "Cloud computing is the on-demand delivery of compute power, database storage, applications, and other IT resources through a cloud services platform via the internet with pay-as-you-go pricing."
If you have expertise and experience with AWS or Microsoft Azure, you will be in more demand on the job market.
9. Develop Data Architecture Skills
Writing code to clean the data is not the only task that you will be performing as a Data Engineer. You will also be tasked to think outside the box and model/architecture the new data pipelines that you will be creating. Even though you will not be doing this alone, you still need to have a solid understanding of what your data looks like and how everything fits together from a high level point of view. Data modeling or architecture also means that you need to be able to be comfortable white-boarding some ideas before you actually execute them into your data pipelines.
10. Understand Big Data Parallel Processing
Traditional Systems can no longer support the data size amounts that are being generated today. Most of the data can be semi-structured or unstructured. To be able to store and process this data, new programming paradigms have been created to process data in a parallel way across many clusters of computers.