How to Become a Data Engineer: Step-By-Step Guide

Travis Cuzick
Travis Cuzick
hero image

You could easily argue that data collection and analysis is the next big gold rush, with almost every company in every industry collecting, storing, and using it.

It helps them make better business decisions, learn more about their customers, predict sales and trends, or even train machine learning algorithms and A.I.

Add in the fact that cloud computing is still booming and there’s never been a better time to get involved with data and become a Data Engineer.

In this guide, I’ll pull back the curtains for you and show you exactly what it takes to become a Data Engineer and answer important questions like:

  • Do you need a degree to get started or get hired? Nope
  • Are there job opportunities? Yep... 353,409 in the US alone
  • How much does it pay? A lot!... $129,000+ is the average of those available jobs

I’ll also cover exactly what a Data Engineer does, the skills required in the role, and how to get the experience you need to land a job.

So that by the end of this guide, you can take the first steps to starting a new career!

Why listen to me?

My name is Travis Cuzick, I’m a self-taught developer and an instructor here at Zero to Mastery, teaching a wide range of data-focused courses.

I’ve also been architecting and coding data solutions for well over a decade now, for some of the biggest companies on the Fortune 500.

In my day job, I work as a Data Solutions Engineer (which is quite rare and ever so slightly different from a Data Engineer, but more on this in a second), and use my various skills to query and manipulate multi-terabyte enterprise data stores for major U.S. financial institutions. (Both ‘big’ and important data!).

However, as a self-taught developer, I know that it can be daunting to start a brand new career, so I’ll try to keep this guide as nice and simple as I can where possible, and give the mile-high view and steps so you can take action and not be overwhelmed.

So grab a cup of coffee and a notepad and let’s go!

What's the difference between a Data Solutions Engineer, a Data Engineer, a Data Analyst, and a Data Scientist?

There are so many different roles within data collection and data analysis that it can get a little confusing to know just who does what - especially when some roles have similarities or overlaps.

Fellow ZTM instructor Diogo Resende covers the differences between the 3 major roles in more detail in this guide, but here’s a mile-high overview:

Data Solutions Engineers

Data Solutions Engineers are a pretty rare role, but they design and architect comprehensive data solutions for an organization.

For example

A business knows it needs to track and collect data better, but isn’t always sure how to do this.

Or maybe they already collect data, but due to demand and scale, they don’t know how to continue collecting inline with their current rate of growth. (A good problem to have).

The Data Solutions Engineer will speak with stakeholders so they can understand the business data requirements and end goals. Then they plan and design how data should ideally flow through the system to meet those agreed business objectives and propose the appropriate technologies to make that happen.

Data Solutions Engineers typically oversee the entire data ecosystem and may lead a team of data professionals. That being said, some smaller companies might have the Data Engineer fulfill this planning instead.

Data Engineers

Data Engineers are responsible for making this proposed solution work.

They focus on the development and maintenance of data infrastructure, pipelines, and systems for ingesting, processing, storing, and managing large volumes of data.

In simple terms, they set up systems to collect the data, clean the data, and keep it all running smoothly.

Data Analysts

Data Analysts take the data that the Data Engineer collects and then analyze what it all means. This way they can gain insights into what’s happening right now, and inform decision-making within an organization.

Data Scientists

Data Scientists take it a step further. They take that same data and apply advanced analytical and statistical techniques to gain additional insights, build predictive models, and solve complex business problems using data.

Basically, they’re using that current and past data to predict what might happen in the future.

TL;DR

If I can use a rough analogy:

  • The Data Solutions Engineer is like an architect, who helps you plan your dream house
  • The Data Engineer is the builder who creates that house based on those plans
  • The Data Analyst is you, making the most of the space in your new house and adding your furniture
  • While the Data Scientist is like an interior decorator, who can help make that home even more awesome - based on new styles and trends

Now that you understand the difference between these roles, let’s get a bit more technical and dive into the Data Engineer role in more detail.

What does a Data Engineer do?

The main responsibilities of a Data Engineer typically include:

1. Collaboration and cross-team communication

Data Engineers collaborate with cross-functional teams, including Data Scientists, Data Analysts, Software Engineers, and Data Solutions Engineers.

This means that they need to be able to communicate effectively to understand data requirements, provide technical expertise, and deliver solutions that meet the needs of the organization.

2. Designing and building Data Pipelines

This is the core task that people think of when we’re talking about a Data Engineer.

Data Engineers are responsible for designing, constructing, and maintaining data pipelines that extract, transform, and load (ETL) data from various sources into data storage systems such as data warehouses, data lakes, or databases.

This involves understanding data sources, defining data ingestion strategies, and implementing efficient ETL processes.

(Again, more on this later once we get into the exact skills and resources to learn).

3. Data Modeling and Schema Design

Rather than just collecting the data, they also design data models and schemas to structure and organize that data in a way that facilitates efficient storage, retrieval, and analysis, while also optimizing data structures for performance and scalability.

This does mean that a base knowledge of how Data Structures and Algorithms work is vital for this role.

4. Data Integration and Ingestion

Data Engineers will often have to integrate data from disparate sources, including databases, APIs, flat files, streaming data sources, and more.

They also need to develop processes for ingesting data accurately and consistently, even when dealing with large volumes of data in real time. Fortunately, there are tools to help with this.

5. Data Transformation and Cleansing

Now that the data has been collected, it needs to be cleaned, preprocessed, and transformed from raw data to make it suitable for analysis. (Bad data would lead to inaccurate information and poor business decisions).

This involves tasks such as data cleansing, deduplication, normalization, and aggregation to ensure data quality and consistency.

6. Performance Optimization

As businesses grow and tools evolve, Data Engineers need to continue to optimize data pipelines and storage systems for further performance, scalability, and reliability.

This usually involves tuning of database queries, improving ETL processes, implementing caching mechanisms, and utilizing parallel processing techniques to enhance data processing speed and efficiency.

7. Data Quality Assurance

We need to make sure that our data continues to be correct, so Data Engineers develop and implement processes to monitor and maintain data quality throughout the data lifecycle.

They also establish data quality standards, perform data validation checks, and address any issues or anomalies that arise to ensure the accuracy and reliability of the data.

8. Infrastructure Management

Finally, Data Engineers are often involved in managing the infrastructure required to support data processing and storage, including cloud services, databases, and big data technologies such as Hadoop, Spark, and Kafka.

This means that they deploy, configure, and maintain the necessary hardware and software components to support data operations.

Speaking of data and ‘big data’...

What’s the difference between a Data Engineer and a ‘Big Data’ Engineer?

In theory, both Data Engineers and Big Data Engineers do the same job - it’s just that the tools they use will change so that ‘big data’ can be handled.

It’s a little easier to understand once you know why it’s called ‘big’ data.

What exactly is ‘Big’ Data?

In simple terms, data is classified as ‘big’ once it can no longer be handled by traditional Data Engineering tools, such as Excel, or even more specific tools such as MySQL, etc.

This is usually due to either one or a combination of the following:

Volume

Big data usually involves datasets that are extremely large, often ranging from terabytes to petabytes or even exabytes in size.

Velocity

Big data is also often generated at high velocity, meaning that it's produced and updated rapidly.

This can include multiple data streams from sources such as sensors, social media feeds, transactional systems, or web logs. This means that either real-time or near-real-time processing may be required to analyze and derive insights from such data streams.

Think Black Friday sales, live stream events, or even the stock exchange.

Variety

Big data can also come in various forms and formats, including structured data (e.g. relational databases), semi-structured data (e.g. JSON, XML), and unstructured data (e.g. text, images, videos).

Managing and analyzing diverse data types is a key challenge in big data processing, and sometimes you just need more specialist tools for the job.

Complexity

Big data can also be fairly complex in terms of its structure, relationships, and interconnections.

This complexity may require advanced analytics techniques, such as machine learning and natural language processing, to extract meaningful insights.

TL;DR

Data Engineering and Big Data Engineering are basically the same job - just with a specialization of tools for specific big data requirements. That being said, Big Data Engineers are often paid slightly more due to this further specialization.

Speaking of which, let’s take a look at the career prospects in this industry…

Is Data Engineering a good career choice?

Yep. The global Big Data and Data Engineering Services market size was valued at $51.7 billion in 2022 and is expected to expand at a CAGR of 18.15% during the forecast period, reaching $137.5 billion by 2028.

Global Big Data + Data Engineering Market Size 2024-2028

So yeah, pretty big industry growth!

And as for jobs? There are currently 353,409 Data Engineering jobs available in the US right now on ZipRecruiter.

How much do Data Engineers get paid?

If we look at the average salary of those jobs on offer, it works out to around $129,716 per year, with some as high as $177,500.

average data engineer salary us

This can vary based on location and experience, but $120,000+ seems pretty great if you ask me! And if you work hard, there’s no reason you can’t be making ~$200,000 with some years of experience under your belt.

Do I need a degree to become a Data Engineer?

No, not really. Some big tech companies may ask for a degree, but most tech companies don’t care. All they care about is that you can do the job, and have a portfolio of work to prove it.

Heck - if your portfolio is good enough, some big tech companies would also hire you without a degree.

However, you will need to make sure you learn and understand data structures and algorithms, (a core concept taught in CS degrees) so that you can understand how systems scale. This is vital as a Data Engineer. You can pick up that skill from online courses like I shared above though.

Speaking of skills - let’s take a look at what else you’ll need to know.

What skills do I need to become a Data Engineer?

The Data Engineering role is very technical and requires a fairly wide range of skills and experience.

I’ll go into them in more detail as we get to the roadmap, but here’s a rough overview:

  • Proficiency in programming languages such as Python, SQL, and possibly others like Java or Scala
  • Knowledge of data modeling concepts and techniques
  • Experience with database systems, both relational (e.g. PostgreSQL, MySQL) and NoSQL (e.g. MongoDB, Cassandra)
  • Familiarity with data warehousing technologies and concepts
  • Understanding of ETL (Extract, Transform, Load) processes and tools
  • Ability to design and optimize data pipelines for efficiency and scalability
  • Experience with big data technologies such as Hadoop, Spark, or Kafka (depending on the job requirements)
  • Familiarity with cloud platforms and services (e.g. AWS, Azure, GCP) for data storage and processing
  • Strong problem-solving and analytical skills
  • Knowledge of data governance and security best practices
  • Effective communication and collaboration skills to work with cross-functional teams and stakeholders

How long does it take to become a Data Engineer?

As a rough estimate, I would say around 12+ months or so - depending on how much time you can dedicate to learn each week.

The reason being, is that there’s quite a lot to learn for this role, and it’s partly why most Data Engineers don’t start out here as their first role in tech. (Some do but not all).

In fact, most usually have prior experience working as Data Scientists or Software Engineers first with a focus on data management, before moving into these roles.

That being said, you can still start here and get hired as your first tech role, you just need to learn all the required skills, which will affect your time frame.

Also, the time to complete will depend on what you know already, but as we get into the roadmap in the next section, I’ll try and add some rough time estimates so you can judge for yourself.

How to become a Data Engineer: A step-by-step roadmap

Step #2 and Step #3 in this roadmap are where you’ll be spending the bulk of your time, and there are a lot of topics to learn. It’s also why the average wage is so high though!

Also, the majority of the skills you learn in Step #2 could get you hired in a lot of tech roles. Hence why this is usually a career progression role and not always someone's entry point.

With that out of the way, let’s take a look at these steps, along with resources where possible.

Step 1: Set yourself up for success

This first step is completely optional but highly recommended, because here’s the thing: Most people don’t know how to learn effectively.

It’s not their fault. Schools teach basic rote methods of learning which are pretty inefficient. They say the thing, and you try to remember the thing, and it's not great - especially if you require certain learning styles to learn best.

This means that topics you might do well with are harder to remember or apply, so it takes longer to learn.

The thing is, there are multiple different learning techniques that you can use that make all of your future learning efforts far more effective. This means you can understand faster and more efficiently, so less back and forth.

You can learn a lot of the key techniques for free right now in this guide, or better still, learn every important technique inside of Andrei’s learning how to learn course.

learn how to learn

Estimated Time Required For This Step: 5 days.

I know it might feel like a step backward or even a detour, but think about it like this:

  • You can learn the core principles in a few days and then immediately start putting them into practice
  • You're going to learn everything else from now on 2x faster and retain way more as well
  • This is a skill that you can keep developing over time and will serve you for your entire career, guaranteed

Bear in mind that there are multiple skills that you need to pick up to become a Data Engineer, and each of them can take weeks or even months of work to complete.

So why not learn how to cut down on that time, improve your comprehension, and pick up skills faster and easier first? The time and energy savings will seriously compound as you go through the rest of the content you need to learn.

Then, once you’ve gone through that course and figured out how to learn faster, you can jump into learning Data Engineering at a more accelerated pace.

Step 2: Build a solid programming foundation

Start off by learning the fundamentals of computer science. One of the easiest ways to grasp this is to learn how data structures and algorithms work so that you know how data scales.

I cannot stress enough how pivotal this is to all of Data Engineering, so check out Andrei’s course below or check out the first few lessons for free.

learn data structures and algorithms

You’re also going to need to learn common programming concepts, such as how computers store data, what is a data structure, etc.



Once you’ve grasped those core concepts, you’re also going to need to become proficient in programming languages, specifically ones that are most commonly used in Data Engineering such as:

  • Python: Used for scripting, data manipulation, and building data pipelines (and usually the first language you’ll learn as a beginner), and
  • SQL: Essential for querying and working with relational databases. Used to store and query structured data. Also has a lot of transactional properties, making it easier to use. It’s also a common interface for databases and data lakes. You also need to know advanced SQL features such as group-by, window functions, and efficient schema

Sidenote: You’ll also need to learn Java or Scala to work with big data tools but hold off on learning this until later. (I’ll remind you when in the steps).

The reason I recommend you wait till later with these two, is that you’re far better off becoming proficient in the other languages I’ve listed here first so you can start the job, and then further skill up into big data and learn Java later on.

Finally, learn Linux or Bash so that you can also run command line scripts and automations.

Why? Well, a lot of Data Engineering is done on virtual machines (for many reasons, one of which being that your home rig probably can’t run some of the tools), and so you need this skill also.

Take the courses I linked to above and make sure to complete the projects that are inside of each.

Estimated Time Required For This Step: 5-7 months depending on how much time you're spending learning.

Step 3: Learn Data Engineering concepts

This step is another huge section of your learning process, but I promise after this, it gets much easier.

Important: There are a lot of tools in Data Engineering and it’s easy to be overwhelmed. (Often there are 2 or 3 options for the exact same task).

With that in mind, focus on trying to learn the core concepts behind why we use a tool more than just trying to master every tool, as they are always being updated.

But by understanding what you want them to do and what you want to achieve, you can then apply that to any tool or update.

Top tip: You can also look at any relevant job posts at companies that you want to work for and see which of these tools they use, so you can then focus on those first. (A lot of tool experience is picked up on the job, but you still need basic comprehension and experience).

data engineer at meta

(Click to zoom in).

If we look at this current job post at Meta (Facebook), they don’t ask for experience with any one specific tool, other than the programming languages that we already recommended.

Instead, they ask for experience with tools that can do a specific thing in Data Engineering. In this case, they want experience with MapReduce which is used in both Apache Hadoop or Apache Spark.

So if you wanted to work at Facebook as a Data Engineer, you would be wise to focus on learning and using these tools vs other options.

With that covered, let’s look at some of the areas you need to learn:

Study the core Data Engineering concepts, such as:

  • Data pipelines
  • Data modeling
  • Database systems
  • Data storage and orchestration
  • ETL processes
  • Data warehousing
  • Batch and stream processing
  • Big Data technologies

I’ve tried to add a few resources as we go, but you can grab any of these topics off this list and throw them into YouTube and get the core concepts. Just make sure that the videos are fairly recent.

For example

FreeCodeCamp has a great beginner video to Data Engineering that’s 3 hours long, which is a great place to start:


I also recommend the Seattle Data Guy blog, as well as his YouTube channel for all things Data Engineering specific.

Data Modeling and Warehousing

Learn basic data modeling principles and experience with data warehousing concepts, including:

  • Dimensional modeling (star schema, snowflake schema)
  • Data warehouse design and optimization
  • Extracting insights from structured and unstructured data sources

Again, you can throw these topics in YouTube and find some great resources:





Database Management

You also need a comprehensive knowledge of database systems and expertise in working with various types of databases, including:

The course I shared earlier on SQL also covers database systems, but you're still going to need to learn about NoSQL and Data Warehousing, so pick one from each and learn them.



Data Processing and Extract, Transform, Load

Then, go ahead and get some experience with tools and frameworks for data processing, extraction, transformation, and loading (ETL), such as:


Again, you don’t need to know everything. Check the tools your ideal company uses first, and focus on those.

You can learn Kafka directly from the creators.

Data Quality and Governance

Gain some knowledge of data quality assurance techniques and data governance best practices, including:

  • Data profiling and cleansing
  • Data validation and verification
  • Metadata management and lineage tracking

Infrastructure Management

Learn how to set up, configure, and manage data infrastructure components, including:

Estimated Time Required For This Step: 6-8 months depending on how much time you're spending learning, and which tools you decide to use.

Step 4: Get job ready

Alright, it's almost time to apply for entry-level Data Engineering jobs.

However, before you can apply, you want to make the best first impression that you can, and that involves 3 things:

  • Make sure that your LinkedIn profile looks professional and up to date. Even if you don't use the platform to apply for jobs, potential employers may look there to check you out. Not to mention, you can even get approached by headhunters and get job offers without you even applying!
  • Create a one-page resume for applications. Some will ask you to submit it when you apply online, so get one made
  • Create a portfolio of your project work. Companies are going to want proof that you can do the work required, so it’s important to have a portfolio of projects you’ve completed
  • Bonus: Some sites will possibly ask you to attach a cover letter of why you want to work there. It's not all the time, but just be aware that if they give the option, you should write a custom one each time

The good news?

Fellow ZTM instructor Dan Schifano goes through this in detail in his course on personal branding, including how to set up a professional portfolio that stands out amongst your peers. (As well as some other great tips to help you stand out even further).

create a portfolio

Estimated Time Required For This Step: Actually building your portfolio site, resume, etc (i.e. the stuff Dan Schifano covers in his branding course) should only take you 1-2 weeks to set-up and prepare.

Go ahead and build your portfolio and then add your programming projects from Step #2 into it. Then, let’s look at adding some more specific Data Engineering type projects to it.

Step 5: Gain further experience

The goal here is to simply apply everything you’ve been learning and show more relevant experiences in your portfolio.

You can always have a quick Google search for new projects using current tools, but as a rule of thumb I recommend trying to find some that cover the following:

  • Practice writing SQL queries to manipulate and analyze data stored in relational databases
  • Work on projects that involve building data pipelines, extracting, transforming, and loading data, and setting up data storage and retrieval systems

Here are two great end-to-end projects that you can check out:



Estimated Time Required For This Step: 1-2 months depending on how many projects you decide to build.

As a final note on portfolio projects, you’re better off having fewer projects that are more specific and detailed, versus a lot of smaller projects with no relevance, so focus on those that will look best for your job applications.

Speaking of which, it's time to get hired and apply for some jobs!

Step 6: Apply for jobs

Now it's time to apply for jobs and get hired!

Trust me, you'll never feel 100% ready but if you've followed along so far, you are ready to start working in the real-world.

If you're anxious then understand this: The simple truth is that you don't need to know every detail about everything to get hired.

In fact, you'll pick up a lot of skills and experience simply by doing the job. It's about having the requirements to get started, and you already have that so start applying already!

We have an entire guide on applying for tech roles, but here are a few extra tips also.

Tech jobs are more than just tech skills

In addition to the technical know-how that you’ve built up through courses and certifications, interviewers will be evaluating your soft skills.

  • Be prepared with examples showing how you’ve collaborated with co-workers or led teams or projects in the past
  • Be able to explain the decisions you made for the projects in your portfolio and discuss various trade-offs that you made
  • Make sure to demonstrate strong communication skills in writing and during the interviews (whether virtual or in-person)... Even very basic things like using proper grammar and having no spelling mistakes, sending a thank you email within 24 hours of your interview, etc.

Have specific examples of how you’ve solved problems

A Data Engineer does a lot of troubleshooting and problem-solving during their normal work, but be prepared to talk through a situation or two where you saved the day by solving a complex or business-critical problem.

If you don’t have work-related examples, share stories from school or community projects.

The usual interview prep

Like any other kind of interview, it’s always good to:

  • Research the company. Learn what you can about their Data Engineering needs and why they’re hiring for your role
  • Learn what you can about the people you’ll be interviewing with, and what their potential areas of focus will be. You can always ask when they offer the interview, and they will happily let you know
  • Practice, practice, practice. Do a mock interview with friends or family, or even just interview yourself, speaking your answers out loud. It’s amazing the difference this makes, and how much more polished you’ll be on the big day
  • Be on time (or even a little bit early) for the interview
  • Dress the part. Figure out the “norm” for the company’s culture (jeans and T-shirt or more professional?) and dress to fit in. If you’re unsure, err on the side of dressing “up"

Do all this, and you’ll smash the interview and get the job.

Estimated Time Required For This Step: Usually somewhere between 1-6 months given all the potential factors, multiple applications, time to hear back, etc.

Step 7: Continue to skill up

By this point, you’re already knowledgeable enough to be a Data Engineer and have hopefully been hired.

However, it’s important in tech to keep skilling up (especially if you want to get paid even more), so I recommend 3 things:

1. Learn Java

Java is commonly used for building scalable data processing applications with frameworks like Apache Spark and other big data tools.

learn java

If you want to get into Big Data, then you need to know Java.

Speaking of which...

2. Become a ‘Big Data’ Engineer

We recommended a few of these tools earlier, but if you haven’t learned them already, go ahead and get familiar with big data platforms and technologies for handling large volumes of data, including:


3. Learn to use A.I. tools to make your life easier

Although you don't have to do this, I highly recommend that you learn to use AI tools to supplement what you do already.


You don't need to work solely with A.I. to see the benefit either. By learning to use these tools, you can increase your output and perform repeatable tasks in minutes vs hours or days.

And sure - the tools are not perfect. You still need to have the core knowledge that you're learned above, but by then applying that experience you have now, with that automation, you'll not only make your life easier - but even grow indemand.

A.I. won't steal your job. But people who can do their job faster and more effectively because they can use the tools, are going to be in high demand.

So add it to you skills, make work easier, and be the one that employers fight over!

We have a few courses on this that you can check out:

Check those out and see how they can help you.

Also, depending on the time that you read this, there may be new specific A.I. tools for your role, so have a quick Google search and see if there anything that can help, and play around with it.

So what are you waiting for? Get started today!

So there you have it - the entire roadmap to becoming a Data Engineer within the next 12 months, or sooner, depending on how long you can dedicate each week.

Data Engineering is a great career to get into right now, with high demand (over 300,00 jobs in the US alone!), a great salary, and interesting topics to learn.

It’s not the easiest thing to pick up as a beginner, as you’ll be learning a lot of different skills and tools, but it is achievable.

You could even learn some of the core elements and get hired as a Data Scientist or Software Engineer first, so you can get hired and get paid. Then you can continue to skill up further and fill out the other gaps as recommended.

Heck, a lot of the content you’ll learn in Step #2 of this roadmap alone can open up a whole range of tech roles. So don’t feel like you need to have completed absolutely everything on this list before you can start getting paid.

P.S.

We don’t cover absolutely everything for this role here at ZTM, but we do cover a lot of the courses that I’ve mentioned above, that will get you 80-90% of the way there.

And more, are all part of the Zero To Mastery Academy.

This means that if you become a member, then you have access to all of these courses right away and will have everything you need in one place.

Plus, as part of your membership, you'll get to join me and 1,000s of other people (some who are alumni mentors and others who are taking the same courses that you will be) in the ZTM Discord.

Ask questions, help others, or just network with other Data Analysts, Scientists, Engineers and other tech professionals.

Make today the day you take a chance on YOU. There's no reason why you couldn't be applying for Data Engineering jobs just 12 months from now.

So what are you waiting for 😀? Come join me and get started on becoming a Data Engineer today!

More from Zero To Mastery

Data Engineer vs Data Analyst vs Data Scientist - Which Is Best for Me? preview
Data Engineer vs Data Analyst vs Data Scientist - Which Is Best for Me?

Data is HOT right now. Great salaries, 1,000s of job opportunities, exciting + high-impact work. But what are the differences and which role is best for you?

Top 5 Benefits Of Using Terraform preview
Top 5 Benefits Of Using Terraform

Are you a DevOps Engineer who wants to automate your tasks and streamline your workflow? If so, then Terraform may just be the tool for you!

Best Programming Languages To Learn In 2024 preview
Best Programming Languages To Learn In 2024

Want to learn a programming language that pays over $100K, has 1,000s+ of jobs available, and is future-proof? Then pick one of these 12 (+ resources to start learning them today!).