You could easily argue that data collection and analysis is the next big gold rush, with almost every company in every industry collecting, storing, and using it.
It helps them make better business decisions, learn more about their customers, predict sales and trends, or even train machine learning algorithms and A.I.
Add in the fact that cloud computing is still booming and there’s never been a better time to get involved with data and become a Data Engineer.
In this guide, I’ll pull back the curtains for you and show you exactly what it takes to become a Data Engineer and answer important questions like:
I’ll also cover exactly what a Data Engineer does, the skills required in the role, and how to get the experience you need to land a job.
So that by the end of this guide, you can take the first steps to starting a new career!
My name is Travis Cuzick, I’m a self-taught developer and an instructor here at Zero to Mastery, teaching a wide range of data-focused courses.
I’ve also been architecting and coding data solutions for well over a decade now, for some of the biggest companies on the Fortune 500.
In my day job, I work as a Data Solutions Engineer (which is quite rare and ever so slightly different from a Data Engineer, but more on this in a second), and use my various skills to query and manipulate multi-terabyte enterprise data stores for major U.S. financial institutions. (Both ‘big’ and important data!).
However, as a self-taught developer, I know that it can be daunting to start a brand new career, so I’ll try to keep this guide as nice and simple as I can where possible, and give the mile-high view and steps so you can take action and not be overwhelmed.
So grab a cup of coffee and a notepad and let’s go!
There are so many different roles within data collection and data analysis that it can get a little confusing to know just who does what - especially when some roles have similarities or overlaps.
Fellow ZTM instructor Diogo Resende covers the differences between the 3 major roles in more detail in this guide, but here’s a mile-high overview:
Data Solutions Engineers are a pretty rare role, but they design and architect comprehensive data solutions for an organization.
For example
A business knows it needs to track and collect data better, but isn’t always sure how to do this.
Or maybe they already collect data, but due to demand and scale, they don’t know how to continue collecting inline with their current rate of growth. (A good problem to have).
The Data Solutions Engineer will speak with stakeholders so they can understand the business data requirements and end goals. Then they plan and design how data should ideally flow through the system to meet those agreed business objectives and propose the appropriate technologies to make that happen.
Data Solutions Engineers typically oversee the entire data ecosystem and may lead a team of data professionals. That being said, some smaller companies might have the Data Engineer fulfill this planning instead.
Data Engineers are responsible for making this proposed solution work.
They focus on the development and maintenance of data infrastructure, pipelines, and systems for ingesting, processing, storing, and managing large volumes of data.
In simple terms, they set up systems to collect the data, clean the data, and keep it all running smoothly.
Data Analysts take the data that the Data Engineer collects and then analyze what it all means. This way they can gain insights into what’s happening right now, and inform decision-making within an organization.
Data Scientists take it a step further. They take that same data and apply advanced analytical and statistical techniques to gain additional insights, build predictive models, and solve complex business problems using data.
Basically, they’re using that current and past data to predict what might happen in the future.
If I can use a rough analogy:
Now that you understand the difference between these roles, let’s get a bit more technical and dive into the Data Engineer role in more detail.
The main responsibilities of a Data Engineer typically include:
Data Engineers collaborate with cross-functional teams, including Data Scientists, Data Analysts, Software Engineers, and Data Solutions Engineers.
This means that they need to be able to communicate effectively to understand data requirements, provide technical expertise, and deliver solutions that meet the needs of the organization.
This is the core task that people think of when we’re talking about a Data Engineer.
Data Engineers are responsible for designing, constructing, and maintaining data pipelines that extract, transform, and load (ETL) data from various sources into data storage systems such as data warehouses, data lakes, or databases.
This involves understanding data sources, defining data ingestion strategies, and implementing efficient ETL processes.
(Again, more on this later once we get into the exact skills and resources to learn).
Rather than just collecting the data, they also design data models and schemas to structure and organize that data in a way that facilitates efficient storage, retrieval, and analysis, while also optimizing data structures for performance and scalability.
This does mean that a base knowledge of how Data Structures and Algorithms work is vital for this role.
Data Engineers will often have to integrate data from disparate sources, including databases, APIs, flat files, streaming data sources, and more.
They also need to develop processes for ingesting data accurately and consistently, even when dealing with large volumes of data in real time. Fortunately, there are tools to help with this.
Now that the data has been collected, it needs to be cleaned, preprocessed, and transformed from raw data to make it suitable for analysis. (Bad data would lead to inaccurate information and poor business decisions).
This involves tasks such as data cleansing, deduplication, normalization, and aggregation to ensure data quality and consistency.
As businesses grow and tools evolve, Data Engineers need to continue to optimize data pipelines and storage systems for further performance, scalability, and reliability.
This usually involves tuning of database queries, improving ETL processes, implementing caching mechanisms, and utilizing parallel processing techniques to enhance data processing speed and efficiency.
We need to make sure that our data continues to be correct, so Data Engineers develop and implement processes to monitor and maintain data quality throughout the data lifecycle.
They also establish data quality standards, perform data validation checks, and address any issues or anomalies that arise to ensure the accuracy and reliability of the data.
Finally, Data Engineers are often involved in managing the infrastructure required to support data processing and storage, including cloud services, databases, and big data technologies such as Hadoop, Spark, and Kafka.
This means that they deploy, configure, and maintain the necessary hardware and software components to support data operations.
Speaking of data and ‘big data’...
In theory, both Data Engineers and Big Data Engineers do the same job - it’s just that the tools they use will change so that ‘big data’ can be handled.
It’s a little easier to understand once you know why it’s called ‘big’ data.
In simple terms, data is classified as ‘big’ once it can no longer be handled by traditional Data Engineering tools, such as Excel, or even more specific tools such as MySQL, etc.
This is usually due to either one or a combination of the following:
Big data usually involves datasets that are extremely large, often ranging from terabytes to petabytes or even exabytes in size.
Big data is also often generated at high velocity, meaning that it's produced and updated rapidly.
This can include multiple data streams from sources such as sensors, social media feeds, transactional systems, or web logs. This means that either real-time or near-real-time processing may be required to analyze and derive insights from such data streams.
Think Black Friday sales, live stream events, or even the stock exchange.
Big data can also come in various forms and formats, including structured data (e.g. relational databases), semi-structured data (e.g. JSON, XML), and unstructured data (e.g. text, images, videos).
Managing and analyzing diverse data types is a key challenge in big data processing, and sometimes you just need more specialist tools for the job.
Big data can also be fairly complex in terms of its structure, relationships, and interconnections.
This complexity may require advanced analytics techniques, such as machine learning and natural language processing, to extract meaningful insights.
Data Engineering and Big Data Engineering are basically the same job - just with a specialization of tools for specific big data requirements. That being said, Big Data Engineers are often paid slightly more due to this further specialization.
Speaking of which, let’s take a look at the career prospects in this industry…
Yep. The global Big Data and Data Engineering Services market size was valued at $51.7 billion in 2022 and is expected to expand at a CAGR of 18.15% during the forecast period, reaching $137.5 billion by 2028.
So yeah, pretty big industry growth!
And as for jobs? There are currently 353,409 Data Engineering jobs available in the US right now on ZipRecruiter.
If we look at the average salary of those jobs on offer, it works out to around $129,716 per year, with some as high as $177,500.
This can vary based on location and experience, but $120,000+ seems pretty great if you ask me! And if you work hard, there’s no reason you can’t be making ~$200,000 with some years of experience under your belt.
No, not really. Some big tech companies may ask for a degree, but most tech companies don’t care. All they care about is that you can do the job, and have a portfolio of work to prove it.
Heck - if your portfolio is good enough, some big tech companies would also hire you without a degree.
However, you will need to make sure you learn and understand data structures and algorithms, (a core concept taught in CS degrees) so that you can understand how systems scale. This is vital as a Data Engineer. You can pick up that skill from online courses like I shared above though.
Speaking of skills - let’s take a look at what else you’ll need to know.
The Data Engineering role is very technical and requires a fairly wide range of skills and experience.
I’ll go into them in more detail as we get to the roadmap, but here’s a rough overview:
As a rough estimate, I would say around 12+ months or so - depending on how much time you can dedicate to learn each week.
The reason being, is that there’s quite a lot to learn for this role, and it’s partly why most Data Engineers don’t start out here as their first role in tech. (Some do but not all).
In fact, most usually have prior experience working as Data Scientists or Software Engineers first with a focus on data management, before moving into these roles.
That being said, you can still start here and get hired as your first tech role, you just need to learn all the required skills, which will affect your time frame.
Also, the time to complete will depend on what you know already, but as we get into the roadmap in the next section, I’ll try and add some rough time estimates so you can judge for yourself.
Step #2 and Step #3 in this roadmap are where you’ll be spending the bulk of your time, and there are a lot of topics to learn. It’s also why the average wage is so high though!
Also, the majority of the skills you learn in Step #2 could get you hired in a lot of tech roles. Hence why this is usually a career progression role and not always someone's entry point.
With that out of the way, let’s take a look at these steps, along with resources where possible.
This first step is completely optional but highly recommended, because here’s the thing: Most people don’t know how to learn effectively.
It’s not their fault. Schools teach basic rote methods of learning which are pretty inefficient. They say the thing, and you try to remember the thing, and it's not great - especially if you require certain learning styles to learn best.
This means that topics you might do well with are harder to remember or apply, so it takes longer to learn.
The thing is, there are multiple different learning techniques that you can use that make all of your future learning efforts far more effective. This means you can understand faster and more efficiently, so less back and forth.
You can learn a lot of the key techniques for free right now in this guide, or better still, learn every important technique inside of Andrei’s learning how to learn course.
Estimated Time Required For This Step: 5 days.
I know it might feel like a step backward or even a detour, but think about it like this:
Bear in mind that there are multiple skills that you need to pick up to become a Data Engineer, and each of them can take weeks or even months of work to complete.
So why not learn how to cut down on that time, improve your comprehension, and pick up skills faster and easier first? The time and energy savings will seriously compound as you go through the rest of the content you need to learn.
Then, once you’ve gone through that course and figured out how to learn faster, you can jump into learning Data Engineering at a more accelerated pace.
Start off by learning the fundamentals of computer science. One of the easiest ways to grasp this is to learn how data structures and algorithms work so that you know how data scales.
I cannot stress enough how pivotal this is to all of Data Engineering, so check out Andrei’s course below or check out the first few lessons for free.
You’re also going to need to learn common programming concepts, such as how computers store data, what is a data structure, etc.
Once you’ve grasped those core concepts, you’re also going to need to become proficient in programming languages, specifically ones that are most commonly used in Data Engineering such as:
Sidenote: You’ll also need to learn Java or Scala to work with big data tools but hold off on learning this until later. (I’ll remind you when in the steps).
The reason I recommend you wait till later with these two, is that you’re far better off becoming proficient in the other languages I’ve listed here first so you can start the job, and then further skill up into big data and learn Java later on.
Finally, learn Linux or Bash so that you can also run command line scripts and automations.
Why? Well, a lot of Data Engineering is done on virtual machines (for many reasons, one of which being that your home rig probably can’t run some of the tools), and so you need this skill also.
Take the courses I linked to above and make sure to complete the projects that are inside of each.
Estimated Time Required For This Step: 5-7 months depending on how much time you're spending learning.
This step is another huge section of your learning process, but I promise after this, it gets much easier.
Important: There are a lot of tools in Data Engineering and it’s easy to be overwhelmed. (Often there are 2 or 3 options for the exact same task).
With that in mind, focus on trying to learn the core concepts behind why we use a tool more than just trying to master every tool, as they are always being updated.
But by understanding what you want them to do and what you want to achieve, you can then apply that to any tool or update.
Top tip: You can also look at any relevant job posts at companies that you want to work for and see which of these tools they use, so you can then focus on those first. (A lot of tool experience is picked up on the job, but you still need basic comprehension and experience).
(Click to zoom in).
If we look at this current job post at Meta (Facebook), they don’t ask for experience with any one specific tool, other than the programming languages that we already recommended.
Instead, they ask for experience with tools that can do a specific thing in Data Engineering. In this case, they want experience with MapReduce which is used in both Apache Hadoop or Apache Spark.
So if you wanted to work at Facebook as a Data Engineer, you would be wise to focus on learning and using these tools vs other options.
With that covered, let’s look at some of the areas you need to learn:
Study the core Data Engineering concepts, such as:
I’ve tried to add a few resources as we go, but you can grab any of these topics off this list and throw them into YouTube and get the core concepts. Just make sure that the videos are fairly recent.
For example
FreeCodeCamp has a great beginner video to Data Engineering that’s 3 hours long, which is a great place to start:
I also recommend the Seattle Data Guy blog, as well as his YouTube channel for all things Data Engineering specific.
Learn basic data modeling principles and experience with data warehousing concepts, including:
Again, you can throw these topics in YouTube and find some great resources:
You also need a comprehensive knowledge of database systems and expertise in working with various types of databases, including:
The course I shared earlier on SQL also covers database systems, but you're still going to need to learn about NoSQL and Data Warehousing, so pick one from each and learn them.
Then, go ahead and get some experience with tools and frameworks for data processing, extraction, transformation, and loading (ETL), such as:
Again, you don’t need to know everything. Check the tools your ideal company uses first, and focus on those.
You can learn Kafka directly from the creators.
Gain some knowledge of data quality assurance techniques and data governance best practices, including:
Learn how to set up, configure, and manage data infrastructure components, including:
Estimated Time Required For This Step: 6-8 months depending on how much time you're spending learning, and which tools you decide to use.
Alright, it's almost time to apply for entry-level Data Engineering jobs.
However, before you can apply, you want to make the best first impression that you can, and that involves 3 things:
The good news?
Fellow ZTM instructor Dan Schifano goes through this in detail in his course on personal branding, including how to set up a professional portfolio that stands out amongst your peers. (As well as some other great tips to help you stand out even further).
Estimated Time Required For This Step: Actually building your portfolio site, resume, etc (i.e. the stuff Dan Schifano covers in his branding course) should only take you 1-2 weeks to set-up and prepare.
Go ahead and build your portfolio and then add your programming projects from Step #2 into it. Then, let’s look at adding some more specific Data Engineering type projects to it.
The goal here is to simply apply everything you’ve been learning and show more relevant experiences in your portfolio.
You can always have a quick Google search for new projects using current tools, but as a rule of thumb I recommend trying to find some that cover the following:
Here are two great end-to-end projects that you can check out:
Estimated Time Required For This Step: 1-2 months depending on how many projects you decide to build.
As a final note on portfolio projects, you’re better off having fewer projects that are more specific and detailed, versus a lot of smaller projects with no relevance, so focus on those that will look best for your job applications.
Speaking of which, it's time to get hired and apply for some jobs!
Now it's time to apply for jobs and get hired!
Trust me, you'll never feel 100% ready but if you've followed along so far, you are ready to start working in the real-world.
If you're anxious then understand this: The simple truth is that you don't need to know every detail about everything to get hired.
In fact, you'll pick up a lot of skills and experience simply by doing the job. It's about having the requirements to get started, and you already have that so start applying already!
We have an entire guide on applying for tech roles, but here are a few extra tips also.
In addition to the technical know-how that you’ve built up through courses and certifications, interviewers will be evaluating your soft skills.
A Data Engineer does a lot of troubleshooting and problem-solving during their normal work, but be prepared to talk through a situation or two where you saved the day by solving a complex or business-critical problem.
If you don’t have work-related examples, share stories from school or community projects.
Like any other kind of interview, it’s always good to:
Do all this, and you’ll smash the interview and get the job.
Estimated Time Required For This Step: Usually somewhere between 1-6 months given all the potential factors, multiple applications, time to hear back, etc.
By this point, you’re already knowledgeable enough to be a Data Engineer and have hopefully been hired.
However, it’s important in tech to keep skilling up (especially if you want to get paid even more), so I recommend 3 things:
Java is commonly used for building scalable data processing applications with frameworks like Apache Spark and other big data tools.
If you want to get into Big Data, then you need to know Java.
Speaking of which...
We recommended a few of these tools earlier, but if you haven’t learned them already, go ahead and get familiar with big data platforms and technologies for handling large volumes of data, including:
Although you don't have to do this, I highly recommend that you learn to use AI tools to supplement what you do already.
You don't need to work solely with A.I. to see the benefit either. By learning to use these tools, you can increase your output and perform repeatable tasks in minutes vs hours or days.
And sure - the tools are not perfect. You still need to have the core knowledge that you're learned above, but by then applying that experience you have now, with that automation, you'll not only make your life easier - but even grow indemand.
A.I. won't steal your job. But people who can do their job faster and more effectively because they can use the tools, are going to be in high demand.
So add it to you skills, make work easier, and be the one that employers fight over!
We have a few courses on this that you can check out:
Check those out and see how they can help you.
Also, depending on the time that you read this, there may be new specific A.I. tools for your role, so have a quick Google search and see if there anything that can help, and play around with it.
So there you have it - the entire roadmap to becoming a Data Engineer within the next 12 months, or sooner, depending on how long you can dedicate each week.
Data Engineering is a great career to get into right now, with high demand (over 300,00 jobs in the US alone!), a great salary, and interesting topics to learn.
It’s not the easiest thing to pick up as a beginner, as you’ll be learning a lot of different skills and tools, but it is achievable.
You could even learn some of the core elements and get hired as a Data Scientist or Software Engineer first, so you can get hired and get paid. Then you can continue to skill up further and fill out the other gaps as recommended.
Heck, a lot of the content you’ll learn in Step #2 of this roadmap alone can open up a whole range of tech roles. So don’t feel like you need to have completed absolutely everything on this list before you can start getting paid.
We don’t cover absolutely everything for this role here at ZTM, but we do cover a lot of the courses that I’ve mentioned above, that will get you 80-90% of the way there.
And more, are all part of the Zero To Mastery Academy.
This means that if you become a member, then you have access to all of these courses right away and will have everything you need in one place.
Plus, as part of your membership, you'll get to join me and 1,000s of other people (some who are alumni mentors and others who are taking the same courses that you will be) in the ZTM Discord.
Ask questions, help others, or just network with other Data Analysts, Scientists, Engineers and other tech professionals.
Make today the day you take a chance on YOU. There's no reason why you couldn't be applying for Data Engineering jobs just 12 months from now.
So what are you waiting for 😀? Come join me and get started on becoming a Data Engineer today!