Data engineering from design to non-trivial processing.
TRAINING FORMAT
ONLINE
WHO'S FITTING
JUNIOR/MIDDLE
LEARN HOW TO PROPERLY PREPARE DATA OF ANY SIZE AND COMPLEXITY
Training samples for machine learning and beautiful graphs for reports don't just appear by themselves: data needs to be collected, stored, validated and combined with each other, reacting quickly to changes in its structure.
STANDARD PATH:
YOU START WORKING WITH THE DATA
→
YOU'RE TRYING TO MAKE IT SYSTEMATIC AND SCALABLE
→
REALISE THAT THERE IS NOT ENOUGH KNOWLEDGE TO COVER THE ENTIRE DWH ARCHITECTURE IN ITS ENTIRETY
×
To work effectively with data, one tool is not enough - you need to consider all the interrelationships of a large warehouse, understand the customer's needs, and treat the data as an end product.
A strong data engineer, through breadth of knowledge and understanding of DWH architecture, is able to select the right tools for any task and deliver results to data consumers.
YOUR CV == IN 5 MONTHS
Roy Mudd
Data Engineer
- I work with relational databases, including MPP, understand the peculiarities of distributed systems based on Greenplum
- I know how to build and automate ETL\ELT pipelines based on Apache Airflow.
- I have experience working with big data in Hadoop and Spark, I know how to create complex SQL queries in Apache Hive.
- I understand data warehouse architecture (DWH), multidimensional modelling, anchor modelling and Data Vault techniques
- I have hands-on experience with Spark in Kubernetes, understand the basic approaches to building data warehouses in clouds
- I understand the principles of work and data preparation for Tableau-based BI-tools.
- I apply ML models on big data, know how to prepare data for training, understand approaches to versioning datasets with Data Version Control
- I know basic approaches to data management based on DMBOK
DESIRED SALARY FROM
$150,000 per annum
HOW TRAINING TAKES PLACE
COURSE DETAILS
The lecturers will talk about the course and its content. You will learn what the value of each module is and how the knowledge gained will help you in your future work.
TRAINING FORMAT
- Training takes place in an intensive format of 3 lessons per week - Homework assignments are done on real infrastructure - All lectures and supplementary materials are available on the education platform and remain with you after the course is over. - Our students spend an average of 10 hours per week on their studies
WORK WITH DATA IN ANY SYSTEM
- Learn data warehouse architecture and approaches to data warehouse design - Compare Hadoop-based Big Data solutions and relational MPP DBMSs in practice - Learn to work with clouds and automate ETL processes with Airflow
UTILISE OUR INFRASTRUCTURE
- Work with all the tools you need on a dedicated server - Improve your skills with Hadoop, Greenplum, PostgreSQL, Airflow, Spark, Hive and Kubernetes
ASK ANY SUPPORT QUESTIONS
- Discuss challenges and projects with market experts - Your mentors will be data engineers from leading companies
WHO THIS COURSE IS FOR:
BI DEVELOPER
You are involved in the development of business intelligence systems, want to master the architecture of modern data warehouses and learn how to design them.
DATA ENGINEER
BACKEND DEVELOPER
Already working with data warehouses, but want to systematise your knowledge and dive deeper into the actual technologies.
DATA ANALYST
Constantly interacting with databases, but want to better understand ETL processes and take analytics to the next level.
Have backend development experience and want to apply it to big data storage and processing challenges.
RECOMMENDED LEVEL:
PYTHON
INFRA-STRUCTURE
SQL
> Knowledge of language syntax
> Understanding of basic data structures (list, dictionary, tuple)
> Mastery of OOP basics (class, object)
> Ability to work with the command line
> Knowledge of basic Linux commands
> Experience with Git
> Knowledge of basic syntax (SELECT, WHERE, GROUP BY, HAVING)
> Ability to create subqueries and make all kinds of JOINs
> Skill in working with window functions
COURSE PROGRAMME ://
Let's start our dive into data engineering by getting acquainted with relational and MPP databases. We will look at their architecture, discuss popular solutions, and find out when MPP databases are better than traditional ones. We will learn how to prepare PostgreSQL and MPP databases using Greenplum as an example.
ETL is a key process in data warehouse management. Let's consider the principles and main stages of its construction. Let's get acquainted with the popular Airflow tool, examine its main components in detail and learn how to automate ETL pipelines using it.
We will get acquainted with the mechanisms of distributed storage of big data based on Hadoop, analyse the main patterns of implementation of their distributed processing. We will consider issues of fault tolerance and recovery from failures. We will talk about streaming data processing, methods and tools for monitoring and profiling Spark jobs.
Data Warehouse is a centralised storage of data from different sources. Let's get acquainted with its top-level logical architecture, consider its main components and analyse in practice different approaches to designing a detailed DWH layer.
Consider cloud solutions and tools for building DWH and Data Lake. Let's get acquainted with Kubernetes and learn how to use it for working with data. We will work with the cloud in practice, consider the process of installation and configuration of JupyterHub and Spark in Kubernetes.
We will consider the basic principles of working with data from the point of view of its visualisation and learn to look at data through the eyes of its consumer. Let's get acquainted with Tableau, a flexible and powerful BI tool. We will learn how it interacts with databases and use it to build an interactive dashboard for monitoring DWH platform.
We will get acquainted with the theory of distributed machine learning. We will learn how to work with the popular Spark ML module and consider approaches to training and applying models on big data.
In their work, engineers often face preparation of data for training ML models. Let's consider tools for building ML pipelines, versioning datasets, organising accounting and model tracking.
In practice, you often have to deal with different data and a huge number of integrations and processes that perform various transformations on it. We will get acquainted with popular approaches to data management, discuss tools for data quality control and data provenance tracking.
ALUMNI FEEDBACK/
I was satisfied with the course: I learnt new technologies (in an applied, rather than overview format) and closed gaps in my fundamental understanding. And most importantly, I got the idea of deploying my data solution in the cloud. As a result, I took a server on DigitalOcean and made my workspace there: I deployed clusters, Jupyter, Superset for visualisation, Airflow for automation, as well as Spark and ClickHouse, following all the recommendations from the lessons. I was very pleased with it.
Now I'm rebuilding my pet project and transferring it to this server - with process building as we discussed in the course. Of course, I don't have BigData, everything is much more prosaic and smaller, but now I have real experience ;).
Kevin
I worked with machine learning and analytics, doing scoring and recommendation models. In my previous job, I managed a team of data engineers. And I wanted to tighten up my competences. Now I've changed jobs because of the move. The company is smaller, so somewhere I do analytics, somewhere I act as an engineer, and somewhere I do development.
At first I took courses on Stepik, and from there I learnt about the Hard ML course. I return to my own Hard ML notes regularly to better solve work tasks. I had no doubts when buying the data engineering course, although I had high expectations after the Hard ML course. Results: overall everything I wanted to learn, I learnt. The theoretical videos were interesting and informative. I liked the block on cloud storages, I had an opportunity to deploy something of my own right away. Sometimes I revisit the block on ETL - the knowledge from there helps me to solve work tasks. A bit lacking in practice. I would like more assignments to write code. In terms of format - it is good that all lectures are recorded in advance. I think it's right - the lecturers don't get tired or exhausted. It's nice that a community has formed around the courses, and both students and professors help in chat rooms.
NICOLE
TUITION FEE
Start mastering the data engineering profession, get access to remote server work and support from our instructors.
> Relational and MPP DBMS > ETL process automation > Big Data > DWH Design > Cloud storage > Data Visualisation
> Big ML > Model management > Data management > Support from teachers > Working on a remote server
To pay for the course, you need to register on our education platform with your first name, last name and email.
If you already have an account, you can use it.
ASK A QUESTION
We will contact you and answer any questions you may have about the course.
FAQ
Yes, we carry out educational activities on the basis of a state licence.
For comfortable learning on the course you need to be able to write code in Python and compose SQL queries to databases. You will not need specialised knowledge of data engineering.
You can watch the lectures from any device, but you will need a computer or laptop to write code. There are no requirements for configuration and power - we will provide all the necessary infrastructure to work on a remote server. At the start of training, you don't need to install any special software - you will only need a browser and standard applications for communication: Telegram, Zoom, and Slack.
On average, our students study 10 hours a week. This is enough time to be able to watch lectures and complete homework on time.
We have organised the training in such a way that you can combine it with your work, study and personal life. You can study at any time and at a pace that suits you - all lectures are pre-recorded and broken down into short 15-30 minute videos, and there are soft two-week deadlines for homework.
The training lasts for 5 months. There will be three lessons each week, which will be accessed gradually. The lessons will consist of video lectures, notes and practical assignments that will take two weeks to complete. After the two-week deadline, access to the assignments will be granted. If you encounter difficulties during the course, you can seek help from mentors.
It is quite normal to get "stuck" on a task during training. In this case, we have a support team that will help you to solve a difficult task.
If things do not go according to plan and you feel that you are falling behind on the programme, please let the course supervisors know. Together we will find ways to make your learning experience more convenient.
All MPP DBMSs are based on the basic principles of distributing data across nodes and generating a parallel query plan from a sequential one. Once you have learnt these principles using Greenplum as an example, you can confidently use any other databases, including HP Vertica and Teradata. ClickHouse is a specialised database with a number of limitations: for example, it is difficult to join two derived tables that do not fit in memory. Greenplum has no such disadvantages, so we chose it.
Any MPP RDBMS has the same basic principles as Greenplum. If your company uses any MPP RDBMS (e.g. Vertica or Teradata), you will be able to apply all the knowledge gained during the course without any restrictions. If your company does not use an MPP RDBMS, then after the training you will either be able to propose its implementation or realise that there is no need for it.
Yes. In this module, we'll tell you how Tableau works internally, explain how query results are cached, and teach you how to configure connectors to different sources. We'll also talk about extracts and the different architectures of Tableau's work with databases, discuss how to merge data on the tool side (and whether it's worth it), and figure out when it's better to use long sources in Tableau and when it's better to use wide sources.
Creating a showcase of data from multiple sources is quite a complex process. We can't give you a universal guide, but we will explain all the steps in detail: designing a showcase, working with the Hadoop stack, interacting with analytical DBMSs and code-driven ETL platforms. Once you understand it, you will be able to solve the task at hand.
ANY QUESTIONS?
Fill out the form, we will contact you, answer all your questions and tell you more about the course.