1
0
mirror of https://github.com/kamranahmedse/developer-roadmap.git synced 2025-09-25 00:21:28 +02:00

add last batch of content for DE roadmap. Ready to PR

This commit is contained in:
Javi Canales
2025-08-22 12:18:20 +02:00
parent efd5d20089
commit 3b809d9f81
26 changed files with 191 additions and 26 deletions

View File

@@ -1 +1,3 @@
# Realtime
Real-time processing, also known as streaming processing, involves the immediate ingestion, as well as analysis, of data as it is generated, providing instantaneous insights and enabling timely decisions in time-sensitive applications like financial trading, medical monitoring, and autonomous vehicles. This differs from batch processing, which handles data in later batches, and typically involves continuous data streaming, low latency, and high availability to deliver immediate outcomes for critical tasks.

View File

@@ -4,9 +4,9 @@ Relational databases are a type of database management system (DBMS) that organi
Visit the following resources to learn more:
- [@course@Databases and SQL](https://www.edx.org/course/databases-5-sql)
- [@article@Relational Databases](https://www.ibm.com/cloud/learn/relational-databases)
- [@article@51 Years of Relational Databases](https://learnsql.com/blog/codd-article-databases/)
- [@course@Databases and SQL](https://www.edx.org/course/databases-5-sql)
- [@article@Intro To Relational Databases](https://www.udacity.com/course/intro-to-relational-databases--ud197)
- [@video@What is Relational Database](https://youtu.be/OqjJjpjDRLc)
- [@feed@Explore top posts about Backend Development](https://app.daily.dev/tags/backend?ref=roadmapsh)

View File

@@ -1 +1,7 @@
# Reusability
One of the goals of Infrastructure as Code (IaC) is to creat modular, standardized units of code—like modules or templates that can be used across multiple projects, environments, and teams, embodying the "Don't Repeat Yourself" (DRY) principle. This approach significantly boosts efficiency, consistency, and maintainability, as it allows for rapid deployment of identical infrastructure patterns, enforces organizational standards, simplifies complex setups, and improves collaboration by providing shared, tested building blocks for infrastructure management.
Visit the following resources to learn more:
- [@article@What is Infrastructure as Code (IaC)?](https://www.redhat.com/en/topics/automation/what-is-infrastructure-as-code-iac)

View File

@@ -1 +1,9 @@
# Reverse ETL
Reverse ETL is the process of extracting data from a data warehouse, transforming it to fit the requirements of operational systems, and then loading it into those other systems. This approach contrasts with traditional ETL, where data is extracted from operational systems, transformed, and loaded into a data warehouse.
While ETL and ELT focus on centralizing data, Reverse ETL aims to operationalize this data by making it actionable within third-party systems such as CRMs, marketing platforms, and other operational tools.
Visit the following resources to learn more:
- [@article@What is Reverse ETL? A Helpful Guide](https://www.datacamp.com/blog/reverse-etl)
- [@video@What is Reverse ETL?](https://www.youtube.com/watch?v=DRAGfc5or2Y)

View File

@@ -1 +1,7 @@
# S3 (Storage)
# S3
Amazon S3 (Simple Storage Service) is an object storage service offered by Amazon Web Services (AWS). It provides scalable, secure and durable storage on the internet. Designed for storing and retrieving any amount of data from anywhere on the web, it is a key tool for many companies in the field of data storage, including mobile applications, websites, backup and restore, archive, enterprise applications, IoT devices, and big data analytics.
Visit the following resources to learn more:
- [@official@S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html)

View File

@@ -1 +1,7 @@
# Segment
Segment is an analytics platform that provides a single API for collecting, storing, and routing customer data from various sources. With Segment, data engineers can easily add analytics tracking to their app, without having to integrate with multiple analytics tools individually. Segment acts as a single point of integration, allowing developers to send data to multiple analytics tools with a single API.
Visit the following resources to learn more:
- [@official@flutter_segment](https://pub.dev/packages/flutter_segment)

View File

@@ -1 +1,8 @@
# Sentry
Sentry tracks your software performance, measuring metrics like throughput and latency, and displaying the impact of errors across multiple systems. Sentry captures distributed traces consisting of transactions and spans, which measure individual services and individual operations within those services.
Visit the following resources to learn more:
- [@official@Sentry](https://sentry.io)
- [@official@Sentry Documentation](https://docs.sentry.io/)

View File

@@ -1 +1,8 @@
# Serverless Options
Serverless data storage involves using cloud provider services for databases and object storage that automatically scale infrastructure and implement a consumption-based, pay-as-you-go model, eliminating the need for developers to manage, provision, or maintain any physical or virtual servers. This approach simplifies development, reduces operational overhead, and offers cost-effectiveness by charging only for the resources used, allowing teams to focus on applications rather than infrastructure management.
Visit the following resources to learn more:
- [@official@What Is Serverless Computing?](https://www.ibm.com/think/topics/serverless)

View File

@@ -1 +1,9 @@
# Slowly Changing Dimension - SCD
Slowly Changing Dimensions (SCDs) are a data warehousing technique used to track changes in dimension data over time. Instead of simply overwriting old data with new data, SCDs allow you to maintain historical records of how dimension attributes have changed. This is crucial for accurate analysis of historical trends and business performance.
Visit the following resources to learn more:
- [@article@WMastering Slowly Changing Dimensions (SCD)](https://www.datacamp.com/tutorial/mastering-slowly-changing-dimensions-scd)
- [@article@Implementing Slowly Changing Dimensions (SCDs) in Data Warehouses](https://www.sqlshack.com/implementing-slowly-changing-dimensions-scds-in-data-warehouses/)

View File

@@ -1 +1,8 @@
# Smoke Testing
Smoke Testing is a software testing process that determines whether the deployed software build is stable or not. Smoke testing is a confirmation for QA team to proceed with further software testing. It consists of a minimal set of tests run on each build to test software functionalities.
Visit the following resources to learn more:
- [@article@Smoke Testing | Software Testing](https://www.guru99.com/smoke-testing.html)
- [@feed@Explore top posts about Testing](https://app.daily.dev/tags/testing?ref=roadmapsh)

View File

@@ -1 +1,10 @@
# Snowflake
Snowflake is a cloud-based data platform that provides a data warehouse as a service. It allows organizations to store, analyze, and share data, offering features like data engineering, data governance, and collaboration capabilities. Snowflake is known for its scalability, ease of use, and ability to handle diverse workloads, including data warehousing, data lakes, and machine learning.
Visit the following resources to learn more:
- [@official@Snowflake Docs](https://docs.snowflake.com/)
- [@official@Snowflake in 20 minutes](https://docs.snowflake.com/en/user-guide/tutorials/snowflake-in-20minutes)
- [@article@Snowflake Tutorial For Beginners: From Architecture to Running Databases](https://www.datacamp.com/tutorial/introduction-to-snowflake-for-beginners)
- [@video@Learn Snowflake in 2 Hours](https://www.youtube.com/watch?v=mP3QbYURT9k)

View File

@@ -1 +1,10 @@
# Snowflake
Snowflake is a cloud-based data platform that provides a data warehouse as a service. It allows organizations to store, analyze, and share data, offering features like data engineering, data governance, and collaboration capabilities. Snowflake is known for its scalability, ease of use, and ability to handle diverse workloads, including data warehousing, data lakes, and machine learning.
Visit the following resources to learn more:
- [@official@Snowflake Docs](https://docs.snowflake.com/)
- [@official@Snowflake in 20 minutes](https://docs.snowflake.com/en/user-guide/tutorials/snowflake-in-20minutes)
- [@article@Snowflake Tutorial For Beginners: From Architecture to Running Databases](https://www.datacamp.com/tutorial/introduction-to-snowflake-for-beginners)
- [@video@Learn Snowflake in 2 Hours](https://www.youtube.com/watch?v=mP3QbYURT9k)

View File

@@ -1 +1,3 @@
# Sources of Data
Sources of data are origins or locations from which data is collected, categorized as primary (direct, firsthand information) or secondary (collected by others). Common primary sources include surveys, interviews, experiments, and sensor data. Secondary sources encompass databases, published reports, government data, books, articles, and web data like social media posts. Data sources can also be classified as internal (within an organization) or external (from outside sources).

View File

@@ -1 +1,11 @@
# Star vs Snowflake Schema
A star schema is a way to organize data in a database, namely in data warehouses, to make it easier and faster to analyze. At the center, there's a main table called the **fact table**, which holds measurable data like sales or revenue. Around it are **dimension tables**, which add details like product names, customer info, or dates. This layout forms a star-like shape.
A snowflake schema is another way of organizing data. In this schema, dimension tables are split into smaller sub-dimensions to keep data more organized and detailed, just like snowflakes in a large lake.
The star schema is simple and fast -ideal when you need to extract data for analysis quickly. On the other hand, the snowflake schema is more detailed. It prioritizes storage efficiency and managing complex data relationships.
Visit the following resources to learn more:
- [@official@Star Schema vs Snowflake Schema: Differences & Use Cases](https://www.datacamp.com/blog/star-schema-vs-snowflake-schema)

View File

@@ -1 +1,3 @@
# Streaming
Streaming processing, also known as real-time processing, involves the immediate ingestion, as well as analysis, of data as it is generated, providing instantaneous insights and enabling timely decisions in time-sensitive applications like financial trading, medical monitoring, and autonomous vehicles. This differs from batch processing, which handles data in later batches, and typically involves continuous data streaming, low latency, and high availability to deliver immediate outcomes for critical tasks.

View File

@@ -1 +1,12 @@
# Streamlit
Streamlit is a free and open-source framework to rapidly build and share machine learning and data science web apps. It is a Python-based library specifically designed for data and machine learning engineers. Data scientists or machine learning engineers are not web developers and they're not interested in spending weeks learning to use these frameworks to build web apps. Instead, they want a tool that is easier to learn and to use, as long as it can display data and collect needed parameters for modeling.
Visit the following resources to learn more:
- [@official@Streamlit Docs](https://docs.streamlit.io/)
- [@official@Streamlit Python: Tutorial](https://www.datacamp.com/tutorial/streamlit)
- [@video@EStreamlit Explained: Python Tutorial for Data Scientists](https://www.youtube.com/watch?v=c8QXUrvSSyg)

View File

@@ -1 +1,8 @@
# Tableu
# Tableau
Tableau is a powerful data visualization tool utilized extensively by data analysts worldwide. Its primary role is to transform raw, unprocessed data into an understandable format without any technical skills or coding. Data analysts use Tableau to create data visualizations, reports, and dashboards that help businesses make more informed, data-driven decisions. They also use it to perform tasks like trend analysis, pattern identification, and forecasts, all within a user-friendly interface. Moreover, Tableau's data visualization capabilities make it easier for stakeholders to understand complex data and act on insights quickly.
Learn more from the following resources:
- [@official@Tableau](https://www.tableau.com/en-gb)
- [@video@What is Tableau?](https://www.youtube.com/watch?v=NLCzpPRCc7U)

View File

@@ -1 +1,12 @@
# Terraform
Terraform is an open-source infrastructure as code (IaC) tool developed by HashiCorp, used to define, provision, and manage cloud and on-premises infrastructure using declarative configuration files. It supports multiple cloud providers like AWS, Azure, and Google Cloud, as well as various services and platforms, enabling infrastructure automation across diverse environments. Terraform's state management and modular structure allow for efficient scaling, reusability, and version control of infrastructure. It is widely used for automating infrastructure provisioning, reducing manual errors, and improving infrastructure consistency and repeatability.
Visit the following resources to learn more:
- [@roadmap@Visit Dedicated Terraform Roadmap](https://roadmap.sh/terraform)
- [@official@Terraform Documentation](https://www.terraform.io/docs)
- [@official@Terraform Tutorials](https://learn.hashicorp.com/terraform)
- [@article@How to Scale Your Terraform Infrastructure](https://thenewstack.io/how-to-scale-your-terraform-infrastructure/)
- [@course@Complete Terraform Course](https://www.youtube.com/watch?v=7xngnjfIlK4)
- [@feed@Explore top posts about Terraform](https://app.daily.dev/tags/terraform?ref=roadmapsh)

View File

@@ -1 +1,9 @@
# Testing
Testing is a systematic process used to evaluate the functionality, performance, and quality of software or systems to ensure they meet specified requirements and standards. It involves various methodologies and levels, including unit testing (testing individual components), integration testing (verifying interactions between components), system testing (assessing the entire system's behavior), and acceptance testing (confirming it meets user needs). Testing can be manual or automated and aims to identify defects, validate that features work as intended, and ensure the system performs reliably under different conditions. Effective testing is critical for delivering high-quality software and mitigating risks before deployment.
Visit the following resources to learn more:
- [@article@What is Software Testing?](https://www.guru99.com/software-testing-introduction-importance.html)
- [@article@Testing Pyramid](https://www.browserstack.com/guide/testing-pyramid-for-test-automation)
- [@feed@Explore top posts about Testing](https://app.daily.dev/tags/testing?ref=roadmapsh)

View File

@@ -1 +1,8 @@
# Tokenization
Tokenization is the step where raw text is broken into small pieces called tokens, and each token is given a unique number. A token can be a whole word, part of a word, a punctuation mark, or even a space. The list of all possible tokens is the models vocabulary. Once text is turned into these numbered tokens, the model can look up an embedding for each number and start its math. By working with tokens instead of full sentences, the model keeps the input size steady and can handle new or rare words by slicing them into familiar sub-pieces. After the model finishes its work, the numbered tokens are turned back into text through the same vocabulary map, letting the user read the result.
Visit the following resources to learn more:
- [@article@Explaining Tokens — the Language and Currency of AI](https://blogs.nvidia.com/blog/ai-tokens-explained/)
- [@article@What is Tokenization? Types, Use Cases, Implementation](https://www.datacamp.com/blog/what-is-tokenization)

View File

@@ -1 +1,8 @@
# Transactions
Transactions in SQL are units of work that group one or more database operations into a single, atomic unit. They ensure data integrity by following the ACID properties: Atomicity (all or nothing), Consistency (database remains in a valid state), Isolation (transactions don't interfere with each other), and Durability (committed changes are permanent). Transactions are essential for maintaining data consistency in complex operations and handling concurrent access to the database.
Learn more from the following resources:
- [@articles@Transactions](https://www.tutorialspoint.com/sql/sql-transactions.htm)
- [@article@A Guide to ACID Properties in Database Management Systems](https://www.mongodb.com/resources/basics/databases/acid-transactions)

View File

@@ -1 +1,3 @@
# Types of Data Ingestion
The primary types of data ingestion are Batch, Streaming, and Hybrid. Batch ingestion processes data in large, scheduled chunks, suitable for non-time-sensitive tasks like monthly reports. Streaming (or Real-time) ingestion handles data as it arrives, ideal for time-sensitive applications such as fraud detection or IoT monitoring. Hybrid ingestion combines both methods, offering flexibility for diverse business needs.

View File

@@ -1 +1,9 @@
# Unit Testing
Unit testing is where individual **units** (modules, functions/methods, routines, etc.) of software are tested to ensure their correctness. This low-level testing ensures smaller components are functionally sound while taking the burden off of higher-level tests. Generally, a developer writes these tests during the development process and they are run as automated tests.
Visit the following resources to learn more:
- [@article@Unit Testing Tutorial](https://www.guru99.com/unit-testing-guide.html)
- [@video@What is Unit Testing?](https://youtu.be/3kzHmaeozDI)
- [@feed@Explore top posts about Testing](https://app.daily.dev/tags/testing?ref=roadmapsh)

View File

@@ -1 +1,3 @@
# What and why use them?
In data engineering, messaging systems act as central brokers for data communication, allowing different applications and services to send and receive data in a decoupled, scalable, and fault-tolerant way. They are crucial for handling high-volume, real-time data streams, building resilient data pipelines, and enabling event-driven architectures by acting as buffers and communication channels between data producers and consumers. Key benefits include decoupling systems for agility, ensuring data reliability through queuing and retries, and horizontal scalability to manage growing data loads, while common examples include Apache Kafka and message queues like RabbitMQ and AWS SQS.

View File

@@ -1 +1,8 @@
# What is Data Warehouse?
# Data Warehouse
**Data Warehouses** are data storage systems which are designed for analyzing, reporting and integrating with transactional systems. The data in a warehouse is clean, consistent, and often transformed to meet wide-range of business requirements. Hence, data warehouses provide structured data but require more processing and management compared to data lakes.
Learn more from the following resources:
- [@article@What Is a Data Warehouse?](https://www.oracle.com/database/what-is-a-data-warehouse/)
- [@video@@hat is a Data Warehouse?](https://www.youtube.com/watch?v=k4tK2ttdSDg)

View File

@@ -1 +1,7 @@
# YARN
# Apache Hadoop YARN
Apache Hadoop YARN (Yet Another Resource Negotiator) is the part of Hadoop that manages resources and runs jobs on a cluster. It has a ResourceManager that controls all cluster resources and an ApplicationMaster for each job that schedules and runs tasks. YARN lets different tools like MapReduce and Spark share the same cluster, making it more efficient, flexible, and reliable.
Visit the following resources to learn more:
- [@video@Hadoop Yarn Tutorial](https://www.youtube.com/watch?v=6bIF9VwRwE0)