Skip to main content

15 docs tagged with "#data"

View all tags

Data Drift

Data drift refers to the phenomenon where the statistical properties of a dataset used for machine learning or analysis change over time. This alteration can be due to various factors, such as shifts in data collection processes, changes in the underlying distribution of the data, or modifications in the environment from which the data originates. Detecting and addressing data drift is crucial to maintaining the performance and reliability of machine learning models and analytical systems.

Data Engineering

Apache Spark is an open-source, distributed computing system that provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Spark extends the MapReduce model to efficiently cover more types of computations, which include interactive queries and stream processing. One of Spark's key features is its in-memory cluster computing which increases the processing speed of an application.

Data Engineering with dbt

This is a book about data engineering, with a sprinkle of dbt as well. What it is not is a book on dbt, it most definitely is a book on data engineering. It contains data engineering knowledge and ways of working.

Data Science

What is data science? It is a bunch of different jobs bunched together and given the tie of AI to make a company sound innovative.

Data Science

What is data science? It is a bunch of different jobs bunched together and given the tie of AI to make a company sound innovative.

Data Science the Hard Parts

This book dives into the difficult aspects of data science. The difficult aspects are business value proposition, communication and measuring impact. These topics are discussed and methods for doing this the right way are presented.

Data Strategy

This book is about strategy, and data is the context in which strategy is discussed. There are some things like the McKinsey data maturity model that are discussed, but the main jist is the strategy. ‘Change is inevitable. … Change is constant.’ This is an important aspect of this entire book.

Database

In the context of business, everything is a database. Databases are the bedrock of how we design things nowadays.

DBT

dbt is an open-source command-line tool that enables data transformation and modeling in a structured and efficient manner. It allows data engineers and analysts to define and manage the data transformation pipeline using SQL queries. With dbt, you can write modular and reusable SQL code called "models," which define the transformations required to convert raw data into structured and analysis-ready data. These models can be organized, tested, and documented within the dbt framework. dbt leverages the power of SQL and provides a layer of abstraction on top of the data warehouse, making it easier to develop, test, and maintain complex data transformations. It promotes best practices such as version control, testing, and documentation, enabling collaborative and maintainable data modeling workflows. dbt integrates with various data warehouses and can be used in conjunction with other data tools and orchestration platforms to create a robust and reliable data pipeline.

Fabric

In the fabric section, I write as much as I am able about the Microsoft data warehouse ecosystem and use Microsoft Fabric as an overarching theme for all things considered. I will not cover Databricks or Snowflake on Azure here, as they have their own sections.

GCP

I have mostly worked with GCP on my own. It is the little brother of the

Getting Started with Streamlit for Data Science

This book is a welcoming introduction to a Python module that has seen rapid growth. It offers a brief overview of the application's capabilities and shows how its user-friendly nature makes it an inclusive tool for both new and experienced data scientists.

Hands-On Unsupervised Learning Using Python

This book is an introduction to unsupervised machine learning techniques and practices. It introduces methods of unsupervised learning for clustering, correlations and time series analysis. It analyses models and provides guidance on how to use them.

Interpretable Machine Learning

This book is about methods and ways to understand AI and data modeling and how to utilize the different ways of interpreting machine learning models. It gives the basis and then dives into the different types of models and methods you can use. It separates the models into specific to models or model families, and model agnostic.

Programming Internals

I made this article to filter out a lot of the more computer sciency things in programming.