Version 2.8.0 Gives You Early Access to Zookeeper-Less Kafka

Introduction

Apache Kafka 2.8.0 is finally out and you can now have early-access to KIP-500 that removes the Apache Zookeeper dependency. Instead, Kafka now relies on an internal Raft quorum that can be activated through Kafka Raft metadata mode. The new feature simplifies cluster administration and infrastructure management and marks a new era for Kafka itself.

Zookeeper-less Kafka

In this article we are going to discuss why there was a need for removing ZooKeeper dependency in the first place. Additionally, we will discuss how ZooKeeper has been replaced by KRaft mode as of version 2.8.0 …


Save time when converting large Spark DataFrames to Pandas

Converting a PySpark DataFrame to Pandas is quite trivial thanks to toPandas()method however, this is probably one of the most costly operations that must be used sparingly, especially when dealing with fairly large volume of data.

Why is it so costly?

Pandas DataFrames are stored in-memory which means that the operations over them are faster to execute however, their size is limited by the memory of a single machine.

On the other hand, Spark DataFrames are distributed across the nodes of the Spark Cluster which is consisted of at least one machine and thus the size of the DataFrames is limited by the size of…


Discussing various ways for renaming columns in PySpark DataFrames

Introduction

Renaming columns in PySpark DataFrames is one of the most common yet simple operations one can apply. In today’s article, we are going to discuss various ways for renaming columns. More specifically, we will explore how to do so using

  • withColumnRenamed() method
  • selectExpr() method
  • alias method
  • Spark SQL

Additionally, we will discuss when to use one method over the other.

First, let’s create an example DataFrame that we’ll reference throughout this guide in order to demonstrate a few concepts.

from pyspark.sql import SparkSession
# Create an instance of spark session
spark_session = SparkSession.builder \
.master('local[1]') \
.appName('Example') \
.getOrCreate()
#…


Exploring some of the most powerful UI monitoring tools for Apache Kafka clusters

Introduction

Apache Kafka is amongst the fastest growing products out there that has been widely adopted by many companies across the globe. If you are using Kafka in production, is very important to be able to monitor and manage the cluster.

This article contains an updated list of the most popular and powerful monitoring tools for Apache Kafka clusters. You can also find an older list of recommendations that I proposed in an article written a few years back.

Specifically, in this article we will cover the following tools and services that can help you manage and monitor the health of…


Discussing how to select multiple columns from PySpark DataFrames by column name, index or with the use of regular expressions

Introduction

When working with Spark, we typically need to deal with a fairly large number of rows and columns and thus, we sometimes have to work only with a small subset of columns.

In today’s short guide we will explore different ways for selecting columns from PySpark DataFrames. Specifically, we will discuss how to select multiple columns

  • by column name
  • by index
  • with the use of regular expressions

First, let’s create an example DataFrame that we’ll reference throughout this article to demonstrate a few concepts.

from pyspark.sql import SparkSession
# Create an instance of spark session
spark_session = SparkSession.builder \…


Discussing how to shuffle the rows of pandas DataFrames

Introduction

Data shuffling is a common task usually performed prior to model training in order to create more representative training and testing sets. For instance, consider that your original dataset is sorted based on a specific column.

If you split the data then the resulting sets won’t represent the true distribution of the dataset. Therefore, we have to shuffle the original dataset in order to minimise variance and ensure that the model will generalise well to new, unseen data points.

In today’s short guide we will discuss how to shuffle the rows of pandas DataFrames in various ways. …


Discussing how to delete specific rows from pandas DataFrames based on column values

Introduction

Deleting rows from pandas DataFrames based on specific conditions relevant to column values is among the most commonly performed tasks. In today’s short guide we are going to explore how to perform row deletion when

  • a row contains (i.e. is equal to) specific column value(s)
  • a particular column value of a row is not equal to another value
  • a row has null value(s) in a specific column
  • a row has non-null column values
  • multiple conditions (combination of the above) need to be met

First, let’s create an example DataFrame that we’ll reference across this article in order to demonstrate a…


Discussing how to iterate over keys and values of dictionaries in Python

Introduction

Dictionaries are among the most useful data structures in Python. Iterating over the keys and values of dictionaries is one of the most commonly used operations that we need to perform while working with such objects.

In today’s short guide, we are going to explore how to iterate over dictionaries in Python 3. Specifically, we’ll discuss how to

  • iterate over just keys
  • iterate over just values
  • iterate over both keys and values in one go

First, let’s create an example dictionary that we’ll reference across the article to demonstrate a few concepts.

my_dict = {'a': 1, 'b': 2, 'c': 3…


How to use predict and predict_proba methods over a dataset in order to perform predictions

Introduction

When training models (and more precisely supervised estimators) with sklearn, we sometimes need to predict the actual class while in some other occasions we may want to predict the class probabilities.

In today’s article we will discuss how to use predict and predict_proba methods over a dataset in order to perform predictions. Additionally, we’ll explore the differences between these methods and discuss when to use one over the other.

First, let’s create an example model that we’ll reference throughout this article in order to demonstrate a few concepts. In our examples, we will be using the Iris dataset which is…


Exploring some of the most useful virtual environment commands for everyday programming with Python

Introduction

Python applications usually make use of third-party modules and packages that are not included in the standard package library. Additionally, the package under development may require a specific version of another library in order to work properly.

This means that some applications may require a specific package version (say version 1.0) while others may need a different package version (say 2.0.1). Therefore, we will end up with a conflict since installing either of the two versions will result in issues in the packages that require a different version.

In today’s article, we are going to discuss how to deal with…

Giorgos Myrianthous

Machine Learning Engineer | Python Developer | https://www.buymeacoffee.com/gmyrianthous

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store