Version 2.8.0 Gives You Early Access to Zookeeper-Less Kafka

Photo by Christian Lambert on Unsplash

Apache Kafka 2.8.0 is finally out and you can now have early-access to KIP-500 that removes the Apache Zookeeper dependency. Instead, Kafka now relies on an internal Raft quorum that can be activated through Kafka Raft metadata mode. The new feature simplifies cluster administration and infrastructure management and marks a new era for Kafka itself.

Zookeeper-less Kafka

In this article we are going to discuss why there was a need for removing ZooKeeper dependency in the first place. Additionally, we will discuss how ZooKeeper has been replaced by KRaft mode as of version 2.8.0 …


Save time when converting large Spark DataFrames to Pandas

Photo by Noah Bogaard on unsplash.com

Converting a PySpark DataFrame to Pandas is quite trivial thanks to toPandas()method however, this is probably one of the most costly operations that must be used sparingly, especially when dealing with fairly large volume of data.

Why is it so costly?

Pandas DataFrames are stored in-memory which means that the operations over them are faster to execute however, their size is limited by the memory of a single machine.

On the other hand, Spark DataFrames are distributed across the nodes of the Spark Cluster which is consisted of at least one machine and thus the size of the DataFrames is limited by the size of…


Discussing 4 features that would make Medium a better place for Technical Writers

Photo by Thought Catalog on Unsplash

Medium has been a great place for tech people such as myself, as it gives me access to myriads of useful articles that can help me grow as a professional and even as a person in my work environment.

Even though the platform constantly adds more features that help writers share more appealing articles in a way that is easier for them to communicate what the want to, there’s still some more work that needs to be done.

Today, I am going to discuss four features that I would personally love to see on Medium out of my experience as…


Discussing the main differences between supervised, unsupervised and semi-supervised learning in Machine Learning

Photo by Gertrūda Valasevičiūtė on Unsplash

In the field of Machine Learning there are two fundamental learning types namely supervised and unsupervised methods. Now depending on the problem we want to solve, the questions we need to answer and the data we have access to we need to choose a suitable learning algorithm.

Therefore, the overall learning procedure relies on the answers given to the the questions raised above. And given that these answers may vary, we first need to clarify what learning type suits the nature of the problem we are trying to solve, before choosing a specific learning algorithm.

In supervised learning, the dataset…


Discussing the difference between select() and selectExpr() methods in Spark

Photo by Alexander Schimmeck on Unsplash

Column selection is definitely among the most commonly used operations performed over Spark DataFrames (and DataSets). Spark comes with two built-in methods that can be used for doing so, namely select() and selectExpr().

In today’s article we are going to discuss how to use both of them and also explain their main differences. Additionally, we will discuss when you should use one over the other.

Before discussing how to perform selection using either select() and selectExpr(), let’s create a sample DataFrame that we’ll use as a reference throughout this article

from pyspark.sql import SparkSessionspark_session = SparkSession.builder \…


How to continuously measure, visualize and improve the performance of your Python code in production

Photo by Chris Liverani on Unsplash

Application performance is one of the most critical and challenging aspects of the software development lifecycle. Emerging technologies require more and more resources over time, and thus, it is crucial for developers to be able to identify bottlenecks in the code and improve them.

In today’s article, we will discuss how code profilers can help us track, visualise and improve the performance of applications in production. Additionally, we will explore gProfiler, an open-source code profiler that you can potentially include in your toolbox for code profiling and performance improvement.

A code profiler is a powerful tool that can help developers…


Discussing how to replace null values in PySpark using fillna() and fill()

Photo by Kelly Sikkema on Unsplash

The replacement of null values in PySpark DataFrames is one of the most common operations undertaken. This can be achieved by using either DataFrame.fillna() or DataFrameNaFunctions.fill() methods. In today’s article we are going to discuss the main difference between these two functions.

While working with Spark DataFrames, many operations that we typically perform over them may return null values in some of the records. From that point onwards, some other operations may result in error if null/empty values are observed and thus we have to somehow replace these values in order to keep processing a DataFrame.

Additionally, when reporting tables…


Exploring some of the most useful pip commands for everyday programming

Photo by Kelli McClintock on Unsplash

Introduction

Even though Python standard library offers a wide range of tools and functionality, developers still need to use external (to the standard library) packages that are widely available. Most programming languages have their standard package managers that allow users to manage the dependencies of their projects.

The most commonly used package manager in Python is definitely pip. In today’s article we are going to explore a few useful pip commands that you can use to manage the dependencies of your Python projects properly.

What is pip

pip is a Python package manager that enables the installation of packages that are not included in…


Exploring how to append to or extend a list in Python

Photo by Tom Wilson on Unsplash

Lists are probably the most commonly used data structures in Python as their characteristics are suitable for many different use-cases. Since this particular object type is mutable (i.e. it can be modified in place), adding or removing elements is even more common.

In today’s article, we are going to explore the difference between the two built-in list methods append() and extend() that can be used to expand a list object by adding more elements to it. Finally, we will also discuss how to insert elements at specific indices in lists.

A list is a Python data structure that is an…


Exploring 3 different options for changing dtypes of columns in pandas

Photo by Meagan Carsience on Unsplash

One of the most common actions one needs to undertake when working with pandas DataFrames is data type (or dtype) casting. In today’s article, we are going to explore 3distinct ways of changing the type of columns in pandas. These methods are

  • astype()
  • to_numeric()
  • convert_dtypes()

Before start discussing the various options you can use to change the type of certain column(s), let’s first create a dummy DataFrame that we’ll use as an example throughout the article.

import pandas as pddf = pd.DataFrame(
[
('1', 1, 'hi'),
('2', 2, 'bye'),
('3', 3, 'hello'),
('4', 4, 'goodbye'),
],
columns=list('ABC')
)
print(df)

Giorgos Myrianthous

Machine Learning Engineer | Python Developer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store