Apache Kafka 2.8.0 is finally out and you can now have early-access to KIP-500 that removes the Apache Zookeeper dependency. Instead, Kafka now relies on an internal Raft quorum that can be activated through Kafka Raft metadata mode. The new feature simplifies cluster administration and infrastructure management and marks a new era for Kafka itself.
In this article we are going to discuss why there was a need for removing ZooKeeper dependency in the first place. Additionally, we will discuss how ZooKeeper has been replaced by
KRaft mode as of version 2.8.0 …
Converting a PySpark DataFrame to Pandas is quite trivial thanks to
toPandas()method however, this is probably one of the most costly operations that must be used sparingly, especially when dealing with fairly large volume of data.
Pandas DataFrames are stored in-memory which means that the operations over them are faster to execute however, their size is limited by the memory of a single machine.
On the other hand, Spark DataFrames are distributed across the nodes of the Spark Cluster which is consisted of at least one machine and thus the size of the DataFrames is limited by the size of…
Medium has been a great place for tech people such as myself, as it gives me access to myriads of useful articles that can help me grow as a professional and even as a person in my work environment.
Even though the platform constantly adds more features that help writers share more appealing articles in a way that is easier for them to communicate what the want to, there’s still some more work that needs to be done.
Today, I am going to discuss four features that I would personally love to see on Medium out of my experience as…
In the field of Machine Learning there are two fundamental learning types namely supervised and unsupervised methods. Now depending on the problem we want to solve, the questions we need to answer and the data we have access to we need to choose a suitable learning algorithm.
Therefore, the overall learning procedure relies on the answers given to the the questions raised above. And given that these answers may vary, we first need to clarify what learning type suits the nature of the problem we are trying to solve, before choosing a specific learning algorithm.
In supervised learning, the dataset…
Column selection is definitely among the most commonly used operations performed over Spark DataFrames (and DataSets). Spark comes with two built-in methods that can be used for doing so, namely
In today’s article we are going to discuss how to use both of them and also explain their main differences. Additionally, we will discuss when you should use one over the other.
Before discussing how to perform selection using either
selectExpr(), let’s create a sample DataFrame that we’ll use as a reference throughout this article
from pyspark.sql import SparkSessionspark_session = SparkSession.builder \…
Application performance is one of the most critical and challenging aspects of the software development lifecycle. Emerging technologies require more and more resources over time, and thus, it is crucial for developers to be able to identify bottlenecks in the code and improve them.
In today’s article, we will discuss how code profilers can help us track, visualise and improve the performance of applications in production. Additionally, we will explore gProfiler, an open-source code profiler that you can potentially include in your toolbox for code profiling and performance improvement.
The replacement of null values in PySpark DataFrames is one of the most common operations undertaken. This can be achieved by using either
DataFrameNaFunctions.fill() methods. In today’s article we are going to discuss the main difference between these two functions.
While working with Spark DataFrames, many operations that we typically perform over them may return null values in some of the records. From that point onwards, some other operations may result in error if null/empty values are observed and thus we have to somehow replace these values in order to keep processing a DataFrame.
Additionally, when reporting tables…
Even though Python standard library offers a wide range of tools and functionality, developers still need to use external (to the standard library) packages that are widely available. Most programming languages have their standard package managers that allow users to manage the dependencies of their projects.
The most commonly used package manager in Python is definitely
pip. In today’s article we are going to explore a few useful
pip commands that you can use to manage the dependencies of your Python projects properly.
pip is a Python package manager that enables the installation of packages that are not included in…
Lists are probably the most commonly used data structures in Python as their characteristics are suitable for many different use-cases. Since this particular object type is mutable (i.e. it can be modified in place), adding or removing elements is even more common.
In today’s article, we are going to explore the difference between the two built-in list methods
extend() that can be used to expand a list object by adding more elements to it. Finally, we will also discuss how to insert elements at specific indices in lists.
A list is a Python data structure that is an…
One of the most common actions one needs to undertake when working with pandas DataFrames is data type (or
dtype) casting. In today’s article, we are going to explore 3distinct ways of changing the type of columns in pandas. These methods are
Before start discussing the various options you can use to change the type of certain column(s), let’s first create a dummy DataFrame that we’ll use as an example throughout the article.
import pandas as pddf = pd.DataFrame(
('1', 1, 'hi'),
('2', 2, 'bye'),
('3', 3, 'hello'),
('4', 4, 'goodbye'),
Machine Learning Engineer | Python Developer