How to print huge PySpark DataFrames

Image for post
Image for post
Photo by Mika Baumeister on unsplash.com

In the big data era, it is quite common to have dataframes that consist of hundreds or even thousands of columns. And in such cases, even printing them out can sometimes be tricky as you somehow need to ensure that the data is presented in a clear but also efficient way.

In this article, I am going to explore the three basic ways one can follow in order to display a PySpark dataframe in a table format. …


The root of language’s flexibility

Image for post
Image for post
Photo by Samuel-Elias Nadler on unsplash.com

If you have a background in languages such as Java, C or C++ which are compiled or statically typed, you might find the way that Python works a bit confusing. For instance, when we assign a value to a variable (say a = 1 ), how the heck does Python know that variable a is an integer?

The Dynamic Typing model

In statically-typed languages the variables’ types are determined at compile-time. In most languages that support this static typing model, programmers must specify the type of each variable. …


All you need to know about sets in Python

Image for post
Image for post
Photo by Maxwell Nelson on unsplash.com

A Python set is a collection type introduced back in version 2.4. It is one of the most powerful data structures of the language as its characteristics can prove useful and practical in numerous use cases. In this article, we’ll have a quick look at the theory behind sets and later on will discuss the most common set operations as well as a few use-cases where sets come in handy.

What are sets?

A set is a mutable and unordered collection of hashable (i.e. immutable) objects with no duplicates. …


5 of the most exciting features of the new release of Apache Spark 3.0

Image for post

A new major release was made available on the 10th of June 2020 for Apache Spark. Version 3.0 — a result of more than 3,400 tickets — builds on top of version 2.x and comes with numerous features — new functionality, bug fixes and performance improvements.

10 years after its initial release as an open source project, Apache Spark has become one of the core technologies in Big Data era. …


When and how a main method is executed in Python

Image for post
Image for post
Photo by Blake Connally on unsplash.com

If you are new to Python, you might have noticed that it is possible to run a Python script with or without a main method. And the notation used in Python to define one (i.e. if __name__ == ‘__main__') is definitely not self-explanatory especially for new comers.

In this article, I am going to explore what is the purpose of a main method and what to expect when you define one in your Python applications.

What is the purpose of __name__ ?

Before executing a program, the Python Interpreter assigns the name of the python module into a special variable called __name__. Depending on whether you are executing the program through command line or importing the module into another module, the assignment for __name__ will vary. …


What is the difference between SparkSession, SparkContext and SQLContext?

In the big data era, Apache Spark is probably one of the most popular technologies as it offers a unified engine for processing enormous amount of data in a reasonable amount of time.

In this article, I am going to cover the various entry points for Spark Applications and how these have evolved over the releases made. Before doing so, it might be useful to go through some basic concepts and terms so that we can then jump more easily to the entry points namely SparkSession, SparkContext or SQLContext.

Image for post
Image for post
Photo by Kristopher Roller on Unsplash

Spark Basic Architecture and Terminology

A Spark Application consists of a Driver Program and a group of Executors on the cluster. The Driver is a process that executes the main program of your Spark application and creates the SparkContext that coordinates the execution of jobs (more on this later). The executors are processes running on the worker nodes of the cluster which are responsible for executing the tasks the driver process has assigned to them. …


Understand the basic functions required for computing time complexity

Image for post
Image for post
Photo by Aron Visuals on unspalsh.com

One of the most important factors one needs to take into account when designing and implementing algorithms is the time complexity that is computed during algorithm analysis.

Time complexity corresponds to the amount of time required for an algorithm to run over the provided input in order to generate the required output. In this article, we are going through the most common functions which are useful in the context of algorithm analysis. …


Do proper module imports and make your life easier

Image for post
Image for post
Photo by Leone Venter on unsplash.com

tl;dr

  • Use absolute imports
  • Append your project’s root directory to PYTHONPATH — In any environment you wish to run your Python application such as Docker, vagrant or your virtual environment i.e. in bin/activate, run (or e.g. add to bin/activate in case you are using virtualenv) the below command:
export PYTHONPATH="${PYTHONPATH}:/path/to/your/project/"
  • *avoid using sys.path.append("/path/to/your/project/")

Module imports can certainly frustrate people and especially those who are fairly new to Python. …


Save time when converting large Spark DataFrames to Pandas

Image for post
Image for post
Photo by Noah Bogaard on unsplash.com

Converting a PySpark DataFrame to Pandas is quite trivial thanks to toPandas()method however, this is probably one of the most costly operations that must be used sparingly, especially when dealing with fairly large volume of data.

Why is it so costly?

Pandas DataFrames are stored in-memory which means that the operations over them are faster to execute however, their size is limited by the memory of a single machine.

On the other hand, Spark DataFrames are distributed across the nodes of the Spark Cluster which is consisted of at least one machine and thus the size of the DataFrames is limited by the size of the cluster. …

About

Giorgos Myrianthous

Python | Data | ML

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store