In the big data era, it is quite common to have dataframes that consist of hundreds or even thousands of columns. And in such cases, even printing them out can sometimes be tricky as you somehow need to ensure that the data is presented in a clear but also efficient way.
In this article, I am going to explore the three basic ways one can follow in order to display a PySpark dataframe in a table format. …
If you have a background in languages such as Java, C or C++ which are compiled or statically typed, you might find the way that Python works a bit confusing. For instance, when we assign a value to a variable (say
a = 1 ), how the heck does Python know that variable
a is an integer?
In statically-typed languages the variables’ types are determined at compile-time. In most languages that support this static typing model, programmers must specify the type of each variable. …
A Python set is a collection type introduced back in version 2.4. It is one of the most powerful data structures of the language as its characteristics can prove useful and practical in numerous use cases. In this article, we’ll have a quick look at the theory behind sets and later on will discuss the most common set operations as well as a few use-cases where sets come in handy.
A set is a mutable and unordered collection of hashable (i.e. immutable) objects with no duplicates. …
A new major release was made available on the 10th of June 2020 for Apache Spark. Version 3.0 — a result of more than 3,400 tickets — builds on top of version 2.x and comes with numerous features — new functionality, bug fixes and performance improvements.
10 years after its initial release as an open source project, Apache Spark has become one of the core technologies in Big Data era. …
If you are new to Python, you might have noticed that it is possible to run a Python script with or without a main method. And the notation used in Python to define one (i.e.
if __name__ == ‘__main__') is definitely not self-explanatory especially for new comers.
In this article, I am going to explore what is the purpose of a main method and what to expect when you define one in your Python applications.
Before executing a program, the Python Interpreter assigns the name of the python module into a special variable called
__name__. Depending on whether you are executing the program through command line or importing the module into another module, the assignment for
__name__ will vary. …
In the big data era, Apache Spark is probably one of the most popular technologies as it offers a unified engine for processing enormous amount of data in a reasonable amount of time.
In this article, I am going to cover the various entry points for Spark Applications and how these have evolved over the releases made. Before doing so, it might be useful to go through some basic concepts and terms so that we can then jump more easily to the entry points namely SparkSession, SparkContext or SQLContext.
A Spark Application consists of a Driver Program and a group of Executors on the cluster. The Driver is a process that executes the main program of your Spark application and creates the SparkContext that coordinates the execution of jobs (more on this later). The executors are processes running on the worker nodes of the cluster which are responsible for executing the tasks the driver process has assigned to them. …
One of the most important factors one needs to take into account when designing and implementing algorithms is the time complexity that is computed during algorithm analysis.
Time complexity corresponds to the amount of time required for an algorithm to run over the provided input in order to generate the required output. In this article, we are going through the most common functions which are useful in the context of algorithm analysis. …
PYTHONPATH— In any environment you wish to run your Python application such as Docker, vagrant or your virtual environment i.e. in bin/activate, run (or e.g. add to
bin/activatein case you are using virtualenv) the below command:
Module imports can certainly frustrate people and especially those who are fairly new to Python. …
Converting a PySpark DataFrame to Pandas is quite trivial thanks to
toPandas()method however, this is probably one of the most costly operations that must be used sparingly, especially when dealing with fairly large volume of data.
Pandas DataFrames are stored in-memory which means that the operations over them are faster to execute however, their size is limited by the memory of a single machine.
On the other hand, Spark DataFrames are distributed across the nodes of the Spark Cluster which is consisted of at least one machine and thus the size of the DataFrames is limited by the size of the cluster. …